Qwen3-8B-Base 性能测试笔记 (完整数据版)
Published: 2025-10-22
Author: Cialtion
- 模型: Qwen3-8B-Base - 设备: NVIDIA GeForce RTX 4090 (48G显存版) - 词表大小: 151,643 --- - Prompt 长度: 7 tokens - 生成 Token 数: 287 tokens | 阶段 | 耗时 (ms) | 占比 | 吞吐...
Qwen3-8B-Base 性能测试笔记 (完整数据版)
1. 测试环境
- 模型: Qwen3-8B-Base
- 设备: NVIDIA GeForce RTX 4090 (48G显存版)
- 词表大小: 151,643
2. 实验总结
实验 1: 短 Prompt
- Prompt 长度: 7 tokens
- 生成 Token 数: 287 tokens
阶段 |
耗时 (ms) |
占比 |
吞吐量 (tokens/s) |
Prefill |
70.87 |
0.47% |
98.77 |
Decode |
14966.21 |
99.53% |
19.11 (52.33 ms/token) |
总计 |
15037.09 |
100% |
19.09 (整体) |
Prefill 阶段模块分析
模块 |
总时间(ms) |
调用次数 |
平均(ms) |
占比 |
Attention.K_proj |
12.27 |
36 |
0.3409 |
24.61 % |
FFN.down_proj |
5.23 |
36 |
0.1452 |
10.48 % |
FFN.gate_proj |
5.07 |
36 |
0.1408 |
10.17 % |
FFN.up_proj |
4.84 |
36 |
0.1345 |
9.71 % |
LayerNorm.post_attn |
3.34 |
36 |
0.0929 |
6.71 % |
LayerNorm.pre_attn |
3.23 |
36 |
0.0897 |
6.48 % |
Attention.Q_proj |
2.98 |
36 |
0.0827 |
5.97 % |
Attention.O_proj |
2.49 |
36 |
0.0692 |
5.00 % |
Attention.V_proj |
1.39 |
36 |
0.0386 |
2.79 % |
LM_Head |
1.37 |
1 |
1.3707 |
2.75 % |
FFN.activation |
0.72 |
36 |
0.0201 |
1.45 % |
Attention.RoPE |
0.40 |
1 |
0.3974 |
0.80 % |
model.layers.0.self_attn.k_norm |
0.16 |
1 |
0.1569 |
0.31 % |
model.layers.0.self_attn.q_norm |
0.14 |
1 |
0.1373 |
0.28 % |
model.layers.7.self_attn.k_norm |
0.13 |
1 |
0.1290 |
0.26 % |
(其他模块) |
6.10 |
--- |
--- |
12.24 % |
总计 |
49.86 |
|
|
|
Decode 阶段模块分析 (平均每步)
模块 |
总时间(ms) |
调用次数 |
平均(ms) |
占比 |
FFN.gate_proj |
1347.96 |
10296 |
0.1309 |
13.27 % |
FFN.down_proj |
1340.25 |
10296 |
0.1302 |
13.19 % |
FFN.up_proj |
1319.39 |
10296 |
0.1281 |
12.99 % |
LayerNorm.post_attn |
902.26 |
10296 |
0.0876 |
8.88 % |
LayerNorm.pre_attn |
877.46 |
10296 |
0.0852 |
8.64 % |
Attention.O_proj |
652.89 |
10296 |
0.0634 |
6.43 % |
Attention.Q_proj |
632.38 |
10296 |
0.0614 |
6.22 % |
LM_Head |
385.11 |
286 |
1.3465 |
3.79 % |
Attention.K_proj |
346.52 |
10296 |
0.0337 |
3.41 % |
Attention.V_proj |
342.32 |
10296 |
0.0332 |
3.37 % |
FFN.activation |
189.11 |
10296 |
0.0184 |
1.86 % |
Attention.RoPE |
63.39 |
286 |
0.2216 |
0.62 % |
model.layers.0.self_attn.q_norm |
26.45 |
286 |
0.0925 |
0.26 % |
model.layers.0.self_attn.k_norm |
25.05 |
286 |
0.0876 |
0.25 % |
model.layers.1.self_attn.q_norm |
24.55 |
286 |
0.0859 |
0.24 % |
(其他模块) |
1684.22 |
--- |
--- |
16.58 % |
总计 |
10159.30 |
|
|
|
实验 2: 长 Prompt
- Prompt 长度: 293 tokens
- 生成 Token 数: 55 tokens
阶段 |
耗时 (ms) |
占比 |
吞吐量 (tokens/s) |
Prefill |
80.42 |
2.76% |
3643.30 |
Decode |
2838.66 |
97.24% |
19.02 (52.57 ms/token) |
总计 |
2919.08 |
100% |
18.84 (整体) |
Prefill 阶段模块分析
模块 |
总时间(ms) |
调用次数 |
平均(ms) |
占比 |
FFN.gate_proj |
9.46 |
36 |
0.2628 |
16.20 % |
FFN.up_proj |
9.07 |
36 |
0.2519 |
15.53 % |
FFN.down_proj |
9.04 |
36 |
0.2510 |
15.48 % |
Attention.Q_proj |
4.34 |
36 |
0.1204 |
7.43 % |
Attention.O_proj |
4.12 |
36 |
0.1144 |
7.05 % |
LayerNorm.post_attn |
3.59 |
36 |
0.0997 |
6.15 % |
LM_Head |
3.51 |
1 |
3.5124 |
6.02 % |
LayerNorm.pre_attn |
3.45 |
36 |
0.0958 |
5.90 % |
Attention.K_proj |
1.84 |
36 |
0.0510 |
3.15 % |
Attention.V_proj |
1.75 |
36 |
0.0486 |
3.00 % |
FFN.activation |
0.75 |
36 |
0.0209 |
1.29 % |
Embedding |
0.59 |
1 |
0.5937 |
1.02 % |
Attention.RoPE |
0.33 |
1 |
0.3269 |
0.56 % |
model.layers.18.self_attn.q_norm |
0.32 |
1 |
0.3173 |
0.54 % |
model.layers.0.self_attn.q_norm |
0.11 |
1 |
0.1147 |
0.20 % |
(其他模块) |
6.13 |
--- |
--- |
10.49 % |
总计 |
58.39 |
|
|
|
Decode 阶段模块分析 (平均每步)
模块 |
总时间(ms) |
调用次数 |
平均(ms) |
占比 |
FFN.gate_proj |
253.83 |
1944 |
0.1306 |
13.20 % |
FFN.down_proj |
253.02 |
1944 |
0.1302 |
13.16 % |
FFN.up_proj |
249.11 |
1944 |
0.1281 |
12.95 % |
LayerNorm.post_attn |
172.15 |
1944 |
0.0886 |
8.95 % |
LayerNorm.pre_attn |
167.13 |
1944 |
0.0860 |
8.69 % |
Attention.O_proj |
122.74 |
1944 |
0.0631 |
6.38 % |
Attention.Q_proj |
119.50 |
1944 |
0.0615 |
6.21 % |
LM_Head |
71.69 |
54 |
1.3276 |
3.73 % |
Attention.K_proj |
65.91 |
1944 |
0.0339 |
3.43 % |
Attention.V_proj |
64.80 |
1944 |
0.0333 |
3.37 % |
FFN.activation |
36.04 |
1944 |
0.0185 |
1.87 % |
Attention.RoPE |
12.04 |
54 |
0.2229 |
0.63 % |
model.layers.9.self_attn.q_norm |
4.99 |
54 |
0.0924 |
0.26 % |
model.layers.0.self_attn.q_norm |
4.97 |
54 |
0.0921 |
0.26 % |
model.layers.30.self_attn.q_norm |
4.93 |
54 |
0.0914 |
0.26 % |
(其他模块) |
320.45 |
--- |
--- |
16.66 % |
总计 |
1923.32 |
|
|
|
3. 对比分析与结论
数据对比
指标 |
实验1 (短prompt) |
实验2 (长prompt) |
比值 |
Prompt 长度 |
7 |
293 |
41.9 x |
Prefill 时间 (ms) |
70.87 |
80.42 |
1.1 x |
Decode 平均时间 (ms/token) |
52.33 |
52.57 |
1.00 x |
生成 token 数 |
287 |
55 |
- |
关键结论
-
Prefill 阶段是计算密集型 (Compute-Bound)
- 当 Prompt 长度从 7 增加到 293 (约42倍)时,Prefill 时间仅增加了 1.1倍。这表明对于短文本,Prefill 的固定开销较大,但随着长度增加,处理时间近似线性增长。长 Prompt 下的高吞吐量 (3643 tokens/s) 也证明了并行计算的效率。
-
Decode 阶段是访存密集型 (Memory-Bound)
- 无论 Prompt 长短,解码(生成)单个 token 的平均时间几乎恒定 (52.33 ms vs 52.57 ms)。这说明 Decode 阶段的瓶颈在于从显存中读写 KV Cache,而不是在矩阵计算上。
-
推理的主要瓶颈是 Decode
- 在两个实验中,Decode 阶段的耗时都占了总时间的绝对主导地位(短 Prompt 占 99.5%,长 Prompt 占 97.2%)。这是典型的自回归模型生成长序列时的性能特征。
4. 附录
点击查看测试代码
```python
#!/usr/bin/env python3
import os
import torch
import time
from collections import defaultdict
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['TRANSFORMERS_OFFLINE'] = '1'
from transformers import AutoModelForCausalLM, AutoTokenizer
print("="*80)
print("加载模型和分词器...")
print("="*80)
model_path = "./Qwen/Qwen3-8B-Base/"
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
local_files_only=True,
torch_dtype=torch.float16,
device_map="cuda:0"
)
print(f"模型加载完成, 设备: {model.device}")
print(f"词表大小: {tokenizer.vocab_size}")
# ============================================
# 修正版:更精确的模块性能分析
# ============================================
class ModuleProfiler:
"""改进版:避免父子模块重复计时"""
def __init__(self, model):
self.model = model
self.module_times = defaultdict(list)
self.hooks = []
self.active_modules = [] # 跟踪活动模块栈
def register_hooks(self):
"""只为叶子模块注册 hooks"""
# 识别所有叶子模块(没有子模块的模块)
leaf_modules = {}
for name, module in self.model.named_modules():
if len(list(module.children())) == 0: # 叶子模块
leaf_modules[id(module)] = name
def pre_forward_hook(module, input):
if id(module) in leaf_modules:
torch.cuda.synchronize()
self.active_modules.append({
'name': leaf_modules[id(module)],
'start': time.time()
})
def forward_hook(module, input, output):
if id(module) in leaf_modules and self.active_modules:
torch.cuda.synchronize()
module_info = self.active_modules.pop()
elapsed = time.time() - module_info['start']
# 简化模块名称
name = self._simplify_name(module_info['name'])
self.module_times[name].append(elapsed)
# 注册 hooks
for name, module in self.model.named_modules():
if id(module) in leaf_modules:
h1 = module.register_forward_pre_hook(pre_forward_hook)
h2 = module.register_forward_hook(forward_hook)
self.hooks.append(h1)
self.hooks.append(h2)
def _simplify_name(self, full_name):
"""简化模块名称"""
if 'embed_tokens' in full_name:
return 'Embedding'
elif 'lm_head' in full_name:
return 'LM_Head'
elif 'model.norm' in full_name:
return 'Final_LayerNorm'
# Attention 相关
elif 'self_attn.q_proj' in full_name:
return 'Attention.Q_proj'
elif 'self_attn.k_proj' in full_name:
return 'Attention.K_proj'
elif 'self_attn.v_proj' in full_name:
return 'Attention.V_proj'
elif 'self_attn.o_proj' in full_name:
return 'Attention.O_proj'
elif 'rotary_emb' in full_name:
return 'Attention.RoPE'
# FFN 相关
elif 'mlp.gate_proj' in full_name:
return 'FFN.gate_proj'
elif 'mlp.up_proj' in full_name:
return 'FFN.up_proj'
elif 'mlp.down_proj' in full_name:
return 'FFN.down_proj'
elif 'mlp.act_fn' in full_name:
return 'FFN.activation'
# LayerNorm
elif 'input_layernorm' in full_name:
return 'LayerNorm.pre_attn'
elif 'post_attention_layernorm' in full_name:
return 'LayerNorm.post_attn'
return full_name
def remove_hooks(self):
for hook in self.hooks:
hook.remove()
self.hooks = []
def reset(self):
self.module_times = defaultdict(list)
self.active_modules = []
def report(self, title="模块性能统计", top_n=15):
print(f"\n{'='*90}")
print(f"{title}")
print(f"{'='*90}")
# 聚合统计
aggregated = {}
for name, times in self.module_times.items():
total = sum(times)
count = len(times)
aggregated[name] = {
'total': total,
'count': count,
'avg': total / count if count > 0 else 0
}
total_time = sum(s['total'] for s in aggregated.values())
# 按总时间排序
sorted_items = sorted(aggregated.items(), key=lambda x: x[1]['total'], reverse=True)
print(f"{'模块':<35} {'总时间(ms)':<12} {'调用次数':<10} {'平均(ms)':<12} {'占比':<8}")
print("-"*90)
for name, stats in sorted_items[:top_n]:
pct = (stats['total'] / total_time * 100) if total_time > 0 else 0
print(f"{name:<35} {stats['total']*1000:<12.2f} {stats['count']:<10} "
f"{stats['avg']*1000:<12.4f} {pct:<8.2f}%")
if len(sorted_items) > top_n:
other_time = sum(s['total'] for _, s in sorted_items[top_n:])
other_pct = (other_time / total_time * 100) if total_time > 0 else 0
print(f"{'(其他模块)':<35} {other_time*1000:<12.2f} {'---':<10} "
f"{'---':<12} {other_pct:<8.2f}%")
print("-"*90)
print(f"{'总计':<35} {total_time*1000:<12.2f}")
print()
# ============================================
# Warmup:消除首次运行的 CUDA 开销
# ============================================
print("\n" + "="*80)
print("Warmup: 预热 GPU(消除 CUDA 编译开销)")
print("="*80)
warmup_prompt = "测试"
warmup_inputs = tokenizer(warmup_prompt, return_tensors="pt").to(model.device)
print("进行 3 次 warmup...")
for i in range(3):
with torch.no_grad():
_ = model(**warmup_inputs, use_cache=True)
print(f" Warmup {i+1}/3 完成")
torch.cuda.synchronize()
print("Warmup 完成!\n")
# ============================================
# 性能分析函数
# ============================================
def profile_generation(model, tokenizer, prompt, max_new_tokens, profiler, experiment_name):
"""带性能分析的生成"""
print(f"\n{'='*80}")
print(f"{experiment_name}")
print(f"{'='*80}\n")
# Tokenize
torch.cuda.synchronize()
t0 = time.time()
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
torch.cuda.synchronize()
tokenize_time = time.time() - t0
input_ids = inputs['input_ids']
prompt_len = input_ids.shape[1]
print(f"Prompt 长度: {prompt_len} tokens")
print(f"Max new tokens: {max_new_tokens}\n")
# ============================================
# Prefill
# ============================================
print("--- Prefill 阶段 ---")
profiler.reset()
torch.cuda.synchronize()
t_prefill = time.time()
with torch.no_grad():
outputs = model(input_ids=input_ids, use_cache=True)
torch.cuda.synchronize()
prefill_time = time.time() - t_prefill
print(f"Prefill 时间: {prefill_time*1000:.2f} ms")
print(f"吞吐量: {prompt_len/prefill_time:.2f} tokens/s")
profiler.report(f"{experiment_name} - Prefill 模块分析")
# ============================================
# Decode
# ============================================
print("--- Decode 阶段 ---")
profiler.reset()
past_key_values = outputs.past_key_values
logits = outputs.logits[:, -1, :]
next_token_id = torch.argmax(logits, dim=-1).item()
generated_tokens = [next_token_id]
current_token = torch.tensor([[next_token_id]], device=model.device)
decode_times = []
for step in range(1, max_new_tokens):
torch.cuda.synchronize()
t_step = time.time()
with torch.no_grad():
outputs = model(
input_ids=current_token,
past_key_values=past_key_values,
use_cache=True
)
torch.cuda.synchronize()
decode_times.append(time.time() - t_step)
logits = outputs.logits[:, -1, :]
past_key_values = outputs.past_key_values
next_token_id = torch.argmax(logits, dim=-1).item()
generated_tokens.append(next_token_id)
current_token = torch.tensor([[next_token_id]], device=model.device)
if next_token_id == tokenizer.eos_token_id:
print(f"遇到 EOS,停止于第 {step+1} 步")
break
avg_decode = sum(decode_times) / len(decode_times) * 1000
total_decode = sum(decode_times) * 1000
print(f"生成 token 数: {len(generated_tokens)}")
print(f"平均 decode 时间: {avg_decode:.2f} ms/token")
print(f"Decode 吞吐量: {1000/avg_decode:.2f} tokens/s")
profiler.report(f"{experiment_name} - Decode 模块分析(平均每步)")
# 解码文本
generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True)
# 总结
total_time = prefill_time + sum(decode_times)
print(f"\n{'='*80}")
print(f"{experiment_name} - 总结")
print(f"{'='*80}")
print(f"Tokenization: {tokenize_time*1000:>10.2f} ms ({tokenize_time/total_time*100:>5.2f}%)")
print(f"Prefill: {prefill_time*1000:>10.2f} ms ({prefill_time/total_time*100:>5.2f}%)")
print(f"Decode: {total_decode:>10.2f} ms ({sum(decode_times)/total_time*100:>5.2f}%)")
print(f"{'总计:':<15} {total_time*1000:>10.2f} ms")
print(f"\n整体吞吐量: {len(generated_tokens)/total_time:.2f} tokens/s")
print(f"生成文本(前150字符): {generated_text[:150]}")
print()
return generated_text, {
'prefill_time': prefill_time,
'decode_time': sum(decode_times),
'avg_decode': avg_decode,
'prompt_len': prompt_len,
'generated_len': len(generated_tokens)
}
# ============================================
# 实验
# ============================================
profiler = ModuleProfiler(model)
profiler.register_hooks()
# 实验 1
prompt1 = "莎士比亚是哪国人?"
text1, stats1 = profile_generation(
model, tokenizer, prompt1, 500, profiler, "实验 1: 短 Prompt"
)
# 实验 2
prompt2 = prompt1 + text1
text2, stats2 = profile_generation(
model, tokenizer, prompt2, 500, profiler, "实验 2: 长 Prompt"
)
profiler.remove_hooks()
# ============================================
# 对比分析
# ============================================
print("\n" + "="*80)
print("实验对比分析")
print("="*80)
print(f"\n{'指标':<30} {'实验1 (短prompt)':<20} {'实验2 (长prompt)':<20} {'比值':<10}")
print("-"*80)
print(f"{'Prompt 长度':<30} {stats1['prompt_len']:<20} {stats2['prompt_len']:<20} "
f"{stats2['prompt_len']/stats1['prompt_len']:<10.1f}x")
print(f"{'Prefill 时间 (ms)':<30} {stats1['prefill_time']*1000:<20.2f} "
f"{stats2['prefill_time']*1000:<20.2f} "
f"{stats2['prefill_time']/stats1['prefill_time']:<10.1f}x")
print(f"{'Decode 平均时间 (ms/token)':<30} {stats1['avg_decode']:<20.2f} "
f"{stats2['avg_decode']:<20.2f} "
f"{stats2['avg_decode']/stats1['avg_decode']:<10.2f}x")
print(f"{'生成 token 数':<30} {stats1['generated_len']:<20} {stats2['generated_len']:<20}")
print("\n" + "="*80)
print("关键结论")
print("="*80)
print("1. Prefill 是 compute-bound:")
print(f" - 时间随 prompt 长度线性增长({stats2['prefill_time']/stats1['prefill_time']:.1f}x)")
print("\n2. Decode 是 memory-bound:")
print(f" - 平均时间几乎不变({stats2['avg_decode']/stats1['avg_decode']:.2f}x,接近1.0)")
print(f" - 瓶颈在内存带宽,不在计算")
print("\n3. 主要耗时:")
print(f" - 实验1: Decode占 {stats1['decode_time']/(stats1['prefill_time']+stats1['decode_time'])*100:.1f}%")
print(f" - 实验2: Decode占 {stats2['decode_time']/(stats2['prefill_time']+stats2['decode_time'])*100:.1f}%")
print("="*80)