Qwen3-8B-Base 性能测试笔记 (完整数据版)

Qwen3-8B-Base 性能测试笔记 (完整数据版)

- 模型: Qwen3-8B-Base - 设备: NVIDIA GeForce RTX 4090 (48G显存版) - 词表大小: 151,643 --- - Prompt 长度: 7 tokens - 生成 Token 数: 287 tokens | 阶段 | 耗时 (ms) | 占比 | 吞吐...

Qwen3-8B-Base 性能测试笔记 (完整数据版)

1. 测试环境

  • 模型: Qwen3-8B-Base
  • 设备: NVIDIA GeForce RTX 4090 (48G显存版)
  • 词表大小: 151,643

2. 实验总结

实验 1: 短 Prompt

  • Prompt 长度: 7 tokens
  • 生成 Token 数: 287 tokens
阶段 耗时 (ms) 占比 吞吐量 (tokens/s)
Prefill 70.87 0.47% 98.77
Decode 14966.21 99.53% 19.11 (52.33 ms/token)
总计 15037.09 100% 19.09 (整体)

Prefill 阶段模块分析

模块 总时间(ms) 调用次数 平均(ms) 占比
Attention.K_proj 12.27 36 0.3409 24.61 %
FFN.down_proj 5.23 36 0.1452 10.48 %
FFN.gate_proj 5.07 36 0.1408 10.17 %
FFN.up_proj 4.84 36 0.1345 9.71 %
LayerNorm.post_attn 3.34 36 0.0929 6.71 %
LayerNorm.pre_attn 3.23 36 0.0897 6.48 %
Attention.Q_proj 2.98 36 0.0827 5.97 %
Attention.O_proj 2.49 36 0.0692 5.00 %
Attention.V_proj 1.39 36 0.0386 2.79 %
LM_Head 1.37 1 1.3707 2.75 %
FFN.activation 0.72 36 0.0201 1.45 %
Attention.RoPE 0.40 1 0.3974 0.80 %
model.layers.0.self_attn.k_norm 0.16 1 0.1569 0.31 %
model.layers.0.self_attn.q_norm 0.14 1 0.1373 0.28 %
model.layers.7.self_attn.k_norm 0.13 1 0.1290 0.26 %
(其他模块) 6.10 --- --- 12.24 %
总计 49.86

Decode 阶段模块分析 (平均每步)

模块 总时间(ms) 调用次数 平均(ms) 占比
FFN.gate_proj 1347.96 10296 0.1309 13.27 %
FFN.down_proj 1340.25 10296 0.1302 13.19 %
FFN.up_proj 1319.39 10296 0.1281 12.99 %
LayerNorm.post_attn 902.26 10296 0.0876 8.88 %
LayerNorm.pre_attn 877.46 10296 0.0852 8.64 %
Attention.O_proj 652.89 10296 0.0634 6.43 %
Attention.Q_proj 632.38 10296 0.0614 6.22 %
LM_Head 385.11 286 1.3465 3.79 %
Attention.K_proj 346.52 10296 0.0337 3.41 %
Attention.V_proj 342.32 10296 0.0332 3.37 %
FFN.activation 189.11 10296 0.0184 1.86 %
Attention.RoPE 63.39 286 0.2216 0.62 %
model.layers.0.self_attn.q_norm 26.45 286 0.0925 0.26 %
model.layers.0.self_attn.k_norm 25.05 286 0.0876 0.25 %
model.layers.1.self_attn.q_norm 24.55 286 0.0859 0.24 %
(其他模块) 1684.22 --- --- 16.58 %
总计 10159.30

实验 2: 长 Prompt

  • Prompt 长度: 293 tokens
  • 生成 Token 数: 55 tokens
阶段 耗时 (ms) 占比 吞吐量 (tokens/s)
Prefill 80.42 2.76% 3643.30
Decode 2838.66 97.24% 19.02 (52.57 ms/token)
总计 2919.08 100% 18.84 (整体)

Prefill 阶段模块分析

模块 总时间(ms) 调用次数 平均(ms) 占比
FFN.gate_proj 9.46 36 0.2628 16.20 %
FFN.up_proj 9.07 36 0.2519 15.53 %
FFN.down_proj 9.04 36 0.2510 15.48 %
Attention.Q_proj 4.34 36 0.1204 7.43 %
Attention.O_proj 4.12 36 0.1144 7.05 %
LayerNorm.post_attn 3.59 36 0.0997 6.15 %
LM_Head 3.51 1 3.5124 6.02 %
LayerNorm.pre_attn 3.45 36 0.0958 5.90 %
Attention.K_proj 1.84 36 0.0510 3.15 %
Attention.V_proj 1.75 36 0.0486 3.00 %
FFN.activation 0.75 36 0.0209 1.29 %
Embedding 0.59 1 0.5937 1.02 %
Attention.RoPE 0.33 1 0.3269 0.56 %
model.layers.18.self_attn.q_norm 0.32 1 0.3173 0.54 %
model.layers.0.self_attn.q_norm 0.11 1 0.1147 0.20 %
(其他模块) 6.13 --- --- 10.49 %
总计 58.39

Decode 阶段模块分析 (平均每步)

模块 总时间(ms) 调用次数 平均(ms) 占比
FFN.gate_proj 253.83 1944 0.1306 13.20 %
FFN.down_proj 253.02 1944 0.1302 13.16 %
FFN.up_proj 249.11 1944 0.1281 12.95 %
LayerNorm.post_attn 172.15 1944 0.0886 8.95 %
LayerNorm.pre_attn 167.13 1944 0.0860 8.69 %
Attention.O_proj 122.74 1944 0.0631 6.38 %
Attention.Q_proj 119.50 1944 0.0615 6.21 %
LM_Head 71.69 54 1.3276 3.73 %
Attention.K_proj 65.91 1944 0.0339 3.43 %
Attention.V_proj 64.80 1944 0.0333 3.37 %
FFN.activation 36.04 1944 0.0185 1.87 %
Attention.RoPE 12.04 54 0.2229 0.63 %
model.layers.9.self_attn.q_norm 4.99 54 0.0924 0.26 %
model.layers.0.self_attn.q_norm 4.97 54 0.0921 0.26 %
model.layers.30.self_attn.q_norm 4.93 54 0.0914 0.26 %
(其他模块) 320.45 --- --- 16.66 %
总计 1923.32

3. 对比分析与结论

数据对比

指标 实验1 (短prompt) 实验2 (长prompt) 比值
Prompt 长度 7 293 41.9 x
Prefill 时间 (ms) 70.87 80.42 1.1 x
Decode 平均时间 (ms/token) 52.33 52.57 1.00 x
生成 token 数 287 55 -

关键结论

  1. Prefill 阶段是计算密集型 (Compute-Bound)

    • 当 Prompt 长度从 7 增加到 293 (约42倍)时,Prefill 时间仅增加了 1.1倍。这表明对于短文本,Prefill 的固定开销较大,但随着长度增加,处理时间近似线性增长。长 Prompt 下的高吞吐量 (3643 tokens/s) 也证明了并行计算的效率。
  2. Decode 阶段是访存密集型 (Memory-Bound)

    • 无论 Prompt 长短,解码(生成)单个 token 的平均时间几乎恒定 (52.33 ms vs 52.57 ms)。这说明 Decode 阶段的瓶颈在于从显存中读写 KV Cache,而不是在矩阵计算上。
  3. 推理的主要瓶颈是 Decode

    • 在两个实验中,Decode 阶段的耗时都占了总时间的绝对主导地位(短 Prompt 占 99.5%,长 Prompt 占 97.2%)。这是典型的自回归模型生成长序列时的性能特征。

4. 附录

点击查看测试代码 ```python #!/usr/bin/env python3 import os import torch import time from collections import defaultdict os.environ['HF_HUB_OFFLINE'] = '1' os.environ['TRANSFORMERS_OFFLINE'] = '1' from transformers import AutoModelForCausalLM, AutoTokenizer print("="*80) print("加载模型和分词器...") print("="*80) model_path = "./Qwen/Qwen3-8B-Base/" tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True) model = AutoModelForCausalLM.from_pretrained( model_path, local_files_only=True, torch_dtype=torch.float16, device_map="cuda:0" ) print(f"模型加载完成, 设备: {model.device}") print(f"词表大小: {tokenizer.vocab_size}") # ============================================ # 修正版:更精确的模块性能分析 # ============================================ class ModuleProfiler: """改进版:避免父子模块重复计时""" def __init__(self, model): self.model = model self.module_times = defaultdict(list) self.hooks = [] self.active_modules = [] # 跟踪活动模块栈 def register_hooks(self): """只为叶子模块注册 hooks""" # 识别所有叶子模块(没有子模块的模块) leaf_modules = {} for name, module in self.model.named_modules(): if len(list(module.children())) == 0: # 叶子模块 leaf_modules[id(module)] = name def pre_forward_hook(module, input): if id(module) in leaf_modules: torch.cuda.synchronize() self.active_modules.append({ 'name': leaf_modules[id(module)], 'start': time.time() }) def forward_hook(module, input, output): if id(module) in leaf_modules and self.active_modules: torch.cuda.synchronize() module_info = self.active_modules.pop() elapsed = time.time() - module_info['start'] # 简化模块名称 name = self._simplify_name(module_info['name']) self.module_times[name].append(elapsed) # 注册 hooks for name, module in self.model.named_modules(): if id(module) in leaf_modules: h1 = module.register_forward_pre_hook(pre_forward_hook) h2 = module.register_forward_hook(forward_hook) self.hooks.append(h1) self.hooks.append(h2) def _simplify_name(self, full_name): """简化模块名称""" if 'embed_tokens' in full_name: return 'Embedding' elif 'lm_head' in full_name: return 'LM_Head' elif 'model.norm' in full_name: return 'Final_LayerNorm' # Attention 相关 elif 'self_attn.q_proj' in full_name: return 'Attention.Q_proj' elif 'self_attn.k_proj' in full_name: return 'Attention.K_proj' elif 'self_attn.v_proj' in full_name: return 'Attention.V_proj' elif 'self_attn.o_proj' in full_name: return 'Attention.O_proj' elif 'rotary_emb' in full_name: return 'Attention.RoPE' # FFN 相关 elif 'mlp.gate_proj' in full_name: return 'FFN.gate_proj' elif 'mlp.up_proj' in full_name: return 'FFN.up_proj' elif 'mlp.down_proj' in full_name: return 'FFN.down_proj' elif 'mlp.act_fn' in full_name: return 'FFN.activation' # LayerNorm elif 'input_layernorm' in full_name: return 'LayerNorm.pre_attn' elif 'post_attention_layernorm' in full_name: return 'LayerNorm.post_attn' return full_name def remove_hooks(self): for hook in self.hooks: hook.remove() self.hooks = [] def reset(self): self.module_times = defaultdict(list) self.active_modules = [] def report(self, title="模块性能统计", top_n=15): print(f"\n{'='*90}") print(f"{title}") print(f"{'='*90}") # 聚合统计 aggregated = {} for name, times in self.module_times.items(): total = sum(times) count = len(times) aggregated[name] = { 'total': total, 'count': count, 'avg': total / count if count > 0 else 0 } total_time = sum(s['total'] for s in aggregated.values()) # 按总时间排序 sorted_items = sorted(aggregated.items(), key=lambda x: x[1]['total'], reverse=True) print(f"{'模块':<35} {'总时间(ms)':<12} {'调用次数':<10} {'平均(ms)':<12} {'占比':<8}") print("-"*90) for name, stats in sorted_items[:top_n]: pct = (stats['total'] / total_time * 100) if total_time > 0 else 0 print(f"{name:<35} {stats['total']*1000:<12.2f} {stats['count']:<10} " f"{stats['avg']*1000:<12.4f} {pct:<8.2f}%") if len(sorted_items) > top_n: other_time = sum(s['total'] for _, s in sorted_items[top_n:]) other_pct = (other_time / total_time * 100) if total_time > 0 else 0 print(f"{'(其他模块)':<35} {other_time*1000:<12.2f} {'---':<10} " f"{'---':<12} {other_pct:<8.2f}%") print("-"*90) print(f"{'总计':<35} {total_time*1000:<12.2f}") print() # ============================================ # Warmup:消除首次运行的 CUDA 开销 # ============================================ print("\n" + "="*80) print("Warmup: 预热 GPU(消除 CUDA 编译开销)") print("="*80) warmup_prompt = "测试" warmup_inputs = tokenizer(warmup_prompt, return_tensors="pt").to(model.device) print("进行 3 次 warmup...") for i in range(3): with torch.no_grad(): _ = model(**warmup_inputs, use_cache=True) print(f" Warmup {i+1}/3 完成") torch.cuda.synchronize() print("Warmup 完成!\n") # ============================================ # 性能分析函数 # ============================================ def profile_generation(model, tokenizer, prompt, max_new_tokens, profiler, experiment_name): """带性能分析的生成""" print(f"\n{'='*80}") print(f"{experiment_name}") print(f"{'='*80}\n") # Tokenize torch.cuda.synchronize() t0 = time.time() inputs = tokenizer(prompt, return_tensors="pt").to(model.device) torch.cuda.synchronize() tokenize_time = time.time() - t0 input_ids = inputs['input_ids'] prompt_len = input_ids.shape[1] print(f"Prompt 长度: {prompt_len} tokens") print(f"Max new tokens: {max_new_tokens}\n") # ============================================ # Prefill # ============================================ print("--- Prefill 阶段 ---") profiler.reset() torch.cuda.synchronize() t_prefill = time.time() with torch.no_grad(): outputs = model(input_ids=input_ids, use_cache=True) torch.cuda.synchronize() prefill_time = time.time() - t_prefill print(f"Prefill 时间: {prefill_time*1000:.2f} ms") print(f"吞吐量: {prompt_len/prefill_time:.2f} tokens/s") profiler.report(f"{experiment_name} - Prefill 模块分析") # ============================================ # Decode # ============================================ print("--- Decode 阶段 ---") profiler.reset() past_key_values = outputs.past_key_values logits = outputs.logits[:, -1, :] next_token_id = torch.argmax(logits, dim=-1).item() generated_tokens = [next_token_id] current_token = torch.tensor([[next_token_id]], device=model.device) decode_times = [] for step in range(1, max_new_tokens): torch.cuda.synchronize() t_step = time.time() with torch.no_grad(): outputs = model( input_ids=current_token, past_key_values=past_key_values, use_cache=True ) torch.cuda.synchronize() decode_times.append(time.time() - t_step) logits = outputs.logits[:, -1, :] past_key_values = outputs.past_key_values next_token_id = torch.argmax(logits, dim=-1).item() generated_tokens.append(next_token_id) current_token = torch.tensor([[next_token_id]], device=model.device) if next_token_id == tokenizer.eos_token_id: print(f"遇到 EOS,停止于第 {step+1} 步") break avg_decode = sum(decode_times) / len(decode_times) * 1000 total_decode = sum(decode_times) * 1000 print(f"生成 token 数: {len(generated_tokens)}") print(f"平均 decode 时间: {avg_decode:.2f} ms/token") print(f"Decode 吞吐量: {1000/avg_decode:.2f} tokens/s") profiler.report(f"{experiment_name} - Decode 模块分析(平均每步)") # 解码文本 generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=True) # 总结 total_time = prefill_time + sum(decode_times) print(f"\n{'='*80}") print(f"{experiment_name} - 总结") print(f"{'='*80}") print(f"Tokenization: {tokenize_time*1000:>10.2f} ms ({tokenize_time/total_time*100:>5.2f}%)") print(f"Prefill: {prefill_time*1000:>10.2f} ms ({prefill_time/total_time*100:>5.2f}%)") print(f"Decode: {total_decode:>10.2f} ms ({sum(decode_times)/total_time*100:>5.2f}%)") print(f"{'总计:':<15} {total_time*1000:>10.2f} ms") print(f"\n整体吞吐量: {len(generated_tokens)/total_time:.2f} tokens/s") print(f"生成文本(前150字符): {generated_text[:150]}") print() return generated_text, { 'prefill_time': prefill_time, 'decode_time': sum(decode_times), 'avg_decode': avg_decode, 'prompt_len': prompt_len, 'generated_len': len(generated_tokens) } # ============================================ # 实验 # ============================================ profiler = ModuleProfiler(model) profiler.register_hooks() # 实验 1 prompt1 = "莎士比亚是哪国人?" text1, stats1 = profile_generation( model, tokenizer, prompt1, 500, profiler, "实验 1: 短 Prompt" ) # 实验 2 prompt2 = prompt1 + text1 text2, stats2 = profile_generation( model, tokenizer, prompt2, 500, profiler, "实验 2: 长 Prompt" ) profiler.remove_hooks() # ============================================ # 对比分析 # ============================================ print("\n" + "="*80) print("实验对比分析") print("="*80) print(f"\n{'指标':<30} {'实验1 (短prompt)':<20} {'实验2 (长prompt)':<20} {'比值':<10}") print("-"*80) print(f"{'Prompt 长度':<30} {stats1['prompt_len']:<20} {stats2['prompt_len']:<20} " f"{stats2['prompt_len']/stats1['prompt_len']:<10.1f}x") print(f"{'Prefill 时间 (ms)':<30} {stats1['prefill_time']*1000:<20.2f} " f"{stats2['prefill_time']*1000:<20.2f} " f"{stats2['prefill_time']/stats1['prefill_time']:<10.1f}x") print(f"{'Decode 平均时间 (ms/token)':<30} {stats1['avg_decode']:<20.2f} " f"{stats2['avg_decode']:<20.2f} " f"{stats2['avg_decode']/stats1['avg_decode']:<10.2f}x") print(f"{'生成 token 数':<30} {stats1['generated_len']:<20} {stats2['generated_len']:<20}") print("\n" + "="*80) print("关键结论") print("="*80) print("1. Prefill 是 compute-bound:") print(f" - 时间随 prompt 长度线性增长({stats2['prefill_time']/stats1['prefill_time']:.1f}x)") print("\n2. Decode 是 memory-bound:") print(f" - 平均时间几乎不变({stats2['avg_decode']/stats1['avg_decode']:.2f}x,接近1.0)") print(f" - 瓶颈在内存带宽,不在计算") print("\n3. 主要耗时:") print(f" - 实验1: Decode占 {stats1['decode_time']/(stats1['prefill_time']+stats1['decode_time'])*100:.1f}%") print(f" - 实验2: Decode占 {stats2['decode_time']/(stats2['prefill_time']+stats2['decode_time'])*100:.1f}%") print("="*80)

Thanks for Reading

If this article was helpful to you, feel free to connect with me!