README.md 2.47 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Medusa Decoding

本文说明如何使用vllm构建和运行medusa模型,目前medusa支持tree-style generation,target model和draft model均可多卡推理

## Overview
与其他模型不同,medusa解码需要一个base model和若干Medusa heads.

Vllm medusa model的实现在[vllm/model_executor/models/medusa.py]

## Support Matrix
  * FP16
  * BF16
  * PAGED_KV_CACHE
  * Tensor Parallel

### convert Medusa model weights
# medusa 模型需要转换为vllm中Medusa的模型格式

```bash
20
python medusa_weight_converter.py --medusa_num_heads 4 --medusa_num_layers 1 --medusa_model_path /work/model.bin --vocab_size 152064 --hidden_size 8192 --output_dir /work/medusa/vllm-medusa-qwen2-72b-head-4 --medusa_choices="[(0), (0, 0), (0, 0, 0), (0, 1), (1), (1, 0), (0, 0, 0, 0), (0, 0, 1), (0, 2), (0, 1, 0), (2), (0, 0, 2), (0, 3), (1, 0, 0), (2, 0), (0, 2, 0), (0, 4), (0, 0, 3), (3), (0, 0, 0, 1), (0, 5), (0, 0, 1, 0), (0, 0, 4)]"
21
```
22
此处model.bin是训练后保存的medusa head权重
23
24
25
26
27
28
29
30


### Run

```bash
python3 -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen_medusa \
  --model /models/Qwen2-72B-Instruct/ -tp 4 \
31
  --max-model-len 1024 --max-num-seqs 8 --gpu-memory-utilization 0.8 \
32
33
34
35
36
  --speculative-model /work/medusa/vllm-medusa-qwen2-72b-head-4 \
  --speculative-draft-tensor-parallel-size 4 \
  --speculative-disable-by-batch-size 4 \
  --use-v2-block-manager \
  --spec-decoding-acceptance-method typical_acceptance_sampler \
37
  --dtype float16 --trust-remote-code --port 8086\
38
  --tree-style-spec-decoding True\
39
  --num-speculative-heads 4 --num-speculative-tokens 24
40
```
41

42
merge-lora可以将lora权重和base model权重融合,提升整体推理速度,若对精度有严格要求,可不设置此参数
43
num-speculative-tokens和medusa choices的个数相关,num_speculative_tokens = len(medusa_choices) + 1
44
45
46
47
48
49

# do request
```bash
curl http://localhost:8086/v1/completions \
-H "Content-Type: application/json" \
-d '{
50
"model": "qwen_medusa",
51
52
53
54
"prompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n帮我写一个C++的快速排序算法<|im_end|>\n<|im_start|>assistant\n",
"max_tokens": 256,
"temperature": 0.0
}'
55
56
57
58
59
60
61
```

### benchmark
python medusa_benchmark_throughput.py --model /data/llm-models/qwen2/Qwen2-72B-Instruct/ -tp 4 --dtype float16 --trust-remote-code --max-num-seqs 1 --dataset /work/test/medusa_benchmark_data.json --max-model-len 4096 --gpu-memory-utilization 0.9

可设置max-num-seqs对不同的batch进行性能测试