# Medusa Decoding 本文说明如何使用vllm构建和运行medusa模型 ## Overview Medusa是一种大模型并行解码算法,除了支持官方提供的Top1-proposer,我们还支持tree-style并行解码,target model和draft model均可多卡推理 与其他模型不同,medusa解码需要一个base model和若干Medusa heads. Vllm medusa model的实现在[vllm/model_executor/models/medusa.py] ## Support Matrix * FP16 * BF16 * PAGED_KV_CACHE * Tensor Parallel ### convert Medusa model weights # medusa 模型需要转换为vllm中Medusa的模型格式 ```bash python medusa_weight_converter.py --medusa_num_heads 4 --medusa_num_layers 1 --medusa_model_path /work/model.bin --vocab_size 152064 --hidden_size 8192 --output_dir /work/medusa/vllm-medusa-qwen2-72b-head-4 --medusa_choices="[(0), (0, 0), (0, 0, 0), (0, 1), (1), (1, 0), (0, 0, 0, 0), (0, 0, 1), (0, 2), (0, 1, 0), (2), (0, 0, 2), (0, 3), (1, 0, 0), (2, 0), (0, 2, 0), (0, 4), (0, 0, 3), (3), (0, 0, 0, 1), (0, 5), (0, 0, 1, 0), (0, 0, 4)]" ``` 此处model.bin是训练后保存的medusa head权重,如果希望采用Top1-proposer,medusa_choices可以不设置 ### Run tree-style generation server ```bash VLLM_TREE_DECODING=1 python3 -m vllm.entrypoints.openai.api_server \ --served-model-name qwen_medusa \ --model /models/Qwen2-72B-Instruct/ -tp 4 \ --max-model-len 1024 --max-num-seqs 8 --gpu-memory-utilization 0.8 \ --speculative-model /work/medusa/vllm-medusa-qwen2-72b-head-4 \ --speculative-draft-tensor-parallel-size 4 \ --speculative-disable-by-batch-size 9 \ --use-v2-block-manager \ --spec-decoding-acceptance-method typical_acceptance_sampler \ --dtype float16 --trust-remote-code --port 8086\ --num-speculative-heads 4 --num-speculative-tokens 24 ``` 注意: num_speculative_tokens = len(medusa_choices) + 1 medusa_choices个数不能太多,否则多batch下会降低推理速度 speculative-disable-by-batch-size要大于max-num-seqs,否则当batch等于max-num-seqs时,不会走并行解码 ### Run Top1-proposer server python3 -m vllm.entrypoints.openai.api_server \ --served-model-name qwen_medusa \ --model /models/Qwen2-72B-Instruct/ -tp 4 \ --max-model-len 1024 --max-num-seqs 8 --gpu-memory-utilization 0.8 \ --speculative-model /work/medusa/vllm-medusa-qwen2-72b-head-4 \ --speculative-draft-tensor-parallel-size 4 \ --speculative-disable-by-batch-size 9 \ --use-v2-block-manager \ --spec-decoding-acceptance-method typical_acceptance_sampler \ --dtype float16 --trust-remote-code --port 8086\ --num-speculative-tokens 4 注意: 使用Top1-proposer时,num-speculative-tokens就是medusa head的个数 # do request ```bash curl http://localhost:8086/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen_medusa", "prompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n帮我写一个C++的快速排序算法<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 256, "temperature": 0.0 }' ``` ### Run tree-style benchmark ```bash VLLM_TREE_DECODING=1 python /work/test/medusa_benchmark_throughput.py --model /models/Qwen2-72B-Instruct/ -tp 4 --dtype float16 --trust-remote-code --max-num-seqs 4 --speculative-model /work/medusa/vllm-medusa1-qwen2-72b-head-4 --speculative-draft-tensor-parallel-size 4 --speculative-disable-by-batch-size 9 --use-v2-block-manager --spec-decoding-acceptance-method typical_acceptance_sampler --max-model-len 1024 --dataset /work/medusa_benchmark_data.json --num-speculative-heads 4 --num-speculative-tokens 24 --gpu-memory-utilization 0.95 ``` ### Run Top1-proposer benchmark ```bash python /work/test/medusa_benchmark_throughput.py --model /models/Qwen2-72B-Instruct/ -tp 4 --dtype float16 --trust-remote-code --max-num-seqs 4 --speculative-model /work/medusa/vllm-medusa1-qwen2-72b-head-4 --speculative-draft-tensor-parallel-size 4 --speculative-disable-by-batch-size 9 --use-v2-block-manager --spec-decoding-acceptance-method typical_acceptance_sampler --max-model-len 1024 --dataset /work/medusa_benchmark_data.json --num-speculative-tokens 4 --gpu-memory-utilization 0.95 ``` 可设置max-num-seqs对不同的batch进行性能测试