# Deepseek-V3-0324 Deepseek-V3-0324 bf16四机部署步骤 ## Table of Contents [TOC] ## 1、环境准备 每个节点准备环境 ```shell docker pull harbor.sourcefind.cn:5443/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 docker run --shm-size 500g --network=host --name=limeng_test2 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v ~/:/workspace/ -v /public/opendas/DL_DATA/llm-models/:/home/models:ro -v /opt/hyhal:/opt/hyhal:ro -it harbor.sourcefind.cn:5443/dcu/admin/base/vllm:0.9.2-ubuntu22.04-dtk25.04.2-py3.10 bash ``` ## 2、导入环境变量 每个节点环境变量请添加到 ~/.bashrc 后重启容器 ```shell export ALLREDUCE_STREAM_WITH_COMPUTE=1 #BW集群的VLLM_HOST_IP需要设置为ib网卡对应的IP export VLLM_HOST_IP=$(hostname -I | awk '{print $2}') echo $VLLM_HOST_IP export NCCL_SOCKET_IFNAME=ib0 export GLOO_SOCKET_IFNAME=ib0 export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 #这里用ibstat看看哪些网卡是active export NCCL_IB_HCA=mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1 export NCCL_MIN_NCHANNELS=16 export NCCL_MAX_NCHANNELS=16 export NCCL_NET_GDR_READ=1 export VLLM_RPC_TIMEOUT=1800000 export NCCL_TOPO_FILE="/workspace/limeng/topo-input.xml" unset NCCL_ALGO export NCCL_NET_GDR_LEVEL=7 export NCCL_SDMA_COPY_ENABLE=0 export VLLM_USE_OPT_ZEROS=1 ``` ## 3、启动ray集群 ```shell 主节点执行 ray start --head --node-ip-address=主节点ip --port=6688 --num-gpus=8 --num-cpus=32 子节点依次执行 ray start --address=主节点ip:6688 --node-ip-address=子节点ip --num-gpus=8 --num-cpus=32 ray start --address=主节点ip:6688 --node-ip-address=子节点ip --num-gpus=8 --num-cpus=32 ray start --address=主节点ip:6688 --node-ip-address=子节点ip --num-gpus=8 --num-cpus=32 ``` ## 4、主节点启动服务 ```shell model_path=/home/models/DeepSeek-V3-0324-bf16 model=${model_path##*/} data_type="bfloat16" tp=32 port=8899 gpu_memory=0.9 #日期目录 log_date=$(date "+%Y-%m-%d") time=$(date "+%Y-%m-%d-%H-%M-%S") log_dir="bw1000_${model}/${log_date}" mkdir -p "${log_dir}" vllm serve ${model_path} \ --trust-remote-code \ --distributed-executor-backend ray \ --dtype $data_type \ --tensor-parallel-size $tp \ --gpu-memory-utilization $gpu_memory \ --disable-cascade-attn \ --host 0.0.0.0 \ --port $port \ --max-model-len 40960 \ --max-seq-len-to-capture 40960 \ --max-num-batched-tokens 40960 \ --disable-log-requests \ --max-num-seqs 1024 \ --block-size 64 \ --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 3}' \ --enable-chunked-prefill \ --enable-prefix-caching \ 2>&1 | tee "${log_dir}/serve_${time}.log" ``` 在线推理 ```Shell python benchmark_serving.py --model /home/models/DeepSeek-V3-0324-bf16 --dataset-name random --trust-remote-code --random-input-len 12000 --random-output-len 550 --port 8899 --ignore-eos --max-concurrency 6 --num-prompts 12 ```