Unverified Commit 729b2429 authored by Baizhou Zhang's avatar Baizhou Zhang Committed by GitHub
Browse files

[Doc] Add documentation for DeepSeek V3.2 (#11877)


Co-authored-by: default avatarXinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: default avatarybyang <ybyang7@iflytek.com>
parent d7056c52
...@@ -13,7 +13,7 @@ ...@@ -13,7 +13,7 @@
"| Model | Reasoning tags | Parser | Notes |\n", "| Model | Reasoning tags | Parser | Notes |\n",
"|---------|-----------------------------|------------------|-------|\n", "|---------|-----------------------------|------------------|-------|\n",
"| [DeepSeek‑R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `<think>` … `</think>` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |\n", "| [DeepSeek‑R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `<think>` … `</think>` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |\n",
"| [DeepSeek‑V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `<think>` … `</think>` | `deepseek-v3` | Supports `thinking` parameter |\n", "| [DeepSeek‑V3 series](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `<think>` … `</think>` | `deepseek-v3` | Including [DeepSeek‑V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp). Supports `thinking` parameter |\n",
"| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `<think>` … `</think>` | `qwen3` | Supports `enable_thinking` parameter |\n", "| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `<think>` … `</think>` | `qwen3` | Supports `enable_thinking` parameter |\n",
"| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `<think>` … `</think>` | `qwen3` or `qwen3-thinking` | Always generates thinking content |\n", "| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `<think>` … `</think>` | `qwen3` or `qwen3-thinking` | Always generates thinking content |\n",
"| [Kimi models](https://huggingface.co/moonshotai/models) | `◁think▷` … `◁/think▷` | `kimi` | Uses special thinking delimiters |\n", "| [Kimi models](https://huggingface.co/moonshotai/models) | `◁think▷` … `◁/think▷` | `kimi` | Uses special thinking delimiters |\n",
...@@ -26,7 +26,7 @@ ...@@ -26,7 +26,7 @@
"- Both are handled by the same `deepseek-r1` parser\n", "- Both are handled by the same `deepseek-r1` parser\n",
"\n", "\n",
"**DeepSeek-V3 Family:**\n", "**DeepSeek-V3 Family:**\n",
"- DeepSeek-V3.1: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n", "- DeepSeek-V3.1/V3.2: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n",
"\n", "\n",
"**Qwen3 Family:**\n", "**Qwen3 Family:**\n",
"- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates\n", "- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates\n",
......
...@@ -170,7 +170,7 @@ python3 -m sglang.launch_server \ ...@@ -170,7 +170,7 @@ python3 -m sglang.launch_server \
- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes. - The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
- FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development. - FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development.
- To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)): - To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)):
- Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value. - Adjust `--max-running-requests` to a larger number. The default value is `48` for MTP. For larger batch sizes, you should increase this value beyond the default value.
- Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it. - Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.
......
# DeepSeek V3.2 Usage
[DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves efficiency improvements in long-context scenarios.
For reporting issues or tracking upcoming features, please refer to this [Roadmap](https://github.com/sgl-project/sglang/issues/11060).
## Installation
### Docker
```bash
# H200/B200
docker pull lmsysorg/sglang:latest
# MI350/MI355
docker pull lmsysorg/sglang:dsv32-rocm
# NPUs
docker pull lmsysorg/sglang:dsv32-a2
docker pull lmsysorg/sglang:dsv32-a3
```
### Build From Source
```bash
# Install SGLang
git clone https://github.com/sgl-project/sglang
cd sglang
pip3 install pip --upgrade
pip3 install -e "python[all]"
# Install flash_mla
git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla
cd flash-mla
git submodule update --init --recursive
pip install -v .
```
## Launch DeepSeek V3.2 with SGLang
To serve DeepSeek-V3.2-Exp on 8xH200/B200 GPUs:
```bash
# Launch with TP + DP
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention
# Launch with EP + DP
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 8 --enable-dp-attention
```
### Configuration Tips
- **DP Attention**: For DeepSeek V3.2 model, the kernels are customized for the use case of `dp_size=8`. So
- **Choices of Attention Kernels**: The attention backend is automatically set to `nsa` attention backend for DeepSeek V3.2 model. In this backend, different kernels for sparse prefilling/decoding are implemented, which can be specified by `--nsa-prefill-backend` and `--nsa-decode-backend` server arguments. The choices of nsa prefill/decode attention kernels include:
- `flashmla_sparse`: `flash_mla_sparse_fwd` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs.
- `flashmla_kv`: `flash_mla_with_kvcache` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs.
- `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs.
- `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU.
- `alter`: Alter kernel on AMD HPUs. Can only be used as decode kernel.
- On the basis of performance benchmarks, the default configuration on H200 and B200 are set as follows :
- H200: `flashmla_sparse` prefill attention, `fa3` decode attention, `bf16` kv cache dtype.
- B200: `flashmla_kv` prefill attention, `flashmla_kv` decode attention, `fp8_e4m3` kv cache dtype.
- Currently we don't enable `prefill=flashmla_sparse` with `decode=flashmla_kv` due to latency caused by kv cache quantization operations. In the future we might shift to this setting after attention/quantization kernels are optimized.
### Multi-token Prediction
SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information.
Example usage:
```bash
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
```
- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
- The default value of `--max-running-requests` is set to `48` for MTP. For larger batch sizes, this value should be increased beyond the default value.
# Function Calling and Reasoning Parser
The usage of function calling and reasoning parser is the same as DeepSeek V3.1. Please refer to [Reasoning Parser](https://docs.sglang.ai/advanced_features/separate_reasoning.html) and [Tool Parser](https://docs.sglang.ai/advanced_features/tool_parser.html) documents.
# PD Disaggregation
Prefill Command:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3.2-Exp \
--disaggregation-mode prefill \
--host $LOCAL_IP \
--port $PORT \
--tp 8 \
--dp 8 \
--enable-dp-attention \
--dist-init-addr ${HOST}:${DIST_PORT} \
--trust-remote-code \
--disaggregation-bootstrap-port 8998 \
--mem-fraction-static 0.9 \
```
Decode command:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3.2-Exp \
--disaggregation-mode decode \
--host $LOCAL_IP \
--port $PORT \
--tp 8 \
--dp 8 \
--enable-dp-attention \
--dist-init-addr ${HOST}:${DIST_PORT} \
--trust-remote-code \
--mem-fraction-static 0.9 \
```
Router command:
```bash
python -m sglang_router.launch_router --pd-disaggregation \
--prefill $PREFILL_ADDR 8998 \
--decode $DECODE_ADDR \
--host 127.0.0.1 \
--port 8000 \
```
If you need more advanced deployment methods or production-ready deployment methods, such as RBG or LWS-based deployment, please refer to [references/multi_node_deployment/rbg_pd/deepseekv32_pd.md](../references/multi_node_deployment/rbg_pd/deepseekv32_pd.md). Additionally, you can also find startup commands for DeepEP-based EP parallelism in the aforementioned documentation.
## Benchmarking Results
### Accuracy Test with `gsm8k`
A simple accuracy benchmark can be tested with `gsm8k` dataset:
```bash
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
```
The result is 0.956, which matches our expectation:
```bash
Accuracy: 0.956
Invalid: 0.000
Latency: 25.109 s
Output throughput: 5226.235 token/s
```
### Accuracy Test with `gpqa-diamond`
Accuracy benchmark on long context can be tested on GPQA-diamond dataset with long output tokens and thinking enabled:
```bash
python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3
```
The mean accuracy over 8 runs shows 0.797, which matches the number 79.9 in official tech report.
```bash
Repeat: 8, mean: 0.797
Scores: ['0.808', '0.798', '0.808', '0.798', '0.783', '0.788', '0.803', '0.793']
```
# DeepSeekV32-Exp RBG Based PD Deploy
## 0. Prerequisites
1. k8s >=1.26
2. lws installed on k8s.
3. rbg installed on k8s.
For RBG installation, please refer to: https://github.com/sgl-project/rbg
## 1. Image Preparation
`lmsysorg/sglang:latest`
### 2. All In One manifest file
*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment*
rbg-dsv32.yml
```yaml
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
name: deepseek-rbg-32exp
namespace: default
spec:
roles:
- name: prefill
replicas: 1
workload:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
restartPolicy: None
leaderWorkerSet:
size: 1
patchLeaderTemplate:
metadata:
labels:
role: leader
pd_role: prefill
spec:
containers:
- command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --port
- "30000"
- --trust-remote
- --host
- 0.0.0.0
- --disable-radix-cache
- --disaggregation-ib-device
- mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
- --disable-radix-cache
- --chunked-prefill-size
- "131072"
- --page-size
- "64"
# - --enable-eplb
- --ep-dispatch-algorithm
- dynamic
- --eplb-algorithm
- deepseek
- --enable-dp-lm-head
- --enable-dp-attention
- --dp-size
- "8"
- --moe-a2a-backend
- deepep
- --deepep-mode
- normal
- --disaggregation-mode
- prefill
- --mem-fraction-static
- "0.8"
- --max-prefill-tokens
- "32768"
- --context-length
- "32768"
- --tp
- "8"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20102
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --ep-num-redundant-experts
- "32"
- --moe-dense-tp-size
- "1"
- --max-running-requests
- "1024"
env:
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
livenessProbe:
failureThreshold: 3000
httpGet:
path: /health
port: 30000
initialDelaySeconds: 300
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
failureThreshold: 20
httpGet:
path: /health
port: 30000
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
name: sglang
ports:
- containerPort: 30000
name: sglang-http
protocol: TCP
patchWorkerTemplate: {}
template:
metadata:
labels:
inference-framework: sglang
inference-stack.io/monitoring: "enabled"
spec:
containers:
- name: sglang
image: lmsysorg/sglang:latest
env:
- name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK
value: "1"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT
value: "1000000000"
- name: NVSHMEM_IB_TRAFFIC_CLASS
value: "16"
- name: NVSHMEM_DISABLE_P2P
value: "0"
- name: ENABLE_METRICS
value: "true"
- name: NVSHMEM_IB_GID_INDEX
value: "3"
- name: NVSHMEM_IB_SL
value: "5"
- name: SGLANG_SET_CPU_AFFINITY
value: "true"
- name: SGL_ENABLE_JIT_DEEPGEMM
value: "1"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"
- name: NCCL_NET_PLUGIN
value: "none"
- name: NCCL_IB_TC
value: "136"
- name: NCCL_IB_SL
value: "5"
- name: NCCL_IB_TIMEOUT
value: "22"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_MIN_NCHANNELS
value: "4"
- name: NCCL_SOCKET_IFNAME
value: bond0
- name: GLOO_SOCKET_IFNAME
value: bond0
- name: NCCL_IB_HCA
value: ^=mlx5_0,mlx5_5,mlx5_6
- name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
value: "bond0"
- name: MC_TE_METRIC
value: "false"
resources:
limits:
nvidia.com/gpu: "8"
securityContext:
capabilities:
add:
- IPC_LOCK
privileged: true
volumeMounts:
- mountPath: /root/.cache
name: sgl-cache
- mountPath: /dev/shm
name: dshm
- mountPath: /work/models
name: model
- mountPath: /dev/infiniband
name: ib
- mountPath: /sgl-workspace/sglang
name: src
dnsPolicy: ClusterFirstWithHostNet
hostIPC: true
hostNetwork: true
nodeSelector:
pd: "yes"
tolerations:
- key: pd
operator: Exists
volumes:
- hostPath:
path: /var/run/sys-topology
name: topo
- hostPath:
path: /data1/sgl_cache4
type: DirectoryOrCreate
name: sgl-cache
- emptyDir:
medium: Memory
name: dshm
- hostPath:
path: /data/DeepSeek-V3.2-Exp
name: model
- hostPath:
path: /dev/infiniband
name: ib
- hostPath:
path: /data/src/sglang
type: DirectoryOrCreate
name: src
- name: decode
replicas: 1
workload:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
leaderWorkerSet:
size: 1
patchLeaderTemplate:
metadata:
labels:
role: leader
pd_role: decode
spec:
containers:
- command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --port
- "30000"
- --trust-remote
- --host
- 0.0.0.0
- --disaggregation-ib-device
- mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
- --chunked-prefill-size
- "131072"
- --prefill-round-robin-balance
- --eplb-rebalance-layers-per-chunk
- "29"
- --page-size
- "64"
- --enable-dp-attention
- --enable-dp-lm-head
- --dp-size
- "8"
- --moe-a2a-backend
- deepep
- --deepep-mode
- low_latency
- --disaggregation-mode
- decode
- --mem-fraction-static
- "0.8"
- --context-length
- "32768"
- --max-running-requests
- "2048"
- --tp-size
- "8" # Size of Tensor Parallelism
- --cuda-graph-max-bs
- "16"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20102
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --ep-num-redundant-experts
- "32"
- --moe-dense-tp-size
- "1"
env:
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
livenessProbe:
failureThreshold: 30000
httpGet:
path: /health
port: 30000
initialDelaySeconds: 300
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 10
name: sglang
readinessProbe:
failureThreshold: 20
httpGet:
path: /health
port: 30000
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 10
patchWorkerTemplate:
spec:
containers:
- command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --crash-dump-folder
- /log
- --chunked-prefill-size
- "262144"
- --prefill-round-robin-balance
- --eplb-rebalance-layers-per-chunk
- "29"
- --page-size
- "64"
- --enable-dp-attention
- --enable-dp-lm-head
- --dp-size
- "32"
- --moe-a2a-backend
- "deepep"
- --deepep-mode
- low_latency
- --disaggregation-mode
- decode
- --mem-fraction-static
- "0.849"
- --context-length
- "32768"
- --disaggregation-ib-device
- mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
- --max-running-requests
- "4096"
- --cuda-graph-max-bs
- "16"
- --tp-size
- "8" # Size of Tensor Parallelism
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20102
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --ep-num-redundant-experts
- "32"
- --moe-dense-tp-size
- "1"
env:
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
name: sglang
template:
metadata:
labels:
inference-framework: sglang-unuse
inference-stack.io/monitoring: "enabled"
spec:
containers:
- image: lmsysorg/sglang:latest
name: sglang
resources:
limits:
nvidia.com/gpu: "8"
securityContext:
capabilities:
add:
- IPC_LOCK
privileged: true
volumeMounts:
- mountPath: /root/.cache
name: sgl-cache
- mountPath: /dev/shm
name: dshm
- mountPath: /work/models
name: model
- mountPath: /dev/infiniband
name: ib
- mountPath: /sgl-workspace/sglang
name: src
env:
- name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK
value: "1"
- name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT
value: "100000000"
- name: NVSHMEM_DISABLE_P2P
value: "0"
- name: NVSHMEM_IB_TRAFFIC_CLASS
value: "16"
- name: NVSHMEM_IB_SL
value: "5"
- name: ENABLE_METRICS
value: "true"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: NVSHMEM_IB_GID_INDEX
value: "3"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"
- name: NCCL_NET_PLUGIN
value: "none"
- name: NCCL_IB_TC
value: "136"
- name: NCCL_IB_SL
value: "5"
- name: NCCL_IB_TIMEOUT
value: "22"
- name: NCCL_IB_GID_INDEX
value: "3"
- name: NCCL_MIN_NCHANNELS
value: "4"
- name: NCCL_SOCKET_IFNAME
value: bond0
- name: GLOO_SOCKET_IFNAME
value: bond0
- name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
value: "bond0"
- name: NCCL_IB_HCA
value: ^=mlx5_0,mlx5_5,mlx5_6
- name: MC_TE_METRIC
value: "false"
- name: SGL_ENABLE_JIT_DEEPGEMM
value: "1"
dnsPolicy: ClusterFirstWithHostNet
hostIPC: true
hostNetwork: true
nodeSelector:
pd: "yes"
tolerations:
- key: pd
operator: Exists
volumes:
- hostPath:
path: /var/run/sys-topology
name: topo
- hostPath:
path: /data1/sgl_cache4
type: DirectoryOrCreate
name: sgl-cache
- hostPath:
path: /data/src/sglang
type: DirectoryOrCreate
name: src
- emptyDir:
medium: Memory
name: dshm
- hostPath:
path: /data/DeepSeek-V3.2-Exp
name: model
- hostPath:
path: /dev/infiniband
name: ib
- name: router
replicas: 1
dependencies: [ "decode", "prefill" ]
template:
spec:
containers:
- name: scheduler
image: lmsysorg/sglang:latest
command:
- sh
- -c
- >
python3 -m sglang_router.launch_router
--host 0.0.0.0
--port 8080
--pd-disaggregation
--policy random
--service-discovery
--service-discovery-namespace ${NAMESPACE}
--service-discovery-port 30000
--prefill-selector pd_role=prefill
--decode-selector pd_role=decode
--max-payload-size 2147483648
--worker-startup-timeout-secs 1200
env:
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
---
apiVersion: v1
kind: Service
metadata:
labels:
app: deepseek-rbg-32exp
name: deepseek-rbg-32exp
namespace: default
spec:
ports:
- name: http
port: 8080
protocol: TCP
targetPort: 8080
nodePort: 30080
selector:
rolebasedgroup.workloads.x-k8s.io/name: deepseek-rbg-32exp
rolebasedgroup.workloads.x-k8s.io/role: router
type: NodePort
```
```bash
[root@ecs-001]# kubectl get po -n default
deepseek-rbg-32exp-decode-main-0 1/1 Running 0 74m
deepseek-rbg-32exp-decode-0-1 1/1 Running 0 74m
deepseek-rbg-32exp-router-9c5dbfc57 1/1 Running 0 22m
deepseek-rbg-32exp-prefill-0 1/1 Running 0 74m
[root@ecs-cbm-x1-pd-cpu-001 main_doc]# kubectl get svc |grep dee
deepseek-rbg-32exp-decode ClusterIP None <none> <none> 97m
deepseek-rbg-32exp-router-service NodePort 172.16.242.169 <none> 8000:30800/TCP 22m
deepseek-rbg-32exp-prefill ClusterIP None <none> <none> 97m
```
At this point, select a nodePort:30800 to access:
```bash
[root@ecs-001]# curl -X POST "http://{nodePort}:30800/v1/chat/completions" \
> -H "Content-Type: application/json" \
> -H "Authorization: Bearer None" \
> -d '{
> "rid":"ccccdd",
> "model": "dsv32",
> "messages": [
> {"role": "system", "content": "0: You are a helpful AI assistant"},
> {"role": "user", "content": "你是谁?."}
> ],
> "max_tokens":221
> }'
{"id":"ccccdd","object":"chat.completion","created":1750252498,"model":"qwen2","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n嗯,用户问了一个很基础的自我介绍问题"你是谁?"。这可能是第一次互动时的常规开场白,也可能是想确认我的身份和功能范围。\n\n用户没有提供任何背景信息,语气简洁中性。这种场景下新用户的可能性较高,需要给出清晰友好的自我介绍,同时突出实用价值来降低陌生感。\n\n考虑到中文用户,应该用简体中文回复。重点要说明三点:身份归属(深度求索)、功能定位(AI助手)、服务范围(学习/工作/生活)。结尾用开放性问题引导对话很关键——既能了解需求,又能避免让用户面对空白输入框时不知所措。\n\n用波浪线结尾可以软化语气,那个笑脸表情😊刚好能中和AI的机械感。不过要控制表情符号数量,避免显得轻浮。\n</think>\n你好呀!我是你的AI助手,由深度求索公司(DeepSeek)开发的语言模型,名字叫 **DeepSeek-V32**。你可以把我当成一个知识丰富、随叫随到的小帮手~😊\n\n我的任务就是陪你聊天、解答问题、","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":14,"total_tokens":235,"completion_tokens":221,"prompt_tokens_details":null}}
```
## FAQ
1. The current deployment startup parameters may not be fully compatible with all RDMA scenarios. Different RDMA NCCL-related environment configurations may be needed in different network environments.
2. Please ensure that the sglang code in the image has incorporated the changes from [PR #10912](https://github.com/sgl-project/sglang/pull/10912).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment