[Doc] Add documentation for DeepSeek V3.2 (#11877)

Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: ybyang <ybyang7@iflytek.com>

[Doc] Add documentation for DeepSeek V3.2 (#11877)
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: ybyang <ybyang7@iflytek.com>
729b2429 · Baizhou Zhang · GitHub · d7056c52 · 729b2429 · 729b2429
Unverified Commit 729b2429 authored Oct 24, 2025 by Baizhou Zhang Committed by GitHub Oct 24, 2025
4 changed files
--- a/docs/advanced_features/separate_reasoning.ipynb
+++ b/docs/advanced_features/separate_reasoning.ipynb
@@ -13,7 +13,7 @@
    "| Model  |  Reasoning tags      | Parser | Notes |\n",
    "|---------|-----------------------------|------------------|-------|\n",
    "| [DeepSeek‑R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `<think>` … `</think>` | `deepseek-r1` | Supports all variants (R1, R1-0528, R1-Distill) |\n",
-    "| [DeepSeek‑V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `<think>` … `</think>` | `deepseek-v3` | Supports `thinking` parameter |\n",
+    "| [DeepSeek‑V3 series](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) | `<think>` … `</think>` | `deepseek-v3` | Including [DeepSeek‑V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp). Supports `thinking` parameter |\n",
    "| [Standard Qwen3 models](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `<think>` … `</think>` | `qwen3` | Supports `enable_thinking` parameter |\n",
    "| [Qwen3-Thinking models](https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507) | `<think>` … `</think>` | `qwen3` or `qwen3-thinking` | Always generates thinking content |\n",
    "| [Kimi models](https://huggingface.co/moonshotai/models) | `◁think▷` … `◁/think▷` | `kimi` | Uses special thinking delimiters |\n",
@@ -26,7 +26,7 @@
    "- Both are handled by the same `deepseek-r1` parser\n",
    "\n",
    "**DeepSeek-V3 Family:**\n",
-    "- DeepSeek-V3.1: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n",
+    "- DeepSeek-V3.1/V3.2: Hybrid model supporting both thinking and non-thinking modes, use the `deepseek-v3` parser and `thinking` parameter (NOTE: not `enable_thinking`)\n",
    "\n",
    "**Qwen3 Family:**\n",
    "- Standard Qwen3 (e.g., Qwen3-2507): Use `qwen3` parser, supports `enable_thinking` in chat templates\n",

--- a/docs/basic_usage/deepseek.md
+++ b/docs/basic_usage/deepseek.md
@@ -170,7 +170,7 @@ python3 -m sglang.launch_server \
 - The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
 - FlashAttention3, FlashMLA, and Triton backend fully supports MTP usage. For FlashInfer backend (`--attention-backend flashinfer`) with speculative decoding,`--speculative-eagle-topk` parameter should be set to `1`. MTP support for the CutlassMLA and TRTLLM MLA backends are still under development.
 - To enable DeepSeek MTP for large batch sizes (>32), there are some parameters should be changed (Reference [this discussion](https://github.com/sgl-project/sglang/issues/4543#issuecomment-2737413756)):
-  - Adjust `--max-running-requests` to a larger number. The default value is `32` for MTP. For larger batch sizes, you should increase this value beyond the default value.
+  - Adjust `--max-running-requests` to a larger number. The default value is `48` for MTP. For larger batch sizes, you should increase this value beyond the default value.
  - Set `--cuda-graph-bs`. It's a list of batch sizes for cuda graph capture. The default captured batch sizes for speculative decoding is set [here](https://github.com/sgl-project/sglang/blob/49420741746c8f3e80e0eb17e7d012bfaf25793a/python/sglang/srt/model_executor/cuda_graph_runner.py#L126). You can include more batch sizes into it.

--- a/docs/basic_usage/deepseek_v32.md
+++ b/docs/basic_usage/deepseek_v32.md
+# DeepSeek V3.2 Usage
+[DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves efficiency improvements in long-context scenarios.
+For reporting issues or tracking upcoming features, please refer to this [Roadmap](https://github.com/sgl-project/sglang/issues/11060).
+## Installation
+### Docker
+```bash
+# H200/B200
+docker pull lmsysorg/sglang:latest
+# MI350/MI355
+docker pull lmsysorg/sglang:dsv32-rocm
+# NPUs
+docker pull lmsysorg/sglang:dsv32-a2
+docker pull lmsysorg/sglang:dsv32-a3
+```
+### Build From Source
+```bash
+# Install SGLang
+git clone https://github.com/sgl-project/sglang
+cd sglang
+pip3 install pip --upgrade
+pip3 install -e "python[all]"
+# Install flash_mla
+git clone https://github.com/deepseek-ai/FlashMLA.git flash-mla
+cd flash-mla
+git submodule update --init --recursive
+pip install -v .
+```
+## Launch DeepSeek V3.2 with SGLang
+To serve DeepSeek-V3.2-Exp on 8xH200/B200 GPUs:
+```bash
+# Launch with TP + DP
+python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention
+# Launch with EP + DP
+python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 8 --enable-dp-attention
+```
+### Configuration Tips
+- **DP Attention**: For DeepSeek V3.2 model, the kernels are customized for the use case of `dp_size=8`. So
+- **Choices of Attention Kernels**: The attention backend is automatically set to `nsa` attention backend for DeepSeek V3.2 model. In this backend, different kernels for sparse prefilling/decoding are implemented, which can be specified by `--nsa-prefill-backend` and `--nsa-decode-backend` server arguments. The choices of nsa prefill/decode attention kernels include:
+  - `flashmla_sparse`: `flash_mla_sparse_fwd` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs.
+  - `flashmla_kv`: `flash_mla_with_kvcache` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs.
+  - `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs.
+  - `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU.
+  - `alter`: Alter kernel on AMD HPUs. Can only be used as decode kernel.
+- On the basis of performance benchmarks, the default configuration on H200 and B200 are set as follows :
+  - H200: `flashmla_sparse` prefill attention, `fa3` decode attention, `bf16` kv cache dtype.
+  - B200: `flashmla_kv` prefill attention, `flashmla_kv` decode attention, `fp8_e4m3` kv cache dtype.
+  - Currently we don't enable `prefill=flashmla_sparse` with `decode=flashmla_kv` due to latency caused by kv cache quantization operations. In the future we might shift to this setting after attention/quantization kernels are optimized.
+### Multi-token Prediction
+SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](https://docs.sglang.ai/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information.
+Example usage:
+```bash
+python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
+```
+- The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
+- The default value of  `--max-running-requests` is set to `48` for MTP. For larger batch sizes, this value should be increased beyond the default value.
+# Function Calling and Reasoning Parser
+The usage of function calling and reasoning parser is the same as DeepSeek V3.1. Please refer to [Reasoning Parser](https://docs.sglang.ai/advanced_features/separate_reasoning.html) and [Tool Parser](https://docs.sglang.ai/advanced_features/tool_parser.html) documents.
+# PD Disaggregation
+Prefill Command:
+```bash
+python -m sglang.launch_server \
+        --model-path deepseek-ai/DeepSeek-V3.2-Exp \
+        --disaggregation-mode prefill \
+        --host $LOCAL_IP \
+        --port $PORT \
+        --tp 8 \
+        --dp 8 \
+        --enable-dp-attention \
+        --dist-init-addr ${HOST}:${DIST_PORT} \
+        --trust-remote-code \
+        --disaggregation-bootstrap-port 8998 \
+        --mem-fraction-static 0.9 \
+```
+Decode command:
+```bash
+python -m sglang.launch_server \
+        --model-path deepseek-ai/DeepSeek-V3.2-Exp \
+        --disaggregation-mode decode \
+        --host $LOCAL_IP \
+        --port $PORT \
+        --tp 8 \
+        --dp 8 \
+        --enable-dp-attention \
+        --dist-init-addr ${HOST}:${DIST_PORT} \
+        --trust-remote-code \
+        --mem-fraction-static 0.9 \
+```
+Router command:
+```bash
+python -m sglang_router.launch_router --pd-disaggregation \
+  --prefill $PREFILL_ADDR 8998 \
+  --decode $DECODE_ADDR \
+  --host 127.0.0.1 \
+  --port 8000 \
+```
+If you need more advanced deployment methods or production-ready deployment methods, such as RBG or LWS-based deployment, please refer to [references/multi_node_deployment/rbg_pd/deepseekv32_pd.md](../references/multi_node_deployment/rbg_pd/deepseekv32_pd.md). Additionally, you can also find startup commands for DeepEP-based EP parallelism in the aforementioned documentation.
+## Benchmarking Results
+### Accuracy Test with `gsm8k`
+A simple accuracy benchmark can be tested with `gsm8k` dataset:
+```bash
+python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
+```
+The result is 0.956, which matches our expectation:
+```bash
+Accuracy: 0.956
+Invalid: 0.000
+Latency: 25.109 s
+Output throughput: 5226.235 token/s
+```
+### Accuracy Test with `gpqa-diamond`
+Accuracy benchmark on long context can be tested on GPQA-diamond dataset with long output tokens and thinking enabled:
+```bash
+python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3
+```
+The mean accuracy over 8 runs shows 0.797, which matches the number 79.9 in official tech report.
+```bash
+Repeat: 8, mean: 0.797
+Scores: ['0.808', '0.798', '0.808', '0.798', '0.783', '0.788', '0.803', '0.793']
+```
--- a/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.md
+++ b/docs/references/multi_node_deployment/rbg_pd/deepseekv32_pd.md
+# DeepSeekV32-Exp RBG Based PD Deploy
+## 0. Prerequisites
+1. k8s >=1.26
+2. lws installed on k8s.
+3. rbg installed on k8s.
+For RBG installation, please refer to: https://github.com/sgl-project/rbg
+## 1. Image Preparation
+`lmsysorg/sglang:latest`
+### 2. All In One manifest file
+*Note: The NodeSelector section, model location section, and taint toleration section can be adjusted according to your actual deployment environment*
+rbg-dsv32.yml
+```yaml
+apiVersion: workloads.x-k8s.io/v1alpha1
+kind: RoleBasedGroup
+metadata:
+  name: deepseek-rbg-32exp
+  namespace: default
+spec:
+  roles:
+    - name: prefill
+      replicas: 1
+      workload:
+        apiVersion: leaderworkerset.x-k8s.io/v1
+        kind: LeaderWorkerSet
+      restartPolicy: None
+      leaderWorkerSet:
+        size: 1
+        patchLeaderTemplate:
+          metadata:
+            labels:
+              role: leader
+              pd_role: prefill
+          spec:
+            containers:
+            - command:
+              - python3
+              - -m
+              - sglang.launch_server
+              - --model-path
+              - /work/models
+              - --port
+              - "30000"
+              - --trust-remote
+              - --host
+              -  0.0.0.0
+              - --disable-radix-cache
+              - --disaggregation-ib-device
+              -  mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
+              - --disable-radix-cache
+              - --chunked-prefill-size
+              - "131072"
+              - --page-size
+              - "64"
+    #          - --enable-eplb
+              - --ep-dispatch-algorithm
+              - dynamic
+              - --eplb-algorithm
+              - deepseek
+              - --enable-dp-lm-head
+              - --enable-dp-attention
+              - --dp-size
+              - "8"
+              - --moe-a2a-backend
+              - deepep
+              - --deepep-mode
+              - normal
+              - --disaggregation-mode
+              - prefill
+              - --mem-fraction-static
+              - "0.8"
+              - --max-prefill-tokens
+              - "32768"
+              - --context-length
+              - "32768"
+              - --tp
+              - "8"
+              - --dist-init-addr
+              - $(LWS_LEADER_ADDRESS):20102
+              - --nnodes
+              - $(LWS_GROUP_SIZE)
+              - --node-rank
+              - $(LWS_WORKER_INDEX)
+              - --trust-remote-code
+              - --ep-num-redundant-experts
+              - "32"
+              - --moe-dense-tp-size
+              - "1"
+              - --max-running-requests
+              - "1024"
+              env:
+              - name: LWS_WORKER_INDEX
+                valueFrom:
+                  fieldRef:
+                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
+              livenessProbe:
+                failureThreshold: 3000
+                httpGet:
+                  path: /health
+                  port: 30000
+                initialDelaySeconds: 300
+                periodSeconds: 60
+                successThreshold: 1
+                timeoutSeconds: 10
+              readinessProbe:
+                failureThreshold: 20
+                httpGet:
+                  path: /health
+                  port: 30000
+                periodSeconds: 30
+                successThreshold: 1
+                timeoutSeconds: 10
+              name: sglang
+              ports:
+              - containerPort: 30000
+                name: sglang-http
+                protocol: TCP
+        patchWorkerTemplate: {}
+      template:
+        metadata:
+          labels:
+            inference-framework: sglang
+            inference-stack.io/monitoring: "enabled"
+        spec:
+            containers:
+            - name: sglang
+              image: lmsysorg/sglang:latest
+              env:
+                - name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK
+                  value: "1"
+                - name: CUDA_LAUNCH_BLOCKING
+                  value: "0"
+                - name:  SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT
+                  value: "1000000000"
+                - name: NVSHMEM_IB_TRAFFIC_CLASS
+                  value: "16"
+                - name: NVSHMEM_DISABLE_P2P
+                  value: "0"
+                - name: ENABLE_METRICS
+                  value: "true"
+                - name: NVSHMEM_IB_GID_INDEX
+                  value: "3"
+                - name: NVSHMEM_IB_SL
+                  value: "5"
+                - name: SGLANG_SET_CPU_AFFINITY
+                  value: "true"
+                - name: SGL_ENABLE_JIT_DEEPGEMM
+                  value: "1"
+                - name:  NCCL_IB_QPS_PER_CONNECTION
+                  value: "8"
+                - name: NCCL_IB_SPLIT_DATA_ON_QPS
+                  value: "1"
+                - name: NCCL_NET_PLUGIN
+                  value: "none"
+                - name: NCCL_IB_TC
+                  value: "136"
+                - name: NCCL_IB_SL
+                  value: "5"
+                - name: NCCL_IB_TIMEOUT
+                  value: "22"
+                - name: NCCL_IB_GID_INDEX
+                  value: "3"
+                - name: NCCL_MIN_NCHANNELS
+                  value: "4"
+                - name: NCCL_SOCKET_IFNAME
+                  value: bond0
+                - name: GLOO_SOCKET_IFNAME
+                  value: bond0
+                - name: NCCL_IB_HCA
+                  value: ^=mlx5_0,mlx5_5,mlx5_6
+                - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
+                  value: "bond0"
+                - name: MC_TE_METRIC
+                  value: "false"
+              resources:
+                limits:
+                  nvidia.com/gpu: "8"
+              securityContext:
+                capabilities:
+                  add:
+                  - IPC_LOCK
+                privileged: true
+              volumeMounts:
+                - mountPath: /root/.cache
+                  name: sgl-cache
+                - mountPath: /dev/shm
+                  name: dshm
+                - mountPath: /work/models
+                  name: model
+                - mountPath: /dev/infiniband
+                  name: ib
+                - mountPath: /sgl-workspace/sglang
+                  name: src
+            dnsPolicy: ClusterFirstWithHostNet
+            hostIPC: true
+            hostNetwork: true
+            nodeSelector:
+              pd: "yes"
+            tolerations:
+              - key: pd
+                operator: Exists
+            volumes:
+            - hostPath:
+                path: /var/run/sys-topology
+              name: topo
+            - hostPath:
+                path: /data1/sgl_cache4
+                type: DirectoryOrCreate
+              name: sgl-cache
+            - emptyDir:
+                medium: Memory
+              name: dshm
+            - hostPath:
+                path: /data/DeepSeek-V3.2-Exp
+              name: model
+            - hostPath:
+                path: /dev/infiniband
+              name: ib
+            - hostPath:
+                path: /data/src/sglang
+                type: DirectoryOrCreate
+              name: src
+    - name: decode
+      replicas: 1
+      workload:
+        apiVersion: leaderworkerset.x-k8s.io/v1
+        kind: LeaderWorkerSet
+      leaderWorkerSet:
+        size: 1
+        patchLeaderTemplate:
+          metadata:
+            labels:
+              role: leader
+              pd_role: decode
+          spec:
+            containers:
+            - command:
+                  - python3
+                  - -m
+                  - sglang.launch_server
+                  - --model-path
+                  - /work/models
+                  - --port
+                  - "30000"
+                  - --trust-remote
+                  - --host
+                  -  0.0.0.0
+                  - --disaggregation-ib-device
+                  -  mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
+                  - --chunked-prefill-size
+                  - "131072"
+                  - --prefill-round-robin-balance
+                  - --eplb-rebalance-layers-per-chunk
+                  - "29"
+                  - --page-size
+                  - "64"
+                  - --enable-dp-attention
+                  - --enable-dp-lm-head
+                  - --dp-size
+                  - "8"
+                  - --moe-a2a-backend
+                  - deepep
+                  - --deepep-mode
+                  - low_latency
+                  - --disaggregation-mode
+                  - decode
+                  - --mem-fraction-static
+                  -  "0.8"
+                  - --context-length
+                  - "32768"
+                  - --max-running-requests
+                  - "2048"
+                  - --tp-size
+                  - "8" # Size of Tensor Parallelism
+                  - --cuda-graph-max-bs
+                  - "16"
+                  - --dist-init-addr
+                  - $(LWS_LEADER_ADDRESS):20102
+                  - --nnodes
+                  - $(LWS_GROUP_SIZE)
+                  - --node-rank
+                  - $(LWS_WORKER_INDEX)
+                  - --trust-remote-code
+                  - --ep-num-redundant-experts
+                  - "32"
+                  - --moe-dense-tp-size
+                  - "1"
+              env:
+              - name: LWS_WORKER_INDEX
+                valueFrom:
+                  fieldRef:
+                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
+              livenessProbe:
+                failureThreshold: 30000
+                httpGet:
+                  path: /health
+                  port: 30000
+                initialDelaySeconds: 300
+                periodSeconds: 60
+                successThreshold: 1
+                timeoutSeconds: 10
+              name: sglang
+              readinessProbe:
+                failureThreshold: 20
+                httpGet:
+                  path: /health
+                  port: 30000
+                periodSeconds: 30
+                successThreshold: 1
+                timeoutSeconds: 10
+        patchWorkerTemplate:
+          spec:
+            containers:
+            - command:
+                - python3
+                - -m
+                - sglang.launch_server
+                - --model-path
+                - /work/models
+                - --crash-dump-folder
+                -  /log
+                - --chunked-prefill-size
+                - "262144"
+                - --prefill-round-robin-balance
+                - --eplb-rebalance-layers-per-chunk
+                - "29"
+                - --page-size
+                - "64"
+                - --enable-dp-attention
+                - --enable-dp-lm-head
+                - --dp-size
+                - "32"
+                - --moe-a2a-backend
+                - "deepep"
+                - --deepep-mode
+                - low_latency
+                - --disaggregation-mode
+                - decode
+                - --mem-fraction-static
+                -  "0.849"
+                - --context-length
+                - "32768"
+                - --disaggregation-ib-device
+                -  mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7
+                - --max-running-requests
+                - "4096"
+                - --cuda-graph-max-bs
+                - "16"
+                - --tp-size
+                - "8" # Size of Tensor Parallelism
+                - --dist-init-addr
+                - $(LWS_LEADER_ADDRESS):20102
+                - --nnodes
+                - $(LWS_GROUP_SIZE)
+                - --node-rank
+                - $(LWS_WORKER_INDEX)
+                - --trust-remote-code
+                - --ep-num-redundant-experts
+                - "32"
+                - --moe-dense-tp-size
+                - "1"
+              env:
+              - name: LWS_WORKER_INDEX
+                valueFrom:
+                  fieldRef:
+                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
+              name: sglang
+      template:
+        metadata:
+          labels:
+            inference-framework: sglang-unuse
+            inference-stack.io/monitoring: "enabled"
+        spec:
+            containers:
+            - image: lmsysorg/sglang:latest
+              name: sglang
+              resources:
+                limits:
+                  nvidia.com/gpu: "8"
+              securityContext:
+                capabilities:
+                  add:
+                  - IPC_LOCK
+                privileged: true
+              volumeMounts:
+                - mountPath: /root/.cache
+                  name: sgl-cache
+                - mountPath: /dev/shm
+                  name: dshm
+                - mountPath: /work/models
+                  name: model
+                - mountPath: /dev/infiniband
+                  name: ib
+                - mountPath: /sgl-workspace/sglang
+                  name: src
+              env:
+                - name: SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK
+                  value: "1"
+                - name: SGLANG_DISAGGREGATION_WAITING_TIMEOUT
+                  value: "100000000"
+                - name: NVSHMEM_DISABLE_P2P
+                  value: "0"
+                - name: NVSHMEM_IB_TRAFFIC_CLASS
+                  value: "16"
+                - name: NVSHMEM_IB_SL
+                  value: "5"
+                - name: ENABLE_METRICS
+                  value: "true"
+                - name: CUDA_LAUNCH_BLOCKING
+                  value: "0"
+                - name: NVSHMEM_IB_GID_INDEX
+                  value: "3"
+                - name:  NCCL_IB_QPS_PER_CONNECTION
+                  value: "8"
+                - name: NCCL_IB_SPLIT_DATA_ON_QPS
+                  value: "1"
+                - name: NCCL_NET_PLUGIN
+                  value: "none"
+                - name: NCCL_IB_TC
+                  value: "136"
+                - name: NCCL_IB_SL
+                  value: "5"
+                - name: NCCL_IB_TIMEOUT
+                  value: "22"
+                - name: NCCL_IB_GID_INDEX
+                  value: "3"
+                - name: NCCL_MIN_NCHANNELS
+                  value: "4"
+                - name: NCCL_SOCKET_IFNAME
+                  value: bond0
+                - name: GLOO_SOCKET_IFNAME
+                  value: bond0
+                - name: NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
+                  value: "bond0"
+                - name: NCCL_IB_HCA
+                  value: ^=mlx5_0,mlx5_5,mlx5_6
+                - name: MC_TE_METRIC
+                  value: "false"
+                - name: SGL_ENABLE_JIT_DEEPGEMM
+                  value: "1"
+            dnsPolicy: ClusterFirstWithHostNet
+            hostIPC: true
+            hostNetwork: true
+            nodeSelector:
+              pd: "yes"
+            tolerations:
+            - key: pd
+              operator: Exists
+            volumes:
+            - hostPath:
+                path: /var/run/sys-topology
+              name: topo
+            - hostPath:
+                path: /data1/sgl_cache4
+                type: DirectoryOrCreate
+              name: sgl-cache
+            - hostPath:
+                path: /data/src/sglang
+                type: DirectoryOrCreate
+              name: src
+            - emptyDir:
+                medium: Memory
+              name: dshm
+            - hostPath:
+                path: /data/DeepSeek-V3.2-Exp
+              name: model
+            - hostPath:
+                path: /dev/infiniband
+              name: ib
+    - name: router
+      replicas: 1
+      dependencies: [ "decode", "prefill" ]
+      template:
+        spec:
+          containers:
+            - name: scheduler
+              image: lmsysorg/sglang:latest
+              command:
+              - sh
+              - -c
+              - >
+                python3 -m sglang_router.launch_router
+                --host 0.0.0.0
+                --port 8080
+                --pd-disaggregation
+                --policy random
+                --service-discovery
+                --service-discovery-namespace ${NAMESPACE}
+                --service-discovery-port 30000
+                --prefill-selector pd_role=prefill
+                --decode-selector pd_role=decode
+                --max-payload-size 2147483648
+                --worker-startup-timeout-secs 1200
+              env:
+              - name: NAMESPACE
+                valueFrom:
+                  fieldRef:
+                    apiVersion: v1
+                    fieldPath: metadata.namespace
+---
+apiVersion: v1
+kind: Service
+metadata:
+  labels:
+    app: deepseek-rbg-32exp
+  name: deepseek-rbg-32exp
+  namespace: default
+spec:
+  ports:
+    - name: http
+      port: 8080
+      protocol: TCP
+      targetPort: 8080
+      nodePort: 30080
+  selector:
+    rolebasedgroup.workloads.x-k8s.io/name: deepseek-rbg-32exp
+    rolebasedgroup.workloads.x-k8s.io/role: router
+  type: NodePort
+```
+```bash
+[root@ecs-001]# kubectl get po -n default
+deepseek-rbg-32exp-decode-main-0             1/1     Running   0          74m
+deepseek-rbg-32exp-decode-0-1                1/1     Running   0          74m
+deepseek-rbg-32exp-router-9c5dbfc57          1/1     Running   0          22m
+deepseek-rbg-32exp-prefill-0                 1/1     Running   0          74m
+[root@ecs-cbm-x1-pd-cpu-001 main_doc]# kubectl  get svc |grep dee
+deepseek-rbg-32exp-decode             ClusterIP   None             <none>        <none>           97m
+deepseek-rbg-32exp-router-service     NodePort    172.16.242.169   <none>        8000:30800/TCP   22m
+deepseek-rbg-32exp-prefill            ClusterIP   None             <none>        <none>           97m
+```
+At this point, select a nodePort:30800 to access:
+```bash
+[root@ecs-001]# curl -X POST "http://{nodePort}:30800/v1/chat/completions" \
+>     -H "Content-Type: application/json" \
+>     -H "Authorization: Bearer None" \
+>     -d '{
+>        "rid":"ccccdd",
+>         "model": "dsv32",
+>         "messages": [
+>             {"role": "system", "content": "0: You are a helpful AI assistant"},
+>             {"role": "user", "content": "你是谁？."}
+>         ],
+>         "max_tokens":221
+>     }'
+{"id":"ccccdd","object":"chat.completion","created":1750252498,"model":"qwen2","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n嗯，用户问了一个很基础的自我介绍问题"你是谁？"。这可能是第一次互动时的常规开场白，也可能是想确认我的身份和功能范围。\n\n用户没有提供任何背景信息，语气简洁中性。这种场景下新用户的可能性较高，需要给出清晰友好的自我介绍，同时突出实用价值来降低陌生感。\n\n考虑到中文用户，应该用简体中文回复。重点要说明三点：身份归属（深度求索）、功能定位（AI助手）、服务范围（学习/工作/生活）。结尾用开放性问题引导对话很关键——既能了解需求，又能避免让用户面对空白输入框时不知所措。\n\n用波浪线结尾可以软化语气，那个笑脸表情😊刚好能中和AI的机械感。不过要控制表情符号数量，避免显得轻浮。\n</think>\n你好呀！我是你的AI助手，由深度求索公司（DeepSeek）开发的语言模型，名字叫 **DeepSeek-V32**。你可以把我当成一个知识丰富、随叫随到的小帮手～😊\n\n我的任务就是陪你聊天、解答问题、","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"length","matched_stop":null}],"usage":{"prompt_tokens":14,"total_tokens":235,"completion_tokens":221,"prompt_tokens_details":null}}
+```
+## FAQ
+1. The current deployment startup parameters may not be fully compatible with all RDMA scenarios. Different RDMA NCCL-related environment configurations may be needed in different network environments.
+2. Please ensure that the sglang code in the image has incorporated the changes from [PR #10912](https://github.com/sgl-project/sglang/pull/10912).