Commit 909abb58 authored by maxiao's avatar maxiao
Browse files

adapt to sglang v0.5.2rc1 on dcu

parents
SGLang Documentation
====================
SGLang is a fast serving framework for large language models and vision language models.
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include:
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
.. toctree::
:maxdepth: 1
:caption: Get Started
get_started/install.md
.. toctree::
:maxdepth: 1
:caption: Basic Usage
basic_usage/send_request.ipynb
basic_usage/openai_api.rst
basic_usage/offline_engine_api.ipynb
basic_usage/native_api.ipynb
basic_usage/sampling_params.md
basic_usage/deepseek.md
basic_usage/gpt_oss.md
basic_usage/llama4.md
.. toctree::
:maxdepth: 1
:caption: Advanced Features
advanced_features/server_arguments.md
advanced_features/hyperparameter_tuning.md
advanced_features/speculative_decoding.ipynb
advanced_features/structured_outputs.ipynb
advanced_features/structured_outputs_for_reasoning_models.ipynb
advanced_features/function_calling.ipynb
advanced_features/separate_reasoning.ipynb
advanced_features/quantization.md
advanced_features/lora.ipynb
advanced_features/pd_disaggregation.md
advanced_features/vlm_query.ipynb
advanced_features/router.md
advanced_features/observability.md
advanced_features/attention_backend.md
.. toctree::
:maxdepth: 1
:caption: Supported Models
supported_models/generative_models.md
supported_models/multimodal_language_models.md
supported_models/embedding_models.md
supported_models/reward_models.md
supported_models/rerank_models.md
supported_models/support_new_models.md
supported_models/transformers_fallback.md
supported_models/modelscope.md
.. toctree::
:maxdepth: 1
:caption: Hardware Platforms
platforms/amd_gpu.md
platforms/blackwell_gpu.md
platforms/cpu_server.md
platforms/tpu.md
platforms/nvidia_jetson.md
platforms/ascend_npu.md
.. toctree::
:maxdepth: 1
:caption: Developer Guide
developer_guide/contribution_guide.md
developer_guide/development_guide_using_docker.md
developer_guide/benchmark_and_profiling.md
developer_guide/bench_serving.md
.. toctree::
:maxdepth: 1
:caption: References
references/faq.md
references/environment_variables.md
references/production_metrics.md
references/multi_node_deployment/multi_node_index.rst
references/custom_chat_template.md
references/frontend/frontend_index.rst
references/learn_more.md
# AMD GPUs
This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
## System Configuration
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
Below are a few key settings to confirm or enable for SGLang:
### Update GRUB Settings
In `/etc/default/grub`, append the following to `GRUB_CMDLINE_LINUX`:
```text
pci=realloc=off iommu=pt
```
Afterward, run `sudo update-grub` (or your distro’s equivalent) and reboot.
### Disable NUMA Auto-Balancing
```bash
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
```
You can automate or verify this change using [this helpful script](https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh).
Again, please go through the entire documentation to confirm your system is using the recommended configuration.
## Install SGLang
You can install SGLang using one of the methods below.
### Install from Source
```bash
# Use the last release branch
git clone -b v0.5.2rc1 https://github.com/sgl-project/sglang.git
cd sglang
# Compile sgl-kernel
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
# Install sglang python package
cd ..
pip install -e "python[all_hip]"
```
### Install Using Docker (Recommended)
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
The steps below show how to build and use an image.
1. Build the docker image.
If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
```bash
docker build -t sglang_image -f Dockerfile.rocm .
```
2. Create a convenient alias.
```bash
alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx \
-v /data:/data'
```
If you are using RDMA, please note that:
- `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
- You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
3. Launch the server.
**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path NousResearch/Meta-Llama-3.1-8B \
--host 0.0.0.0 \
--port 30000
```
4. To verify the utility, you can run a benchmark in another terminal or refer to [other docs](https://docs.sglang.ai/backend/openai_api_completions.html) to send requests to the engine.
```bash
drun sglang_image \
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 4000 \
--random-input 128 \
--random-output 128
```
With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.
## Examples
### Running DeepSeek-V3
The only difference when running DeepSeek-V3 is in how you start the server. Here's an example command:
```bash
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \ # <- here
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
```
[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
### Running Llama3.1
Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:
```bash
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
```
### Warmup Step
When the server displays `The server is fired up and ready to roll!`, it means the startup is successful.
# Ascend NPUs
You can install SGLang using any of the methods below. Please go through `System Settings` section to ensure the clusters are roaring at max performance. Feel free to leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) if you encounter any issues or have any problems.
## System Settings
### CPU performance power scheme
The default power scheme on Ascend hardware is `ondemand` which could affect performance, changing it to `performance` is recommended.
```shell
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Make sure changes are applied successfully
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # shows performance
```
### Disable NUMA balancing
```shell
sudo sysctl -w kernel.numa_balancing=0
# Check
cat /proc/sys/kernel/numa_balancing # shows 0
```
### Prevent swapping out system memory
```shell
sudo sysctl -w vm.swappiness=10
# Check
cat /proc/sys/vm/swappiness # shows 10
```
## Installing SGLang
### Method 1: Installing from source with prerequisites
#### Python Version
Only `python==3.11` is supported currently. If you don't want to break system pre-installed python, try installing with [conda](https://github.com/conda/conda).
```shell
conda create --name sglang_npu python=3.11
conda activate sglang_npu
```
#### MemFabric Adaptor
_TODO: MemFabric is still a working project yet open sourced til August/September, 2025. We will release it as prebuilt wheel package for now._
_Notice: Prebuilt wheel package is based on `aarch64`, please leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) to let us know the requests for `amd64` build._
MemFabric Adaptor is a drop-in replacement of Mooncake Transfer Engine that enables KV cache transfer on Ascend NPU clusters.
```shell
MF_WHL_NAME="mf_adapter-1.0.0-cp311-cp311-linux_aarch64.whl"
MEMFABRIC_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/${MF_WHL_NAME}"
wget -O "${MF_WHL_NAME}" "${MEMFABRIC_URL}" && pip install "./${MF_WHL_NAME}"
```
#### Pytorch and Pytorch Framework Adaptor on Ascend
Only `torch==2.6.0` is supported currently due to NPUgraph and Triton-on-Ascend's limitation, however a more generalized version will be release by the end of September, 2025.
```shell
PYTORCH_VERSION=2.6.0
TORCHVISION_VERSION=0.21.0
pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu
PTA_VERSION="v7.1.0.1-pytorch2.6.0"
PTA_NAME="torch_npu-2.6.0.post1-cp311-cp311-manylinux_2_28_aarch64.whl"
PTA_URL="https://gitee.com/ascend/pytorch/releases/download/${PTA_VERSION}/${PTA_WHL_NAME}"
wget -O "${PTA_NAME}" "${PTA_URL}" && pip install "./${PTA_NAME}"
```
#### vLLM
vLLM is still a major prerequisite on Ascend NPU. Because of `torch==2.6.0` limitation, only vLLM v0.8.5 is supported.
```shell
VLLM_TAG=v0.8.5
git clone --depth 1 https://github.com/vllm-project/vllm.git --branch $VLLM_TAG
(cd vllm && VLLM_TARGET_DEVICE="empty" pip install -v -e .)
```
#### Triton on Ascend
_Notice:_ We recommend installing triton-ascend from source due to its rapid development, the version on PYPI can't keep up for know. This problem will be solved on Sep. 2025, afterwards `pip install` would be the one and only installing method.
Please follow Triton-on-Ascend's [installation guide from source](https://gitee.com/ascend/triton-ascend#2%E6%BA%90%E4%BB%A3%E7%A0%81%E5%AE%89%E8%A3%85-triton-ascend) to install the latest `triton-ascend` package.
#### DeepEP-compatible Library
We are also providing a DeepEP-compatible Library as a drop-in replacement of deepseek-ai's DeepEP library, check the [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md).
#### Installing SGLang from source
```shell
# Use the last release branch
git clone -b v0.5.2rc1 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install -e python[srt_npu]
```
### Method 2: Using docker
__Notice:__ `--privileged` and `--network=host` are required by RDMA, which is typically needed by Ascend NPU clusters.
__Notice:__ The following docker command is based on Atlas 800I A3 machines. If you are using Atlas 800I A2, make sure only `davinci[0-7]` are mapped into container.
```shell
# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker
# Build the docker image
docker build -t sglang-npu:main -f Dockerfile.npu .
alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
--device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
--device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
--device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
--device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
--device=/dev/davinci_manager --device=/dev/hisi_hdc \
--volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
--volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
--volume /etc/ascend_install.info:/etc/ascend_install.info \
--volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/'
drun --env "HF_TOKEN=<secret>" \
sglang-npu:main \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000
```
## Examples
### Running DeepSeek-V3
Running DeepSeek with PD disaggregation on 2 x Atlas 800I A3.
Model weights could be found [here](https://modelers.cn/models/State_Cloud/Deepseek-R1-bf16-hfd-w8a8).
Prefill:
```shell
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
drun sglang-npu:main \
python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
--trust-remote-code \
--attention-backend ascend \
--mem-fraction-static 0.8 \
--quantization w8a8_int8 \
--tp-size 16 \
--dp-size 1 \
--nnodes 1 \
--node-rank 0 \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 6657 \
--disaggregation-transfer-backend ascend \
--dist-init-addr <PREFILL_HOST_IP>:6688 \
--host <PREFILL_HOST_IP> \
--port 8000
```
Decode:
```shell
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
export HCCL_BUFFSIZE=200
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
drun sglang-npu:main \
python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
--trust-remote-code \
--attention-backend ascend \
--mem-fraction-static 0.8 \
--quantization w8a8_int8 \
--enable-deepep-moe \
--deepep-mode low_latency \
--tp-size 16 \
--dp-size 1 \
--ep-size 16 \
--nnodes 1 \
--node-rank 0 \
--disaggregation-mode decode \
--disaggregation-transfer-backend ascend \
--dist-init-addr <DECODE_HOST_IP>:6688 \
--host <DECODE_HOST_IP> \
--port 8001
```
Mini_LB:
```shell
drun sglang-npu:main \
python -m sglang.srt.disaggregation.launch_lb \
--prefill http://<PREFILL_HOST_IP>:8000 \
--decode http://<DECODE_HOST_IP>:8001 \
--host 127.0.0.1 --port 5000
```
# Blackwell GPUs
We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
## B200 with x86 CPUs
TODO
## GB200/GB300 with ARM CPUs
TODO
# CPU Servers
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
which are 4th generation or newer Intel® Xeon® Scalable Processors.
## Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU,
including the most notable open-source models like Llama series, Qwen series,
and the phenomenal high-quality reasoning model DeepSeek-R1.
| Model Name | BF16 | w8a8_int8 | FP8 |
|:---:|:---:|:---:|:---:|
| DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | |
| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) | |
| QwQ-32B | | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) | |
| DeepSeek-Distilled-Llama | | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) | |
| Qwen3-235B | | | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
**Note:** The model identifiers listed in the table above
have been verified on 6th Gen Intel® Xeon® P-core platforms.
## Installation
### Install Using Docker
It is recommended to use Docker for setting up the SGLang environment.
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker
# Build the docker image
docker build -t sglang-cpu:main -f Dockerfile.xeon .
# Initiate a docker container
docker run \
-it \
--privileged \
--ipc=host \
--network=host \
-v /dev/shm:/dev/shm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 30000:30000 \
-e "HF_TOKEN=<secret>" \
sglang-cpu:main /bin/bash
```
### Install From Source
If you'd prefer to install SGLang in a bare metal environment,
the command list is as below.
It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
is required to enable SGLang service with CPU engine.
```bash
# Create and activate a conda environment
conda create -n sgl-cpu python=3.12 -y
conda activate sgl-cpu
# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
pip config set global.index-url https://download.pytorch.org/whl/cpu
pip config set global.extra-index-url https://pypi.org/simple
# Check if some conda related environment variables have been set
env | grep -i conda
# The following environment variable settings are required
# if they have not been set properly
export CONDA_EXE=$(which conda)
export CONDA_ROOT=${CONDA_EXE}/../..
export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin
# Clone the SGLang code
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout <YOUR-DESIRED-VERSION>
# Install SGLang dependent libs, and build SGLang main package
pip install --upgrade pip setuptools
conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
pip install intel-openmp
pip install -e "python[all_cpu]"
# Build the CPU backend kernels
cd sgl-kernel
cp pyproject_cpu.toml pyproject.toml
pip install -v .
# Other required environment variables
# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
export SGLANG_USE_CPU_ENGINE=1
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
```
## Launch of the Serving Engine
Example command to launch SGLang serving:
```bash
python -m sglang.launch_server \
--model <MODEL_ID_OR_PATH> \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--tp 6
```
Notes:
1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
The number of TP specified is how many TP ranks will be used during the execution.
In a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
Usually we can get the SNC information (How many available) from Operation System.
User can specify TP to be no more than the total available SNCs in current system.
If the specified TP rank number differs from the total SNC count,
the system will automatically utilize the first `n` SNCs.
Note that `n` cannot exceed the total SNC number, doing so will result in an error.
To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
```bash
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
```
3. A warmup step is automatically triggered when the service is started.
The server is ready when you see the log `The server is fired up and ready to roll!`.
## Benchmarking with Requests
You can benchmark the performance via the `bench_serving` script.
Run the command in another terminal.
```bash
python -m sglang.bench_serving \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1 \
--request-rate inf \
--random-range-ratio 1.0
```
The detail explanations of the parameters can be looked up by the command:
```bash
python -m sglang.bench_serving -h
```
Additionally, the requests can be formed with
[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
and sent via the command line (e.g. using `curl`) or via your own script.
## Example: Running DeepSeek-R1
An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server
```bash
python -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--quantization w8a8_int8 \
--host 0.0.0.0 \
--mem-fraction-static 0.8 \
--max-total-token 65536 \
--tp 6
```
Similarly, an example command to launch service for FP8 DeepSeek-R1 would be
```bash
python -m sglang.launch_server \
--model deepseek-ai/DeepSeek-R1 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--mem-fraction-static 0.8 \
--max-total-token 65536 \
--tp 6
```
Then you can test with `bench_serving` command or construct your own command or script
following [the benchmarking example](#benchmarking-with-requests).
# NVIDIA Jetson Orin
## Prerequisites
Before starting, ensure the following:
- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later.
- **CUDA Toolkit** and **cuDNN** are installed.
- Verify that the Jetson AGX Orin is in **high-performance mode**:
```bash
sudo nvpmodel -m 0
```
* * * * *
## Installing and running SGLang with Jetson Containers
Clone the jetson-containers github repository:
```
git clone https://github.com/dusty-nv/jetson-containers.git
```
Run the installation script:
```
bash jetson-containers/install.sh
```
Build the container:
```
CUDA_VERSION=12.6 jetson-containers build sglang
```
Run the container:
```
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
```
* * * * *
Running Inference
-----------------------------------------
Launch the server:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
```
The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).
After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.
* * * * *
Running quantization with TorchAO
-------------------------------------
TorchAO is suggested to NVIDIA Jetson Orin.
```bash
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
```
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.
* * * * *
Structured output with XGrammar
-------------------------------
Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).
* * * * *
Thanks to the support from [shahizat](https://github.com/shahizat).
References
----------
- [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)
# TPU
The support for TPU is under active development. Please stay tuned.
# Custom Chat Template
**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)).
By default, the server uses the chat template specified in the model tokenizer from Hugging Face.
It should just work for most official models such as Llama-2/Llama-3.
If needed, you can also override the chat template when launching the server:
```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
```
If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.
## JSON Format
You can load the JSON format, which is defined by `conversation.py`.
```json
{
"name": "my_model",
"system": "<|im_start|>system",
"user": "<|im_start|>user",
"assistant": "<|im_start|>assistant",
"sep_style": "CHATML",
"sep": "<|im_end|>",
"stop_str": ["<|im_end|>", "<|im_start|>"]
}
```
```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
```
## Jinja Format
You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers.
```bash
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
```
# Environment Variables
SGLang supports various environment variables that can be used to configure its runtime behavior. This document provides a comprehensive list and aims to stay updated over time.
*Note: SGLang uses two prefixes for environment variables: `SGL_` and `SGLANG_`. This is likely due to historical reasons. While both are currently supported for different settings, future versions might consolidate them.*
## General Configuration
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_USE_MODELSCOPE` | Enable using models from ModelScope | `false` |
| `SGLANG_HOST_IP` | Host IP address for the server | `0.0.0.0` |
| `SGLANG_PORT` | Port for the server | auto-detected |
| `SGLANG_LOGGING_CONFIG_PATH` | Custom logging configuration path | Not set |
| `SGLANG_DISABLE_REQUEST_LOGGING` | Disable request logging | `false` |
| `SGLANG_HEALTH_CHECK_TIMEOUT` | Timeout for health check in seconds | `20` |
## Performance Tuning
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_ENABLE_TORCH_INFERENCE_MODE` | Control whether to use torch.inference_mode | `false` |
| `SGLANG_ENABLE_TORCH_COMPILE` | Enable torch.compile | `true` |
| `SGLANG_SET_CPU_AFFINITY` | Enable CPU affinity setting (often set to `1` in Docker builds) | `0` |
| `SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN` | Allows the scheduler to overwrite longer context length requests (often set to `1` in Docker builds) | `0` |
| `SGLANG_IS_FLASHINFER_AVAILABLE` | Control FlashInfer availability check | `true` |
| `SGLANG_SKIP_P2P_CHECK` | Skip P2P (peer-to-peer) access check | `false` |
| `SGL_CHUNKED_PREFIX_CACHE_THRESHOLD` | Sets the threshold for enabling chunked prefix caching | `8192` |
| `SGLANG_FUSED_MLA_ENABLE_ROPE_FUSION` | Enable RoPE fusion in Fused Multi-Layer Attention | `1` |
## DeepGEMM Configuration (Advanced Optimization)
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGL_ENABLE_JIT_DEEPGEMM` | Enable Just-In-Time compilation of DeepGEMM kernels | `"true"` |
| `SGL_JIT_DEEPGEMM_PRECOMPILE` | Enable precompilation of DeepGEMM kernels | `"true"` |
| `SGL_JIT_DEEPGEMM_COMPILE_WORKERS` | Number of workers for parallel DeepGEMM kernel compilation | `4` |
| `SGL_IN_DEEPGEMM_PRECOMPILE_STAGE` | Indicator flag used during the DeepGEMM precompile script | `"false"` |
| `SGL_DG_CACHE_DIR` | Directory for caching compiled DeepGEMM kernels | `~/.cache/deep_gemm` |
| `SGL_DG_USE_NVRTC` | Use NVRTC (instead of Triton) for JIT compilation (Experimental) | `"0"` |
| `SGL_USE_DEEPGEMM_BMM` | Use DeepGEMM for Batched Matrix Multiplication (BMM) operations | `"false"` |
## Memory Management
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_DEBUG_MEMORY_POOL` | Enable memory pool debugging | `false` |
| `SGLANG_CLIP_MAX_NEW_TOKENS_ESTIMATION` | Clip max new tokens estimation for memory planning | `4096` |
| `SGLANG_DETOKENIZER_MAX_STATES` | Maximum states for detokenizer | Default value based on system |
| `SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK` | Disable checks for memory imbalance across Tensor Parallel ranks | Not set (defaults to enabled check) |
## Model-Specific Options
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_USE_AITER` | Use AITER optimize implementation | `false` |
| `SGLANG_INT4_WEIGHT` | Enable INT4 weight quantization | `false` |
| `SGLANG_MOE_PADDING` | Enable MoE padding (sets padding size to 128 if value is `1`, often set to `1` in Docker builds) | `0` |
| `SGLANG_FORCE_FP8_MARLIN` | Force using FP8 MARLIN kernels even if other FP8 kernels are available | `false` |
| `SGLANG_ENABLE_FLASHINFER_GEMM` | Use flashinfer kernels when running blockwise fp8 GEMM on Blackwell GPUs | `false` |
| `SGLANG_SUPPORT_CUTLASS_BLOCK_FP8` | Use Cutlass kernels when running blockwise fp8 GEMM on Hopper or Blackwell GPUs | `false` |
| `SGLANG_CUTLASS_MOE` | Use Cutlass FP8 MoE kernel on Blackwell GPUs | `false` |
## Distributed Computing
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_BLOCK_NONZERO_RANK_CHILDREN` | Control blocking of non-zero rank children processes | `1` |
| `SGL_IS_FIRST_RANK_ON_NODE` | Indicates if the current process is the first rank on its node | `"true"` |
| `SGLANG_PP_LAYER_PARTITION` | Pipeline parallel layer partition specification | Not set |
## Testing & Debugging (Internal/CI)
*These variables are primarily used for internal testing, continuous integration, or debugging.*
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_IS_IN_CI` | Indicates if running in CI environment | `false` |
| `SGLANG_AMD_CI` | Indicates running in AMD CI environment | `0` |
| `SGLANG_TEST_RETRACT` | Enable retract decode testing | `false` |
| `SGLANG_RECORD_STEP_TIME` | Record step time for profiling | `false` |
| `SGLANG_TEST_REQUEST_TIME_STATS` | Test request time statistics | `false` |
| `SGLANG_CI_SMALL_KV_SIZE` | Use small KV cache size in CI | Not set |
## Profiling & Benchmarking
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_TORCH_PROFILER_DIR` | Directory for PyTorch profiler output | `/tmp` |
| `SGLANG_PROFILE_WITH_STACK` | Set `with_stack` option (bool) for PyTorch profiler (capture stack trace) | `true` |
## Storage & Caching
| Environment Variable | Description | Default Value |
| --- | --- | --- |
| `SGLANG_DISABLE_OUTLINES_DISK_CACHE` | Disable Outlines disk cache | `true` |
# Troubleshooting and Frequently Asked Questions
## Troubleshooting
This page lists common errors and tips for resolving them.
### CUDA Out of Memory
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
### CUDA Error: Illegal Memory Access Encountered
This error may result from kernel errors or out-of-memory issues:
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
## Frequently Asked Questions
### The results are not deterministic, even with a temperature of 0
You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.
From our initial investigation, this indeterminism arises from two factors: dynamic batching and prefix caching. Roughly speaking, dynamic batching accounts for about 95% of the indeterminism, while prefix caching accounts for the remaining portion. The server runs dynamic batching under the hood. Different batch sizes can cause PyTorch/CuBLAS to dispatch to different CUDA kernels, which can lead to slight numerical differences. This difference accumulates across many layers, resulting in nondeterministic output when the batch size changes. Similarly, when prefix caching is enabled, it can also dispatch to different kernels. Even when the computations are mathematically equivalent, small numerical differences from different kernel implementations lead to the final nondeterministic outputs.
To achieve more deterministic outputs in the current code, you can add `--disable-radix-cache` and send only one request at a time. The results will be mostly deterministic under this setting.
We are still investigating the root causes and potential solutions. In the short term, we may introduce a "deterministic mode" that uses more padding to address the variance caused by dynamic batching. This mode will be more deterministic but slower.
We have two issues to track our progress:
- The deterministic mode is tracked at [https://github.com/sgl-project/sglang/issues/1729](https://github.com/sgl-project/sglang/issues/1729).
- The per-request random seed is tracked at [https://github.com/sgl-project/sglang/issues/1335](https://github.com/sgl-project/sglang/issues/1335).
# Choices Methods in SGLang
This doc describes the choices methods supported by SGLang.
The optional `choices_method` arg determines how options supplied to SGLang's `choices` primitive are selected. Only the `RuntimeEndpoint` backend supports the `choices_method` arg. Other backends, such as `OpenAI`, have bespoke selection implementations due to API limitations.
## Methods
### Token Length Normalized
Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens.
Usage example (alternatively, simply omit the `choices_method` arg):
```python
@sgl.function
def example(s):
s += sgl.user("What is the capital of France?")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["London", "Paris", "Berlin"],
choices_method=sgl.token_length_normalized,
)
)
```
This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are `["Paris", "Antidisestablishmentarianism"]`.
### Greedy Token Selection
Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option.
Usage example:
```python
@sgl.function
def example(s):
s += sgl.user("What is the capital of France?")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["London", "Paris", "Berlin"],
choices_method=sgl.greedy_token_selection,
)
)
```
This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example:
```python
@sgl.function
def us_president_example(s):
s += sgl.user("Name a US president.")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["Donald Duck", "Millard Fillmore"],
choices_method=sgl.greedy_token_selection,
)
)
```
### Unconditional Likelihood Normalized
Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in [this EleutherAI blogpost](https://blog.eleuther.ai/multiple-choice-normalization/). This method incurs an additional LLM call to obtain the unconditional likelihoods.
Usage example:
```python
@sgl.function
def example(s):
s += sgl.user("What is the capital of France?")
s += sgl.assistant(
sgl.gen(
"answer",
choices=["London", "Paris", "Berlin"],
choices_method=sgl.unconditional_likelihood_normalized,
)
)
```
Frontend Language
=================
.. toctree::
:maxdepth: 1
:caption: Frontend Language
frontend_tutorial.ipynb
choices_methods.md
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SGLang Frontend Language"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SGLang frontend language can be used to define simple and easy prompts in a convenient, structured way."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Launch A Server\n",
"\n",
"Launch the server in your terminal and wait for it to initialize."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sglang import assistant_begin, assistant_end\n",
"from sglang import assistant, function, gen, system, user\n",
"from sglang import image\n",
"from sglang import RuntimeEndpoint\n",
"from sglang.lang.api import set_default_backend\n",
"from sglang.srt.utils import load_image\n",
"from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import print_highlight, terminate_process, wait_for_server\n",
"\n",
"server_process, port = launch_server_cmd(\n",
" \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
"print(f\"Server started on http://localhost:{port}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set the default backend. Note: Besides the local server, you may use also `OpenAI` or other API endpoints."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Basic Usage\n",
"\n",
"The most simple way of using SGLang frontend language is a simple question answer dialog between a user and an assistant."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def basic_qa(s, question):\n",
" s += system(f\"You are a helpful assistant than can answer questions.\")\n",
" s += user(question)\n",
" s += assistant(gen(\"answer\", max_tokens=512))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"state = basic_qa(\"List 3 countries and their capitals.\")\n",
"print_highlight(state[\"answer\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multi-turn Dialog\n",
"\n",
"SGLang frontend language can also be used to define multi-turn dialogs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def multi_turn_qa(s):\n",
" s += system(f\"You are a helpful assistant than can answer questions.\")\n",
" s += user(\"Please give me a list of 3 countries and their capitals.\")\n",
" s += assistant(gen(\"first_answer\", max_tokens=512))\n",
" s += user(\"Please give me another list of 3 countries and their capitals.\")\n",
" s += assistant(gen(\"second_answer\", max_tokens=512))\n",
" return s\n",
"\n",
"\n",
"state = multi_turn_qa()\n",
"print_highlight(state[\"first_answer\"])\n",
"print_highlight(state[\"second_answer\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Control flow\n",
"\n",
"You may use any Python code within the function to define more complex control flows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def tool_use(s, question):\n",
" s += assistant(\n",
" \"To answer this question: \"\n",
" + question\n",
" + \". I need to use a \"\n",
" + gen(\"tool\", choices=[\"calculator\", \"search engine\"])\n",
" + \". \"\n",
" )\n",
"\n",
" if s[\"tool\"] == \"calculator\":\n",
" s += assistant(\"The math expression is: \" + gen(\"expression\"))\n",
" elif s[\"tool\"] == \"search engine\":\n",
" s += assistant(\"The key word to search is: \" + gen(\"word\"))\n",
"\n",
"\n",
"state = tool_use(\"What is 2 * 2?\")\n",
"print_highlight(state[\"tool\"])\n",
"print_highlight(state[\"expression\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parallelism\n",
"\n",
"Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def tip_suggestion(s):\n",
" s += assistant(\n",
" \"Here are two tips for staying healthy: \"\n",
" \"1. Balanced Diet. 2. Regular Exercise.\\n\\n\"\n",
" )\n",
"\n",
" forks = s.fork(2)\n",
" for i, f in enumerate(forks):\n",
" f += assistant(\n",
" f\"Now, expand tip {i+1} into a paragraph:\\n\"\n",
" + gen(\"detailed_tip\", max_tokens=256, stop=\"\\n\\n\")\n",
" )\n",
"\n",
" s += assistant(\"Tip 1:\" + forks[0][\"detailed_tip\"] + \"\\n\")\n",
" s += assistant(\"Tip 2:\" + forks[1][\"detailed_tip\"] + \"\\n\")\n",
" s += assistant(\n",
" \"To summarize the above two tips, I can say:\\n\" + gen(\"summary\", max_tokens=512)\n",
" )\n",
"\n",
"\n",
"state = tip_suggestion()\n",
"print_highlight(state[\"summary\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Constrained Decoding\n",
"\n",
"Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def regular_expression_gen(s):\n",
" s += user(\"What is the IP address of the Google DNS servers?\")\n",
" s += assistant(\n",
" gen(\n",
" \"answer\",\n",
" temperature=0,\n",
" regex=r\"((25[0-5]|2[0-4]\\d|[01]?\\d\\d?).){3}(25[0-5]|2[0-4]\\d|[01]?\\d\\d?)\",\n",
" )\n",
" )\n",
"\n",
"\n",
"state = regular_expression_gen()\n",
"print_highlight(state[\"answer\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `regex` to define a `JSON` decoding schema."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"character_regex = (\n",
" r\"\"\"\\{\\n\"\"\"\n",
" + r\"\"\" \"name\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
" + r\"\"\" \"house\": \"(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)\",\\n\"\"\"\n",
" + r\"\"\" \"blood status\": \"(Pure-blood|Half-blood|Muggle-born)\",\\n\"\"\"\n",
" + r\"\"\" \"occupation\": \"(student|teacher|auror|ministry of magic|death eater|order of the phoenix)\",\\n\"\"\"\n",
" + r\"\"\" \"wand\": \\{\\n\"\"\"\n",
" + r\"\"\" \"wood\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
" + r\"\"\" \"core\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
" + r\"\"\" \"length\": [0-9]{1,2}\\.[0-9]{0,2}\\n\"\"\"\n",
" + r\"\"\" \\},\\n\"\"\"\n",
" + r\"\"\" \"alive\": \"(Alive|Deceased)\",\\n\"\"\"\n",
" + r\"\"\" \"patronus\": \"[\\w\\d\\s]{1,16}\",\\n\"\"\"\n",
" + r\"\"\" \"bogart\": \"[\\w\\d\\s]{1,16}\"\\n\"\"\"\n",
" + r\"\"\"\\}\"\"\"\n",
")\n",
"\n",
"\n",
"@function\n",
"def character_gen(s, name):\n",
" s += user(\n",
" f\"{name} is a character in Harry Potter. Please fill in the following information about this character.\"\n",
" )\n",
" s += assistant(gen(\"json_output\", max_tokens=256, regex=character_regex))\n",
"\n",
"\n",
"state = character_gen(\"Harry Potter\")\n",
"print_highlight(state[\"json_output\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Batching \n",
"\n",
"Use `run_batch` to run a batch of prompts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def text_qa(s, question):\n",
" s += user(question)\n",
" s += assistant(gen(\"answer\", stop=\"\\n\"))\n",
"\n",
"\n",
"states = text_qa.run_batch(\n",
" [\n",
" {\"question\": \"What is the capital of the United Kingdom?\"},\n",
" {\"question\": \"What is the capital of France?\"},\n",
" {\"question\": \"What is the capital of Japan?\"},\n",
" ],\n",
" progress_bar=True,\n",
")\n",
"\n",
"for i, state in enumerate(states):\n",
" print_highlight(f\"Answer {i+1}: {states[i]['answer']}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Streaming \n",
"\n",
"Use `stream` to stream the output to the user."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def text_qa(s, question):\n",
" s += user(question)\n",
" s += assistant(gen(\"answer\", stop=\"\\n\"))\n",
"\n",
"\n",
"state = text_qa.run(\n",
" question=\"What is the capital of France?\", temperature=0.1, stream=True\n",
")\n",
"\n",
"for out in state.text_iter():\n",
" print(out, end=\"\", flush=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Complex Prompts\n",
"\n",
"You may use `{system|user|assistant}_{begin|end}` to define complex prompts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def chat_example(s):\n",
" s += system(\"You are a helpful assistant.\")\n",
" # Same as: s += s.system(\"You are a helpful assistant.\")\n",
"\n",
" with s.user():\n",
" s += \"Question: What is the capital of France?\"\n",
"\n",
" s += assistant_begin()\n",
" s += \"Answer: \" + gen(\"answer\", max_tokens=100, stop=\"\\n\")\n",
" s += assistant_end()\n",
"\n",
"\n",
"state = chat_example()\n",
"print_highlight(state[\"answer\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multi-modal Generation\n",
"\n",
"You may use SGLang frontend language to define multi-modal prompts.\n",
"See [here](https://docs.sglang.ai/supported_models/generative_models.html) for supported models."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0\"\n",
")\n",
"\n",
"wait_for_server(f\"http://localhost:{port}\")\n",
"print(f\"Server started on http://localhost:{port}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"set_default_backend(RuntimeEndpoint(f\"http://localhost:{port}\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ask a question about an image."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@function\n",
"def image_qa(s, image_file, question):\n",
" s += user(image(image_file) + question)\n",
" s += assistant(gen(\"answer\", max_tokens=256))\n",
"\n",
"\n",
"image_url = \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n",
"image_bytes, _ = load_image(image_url)\n",
"state = image_qa(image_bytes, \"What is in the image?\")\n",
"print_highlight(state[\"answer\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
# Learn more
You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials).
The latest SGLang features and updates are shared through the [LMSYS blog](https://lmsys.org/blog/).
The 2025 H2 roadmap can be found at this [issue](https://github.com/sgl-project/sglang/issues/7736).
# Deploy On Kubernetes
This document is for deploying a RoCE network-based SGLang two-node inference service on a Kubernetes (K8S) cluster.
[LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference.
SGLang can also be deployed with LWS on Kubernetes for distributed model serving.
Please see this guide for more details on deploying SGLang on Kubernetes using LWS.
Here we take the deployment of DeepSeek-R1 as an example.
## Prerequisites
1. At least two Kubernetes nodes, each with two H20 systems and eight GPUs, are required.
2. Make sure your K8S cluster has LWS correctly installed. If it hasn't been set up yet, please follow the [installation instructions](https://github.com/kubernetes-sigs/lws/blob/main/site/content/en/docs/installation/_index.md). **Note:** For LWS versions ≤0.5.x, you must use the Downward API to obtain `LWS_WORKER_INDEX`, as native support for this feature was introduced in v0.6.0.
## Basic example
For the basic example documentation, refer to [Deploy Distributed Inference Service with SGLang and LWS on GPUs](https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/sglang).
However, that document only covers the basic NCCL socket mode.
In this section, we’ll make some simple modifications to adapt the setup to the RDMA scenario.
## RDMA RoCE case
* Check your env:
```bash
[root@node1 ~]# ibstatus
Infiniband device 'mlx5_bond_0' port 1 status:
default gid: fe80:0000:0000:0000:0225:9dff:fe64:c79a
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (2X NDR)
link_layer: Ethernet
Infiniband device 'mlx5_bond_1' port 1 status:
default gid: fe80:0000:0000:0000:0225:9dff:fe6e:c3ec
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (2X NDR)
link_layer: Ethernet
Infiniband device 'mlx5_bond_2' port 1 status:
default gid: fe80:0000:0000:0000:0225:9dff:fe73:0dd7
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (2X NDR)
link_layer: Ethernet
Infiniband device 'mlx5_bond_3' port 1 status:
default gid: fe80:0000:0000:0000:0225:9dff:fe36:f7ff
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (2X NDR)
link_layer: Ethernet
```
* Prepare the `lws.yaml` file for deploying on k8s.
```yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: sglang
spec:
replicas: 1
leaderWorkerTemplate:
size: 2
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
hostIPC: true
containers:
- name: sglang-leader
image: sglang:latest
securityContext:
privileged: true
env:
- name: NCCL_IB_GID_INDEX
value: "3"
command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --mem-fraction-static
- "0.93"
- --torch-compile-max-bs
- "8"
- --max-running-requests
- "20"
- --tp
- "16" # Size of Tensor Parallelism
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --host
- "0.0.0.0"
- --port
- "40000"
resources:
limits:
nvidia.com/gpu: "8"
ports:
- containerPort: 40000
readinessProbe:
tcpSocket:
port: 40000
initialDelaySeconds: 15
periodSeconds: 10
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: model
mountPath: /work/models
- name: ib
mountPath: /dev/infiniband
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: model
hostPath:
path: '< your models dir >' # modify it according your models dir
- name: ib
hostPath:
path: /dev/infiniband
workerTemplate:
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
hostIPC: true
containers:
- name: sglang-worker
image: sglang:latest
securityContext:
privileged: true
env:
- name: NCCL_IB_GID_INDEX
value: "3"
command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --mem-fraction-static
- "0.93"
- --torch-compile-max-bs
- "8"
- --max-running-requests
- "20"
- --tp
- "16" # Size of Tensor Parallelism
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20000
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
resources:
limits:
nvidia.com/gpu: "8"
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: model
mountPath: /work/models
- name: ib
mountPath: /dev/infiniband
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: ib
hostPath:
path: /dev/infiniband
- name: model
hostPath:
path: /data1/models/deepseek_v3_moe
---
apiVersion: v1
kind: Service
metadata:
name: sglang-leader
spec:
selector:
leaderworkerset.sigs.k8s.io/name: sglang
role: leader
ports:
- protocol: TCP
port: 40000
targetPort: 40000
```
* Then use `kubectl apply -f lws.yaml` you will get this output.
```text
NAME READY STATUS RESTARTS AGE
sglang-0 0/1 Running 0 9s
sglang-0-1 1/1 Running 0 9s
```
Wait for the sglang leader (`sglang-0`) status to change to 1/1, which indicates it is `Ready`.
You can use the command `kubectl logs -f sglang-0` to view the logs of the leader node.
Once successful, you should see output like this:
```text
[2025-02-17 05:27:24 TP1] Capture cuda graph end. Time elapsed: 84.89 s
[2025-02-17 05:27:24 TP6] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24 TP0] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24 TP7] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24 TP3] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24 TP2] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24 TP4] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24 TP1] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24 TP5] max_total_num_tokens=712400, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=50, context_len=163840
[2025-02-17 05:27:24] INFO: Started server process [1]
[2025-02-17 05:27:24] INFO: Waiting for application startup.
[2025-02-17 05:27:24] INFO: Application startup complete.
[2025-02-17 05:27:24] INFO: Uvicorn running on http://0.0.0.0:40000 (Press CTRL+C to quit)
[2025-02-17 05:27:25] INFO: 127.0.0.1:48908 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-02-17 05:27:25 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-02-17 05:27:32] INFO: 127.0.0.1:48924 - "POST /generate HTTP/1.1" 200 OK
[2025-02-17 05:27:32] The server is fired up and ready to roll!
```
If it doesn’t start up successfully, please follow these steps to check for any remaining issues. Thanks!
### Debug
* Set `NCCL_DEBUG=TRACE` to check if it is a NCCL communication problem.
This should resolve most NCCL-related issues.
***Notice: If you find that NCCL_DEBUG=TRACE is not effective in the container environment, but the process is stuck or you encounter hard-to-diagnose issues, try switching to a different container image. Some images may not handle standard error output properly.***
#### RoCE scenario
* Please make sure that RDMA devices are available in the cluster environment.
* Please make sure that the nodes in the cluster have Mellanox NICs with RoCE. In this example, we use Mellanox ConnectX 5 model NICs, and the proper OFED driver has been installed. If not, please refer to the document [Install OFED Driver](https://docs.nvidia.com/networking/display/mlnxofedv461000/installing+mellanox+ofed) to install the driver.
* Check your env:
```shell
$ lspci -nn | grep Eth | grep Mellanox
0000:7f:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0000:7f:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0000:c7:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0000:c7:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0001:08:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0001:08:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0001:a2:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
0001:a2:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
```
* Check the OFED driver:
```shell
ofed_info -s
OFED-internal-23.07-0.5.0:
```
* Show RDMA link status and check IB devices:
```shell
$ rdma link show
8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0
9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2
10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4
11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6
$ ibdev2netdev
8/1: mlx5_bond_0/1: state ACTIVE physical_state LINK_UP netdev reth0
9/1: mlx5_bond_1/1: state ACTIVE physical_state LINK_UP netdev reth2
10/1: mlx5_bond_2/1: state ACTIVE physical_state LINK_UP netdev reth4
11/1: mlx5_bond_3/1: state ACTIVE physical_state LINK_UP netdev reth6
```
* Test RoCE network speed on the host:
```shell
yum install qperf
# for server:
execute qperf
# for client
qperf -t 60 -cm1 <server_ip> rc_rdma_write_bw
```
* Check RDMA accessible in your container:
```shell
# ibv_devices
# ibv_devinfo
```
## Keys to success
* In the YAML configuration above, pay attention to the NCCL environment variable. For older versions of NCCL, you should check the NCCL_IB_GID_INDEX environment setting.
* NCCL_SOCKET_IFNAME is also crucial, but in a containerized environment, this typically isn’t an issue.
* In some cases, it’s necessary to configure GLOO_SOCKET_IFNAME correctly.
* NCCL_DEBUG is essential for troubleshooting, but I've found that sometimes it doesn't show error logs within containers. This could be related to the Docker image you're using. You may want to try switching images if needed.
* Avoid using Docker images based on Ubuntu 18.04, as they tend to have compatibility issues.
## Remaining issues
* In Kubernetes, Docker, or Containerd environments, we use hostNetwork to prevent performance degradation.
* We utilize privileged mode, which isn’t secure. Additionally, in containerized environments, full GPU isolation cannot be achieved.
## TODO
* Integrated with [k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin).
apiVersion: v1
kind: Service
metadata:
name: deepseekr10528-decode-main
spec:
selector:
leaderworkerset.sigs.k8s.io/name: deepseekr10528-decode-main
role: leader
ports:
- protocol: TCP
port: 30000
targetPort: 30000
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: deepseekr10528-decode-main
spec:
leaderWorkerTemplate:
leaderTemplate:
metadata:
labels:
role: leader
spec:
containers:
- command:
- python3
- -m
- sglang.launch_server
- --port
- "30000"
- --host
- "0.0.0.0"
- --model-path
- /work/models
- --chunked-prefill-size
- "262144"
- --page-size
- "64"
- --enable-dp-attention
- --enable-dp-lm-head
- --dp-size
- "16"
- --moe-a2a-backend
- deepep
- --disaggregation-mode
- decode
- --mem-fraction-static
- "0.849"
- --context-length
- "32768"
- --disaggregation-ib-device
- "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3"
- --cuda-graph-max-bs
- "64"
- --max-running-requests
- "2048"
- --tp-size
- "16" # Size of Tensor Parallelism
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20102
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --ep-num-redundant-experts
- "32"
- --moe-dense-tp-size
- "1"
env:
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: NVSHMEM_IB_GID_INDEX
value: "3"
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
value: "1"
- name: NVSHMEM_HCA_PE_MAPPING
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"
- name: NCCL_NET_PLUGIN
value: "none"
- name: NCCL_IB_TC
value: "136"
- name: NCCL_MIN_NCHANNELS
value: "4"
- name: NCCL_IB_SL
value: "5"
- name: MC_TE_METRIC
value: "true"
- name: SGLANG_MOONCAKE_TRANS_THREAD
value: "16"
- name: SGL_ENABLE_JIT_DEEPGEMM
value: "1"
- name: NCCL_IB_HCA
value: ^=mlx5_0,mlx5_5,mlx5_6
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
image: lmsysorg/sglang:latest
name: sglang-leader
ports:
- containerPort: 30000
protocol: TCP
readinessProbe:
periodSeconds: 30
tcpSocket:
port: 30000
resources:
limits:
nvidia.com/gpu: "8"
securityContext:
capabilities:
add:
- IPC_LOCK
privileged: true
volumeMounts:
- mountPath: /root/.cache
name: sgl-cache
- mountPath: /dev/shm
name: dshm
- mountPath: /work/models
name: model
- mountPath: /dev/infiniband
name: ib
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
name: cf
dnsPolicy: ClusterFirstWithHostNet
hostIPC: true
hostNetwork: true
nodeSelector:
# should modify according your deployment env
pd: "yes"
tolerations:
# should modify according your deployment env
- key: bopd
operator: Exists
- key: node-role
operator: Exists
volumes:
- hostPath:
path: /data1/sgl_cache1
type: DirectoryOrCreate
name: sgl-cache
- emptyDir:
medium: Memory
name: dshm
- hostPath:
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
name: model
- hostPath:
path: /dev/infiniband
name: ib
- hostPath:
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
name: cf
restartPolicy: RecreateGroupOnPodRestart
size: 2
workerTemplate:
metadata: {}
spec:
containers:
- command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --chunked-prefill-size
- "262144"
- --page-size
- "64"
- --enable-dp-attention
- --enable-dp-lm-head
- --dp-size
- "16"
- --moe-a2a-backend
- deepep
- --disaggregation-mode
- decode
- --mem-fraction-static
- "0.849"
- --context-length
- "32768"
- --disaggregation-ib-device
- "mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3"
- --cuda-graph-max-bs
- "64"
- --max-running-requests
- "2048"
- --tp-size
- "16" # Size of Tensor Parallelism
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20102
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --ep-num-redundant-experts
- "32"
- --moe-dense-tp-size
- "1"
env:
- name: NVSHMEM_IB_TRAFFIC_CLASS
value: "16"
- name: NVSHMEM_IB_GID_INDEX
value: "3"
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
value: "1"
- name: NVSHMEM_HCA_PE_MAPPING
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"
- name: NCCL_NET_PLUGIN
value: "none"
- name: NCCL_IB_TC
value: "136"
- name: NCCL_MIN_NCHANNELS
value: "4"
- name: MC_TE_METRIC
value: "true"
- name: NCCL_IB_SL
value: "5"
- name: SGLANG_MOONCAKE_TRANS_THREAD
value: "16"
- name: SGL_ENABLE_JIT_DEEPGEMM
value: "1"
- name: NCCL_IB_HCA
value: ^=mlx5_0,mlx5_5,mlx5_6
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
image: lmsysorg/sglang:latest
name: sglang-worker
ports:
- containerPort: 30001
resources:
limits:
nvidia.com/gpu: "8"
securityContext:
capabilities:
add:
- IPC_LOCK
privileged: true
volumeMounts:
- mountPath: /root/.cache
name: sgl-cache
- mountPath: /dev/shm
name: dshm
- mountPath: /work/models
name: model
- mountPath: /dev/infiniband
name: ib
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
name: cf
dnsPolicy: ClusterFirstWithHostNet
hostIPC: true
hostNetwork: true
nodeSelector:
# should modify according your deployment env
pd: "yes"
tolerations:
# should modify according your deployment env
- key: bopd
operator: Exists
- key: node-role
operator: Exists
volumes:
- hostPath:
path: /data1/sgl_cache1
type: DirectoryOrCreate
name: sgl-cache
- emptyDir:
medium: Memory
name: dshm
- hostPath:
path: /dev/infiniband
name: ib
- hostPath:
# modify according to you deployment env
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
name: model
- hostPath:
# modify according to you deployment env
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
name: cf
networkConfig:
subdomainPolicy: Shared
replicas: 1
rolloutStrategy:
rollingUpdateConfiguration:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
startupPolicy: LeaderCreated
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseekr10528-lb-main
labels:
app: deepseekr10528-lb
spec:
replicas: 1
selector:
matchLabels:
app: deepseekr10528-lb
template:
metadata:
labels:
app: deepseekr10528-lb
spec:
nodeSelector:
bo: "yes"
tolerations:
- key: bopd
operator: Exists
- key: node-role
operator: Exists
containers:
- name: sgl-minilb
image: lmsysorg/sglang:latest
command:
- python
- -m
- sglang.srt.disaggregation.mini_lb
- --prefill
- http://deepseekr10528-prefill-main:30000
- --decode
- http://deepseekr10528-decode-main:30000
- --host
- 0.0.0.0
- --port
- "8000"
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: deepseekr10528-lb-service
spec:
type: NodePort # NodePort is easy to test, you can also specify `ClusterIP`
selector:
app: deepseekr10528-lb
ports:
- protocol: TCP
port: 8000 # Service Port(In-Cluster)
targetPort: 8000 # Exposed Container
nodePort: 30800
apiVersion: v1
kind: Service
metadata:
name: deepseekr10528-prefill-main
spec:
selector:
leaderworkerset.sigs.k8s.io/name: deepseekr10528-prefill-main
role: leader
ports:
- protocol: TCP
port: 30000
targetPort: 30000
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: deepseekr10528-prefill-main
spec:
leaderWorkerTemplate:
leaderTemplate:
metadata:
labels:
role: leader
spec:
containers:
- command:
- python3
- -m
- sglang.launch_server
- --port
- "30000"
- --host
- "0.0.0.0"
- --model-path
- /work/models
- --disaggregation-ib-device
# should modify according your rdma env
- mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
- --chunked-prefill-size
- "524288"
- --max-prefill-tokens
- "32768"
- --page-size
- "64"
- --ep-dispatch-algorithm
- dynamic
- --eplb-algorithm
- deepseek
- --enable-dp-lm-head
- --enable-dp-attention
- --dp-size
- "16"
- --disable-radix-cache
- --moe-a2a-backend
- deepep
- --disaggregation-mode
- prefill
- --mem-fraction-static
- "0.7"
- --context-length
- "32768"
- --tp
- "16"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20102
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --ep-num-redundant-experts
- "32"
- --moe-dense-tp-size
- "1"
- --max-running-requests
- "1024"
env:
- name: NVSHMEM_HCA_PE_MAPPING
# should modify according your rdma env
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
- name: NVSHMEM_IB_GID_INDEX
value: "3"
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
value: "1"
- name: SGLANG_SET_CPU_AFFINITY
value: "true"
- name: SGL_ENABLE_JIT_DEEPGEMM
value: "1"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"
- name: NCCL_NET_PLUGIN
value: none
- name: NCCL_IB_TC
value: "136"
- name: NCCL_MIN_NCHANNELS
value: "4"
- name: MC_TE_METRIC
value: "false"
- name: NCCL_IB_SL
value: "5"
- name: NCCL_IB_HCA
value: ^=mlx5_0,mlx5_5,mlx5_6
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
image: lmsysorg/sglang:latest
name: sglang-leader
ports:
- containerPort: 30000
protocol: TCP
readinessProbe:
periodSeconds: 30
tcpSocket:
port: 30000
resources:
limits:
nvidia.com/gpu: "8"
securityContext:
capabilities:
add:
- IPC_LOCK
privileged: true
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /work/models
name: model
- mountPath: /dev/infiniband
name: ib
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
name: cf
- mountPath: /root/.cache
name: sgl-cache
dnsPolicy: ClusterFirstWithHostNet
hostIPC: true
hostNetwork: true
nodeSelector:
# should modify according your deployment env
pd: "yes"
tolerations:
# should modify according your deployment env
- key: bopd
operator: Exists
- key: node-role
operator: Exists
volumes:
- emptyDir:
medium: Memory
name: dshm
- hostPath:
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
name: model
- hostPath:
path: /dev/infiniband
name: ib
- hostPath:
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
name: cf
- hostPath:
path: /data1/sgl_cache
type: DirectoryOrCreate
name: sgl-cache
restartPolicy: RecreateGroupOnPodRestart
size: 2
workerTemplate:
metadata: {}
spec:
containers:
- command:
- python3
- -m
- sglang.launch_server
- --model-path
- /work/models
- --disaggregation-ib-device
# should modify according your rdma env
- mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3
- --chunked-prefill-size
- "524288"
- --max-prefill-tokens
- "32768"
- --page-size
- "64"
- --ep-dispatch-algorithm
- dynamic
- --eplb-algorithm
- deepseek
# - --deepep-config
# - /home/aiges/tuned/tuned_8sms.json
# can be tuned using deepep test scripts
- --enable-dp-lm-head
- --enable-dp-attention
- --dp-size
- "16"
- --disable-radix-cache
- --moe-a2a-backend
- deepep
- --disaggregation-mode
- prefill
- --mem-fraction-static
- "0.7"
- --context-length
- "32768"
- --tp
- "16"
- --dist-init-addr
- $(LWS_LEADER_ADDRESS):20102
- --nnodes
- $(LWS_GROUP_SIZE)
- --node-rank
- $(LWS_WORKER_INDEX)
- --trust-remote-code
- --ep-num-redundant-experts
- "32"
- --moe-dense-tp-size
- "1"
- --max-running-requests
- "1024"
env:
- name: SGLANG_SET_CPU_AFFINITY
value: "true"
- name: NVSHMEM_HCA_PE_MAPPING
# should modify according your rdma env
value: "mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2"
- name: NCCL_IB_HCA
value: ^=mlx5_0,mlx5_5,mlx5_6
- name: NVSHMEM_IB_TRAFFIC_CLASS
value: "16"
- name: NVSHMEM_IB_GID_INDEX
value: "3"
- name: NVSHMEM_ENABLE_NIC_PE_MAPPING
value: "1"
- name: CUDA_LAUNCH_BLOCKING
value: "0"
- name: SGLANG_MOONCAKE_TRANS_THREAD
value: "8"
- name: SGL_ENABLE_JIT_DEEPGEMM
value: "1"
- name: SGL_CHUNKED_PREFIX_CACHE_THRESHOLD
value: "0"
- name: NCCL_IB_QPS_PER_CONNECTION
value: "8"
- name: NCCL_IB_SPLIT_DATA_ON_QPS
value: "1"
- name: NCCL_NET_PLUGIN
value: none
- name: NCCL_IB_TC
value: "136"
- name: NCCL_MIN_NCHANNELS
value: "4"
- name: MC_TE_METRIC
value: "true"
- name: NCCL_IB_SL
value: "5"
- name: LWS_WORKER_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/worker-index']
image: lmsysorg/sglang:latest
name: sglang-worker
ports:
- containerPort: 30001
protocol: TCP
resources:
limits:
nvidia.com/gpu: "8"
securityContext:
capabilities:
add:
- IPC_LOCK
privileged: true
volumeMounts:
- mountPath: /root/.cache
name: sgl-cache
- mountPath: /dev/shm
name: dshm
- mountPath: /work/models
name: model
- mountPath: /dev/infiniband
name: ib
- mountPath: /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs
name: cf
dnsPolicy: ClusterFirstWithHostNet
hostIPC: true
hostNetwork: true
nodeSelector:
# should modify according your deployment env
pd: "yes"
tolerations:
# should modify according your deployment env
- key: bopd
operator: Exists
- key: node-role
operator: Exists
volumes:
- emptyDir:
medium: Memory
name: dshm
- hostPath:
path: /dev/infiniband
name: ib
- hostPath:
# modify according to you deployment env
path: /data1/maas_hosted_models/models/DeepSeek-R1-0528/deepseek_r1_0528
name: model
- hostPath:
# modify according to you deployment env
path: /data1/maas_hosted_models/models/fused_moe_triton/configs
name: cf
- hostPath:
# modify according to you deployment env
path: /data1/sgl_cache
type: DirectoryOrCreate
name: sgl-cache
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment