Unverified Commit c9565e49 authored by Shenggui Li's avatar Shenggui Li Committed by GitHub
Browse files

[docker] added rdma support (#3619)

parent d03c4c25
...@@ -7,6 +7,7 @@ Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model ...@@ -7,6 +7,7 @@ Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model
For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html). For optimizations made on the DeepSeek series models regarding SGLang, please refer to [DeepSeek Model Optimizations in SGLang](https://docs.sglang.ai/references/deepseek.html).
## Hardware Recommendation ## Hardware Recommendation
- 8 x NVIDIA H200 GPUs - 8 x NVIDIA H200 GPUs
If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below. If you do not have GPUs with large enough memory, please try multi-node tensor parallelism. There is an example serving with [2 H20 nodes](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208) below.
...@@ -18,19 +19,26 @@ For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a si ...@@ -18,19 +19,26 @@ For running on AMD MI300X, use this as a reference. [Running DeepSeek-R1 on a si
If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded. If you encounter errors when starting the server, ensure the weights have finished downloading. It's recommended to download them beforehand or restart multiple times until all weights are downloaded.
### Using Docker (Recommended) ### Using Docker (Recommended)
```bash ```bash
# Pull latest image # Pull latest image
# https://hub.docker.com/r/lmsysorg/sglang/tags # https://hub.docker.com/r/lmsysorg/sglang/tags
docker pull lmsysorg/sglang:latest docker pull lmsysorg/sglang:latest
# Launch # Launch
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \ docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000 python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000
``` ```
If you are using RDMA, please note that:
1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
Add [performance optimization options](#performance-optimization-options) as needed. Add [performance optimization options](#performance-optimization-options) as needed.
### Using pip ### Using pip
```bash ```bash
# Installation # Installation
pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python pip install "sglang[all]>=0.4.3" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
...@@ -42,7 +50,9 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-r ...@@ -42,7 +50,9 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-r
Add [performance optimization options](#performance-optimization-options) as needed. Add [performance optimization options](#performance-optimization-options) as needed.
<a id="option_args"></a> <a id="option_args"></a>
### Performance Optimization Options ### Performance Optimization Options
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed. [MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed.
- [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput. - [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
...@@ -68,7 +78,8 @@ response = client.chat.completions.create( ...@@ -68,7 +78,8 @@ response = client.chat.completions.create(
print(response) print(response)
``` ```
### Example: Serving with two H20*8 nodes ### Example: Serving with two H20\*8 nodes
For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. Please **use the first node's IP** for both commands. For example, there are two H20 nodes, each with 8 GPUs. The first node's IP is `10.0.0.1`, and the second node's IP is `10.0.0.2`. Please **use the first node's IP** for both commands.
If the command fails, try setting the `GLOO_SOCKET_IFNAME` parameter. For more information, see [Common Environment Variables](https://pytorch.org/docs/stable/distributed.html#common-environment-variables). If the command fails, try setting the `GLOO_SOCKET_IFNAME` parameter. For more information, see [Common Environment Variables](https://pytorch.org/docs/stable/distributed.html#common-environment-variables).
...@@ -85,7 +96,8 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20. ...@@ -85,7 +96,8 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20.
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args). > **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
### Example: Serving with two H200*8 nodes and docker ### Example: Serving with two H200\*8 nodes and docker
There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`. There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`.
A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage. A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.
...@@ -120,6 +132,7 @@ docker run --gpus all \ ...@@ -120,6 +132,7 @@ docker run --gpus all \
``` ```
To ensure functionality, we include a test from a client Docker container. To ensure functionality, we include a test from a client Docker container.
```bash ```bash
docker run --gpus all \ docker run --gpus all \
--shm-size 32g \ --shm-size 32g \
...@@ -136,7 +149,8 @@ docker run --gpus all \ ...@@ -136,7 +149,8 @@ docker run --gpus all \
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args). > **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
### Example: Serving with four A100*8 nodes ### Example: Serving with four A100\*8 nodes
To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first. To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server. Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
......
...@@ -14,6 +14,7 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ ...@@ -14,6 +14,7 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 \ && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 \
&& update-alternatives --set python3 /usr/bin/python3.10 && apt install python3.10-distutils -y \ && update-alternatives --set python3 /usr/bin/python3.10 && apt install python3.10-distutils -y \
&& apt install curl git sudo libibverbs-dev -y \ && apt install curl git sudo libibverbs-dev -y \
&& apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \
&& curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py \ && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py \
&& python3 --version \ && python3 --version \
&& python3 -m pip --version \ && python3 -m pip --version \
......
...@@ -21,6 +21,7 @@ RUN apt-get update && apt-get install -y \ ...@@ -21,6 +21,7 @@ RUN apt-get update && apt-get install -y \
pkg-config \ pkg-config \
libssl-dev \ libssl-dev \
bear \ bear \
&& apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \
&& rm -rf /var/lib/apt/lists/* \ && rm -rf /var/lib/apt/lists/* \
&& apt-get clean && apt-get clean
......
...@@ -20,6 +20,8 @@ ARG TRITON_COMMIT="improve_fa_decode_3.0.0" ...@@ -20,6 +20,8 @@ ARG TRITON_COMMIT="improve_fa_decode_3.0.0"
ARG ATER_REPO="https://github.com/HaiShaw/ater" ARG ATER_REPO="https://github.com/HaiShaw/ater"
ARG CK_COMMITS="fa05ae" ARG CK_COMMITS="fa05ae"
RUN apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1
RUN git clone ${SGL_REPO} \ RUN git clone ${SGL_REPO} \
&& cd sglang \ && cd sglang \
&& if [ "${SGL_BRANCH}" = ${SGL_DEFAULT} ]; then \ && if [ "${SGL_BRANCH}" = ${SGL_DEFAULT} ]; then \
......
...@@ -7,7 +7,8 @@ services: ...@@ -7,7 +7,8 @@ services:
# If you use modelscope, you need mount this directory # If you use modelscope, you need mount this directory
# - ${HOME}/.cache/modelscope:/root/.cache/modelscope # - ${HOME}/.cache/modelscope:/root/.cache/modelscope
restart: always restart: always
network_mode: host network_mode: host # required by RDMA
privileged: true # required by RDMA
# Or you can only publish port 30000 # Or you can only publish port 30000
# ports: # ports:
# - 30000:30000 # - 30000:30000
...@@ -16,8 +17,7 @@ services: ...@@ -16,8 +17,7 @@ services:
# if you use modelscope to download model, you need set this environment # if you use modelscope to download model, you need set this environment
# - SGLANG_USE_MODELSCOPE: true # - SGLANG_USE_MODELSCOPE: true
entrypoint: python3 -m sglang.launch_server entrypoint: python3 -m sglang.launch_server
command: command: --model-path meta-llama/Llama-3.1-8B-Instruct
--model-path meta-llama/Llama-3.1-8B-Instruct
--host 0.0.0.0 --host 0.0.0.0
--port 30000 --port 30000
ulimits: ulimits:
...@@ -31,5 +31,5 @@ services: ...@@ -31,5 +31,5 @@ services:
reservations: reservations:
devices: devices:
- driver: nvidia - driver: nvidia
device_ids: ['0'] device_ids: ["0"]
capabilities: [gpu] capabilities: [gpu]
...@@ -16,18 +16,23 @@ tar xf vscode_cli_alpine_x64_cli.tar.gz ...@@ -16,18 +16,23 @@ tar xf vscode_cli_alpine_x64_cli.tar.gz
The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers. The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers.
❗️ **Note on RDMA**
1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them but keeping them there does not harm. Thus, we enable these two flags by default in the commands below.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
### H100 ### H100
```bash ```bash
# Change the name to yours # Change the name to yours
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh docker exec -it sglang_zhyncs /bin/zsh
``` ```
### H200 ### H200
```bash ```bash
docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh docker run -itd --shm-size 32g --gpus all -v /mnt/co-research/shared-models:/root/.cache/huggingface --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh docker exec -it sglang_zhyncs /bin/zsh
``` ```
......
...@@ -63,13 +63,18 @@ docker build -t sglang_image -f Dockerfile.rocm . ...@@ -63,13 +63,18 @@ docker build -t sglang_image -f Dockerfile.rocm .
2. Create a convenient alias. 2. Create a convenient alias.
```bash ```bash
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri \ alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \ --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \ --security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx \ -v $HOME/dockerx:/dockerx \
-v /data:/data' -v /data:/data'
``` ```
If you are using RDMA, please note that:
1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
3. Launch the server. 3. Launch the server.
**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens). **NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment