install.md 7.79 KB
Newer Older
1
# Install SGLang
2

3
You can install SGLang using any of the methods below.
4

5
6
7
For running DeepSeek V3/R1, refer to [DeepSeek V3 Support](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). It is recommended to use the latest version and deploy it with [Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) to avoid environment-related issues.

It is recommended to use uv to install the dependencies for faster installation:
8

9
## Method 1: With pip or uv
10
11

```bash
12
pip install --upgrade pip
13
pip install uv
14
uv pip install "sglang[all]>=0.4.6.post1"
15
16
```

17
**Quick Fixes to Common Problems**
18

19
- SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
20

21
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
22

23
24
  1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
  2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
simveit's avatar
simveit committed
25

26
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.51.1`.
simveit's avatar
simveit committed
27

28
## Method 2: From source
29
30

```bash
31
# Use the last release branch
32
git clone -b v0.4.6.post1 https://github.com/sgl-project/sglang.git
33
cd sglang
34

35
pip install --upgrade pip
36
pip install -e "python[all]"
37
```
38

39
Note: SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html).
Yineng Zhang's avatar
Yineng Zhang committed
40

41
If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](https://github.com/sgl-project/sglang/blob/main/docs/developer/development_guide_using_docker.md#setup-docker-container) for guidance. The docker image is `lmsysorg/sglang:dev`.
Lianmin Zheng's avatar
Lianmin Zheng committed
42

43
Note: For AMD ROCm system with Instinct/MI GPUs, do following instead:
44

45
```bash
46
# Use the last release branch
47
git clone -b v0.4.6.post1 https://github.com/sgl-project/sglang.git
48
49
50
cd sglang

pip install --upgrade pip
51
52
53
cd sgl-kernel
python setup_rocm.py install
cd ..
54
55
56
pip install -e "python[all_hip]"
```

57
## Method 3: Using docker
58

59
60
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
61
62

```bash
63
docker run --gpus all \
64
    --shm-size 32g \
65
    -p 30000:30000 \
66
    -v ~/.cache/huggingface:/root/.cache/huggingface \
67
68
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
69
    lmsysorg/sglang:latest \
70
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
71
72
```

73
Note: For AMD ROCm system with Instinct/MI GPUs, it is recommended to use `docker/Dockerfile.rocm` to build images, example and usage as below:
74
75

```bash
76
docker build --build-arg SGL_BRANCH=v0.4.6.post1 -t v0.4.6.post1-rocm630 -f Dockerfile.rocm .
77
78
79
80
81
82
83
84

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
85
    v0.4.6.post1-rocm630 \
86
87
88
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
89
drun v0.4.6.post1-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
90
91
```

92
## Method 4: Using docker compose
93
94
95

<details>
<summary>More</summary>
96

97
> This method is recommended if you plan to serve it as a service.
Lianmin Zheng's avatar
Lianmin Zheng committed
98
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
99

Lianmin Zheng's avatar
Lianmin Zheng committed
100
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
101
2. Execute the command `docker compose up -d` in your terminal.
102
</details>
103

104
105
106
107
108
109
## Method 5: Using Kubernetes

<details>
<summary>More</summary>

1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
110

111
112
113
114
   Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.

2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)

115
   Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
116

117
</details>
118
119

## Method 6: Run on Kubernetes or Clouds with SkyPilot
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143

<details>
<summary>More</summary>

To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).

1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>

```yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
144
    --model-path meta-llama/Llama-3.1-8B-Instruct \
145
146
147
    --host 0.0.0.0 \
    --port 30000
```
148

149
150
151
152
153
154
155
156
157
</details>

```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
158

159
160
161
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>

162
## Common Notes
163

Lianmin Zheng's avatar
Lianmin Zheng committed
164
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
Lianmin Zheng's avatar
Lianmin Zheng committed
165
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
Yineng Zhang's avatar
Yineng Zhang committed
166
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
lukec's avatar
lukec committed
167
- To reinstall flashinfer locally, use the following command: `pip install "flashinfer-python==0.2.3" -i https://flashinfer.ai/whl/cu124/torch2.6 --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.