install.md 7.57 KB
Newer Older
1
# Install SGLang
2

Yineng Zhang's avatar
Yineng Zhang committed
3
You can install SGLang using any of the methods below. For running DeepSeek V3/R1 with SGLang, refer to [DeepSeek V3 Support](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). It is always recommended to use the [latest release version](https://pypi.org/project/sglang/#history) and deploy it with [Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) to avoid fixed issues and environment-related problems.
4

5
6
7
8
9
## Method 1: With pip or uv

We recommend using uv to install the dependencies with a higher installation speed:

```bash
10
pip install --upgrade pip
11
12
13
pip install uv
uv pip install sgl-kernel --force-reinstall --no-deps
uv pip install "sglang[all]>=0.4.3.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
14
15
```

16
17
18
19
20
**Quick Fix to Installation**

- SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the package currently used by FlashInfer is named `flashinfer-python`, not `flashinfer`.

- If you experience an error like `OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root`, please try either of the following solutions:
Lianmin Zheng's avatar
Lianmin Zheng committed
21

22
23
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
2. Follow the procedure described in [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) first, then install SGLang as described above.
simveit's avatar
simveit committed
24

25
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.48.3`.
simveit's avatar
simveit committed
26

27
## Method 2: From source
28
```
29
# Use the last release branch
30
git clone -b v0.4.3.post2 https://github.com/sgl-project/sglang.git
31
cd sglang
32

33
pip install --upgrade pip
34
pip install sgl-kernel --force-reinstall --no-deps
35
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
36
```
37

Yineng Zhang's avatar
Yineng Zhang committed
38
39
40
Note: SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html).

If you want to work on development in SGLang, it is highly recommended that you use docker. Please refer to [setup docker container](https://github.com/sgl-project/sglang/blob/main/docs/developer/development_guide_using_docker.md#setup-docker-container) for guidance. The image used is `lmsysorg/sglang:dev`.
Lianmin Zheng's avatar
Lianmin Zheng committed
41

42
43
44
45
Note: To AMD ROCm system with Instinct/MI GPUs, do following instead:

```
# Use the last release branch
46
git clone -b v0.4.3.post2 https://github.com/sgl-project/sglang.git
47
48
49
cd sglang

pip install --upgrade pip
50
51
52
cd sgl-kernel
python setup_rocm.py install
cd ..
53
54
55
pip install -e "python[all_hip]"
```

56
## Method 3: Using docker
57
58
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
59
60

```bash
61
docker run --gpus all \
62
    --shm-size 32g \
63
    -p 30000:30000 \
64
    -v ~/.cache/huggingface:/root/.cache/huggingface \
65
66
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
67
    lmsysorg/sglang:latest \
68
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
69
70
```

71
72
73
Note: To AMD ROCm system with Instinct/MI GPUs, it is recommended to use `docker/Dockerfile.rocm` to build images, example and usage as below:

```bash
74
docker build --build-arg SGL_BRANCH=v0.4.3.post2 -t v0.4.3.post2-rocm630 -f Dockerfile.rocm .
75
76
77
78
79
80
81
82

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
83
    v0.4.3.post2-rocm630 \
84
85
86
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
87
drun v0.4.3.post2-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
88
89
```

90
## Method 4: Using docker compose
91
92
93

<details>
<summary>More</summary>
94

95
> This method is recommended if you plan to serve it as a service.
Lianmin Zheng's avatar
Lianmin Zheng committed
96
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
97

Lianmin Zheng's avatar
Lianmin Zheng committed
98
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
99
2. Execute the command `docker compose up -d` in your terminal.
100
</details>
101

102
## Method 5: Run on Kubernetes or Clouds with SkyPilot
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

<details>
<summary>More</summary>

To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).

1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>

```yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
127
    --model-path meta-llama/Llama-3.1-8B-Instruct \
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
    --host 0.0.0.0 \
    --port 30000
```
</details>

```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>

143
## Common Notes
Lianmin Zheng's avatar
Lianmin Zheng committed
144
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
Lianmin Zheng's avatar
Lianmin Zheng committed
145
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
Yineng Zhang's avatar
Yineng Zhang committed
146
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
147
- To reinstall flashinfer locally, use the following command: `pip install "flashinfer-python>=0.2.1.post2" -i https://flashinfer.ai/whl/cu124/torch2.5 --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.