install.md 5.42 KB
Newer Older
1
# Install SGLang
2

Lianmin Zheng's avatar
Lianmin Zheng committed
3
You can install SGLang using one of the methods below.
4

Lianmin Zheng's avatar
Lianmin Zheng committed
5
This page primarily applies to common NVIDIA GPU platforms.
6
For other or newer platforms, please refer to the dedicated pages for [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [TPU](../platforms/tpu.md), [NVIDIA DGX Spark](https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).
7

8
## Method 1: With pip or uv
9

Lianmin Zheng's avatar
Lianmin Zheng committed
10
11
It is recommended to use uv for faster installation:

12
```bash
13
pip install --upgrade pip
14
pip install uv
15
uv pip install "sglang" --prerelease=allow
16
17
```

Lianmin Zheng's avatar
Lianmin Zheng committed
18
**Quick fixes to common problems**
19

20
21
22
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
  1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
  2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
simveit's avatar
simveit committed
23

24
## Method 2: From source
25
26

```bash
27
# Use the last release branch
28
git clone -b v0.5.4.post3 https://github.com/sgl-project/sglang.git
29
cd sglang
30

Lianmin Zheng's avatar
Lianmin Zheng committed
31
# Install the python packages
32
pip install --upgrade pip
33
pip install -e "python"
34
```
35

Lianmin Zheng's avatar
Lianmin Zheng committed
36
**Quick fixes to common problems**
37

Lianmin Zheng's avatar
Lianmin Zheng committed
38
- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
39

40
## Method 3: Using docker
41

Lianmin Zheng's avatar
Lianmin Zheng committed
42
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
43
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
44
45

```bash
46
docker run --gpus all \
47
    --shm-size 32g \
48
    -p 30000:30000 \
49
    -v ~/.cache/huggingface:/root/.cache/huggingface \
50
51
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
52
    lmsysorg/sglang:latest \
53
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
54
55
```

Lianmin Zheng's avatar
Lianmin Zheng committed
56
57
You can also find the nightly docker images [here](https://hub.docker.com/r/lmsysorg/sglang/tags?name=nightly).

Lianmin Zheng's avatar
Lianmin Zheng committed
58
## Method 4: Using Kubernetes
59

Lianmin Zheng's avatar
Lianmin Zheng committed
60
Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).
61

Lianmin Zheng's avatar
Lianmin Zheng committed
62
63
<details>
<summary>More</summary>
64

Lianmin Zheng's avatar
Lianmin Zheng committed
65
1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
66

Lianmin Zheng's avatar
Lianmin Zheng committed
67
68
69
70
71
   Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.

2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)

   Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
72

Lianmin Zheng's avatar
Lianmin Zheng committed
73
</details>
74

Lianmin Zheng's avatar
Lianmin Zheng committed
75
## Method 5: Using docker compose
76
77
78

<details>
<summary>More</summary>
79

80
> This method is recommended if you plan to serve it as a service.
Lianmin Zheng's avatar
Lianmin Zheng committed
81
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
82

Lianmin Zheng's avatar
Lianmin Zheng committed
83
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
84
2. Execute the command `docker compose up -d` in your terminal.
85
</details>
86

87
## Method 6: Run on Kubernetes or Clouds with SkyPilot
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

<details>
<summary>More</summary>

To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).

1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>

```yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
112
    --model-path meta-llama/Llama-3.1-8B-Instruct \
113
114
115
    --host 0.0.0.0 \
    --port 30000
```
116

117
118
119
120
121
122
123
124
125
</details>

```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
126

127
128
129
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>

130
## Common Notes
131

Lianmin Zheng's avatar
Lianmin Zheng committed
132
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
133
- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.