install.md 5.73 KB
Newer Older
1
# Install SGLang
2

3
You can install SGLang using any of the methods below.
4

5
## Method 1: With pip
6
7
```
pip install --upgrade pip
8
pip install sgl-kernel --force-reinstall --no-deps
9
pip install "sglang[all]>=0.4.2.post2" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
10
11
```

12
Note: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.
Lianmin Zheng's avatar
Lianmin Zheng committed
13

14
## Method 2: From source
15
```
16
# Use the last release branch
17
git clone -b v0.4.2.post2 https://github.com/sgl-project/sglang.git
18
cd sglang
19

20
pip install --upgrade pip
21
pip install sgl-kernel --force-reinstall --no-deps
22
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/
23
```
24

25
Note: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions. If you meet with issue like **ImportError: cannot import name `_grouped_size_compiled_for_decode_kernels`**, installing FlashInfer with some older version like 0.1.6 instead of the latest version could solve it.
Lianmin Zheng's avatar
Lianmin Zheng committed
26

27
28
29
30
Note: To AMD ROCm system with Instinct/MI GPUs, do following instead:

```
# Use the last release branch
31
git clone -b v0.4.2.post2 https://github.com/sgl-project/sglang.git
32
33
34
cd sglang

pip install --upgrade pip
35
36
37
cd sgl-kernel
python setup_rocm.py install
cd ..
38
39
40
pip install -e "python[all_hip]"
```

41
## Method 3: Using docker
42
43
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
44
45

```bash
46
docker run --gpus all \
47
    --shm-size 32g \
48
    -p 30000:30000 \
49
    -v ~/.cache/huggingface:/root/.cache/huggingface \
50
51
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
52
    lmsysorg/sglang:latest \
53
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
54
55
```

56
57
58
Note: To AMD ROCm system with Instinct/MI GPUs, it is recommended to use `docker/Dockerfile.rocm` to build images, example and usage as below:

```bash
59
docker build --build-arg SGL_BRANCH=v0.4.2.post2 -t v0.4.2.post2-rocm630 -f Dockerfile.rocm .
60
61
62
63
64
65
66
67

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
68
    v0.4.2.post2-rocm630 \
69
70
71
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
72
drun v0.4.2.post2-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
73
74
```

75
## Method 4: Using docker compose
76
77
78

<details>
<summary>More</summary>
79

80
> This method is recommended if you plan to serve it as a service.
Lianmin Zheng's avatar
Lianmin Zheng committed
81
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
82

Lianmin Zheng's avatar
Lianmin Zheng committed
83
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
84
2. Execute the command `docker compose up -d` in your terminal.
85
</details>
86

87
## Method 5: Run on Kubernetes or Clouds with SkyPilot
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

<details>
<summary>More</summary>

To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).

1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>

```yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
112
    --model-path meta-llama/Llama-3.1-8B-Instruct \
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
    --host 0.0.0.0 \
    --port 30000
```
</details>

```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>

128
## Common Notes
Lianmin Zheng's avatar
Lianmin Zheng committed
129
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
Lianmin Zheng's avatar
Lianmin Zheng committed
130
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
Lianmin Zheng's avatar
Lianmin Zheng committed
131
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. This allows you to build SGLang programs locally and execute them by connecting to the remote backend.