install.md 3.35 KB
Newer Older
1
## Install SGLang
2

3
You can install SGLang using any of the methods below.
4

5
### Method 1: With pip
6
7
8
9
```
pip install --upgrade pip
pip install "sglang[all]"

10
# Install FlashInfer CUDA kernels
11
12
13
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```

14
### Method 2: From source
15
```
16
17
18
# Use the last release branch
git clone -b v0.3.0 https://github.com/sgl-project/sglang.git
cd sglang
19

20
21
pip install --upgrade pip
pip install -e "python[all]"
22

23
24
25
# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```
26

27
28
29
### Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
30
31

```bash
32
33
docker run --gpus all \
    -p 30000:30000 \
34
    -v ~/.cache/huggingface:/root/.cache/huggingface \
35
36
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
37
38
39
40
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```

41
42
43
44
### Method 4: Using docker compose

<details>
<summary>More</summary>
45

46
47
> This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml).
48

49
1. Copy the [compose.yml](./docker/compose.yaml) to your local machine
50
2. Execute the command `docker compose up -d` in your terminal.
51
</details>
52

53
### Method 5: Run on Kubernetes or Clouds with SkyPilot
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

<details>
<summary>More</summary>

To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).

1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>

```yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
```
</details>

```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>

94
### Common Notes
Lianmin Zheng's avatar
Lianmin Zheng committed
95
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
96
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.