Unverified Commit b64b88e7 authored by Michael Yao's avatar Michael Yao Committed by GitHub
Browse files

[Docs] Update start/install.md (#5398)

parent bc24205b
...@@ -19,13 +19,15 @@ uv pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124 ...@@ -19,13 +19,15 @@ uv pip install "sglang[all]>=0.4.5" --find-links https://flashinfer.ai/whl/cu124
- SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`. - SGLang currently uses torch 2.5, so you need to install flashinfer for torch 2.5. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions: - If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable. 1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above. 2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
- If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.48.3`. - If you encounter `ImportError; cannot import name 'is_valid_list_of_images' from 'transformers.models.llama.image_processing_llama'`, try to use the specified version of `transformers` in [pyproject.toml](https://github.com/sgl-project/sglang/blob/main/python/pyproject.toml). Currently, just running `pip install transformers==4.48.3`.
## Method 2: From source ## Method 2: From source
```
```bash
# Use the last release branch # Use the last release branch
git clone -b v0.4.5 https://github.com/sgl-project/sglang.git git clone -b v0.4.5 https://github.com/sgl-project/sglang.git
cd sglang cd sglang
...@@ -40,7 +42,7 @@ If you want to develop SGLang, it is recommended to use docker. Please refer to ...@@ -40,7 +42,7 @@ If you want to develop SGLang, it is recommended to use docker. Please refer to
Note: For AMD ROCm system with Instinct/MI GPUs, do following instead: Note: For AMD ROCm system with Instinct/MI GPUs, do following instead:
``` ```bash
# Use the last release branch # Use the last release branch
git clone -b v0.4.5 https://github.com/sgl-project/sglang.git git clone -b v0.4.5 https://github.com/sgl-project/sglang.git
cd sglang cd sglang
...@@ -53,6 +55,7 @@ pip install -e "python[all_hip]" ...@@ -53,6 +55,7 @@ pip install -e "python[all_hip]"
``` ```
## Method 3: Using docker ## Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
...@@ -104,13 +107,14 @@ drun v0.4.5-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 10 ...@@ -104,13 +107,14 @@ drun v0.4.5-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 10
<summary>More</summary> <summary>More</summary>
1. Option 1: For single node serving (typically when the model size fits into GPUs on one node) 1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example. Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.
2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`) 2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)
Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
</details>
Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
</details>
## Method 6: Run on Kubernetes or Clouds with SkyPilot ## Method 6: Run on Kubernetes or Clouds with SkyPilot
...@@ -141,6 +145,7 @@ run: | ...@@ -141,6 +145,7 @@ run: |
--host 0.0.0.0 \ --host 0.0.0.0 \
--port 30000 --port 30000
``` ```
</details> </details>
```bash ```bash
...@@ -150,10 +155,12 @@ HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml ...@@ -150,10 +155,12 @@ HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
# Get the HTTP API endpoint # Get the HTTP API endpoint
sky status --endpoint 30000 sglang sky status --endpoint 30000 sglang
``` ```
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve). 3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details> </details>
## Common Notes ## Common Notes
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. - If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime. - The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment