Unverified Commit 2449a0af authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Refactor the docs (#9031)

parent 0f229c07
# Install SGLang # Install SGLang
You can install SGLang using any of the methods below. You can install SGLang using one of the methods below.
For running DeepSeek V3/R1, refer to [DeepSeek V3 Support](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3). It is recommended to use the latest version and deploy it with [Docker](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#using-docker-recommended) to avoid environment-related issues. This page primarily applies to common NVIDIA GPU platforms.
For other or newer platforms, please refer to the dedicated pages for [NVIDIA Blackwell GPUs](../platforms/blackwell_gpu.md), [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).
It is recommended to use uv to install the dependencies for faster installation:
## Method 1: With pip or uv ## Method 1: With pip or uv
It is recommended to use uv for faster installation:
```bash ```bash
pip install --upgrade pip pip install --upgrade pip
pip install uv pip install uv
uv pip install "sglang[all]>=0.5.0rc0" uv pip install "sglang[all]>=0.5.0rc0"
``` ```
**Quick Fixes to Common Problems** **Quick fixes to common problems**
- SGLang currently uses torch 2.7.1, so you need to install flashinfer for torch 2.7.1. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions: - If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable. 1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above. 2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
- SGLang currently uses torch 2.8 and flashinfer for torch 2.8. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
## Method 2: From source ## Method 2: From source
...@@ -30,34 +28,18 @@ uv pip install "sglang[all]>=0.5.0rc0" ...@@ -30,34 +28,18 @@ uv pip install "sglang[all]>=0.5.0rc0"
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
cd sglang cd sglang
# Install the python packages
pip install --upgrade pip pip install --upgrade pip
pip install -e "python[all]" pip install -e "python[all]"
``` ```
Note: SGLang currently uses torch 2.7.1, so you need to install flashinfer for torch 2.7.1. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). **Quick fixes to common problems**
- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](https://github.com/sgl-project/sglang/blob/main/docs/references/development_guide_using_docker.md#setup-docker-container) for guidance. The docker image is `lmsysorg/sglang:dev`. - SGLang currently uses torch 2.8 and flashinfer for torch 2.8. If you want to install flashinfer separately, please refer to [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html). Please note that the FlashInfer pypi package is called `flashinfer-python` instead of `flashinfer`.
Note: For AMD ROCm system with Instinct/MI GPUs, do following instead:
```bash
# Use the last release branch
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"
```
Note: Please refer to [the CPU environment setup command list](../references/cpu.md#install-from-source)
to set up the SGLang environment for running the models with CPU servers.
## Method 3: Using docker ## Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
```bash ```bash
...@@ -71,29 +53,24 @@ docker run --gpus all \ ...@@ -71,29 +53,24 @@ docker run --gpus all \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
``` ```
Note: For AMD ROCm system with Instinct/MI GPUs, it is recommended to use `docker/Dockerfile.rocm` to build images, example and usage as below: ## Method 4: Using Kubernetes
```bash Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).
docker build --build-arg SGL_BRANCH=v0.5.0rc0 -t v0.5.0rc0-rocm630 -f Dockerfile.rocm .
alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \ <details>
--shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ <summary>More</summary>
-v $HOME/dockerx:/dockerx -v /data:/data'
drun -p 30000:30000 \ 1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
v0.5.0rc0-rocm630 \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.
drun v0.5.0rc0-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8
``` 2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)
Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
Note: Please refer to [the CPU installation guide using Docker](../references/cpu.md#install-using-docker) </details>
to set up the SGLang environment for running the models with CPU servers.
## Method 4: Using docker compose ## Method 5: Using docker compose
<details> <details>
<summary>More</summary> <summary>More</summary>
...@@ -105,21 +82,6 @@ to set up the SGLang environment for running the models with CPU servers. ...@@ -105,21 +82,6 @@ to set up the SGLang environment for running the models with CPU servers.
2. Execute the command `docker compose up -d` in your terminal. 2. Execute the command `docker compose up -d` in your terminal.
</details> </details>
## Method 5: Using Kubernetes
<details>
<summary>More</summary>
1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.
2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)
Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
</details>
## Method 6: Run on Kubernetes or Clouds with SkyPilot ## Method 6: Run on Kubernetes or Clouds with SkyPilot
<details> <details>
...@@ -166,6 +128,6 @@ sky status --endpoint 30000 sglang ...@@ -166,6 +128,6 @@ sky status --endpoint 30000 sglang
## Common Notes ## Common Notes
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`. - To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
...@@ -12,31 +12,41 @@ The core features include: ...@@ -12,31 +12,41 @@ The core features include:
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Installation :caption: Get Started
start/install.md get_started/install.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Backend Tutorial :caption: Basic Usage
references/deepseek basic_usage/send_request.ipynb
references/llama4 basic_usage/openai_api.rst
backend/send_request.ipynb basic_usage/offline_engine_api.ipynb
backend/openai_api_completions.ipynb basic_usage/native_api.ipynb
backend/openai_api_vision.ipynb basic_usage/sampling_params.md
backend/openai_api_embeddings.ipynb basic_usage/deepseek.md
backend/native_api.ipynb basic_usage/gpt_oss.md
backend/offline_engine_api.ipynb basic_usage/llama4.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Advanced Backend Configurations :caption: Advanced Features
backend/server_arguments.md advanced_features/server_arguments.md
backend/sampling_params.md advanced_features/hyperparameter_tuning.md
backend/hyperparameter_tuning.md advanced_features/speculative_decoding.ipynb
backend/attention_backend.md advanced_features/structured_outputs.ipynb
advanced_features/structured_outputs_for_reasoning_models.ipynb
advanced_features/function_calling.ipynb
advanced_features/separate_reasoning.ipynb
advanced_features/quantization.md
advanced_features/lora.ipynb
advanced_features/pd_disaggregation.md
advanced_features/vlm_query.ipynb
advanced_features/router.md
advanced_features/observability.md
advanced_features/attention_backend.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
...@@ -46,43 +56,38 @@ The core features include: ...@@ -46,43 +56,38 @@ The core features include:
supported_models/multimodal_language_models.md supported_models/multimodal_language_models.md
supported_models/embedding_models.md supported_models/embedding_models.md
supported_models/reward_models.md supported_models/reward_models.md
supported_models/rerank_models.md
supported_models/support_new_models.md supported_models/support_new_models.md
supported_models/transformers_fallback.md supported_models/transformers_fallback.md
supported_models/modelscope.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Advanced Features :caption: Hardware Platforms
backend/speculative_decoding.ipynb
backend/structured_outputs.ipynb
backend/function_calling.ipynb
backend/separate_reasoning.ipynb
backend/structured_outputs_for_reasoning_models.ipynb
backend/custom_chat_template.md
backend/quantization.md
backend/lora.ipynb
backend/pd_disaggregation.md
backend/vlm_query.ipynb
.. toctree::
:maxdepth: 1
:caption: Frontend Tutorial
frontend/frontend.ipynb platforms/amd_gpu.md
frontend/choices_methods.md platforms/blackwell_gpu.md
platforms/cpu_server.md
platforms/tpu.md
platforms/nvidia_jetson.md
platforms/ascend_npu.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: SGLang Router :caption: Developer Guide
router/router.md developer_guide/contribution_guide.md
developer_guide/development_guide_using_docker.md
developer_guide/benchmark_and_profiling.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: References :caption: References
references/general references/faq.md
references/hardware references/environment_variables.md
references/advanced_deploy references/production_metrics.md
references/performance_analysis_and_optimization references/custom_chat_template.md
references/developer references/frontend/frontend_index.rst
references/multi_node_deployment/multi_node_index.rst
references/learn_more.md
# SGLang on AMD # AMD GPUs
This document describes how to set up an AMD-based environment for [SGLang](https://github.com/sgl-project/sglang). If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues) on the SGLang repository. This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
## System Configuration ## System Configuration
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning: When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) - [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html) - [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html) - [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html) - [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system. **NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
...@@ -35,24 +36,35 @@ You can automate or verify this change using [this helpful script](https://githu ...@@ -35,24 +36,35 @@ You can automate or verify this change using [this helpful script](https://githu
Again, please go through the entire documentation to confirm your system is using the recommended configuration. Again, please go through the entire documentation to confirm your system is using the recommended configuration.
## Installing SGLang ## Install SGLang
For general installation instructions, see the official [SGLang Installation Docs](../start/install.md). Below are the AMD-specific steps summarized for convenience. You can install SGLang using one of the methods below.
### Install from Source ### Install from Source
```bash ```bash
git clone https://github.com/sgl-project/sglang.git # Use the last release branch
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
cd sglang cd sglang
# Compile sgl-kernel
pip install --upgrade pip pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps cd sgl-kernel
python setup_rocm.py install
# Install sglang python package
cd ..
pip install -e "python[all_hip]" pip install -e "python[all_hip]"
``` ```
### Install Using Docker (Recommended) ### Install Using Docker (Recommended)
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
The steps below show how to build and use an image.
1. Build the docker image. 1. Build the docker image.
If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
```bash ```bash
docker build -t sglang_image -f Dockerfile.rocm . docker build -t sglang_image -f Dockerfile.rocm .
...@@ -68,10 +80,10 @@ pip install -e "python[all_hip]" ...@@ -68,10 +80,10 @@ pip install -e "python[all_hip]"
-v /data:/data' -v /data:/data'
``` ```
If you are using RDMA, please note that: If you are using RDMA, please note that:
- `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
- You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
3. Launch the server. 3. Launch the server.
**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens). **NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
......
# Ascend NPUs
## Install
TODO
## Examples
TODO
# Blackwell GPUs
We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
## B200 with x86 CPUs
TODO
## GB200/GB300 with ARM CPUs
TODO
# SGLang on CPU # CPU Servers
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers. The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions, Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
......
# Apply SGLang on NVIDIA Jetson Orin # NVIDIA Jetson Orin
## Prerequisites ## Prerequisites
......
# TPU
The support for TPU is under active development. Please stay tuned.
# Measuring Model Accuracy in SGLang
This guide shows how to evaluate model accuracy using SGLang's [built-in benchmarks](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark). Please include accuracy on crucial benchmarks in your PR if you make modifications on the model side, like the kernel and model architecture.
## Benchmarking Model Accuracy
This is a reference workflow for the [MMLU benchmark](https://github.com/sgl-project/sglang/tree/main/benchmark/mmlu). For more details or other benchmarks, please refer to the README in each specific benchmark folder under [sglang/benchmark](https://github.com/sgl-project/sglang/tree/b045841baeff37a5601fcde23fa98bd09d942c36/benchmark).
```bash
# Step 1: Download the dataset
bash download_data.sh
# Step 2: Launch the server
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Math-1.5B-Instruct \ # Model selection
--port 30000 \ # Network configuration
--mem-fraction-static 0.8 # Memory optimization
# Step 3: Run the benchmark script
python3 bench_sglang.py --nsub 10 # Test 10 subjects
# Step 4: Extract the accuracy
cat result.jsonl | grep -oP '"accuracy": \K\d+\.\d+'
```
## Customizing Benchmark Scripts
Some benchmark implementations may differ from ours, causing accuracy discrepancies. To match [[Qwen2.5-Math]](https://github.com/QwenLM/Qwen2.5-Math)'s reported 76.8% GSM8K accuracy, customization is required.
```python
# The GSM8K benchmark script includes few shot examples for evaluation by default.
# Here we exclude them.
for i in range(len(lines[num_shots:num_questions])):
questions.append(get_one_example(lines, i, False))
labels.append(get_answer_value(lines[i]["answer"]))
```
```python
@sgl.function
def few_shot_gsm8k(s, question):
# System prompt given in https://github.com/QwenLM/Qwen2.5-Math
s += sgl.system("Please reason step by step, and put your final answer within \\boxed{}.") # Include system prompt
s += few_shot_examples + question
# Stopwords given in evaluation/math_eval.py of the Qwen2.5-Math repo
s += sgl.gen(
"answer", max_tokens=2048, stop=["Question", "Assistant:", "</s>", "<|im_end|>", "<|endoftext|>"]
)
```
These adjustments should return the desired accuracy.
## Extending Evaluation Capabilities
1. **Contribute New Benchmarks**
* Follow our [contribution guidelines](../references/contribution_guide.md) to add new test scripts
2. **Request Implementations**
* Feel free to open an issue describing your evaluation needs
3. **Use Alternative Tools**
* [OpenCompass](https://opencompass.org.cn)
* [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
Multi-Node Deployment
==========================
.. toctree::
:maxdepth: 1
multi_node.md
deploy_on_k8s.md
disaggregation/lws_pd_deploy.md
Multi-Node Deployment
==========================
.. toctree::
:maxdepth: 1
deepseek.md
Developer Reference
==========================
.. toctree::
:maxdepth: 1
development_guide_using_docker.md
release_process.md
setup_github_runner.md
# Frequently Asked Questions # Troubleshooting and Frequently Asked Questions
## The results are not deterministic, even with a temperature of 0 ## Troubleshooting
This page lists common errors and tips for resolving them.
### CUDA Out of Memory
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
- If OOM occurs during decoding, try lowering `--max-running-requests`.
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
### CUDA Error: Illegal Memory Access Encountered
This error may result from kernel errors or out-of-memory issues:
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
## Frequently Asked Questions
### The results are not deterministic, even with a temperature of 0
You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0. You may notice that when you send the same request twice, the results from the engine will be slightly different, even when the temperature is set to 0.
......
Frontend Language
=================
.. toctree::
:maxdepth: 1
:caption: Frontend Language
frontend_tutorial.ipynb
choices_methods.md
...@@ -29,23 +29,15 @@ ...@@ -29,23 +29,15 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"import requests\n",
"import os\n",
"\n",
"from sglang import assistant_begin, assistant_end\n", "from sglang import assistant_begin, assistant_end\n",
"from sglang import assistant, function, gen, system, user\n", "from sglang import assistant, function, gen, system, user\n",
"from sglang import image\n", "from sglang import image\n",
"from sglang import RuntimeEndpoint, set_default_backend\n", "from sglang import RuntimeEndpoint\n",
"from sglang.lang.api import set_default_backend\n",
"from sglang.srt.utils import load_image\n", "from sglang.srt.utils import load_image\n",
"from sglang.test.test_utils import is_in_ci\n", "from sglang.test.doc_patch import launch_server_cmd\n",
"from sglang.utils import print_highlight, terminate_process, wait_for_server\n", "from sglang.utils import print_highlight, terminate_process, wait_for_server\n",
"\n", "\n",
"if is_in_ci():\n",
" from patch import launch_server_cmd\n",
"else:\n",
" from sglang.utils import launch_server_cmd\n",
"\n",
"\n",
"server_process, port = launch_server_cmd(\n", "server_process, port = launch_server_cmd(\n",
" \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0\"\n", " \"python -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0\"\n",
")\n", ")\n",
......
General Guidance
==========
.. toctree::
:maxdepth: 1
contribution_guide.md
troubleshooting.md
faq.md
learn_more.md
modelscope.md
environment_variables.md
production_metrics.md
Hardware Supports
==========
.. toctree::
:maxdepth: 1
amd.md
nvidia_jetson.md
cpu.md
# Learn more # Learn more
You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials). You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials).
The latest SGLang features and updates are shared through the [LMSYS blog](https://lmsys.org/blog/).
The 2025 H2 roadmap can be found at this [issue](https://github.com/sgl-project/sglang/issues/7736).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment