Unverified Commit d738ab52 authored by applesaucethebun's avatar applesaucethebun Committed by GitHub
Browse files

fix some typos (#6209)


Co-authored-by: default avatarBrayden Zhong <b8zhong@uwaterloo.ca>
parent 3ee40ff9
......@@ -356,7 +356,7 @@ client.sh
# Start profiling via API
curl http://localhost:30000/start_profile -H "Content-Type: application/json"
# Benchmark serving using sglang with random dataset and tokenizer
# Benchmark serving using SGLang with a random dataset and tokenizer
# Define the log file with a timestamp
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LOGFILE="sglang_client_log_$TIMESTAMP.json"
......
......@@ -93,21 +93,21 @@ TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDU
#Inference with large improvement on AMD GPU
TORCHINDUCTOR_FREEZING=1 your_script.sh
```
## 4. Fused MOE kernel
To maximize moe kernel efficiency, need to use below scripts to find out the best launch configuration
## 4. Fused MoE kernel
To maximize MoE kernel efficiency, need to use below scripts to find out the best launch configuration
### Key parameters:
- **--model**: what moe model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
- **--model**: what MoE model type to do tuning, it will automatically decide the size of d_model, model_intermediate_size, num_layers
- **--tp-size**: simulate the whole model run configuration to set the dimension size using tp correctly
- **--batch**: M dimension size of moe kernel, for prefill moe kernel the value is batch*input_len, for decode moe kernel the value is batch
- **--batch**: M dimension size of MoE kernel, for prefill MoE kernel the value is batch*input_len, for decode MoE kernel the value is batch
- **--dtype**: computation type
```bash
#Tuning
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in moe view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
#so we can tune decode moe use below command
#for example, we have one case like this "python3 -m sglang.bench_latency --model dummy_grok1/ --load-format dummy --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --batch-size 32 --input 1024 --output 8 --attention-backend triton --sampling-backend pytorch --quantization fp8" to run, it defined batch-size 32 input length 1024 and output length 8, from "--batch" in MoE view point, the prefill batch is 32*1024 = 32768, the decode batch is 32*1(only one output token generated in each run).
#so we can tune decode MoE use below command
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32"
# and use this command to tune prefill moe
# and use this command to tune prefill MoE
python benchmark_moe_rocm.py --model grok1 --tp-size 8 --dtype float8 --batch "32768"
```
......
......@@ -44,7 +44,7 @@ SGLang is a fast serving framework for large language models and vision language
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include:
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (PagedAttention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
......
## How to reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0
In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.
In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set GPU utilization to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.
## Online benchmark results
......@@ -41,12 +41,12 @@ In short, with multi step enabled, in online scenarios that we benchmarked, the
## Installation
```bash
# install sglang v0.3.0
# install SGLang v0.3.0
pip install --upgrade pip
pip install "sglang[all]"==0.3.0
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
# install vllm v0.6.0
# install vLLM v0.6.0
pip install vllm==0.6.0
```
......
......@@ -45,10 +45,10 @@ Add [performance optimization options](#performance-optimization-options) as nee
### Performance Optimization Options
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations can be enabled as needed.
[MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations) are enabled by default. Here are some optional optimizations that can be enabled as needed.
- [Data Parallelism Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models): For high QPS scenarios, add the `--enable-dp-attention` argument to boost throughput.
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
- [Torch.compile Optimization](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#torchcompile-latency-optimizations): Add `--enable-torch-compile` argument to enable it. This will take some time while the server starts. The maximum batch size for torch.compile optimization can be controlled with `--torch-compile-max-bs`. It's recommended to set it between `1` and `8`. (e.g., `--torch-compile-max-bs 8`)
### Example: Sending requests with OpenAI API
......@@ -90,7 +90,7 @@ If you have two H100 nodes, the usage is similar to the aforementioned H20.
> **Note that the launch command here does not enable Data Parallelism Attention or `torch.compile` Optimization**. For optimal performance, please refer to the command options in [Performance Optimization Options](#option_args).
### Example: Serving with two H200\*8 nodes and docker
### Example: Serving with two H200\*8 nodes and Docker
There are two H200 nodes, each with 8 GPUs. The first node's IP is `192.168.114.10`, and the second node's IP is `192.168.114.11`. Configure the endpoint to expose it to another Docker container using `--host 0.0.0.0` and `--port 40000`, and set up communications with `--dist-init-addr 192.168.114.10:20000`.
A single H200 with 8 devices can run DeepSeek V3, the dual H200 setup is just to demonstrate multi-node usage.
......@@ -147,7 +147,7 @@ docker run --gpus all \
To serve DeepSeek-V3 with A100 GPUs, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first.
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assume the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can have following commands to launch the server.
Since the BF16 model is over 1.3 TB, we need to prepare four A100 nodes, each with 8 80GB GPUs. Assuming the first node's IP is `10.0.0.1`, and the converted model path is `/path/to/DeepSeek-V3-BF16`, we can run the following commands to launch the server.
```bash
# node 1
......@@ -178,7 +178,7 @@ python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1
### Example: Serving with 8 A100/A800 with AWQ Quantization
Add `--quantization moe_wna16` flag to enable moe wna16 kernel for better performance.
Add the `--quantization moe_wna16` flag to enable the MoE wna16 kernel for better performance.
One example is as follows:
```bash
......@@ -188,12 +188,12 @@ python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --
### Example: Serving with 16 A100/A800 with int8 Quantization
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to Huggingface. One example is as follows:
There are block-wise and per-channel quantization methods, and the quantization parameters have already been uploaded to HuggingFace. One example is as follows:
- [meituan/DeepSeek-R1-Block-INT8](https://huggingface.co/meituan/DeepSeek-R1-Block-INT8)
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can have following commands to launch the server:
Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-INT8` and port=5000, we can run the following commands to launch the server:
```bash
#master
python3 -m sglang.launch_server \
......@@ -225,7 +225,7 @@ Running with per-channel quantization model:
- [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8)
Assuming that master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can have following commands to launch the server:
Assuming that the master node IP is `MASTER_IP`, checkpoint path is `/path/to/DeepSeek-R1-Channel-INT8` and port=5000, we can run the following commands to launch the server:
```bash
#master
......
## Run benchmark
## Run Benchmark
### Benchmark sglang
### Benchmark SGLang
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
```
......@@ -10,7 +10,7 @@ python3 bench_sglang.py --num-questions 200
```
### Benchmark vllm
### Benchmark vLLM
```
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
```
......@@ -20,7 +20,7 @@ python3 bench_other.py --num-questions 200 --backend vllm
```
### Benchmark lightllm
### Benchmark LightLLM
```
# A10G
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
......@@ -31,13 +31,13 @@ python3 bench_other.py --num-questions 200 --backend lightllm
```
### Benchmark guidance
### Benchmark Guidance
```
python3 bench_other.py --num-questions 200 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
```
### Benchmark lmql
### Benchmark LMQL
```
CUDA_VISIBLE_DEVICES=0,1 lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
```
......
## Run benchmark
### Benchmark sglang
### Benchmark SGLang
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
```
......@@ -10,7 +10,7 @@ python3 bench_sglang.py --num-questions 200
```
### Benchmark vllm
### Benchmark vLLM
```
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
```
......@@ -20,7 +20,7 @@ python3 bench_other.py --num-questions 200 --backend vllm
```
### Benchmark lightllm
### Benchmark LightLLM
```
# A10G
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
......@@ -31,13 +31,13 @@ python3 bench_other.py --num-questions 200 --backend lightllm
```
### Benchmark guidance
### Benchmark Guidance
```
CUDA_VISIBLE_DEVICES=0,1 python3 bench_other.py --num-questions 200 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
```
### Benchmark lmql
### Benchmark LMQL
```
lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
```
......
......@@ -4,7 +4,7 @@ This directory contains benchmarking tools for MoE (Mixture of Experts) kernels.
### Tuning Tool
- `tuning_fused_moe_triton.py`: A tool for tuning the `fused_moe_triton` kernel. Adapted from [vllm's benchmark_moe.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), with added support for various model architectures.
- `tuning_fused_moe_triton.py`: A tool for tuning the `fused_moe_triton` kernel. Adapted from [vLLM's benchmark_moe.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), with added support for various model architectures.
Example usage:
```bash
......@@ -48,7 +48,7 @@ After tuning, a configuration file (e.g., `E=64,N=640,device_name=NVIDIA_GeForce
### Performance Comparison Tool
- `benchmark_vllm_vs_sglang_fused_moe_triton.py`: A tool for comparing the performance of fused MoE kernels between vllm and sglang implementations. Supports various model architectures and data types.
- `benchmark_vllm_vs_sglang_fused_moe_triton.py`: A tool for comparing the performance of fused MoE kernels between vLLM and SGLang implementations. Supports various model architectures and data types.
Example usage:
```bash
......
## Download data
## Download Data
```
bash download_data.sh
```
## Run benchmark
## Run Benchmark
### Benchmark sglang
### Benchmark SGLang
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
```
......@@ -19,7 +19,7 @@ python3 bench_sglang.py --nsub 10
python3 bench_sglang.py --backend gpt-3.5-turbo --parallel 8
```
### Benchmark vllm
### Benchmark vLLM
```
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
```
......@@ -29,7 +29,7 @@ python3 bench_other.py --nsub 10 --backend vllm
```
### Benchmark lightllm
### Benchmark LightLLM
```
# A10G
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
......@@ -43,13 +43,13 @@ python3 bench_other.py --nsub 10 --backend lightllm
```
### Benchmark guidance
### Benchmark Guidance
```
python3 bench_other.py --nsub 10 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
```
### Benchmark lmql
### Benchmark LMQL
```
CUDA_VISIBLE_DEVICES=0,1 lmql serve-model meta-llama/Llama-2-7b-chat-hf --cuda --port 23000
```
......
......@@ -4,9 +4,9 @@
wget -O question.jsonl https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl
```
## Run benchmark
## Run Benchmark
### Benchmark sglang
### Benchmark SGLang
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
```
......@@ -15,7 +15,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
python3 bench_sglang.py --num-questions 80
```
### Benchmark sglang EAGLE
### Benchmark SGLang EAGLE
```
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B-Instruct --speculative-algo EAGLE \
--speculative-draft lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 \
......@@ -27,7 +27,7 @@ python3 bench_sglang_eagle.py --num-questions 80 --parallel 1
```
### Benchmark vllm
### Benchmark vLLM
```
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
```
......@@ -37,7 +37,7 @@ python3 bench_other.py --num-questions 80 --backend vllm
```
### Benchmark lightllm
### Benchmark LightLLM
```
# A10G
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
......
## Download data
## Download Data
```
wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
```
## Run benchmark
## Run Benchmark
### Benchmark sglang
### Benchmark SGLang
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --schedule-conservativeness 1.3
```
......@@ -16,7 +16,7 @@ python3 bench_sglang.py --num-questions 32 --parallel 1
```
### Benchmark vllm
### Benchmark vLLM
```
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
```
......@@ -26,7 +26,7 @@ python3 bench_other.py --num-questions 64 --backend vllm
```
### Benchmark lightllm
### Benchmark LightLLM
```
# A10G
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
......@@ -37,12 +37,12 @@ python3 bench_other.py --num-questions 64 --backend lightllm
```
### Benchmark guidance
### Benchmark Guidance
```
python3 bench_other.py --num-questions 8 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
```
### Benchmark lmql
### Benchmark LMQL
```
python3 bench_other.py --num-questions 64 --backend lmql --parallel 1
......
## Download data
## Download Data
```
wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
```
## Run benchmark
## Run Benchmark
NOTE: This is an implementation for throughput/latency benchmark purposes. The prompts are not tuned to achieve good accuracy on the GSM-8K tasks.
### Benchmark sglang
### Benchmark SGLang
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
```
......@@ -18,7 +18,7 @@ python3 bench_sglang.py --num-questions 16 --parallel 1
```
### Benchmark vllm
### Benchmark vLLM
```
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
```
......@@ -28,7 +28,7 @@ python3 bench_other.py --num-questions 32 --backend vllm
```
### Benchmark lightllm
### Benchmark LightLLM
```
# A10G
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
......@@ -39,12 +39,12 @@ python3 bench_other.py --num-questions 32 --backend lightllm
```
### Benchmark guidance
### Benchmark Guidance
```
python3 bench_other.py --num-questions 8 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
```
### Benchmark lmql
### Benchmark LMQL
```
python3 bench_other.py --num-questions 8 --backend lmql --parallel 1
......
## Download data
## Download Data
```
wget https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl
```
## Run benchmark
### Benchmark sglang
### Benchmark SGLang
```
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
```
......@@ -16,7 +16,7 @@ python3 bench_sglang.py --num-questions 10 --parallel 1
```
### Benchmark vllm
### Benchmark vLLM
```
python3 -m vllm.entrypoints.api_server --tokenizer-mode auto --model meta-llama/Llama-2-7b-chat-hf --disable-log-requests --port 21000
```
......@@ -26,7 +26,7 @@ python3 bench_other.py --num-questions 32 --backend vllm
```
### Benchmark lightllm
### Benchmark LightLLM
```
# A10G
python -m lightllm.server.api_server --tokenizer_mode auto --model_dir ~/model_weights/llama-2-7b-chat-hf --max_total_token_num 16000 --port 22000
......@@ -37,7 +37,7 @@ python3 bench_other.py --num-questions 32 --backend lightllm
```
### Benchmark guidance
### Benchmark Guidance
```
python3 bench_other.py --num-questions 32 --backend guidance --parallel 1 --n-ctx 4096 --model-path path/to/gguf
```
......@@ -22,7 +22,7 @@ spec:
command:
- /bin/bash
- -c
# please modify the sglang serving arguments below, as necessary.
# please modify the SGLang serving arguments below, as necessary.
# NOTE: the --expert-parallel-size and --enable-ep-moe are for MoE model like DeepSeek-R1
args:
- |
......
......@@ -82,7 +82,7 @@ if is_in_ci():
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
# Terminalte Engine
# Terminate Engine
llm.shutdown()
```
......@@ -94,7 +94,7 @@ llm.shutdown()
### **Model Selection**
For demonstrations in the docs, **prefer smaller models** to reduce memory consumption and speed up inference. Running larger models in CI can lead to instability due to memory constraints.
For demonstrations in the docs, we **prefer smaller models** to reduce memory consumption and speed up inference. Running larger models in CI can lead to instability due to memory constraints.
### **Prompt Alignment Example**
......
......@@ -134,7 +134,7 @@ python3 -m sglang.launch_server \
SGLang supports the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`.
Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command:
Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with CUDA graph capture. So we suggest to disable the CUDA graph capture when using `"int8dq"` method. Namely, please use the following command:
```bash
python3 -m sglang.launch_server \
......
......@@ -38,7 +38,7 @@ memory management, and optimization techniques.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](custom_chat_template.md).
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, so you can use the following commands. If you encounter deadlocks, please try to add `--disable-cuda-graph`.
```bash
# Node 0
......
# Development Guide Using Docker
## Setup VSCode on a Remote Host
(Optional - you can skip this step if you plan to run sglang dev container locally)
(Optional - you can skip this step if you plan to run the SGLang dev container locally)
1. In the remote host, download `code` from [Https://code.visualstudio.com/docs/?dv=linux64cli](https://code.visualstudio.com/download) and run `code tunnel` in a shell.
1. In the remote host, download `code` from [https://code.visualstudio.com/docs/?dv=linux64cli](https://code.visualstudio.com/download) and run `code tunnel` in a shell.
Example
```bash
......@@ -19,20 +19,20 @@ tar xf vscode_cli_alpine_x64_cli.tar.gz
## Setup Docker Container
### Option 1. Use the default dev container automatically from VSCode
There is a `.devcontainer` folder in the sglang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).
There is a `.devcontainer` folder in the SGLang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).
![image](https://github.com/user-attachments/assets/6a245da8-2d4d-4ea8-8db1-5a05b3a66f6d)
(*Figure 1: Diagram from VSCode official documentation [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).*)
To enable this, you only need to:
1. Start Visual Studio Code and install [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
1. Start Visual Studio Code and install the [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
2. Press F1, type and choose "Dev Container: Open Folder in Container.
3. Input the `sglang` local repo path in your machine and press enter.
The first time you open it in dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container:
The first time you open it in the dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container:
![image](https://github.com/user-attachments/assets/650bba0b-c023-455f-91f9-ab357340106b)
Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, sglang server will be started in the dev container with all your local changes applied automatically:
Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, the SGLang server will be started in the dev container with all your local changes applied automatically:
![image](https://github.com/user-attachments/assets/748c85ba-7f8c-465e-8599-2bf7a8dde895)
......@@ -52,21 +52,21 @@ docker run -itd --shm-size 32g --gpus all -v <volumes-to-mount> --ipc=host --net
docker exec -it sglang_dev /bin/zsh
```
Some useful volumes to mount are:
1. **Huggingface model cache**: mounting model cache can avoid re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`.
1. **HuggingFace model cache**: mounting model cache can avoid the need to re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`.
2. **SGLang repository**: code changes in the SGLang local repository will be automatically synced to the .devcontainer.
Example 1: Monting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer.
Example 1: Mounting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer.
```bash
docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh
```
Example 2: Mounting both HuggingFace cache and local SGLang repo. Local code changes are automatically synced to the devcontainer as the SGLang is installed in editable mode in the dev image.
Example 2: Mounting both the HuggingFace cache and the local SGLang repo. Local code changes are automatically synced to the devcontainer as SGLang is installed in editable mode in the dev image.
```bash
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache/huggingface/:/root/.cache/huggingface -v $HOME/src/sglang:/sgl-workspace/sglang --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
docker exec -it sglang_zhyncs /bin/zsh
```
## Debug SGLang with VSCode Debugger
1. (Create if not exist) open `launch.json` in VSCode.
1. (Create if it does not exist) open `launch.json` in VSCode.
2. Add the following config and save. Please note that you can edit the script as needed to apply different parameters or debug a different program (e.g. benchmark script).
```JSON
{
......
......@@ -4,7 +4,7 @@
### Step 1: Start a docker container.
You can mount a folder for the shared huggingface model weights cache. The command below uses `/tmp/huggingface` as an example.
You can mount a folder for the shared HuggingFace model weights cache. The command below uses `/tmp/huggingface` as an example.
```
docker pull nvidia/cuda:12.1.1-devel-ubuntu22.04
......
......@@ -5,7 +5,7 @@ SGLang is a fast serving framework for large language models and vision language
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include:
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (PagedAttention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-LoRA batching.
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment