Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

Update Readme (#660)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
51fda143 · Ying Sheng · GitHub · dc4e4a6a · 51fda143 · 51fda143
Unverified Commit 51fda143 authored Jul 19, 2024 by Ying Sheng Committed by GitHub Jul 19, 2024
20 changed files
--- a/README.md
+++ b/README.md
@@ -6,23 +6,29 @@
 | [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
-SGLang is a structured generation language designed for large language models (LLMs).
+SGLang is a fast serving framework for large language models and vision language models.
-It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
+It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
 The core features include:
+- **Fast Backend Runtime**: Efficient serving with RadixAttention for prefix caching, continuous batching, token attention (paged attention), tensor parallelism, flashinfer kernels, jump-forward constrained decoding, and quantization (AWQ/FP8/GPTQ/Marlin).
 - **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
- **High-Performance Backend Runtime**: Features RadixAttention for accelerating complex LLM programs by reusing the KV cache across multiple calls. It can also serve as a standalone inference engine with all common techniques implemented (e.g., continuous batching and tensor parallelism).
 ## News
+- [2024/04] 🔥 SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
 - [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
- [2024/01] 🔥 SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
 - [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
+<details>
+<summary>More</summary>
+- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
+</details>
 ## Contents
 - [Install](#install)
- [Quick Start](#quick-start)
- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang)
 - [Backend: SGLang Runtime (SRT)](#backend-sglang-runtime-srt)
+- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang)
 - [Benchmark And Performance](#benchmark-and-performance)
 - [Roadmap](#roadmap)
 - [Citation And Acknowledgment](#citation-and-acknowledgment)
@@ -70,13 +76,118 @@ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/
 - If you cannot install FlashInfer, check out its [installation](https://docs.flashinfer.ai/installation.html#) page. If you still cannot install it, you can use the slower Triton kernels by adding `--disable-flashinfer` when launching the server.
 - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
-## Quick Start
+## Backend: SGLang Runtime (SRT)
+The SGLang Runtime (SRT) is an efficient serving engine.
+### Launching a Server
+Launch a server
+```
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
+```
+Send a request
+```
+curl http://localhost:30000/generate \
+  -H "Content-Type: application/json" \
+  -d '{
+    "text": "Once upon a time,",
+    "sampling_params": {
+      "max_new_tokens": 16,
+      "temperature": 0
+    }
+  }'
+```
+Learn more about the argument format [here](docs/sampling_params.md).
+### OpenAI Compatible API
+In addition, the server supports OpenAI-compatible APIs.
+```python
+import openai
+client = openai.Client(
+    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
+# Text completion
+response = client.completions.create(
+	model="default",
+	prompt="The capital of France is",
+	temperature=0,
+	max_tokens=32,
+)
+print(response)
+# Chat completion
+response = client.chat.completions.create(
+    model="default",
+    messages=[
+        {"role": "system", "content": "You are a helpful AI assistant"},
+        {"role": "user", "content": "List 3 countries and their capitals."},
+    ],
+    temperature=0,
+    max_tokens=64,
+)
+print(response)
+```
+It supports streaming and most features of the Chat/Completions/Models endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).
+### Additional Server Arguments
+- Add `--tp 2` to enable tensor parallelism. If it indicates `peer access is not supported between these two devices`, add `--enable-p2p-check` option.
+```
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --tp 2
+```
+- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory.
+```
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --dp 2 --tp 2
+```
+- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
+```
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7
+```
+- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
+- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-1` be the hostname of the first node and `50000` be an available port.
+```
+# Node 0
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 0
+# Node 1
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 1
+```
+- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/custom_chat_template.md).
+### Supported Models
+- Llama / Llama 2 / Llama 3
+- Mistral / Mixtral
+- Gemma / Gemma 2
+- Qwen / Qwen 2 / Qwen 2 MoE
+- LLaVA 1.5 / 1.6
+  - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
+  - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
+  - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000`
+- LLaVA-NeXT-Video
+  - see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh)
+- Yi-VL
+  - see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py).
+- StableLM
+- Command-R
+- DBRX
+- Grok
+- ChatGLM
+- InternLM 2
+Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md).
+## Frontend: Structured Generation Language (SGLang)
+The frontend language can be used with local models or API models.
+### Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
-### Using Local Models
+#### Using Local Models
 First, launch a server with
 ```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
 ```
 Then, connect to the server and answer a multi-turn question.
@@ -105,7 +216,7 @@ for m in state.messages():
 print(state["answer_1"])
 ```
-### Using OpenAI Models
+#### Using OpenAI Models
 Set the OpenAI API Key
 ```
 export OPENAI_API_KEY=sk-******
@@ -136,13 +247,12 @@ for m in state.messages():
 print(state["answer_1"])
 ```
-### More Examples
+#### More Examples
 Anthropic and VertexAI (Gemini) models are also supported.
 You can find more examples at [examples/quick_start](examples/quick_start).
-## Frontend: Structured Generation Language (SGLang)
+### Language Feature
 To begin with, import sglang.
 ```python
 import sglang as sgl
@@ -155,7 +265,7 @@ The system will manage the state, chat template, parallelism and batching for yo
 The complete code for the examples below can be found at [readme_examples.py](examples/usage/readme_examples.py)
-### Control Flow
+#### Control Flow
 You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
 ```python
@@ -170,7 +280,7 @@ def tool_use(s, question):
        s += "The key word to search is" + sgl.gen("word")
 ```
-### Parallelism
+#### Parallelism
 Use `fork` to launch parallel prompts.
 Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
@@ -192,7 +302,7 @@ def tip_suggestion(s):
    s += "In summary" + sgl.gen("summary")
 ```
-### Multi Modality
+#### Multi Modality
 Use `sgl.image` to pass an image as input.
 ```python
@@ -204,7 +314,7 @@ def image_qa(s, image_file, question):
 See also [srt_example_llava.py](examples/quick_start/srt_example_llava.py).
-### Constrained Decoding
+#### Constrained Decoding
 Use `regex` to specify a regular expression as a decoding constraint.
 This is only supported for local models.
@@ -219,7 +329,7 @@ def regular_expression_gen(s):
    )
 ```
-### JSON Decoding
+#### JSON Decoding
 Use `regex` to specify a JSON schema with a regular expression.
 ```python
@@ -248,8 +358,7 @@ def character_gen(s, name):
 See also [json_decode.py](examples/usage/json_decode.py) for an additional example on specifying formats with Pydantic models.
+#### Batching
-### Batching
 Use `run_batch` to run a batch of requests with continuous batching.
 ```python
@@ -268,7 +377,7 @@ states = text_qa.run_batch(
 )
 ```
-### Streaming
+#### Streaming
 Add `stream=True` to enable streaming.
 ```python
@@ -287,139 +396,10 @@ for out in state.text_iter():
    print(out, end="", flush=True)
 ```
-### Tips and Implementation Details
+#### Tips and Implementation Details
 - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
-## Backend: SGLang Runtime (SRT)
-The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
-However, it can also be used as a standalone API server.
-In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.
-### Usage
-Launch a server
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
-```
-Send a request
-```
-curl http://localhost:30000/generate \
-  -H "Content-Type: application/json" \
-  -d '{
-    "text": "Once upon a time,",
-    "sampling_params": {
-      "max_new_tokens": 16,
-      "temperature": 0
-    }
-  }'
-```
-Learn more about the argument format [here](docs/sampling_params.md).
-### OpenAI Compatible API
-In addition, the server supports an experimental OpenAI-compatible API.
-```python
-import openai
-client = openai.Client(
-    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
-# Text completion
-response = client.completions.create(
-	model="default",
-	prompt="The capital of France is",
-	temperature=0,
-	max_tokens=32,
-)
-print(response)
-# Chat completion
-response = client.chat.completions.create(
-    model="default",
-    messages=[
-        {"role": "system", "content": "You are a helpful AI assistant"},
-        {"role": "user", "content": "List 3 countries and their capitals."},
-    ],
-    temperature=0,
-    max_tokens=64,
-)
-print(response)
-```
-By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
-If needed, you can also override the chat template when launching the server:
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
-```
-If the chat template you are looking for is missing, you are welcome to contribute it.
-Meanwhile, you can also temporarily register your chat template as follows:
-```json
-{
-  "name": "my_model",
-  "system": "<|im_start|>system",
-  "user": "<|im_start|>user",
-  "assistant": "<|im_start|>assistant",
-  "sep_style": "CHATML",
-  "sep": "<|im_end|>",
-  "stop_str": ["<|im_end|>", "<|im_start|>"]
-}
-```
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
-```
-### Additional Arguments
- Add `--tp 2` to enable tensor parallelism. If it indicates `peer access is not supported between these two devices`, add `--enable-p2p-check` option.
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
-```
- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory.
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --dp 2 --tp 2
-```
- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
-```
- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-1` be the hostname of the first node and `50000` be an available port.
-```
-# Node 0
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 0
-# Node 1
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 1
-```
-### Supported Models
- Llama
- Mistral
- Mixtral
- Qwen / Qwen 2 / Qwen 2 MoE
- Gemma / Gemma 2
-  - `python -m sglang.launch_server --model-path google/gemma-7b-it --port 30000 --attention-reduce-in-fp32`
- LLaVA
-  - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
-  - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
-  - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000`
- LLaVA-NeXT-Video
-  - see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh)
- Yi-VL
-  - see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py).
- StableLM
- Command-R
- DBRX
- Grok
- ChatGLM
- AWQ/GPTQ/Marlin quantization
-Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md).
 ## Benchmark And Performance
 - Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
 ![llama_7b](assets/llama_7b.jpg)

--- a/docs/benchmark_results.md
+++ b/docs/benchmark_results.md
-## Benchmark Results
+# Benchmark Results
 We tested our system on the following common LLM workloads and reported the achieved throughput:
 - **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark.

--- a/docs/custom_chat_template.md
+++ b/docs/custom_chat_template.md
+# Custom Chat Template in SGLang Runtime
+By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
+If needed, you can also override the chat template when launching the server:
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
+```
+If the chat template you are looking for is missing, you are welcome to contribute it.
+Meanwhile, you can also temporarily register your chat template as follows:
+```json
+{
+  "name": "my_model",
+  "system": "<|im_start|>system",
+  "user": "<|im_start|>user",
+  "assistant": "<|im_start|>assistant",
+  "sep_style": "CHATML",
+  "sep": "<|im_end|>",
+  "stop_str": ["<|im_end|>", "<|im_start|>"]
+}
+```
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
+```
\ No newline at end of file
--- a/docs/model_support.md
+++ b/docs/model_support.md
-## How to Support a New Model
+# How to Support a New Model
 To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create new files for the new models. Most models are based on the transformer architecture, making them very similar.

--- a/docs/sampling_params.md
+++ b/docs/sampling_params.md
-## Sampling Parameters of SGLang Runtime
+# Sampling Parameters in SGLang Runtime
 This doc describes the sampling parameters of the SGLang Runtime.
 The `/generate` endpoint accepts the following arguments in the JSON format.
@@ -6,11 +6,11 @@ The `/generate` endpoint accepts the following arguments in the JSON format.
 ```python
 @dataclass
 class GenerateReqInput:
-    # The input prompt
+    # The input prompt. It can be a single prompt or a batch of prompts.
    text: Union[List[str], str]
    # The token ids for text; one can either specify text or input_ids
    input_ids: Optional[Union[List[List[int]], List[int]]] = None
-    # The image input
+    # The image input. It can be a file name.
    image_data: Optional[Union[List[str], str]] = None
    # The sampling_params
    sampling_params: Union[List[Dict], Dict] = None

--- a/docs/test_process.md
+++ b/docs/test_process.md
-## SRT Unit Tests
+# SRT Unit Tests
 ### Latency Alignment
 Make sure your changes do not slow down the following benchmarks

--- a/python/sglang/README.md
+++ b/python/sglang/README.md
 # Code Structure
- `backend`: Various backends for the language interpreter.
 - `lang`: The frontend language.
- `srt`: The serving engine for running local models. (SRT = SGLang Runtime).
+- `srt`: The backend engine for running local models. (SRT = SGLang Runtime).
 - `test`: Test utilities.
 - `api.py`: Public API.
 - `bench_latency.py`: Benchmark utilities.

--- a/python/sglang/__init__.py
+++ b/python/sglang/__init__.py
@@ -22,16 +22,16 @@ from sglang.api import (
    video,
 )
-# SGL Backends
-from sglang.backend.anthropic import Anthropic
-from sglang.backend.litellm import LiteLLM
-from sglang.backend.openai import OpenAI
-from sglang.backend.runtime_endpoint import RuntimeEndpoint
-from sglang.backend.vertexai import VertexAI
 # Global Configurations
 from sglang.global_config import global_config
+# SGL Backends
+from sglang.lang.backend.anthropic import Anthropic
+from sglang.lang.backend.litellm import LiteLLM
+from sglang.lang.backend.openai import OpenAI
+from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
+from sglang.lang.backend.vertexai import VertexAI
 # public APIs management
 __all__ = [
    "global_config",

--- a/python/sglang/api.py
+++ b/python/sglang/api.py
@@ -4,8 +4,8 @@ import os
 import re
 from typing import Callable, List, Optional, Union
-from sglang.backend.base_backend import BaseBackend
 from sglang.global_config import global_config
+from sglang.lang.backend.base_backend import BaseBackend
 from sglang.lang.ir import (
    SglExpr,
    SglExprList,

--- a/python/sglang/bench.py
+++ b/python/sglang/bench.py
--- a/python/sglang/backend/__init__.py
+++ b/python/sglang/backend/__init__.py
--- a/python/sglang/backend/anthropic.py
+++ b/python/sglang/backend/anthropic.py
@@ -2,7 +2,7 @@ from typing import List, Optional, Union
 import numpy as np
-from sglang.backend.base_backend import BaseBackend
+from sglang.lang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import get_chat_template
 from sglang.lang.interpreter import StreamExecutor
 from sglang.lang.ir import SglSamplingParams

--- a/python/sglang/backend/base_backend.py
+++ b/python/sglang/backend/base_backend.py
--- a/python/sglang/backend/litellm.py
+++ b/python/sglang/backend/litellm.py
 from typing import Mapping, Optional
-from sglang.backend.base_backend import BaseBackend
+from sglang.lang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import get_chat_template_by_model_path
 from sglang.lang.interpreter import StreamExecutor
 from sglang.lang.ir import SglSamplingParams

--- a/python/sglang/backend/openai.py
+++ b/python/sglang/backend/openai.py
@@ -6,7 +6,7 @@ from typing import Callable, List, Optional, Union
 import numpy as np
-from sglang.backend.base_backend import BaseBackend
+from sglang.lang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import ChatTemplate, get_chat_template_by_model_path
 from sglang.lang.interpreter import StreamExecutor
 from sglang.lang.ir import SglSamplingParams

--- a/python/sglang/backend/runtime_endpoint.py
+++ b/python/sglang/backend/runtime_endpoint.py
@@ -3,8 +3,8 @@ from typing import List, Optional
 import numpy as np
-from sglang.backend.base_backend import BaseBackend
 from sglang.global_config import global_config
+from sglang.lang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import get_chat_template_by_model_path
 from sglang.lang.interpreter import StreamExecutor
 from sglang.lang.ir import SglSamplingParams

--- a/python/sglang/backend/vertexai.py
+++ b/python/sglang/backend/vertexai.py
@@ -2,7 +2,7 @@ import os
 import warnings
 from typing import Optional
-from sglang.backend.base_backend import BaseBackend
+from sglang.lang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import get_chat_template
 from sglang.lang.interpreter import StreamExecutor
 from sglang.lang.ir import SglSamplingParams

--- a/python/sglang/lang/tracer.py
+++ b/python/sglang/lang/tracer.py
@@ -3,8 +3,8 @@
 import uuid
 from typing import Any, Callable, Dict, List, Optional, Union
-from sglang.backend.base_backend import BaseBackend
 from sglang.global_config import global_config
+from sglang.lang.backend.base_backend import BaseBackend
 from sglang.lang.interpreter import ProgramState, ProgramStateGroup
 from sglang.lang.ir import (
    SglArgument,

--- a/python/sglang/srt/server.py
+++ b/python/sglang/srt/server.py
@@ -26,7 +26,7 @@ import uvloop
 from fastapi import FastAPI, Request
 from fastapi.responses import JSONResponse, Response, StreamingResponse
-from sglang.backend.runtime_endpoint import RuntimeEndpoint
+from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
 from sglang.srt.constrained import disable_cache
 from sglang.srt.hf_transformers_utils import get_tokenizer
 from sglang.srt.managers.controller.manager_multi import (

--- a/python/sglang/srt/server_args.py
+++ b/python/sglang/srt/server_args.py
@@ -166,6 +166,15 @@ class ServerArgs:
            "--quantization",
            type=str,
            default=ServerArgs.quantization,
+            choices=[
+                "awq",
+                "fp8",
+                "gptq",
+                "marlin",
+                "gptq_marlin",
+                "squeezellm",
+                "bitsandbytes",
+            ],
            help="The quantization method.",
        )
        parser.add_argument(
@@ -243,13 +252,13 @@ class ServerArgs:
        parser.add_argument(
            "--show-time-cost",
            action="store_true",
-            help="Show time cost of custom marks",
+            help="Show time cost of custom marks.",
        )
        parser.add_argument(
            "--api-key",
            type=str,
            default=ServerArgs.api_key,
-            help="Set API key of the server",
+            help="Set API key of the server.",
        )
        # Data parallelism
@@ -285,17 +294,17 @@ class ServerArgs:
        parser.add_argument(
            "--disable-flashinfer",
            action="store_true",
-            help="Disable flashinfer inference kernels",
+            help="Disable flashinfer inference kernels.",
        )
        parser.add_argument(
            "--disable-radix-cache",
            action="store_true",
-            help="Disable RadixAttention",
+            help="Disable RadixAttention for prefix caching.",
        )
        parser.add_argument(
            "--disable-regex-jump-forward",
            action="store_true",
-            help="Disable regex jump-forward",
+            help="Disable regex jump-forward.",
        )
        parser.add_argument(
            "--disable-cuda-graph",