Unverified Commit ce636ac4 authored by Ran Chen's avatar Ran Chen Committed by GitHub
Browse files

fix incorrect links in documentation (#1481)


Co-authored-by: default avatarYineng Zhang <me@zhyncs.com>
parent 82136eb0
...@@ -19,7 +19,7 @@ curl http://localhost:30000/generate \ ...@@ -19,7 +19,7 @@ curl http://localhost:30000/generate \
} }
}' }'
``` ```
Learn more about the argument format [here](docs/en/sampling_params.md). Learn more about the argument format `here <https://sglang.readthedocs.io/en/latest/sampling_params.html>`_.
### OpenAI Compatible API ### OpenAI Compatible API
In addition, the server supports OpenAI-compatible APIs. In addition, the server supports OpenAI-compatible APIs.
...@@ -73,7 +73,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -73,7 +73,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
``` ```
- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance. - See `hyperparameter tuning <https://sglang.readthedocs.io/en/latest/hyperparameter_tuning.html>`_ on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size. - If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
...@@ -81,7 +81,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -81,7 +81,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md). - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a `custom chat template <https://sglang.readthedocs.io/en/latest/custom_chat_template.html>`_.
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port. - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
``` ```
# Node 0 # Node 0
...@@ -102,11 +102,11 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -102,11 +102,11 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) - [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava` - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava` - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py) - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- LLaVA 1.5 / 1.6 / NeXT - LLaVA 1.5 / 1.6 / NeXT
- `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3` - `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
- `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava` - `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py) - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- Yi-VL - Yi-VL
- StableLM - StableLM
- Command-R - Command-R
...@@ -122,7 +122,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -122,7 +122,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- gte-Qwen2 - gte-Qwen2
- `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding` - `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md). Instructions for supporting a new model are `here <https://sglang.readthedocs.io/en/latest/model_support.html>`_.
#### Use Models From ModelScope #### Use Models From ModelScope
<details> <details>
......
...@@ -70,7 +70,7 @@ print(state["answer_1"]) ...@@ -70,7 +70,7 @@ print(state["answer_1"])
#### More Examples #### More Examples
Anthropic and VertexAI (Gemini) models are also supported. Anthropic and VertexAI (Gemini) models are also supported.
You can find more examples at [examples/quick_start](examples/frontend_language/quick_start). You can find more examples at [examples/quick_start](https://github.com/sgl-project/sglang/tree/main/examples/frontend_language/quick_start).
### Language Feature ### Language Feature
To begin with, import sglang. To begin with, import sglang.
...@@ -83,7 +83,7 @@ You can implement your prompt flow in a function decorated by `sgl.function`. ...@@ -83,7 +83,7 @@ You can implement your prompt flow in a function decorated by `sgl.function`.
You can then invoke the function with `run` or `run_batch`. You can then invoke the function with `run` or `run_batch`.
The system will manage the state, chat template, parallelism and batching for you. The system will manage the state, chat template, parallelism and batching for you.
The complete code for the examples below can be found at [readme_examples.py](examples/frontend_language/usage/readme_examples.py) The complete code for the examples below can be found at [readme_examples.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/usage/readme_examples.py)
#### Control Flow #### Control Flow
You can use any Python code within the function body, including control flow, nested function calls, and external libraries. You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
...@@ -132,7 +132,7 @@ def image_qa(s, image_file, question): ...@@ -132,7 +132,7 @@ def image_qa(s, image_file, question):
s += sgl.assistant(sgl.gen("answer", max_tokens=256) s += sgl.assistant(sgl.gen("answer", max_tokens=256)
``` ```
See also [srt_example_llava.py](examples/frontend_language/quick_start/local_example_llava_next.py). See also [local_example_llava_next.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/quick_start/local_example_llava_next.py).
#### Constrained Decoding #### Constrained Decoding
Use `regex` to specify a regular expression as a decoding constraint. Use `regex` to specify a regular expression as a decoding constraint.
...@@ -176,7 +176,7 @@ def character_gen(s, name): ...@@ -176,7 +176,7 @@ def character_gen(s, name):
s += sgl.gen("json_output", max_tokens=256, regex=character_regex) s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
``` ```
See also [json_decode.py](examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models. See also [json_decode.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models.
#### Batching #### Batching
Use `run_batch` to run a batch of requests with continuous batching. Use `run_batch` to run a batch of requests with continuous batching.
......
...@@ -4,7 +4,7 @@ To support a new model in SGLang, you only need to add a single file under [SGLa ...@@ -4,7 +4,7 @@ To support a new model in SGLang, you only need to add a single file under [SGLa
Another valuable resource is the [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models). vLLM has extensive coverage of models, and SGLang has reused vLLM for most parts of the model implementations. This similarity makes it easy to port many models from vLLM to SGLang. Another valuable resource is the [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models). vLLM has extensive coverage of models, and SGLang has reused vLLM for most parts of the model implementations. This similarity makes it easy to port many models from vLLM to SGLang.
To port a model from vLLM to SGLang, you can compare these two files [SGLang LLaMA Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py) and [vLLM LLaMA Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of PagedAttention with RadixAttention. The other parts are almost identical. Specifically, To port a model from vLLM to SGLang, you can compare these two files [SGLang LLaMA Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) and [vLLM LLaMA Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of PagedAttention with RadixAttention. The other parts are almost identical. Specifically,
- Replace vllm's `Attention` with `RadixAttention`. Note that you need to pass `layer_id` all the way to `RadixAttention`. - Replace vllm's `Attention` with `RadixAttention`. Note that you need to pass `layer_id` all the way to `RadixAttention`.
- Replace vllm's `LogitsProcessor` with SGLang's `LogitsProcessor`. - Replace vllm's `LogitsProcessor` with SGLang's `LogitsProcessor`.
- Remove `Sample`. - Remove `Sample`.
...@@ -13,4 +13,4 @@ To port a model from vLLM to SGLang, you can compare these two files [SGLang LLa ...@@ -13,4 +13,4 @@ To port a model from vLLM to SGLang, you can compare these two files [SGLang LLa
- Test correctness by comparing the final logits and outputs of the two following commands: - Test correctness by comparing the final logits and outputs of the two following commands:
- `python3 scripts/playground/reference_hf.py --model [new model]` - `python3 scripts/playground/reference_hf.py --model [new model]`
- `python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code` - `python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code`
- Update [Supported Models](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#supported-models) at [README](../README.md). - Update [Supported Models](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#supported-models) at [README](https://github.com/sgl-project/sglang/blob/main/README.md).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment