Unverified Commit d1b31b06 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Improve docs and fix the broken links (#1875)

parent d59a4782
...@@ -84,7 +84,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -84,7 +84,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](https://sgl-project.github.io/custom_chat_template.html). - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](https://sgl-project.github.io/references/custom_chat_template.html).
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph` - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
``` ```
# Node 0 # Node 0
......
...@@ -23,8 +23,8 @@ The core features include: ...@@ -23,8 +23,8 @@ The core features include:
:maxdepth: 1 :maxdepth: 1
:caption: Backend Tutorial :caption: Backend Tutorial
backend/openai_api.ipynb backend/openai_api_completions.ipynb
backend/vision_language_model.ipynb backend/openai_api_vision.ipynb
backend/backend.md backend/backend.md
...@@ -46,5 +46,5 @@ The core features include: ...@@ -46,5 +46,5 @@ The core features include:
references/choices_methods.md references/choices_methods.md
references/benchmark_and_profiling.md references/benchmark_and_profiling.md
references/troubleshooting.md references/troubleshooting.md
references/embedding_model.ipynb references/custom_chat_template.md
references/learn_more.md references/learn_more.md
.. _custom-chat-template:
# Custom Chat Template in SGLang Runtime # Custom Chat Template in SGLang Runtime
**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)). **NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)).
......
.. _sampling-parameters:
# Sampling Parameters in SGLang Runtime # Sampling Parameters in SGLang Runtime
This doc describes the sampling parameters of the SGLang Runtime. This doc describes the sampling parameters of the SGLang Runtime.
It is the low-level endpoint of the runtime. It is the low-level endpoint of the runtime.
......
This diff is collapsed.
...@@ -132,7 +132,7 @@ class TestOpenAIVisionServer(unittest.TestCase): ...@@ -132,7 +132,7 @@ class TestOpenAIVisionServer(unittest.TestCase):
assert response.usage.completion_tokens > 0 assert response.usage.completion_tokens > 0
assert response.usage.total_tokens > 0 assert response.usage.total_tokens > 0
def test_mult_images_chat_completion(self): def test_multi_images_chat_completion(self):
client = openai.Client(api_key=self.api_key, base_url=self.base_url) client = openai.Client(api_key=self.api_key, base_url=self.base_url)
response = client.chat.completions.create( response = client.chat.completions.create(
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment