"Launch the server in your terminal and wait for it to initialize.\n",
"Launch the server in your terminal and wait for it to initialize.\n",
"\n",
"\n",
"**Remember to add `--chat-template llama_3_vision` to specify the vision chat template, otherwise the server only supports text, and performance degradation may occur.**\n",
"**Remember to add** `--chat-template llama_3_vision` **to specify the vision chat template, otherwise the server only supports text, and performance degradation may occur.**\n",
"\n",
"\n",
"We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text."
"We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text."
@@ -52,7 +52,7 @@ Please consult the documentation below to learn more about the parameters you ma
...
@@ -52,7 +52,7 @@ Please consult the documentation below to learn more about the parameters you ma
*`chat_template`: The chat template to use. Deviating from the default might lead to unexpected responses. For multi-modal chat templates, refer to [here](https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template).
*`chat_template`: The chat template to use. Deviating from the default might lead to unexpected responses. For multi-modal chat templates, refer to [here](https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template).
*`is_embedding`: Set to true to perform [embedding](https://docs.sglang.ai/backend/openai_api_embeddings.html) / [encode](https://docs.sglang.ai/backend/native_api.html#Encode-(embedding-model)) and [reward](https://docs.sglang.ai/backend/native_api.html#Classify-(reward-model)) tasks.
*`is_embedding`: Set to true to perform [embedding](https://docs.sglang.ai/backend/openai_api_embeddings.html) / [encode](https://docs.sglang.ai/backend/native_api.html#Encode-(embedding-model)) and [reward](https://docs.sglang.ai/backend/native_api.html#Classify-(reward-model)) tasks.
*`revision`: Adjust if a specific version of the model should be used.
*`revision`: Adjust if a specific version of the model should be used.
*`skip_tokenizer_init`: Set to true to provide the tokens to the engine and get the output tokens directly, typically used in RLHF.
*`skip_tokenizer_init`: Set to true to provide the tokens to the engine and get the output tokens directly, typically used in RLHF. Please see this [example for reference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/input_ids.py).
*`json_model_override_args`: Override model config with the provided JSON.
*`json_model_override_args`: Override model config with the provided JSON.
*`delete_ckpt_after_loading`: Delete the model checkpoint after loading the model.
*`delete_ckpt_after_loading`: Delete the model checkpoint after loading the model.
@@ -27,11 +27,13 @@ The router supports two working modes:
...
@@ -27,11 +27,13 @@ The router supports two working modes:
This will be a drop-in replacement for the existing `--dp-size` argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
This will be a drop-in replacement for the existing `--dp-size` argument of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers.
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker.
Please adjust the batchsize accordingly to archieve maximum throughput.
```python
```python
importrequests
importrequests
...
@@ -47,7 +49,7 @@ print(response.json())
...
@@ -47,7 +49,7 @@ print(response.json())
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.
This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers.