Unverified Commit 441c22db authored by Yineng Zhang's avatar Yineng Zhang Committed by GitHub
Browse files

doc: update backend (#1486)

parent ce636ac4
...@@ -19,7 +19,7 @@ curl http://localhost:30000/generate \ ...@@ -19,7 +19,7 @@ curl http://localhost:30000/generate \
} }
}' }'
``` ```
Learn more about the argument format `here <https://sglang.readthedocs.io/en/latest/sampling_params.html>`_. Learn more about the argument format [here](https://sglang.readthedocs.io/en/latest/sampling_params.html).
### OpenAI Compatible API ### OpenAI Compatible API
In addition, the server supports OpenAI-compatible APIs. In addition, the server supports OpenAI-compatible APIs.
...@@ -73,7 +73,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -73,7 +73,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
``` ```
- See `hyperparameter tuning <https://sglang.readthedocs.io/en/latest/hyperparameter_tuning.html>`_ on tuning hyperparameters for better performance. - See [hyperparameter tuning](https://sglang.readthedocs.io/en/latest/hyperparameter_tuning.html) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size. - If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
``` ```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
...@@ -81,7 +81,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -81,7 +81,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a `custom chat template <https://sglang.readthedocs.io/en/latest/custom_chat_template.html>`_. - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](https://sglang.readthedocs.io/en/latest/custom_chat_template.html).
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port. - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
``` ```
# Node 0 # Node 0
...@@ -122,7 +122,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ...@@ -122,7 +122,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- gte-Qwen2 - gte-Qwen2
- `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding` - `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
Instructions for supporting a new model are `here <https://sglang.readthedocs.io/en/latest/model_support.html>`_. Instructions for supporting a new model are [here](https://sglang.readthedocs.io/en/latest/model_support.html).
#### Use Models From ModelScope #### Use Models From ModelScope
<details> <details>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment