Unverified Commit 77395154 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Fix logprob_start_len for multi modal models (#2597)


Co-authored-by: default avatarlibra <lihu723@gmail.com>
Co-authored-by: default avatarfzyzcjy <ch271828n@outlook.com>
Co-authored-by: default avatarWang, Haoyu <haoyu.wang@intel.com>
parent 637de9e8
...@@ -9,12 +9,16 @@ Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model ...@@ -9,12 +9,16 @@ Special thanks to Meituan's Search & Recommend Platform Team and Baseten's Model
## Hardware Recommendation ## Hardware Recommendation
- 8 x NVIDIA H200 GPUs - 8 x NVIDIA H200 GPUs
If you do not have GPUs with large enough memory, please try multi-node tensor parallelism ([help 1](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L88-L95) [help 2](https://github.com/sgl-project/sglang/blob/637de9e8ce91fd3e92755eb2a842860925954ab1/docs/backend/backend.md?plain=1#L152-L168)).
## Installation & Launch ## Installation & Launch
If you see errors when launching the server, please check if it has finished downloading the weights. It is recommended to download the weights before launching, or to launch multiple times until all the weights have been downloaded.
### Using Docker (Recommended) ### Using Docker (Recommended)
```bash ```bash
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \ docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-Base --enable-dp-attention --tp 8 --trust-remote-code --port 30000 python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --enable-dp-attention --tp 8 --trust-remote-code --port 30000
``` ```
### Using pip ### Using pip
...@@ -23,11 +27,9 @@ docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/roo ...@@ -23,11 +27,9 @@ docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/roo
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer
# Launch # Launch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3-Base --enable-dp-attention --tp 8 --trust-remote-code python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --enable-dp-attention --tp 8 --trust-remote-code
``` ```
If you see errors when launching the server, please check if it has finished downloading the weights. It is recommended to download the weights before launching, or to launch multiple times until all the weights have been downloaded.
### Example with OpenAI API ### Example with OpenAI API
```python3 ```python3
...@@ -48,7 +50,7 @@ response = client.chat.completions.create( ...@@ -48,7 +50,7 @@ response = client.chat.completions.create(
print(response) print(response)
``` ```
## DeepSeek V3 optimization plan ## DeepSeek V3 Optimization Plan
https://github.com/sgl-project/sglang/issues/2591 https://github.com/sgl-project/sglang/issues/2591
......
...@@ -223,7 +223,7 @@ ...@@ -223,7 +223,7 @@
"## Structured decoding (JSON, Regex)\n", "## Structured decoding (JSON, Regex)\n",
"You can define a JSON schema or regular expression to constrain the model's output. The model output will be guaranteed to follow the given constraints and this depends on the grammar backend.\n", "You can define a JSON schema or regular expression to constrain the model's output. The model output will be guaranteed to follow the given constraints and this depends on the grammar backend.\n",
"\n", "\n",
"SGlang has two backends: outlines (default) and Xgrammar. Xgrammar enhances JSON decoding performance but does not support regular expressions. To use Xgrammar, add the `--grammar-backend xgrammar` when launching the server:\n", "SGlang has two backends: [Outlines](https://github.com/dottxt-ai/outlines) (default) and [XGrammar](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). Xgrammar accelerates JSON decoding performance but does not support regular expressions. To use Xgrammar, add the `--grammar-backend xgrammar` when launching the server:\n",
"\n", "\n",
"```bash\n", "```bash\n",
"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
......
# Sampling Parameters in SGLang Runtime # Sampling Parameters in SGLang Runtime
This doc describes the sampling parameters of the SGLang Runtime. This doc describes the sampling parameters of the SGLang Runtime.
It is the low-level endpoint of the runtime. It is the low-level endpoint of the runtime.
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](../backend/openai_api_completions.ipynb).
](https://github.com/sgl-project/sglang?tab=readme-ov-file#openai-compatible-api).
The `/generate` endpoint accepts the following arguments in the JSON format. The `/generate` endpoint accepts the following arguments in the JSON format.
......
...@@ -565,7 +565,7 @@ class Scheduler: ...@@ -565,7 +565,7 @@ class Scheduler:
if req.logprob_start_len == -1: if req.logprob_start_len == -1:
# By default, only return the logprobs for output tokens # By default, only return the logprobs for output tokens
req.logprob_start_len = len(recv_req.input_ids) - 1 req.logprob_start_len = len(req.origin_input_ids) - 1
# Truncate prompts that are too long # Truncate prompts that are too long
if len(req.origin_input_ids) > self.max_req_input_len: if len(req.origin_input_ids) > self.max_req_input_len:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment