Unverified Commit 7b394e5f authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Fix docs (#1889)

parent 3b60558d
......@@ -6,10 +6,12 @@
"source": [
"# OpenAI APIs - Embedding\n",
"\n",
"SGLang supports embedding models in the same way as completion models. Here are some example models:\n",
"SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
"A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n",
"\n",
"- [intfloat/e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct)\n",
"- [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)\n"
"This tutorial covers the embedding APIs for embedding models, such as \n",
"- [intfloat/e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) \n",
"- [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) \n"
]
},
{
......@@ -96,7 +98,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using OpenAI Compatible API w/ Requests"
"## Using Python Requests"
]
},
{
......
......@@ -107,7 +107,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using OpenAI Compatible API w/ Requests"
"## Using Python Requests"
]
},
{
......@@ -150,9 +150,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using OpenAI Python Client\n",
"\n",
"Also, you can use the OpenAI Python API library to send requests."
"## Using OpenAI Python Client"
]
},
{
......
......@@ -25,7 +25,7 @@ The core features include:
backend/openai_api_completions.ipynb
backend/openai_api_vision.ipynb
backend/openai_embedding_api.ipynb
backend/openai_api_embeddings.ipynb
backend/native_api.ipynb
backend/backend.md
......
......@@ -177,252 +177,3 @@ print(response.json())
The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
Streaming is supported in a similar manner as [above](#streaming).
## Performance Implications on Penalties
While you can apply penalties by supplying relevant `sampling_params`, this comes with some drawbacks.
These drawbacks will be applied to every single requests in the same batch, as penalizers also applies in batch.
### Latency
While we try to compute penalty algorithms through CUDA, it is still additional computation on top of the basic sampling logic. For detailed overhead, we recommend you to run your own benchmarks, but you can find samples below to get a glimpse.
### Memory
Since we compute penalty algorithms through CUDA, the logic stores relevant parameters on GPU. This is usually in a scale of `vocab_size` multiplied by `running_requests`.
You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using.
Tuning `--mem-fraction-static` and/or `--max-running-requests` will help.
### Benchmarks
All the benchmarks below were ran on NVIDIA H100 SXM5.
<details>
#### Baseline
Measured at [dc9d06d886151707f97d0b78095df9de262fd3c9](https://github.com/sgl-project/sglang/commit/dc9d06d886151707f97d0b78095df9de262fd3c9).
```
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 66.11
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 775118
Request throughput (req/s): 45.38
Input token throughput (tok/s): 5727.04
Output token throughput (tok/s): 11732.16
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 40881.94
Median E2E Latency (ms): 43967.10
---------------Time to First Token----------------
Mean TTFT (ms): 19884.75
Median TTFT (ms): 14226.56
P99 TTFT (ms): 47738.97
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 91.96
Median TPOT (ms): 90.11
P99 TPOT (ms): 308.54
---------------Inter-token Latency----------------
Mean ITL (ms): 174.54
Median ITL (ms): 58.56
P99 ITL (ms): 440.18
==================================================
```
#### All Together
```
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
"frequency_penalty": 1.1,
"presence_penalty": 1.1,
"repetition_penalty": 0.1,
"min_new_tokens": 5
}'
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 78.35
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 774756
Request throughput (req/s): 38.29
Input token throughput (tok/s): 4832.86
Output token throughput (tok/s): 9900.39
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 49017.68
Median E2E Latency (ms): 52825.70
---------------Time to First Token----------------
Mean TTFT (ms): 23892.60
Median TTFT (ms): 18895.47
P99 TTFT (ms): 57426.01
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 114.54
Median TPOT (ms): 107.27
P99 TPOT (ms): 293.31
---------------Inter-token Latency----------------
Mean ITL (ms): 205.68
Median ITL (ms): 73.97
P99 ITL (ms): 453.86
==================================================
```
#### Frequency Penalty
```
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
"frequency_penalty": 1.1
}'
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 72.72
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 774955
Request throughput (req/s): 41.26
Input token throughput (tok/s): 5206.84
Output token throughput (tok/s): 10666.51
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 45445.56
Median E2E Latency (ms): 48960.39
---------------Time to First Token----------------
Mean TTFT (ms): 22363.16
Median TTFT (ms): 17125.02
P99 TTFT (ms): 52920.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 104.71
Median TPOT (ms): 98.30
P99 TPOT (ms): 268.06
---------------Inter-token Latency----------------
Mean ITL (ms): 191.60
Median ITL (ms): 67.83
P99 ITL (ms): 455.46
==================================================
```
#### Presence Penalty
```
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
"presence_penalty": 1.1
}'
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 72.04
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 775210
Request throughput (req/s): 41.64
Input token throughput (tok/s): 5255.98
Output token throughput (tok/s): 10767.18
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 44926.61
Median E2E Latency (ms): 48302.88
---------------Time to First Token----------------
Mean TTFT (ms): 22095.39
Median TTFT (ms): 16740.93
P99 TTFT (ms): 52554.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 103.54
Median TPOT (ms): 97.37
P99 TPOT (ms): 271.86
---------------Inter-token Latency----------------
Mean ITL (ms): 189.86
Median ITL (ms): 68.45
P99 ITL (ms): 447.11
==================================================
```
#### Repetition Penalty
```
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
"repetition_penalty": 0.1
}'
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 74.54
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 766008
Request throughput (req/s): 40.24
Input token throughput (tok/s): 5079.36
Output token throughput (tok/s): 10405.35
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 46530.38
Median E2E Latency (ms): 50302.65
---------------Time to First Token----------------
Mean TTFT (ms): 22603.47
Median TTFT (ms): 17167.08
P99 TTFT (ms): 54497.85
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 117.59
Median TPOT (ms): 101.79
P99 TPOT (ms): 320.04
---------------Inter-token Latency----------------
Mean ITL (ms): 195.26
Median ITL (ms): 69.51
P99 ITL (ms): 433.86
==================================================
```
#### Min New Tokens
The min new tokens penalizer computes until generation process reaches given `min_new_tokens`.
Dislike other penalizers, setting this to higher value will have more latency implications.
```
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
"min_new_tokens": 5
}'
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Successful requests: 3000
Benchmark duration (s): 66.94
Total input tokens: 378633
Total generated tokens: 775651
Total generated tokens (retokenized): 775220
Request throughput (req/s): 44.81
Input token throughput (tok/s): 5656.13
Output token throughput (tok/s): 11586.90
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 41888.55
Median E2E Latency (ms): 45354.16
---------------Time to First Token----------------
Mean TTFT (ms): 20866.91
Median TTFT (ms): 16219.79
P99 TTFT (ms): 49263.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 97.05
Median TPOT (ms): 89.76
P99 TPOT (ms): 233.50
---------------Inter-token Latency----------------
Mean ITL (ms): 179.17
Median ITL (ms): 55.08
P99 ITL (ms): 409.12
==================================================
```
</details>
......@@ -97,5 +97,5 @@ sky status --endpoint 30000 sglang
## Common Notes
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. This allows you to build SGLang programs locally and execute them by connecting to the remote backend.
......@@ -5,7 +5,6 @@
"metadata": {},
"source": [
"# Quick Start: Sending Requests\n",
"\n",
"This notebook provides a quick-start guide for using SGLang after installation."
]
},
......@@ -14,7 +13,6 @@
"metadata": {},
"source": [
"## Launch a server\n",
"\n",
"This code block is equivalent to executing \n",
"\n",
"```bash\n",
......@@ -83,7 +81,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using OpenAI Compatible API w/ Requests"
"## Using Requests"
]
},
{
......@@ -119,9 +117,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using OpenAI Python Client\n",
"\n",
"You can also use the OpenAI Python API library to send requests."
"## Using OpenAI Python Client"
]
},
{
......@@ -153,6 +149,41 @@
"print_highlight(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Streaming"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import openai\n",
"\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
"\n",
"# Use stream=True for streaming responses\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
" max_tokens=64,\n",
" stream=True,\n",
")\n",
"\n",
"# Handle the streaming output\n",
"for chunk in response:\n",
" if chunk.choices[0].delta.content:\n",
" print(chunk.choices[0].delta.content, end='', flush=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
......@@ -184,6 +215,46 @@
"print_highlight(response.json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Streaming"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests, json\n",
"\n",
"response = requests.post(\n",
" \"http://localhost:30000/generate\",\n",
" json={\n",
" \"text\": \"The capital of France is\",\n",
" \"sampling_params\": {\n",
" \"temperature\": 0,\n",
" \"max_new_tokens\": 32,\n",
" },\n",
" \"stream\": True,\n",
" },\n",
" stream=True,\n",
")\n",
"\n",
"prev = 0\n",
"for chunk in response.iter_lines(decode_unicode=False):\n",
" chunk = chunk.decode(\"utf-8\")\n",
" if chunk and chunk.startswith(\"data:\"):\n",
" if chunk == \"data: [DONE]\":\n",
" break\n",
" data = json.loads(chunk[5:].strip(\"\\n\"))\n",
" output = data[\"text\"]\n",
" print(output[prev:], end=\"\", flush=True)\n",
" prev = len(output)"
]
},
{
"cell_type": "code",
"execution_count": 6,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment