Unverified Commit be7986e0 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Fix docs (#1890)

parent 5a5f1843
......@@ -127,7 +127,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/
## Supported Models
**Generative Models**
- Llama / Llama 2 / Llama 3 / Llama 3.1
- Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
- Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
......
......@@ -5,7 +5,6 @@
"metadata": {},
"source": [
"# Native APIs\n",
"\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:\n",
"\n",
"- `/generate`\n",
......@@ -40,7 +39,6 @@
" terminate_process,\n",
" print_highlight,\n",
")\n",
"import subprocess, json\n",
"\n",
"server_process = execute_shell_command(\n",
"\"\"\"\n",
......@@ -56,8 +54,7 @@
"metadata": {},
"source": [
"## Generate\n",
"\n",
"Used to generate completion from the model, similar to the `/v1/completions` API in OpenAI. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
]
},
{
......@@ -72,7 +69,7 @@
"data = {\"text\": \"What is the capital of France?\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.text)"
"print_highlight(response.json())"
]
},
{
......@@ -80,8 +77,7 @@
"metadata": {},
"source": [
"## Get Server Args\n",
"\n",
"Used to get the serving args when the server is launched."
"Get the arguments of a server."
]
},
{
......@@ -102,7 +98,7 @@
"source": [
"## Get Model Info\n",
"\n",
"Used to get the model info.\n",
"Get the information of the model.\n",
"\n",
"- `model_path`: The path/name of the model.\n",
"- `is_generation`: Whether the model is used as generation model or embedding model."
......@@ -120,7 +116,7 @@
"response_json = response.json()\n",
"print_highlight(response_json)\n",
"assert response_json[\"model_path\"] == \"meta-llama/Llama-3.2-1B-Instruct\"\n",
"assert response_json[\"is_generation\"] == True\n",
"assert response_json[\"is_generation\"] is True\n",
"assert response_json.keys() == {\"model_path\", \"is_generation\"}"
]
},
......@@ -128,8 +124,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Health and Health Generate\n",
"\n",
"## Health Check\n",
"- `/health`: Check the health of the server.\n",
"- `/health_generate`: Check the health of the server by generating one token."
]
......@@ -164,7 +159,7 @@
"source": [
"## Flush Cache\n",
"\n",
"Used to flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
"Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
]
},
{
......@@ -259,7 +254,7 @@
"source": [
"## Encode\n",
"\n",
"Used to encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n",
"Therefore, we launch a new server to server an embedding model.\n"
]
},
......
......@@ -24,7 +24,7 @@
"\n",
"```bash\n",
"python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
" --port 30010 --host 0.0.0.0 --is-embedding\n",
" --port 30000 --host 0.0.0.0 --is-embedding\n",
"```\n",
"\n",
"Remember to add `--is-embedding` to the command."
......@@ -53,11 +53,11 @@
"embedding_process = execute_shell_command(\n",
" \"\"\"\n",
"python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
" --port 30010 --host 0.0.0.0 --is-embedding\n",
" --port 30000 --host 0.0.0.0 --is-embedding\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(\"http://localhost:30010\")"
"wait_for_server(\"http://localhost:30000\")"
]
},
{
......@@ -84,7 +84,7 @@
"\n",
"text = \"Once upon a time\"\n",
"\n",
"curl_text = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n",
"curl_text = f\"\"\"curl -s http://localhost:30000/v1/embeddings \\\n",
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n",
"\n",
"text_embedding = json.loads(subprocess.check_output(curl_text, shell=True))[\"data\"][0][\n",
......@@ -112,7 +112,7 @@
"text = \"Once upon a time\"\n",
"\n",
"response = requests.post(\n",
" \"http://localhost:30010/v1/embeddings\",\n",
" \"http://localhost:30000/v1/embeddings\",\n",
" json={\n",
" \"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\",\n",
" \"input\": text\n",
......@@ -146,7 +146,7 @@
"source": [
"import openai\n",
"\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30010/v1\", api_key=\"None\")\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
"\n",
"# Text embedding example\n",
"response = client.embeddings.create(\n",
......@@ -189,7 +189,7 @@
"tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-7B-instruct\")\n",
"input_ids = tokenizer.encode(text)\n",
"\n",
"curl_ids = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n",
"curl_ids = f\"\"\"curl -s http://localhost:30000/v1/embeddings \\\n",
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n",
"\n",
"input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n",
......
......@@ -26,7 +26,7 @@
"\n",
"```bash\n",
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
" --port 30010 --chat-template llama_3_vision\n",
" --port 30000 --chat-template llama_3_vision\n",
"```\n",
"in your terminal and wait for the server to be ready.\n",
"\n",
......@@ -50,11 +50,11 @@
"embedding_process = execute_shell_command(\n",
" \"\"\"\n",
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
" --port=30010 --chat-template=llama_3_vision\n",
" --port=30000 --chat-template=llama_3_vision\n",
"\"\"\"\n",
")\n",
"\n",
"wait_for_server(\"http://localhost:30010\")"
"wait_for_server(\"http://localhost:30000\")"
]
},
{
......@@ -75,7 +75,7 @@
"import subprocess\n",
"\n",
"curl_command = \"\"\"\n",
"curl -s http://localhost:30010/v1/chat/completions \\\n",
"curl -s http://localhost:30000/v1/chat/completions \\\n",
" -d '{\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
" \"messages\": [\n",
......@@ -118,7 +118,7 @@
"source": [
"import requests\n",
"\n",
"url = \"http://localhost:30010/v1/chat/completions\"\n",
"url = \"http://localhost:30000/v1/chat/completions\"\n",
"\n",
"data = {\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
......@@ -161,7 +161,7 @@
"source": [
"from openai import OpenAI\n",
"\n",
"client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n",
"client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
......@@ -205,7 +205,7 @@
"source": [
"from openai import OpenAI\n",
"\n",
"client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n",
"client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
......
......@@ -25,7 +25,7 @@ If you see `decode out of memory happened` occasionally but not frequently, it i
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can decrease these parameters.
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
......
......@@ -2,12 +2,13 @@
This page lists some common errors and tips for fixing them.
## CUDA out of memory
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9.
## The server hangs
If the server hangs, try disabling some optimizations when launching the server.
- Add `--disable-cuda-graph`.
- Add `--sampling-backend pytorch`.
- If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above seciton to avoid the OOM.
......@@ -70,7 +70,7 @@
"\n",
"curl_command = \"\"\"\n",
"curl -s http://localhost:30000/v1/chat/completions \\\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What is a LLM?\"}]}'\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]}'\n",
"\"\"\"\n",
"\n",
"response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
......@@ -104,8 +104,7 @@
"data = {\n",
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" \"messages\": [\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
" {\"role\": \"user\", \"content\": \"What is a LLM?\"}\n",
" {\"role\": \"user\", \"content\": \"What is the capital of France?\"}\n",
" ]\n",
"}\n",
"\n",
......@@ -140,7 +139,6 @@
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
......@@ -170,7 +168,6 @@
"response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n",
" temperature=0,\n",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment