Unverified Commit be7986e0 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Fix docs (#1890)

parent 5a5f1843
...@@ -127,7 +127,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/ ...@@ -127,7 +127,7 @@ You can view the full example [here](https://github.com/sgl-project/sglang/tree/
## Supported Models ## Supported Models
**Generative Models** **Generative Models**
- Llama / Llama 2 / Llama 3 / Llama 3.1 - Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
- Mistral / Mixtral / Mistral NeMo - Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2 - Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL - Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
......
...@@ -5,7 +5,6 @@ ...@@ -5,7 +5,6 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# Native APIs\n", "# Native APIs\n",
"\n",
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:\n", "Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:\n",
"\n", "\n",
"- `/generate`\n", "- `/generate`\n",
...@@ -40,7 +39,6 @@ ...@@ -40,7 +39,6 @@
" terminate_process,\n", " terminate_process,\n",
" print_highlight,\n", " print_highlight,\n",
")\n", ")\n",
"import subprocess, json\n",
"\n", "\n",
"server_process = execute_shell_command(\n", "server_process = execute_shell_command(\n",
"\"\"\"\n", "\"\"\"\n",
...@@ -56,8 +54,7 @@ ...@@ -56,8 +54,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Generate\n", "## Generate\n",
"\n", "Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
"Used to generate completion from the model, similar to the `/v1/completions` API in OpenAI. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
] ]
}, },
{ {
...@@ -72,7 +69,7 @@ ...@@ -72,7 +69,7 @@
"data = {\"text\": \"What is the capital of France?\"}\n", "data = {\"text\": \"What is the capital of France?\"}\n",
"\n", "\n",
"response = requests.post(url, json=data)\n", "response = requests.post(url, json=data)\n",
"print_highlight(response.text)" "print_highlight(response.json())"
] ]
}, },
{ {
...@@ -80,8 +77,7 @@ ...@@ -80,8 +77,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Get Server Args\n", "## Get Server Args\n",
"\n", "Get the arguments of a server."
"Used to get the serving args when the server is launched."
] ]
}, },
{ {
...@@ -102,7 +98,7 @@ ...@@ -102,7 +98,7 @@
"source": [ "source": [
"## Get Model Info\n", "## Get Model Info\n",
"\n", "\n",
"Used to get the model info.\n", "Get the information of the model.\n",
"\n", "\n",
"- `model_path`: The path/name of the model.\n", "- `model_path`: The path/name of the model.\n",
"- `is_generation`: Whether the model is used as generation model or embedding model." "- `is_generation`: Whether the model is used as generation model or embedding model."
...@@ -120,7 +116,7 @@ ...@@ -120,7 +116,7 @@
"response_json = response.json()\n", "response_json = response.json()\n",
"print_highlight(response_json)\n", "print_highlight(response_json)\n",
"assert response_json[\"model_path\"] == \"meta-llama/Llama-3.2-1B-Instruct\"\n", "assert response_json[\"model_path\"] == \"meta-llama/Llama-3.2-1B-Instruct\"\n",
"assert response_json[\"is_generation\"] == True\n", "assert response_json[\"is_generation\"] is True\n",
"assert response_json.keys() == {\"model_path\", \"is_generation\"}" "assert response_json.keys() == {\"model_path\", \"is_generation\"}"
] ]
}, },
...@@ -128,8 +124,7 @@ ...@@ -128,8 +124,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Health and Health Generate\n", "## Health Check\n",
"\n",
"- `/health`: Check the health of the server.\n", "- `/health`: Check the health of the server.\n",
"- `/health_generate`: Check the health of the server by generating one token." "- `/health_generate`: Check the health of the server by generating one token."
] ]
...@@ -164,7 +159,7 @@ ...@@ -164,7 +159,7 @@
"source": [ "source": [
"## Flush Cache\n", "## Flush Cache\n",
"\n", "\n",
"Used to flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API." "Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
] ]
}, },
{ {
...@@ -259,7 +254,7 @@ ...@@ -259,7 +254,7 @@
"source": [ "source": [
"## Encode\n", "## Encode\n",
"\n", "\n",
"Used to encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n", "Encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n",
"Therefore, we launch a new server to server an embedding model.\n" "Therefore, we launch a new server to server an embedding model.\n"
] ]
}, },
......
...@@ -24,7 +24,7 @@ ...@@ -24,7 +24,7 @@
"\n", "\n",
"```bash\n", "```bash\n",
"python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n", "python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
" --port 30010 --host 0.0.0.0 --is-embedding\n", " --port 30000 --host 0.0.0.0 --is-embedding\n",
"```\n", "```\n",
"\n", "\n",
"Remember to add `--is-embedding` to the command." "Remember to add `--is-embedding` to the command."
...@@ -53,11 +53,11 @@ ...@@ -53,11 +53,11 @@
"embedding_process = execute_shell_command(\n", "embedding_process = execute_shell_command(\n",
" \"\"\"\n", " \"\"\"\n",
"python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n", "python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
" --port 30010 --host 0.0.0.0 --is-embedding\n", " --port 30000 --host 0.0.0.0 --is-embedding\n",
"\"\"\"\n", "\"\"\"\n",
")\n", ")\n",
"\n", "\n",
"wait_for_server(\"http://localhost:30010\")" "wait_for_server(\"http://localhost:30000\")"
] ]
}, },
{ {
...@@ -84,7 +84,7 @@ ...@@ -84,7 +84,7 @@
"\n", "\n",
"text = \"Once upon a time\"\n", "text = \"Once upon a time\"\n",
"\n", "\n",
"curl_text = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n", "curl_text = f\"\"\"curl -s http://localhost:30000/v1/embeddings \\\n",
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n", " -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n",
"\n", "\n",
"text_embedding = json.loads(subprocess.check_output(curl_text, shell=True))[\"data\"][0][\n", "text_embedding = json.loads(subprocess.check_output(curl_text, shell=True))[\"data\"][0][\n",
...@@ -112,7 +112,7 @@ ...@@ -112,7 +112,7 @@
"text = \"Once upon a time\"\n", "text = \"Once upon a time\"\n",
"\n", "\n",
"response = requests.post(\n", "response = requests.post(\n",
" \"http://localhost:30010/v1/embeddings\",\n", " \"http://localhost:30000/v1/embeddings\",\n",
" json={\n", " json={\n",
" \"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\",\n", " \"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\",\n",
" \"input\": text\n", " \"input\": text\n",
...@@ -146,7 +146,7 @@ ...@@ -146,7 +146,7 @@
"source": [ "source": [
"import openai\n", "import openai\n",
"\n", "\n",
"client = openai.Client(base_url=\"http://127.0.0.1:30010/v1\", api_key=\"None\")\n", "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
"\n", "\n",
"# Text embedding example\n", "# Text embedding example\n",
"response = client.embeddings.create(\n", "response = client.embeddings.create(\n",
...@@ -189,7 +189,7 @@ ...@@ -189,7 +189,7 @@
"tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-7B-instruct\")\n", "tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-7B-instruct\")\n",
"input_ids = tokenizer.encode(text)\n", "input_ids = tokenizer.encode(text)\n",
"\n", "\n",
"curl_ids = f\"\"\"curl -s http://localhost:30010/v1/embeddings \\\n", "curl_ids = f\"\"\"curl -s http://localhost:30000/v1/embeddings \\\n",
" -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n", " -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-7B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n",
"\n", "\n",
"input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n", "input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n",
......
...@@ -26,7 +26,7 @@ ...@@ -26,7 +26,7 @@
"\n", "\n",
"```bash\n", "```bash\n",
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n", "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
" --port 30010 --chat-template llama_3_vision\n", " --port 30000 --chat-template llama_3_vision\n",
"```\n", "```\n",
"in your terminal and wait for the server to be ready.\n", "in your terminal and wait for the server to be ready.\n",
"\n", "\n",
...@@ -50,11 +50,11 @@ ...@@ -50,11 +50,11 @@
"embedding_process = execute_shell_command(\n", "embedding_process = execute_shell_command(\n",
" \"\"\"\n", " \"\"\"\n",
"python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n", "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
" --port=30010 --chat-template=llama_3_vision\n", " --port=30000 --chat-template=llama_3_vision\n",
"\"\"\"\n", "\"\"\"\n",
")\n", ")\n",
"\n", "\n",
"wait_for_server(\"http://localhost:30010\")" "wait_for_server(\"http://localhost:30000\")"
] ]
}, },
{ {
...@@ -75,7 +75,7 @@ ...@@ -75,7 +75,7 @@
"import subprocess\n", "import subprocess\n",
"\n", "\n",
"curl_command = \"\"\"\n", "curl_command = \"\"\"\n",
"curl -s http://localhost:30010/v1/chat/completions \\\n", "curl -s http://localhost:30000/v1/chat/completions \\\n",
" -d '{\n", " -d '{\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
" \"messages\": [\n", " \"messages\": [\n",
...@@ -118,7 +118,7 @@ ...@@ -118,7 +118,7 @@
"source": [ "source": [
"import requests\n", "import requests\n",
"\n", "\n",
"url = \"http://localhost:30010/v1/chat/completions\"\n", "url = \"http://localhost:30000/v1/chat/completions\"\n",
"\n", "\n",
"data = {\n", "data = {\n",
" \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
...@@ -161,7 +161,7 @@ ...@@ -161,7 +161,7 @@
"source": [ "source": [
"from openai import OpenAI\n", "from openai import OpenAI\n",
"\n", "\n",
"client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n", "client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
"\n", "\n",
"response = client.chat.completions.create(\n", "response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
...@@ -205,7 +205,7 @@ ...@@ -205,7 +205,7 @@
"source": [ "source": [
"from openai import OpenAI\n", "from openai import OpenAI\n",
"\n", "\n",
"client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n", "client = OpenAI(base_url=\"http://localhost:30000/v1\", api_key=\"None\")\n",
"\n", "\n",
"response = client.chat.completions.create(\n", "response = client.chat.completions.create(\n",
" model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
......
...@@ -25,7 +25,7 @@ If you see `decode out of memory happened` occasionally but not frequently, it i ...@@ -25,7 +25,7 @@ If you see `decode out of memory happened` occasionally but not frequently, it i
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests` ### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can decrease these parameters. If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`. If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
......
...@@ -2,12 +2,13 @@ ...@@ -2,12 +2,13 @@
This page lists some common errors and tips for fixing them. This page lists some common errors and tips for fixing them.
## CUDA out of memory
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
## CUDA error: an illegal memory access was encountered ## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues. This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix. - If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9. - If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above seciton to avoid the OOM.
## The server hangs
If the server hangs, try disabling some optimizations when launching the server.
- Add `--disable-cuda-graph`.
- Add `--sampling-backend pytorch`.
...@@ -70,7 +70,7 @@ ...@@ -70,7 +70,7 @@
"\n", "\n",
"curl_command = \"\"\"\n", "curl_command = \"\"\"\n",
"curl -s http://localhost:30000/v1/chat/completions \\\n", "curl -s http://localhost:30000/v1/chat/completions \\\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What is a LLM?\"}]}'\n", " -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]}'\n",
"\"\"\"\n", "\"\"\"\n",
"\n", "\n",
"response = json.loads(subprocess.check_output(curl_command, shell=True))\n", "response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
...@@ -104,8 +104,7 @@ ...@@ -104,8 +104,7 @@
"data = {\n", "data = {\n",
" \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" \"messages\": [\n", " \"messages\": [\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n", " {\"role\": \"user\", \"content\": \"What is the capital of France?\"}\n",
" {\"role\": \"user\", \"content\": \"What is a LLM?\"}\n",
" ]\n", " ]\n",
"}\n", "}\n",
"\n", "\n",
...@@ -140,7 +139,6 @@ ...@@ -140,7 +139,6 @@
"response = client.chat.completions.create(\n", "response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n", " messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n", " ],\n",
" temperature=0,\n", " temperature=0,\n",
...@@ -170,7 +168,6 @@ ...@@ -170,7 +168,6 @@
"response = client.chat.completions.create(\n", "response = client.chat.completions.create(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[\n", " messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
" {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
" ],\n", " ],\n",
" temperature=0,\n", " temperature=0,\n",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment