"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce these following APIs:\n",
"\n",
"- `/generate`\n",
...
...
@@ -40,7 +39,6 @@
" terminate_process,\n",
" print_highlight,\n",
")\n",
"import subprocess, json\n",
"\n",
"server_process = execute_shell_command(\n",
"\"\"\"\n",
...
...
@@ -56,8 +54,7 @@
"metadata": {},
"source": [
"## Generate\n",
"\n",
"Used to generate completion from the model, similar to the `/v1/completions` API in OpenAI. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)."
]
},
{
...
...
@@ -72,7 +69,7 @@
"data = {\"text\": \"What is the capital of France?\"}\n",
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.text)"
"print_highlight(response.json())"
]
},
{
...
...
@@ -80,8 +77,7 @@
"metadata": {},
"source": [
"## Get Server Args\n",
"\n",
"Used to get the serving args when the server is launched."
"Get the arguments of a server."
]
},
{
...
...
@@ -102,7 +98,7 @@
"source": [
"## Get Model Info\n",
"\n",
"Used to get the model info.\n",
"Get the information of the model.\n",
"\n",
"- `model_path`: The path/name of the model.\n",
"- `is_generation`: Whether the model is used as generation model or embedding model."
"- `/health_generate`: Check the health of the server by generating one token."
]
...
...
@@ -164,7 +159,7 @@
"source": [
"## Flush Cache\n",
"\n",
"Used to flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
"Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
]
},
{
...
...
@@ -259,7 +254,7 @@
"source": [
"## Encode\n",
"\n",
"Used to encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n",
"Encode text into embeddings. Note that this API is only available for [embedding models](./openai_embedding_api.ipynb) and will raise an error for generation models.\n",
"Therefore, we launch a new server to server an embedding model.\n"
This page lists some common errors and tips for fixing them.
## CUDA out of memory
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9.
## The server hangs
If the server hangs, try disabling some optimizations when launching the server.
- Add `--disable-cuda-graph`.
- Add `--sampling-backend pytorch`.
- If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above seciton to avoid the OOM.
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What is a LLM?\"}]}'\n",
" -d '{\"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]}'\n",