{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Native API\n", "\n", "Apart from the OpenAI compatible API, the SGLang Runtime also provides its native server API. We introduce these following API:\n", "\n", "- `/generate`\n", "- `/update_weights`\n", "- `/get_server_args`\n", "- `/get_model_info`\n", "- `/health`\n", "- `/health_generate`\n", "- `/flush_cache`\n", "- `/get_memory_pool_size`\n", "\n", "We mainly use `requests` to test these APIs in the following examples. You can also use `curl`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch A Server" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sglang.utils import (\n", " execute_shell_command,\n", " wait_for_server,\n", " terminate_process,\n", " print_highlight,\n", ")\n", "import subprocess, json\n", "\n", "server_process = execute_shell_command(\n", "\"\"\"\n", "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --port=30010\n", "\"\"\"\n", ")\n", "\n", "wait_for_server(\"http://localhost:30010\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate\n", "\n", "Used to generate completion from the model, similar to the `/v1/completions` API in OpenAI. Detailed parameters can be found in the [sampling parameters](https://sgl-project.github.io/references/sampling_params.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests\n", "\n", "url = \"http://localhost:30010/generate\"\n", "data = {\"text\": \"List 3 countries and their capitals.\"}\n", "\n", "response = requests.post(url, json=data)\n", "print_highlight(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get Server Args\n", "\n", "Used to get the serving args when the server is launched." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"http://localhost:30010/get_server_args\"\n", "\n", "response = requests.get(url)\n", "print_highlight(response.json())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Model Info\n", "\n", "Used to get the model info.\n", "\n", "- `model_path`: The path/name of the model.\n", "- `is_generation`: Whether the model is used as generation model or embedding model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"http://localhost:30010/get_model_info\"\n", "\n", "response = requests.get(url)\n", "response_json = response.json()\n", "print_highlight(response_json)\n", "assert response_json[\"model_path\"] == \"meta-llama/Llama-3.2-1B-Instruct\"\n", "assert response_json[\"is_generation\"] == True\n", "assert response_json.keys() == {\"model_path\", \"is_generation\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Health and Health Generate\n", "\n", "- `/health`: Check the health of the server.\n", "- `/health_generate`: Check the health of the server by generating one token." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"http://localhost:30010/health_generate\"\n", "\n", "response = requests.get(url)\n", "print_highlight(response.text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"http://localhost:30010/health\"\n", "\n", "response = requests.get(url)\n", "print_highlight(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Flush Cache\n", "\n", "Used to flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# flush cache\n", "\n", "url = \"http://localhost:30010/flush_cache\"\n", "\n", "response = requests.post(url)\n", "print_highlight(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Memory Pool Size\n", "\n", "Get the memory pool size in number of tokens.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get_memory_pool_size\n", "\n", "url = \"http://localhost:30010/get_memory_pool_size\"\n", "\n", "response = requests.get(url)\n", "print_highlight(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Update Weights\n", "\n", "Update model weights without restarting the server. Use for continuous evaluation during training. Only applicable for models with the same architecture and parameter size." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# successful update with same architecture and size\n", "\n", "url = \"http://localhost:30010/update_weights\"\n", "data = {\"model_path\": \"meta-llama/Llama-3.2-1B\"}\n", "\n", "response = requests.post(url, json=data)\n", "print_highlight(response.text)\n", "assert response.json()[\"success\"] == True\n", "assert response.json()[\"message\"] == \"Succeeded to update model weights.\"\n", "assert response.json().keys() == {\"success\", \"message\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# failed update with different parameter size\n", "\n", "url = \"http://localhost:30010/update_weights\"\n", "data = {\"model_path\": \"meta-llama/Llama-3.2-3B\"}\n", "\n", "response = requests.post(url, json=data)\n", "response_json = response.json()\n", "print_highlight(response_json)\n", "assert response_json[\"success\"] == False\n", "assert response_json[\"message\"] == (\n", " \"Failed to update weights: The size of tensor a (2048) must match \"\n", " \"the size of tensor b (3072) at non-singleton dimension 1.\\n\"\n", " \"Rolling back to original weights.\"\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "terminate_process(server_process)" ] } ], "metadata": { "kernelspec": { "display_name": "AlphaMeemory", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 2 }