{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenAI APIs - Completions\n", "\n", "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n", "\n", "This tutorial covers the following popular APIs:\n", "\n", "- `chat/completions`\n", "- `completions`\n", "- `batches`\n", "\n", "Check out other tutorials to learn about [vision APIs](https://docs.sglang.ai/backend/openai_api_vision.html) for vision-language models and [embedding APIs](https://docs.sglang.ai/backend/openai_api_embeddings.html) for embedding models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch A Server\n", "\n", "Launch the server in your terminal and wait for it to initialize." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sglang.test.test_utils import is_in_ci\n", "\n", "if is_in_ci():\n", " from patch import launch_server_cmd\n", "else:\n", " from sglang.utils import launch_server_cmd\n", "\n", "from sglang.utils import wait_for_server, print_highlight, terminate_process\n", "\n", "\n", "server_process, port = launch_server_cmd(\n", " \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --mem-fraction-static 0.8\"\n", ")\n", "\n", "wait_for_server(f\"http://localhost:{port}\")\n", "print(f\"Server started on http://localhost:{port}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chat Completions\n", "\n", "### Usage\n", "\n", "The server fully implements the OpenAI API.\n", "It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n", "You can also specify a custom chat template with `--chat-template` when launching the server." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"qwen/qwen2.5-0.5b-instruct\",\n", " messages=[\n", " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", " ],\n", " temperature=0,\n", " max_tokens=64,\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters\n", "\n", "The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n", "\n", "SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor.\n", "\n", "#### Enabling Model Thinking/Reasoning\n", "\n", "You can use `chat_template_kwargs` to enable or disable the model's internal thinking or reasoning process output. Set `\"enable_thinking\": True` within `chat_template_kwargs` to include the reasoning steps in the response. This requires launching the server with a compatible reasoning parser (e.g., `--reasoning-parser qwen3` for Qwen3 models).\n", "\n", "Here's an example demonstrating how to enable thinking and retrieve the reasoning content separately (using `separate_reasoning: True`):\n", "\n", "```python\n", "# Ensure the server is launched with a compatible reasoning parser, e.g.:\n", "# python3 -m sglang.launch_server --model-path QwQ/Qwen3-32B-250415 --reasoning-parser qwen3 ...\n", "\n", "from openai import OpenAI\n", "\n", "# Modify OpenAI's API key and API base to use SGLang's API server.\n", "openai_api_key = \"EMPTY\"\n", "openai_api_base = f\"http://127.0.0.1:{port}/v1\" # Use the correct port\n", "\n", "client = OpenAI(\n", " api_key=openai_api_key,\n", " base_url=openai_api_base,\n", ")\n", "\n", "model = \"QwQ/Qwen3-32B-250415\" # Use the model loaded by the server\n", "messages = [{\"role\": \"user\", \"content\": \"9.11 and 9.8, which is greater?\"}]\n", "\n", "response = client.chat.completions.create(\n", " model=model,\n", " messages=messages,\n", " extra_body={\n", " \"chat_template_kwargs\": {\"enable_thinking\": True},\n", " \"separate_reasoning\": True\n", " }\n", ")\n", "\n", "print(\"response.choices[0].message.reasoning_content: \\n\", response.choices[0].message.reasoning_content)\n", "print(\"response.choices[0].message.content: \\n\", response.choices[0].message.content)\n", "```\n", "\n", "**Example Output:**\n", "\n", "```\n", "response.choices[0].message.reasoning_content: \n", " Okay, so I need to figure out which number is greater between 9.11 and 9.8. Hmm, let me think. Both numbers start with 9, right? So the whole number part is the same. That means I need to look at the decimal parts to determine which one is bigger.\n", "...\n", "Therefore, after checking multiple methods—aligning decimals, subtracting, converting to fractions, and using a real-world analogy—it's clear that 9.8 is greater than 9.11.\n", "\n", "response.choices[0].message.content: \n", " To determine which number is greater between **9.11** and **9.8**, follow these steps:\n", "...\n", "**Answer**: \n", "9.8 is greater than 9.11.\n", "```\n", "\n", "Setting `\"enable_thinking\": False` (or omitting it) will result in `reasoning_content` being `None`.\n", "\n", "Here is an example of a detailed chat completion request using standard OpenAI parameters:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.chat.completions.create(\n", " model=\"qwen/qwen2.5-0.5b-instruct\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n", " },\n", " {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n", " {\n", " \"role\": \"assistant\",\n", " \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n", " },\n", " {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n", " ],\n", " temperature=0.3, # Lower temperature for more focused responses\n", " max_tokens=128, # Reasonable length for a concise response\n", " top_p=0.95, # Slightly higher for better fluency\n", " presence_penalty=0.2, # Mild penalty to avoid repetition\n", " frequency_penalty=0.2, # Mild penalty for more natural language\n", " n=1, # Single response is usually more stable\n", " seed=42, # Keep for reproducibility\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Streaming mode is also supported." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stream = client.chat.completions.create(\n", " model=\"qwen/qwen2.5-0.5b-instruct\",\n", " messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n", " stream=True,\n", ")\n", "for chunk in stream:\n", " if chunk.choices[0].delta.content is not None:\n", " print(chunk.choices[0].delta.content, end=\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Completions\n", "\n", "### Usage\n", "Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.completions.create(\n", " model=\"qwen/qwen2.5-0.5b-instruct\",\n", " prompt=\"List 3 countries and their capitals.\",\n", " temperature=0,\n", " max_tokens=64,\n", " n=1,\n", " stop=None,\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters\n", "\n", "The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n", "\n", "Here is an example of a detailed completions request:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.completions.create(\n", " model=\"qwen/qwen2.5-0.5b-instruct\",\n", " prompt=\"Write a short story about a space explorer.\",\n", " temperature=0.7, # Moderate temperature for creative writing\n", " max_tokens=150, # Longer response for a story\n", " top_p=0.9, # Balanced diversity in word choice\n", " stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n", " presence_penalty=0.3, # Encourage novel elements\n", " frequency_penalty=0.3, # Reduce repetitive phrases\n", " n=1, # Generate one completion\n", " seed=123, # For reproducible results\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structured Outputs (JSON, Regex, EBNF)\n", "\n", "For OpenAI compatible structed outputs API, refer to [Structured Outputs](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API) for more details.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batches\n", "\n", "Batches API for chat completions and completions are also supported. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).\n", "\n", "The batches APIs are:\n", "\n", "- `batches`\n", "- `batches/{batch_id}/cancel`\n", "- `batches/{batch_id}`\n", "\n", "Here is an example of a batch job for chat completions, completions are similar.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "requests = [\n", " {\n", " \"custom_id\": \"request-1\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n", " \"messages\": [\n", " {\"role\": \"user\", \"content\": \"Tell me a joke about programming\"}\n", " ],\n", " \"max_tokens\": 50,\n", " },\n", " },\n", " {\n", " \"custom_id\": \"request-2\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n", " \"messages\": [{\"role\": \"user\", \"content\": \"What is Python?\"}],\n", " \"max_tokens\": 50,\n", " },\n", " },\n", "]\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " file_response = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_response = client.batches.create(\n", " input_file_id=file_response.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Batch job created with ID: {batch_response.id}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while batch_response.status not in [\"completed\", \"failed\", \"cancelled\"]:\n", " time.sleep(3)\n", " print(f\"Batch job status: {batch_response.status}...trying again in 3 seconds...\")\n", " batch_response = client.batches.retrieve(batch_response.id)\n", "\n", "if batch_response.status == \"completed\":\n", " print(\"Batch job completed successfully!\")\n", " print(f\"Request counts: {batch_response.request_counts}\")\n", "\n", " result_file_id = batch_response.output_file_id\n", " file_response = client.files.content(result_file_id)\n", " result_content = file_response.read().decode(\"utf-8\")\n", "\n", " results = [\n", " json.loads(line) for line in result_content.split(\"\\n\") if line.strip() != \"\"\n", " ]\n", "\n", " for result in results:\n", " print_highlight(f\"Request {result['custom_id']}:\")\n", " print_highlight(f\"Response: {result['response']}\")\n", "\n", " print_highlight(\"Cleaning up files...\")\n", " # Only delete the result file ID since file_response is just content\n", " client.files.delete(result_file_id)\n", "else:\n", " print_highlight(f\"Batch job failed with status: {batch_response.status}\")\n", " if hasattr(batch_response, \"errors\"):\n", " print_highlight(f\"Errors: {batch_response.errors}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.\n", "\n", "1. `batches/{batch_id}`: Retrieve the batch job status.\n", "2. `batches/{batch_id}/cancel`: Cancel the batch job.\n", "\n", "Here is an example to check the batch job status." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "requests = []\n", "for i in range(20):\n", " requests.append(\n", " {\n", " \"custom_id\": f\"request-{i}\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": f\"{i}: You are a helpful AI assistant\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": \"Write a detailed story about topic. Make it very long.\",\n", " },\n", " ],\n", " \"max_tokens\": 64,\n", " },\n", " }\n", " )\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " uploaded_file = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_job = client.batches.create(\n", " input_file_id=uploaded_file.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n", "print_highlight(f\"Initial status: {batch_job.status}\")\n", "\n", "time.sleep(10)\n", "\n", "max_checks = 5\n", "for i in range(max_checks):\n", " batch_details = client.batches.retrieve(batch_id=batch_job.id)\n", "\n", " print_highlight(\n", " f\"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}\"\n", " )\n", " print_highlight(\n", " f\"Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}\"\n", " )\n", "\n", " time.sleep(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example to cancel a batch job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "import os\n", "\n", "client = OpenAI(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n", "\n", "requests = []\n", "for i in range(5000):\n", " requests.append(\n", " {\n", " \"custom_id\": f\"request-{i}\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": f\"{i}: You are a helpful AI assistant\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": \"Write a detailed story about topic. Make it very long.\",\n", " },\n", " ],\n", " \"max_tokens\": 128,\n", " },\n", " }\n", " )\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " uploaded_file = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_job = client.batches.create(\n", " input_file_id=uploaded_file.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n", "print_highlight(f\"Initial status: {batch_job.status}\")\n", "\n", "time.sleep(10)\n", "\n", "try:\n", " cancelled_job = client.batches.cancel(batch_id=batch_job.id)\n", " print_highlight(f\"Cancellation initiated. Status: {cancelled_job.status}\")\n", " assert cancelled_job.status == \"cancelling\"\n", "\n", " # Monitor the cancellation process\n", " while cancelled_job.status not in [\"failed\", \"cancelled\"]:\n", " time.sleep(3)\n", " cancelled_job = client.batches.retrieve(batch_job.id)\n", " print_highlight(f\"Current status: {cancelled_job.status}\")\n", "\n", " # Verify final status\n", " assert cancelled_job.status == \"cancelled\"\n", " print_highlight(\"Batch job successfully cancelled\")\n", "\n", "except Exception as e:\n", " print_highlight(f\"Error during cancellation: {e}\")\n", " raise e\n", "\n", "finally:\n", " try:\n", " del_response = client.files.delete(uploaded_file.id)\n", " if del_response.deleted:\n", " print_highlight(\"Successfully cleaned up input file\")\n", " if os.path.exists(input_file_path):\n", " os.remove(input_file_path)\n", " print_highlight(\"Successfully deleted local batch_requests.jsonl file\")\n", " except Exception as e:\n", " print_highlight(f\"Error cleaning up: {e}\")\n", " raise e" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terminate_process(server_process)" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }