{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenAI APIs - Completions\n", "\n", "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n", "\n", "This tutorial covers the following popular APIs:\n", "\n", "- `chat/completions`\n", "- `completions`\n", "- `batches`\n", "\n", "Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch A Server\n", "\n", "This code block is equivalent to executing \n", "\n", "```bash\n", "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", "--port 30000 --host 0.0.0.0\n", "```\n", "\n", "in your terminal and wait for the server to be ready." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sglang.utils import (\n", " execute_shell_command,\n", " wait_for_server,\n", " terminate_process,\n", " print_highlight,\n", ")\n", "\n", "server_process = execute_shell_command(\n", " \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0\"\n", ")\n", "\n", "wait_for_server(\"http://localhost:30000\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chat Completions\n", "\n", "### Usage\n", "\n", "The server fully implements the OpenAI API.\n", "It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n", "You can also specify a custom chat template with `--chat-template` when launching the server." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import openai\n", "\n", "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n", " ],\n", " temperature=0,\n", " max_tokens=64,\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters\n", "\n", "The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n", "\n", "Here is an example of a detailed chat completion request:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\n", " \"role\": \"system\",\n", " \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n", " },\n", " {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n", " {\n", " \"role\": \"assistant\",\n", " \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n", " },\n", " {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n", " ],\n", " temperature=0.3, # Lower temperature for more focused responses\n", " max_tokens=128, # Reasonable length for a concise response\n", " top_p=0.95, # Slightly higher for better fluency\n", " presence_penalty=0.2, # Mild penalty to avoid repetition\n", " frequency_penalty=0.2, # Mild penalty for more natural language\n", " n=1, # Single response is usually more stable\n", " seed=42, # Keep for reproducibility\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Streaming mode is also supported." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stream = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n", " stream=True,\n", ")\n", "for chunk in stream:\n", " if chunk.choices[0].delta.content is not None:\n", " print(chunk.choices[0].delta.content, end=\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Completions\n", "\n", "### Usage\n", "Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " prompt=\"List 3 countries and their capitals.\",\n", " temperature=0,\n", " max_tokens=64,\n", " n=1,\n", " stop=None,\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters\n", "\n", "The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n", "\n", "Here is an example of a detailed completions request:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " prompt=\"Write a short story about a space explorer.\",\n", " temperature=0.7, # Moderate temperature for creative writing\n", " max_tokens=150, # Longer response for a story\n", " top_p=0.9, # Balanced diversity in word choice\n", " stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n", " presence_penalty=0.3, # Encourage novel elements\n", " frequency_penalty=0.3, # Reduce repetitive phrases\n", " n=1, # Generate one completion\n", " seed=123, # For reproducible results\n", ")\n", "\n", "print_highlight(f\"Response: {response}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structured decoding (JSON, Regex)\n", "You can define a JSON schema or regular expression to constrain the model's output. The model output will be guaranteed to follow the given constraints and this depends on the grammar backend.\n", "\n", "SGlang has two backends: [Outlines](https://github.com/dottxt-ai/outlines) (default) and [XGrammar](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar). Xgrammar accelerates JSON decoding performance but does not support regular expressions. To use Xgrammar, add the `--grammar-backend xgrammar` when launching the server:\n", "\n", "```bash\n", "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n", "--port 30000 --host 0.0.0.0 --grammar-backend xgrammar\n", "```\n", "\n", "### JSON" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "json_schema = json.dumps(\n", " {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n", " \"population\": {\"type\": \"integer\"},\n", " },\n", " \"required\": [\"name\", \"population\"],\n", " }\n", ")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": \"Give me the information of the capital of France in the JSON format.\",\n", " },\n", " ],\n", " temperature=0,\n", " max_tokens=128,\n", " response_format={\n", " \"type\": \"json_schema\",\n", " \"json_schema\": {\"name\": \"foo\", \"schema\": json.loads(json_schema)},\n", " },\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regular expression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = client.chat.completions.create(\n", " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " messages=[\n", " {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n", " ],\n", " temperature=0,\n", " max_tokens=128,\n", " extra_body={\"regex\": \"(Paris|London)\"},\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batches\n", "\n", "Batches API for chat completions and completions are also supported. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).\n", "\n", "The batches APIs are:\n", "\n", "- `batches`\n", "- `batches/{batch_id}/cancel`\n", "- `batches/{batch_id}`\n", "\n", "Here is an example of a batch job for chat completions, completions are similar.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n", "\n", "requests = [\n", " {\n", " \"custom_id\": \"request-1\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [\n", " {\"role\": \"user\", \"content\": \"Tell me a joke about programming\"}\n", " ],\n", " \"max_tokens\": 50,\n", " },\n", " },\n", " {\n", " \"custom_id\": \"request-2\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [{\"role\": \"user\", \"content\": \"What is Python?\"}],\n", " \"max_tokens\": 50,\n", " },\n", " },\n", "]\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " file_response = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_response = client.batches.create(\n", " input_file_id=file_response.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Batch job created with ID: {batch_response.id}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "while batch_response.status not in [\"completed\", \"failed\", \"cancelled\"]:\n", " time.sleep(3)\n", " print(f\"Batch job status: {batch_response.status}...trying again in 3 seconds...\")\n", " batch_response = client.batches.retrieve(batch_response.id)\n", "\n", "if batch_response.status == \"completed\":\n", " print(\"Batch job completed successfully!\")\n", " print(f\"Request counts: {batch_response.request_counts}\")\n", "\n", " result_file_id = batch_response.output_file_id\n", " file_response = client.files.content(result_file_id)\n", " result_content = file_response.read().decode(\"utf-8\")\n", "\n", " results = [\n", " json.loads(line) for line in result_content.split(\"\\n\") if line.strip() != \"\"\n", " ]\n", "\n", " for result in results:\n", " print_highlight(f\"Request {result['custom_id']}:\")\n", " print_highlight(f\"Response: {result['response']}\")\n", "\n", " print_highlight(\"Cleaning up files...\")\n", " # Only delete the result file ID since file_response is just content\n", " client.files.delete(result_file_id)\n", "else:\n", " print_highlight(f\"Batch job failed with status: {batch_response.status}\")\n", " if hasattr(batch_response, \"errors\"):\n", " print_highlight(f\"Errors: {batch_response.errors}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.\n", "\n", "1. `batches/{batch_id}`: Retrieve the batch job status.\n", "2. `batches/{batch_id}/cancel`: Cancel the batch job.\n", "\n", "Here is an example to check the batch job status." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n", "\n", "requests = []\n", "for i in range(100):\n", " requests.append(\n", " {\n", " \"custom_id\": f\"request-{i}\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": f\"{i}: You are a helpful AI assistant\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": \"Write a detailed story about topic. Make it very long.\",\n", " },\n", " ],\n", " \"max_tokens\": 500,\n", " },\n", " }\n", " )\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " uploaded_file = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_job = client.batches.create(\n", " input_file_id=uploaded_file.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n", "print_highlight(f\"Initial status: {batch_job.status}\")\n", "\n", "time.sleep(10)\n", "\n", "max_checks = 5\n", "for i in range(max_checks):\n", " batch_details = client.batches.retrieve(batch_id=batch_job.id)\n", "\n", " print_highlight(\n", " f\"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}\"\n", " )\n", " print_highlight(\n", " f\"Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}\"\n", " )\n", "\n", " time.sleep(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example to cancel a batch job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import time\n", "from openai import OpenAI\n", "import os\n", "\n", "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n", "\n", "requests = []\n", "for i in range(500):\n", " requests.append(\n", " {\n", " \"custom_id\": f\"request-{i}\",\n", " \"method\": \"POST\",\n", " \"url\": \"/chat/completions\",\n", " \"body\": {\n", " \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"system\",\n", " \"content\": f\"{i}: You are a helpful AI assistant\",\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"content\": \"Write a detailed story about topic. Make it very long.\",\n", " },\n", " ],\n", " \"max_tokens\": 500,\n", " },\n", " }\n", " )\n", "\n", "input_file_path = \"batch_requests.jsonl\"\n", "with open(input_file_path, \"w\") as f:\n", " for req in requests:\n", " f.write(json.dumps(req) + \"\\n\")\n", "\n", "with open(input_file_path, \"rb\") as f:\n", " uploaded_file = client.files.create(file=f, purpose=\"batch\")\n", "\n", "batch_job = client.batches.create(\n", " input_file_id=uploaded_file.id,\n", " endpoint=\"/v1/chat/completions\",\n", " completion_window=\"24h\",\n", ")\n", "\n", "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n", "print_highlight(f\"Initial status: {batch_job.status}\")\n", "\n", "time.sleep(10)\n", "\n", "try:\n", " cancelled_job = client.batches.cancel(batch_id=batch_job.id)\n", " print_highlight(f\"Cancellation initiated. Status: {cancelled_job.status}\")\n", " assert cancelled_job.status == \"cancelling\"\n", "\n", " # Monitor the cancellation process\n", " while cancelled_job.status not in [\"failed\", \"cancelled\"]:\n", " time.sleep(3)\n", " cancelled_job = client.batches.retrieve(batch_job.id)\n", " print_highlight(f\"Current status: {cancelled_job.status}\")\n", "\n", " # Verify final status\n", " assert cancelled_job.status == \"cancelled\"\n", " print_highlight(\"Batch job successfully cancelled\")\n", "\n", "except Exception as e:\n", " print_highlight(f\"Error during cancellation: {e}\")\n", " raise e\n", "\n", "finally:\n", " try:\n", " del_response = client.files.delete(uploaded_file.id)\n", " if del_response.deleted:\n", " print_highlight(\"Successfully cleaned up input file\")\n", " if os.path.exists(input_file_path):\n", " os.remove(input_file_path)\n", " print_highlight(\"Successfully deleted local batch_requests.jsonl file\")\n", " except Exception as e:\n", " print_highlight(f\"Error cleaning up: {e}\")\n", " raise e" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "terminate_process(server_process)" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }