openai_api_completions.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# OpenAI APIs - Completions\n",
    "\n",
    "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
    "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n",
    "\n",
    "This tutorial covers the following popular APIs:\n",
    "\n",
    "- `chat/completions`\n",
    "- `completions`\n",
    "- `batches`\n",
    "\n",
    "Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Launch A Server\n",
    "\n",
    "This code block is equivalent to executing \n",
    "\n",
    "```bash\n",
    "python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
    "--port 30000 --host 0.0.0.0\n",
    "```\n",
    "\n",
    "in your terminal and wait for the server to be ready."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2024-11-02 00:06:33.051950: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "2024-11-02 00:06:33.063961: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "2024-11-02 00:06:33.063983: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
      "2024-11-02 00:06:33.581526: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
      "[2024-11-02 00:06:41] server_args=ServerArgs(model_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_path='meta-llama/Meta-Llama-3.1-8B-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Meta-Llama-3.1-8B-Instruct', chat_template=None, is_embedding=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=73322355, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)\n",
      "[2024-11-02 00:06:51 TP0] Init torch distributed begin.\n",
      "[2024-11-02 00:06:54 TP0] Load weight begin. avail mem=76.83 GB\n",
      "[2024-11-02 00:06:54 TP0] lm_eval is not installed, GPTQ may not be usable\n",
      "INFO 11-02 00:06:54 weight_utils.py:243] Using model weights format ['*.safetensors']\n",
      "Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]\n",
      "Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:03,  1.22s/it]\n",
      "Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:02<00:02,  1.24s/it]\n",
      "Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:03<00:01,  1.29s/it]\n",
      "Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.06it/s]\n",
      "Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:04<00:00,  1.06s/it]\n",
      "\n",
      "[2024-11-02 00:06:59 TP0] Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=61.82 GB\n",
      "[2024-11-02 00:06:59 TP0] Memory pool end. avail mem=8.19 GB\n",
      "[2024-11-02 00:07:00 TP0] Capture cuda graph begin. This can take up to several minutes.\n",
      "[2024-11-02 00:07:09 TP0] max_total_num_tokens=430915, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072\n",
      "[2024-11-02 00:07:09] INFO:     Started server process [102482]\n",
      "[2024-11-02 00:07:09] INFO:     Waiting for application startup.\n",
      "[2024-11-02 00:07:09] INFO:     Application startup complete.\n",
      "[2024-11-02 00:07:09] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)\n",
      "[2024-11-02 00:07:09] INFO:     127.0.0.1:33692 - \"GET /v1/models HTTP/1.1\" 200 OK\n",
      "[2024-11-02 00:07:10] INFO:     127.0.0.1:33702 - \"GET /get_model_info HTTP/1.1\" 200 OK\n",
      "[2024-11-02 00:07:10 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
      "[2024-11-02 00:07:10] INFO:     127.0.0.1:33710 - \"POST /generate HTTP/1.1\" 200 OK\n",
      "[2024-11-02 00:07:10] The server is fired up and ready to roll!\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'><br><br>                    NOTE: Typically, the server runs in a separate terminal.<br>                    In this notebook, we run the server and notebook code together, so their outputs are combined.<br>                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.<br>                    </strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sglang.utils import (\n",
    "    execute_shell_command,\n",
    "    wait_for_server,\n",
    "    terminate_process,\n",
    "    print_highlight,\n",
    ")\n",
    "\n",
    "server_process = execute_shell_command(\n",
    "\"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30000 --host 0.0.0.0\"\n",
    ")\n",
    "\n",
    "wait_for_server(\"http://localhost:30000\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chat Completions\n",
    "\n",
    "### Usage\n",
    "\n",
    "The server fully implements the OpenAI API.\n",
    "It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n",
    "You can also specify a custom chat template with `--chat-template` when launching the server."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:16.624550Z",
     "iopub.status.busy": "2024-11-01T02:45:16.624258Z",
     "iopub.status.idle": "2024-11-01T02:45:18.087455Z",
     "shell.execute_reply": "2024-11-01T02:45:18.086450Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-11-02 00:08:04 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 1, cache hit rate: 1.79%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
      "[2024-11-02 00:08:04 TP0] Decode batch. #running-req: 1, #token: 82, token usage: 0.00, gen throughput (token/s): 0.72, #queue-req: 0\n",
      "[2024-11-02 00:08:04] INFO:     127.0.0.1:51178 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Response: ChatCompletion(id='bb74a7e9fcae4df7af2ee59e25aa75a5', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Here are 3 countries and their capitals:\\n\\n1. **Country:** Japan\\n**Capital:** Tokyo\\n\\n2. **Country:** Australia\\n**Capital:** Canberra\\n\\n3. **Country:** Brazil\\n**Capital:** Brasília', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), matched_stop=128009)], created=1730506084, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=46, prompt_tokens=49, total_tokens=95, completion_tokens_details=None, prompt_tokens_details=None))</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import openai\n",
    "\n",
    "client = openai.Client(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a helpful AI assistant\"},\n",
    "        {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
    "    ],\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    ")\n",
    "\n",
    "print_highlight(f\"Response: {response}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters\n",
    "\n",
    "The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
    "\n",
    "Here is an example of a detailed chat completion request:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:18.090228Z",
     "iopub.status.busy": "2024-11-01T02:45:18.090071Z",
     "iopub.status.idle": "2024-11-01T02:45:21.193221Z",
     "shell.execute_reply": "2024-11-01T02:45:21.192539Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-11-02 00:08:08 TP0] Prefill batch. #new-seq: 1, #new-token: 48, #cached-token: 28, cache hit rate: 21.97%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
      "[2024-11-02 00:08:08 TP0] Decode batch. #running-req: 1, #token: 104, token usage: 0.00, gen throughput (token/s): 9.89, #queue-req: 0\n",
      "[2024-11-02 00:08:09 TP0] Decode batch. #running-req: 1, #token: 144, token usage: 0.00, gen throughput (token/s): 132.64, #queue-req: 0\n",
      "[2024-11-02 00:08:09 TP0] Decode batch. #running-req: 1, #token: 184, token usage: 0.00, gen throughput (token/s): 132.28, #queue-req: 0\n",
      "[2024-11-02 00:08:09] INFO:     127.0.0.1:51178 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Ancient Rome's major achievements include:<br><br>1. **Law and Governance**: The Twelve Tables (450 BCE) and the Julian Laws (5th century BCE) established a foundation for Roman law, which influenced modern Western law. The Roman Republic (509-27 BCE) and Empire (27 BCE-476 CE) developed a system of governance that included the concept of citizenship, representation, and checks on power.<br><br>2. **Architecture and Engineering**: Romans developed impressive architectural styles, such as the arch, dome, and aqueducts. Iconic structures like the Colosseum, Pantheon, and Roman Forum showcased their engineering prowess.<br><br></strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"system\",\n",
    "            \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
    "        },\n",
    "        {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
    "        {\n",
    "            \"role\": \"assistant\",\n",
    "            \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
    "        },\n",
    "        {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
    "    ],\n",
    "    temperature=0.3,  # Lower temperature for more focused responses\n",
    "    max_tokens=128,  # Reasonable length for a concise response\n",
    "    top_p=0.95,  # Slightly higher for better fluency\n",
    "    presence_penalty=0.2,  # Mild penalty to avoid repetition\n",
    "    frequency_penalty=0.2,  # Mild penalty for more natural language\n",
    "    n=1,  # Single response is usually more stable\n",
    "    seed=42,  # Keep for reproducibility\n",
    ")\n",
    "\n",
    "print_highlight(response.choices[0].message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Streaming mode is also supported"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:21.195226Z",
     "iopub.status.busy": "2024-11-01T02:45:21.194680Z",
     "iopub.status.idle": "2024-11-01T02:45:21.675473Z",
     "shell.execute_reply": "2024-11-01T02:45:21.675050Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-11-02 00:08:19] INFO:     127.0.0.1:44218 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n",
      "[2024-11-02 00:08:19 TP0] Prefill batch. #new-seq: 1, #new-token: 15, #cached-token: 25, cache hit rate: 31.40%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
      "This is only a test."
     ]
    }
   ],
   "source": [
    "stream = client.chat.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
    "    stream=True,\n",
    ")\n",
    "for chunk in stream:\n",
    "    if chunk.choices[0].delta.content is not None:\n",
    "        print(chunk.choices[0].delta.content, end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Completions\n",
    "\n",
    "### Usage\n",
    "\n",
    "Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:21.676813Z",
     "iopub.status.busy": "2024-11-01T02:45:21.676665Z",
     "iopub.status.idle": "2024-11-01T02:45:23.182104Z",
     "shell.execute_reply": "2024-11-01T02:45:23.181695Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-11-02 00:08:25 TP0] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 1, cache hit rate: 30.39%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
      "[2024-11-02 00:08:25 TP0] Decode batch. #running-req: 1, #token: 24, token usage: 0.00, gen throughput (token/s): 2.45, #queue-req: 0\n",
      "[2024-11-02 00:08:25 TP0] Decode batch. #running-req: 1, #token: 64, token usage: 0.00, gen throughput (token/s): 142.10, #queue-req: 0\n",
      "[2024-11-02 00:08:26] INFO:     127.0.0.1:37290 - \"POST /v1/completions HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Response: Completion(id='25412696fce14364b40430b5671fc11e', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' 1. 2. 3.\\n1.  United States - Washington D.C. 2.  Japan - Tokyo 3.  Australia - Canberra\\nList 3 countries and their capitals. 1. 2. 3.\\n1.  China - Beijing 2.  Brazil - Bras', matched_stop=None)], created=1730506106, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=9, total_tokens=73, completion_tokens_details=None, prompt_tokens_details=None))</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = client.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    prompt=\"List 3 countries and their capitals.\",\n",
    "    temperature=0,\n",
    "    max_tokens=64,\n",
    "    n=1,\n",
    "    stop=None,\n",
    ")\n",
    "\n",
    "print_highlight(f\"Response: {response}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameters\n",
    "\n",
    "The completions API accepts OpenAI Completions API's parameters.  Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n",
    "\n",
    "Here is an example of a detailed completions request:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:23.186337Z",
     "iopub.status.busy": "2024-11-01T02:45:23.186189Z",
     "iopub.status.idle": "2024-11-01T02:45:26.769744Z",
     "shell.execute_reply": "2024-11-01T02:45:26.769299Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:23 TP0] Prefill batch. #new-seq: 1, #new-token: 9, #cached-token: 1, cache hit rate: 29.32%, token usage: 0.00, #running-req: 0, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:23 TP0] Decode batch. #running-req: 1, #token: 29, token usage: 0.00, gen throughput (token/s): 40.76, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:24 TP0] Decode batch. #running-req: 1, #token: 69, token usage: 0.00, gen throughput (token/s): 42.13, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:25 TP0] Decode batch. #running-req: 1, #token: 109, token usage: 0.00, gen throughput (token/s): 42.01, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:26 TP0] Decode batch. #running-req: 1, #token: 149, token usage: 0.00, gen throughput (token/s): 41.87, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:26] INFO:     127.0.0.1:37738 - \"POST /v1/completions HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Response: Completion(id='fe384c17aece4a5ca5fb5238dcd1adec', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=\" This can be a sci-fi story, and you have the ability to create a unique and imaginative universe.\\nIn the depths of space, a lone space explorer named Kaelin Vex navigated through the swirling vortex of the Aurora Nebula. Her ship, the Starweaver, was an extension of herself, its advanced AI system linked directly to her mind. Together, they danced through the cosmos, searching for answers to the mysteries of the universe.\\nKaelin's mission was to uncover the secrets of the ancient alien civilization known as the Architects. Legends spoke of their unparalleled technological prowess and their ability to manipulate reality itself. Many believed they had transcended their physical forms, becoming one with the cosmos.\\nAs Kaelin delved deeper into\", matched_stop=None)], created=1730429126, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=150, prompt_tokens=10, total_tokens=160, prompt_tokens_details=None))</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response = client.completions.create(\n",
    "    model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "    prompt=\"Write a short story about a space explorer.\",\n",
    "    temperature=0.7,  # Moderate temperature for creative writing\n",
    "    max_tokens=150,  # Longer response for a story\n",
    "    top_p=0.9,  # Balanced diversity in word choice\n",
    "    stop=[\"\\n\\n\", \"THE END\"],  # Multiple stop sequences\n",
    "    presence_penalty=0.3,  # Encourage novel elements\n",
    "    frequency_penalty=0.3,  # Reduce repetitive phrases\n",
    "    n=1,  # Generate one completion\n",
    "    seed=123,  # For reproducible results\n",
    ")\n",
    "\n",
    "print_highlight(f\"Response: {response}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Batches\n",
    "\n",
    "Batches API for chat completions and completions are also supported. You can upload your requests in `jsonl` files, create a batch job, and retrieve the results when the batch job is completed (which takes longer but costs less).\n",
    "\n",
    "The batches APIs are:\n",
    "\n",
    "- `batches`\n",
    "- `batches/{batch_id}/cancel`\n",
    "- `batches/{batch_id}`\n",
    "\n",
    "Here is an example of a batch job for chat completions, completions are similar.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:26.772016Z",
     "iopub.status.busy": "2024-11-01T02:45:26.771868Z",
     "iopub.status.idle": "2024-11-01T02:45:26.794225Z",
     "shell.execute_reply": "2024-11-01T02:45:26.793811Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:26] INFO:     127.0.0.1:57182 - \"POST /v1/files HTTP/1.1\" 200 OK\n",
      "[2024-10-31 19:45:26] INFO:     127.0.0.1:57182 - \"POST /v1/batches HTTP/1.1\" 200 OK\n",
      "[2024-10-31 19:45:26 TP0] Prefill batch. #new-seq: 2, #new-token: 20, #cached-token: 60, cache hit rate: 42.80%, token usage: 0.00, #running-req: 0, #queue-req: 0\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Batch job created with ID: batch_d9af5b49-ad3d-423e-8c30-4aaafa5c18c4</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import json\n",
    "import time\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "requests = [\n",
    "    {\n",
    "        \"custom_id\": \"request-1\",\n",
    "        \"method\": \"POST\",\n",
    "        \"url\": \"/chat/completions\",\n",
    "        \"body\": {\n",
    "            \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "            \"messages\": [\n",
    "                {\"role\": \"user\", \"content\": \"Tell me a joke about programming\"}\n",
    "            ],\n",
    "            \"max_tokens\": 50,\n",
    "        },\n",
    "    },\n",
    "    {\n",
    "        \"custom_id\": \"request-2\",\n",
    "        \"method\": \"POST\",\n",
    "        \"url\": \"/chat/completions\",\n",
    "        \"body\": {\n",
    "            \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "            \"messages\": [{\"role\": \"user\", \"content\": \"What is Python?\"}],\n",
    "            \"max_tokens\": 50,\n",
    "        },\n",
    "    },\n",
    "]\n",
    "\n",
    "input_file_path = \"batch_requests.jsonl\"\n",
    "\n",
    "with open(input_file_path, \"w\") as f:\n",
    "    for req in requests:\n",
    "        f.write(json.dumps(req) + \"\\n\")\n",
    "\n",
    "with open(input_file_path, \"rb\") as f:\n",
    "    file_response = client.files.create(file=f, purpose=\"batch\")\n",
    "\n",
    "batch_response = client.batches.create(\n",
    "    input_file_id=file_response.id,\n",
    "    endpoint=\"/v1/chat/completions\",\n",
    "    completion_window=\"24h\",\n",
    ")\n",
    "\n",
    "print_highlight(f\"Batch job created with ID: {batch_response.id}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:26.796422Z",
     "iopub.status.busy": "2024-11-01T02:45:26.796273Z",
     "iopub.status.idle": "2024-11-01T02:45:29.810471Z",
     "shell.execute_reply": "2024-11-01T02:45:29.810041Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:27 TP0] Decode batch. #running-req: 1, #token: 69, token usage: 0.00, gen throughput (token/s): 51.72, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch job status: validating...trying again in 3 seconds...\n",
      "[2024-10-31 19:45:29] INFO:     127.0.0.1:57182 - \"GET /v1/batches/batch_d9af5b49-ad3d-423e-8c30-4aaafa5c18c4 HTTP/1.1\" 200 OK\n",
      "Batch job completed successfully!\n",
      "Request counts: BatchRequestCounts(completed=2, failed=0, total=2)\n",
      "[2024-10-31 19:45:29] INFO:     127.0.0.1:57182 - \"GET /v1/files/backend_result_file-4ed79bf4-1e07-4fa9-9638-7448aa4e074b/content HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Request request-1:</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Response: {'status_code': 200, 'request_id': 'request-1', 'body': {'id': 'request-1', 'object': 'chat.completion', 'created': 1730429127, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': 'Why do programmers prefer dark mode?\\n\\nBecause light attracts bugs.'}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}, 'usage': {'prompt_tokens': 41, 'completion_tokens': 13, 'total_tokens': 54}, 'system_fingerprint': None}}</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Request request-2:</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Response: {'status_code': 200, 'request_id': 'request-2', 'body': {'id': 'request-2', 'object': 'chat.completion', 'created': 1730429127, 'model': 'meta-llama/Meta-Llama-3.1-8B-Instruct', 'choices': {'index': 0, 'message': {'role': 'assistant', 'content': '**What is Python?**\\n\\nPython is a high-level, interpreted programming language that is widely used for various purposes such as web development, scientific computing, data analysis, artificial intelligence, and more. It was created in the late 1980s by'}, 'logprobs': None, 'finish_reason': 'length', 'matched_stop': None}, 'usage': {'prompt_tokens': 39, 'completion_tokens': 50, 'total_tokens': 89}, 'system_fingerprint': None}}</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Cleaning up files...</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:29] INFO:     127.0.0.1:57182 - \"DELETE /v1/files/backend_result_file-4ed79bf4-1e07-4fa9-9638-7448aa4e074b HTTP/1.1\" 200 OK\n"
     ]
    }
   ],
   "source": [
    "while batch_response.status not in [\"completed\", \"failed\", \"cancelled\"]:\n",
    "    time.sleep(3)\n",
    "    print(f\"Batch job status: {batch_response.status}...trying again in 3 seconds...\")\n",
    "    batch_response = client.batches.retrieve(batch_response.id)\n",
    "\n",
    "if batch_response.status == \"completed\":\n",
    "    print(\"Batch job completed successfully!\")\n",
    "    print(f\"Request counts: {batch_response.request_counts}\")\n",
    "\n",
    "    result_file_id = batch_response.output_file_id\n",
    "    file_response = client.files.content(result_file_id)\n",
    "    result_content = file_response.read().decode(\"utf-8\")\n",
    "\n",
    "    results = [\n",
    "        json.loads(line) for line in result_content.split(\"\\n\") if line.strip() != \"\"\n",
    "    ]\n",
    "\n",
    "    for result in results:\n",
    "        print_highlight(f\"Request {result['custom_id']}:\")\n",
    "        print_highlight(f\"Response: {result['response']}\")\n",
    "\n",
    "    print_highlight(\"Cleaning up files...\")\n",
    "    # Only delete the result file ID since file_response is just content\n",
    "    client.files.delete(result_file_id)\n",
    "else:\n",
    "    print_highlight(f\"Batch job failed with status: {batch_response.status}\")\n",
    "    if hasattr(batch_response, \"errors\"):\n",
    "        print_highlight(f\"Errors: {batch_response.errors}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It takes a while to complete the batch job. You can use these two APIs to retrieve the batch job status or cancel the batch job.\n",
    "\n",
    "1. `batches/{batch_id}`: Retrieve the batch job status.\n",
    "2. `batches/{batch_id}/cancel`: Cancel the batch job.\n",
    "\n",
    "Here is an example to check the batch job status."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:29.812339Z",
     "iopub.status.busy": "2024-11-01T02:45:29.812198Z",
     "iopub.status.idle": "2024-11-01T02:45:54.851243Z",
     "shell.execute_reply": "2024-11-01T02:45:54.850668Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:29] INFO:     127.0.0.1:57186 - \"POST /v1/files HTTP/1.1\" 200 OK\n",
      "[2024-10-31 19:45:29] INFO:     127.0.0.1:57186 - \"POST /v1/batches HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Created batch job with ID: batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Initial status: validating</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:29 TP0] Prefill batch. #new-seq: 27, #new-token: 810, #cached-token: 675, cache hit rate: 45.05%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
      "[2024-10-31 19:45:29 TP0] Prefill batch. #new-seq: 73, #new-token: 2190, #cached-token: 1825, cache hit rate: 45.33%, token usage: 0.00, #running-req: 27, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:30 TP0] Decode batch. #running-req: 100, #token: 5125, token usage: 0.02, gen throughput (token/s): 636.38, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:31 TP0] Decode batch. #running-req: 100, #token: 9125, token usage: 0.04, gen throughput (token/s): 3507.97, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:33 TP0] Decode batch. #running-req: 100, #token: 13125, token usage: 0.06, gen throughput (token/s): 3417.06, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:34 TP0] Decode batch. #running-req: 100, #token: 17125, token usage: 0.08, gen throughput (token/s): 3332.03, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:35 TP0] Decode batch. #running-req: 100, #token: 21125, token usage: 0.10, gen throughput (token/s): 3252.29, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:36 TP0] Decode batch. #running-req: 100, #token: 25125, token usage: 0.12, gen throughput (token/s): 3173.87, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:38 TP0] Decode batch. #running-req: 100, #token: 29125, token usage: 0.13, gen throughput (token/s): 3101.31, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:39 TP0] Decode batch. #running-req: 100, #token: 33125, token usage: 0.15, gen throughput (token/s): 3030.90, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:39] INFO:     127.0.0.1:37782 - \"GET /v1/batches/batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Batch job details (check 1 / 5) // ID: batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 // Status: in_progress // Created at: 1730429129 // Input file ID: backend_input_file-f42b27b5-05ee-4d27-9a37-ff04c3b4a427 // Output file ID: None</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'><strong>Request counts: Total: 0 // Completed: 0 // Failed: 0</strong></strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:40 TP0] Decode batch. #running-req: 100, #token: 37125, token usage: 0.17, gen throughput (token/s): 2961.37, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:42 TP0] Decode batch. #running-req: 100, #token: 41125, token usage: 0.19, gen throughput (token/s): 2899.29, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:42] INFO:     127.0.0.1:37782 - \"GET /v1/batches/batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Batch job details (check 2 / 5) // ID: batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 // Status: in_progress // Created at: 1730429129 // Input file ID: backend_input_file-f42b27b5-05ee-4d27-9a37-ff04c3b4a427 // Output file ID: None</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'><strong>Request counts: Total: 0 // Completed: 0 // Failed: 0</strong></strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:43 TP0] Decode batch. #running-req: 100, #token: 45125, token usage: 0.21, gen throughput (token/s): 2836.50, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:45 TP0] Decode batch. #running-req: 100, #token: 49125, token usage: 0.23, gen throughput (token/s): 2777.80, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:45] INFO:     127.0.0.1:37782 - \"GET /v1/batches/batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Batch job details (check 3 / 5) // ID: batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 // Status: in_progress // Created at: 1730429129 // Input file ID: backend_input_file-f42b27b5-05ee-4d27-9a37-ff04c3b4a427 // Output file ID: None</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'><strong>Request counts: Total: 0 // Completed: 0 // Failed: 0</strong></strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:48] INFO:     127.0.0.1:37782 - \"GET /v1/batches/batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Batch job details (check 4 / 5) // ID: batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 // Status: completed // Created at: 1730429129 // Input file ID: backend_input_file-f42b27b5-05ee-4d27-9a37-ff04c3b4a427 // Output file ID: backend_result_file-dc391511-07f2-4f94-90cb-3ed09bc4b8a3</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'><strong>Request counts: Total: 100 // Completed: 100 // Failed: 0</strong></strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:51] INFO:     127.0.0.1:37782 - \"GET /v1/batches/batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Batch job details (check 5 / 5) // ID: batch_3d1a7f8e-af5a-4a14-8391-1001aadfe1b2 // Status: completed // Created at: 1730429129 // Input file ID: backend_input_file-f42b27b5-05ee-4d27-9a37-ff04c3b4a427 // Output file ID: backend_result_file-dc391511-07f2-4f94-90cb-3ed09bc4b8a3</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'><strong>Request counts: Total: 100 // Completed: 100 // Failed: 0</strong></strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import json\n",
    "import time\n",
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "requests = []\n",
    "for i in range(100):\n",
    "    requests.append(\n",
    "        {\n",
    "            \"custom_id\": f\"request-{i}\",\n",
    "            \"method\": \"POST\",\n",
    "            \"url\": \"/chat/completions\",\n",
    "            \"body\": {\n",
    "                \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "                \"messages\": [\n",
    "                    {\n",
    "                        \"role\": \"system\",\n",
    "                        \"content\": f\"{i}: You are a helpful AI assistant\",\n",
    "                    },\n",
    "                    {\n",
    "                        \"role\": \"user\",\n",
    "                        \"content\": \"Write a detailed story about topic. Make it very long.\",\n",
    "                    },\n",
    "                ],\n",
    "                \"max_tokens\": 500,\n",
    "            },\n",
    "        }\n",
    "    )\n",
    "\n",
    "input_file_path = \"batch_requests.jsonl\"\n",
    "with open(input_file_path, \"w\") as f:\n",
    "    for req in requests:\n",
    "        f.write(json.dumps(req) + \"\\n\")\n",
    "\n",
    "with open(input_file_path, \"rb\") as f:\n",
    "    uploaded_file = client.files.create(file=f, purpose=\"batch\")\n",
    "\n",
    "batch_job = client.batches.create(\n",
    "    input_file_id=uploaded_file.id,\n",
    "    endpoint=\"/v1/chat/completions\",\n",
    "    completion_window=\"24h\",\n",
    ")\n",
    "\n",
    "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n",
    "print_highlight(f\"Initial status: {batch_job.status}\")\n",
    "\n",
    "time.sleep(10)\n",
    "\n",
    "max_checks = 5\n",
    "for i in range(max_checks):\n",
    "    batch_details = client.batches.retrieve(batch_id=batch_job.id)\n",
    "\n",
    "    print_highlight(\n",
    "        f\"Batch job details (check {i+1} / {max_checks}) // ID: {batch_details.id} // Status: {batch_details.status} // Created at: {batch_details.created_at} // Input file ID: {batch_details.input_file_id} // Output file ID: {batch_details.output_file_id}\"\n",
    "    )\n",
    "    print_highlight(\n",
    "        f\"<strong>Request counts: Total: {batch_details.request_counts.total} // Completed: {batch_details.request_counts.completed} // Failed: {batch_details.request_counts.failed}</strong>\"\n",
    "    )\n",
    "\n",
    "    time.sleep(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is an example to cancel a batch job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:45:54.854018Z",
     "iopub.status.busy": "2024-11-01T02:45:54.853851Z",
     "iopub.status.idle": "2024-11-01T02:46:07.893199Z",
     "shell.execute_reply": "2024-11-01T02:46:07.892310Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:54] INFO:     127.0.0.1:33180 - \"POST /v1/files HTTP/1.1\" 200 OK\n",
      "[2024-10-31 19:45:54] INFO:     127.0.0.1:33180 - \"POST /v1/batches HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Created batch job with ID: batch_c30756c3-8c09-4142-9630-9590d6124986</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Initial status: validating</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:54 TP0] Prefill batch. #new-seq: 135, #new-token: 1150, #cached-token: 6275, cache hit rate: 67.38%, token usage: 0.01, #running-req: 0, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:55 TP0] Prefill batch. #new-seq: 274, #new-token: 8192, #cached-token: 6850, cache hit rate: 55.74%, token usage: 0.02, #running-req: 135, #queue-req: 91\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:56 TP0] Prefill batch. #new-seq: 92, #new-token: 2758, #cached-token: 2302, cache hit rate: 54.19%, token usage: 0.06, #running-req: 408, #queue-req: 1\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:45:56 TP0] Decode batch. #running-req: 500, #token: 16025, token usage: 0.07, gen throughput (token/s): 409.21, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:46:00 TP0] Decode batch. #running-req: 500, #token: 36025, token usage: 0.17, gen throughput (token/s): 5777.09, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:46:03 TP0] Decode batch. #running-req: 500, #token: 56025, token usage: 0.26, gen throughput (token/s): 5530.76, #queue-req: 0\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:46:04] INFO:     127.0.0.1:57728 - \"POST /v1/batches/batch_c30756c3-8c09-4142-9630-9590d6124986/cancel HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Cancellation initiated. Status: cancelling</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:46:07] INFO:     127.0.0.1:57728 - \"GET /v1/batches/batch_c30756c3-8c09-4142-9630-9590d6124986 HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Current status: cancelled</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Batch job successfully cancelled</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2024-10-31 19:46:07] INFO:     127.0.0.1:57728 - \"DELETE /v1/files/backend_input_file-0fbf83a7-301c-488e-a221-b702e24df6a5 HTTP/1.1\" 200 OK\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Successfully cleaned up input file</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<strong style='color: #00008B;'>Successfully deleted local batch_requests.jsonl file</strong>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import json\n",
    "import time\n",
    "from openai import OpenAI\n",
    "import os\n",
    "\n",
    "client = OpenAI(base_url=\"http://127.0.0.1:30000/v1\", api_key=\"None\")\n",
    "\n",
    "requests = []\n",
    "for i in range(500):\n",
    "    requests.append(\n",
    "        {\n",
    "            \"custom_id\": f\"request-{i}\",\n",
    "            \"method\": \"POST\",\n",
    "            \"url\": \"/chat/completions\",\n",
    "            \"body\": {\n",
    "                \"model\": \"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
    "                \"messages\": [\n",
    "                    {\n",
    "                        \"role\": \"system\",\n",
    "                        \"content\": f\"{i}: You are a helpful AI assistant\",\n",
    "                    },\n",
    "                    {\n",
    "                        \"role\": \"user\",\n",
    "                        \"content\": \"Write a detailed story about topic. Make it very long.\",\n",
    "                    },\n",
    "                ],\n",
    "                \"max_tokens\": 500,\n",
    "            },\n",
    "        }\n",
    "    )\n",
    "\n",
    "input_file_path = \"batch_requests.jsonl\"\n",
    "with open(input_file_path, \"w\") as f:\n",
    "    for req in requests:\n",
    "        f.write(json.dumps(req) + \"\\n\")\n",
    "\n",
    "with open(input_file_path, \"rb\") as f:\n",
    "    uploaded_file = client.files.create(file=f, purpose=\"batch\")\n",
    "\n",
    "batch_job = client.batches.create(\n",
    "    input_file_id=uploaded_file.id,\n",
    "    endpoint=\"/v1/chat/completions\",\n",
    "    completion_window=\"24h\",\n",
    ")\n",
    "\n",
    "print_highlight(f\"Created batch job with ID: {batch_job.id}\")\n",
    "print_highlight(f\"Initial status: {batch_job.status}\")\n",
    "\n",
    "time.sleep(10)\n",
    "\n",
    "try:\n",
    "    cancelled_job = client.batches.cancel(batch_id=batch_job.id)\n",
    "    print_highlight(f\"Cancellation initiated. Status: {cancelled_job.status}\")\n",
    "    assert cancelled_job.status == \"cancelling\"\n",
    "\n",
    "    # Monitor the cancellation process\n",
    "    while cancelled_job.status not in [\"failed\", \"cancelled\"]:\n",
    "        time.sleep(3)\n",
    "        cancelled_job = client.batches.retrieve(batch_job.id)\n",
    "        print_highlight(f\"Current status: {cancelled_job.status}\")\n",
    "\n",
    "    # Verify final status\n",
    "    assert cancelled_job.status == \"cancelled\"\n",
    "    print_highlight(\"Batch job successfully cancelled\")\n",
    "\n",
    "except Exception as e:\n",
    "    print_highlight(f\"Error during cancellation: {e}\")\n",
    "    raise e\n",
    "\n",
    "finally:\n",
    "    try:\n",
    "        del_response = client.files.delete(uploaded_file.id)\n",
    "        if del_response.deleted:\n",
    "            print_highlight(\"Successfully cleaned up input file\")\n",
    "        if os.path.exists(input_file_path):\n",
    "            os.remove(input_file_path)\n",
    "            print_highlight(\"Successfully deleted local batch_requests.jsonl file\")\n",
    "    except Exception as e:\n",
    "        print_highlight(f\"Error cleaning up: {e}\")\n",
    "        raise e"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-11-01T02:46:07.896114Z",
     "iopub.status.busy": "2024-11-01T02:46:07.895820Z",
     "iopub.status.idle": "2024-11-01T02:46:09.365287Z",
     "shell.execute_reply": "2024-11-01T02:46:09.364705Z"
    }
   },
   "outputs": [],
   "source": [
    "terminate_process(server_process)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}