{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenAI APIs - Vision\n", "\n", "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n", "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n", "This tutorial covers the vision APIs for vision language models.\n", "\n", "SGLang supports vision language models such as Llama 3.2, LLaVA-OneVision, and QWen-VL2 \n", "- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) \n", "- [lmms-lab/llava-onevision-qwen2-72b-ov-chat](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat) \n", "- [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Launch A Server\n", "\n", "This code block is equivalent to executing \n", "\n", "```bash\n", "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n", " --port 30010 --chat-template llama_3_vision\n", "```\n", "in your terminal and wait for the server to be ready.\n", "\n", "Remember to add `--chat-template llama_3_vision` to specify the vision chat template, otherwise the server only supports text.\n", "We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2024-11-02 00:24:10.542705: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2024-11-02 00:24:10.554725: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2024-11-02 00:24:10.554758: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", "2024-11-02 00:24:11.063662: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", "[2024-11-02 00:24:19] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', chat_template='llama_3_vision', is_embedding=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=553831757, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)\n", "[2024-11-02 00:24:20] Use chat template for the OpenAI-compatible API server: llama_3_vision\n", "[2024-11-02 00:24:29 TP0] Automatically turn off --chunked-prefill-size and adjust --mem-fraction-static for multimodal models.\n", "[2024-11-02 00:24:29 TP0] Init torch distributed begin.\n", "[2024-11-02 00:24:32 TP0] Load weight begin. avail mem=76.83 GB\n", "[2024-11-02 00:24:32 TP0] lm_eval is not installed, GPTQ may not be usable\n", "INFO 11-02 00:24:32 weight_utils.py:243] Using model weights format ['*.safetensors']\n", "Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00

NOTE: Typically, the server runs in a separate terminal.
In this notebook, we run the server and notebook code together, so their outputs are combined.
To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sglang.utils import (\n", " execute_shell_command,\n", " wait_for_server,\n", " terminate_process,\n", " print_highlight,\n", ")\n", "\n", "embedding_process = execute_shell_command(\n", "\"\"\"\n", "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n", " --port=30010 --chat-template=llama_3_vision\n", "\"\"\"\n", ")\n", "\n", "wait_for_server(\"http://localhost:30010\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using cURL\n", "\n", "Once the server is up, you can send test requests using curl." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 485 0 0 100 485 0 2420 --:--:-- --:--:-- --:--:-- 2412" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[2024-11-02 00:26:23 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 6462, cache hit rate: 49.97%, token usage: 0.02, #running-req: 0, #queue-req: 0\n", "[2024-11-02 00:26:24] INFO: 127.0.0.1:39828 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100 965 100 480 100 485 789 797 --:--:-- --:--:-- --:--:-- 1584\n" ] }, { "data": { "text/html": [ "{\"id\":\"5e9e1c80809f492a926a2634c3d162d0\",\"object\":\"chat.completion\",\"created\":1730507184,\"model\":\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"The image depicts a man ironing clothes on an ironing board that is placed on the back of a yellow taxi cab.\"},\"logprobs\":null,\"finish_reason\":\"stop\",\"matched_stop\":128009}],\"usage\":{\"prompt_tokens\":6463,\"total_tokens\":6489,\"completion_tokens\":26,\"prompt_tokens_details\":null}}" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import subprocess\n", "\n", "curl_command = \"\"\"\n", "curl http://localhost:30010/v1/chat/completions \\\n", " -H \"Content-Type: application/json\" \\\n", " -H \"Authorization: Bearer None\" \\\n", " -d '{\n", " \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " \"messages\": [\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"What’s in this image?\"\n", " },\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"\n", " }\n", " }\n", " ]\n", " }\n", " ],\n", " \"max_tokens\": 300\n", " }'\n", "\"\"\"\n", "\n", "response = subprocess.check_output(curl_command, shell=True).decode()\n", "print_highlight(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using OpenAI Python Client\n", "\n", "You can use the OpenAI Python API library to send requests." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2024-11-02 00:26:33 TP0] Prefill batch. #new-seq: 1, #new-token: 11, #cached-token: 6452, cache hit rate: 66.58%, token usage: 0.02, #running-req: 0, #queue-req: 0\n", "[2024-11-02 00:26:34 TP0] Decode batch. #running-req: 1, #token: 6477, token usage: 0.02, gen throughput (token/s): 0.77, #queue-req: 0\n", "[2024-11-02 00:26:34] INFO: 127.0.0.1:43258 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n" ] }, { "data": { "text/html": [ "The image shows a man ironing clothes on the back of a yellow taxi cab." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"What is in this image?\",\n", " },\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\"},\n", " },\n", " ],\n", " }\n", " ],\n", " max_tokens=300,\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple-Image Inputs\n", "\n", "The server also supports multiple images and interleaved text and images if the model supports it." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[2024-11-02 00:20:30 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 12894, cache hit rate: 83.27%, token usage: 0.04, #running-req: 0, #queue-req: 0\n", "[2024-11-02 00:20:30 TP0] Decode batch. #running-req: 1, #token: 12903, token usage: 0.04, gen throughput (token/s): 2.02, #queue-req: 0\n", "[2024-11-02 00:20:30 TP0] Decode batch. #running-req: 1, #token: 12943, token usage: 0.04, gen throughput (token/s): 105.52, #queue-req: 0\n", "[2024-11-02 00:20:30] INFO: 127.0.0.1:41386 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n" ] }, { "data": { "text/html": [ "The first image shows a man in a yellow shirt ironing a shirt on the back of a yellow taxi cab, with a red line connecting the two objects. The second image shows a large orange \"S\" and \"G\" on a white background, with a red line connecting them." ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from openai import OpenAI\n", "\n", "client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n", "\n", "response = client.chat.completions.create(\n", " model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true\",\n", " },\n", " },\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\n", " \"url\": \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\",\n", " },\n", " },\n", " {\n", " \"type\": \"text\",\n", " \"text\": \"I have two very different images. They are not related at all. \"\n", " \"Please describe the first image in one sentence, and then describe the second image in another sentence.\",\n", " },\n", " ],\n", " }\n", " ],\n", " temperature=0,\n", ")\n", "\n", "print_highlight(response.choices[0].message.content)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "terminate_process(embedding_process)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chat Template\n", "\n", "As mentioned before, if you do not specify a vision model's `--chat-template`, the server uses Hugging Face's default template, which only supports text.\n", "\n", "We list popular vision models with their chat templates:\n", "\n", "- [meta-llama/Llama-3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) uses `llama_3_vision`.\n", "- [LLaVA-NeXT](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff) uses `chatml-llava`.\n", "- [LlaVA-OneVision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) uses `chatml-llava`.\n", "- [Llama3-LLaVA-NeXT](https://huggingface.co/lmms-lab/llama3-llava-next-8b) uses `llava_llama_3`.\n", "- [LLaVA-v1.5 / 1.6](https://huggingface.co/liuhaotian/llava-v1.6-34b) uses `vicuna_v1.1`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 2 }