Add vlm document (#1866)

Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>

Add vlm document (#1866)
Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
3bf3d011 · Chayenne · GitHub · d86a2d65 · 3bf3d011 · 3bf3d011
Unverified Commit 3bf3d011 authored Nov 01, 2024 by Chayenne Committed by GitHub Nov 01, 2024
Showing with 467 additions and 157 deletions

docs/backend/embedding_model.ipynb docs/backend/embedding_model.ipynb +35 -157

docs/backend/vision_language_model.ipynb docs/backend/vision_language_model.ipynb +431 -0

docs/index.rst docs/index.rst +1 -0

No files found.
--- a/docs/backend/embedding_model.ipynb
+++ b/docs/backend/embedding_model.ipynb
@@ -19,6 +19,7 @@
    "## Launch A Server\n",
    "\n",
    "The following code is equivalent to running this in the shell:\n",
+    "\n",
    "```bash\n",
    "python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct \\\n",
    "    --port 30010 --host 0.0.0.0 --is-embedding\n",
@@ -44,161 +45,38 @@
     "output_type": "stream",
     "text": [
      "/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
-      "  warnings.warn(\n"
+      "  warnings.warn(\n",
-     ]
+      "[2024-10-31 22:40:37] server_args=ServerArgs(model_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Alibaba-NLP/gte-Qwen2-7B-instruct', chat_template=None, is_embedding=True, host='0.0.0.0', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=309155486, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)\n",
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[2024-10-31 19:47:37] server_args=ServerArgs(model_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_path='Alibaba-NLP/gte-Qwen2-7B-instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='Alibaba-NLP/gte-Qwen2-7B-instruct', chat_template=None, is_embedding=True, host='0.0.0.0', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=314021918, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
      "/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
      "  warnings.warn(\n",
      "/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
-      "  warnings.warn(\n"
+      "  warnings.warn(\n",
-     ]
+      "[2024-10-31 22:40:42 TP0] Init torch distributed begin.\n",
-    },
+      "[2024-10-31 22:40:43 TP0] Load weight begin. avail mem=47.27 GB\n",
-    {
+      "[2024-10-31 22:40:43 TP0] lm_eval is not installed, GPTQ may not be usable\n",
-     "name": "stdout",
+      "INFO 10-31 22:40:44 weight_utils.py:243] Using model weights format ['*.safetensors']\n",
-     "output_type": "stream",
+      "Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]\n",
-     "text": [
+      "Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:00<00:03,  1.97it/s]\n",
-      "[2024-10-31 19:47:43 TP0] Init torch distributed begin.\n"
+      "Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:01<00:03,  1.40it/s]\n",
-     ]
+      "Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:02<00:03,  1.11it/s]\n",
-    },
+      "Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:03<00:03,  1.00s/it]\n",
-    {
+      "Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:04<00:02,  1.07s/it]\n",
-     "name": "stdout",
+      "Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:05<00:01,  1.10s/it]\n",
-     "output_type": "stream",
+      "Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:07<00:00,  1.12s/it]\n",
-     "text": [
+      "Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:07<00:00,  1.02s/it]\n",
-      "[2024-10-31 19:47:44 TP0] Load weight begin. avail mem=47.27 GB\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[2024-10-31 19:47:44 TP0] lm_eval is not installed, GPTQ may not be usable\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "INFO 10-31 19:47:45 weight_utils.py:243] Using model weights format ['*.safetensors']\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:00<00:03,  1.96it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:01<00:03,  1.39it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:02<00:03,  1.13it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:03<00:02,  1.00it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:04<00:02,  1.05s/it]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:05<00:01,  1.09s/it]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\r",
-      "Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:07<00:00,  1.11s/it]\n",
-      "\r",
-      "Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:07<00:00,  1.01s/it]\n",
      "\n",
-      "[2024-10-31 19:47:53 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=32.91 GB\n",
+      "[2024-10-31 22:40:51 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.float16, avail mem=32.91 GB\n",
-      "[2024-10-31 19:47:53 TP0] Memory pool end. avail mem=4.56 GB\n"
+      "[2024-10-31 22:40:51 TP0] Memory pool end. avail mem=4.56 GB\n",
-     ]
+      "[2024-10-31 22:40:52 TP0] max_total_num_tokens=509971, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072\n",
-    },
+      "[2024-10-31 22:40:52] INFO:     Started server process [1752367]\n",
-    {
+      "[2024-10-31 22:40:52] INFO:     Waiting for application startup.\n",
-     "name": "stdout",
+      "[2024-10-31 22:40:52] INFO:     Application startup complete.\n",
-     "output_type": "stream",
+      "[2024-10-31 22:40:52] INFO:     Uvicorn running on http://0.0.0.0:30010 (Press CTRL+C to quit)\n",
-     "text": [
+      "[2024-10-31 22:40:52] INFO:     127.0.0.1:41676 - \"GET /v1/models HTTP/1.1\" 200 OK\n",
-      "[2024-10-31 19:47:53 TP0] max_total_num_tokens=509971, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072\n"
+      "[2024-10-31 22:40:53] INFO:     127.0.0.1:41678 - \"GET /get_model_info HTTP/1.1\" 200 OK\n",
-     ]
+      "[2024-10-31 22:40:53 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
-    },
+      "[2024-10-31 22:40:54] INFO:     127.0.0.1:41684 - \"POST /encode HTTP/1.1\" 200 OK\n",
-    {
+      "[2024-10-31 22:40:54] The server is fired up and ready to roll!\n"
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[2024-10-31 19:47:54] INFO:     Started server process [1552642]\n",
-      "[2024-10-31 19:47:54] INFO:     Waiting for application startup.\n",
-      "[2024-10-31 19:47:54] INFO:     Application startup complete.\n",
-      "[2024-10-31 19:47:54] INFO:     Uvicorn running on http://0.0.0.0:30010 (Press CTRL+C to quit)\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[2024-10-31 19:47:54] INFO:     127.0.0.1:47776 - \"GET /v1/models HTTP/1.1\" 200 OK\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[2024-10-31 19:47:55] INFO:     127.0.0.1:50344 - \"GET /get_model_info HTTP/1.1\" 200 OK\n",
-      "[2024-10-31 19:47:55 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "[2024-10-31 19:47:55] INFO:     127.0.0.1:50352 - \"POST /encode HTTP/1.1\" 200 OK\n",
-      "[2024-10-31 19:47:55] The server is fired up and ready to roll!\n"
     ]
    },
    {
@@ -255,8 +133,8 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[2024-10-31 19:47:59 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
+      "[2024-10-31 22:40:57 TP0] Prefill batch. #new-seq: 1, #new-token: 4, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
-      "[2024-10-31 19:47:59] INFO:     127.0.0.1:50358 - \"POST /v1/embeddings HTTP/1.1\" 200 OK\n"
+      "[2024-10-31 22:40:57] INFO:     127.0.0.1:51746 - \"POST /v1/embeddings HTTP/1.1\" 200 OK\n"
     ]
    },
    {
@@ -312,8 +190,8 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[2024-10-31 19:47:59 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, cache hit rate: 21.43%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
+      "[2024-10-31 22:40:58 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, cache hit rate: 21.43%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
-      "[2024-10-31 19:47:59] INFO:     127.0.0.1:50362 - \"POST /v1/embeddings HTTP/1.1\" 200 OK\n"
+      "[2024-10-31 22:40:58] INFO:     127.0.0.1:51750 - \"POST /v1/embeddings HTTP/1.1\" 200 OK\n"
     ]
    },
    {
@@ -377,8 +255,8 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[2024-10-31 19:48:01 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, cache hit rate: 33.33%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
+      "[2024-10-31 22:41:00 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 3, cache hit rate: 33.33%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
-      "[2024-10-31 19:48:01] INFO:     127.0.0.1:50366 - \"POST /v1/embeddings HTTP/1.1\" 200 OK\n"
+      "[2024-10-31 22:41:00] INFO:     127.0.0.1:51762 - \"POST /v1/embeddings HTTP/1.1\" 200 OK\n"
     ]
    },
    {

--- a/docs/backend/vision_language_model.ipynb
+++ b/docs/backend/vision_language_model.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Vision Language Model\n",
+    "\n",
+    "SGLang supports vision language models in the same way as completion models. Here are some example models:\n",
+    "\n",
+    "- [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)\n",
+    "- [lmms-lab/llava-onevision-qwen2-7b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Launch A Server\n",
+    "\n",
+    "The following code is equivalent to running this in the shell:\n",
+    "\n",
+    "```bash\n",
+    "python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
+    " --port=30010 --chat-template=llama_3_vision\n",
+    "```\n",
+    "\n",
+    "Remember to add `--chat-template=llama_3_vision` to specify the vision chat template, otherwise the server only supports text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
+      "  warnings.warn(\n",
+      "[2024-10-31 23:10:49] server_args=ServerArgs(model_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_path='meta-llama/Llama-3.2-11B-Vision-Instruct', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='meta-llama/Llama-3.2-11B-Vision-Instruct', chat_template='llama_3_vision', is_embedding=False, host='127.0.0.1', port=30010, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=178735948, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)\n",
+      "/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
+      "  warnings.warn(\n",
+      "/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.\n",
+      "  warnings.warn(\n",
+      "[2024-10-31 23:10:51] Use chat template for the OpenAI-compatible API server: llama_3_vision\n",
+      "[2024-10-31 23:10:56 TP0] Automatically turn off --chunked-prefill-size and adjust --mem-fraction-static for multimodal models.\n",
+      "[2024-10-31 23:10:56 TP0] Init torch distributed begin.\n",
+      "[2024-10-31 23:10:56 TP0] Load weight begin. avail mem=47.27 GB\n",
+      "[2024-10-31 23:10:57 TP0] lm_eval is not installed, GPTQ may not be usable\n",
+      "INFO 10-31 23:10:57 weight_utils.py:243] Using model weights format ['*.safetensors']\n",
+      "Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]\n",
+      "Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  2.31it/s]\n",
+      "Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:01,  2.24it/s]\n",
+      "Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.23it/s]\n",
+      "Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:01<00:00,  2.21it/s]\n",
+      "Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  2.85it/s]\n",
+      "Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  2.54it/s]\n",
+      "\n",
+      "[2024-10-31 23:10:59 TP0] Load weight end. type=MllamaForConditionalGeneration, dtype=torch.bfloat16, avail mem=27.15 GB\n",
+      "[2024-10-31 23:10:59 TP0] Memory pool end. avail mem=6.62 GB\n",
+      "[2024-10-31 23:10:59 TP0] Capture cuda graph begin. This can take up to several minutes.\n",
+      "[2024-10-31 23:11:09 TP0] max_total_num_tokens=127149, max_prefill_tokens=16384, max_running_requests=2049, context_len=131072\n",
+      "[2024-10-31 23:11:10] INFO:     Started server process [1796287]\n",
+      "[2024-10-31 23:11:10] INFO:     Waiting for application startup.\n",
+      "[2024-10-31 23:11:10] INFO:     Application startup complete.\n",
+      "[2024-10-31 23:11:10] INFO:     Uvicorn running on http://127.0.0.1:30010 (Press CTRL+C to quit)\n",
+      "[2024-10-31 23:11:10] INFO:     127.0.0.1:60770 - \"GET /v1/models HTTP/1.1\" 200 OK\n",
+      "[2024-10-31 23:11:11] INFO:     127.0.0.1:60780 - \"GET /get_model_info HTTP/1.1\" 200 OK\n",
+      "[2024-10-31 23:11:11 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
+      "[2024-10-31 23:11:11] INFO:     127.0.0.1:60796 - \"POST /generate HTTP/1.1\" 200 OK\n",
+      "[2024-10-31 23:11:11] The server is fired up and ready to roll!\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<strong style='color: #00008B;'><br><br>                    NOTE: Typically, the server runs in a separate terminal.<br>                    In this notebook, we run the server and notebook code together, so their outputs are combined.<br>                    To improve clarity, the server logs are displayed in the original black color, while the notebook outputs are highlighted in blue.<br>                    </strong>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from sglang.utils import (\n",
+    "    execute_shell_command,\n",
+    "    wait_for_server,\n",
+    "    terminate_process,\n",
+    "    print_highlight,\n",
+    ")\n",
+    "\n",
+    "embedding_process = execute_shell_command(\n",
+    "    \"\"\"\n",
+    "    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \\\n",
+    "        --port=30010 --chat-template=llama_3_vision\n",
+    "\n",
+    "\"\"\"\n",
+    ")\n",
+    "\n",
+    "wait_for_server(\"http://localhost:30010\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Use Curl"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
+      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
+      "100   559    0     0  100   559      0    253  0:00:02  0:00:02 --:--:--   253"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "/home/chenyang/miniconda3/envs/AlphaMeemory/lib/python3.11/site-packages/torch/storage.py:414: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
+      "  return torch.load(io.BytesIO(b))\n",
+      "[2024-10-31 23:11:18 TP0] Prefill batch. #new-seq: 1, #new-token: 6463, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100   559    0     0  100   559      0    174  0:00:03  0:00:03 --:--:--   174"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2024-10-31 23:11:20 TP0] Decode batch. #running-req: 1, #token: 6496, token usage: 0.05, gen throughput (token/s): 3.90, #queue-req: 0\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100   559    0     0  100   559      0    107  0:00:05  0:00:05 --:--:--   107"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2024-10-31 23:11:21 TP0] Decode batch. #running-req: 1, #token: 6536, token usage: 0.05, gen throughput (token/s): 33.67, #queue-req: 0\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100   559    0     0  100   559      0     90  0:00:06  0:00:06 --:--:--     0"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2024-10-31 23:11:22 TP0] Decode batch. #running-req: 1, #token: 6576, token usage: 0.05, gen throughput (token/s): 33.60, #queue-req: 0\n",
+      "[2024-10-31 23:11:22] INFO:     127.0.0.1:54224 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100  1544  100   985  100   559    142     80  0:00:06  0:00:06 --:--:--   265\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<strong style='color: #00008B;'>{'id': 'f618453e8f3e4408b893a958f2868a44', 'object': 'chat.completion', 'created': 1730441482, 'model': 'meta-llama/Llama-3.2-11B-Vision-Instruct', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': 'The image depicts a serene and peaceful landscape featuring a wooden boardwalk that meanders through a lush grassy field, set against a backdrop of trees and a bright blue sky with wispy clouds. The boardwalk is made of weathered wooden planks and is surrounded by tall grass on either side, creating a sense of depth and texture. The surrounding trees add a touch of natural beauty to the scene, while the blue sky with wispy clouds provides a sense of calmness and serenity. The overall atmosphere of the image is one of tranquility and relaxation, inviting the viewer to step into the peaceful world depicted.'}, 'logprobs': None, 'finish_reason': 'stop', 'matched_stop': 128009}], 'usage': {'prompt_tokens': 6463, 'total_tokens': 6588, 'completion_tokens': 125, 'prompt_tokens_details': None}}</strong>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import subprocess, json, os\n",
+    "\n",
+    "curl_command = \"\"\"\n",
+    "curl http://localhost:30010/v1/chat/completions \\\n",
+    "  -H \"Content-Type: application/json\" \\\n",
+    "  -H \"Authorization: Bearer None\" \\\n",
+    "  -d '{\n",
+    "    \"model\": \"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
+    "    \"messages\": [\n",
+    "      {\n",
+    "        \"role\": \"user\",\n",
+    "        \"content\": [\n",
+    "          {\n",
+    "            \"type\": \"text\",\n",
+    "            \"text\": \"What’s in this image?\"\n",
+    "          },\n",
+    "          {\n",
+    "            \"type\": \"image_url\",\n",
+    "            \"image_url\": {\n",
+    "              \"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
+    "            }\n",
+    "          }\n",
+    "        ]\n",
+    "      }\n",
+    "    ],\n",
+    "    \"max_tokens\": 300\n",
+    "  }'\n",
+    "\"\"\"\n",
+    "\n",
+    "response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
+    "print_highlight(response)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using OpenAI Compatible API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2024-10-31 23:11:23 TP0] Prefill batch. #new-seq: 1, #new-token: 6463, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
+      "[2024-10-31 23:11:24 TP0] Decode batch. #running-req: 1, #token: 6492, token usage: 0.05, gen throughput (token/s): 20.07, #queue-req: 0\n",
+      "[2024-10-31 23:11:25 TP0] Decode batch. #running-req: 1, #token: 6532, token usage: 0.05, gen throughput (token/s): 33.68, #queue-req: 0\n",
+      "[2024-10-31 23:11:26 TP0] Decode batch. #running-req: 1, #token: 6572, token usage: 0.05, gen throughput (token/s): 33.62, #queue-req: 0\n",
+      "[2024-10-31 23:11:27 TP0] Decode batch. #running-req: 1, #token: 6612, token usage: 0.05, gen throughput (token/s): 33.62, #queue-req: 0\n",
+      "[2024-10-31 23:11:28] INFO:     127.0.0.1:54228 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<strong style='color: #00008B;'>The image depicts a serene and peaceful scene of a wooden boardwalk leading through a lush field of tall grass, set against a backdrop of trees and a blue sky with clouds. The boardwalk is made of light-colored wood and has a simple design, with the wooden planks running parallel to each other. It stretches out into the distance, disappearing into the horizon.<br><br>The field is filled with tall, vibrant green grass that sways gently in the breeze, creating a sense of movement and life. The trees in the background are also lush and green, adding depth and texture to the scene. The blue sky above is dotted with white clouds, which are scattered across the horizon. The overall atmosphere of the image is one of tranquility and serenity, inviting the viewer to step into the peaceful world depicted.</strong>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import base64, requests\n",
+    "from openai import OpenAI\n",
+    "\n",
+    "client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n",
+    "\n",
+    "\n",
+    "def encode_image(image_path):\n",
+    "    with open(image_path, \"rb\") as image_file:\n",
+    "        return base64.b64encode(image_file.read()).decode(\"utf-8\")\n",
+    "\n",
+    "\n",
+    "def download_image(image_url, image_path):\n",
+    "    response = requests.get(image_url)\n",
+    "    response.raise_for_status()\n",
+    "    with open(image_path, \"wb\") as f:\n",
+    "        f.write(response.content)\n",
+    "\n",
+    "\n",
+    "image_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"\n",
+    "image_path = \"boardwalk.jpeg\"\n",
+    "download_image(image_url, image_path)\n",
+    "\n",
+    "base64_image = encode_image(image_path)\n",
+    "\n",
+    "response = client.chat.completions.create(\n",
+    "    model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
+    "    messages=[\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": [\n",
+    "                {\n",
+    "                    \"type\": \"text\",\n",
+    "                    \"text\": \"What is in this image?\",\n",
+    "                },\n",
+    "                {\n",
+    "                    \"type\": \"image_url\",\n",
+    "                    \"image_url\": {\"url\": f\"data:image/jpeg;base64,{base64_image}\"},\n",
+    "                },\n",
+    "            ],\n",
+    "        }\n",
+    "    ],\n",
+    "    max_tokens=300,\n",
+    ")\n",
+    "\n",
+    "print_highlight(response.choices[0].message.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Multiple Images Input"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[2024-10-31 23:11:28 TP0] Prefill batch. #new-seq: 1, #new-token: 12871, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0\n",
+      "[2024-10-31 23:11:30 TP0] Decode batch. #running-req: 1, #token: 12899, token usage: 0.10, gen throughput (token/s): 15.36, #queue-req: 0\n",
+      "[2024-10-31 23:11:31 TP0] Decode batch. #running-req: 1, #token: 12939, token usage: 0.10, gen throughput (token/s): 33.33, #queue-req: 0\n",
+      "[2024-10-31 23:11:32 TP0] Decode batch. #running-req: 1, #token: 12979, token usage: 0.10, gen throughput (token/s): 33.28, #queue-req: 0\n",
+      "[2024-10-31 23:11:33] INFO:     127.0.0.1:50966 - \"POST /v1/chat/completions HTTP/1.1\" 200 OK\n",
+      "Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The two images depict a serene and idyllic scene, with the first image showing a well-trodden wooden path through a field, while the second image shows an overgrown, less-traveled path through the same field. The first image features a clear and well-maintained wooden path, whereas the second image shows a more neglected and overgrown path that is not as well-defined. The first image has a more vibrant and inviting atmosphere, while the second image appears more peaceful and serene. Overall, both images evoke a sense of tranquility and connection to nature.', refusal=None, role='assistant', function_call=None, tool_calls=None), matched_stop=128009)\n"
+     ]
+    }
+   ],
+   "source": [
+    "from openai import OpenAI\n",
+    "\n",
+    "client = OpenAI(base_url=\"http://localhost:30010/v1\", api_key=\"None\")\n",
+    "\n",
+    "response = client.chat.completions.create(\n",
+    "    model=\"meta-llama/Llama-3.2-11B-Vision-Instruct\",\n",
+    "    messages=[\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": [\n",
+    "                {\n",
+    "                    \"type\": \"text\",\n",
+    "                    \"text\": \"Are there any differences between these two images?\",\n",
+    "                },\n",
+    "                {\n",
+    "                    \"type\": \"image_url\",\n",
+    "                    \"image_url\": {\n",
+    "                        \"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n",
+    "                    },\n",
+    "                },\n",
+    "                {\n",
+    "                    \"type\": \"image_url\",\n",
+    "                    \"image_url\": {\n",
+    "                        \"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n",
+    "                    },\n",
+    "                },\n",
+    "            ],\n",
+    "        }\n",
+    "    ],\n",
+    "    max_tokens=300,\n",
+    ")\n",
+    "print(response.choices[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "terminate_process(embedding_process)\n",
+    "os.remove(image_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chat Template\n",
+    "\n",
+    "As mentioned before, if you do not specify a vision model's `chat-template`, the server uses Hugging Face's default template, which only supports text.\n",
+    "\n",
+    "You can add your custom chat template by referring to the [custom chat template](../references/custom_chat_template.md).\n",
+    "\n",
+    "We list popular vision models with their chat templates:\n",
+    "\n",
+    "- [meta-llama/Llama-3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) uses `llama_3_vision`.\n",
+    "- [LLaVA-NeXT](https://huggingface.co/collections/lmms-lab/llava-next-6623288e2d61edba3ddbf5ff) uses `chatml-llava`.\n",
+    "- [llama3-llava-next](https://huggingface.co/lmms-lab/llama3-llava-next-8b) uses `llava_llama_3`.\n",
+    "- [llava-onevision](https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov) uses `chatml-llava`.\n",
+    "- [liuhaotian/llava-v1.5 / 1.6](https://huggingface.co/liuhaotian/llava-v1.5-13b) uses `vicuna_v1.1`."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "AlphaMeemory",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -24,6 +24,7 @@ The core features include:
   :caption: Backend Tutorial
   backend/openai_api.ipynb
+   backend/vision_language_model.ipynb
   backend/backend.md