doc: update developer guide regarding mllms (#6138)

Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>

doc: update developer guide regarding mllms (#6138)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
cd7c8a8d · Mick · GitHub · 3e350a93 · cd7c8a8d · cd7c8a8d
Unverified Commit cd7c8a8d authored May 14, 2025 by Mick Committed by GitHub May 14, 2025
4 changed files
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -38,7 +38,7 @@ The core features include:
   :caption: Supported Models
   supported_models/generative_models.md
-   supported_models/vision_language_models.md
+   supported_models/multimodal_language_models.md
   supported_models/embedding_models.md
   supported_models/reward_models.md
   supported_models/support_new_models.md

--- a/docs/supported_models/multimodal_language_models.md
+++ b/docs/supported_models/multimodal_language_models.md
+# Multimodal Language Models
+These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models
+with multimodal encoders.
+## Example launch Command
+```shell
+python3 -m sglang.launch_server \
+  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \  # example HF/local path
+  --host 0.0.0.0 \
+  --port 30000 \
+```
+## Supporting Metrics
+| Model Family (Variants)    | Example HuggingFace Identifier             | Chat Template    | Description                                                                                                                                                                                                     |
+|----------------------------|--------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **Qwen-VL** (Qwen2 series) | `Qwen/Qwen2.5-VL-7B-Instruct`              | `qwen2-vl`       | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content.                                                                     |
+| **DeepSeek-VL2**           | `deepseek-ai/deepseek-vl2`                 | `deepseek-vl2`   | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs.                                                                        |
+| **Janus-Pro** (1B, 7B)     | `deepseek-ai/Janus-Pro-7B`                 | `janus-pro`      | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
+| **MiniCPM-V / MiniCPM-o**  | `openbmb/MiniCPM-V-2_6`                    | `minicpmv`       | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices.                                                 |
+| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | `llama_3_vision` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks.                                                                                     |
+| **LLaVA** (v1.5 & v1.6)    | *e.g.* `liuhaotian/llava-v1.5-13b`         | `vicuna_v1.1`    | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts.                                                                               |
+| **LLaVA-NeXT** (8B, 72B)   | `lmms-lab/llava-next-72b`                  | `chatml-llava`   | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks.                                                       |
+| **LLaVA-OneVision**        | `lmms-lab/llava-onevision-qwen2-7b-ov`     | `chatml-llava`   | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format.                                                 |
+| **Gemma 3 (Multimodal)**   | `google/gemma-3-4b-it`                     | `gemma-it`       | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context.                                                                        |
+| **Kimi-VL** (A3B)          | `moonshotai/Kimi-VL-A3B-Instruct`          | `kimi-vl`        | Kimi-VL is a multimodal model that can understand and generate text from images.                                                                                                                                |
--- a/docs/supported_models/support_new_models.md
+++ b/docs/supported_models/support_new_models.md
 # How to Support New Models
-This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also covers how to test new models and register external implementations.
+This document explains how to add support for new language models and multimodal large language models (mllms) in
+SGLang. It also covers how to test new models and register external implementations.
 ## How to Support a new Language Model
-To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
+To support a new model in SGLang, you only need to add a single file under
+the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn
+from existing model implementations and create a new file for your model. For most models, you should be able to find a
+similar model to start with (e.g., starting from Llama). Also refer how
+to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
-## How to Support a new Vision-Language model
+## How to Support a new Multimodal Large Language Model
-To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM support:
+To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the
+standard LLM support:
 1. **Register your new model as multimodal**:
-   Extend `is_multimodal_model` in [model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return `True` for your model.
+   Extend `is_multimodal_model`
+   in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561)
+   to return `True` for your model.
-2. **Process Images**:
+2. **Register a new chat-template**
-   Define a new `Processor` class that inherits from `BaseProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) for more details.
+   See [conversation.py](https://github.com/sgl-project/sglang/blob/86a779dbe9e815c02f71ea82574608f6eae016b5/python/sglang/srt/conversation.py)
-3. **Handle Image Tokens**:
+3. **Multimodal Data Processor**:
-   Implement a `pad_input_ids` function for your new model. In this function, image tokens in the prompt should be expanded and replaced with image-hashes so that SGLang can recognize different images when using `RadixAttention`.
+   Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
+   model’s dedicated processor.
+   See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
+   for more details.
-4. **Replace Vision Attention**:
+4. **Handle Multimodal Tokens**:
-   Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
+   Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be
+   expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data
+   with `RadixAttention`.
-You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
+5. **Adapt to Vision Attention**:
+   Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
-You should test the new vLM locally against Hugging Face models. See the [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
+You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or
+other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
+You should test the new MLLM locally against Hugging Face models. See the [
+`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
 ## Test the Correctness
 ### Interactive Debugging
-For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits:
+For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands
+should give the same text output and very similar prefill logits:
 - Get the reference output:
  ```bash
-  python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm}
+  python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm}
  ```
 - Get the SGLang output:
  ```bash
@@ -43,7 +62,10 @@ For interactive debugging, compare the outputs of Hugging Face/Transformers and
 ### Add the Model to the Test Suite
-To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR.
+To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in
+the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py)
+file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU,
+MMMU-Pro, etc.) in your PR.
 This is the command to test a new model on your local machine:
@@ -53,26 +75,29 @@ ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerati
 ## Port a Model from vLLM to SGLang
-The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang.
+The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable
+resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models
+from vLLM to SGLang.
 To port a model from vLLM to SGLang:
 - Compare these two files for guidance:
-  - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py)
+    - [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py)
-  - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py)
+    - [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py)
 - The major differences include:
-  - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
+    - **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
-  - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
+    - **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
-  - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
+    - **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
-  - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
+    - **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
-  - **Remove `Sample`.**
+    - **Remove `Sample`.**
-  - **Change the `forward()` functions** and add a `forward_batch()` method.
+    - **Change the `forward()` functions** and add a `forward_batch()` method.
-  - **Add `EntryClass`** at the end.
+    - **Add `EntryClass`** at the end.
-  - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
+    - **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
 ## Registering an External Model Implementation
-In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code.
+In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server.
+This allows you to integrate your model without modifying the source code.
 For example:
@@ -101,4 +126,5 @@ launch_server(server_args)
 ---
-By following these guidelines, you can add support for new language models and vision-language models in SGLang and ensure they are thoroughly tested and easily integrated into the system.
+By following these guidelines, you can add support for new language models and multimodal large language models in
+SGLang and ensure they are thoroughly tested and easily integrated into the system.
--- a/docs/supported_models/vision_language_models.md
+++ b/docs/supported_models/vision_language_models.md
-# Vision Language Models
-These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with visual encoders and require a specific chat template for handling vision prompts.
-## Example launch Command
-```shell
-python3 -m sglang.launch_server \
-  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \  # example HF/local path
-  --host 0.0.0.0 \
-  --port 30000 \
-```
-## Supporting Matrixs
-| Model Family (Variants)        | Example HuggingFace Identifier                     | Chat Template        | Description                                                                            |
-|--------------------------------|--------------------------------------------------|----------------------|----------------------------------------------------------------------------------------|
-| **Qwen-VL** (Qwen2 series)     | `Qwen/Qwen2.5-VL-7B-Instruct`                    | `qwen2-vl`           | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
-| **DeepSeek-VL2**               | `deepseek-ai/deepseek-vl2`                       | `deepseek-vl2`       | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
-| **Janus-Pro** (1B, 7B)      | `deepseek-ai/Janus-Pro-7B`                     | `janus-pro`       | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
-| **MiniCPM-V / MiniCPM-o**      | `openbmb/MiniCPM-V-2_6`                          | `minicpmv`           | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
-| **Llama 3.2 Vision** (11B)     | `meta-llama/Llama-3.2-11B-Vision-Instruct`        | `llama_3_vision`     | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
-| **Pixtral** (12B, 124B)              | `mistral-community/pixtral-12b`                           | `mistral`     | Pixtral is a vision-language model from Mistral AI that can process both text and images. |
-| **LLaVA** (v1.5 & v1.6)        | *e.g.* `liuhaotian/llava-v1.5-13b`               | `vicuna_v1.1`        | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
-| **LLaVA-NeXT** (8B, 72B)       | `lmms-lab/llava-next-72b`                        | `chatml-llava`       | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
-| **LLaVA-OneVision**            | `lmms-lab/llava-onevision-qwen2-7b-ov`           | `chatml-llava`       | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
-| **Gemma 3 (Multimodal)**       | `google/gemma-3-4b-it`                           | `gemma-it`           | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |
-| **Kimi-VL** (A3B)              | `moonshotai/Kimi-VL-A3B-Instruct`                | `kimi-vl`            | Kimi-VL is a multimodal model that can understand and generate text from images. |