doc: Support a new vLM (#3405)

Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

doc: Support a new vLM (#3405)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
ced68066 · Mick · GitHub · b8318aec · ced68066
Unverified Commit ced68066 authored Feb 12, 2025 by Mick Committed by GitHub Feb 12, 2025
Show whitespace changes
Inline Side-by-side

Showing with 12 additions and 1 deletion

docs/references/supported_models.md docs/references/supported_models.md +12 -1

No files found.
--- a/docs/references/supported_models.md
+++ b/docs/references/supported_models.md
@@ -48,12 +48,23 @@
 - InternLM2ForRewardModel
  - `python -m sglang.launch_server --model-path internlm/internlm2-7b-reward --is-embedding --trust-remote-code`
-## How to Support a New Model
+## How to Support a New Language Model
 To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models).
 You can learn from existing model implementations and create new files for the new models.
 For most models, you should be able to find a similar model to start with (e.g., starting from Llama).
+## How to Support a New vision LLM
+To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM.
+1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [`model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return True for your model.
+2. **Process Images**: Create a new `ImageProcessor` class that inherits from `BaseImageProcessor` and register this processor as your model's dedicated processor. See [`image_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/image_processor.py) for more details.
+3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt should be expanded and replaced with image-hashes, so that SGLang can recognize different images for `RadixAttention`.
+4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.
+You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLMs. These models demonstrate how to properly handle both visual and textual inputs.
 ### Test the correctness
 #### Interactive debugging