"llm/vscode:/vscode.git/clone" did not exist on "35934b2e05cd598a6de0a1ed1ef62c11fb078f36"
supported_models.md 7.93 KB
Newer Older
1
2
3
# Supported Models

## Generative Models
Chang Su's avatar
Chang Su committed
4
- Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2 / Llama 3.3 / Llama 4
5
- Mistral / Mixtral / Mistral NeMo / Mistral Small 3
6
- Gemma / Gemma 2 / Gemma3
7
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL / Qwen 2.5 VL / Olympic Coder
8
- DeepSeek / DeepSeek 2 / [DeepSeek 3](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3)
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
- OLMoE
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- LLaVA 1.5 / 1.6 / NeXT
  - `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
  - `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- Yi-VL
- StableLM
- Command-R
- DBRX
- Grok
- ChatGLM
- InternLM 2
- Exaone 3
- BaiChuan2
Mick's avatar
Mick committed
27
- MiniCPM / MiniCPM 3 / MiniCPM-v / MiniCPM-o
28
29
30
- XVERSE / XVERSE MoE
- SmolLM
- GLM-4
31
- Phi-3 / Phi-4
Tanjiro's avatar
Tanjiro committed
32
- Phi-3-Small
33
- IBM Granite 3
Mick's avatar
Mick committed
34
- Janus-Pro-1B / Janus-Pro-7B
35
- Deepseek-VL2 / Deepseek-VL2-small
36
- Gemma 3 (it)
37
38
39
40
41

## Embedding Models

- LlamaEmbeddingModel
- Mistral embedding models
42
- Qwen embedding models
43
  - `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
44
45
- Multi-modal embedding models
  - `python -m sglang.launch_server --model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct --is-embedding --chat-template gme-qwen2-vl`
uylnap's avatar
uylnap committed
46
47
- CLIP
  - `python -m sglang.launch_server --model-path openai/clip-vit-large-patch14-336 --is-embedding`
48
49
50
51
52

## Reward Models

- LlamaForSequenceClassification
  - `python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --is-embedding`
53
54
- Gemma2ForSequenceClassification
  - `python -m sglang.launch_server --model-path Skywork/Skywork-Reward-Gemma-2-27B-v0.2 --is-embedding`
RangiLyu's avatar
RangiLyu committed
55
56
- InternLM2ForRewardModel
  - `python -m sglang.launch_server --model-path internlm/internlm2-7b-reward --is-embedding --trust-remote-code`
57
58
- Qwen2ForRewardModel
  - `python -m sglang.launch_server --model-path Qwen/Qwen2.5-Math-RM-72B --is-embedding --trust-remote-code --tp-size=4`
59
60
- Qwen2ForSequenceClassification
  - `python -m sglang.launch_server --model-path jason9693/Qwen2.5-1.5B-apeach --is-embedding --trust-remote-code`
Mick's avatar
Mick committed
61
## How to Support a New Language Model
62

63
64
65
To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models).
You can learn from existing model implementations and create new files for the new models.
For most models, you should be able to find a similar model to start with (e.g., starting from Llama).
66

Mick's avatar
Mick committed
67
## How to Support a New vLM
Mick's avatar
Mick committed
68

69
70
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard
LLM.
Mick's avatar
Mick committed
71

72
73
74
1. **Register your new model as multimodal**: Extend `is_multimodal_model` in [
   `model_config.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to
   return True for your model.
Mick's avatar
Mick committed
75
2. **Process Images**: Define a new `Processor` class that inherits from `BaseProcessor` and register this
76
   processor as your model's dedicated processor. See [
Mick's avatar
Mick committed
77
   `multimodal_processor.py`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
78
79
80
81
   for more details.
3. **Handle Image Tokens**: Implement a `pad_input_ids` function for your new model, in which image tokens in the prompt
   should be expanded and replaced with image-hashes, so that SGLang can recognize different images for
   `RadixAttention`.
Mick's avatar
Mick committed
82
83
4. Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.

84
You can refer [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other
Mick's avatar
Mick committed
85
vLMs. These models demonstrate how to properly handle both multimodal and textual inputs.
86
87

You should test the new vLM locally against hf models. See [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) for an example.
Mick's avatar
Mick committed
88

89
### Test the correctness
90

91
#### Interactive debugging
92
93
94
For interactive debugging, you can compare the outputs of huggingface/transformers and SGLang.
The following two commands should give the same text output and very similar prefill logits.

95
- Get the reference output by `python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm}`
96
- Get the SGLang output by `python3 -m sglang.bench_one_batch --correct --model [new model]`
97

98
#### Add the model to the test suite
99
100
101
102
103
104
105
106
To make sure the new model is well maintained in the future, it is better to add it to the test suite.
You can add it to the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) and run the following command to test it.

For example, if the model is Qwen/Qwen2-1.5B
```
ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others
```

107
### Port a model from vLLM to SGLang
108
109
110
Another valuable resource is the [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models). vLLM has extensive coverage of models, and SGLang reuses vLLM's interface and some layers to implement the models. This similarity makes it easy to port many models from vLLM to SGLang.

To port a model from vLLM to SGLang, you can compare these two files [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) and [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of Attention with RadixAttention. The other parts are almost identical. Specifically,
111
  - Replace vllm's `Attention` with `RadixAttention`. Note that you need to pass `layer_id` all the way to `RadixAttention`.
Ying Sheng's avatar
Ying Sheng committed
112
  - Replace vllm's `LogitsProcessor` with SGLang's `LogitsProcessor`.
113
  - Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.
114
  - Replace other vLLM layers with SGLang layers (e.g., `RMSNorm`, `SiluAndMul`).
Ying Sheng's avatar
Ying Sheng committed
115
  - Remove `Sample`.
116
  - Change `forward()` functions, and add `forward_batch`.
Ying Sheng's avatar
Ying Sheng committed
117
  - Add `EntryClass` at the end.
118
  - Please ensure the new implementation uses **only SGLang components and does not rely on any vLLM components**.
119
120
121
122
123
124
125
126
127

### Registering an external model implementation

In addition to the methods described above, you can also register your new model with the `ModelRegistry` before launching the server. This approach is useful if you want to integrate your model without needing to modify the source code.

Here is how you can do it:

```python
from sglang.srt.models.registry import ModelRegistry
128
from sglang.srt.entrypoints.http_server import launch_server
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144

# for a single model, you can add it to the registry
ModelRegistry.models[model_name] = model_class

# for multiple models, you can imitate the import_model_classes() function in sglang/srt/models/registry.py
from functools import lru_cache

@lru_cache()
def import_new_model_classes():
    model_arch_name_to_cls = {}
    ...
    return model_arch_name_to_cls

ModelRegistry.models.update(import_new_model_classes())

launch_server(server_args)
145
```