Commit fe851fbc authored by zhouxiang's avatar zhouxiang
Browse files

0.2.6版本新增文件补充

parent e2d98ddc
# How to support new model in lmdeploy.pytorch
lmdeploy.pytorch is designed to ease new model deployment and prototype verification. If you are willing to use our engine, here is the tutorial.
## Support New Model
Let's begin with Llama.
Before delving into the details, it's essential to acquaint ourselves with the input specifications of the model. In order to accommodate new features within our engine, there are some deviations from the typical transformer inputs.
1. To circumvent the need for batch padding, continuous batching is employed. Consequently, the `input_ids` now represents the concatenation of all input sequences in the batch, followed by a `unsqueeze(0)` operation to align with the original `input_ids` dimension.
2. In an effort to optimize memory usage for the key/value cache, we implement paged attention. This transforms the `past_key_value` into a substantial tensor with dimensions `[num_blocks, block_size, num_heads, head_dim]`. Here, `num_blocks` denotes the number of page blocks, and `block_size` indicates the size of each block.
3. Accompanying these changes, additional inputs are imperative to support the modified inputs described above. These include the block table and history length. It's important to note that these supplementary inputs are not explicitly listed as arguments in the original forward method. Instead, a context object is utilized to furnish this essential information.
Due to the alterations in the input structure mentioned earlier, the forward methods for both `LlamaModel` and `LlamaAttention` modules need to be adjusted. Below are the modified implementations:
For `LlamaModel`:
```python
# lmdeploy/pytorch/models/llama.py
class LlamaModel(nn.Module):
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
"""Rewrite implementation of LlamaModel.forward."""
inputs_embeds = self.embed_tokens(input_ids)
hidden_states = inputs_embeds
# decoder layers
for idx, decoder_layer in enumerate(self.layers):
past_key_value = past_key_values[idx]
layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = layer_outputs[0]
hidden_states = self.norm(hidden_states)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=past_key_values,
hidden_states=None,
attentions=None,
)
```
For LlamaAttention:
```python
# lmdeploy/pytorch/models/llama.py
from lmdeploy.pytorch.kernels import apply_rotary_pos_emb, fill_kv_cache, paged_attention_fwd
class LlamaAttention(nn.Module):
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
"""Rewrite of LlamaAttention.forward."""
context = self.context.context
history_lengths = context.history_lengths
position_ids_1d = context.position_ids_1d
block_offsets = context.block_offsets
# qkv proj
query_states = q_proj(hidden_states)
key_states = k_proj(hidden_states)
value_states = v_proj(hidden_states)
query_states = query_states.view(-1, num_heads, head_dim)
key_states = key_states.view(-1, num_kv_heads, head_dim)
value_states = value_states.view(-1, num_kv_heads, head_dim)
# rotary embedding
max_seq_len = position_ids.size(-1)
kv_seq_len = max_seq_len + max(history_lengths)
if kv_seq_len >= self.rotary_emb.max_seq_len_cached:
cos, sin = self.rotary_emb(value_states,
seq_len=kv_seq_len + 128)
query_states, key_states = apply_rotary_pos_emb(
query_states,
key_states,
self.rotary_emb.cos_cached,
self.rotary_emb.sin_cached,
position_ids,
position_ids_1d,
q_embed=query_states,
k_embed=key_states)
# fill kv cache
kv_seq_length = context.kv_seq_length
q_seq_length = context.q_seq_length
q_start_loc = context.q_start_loc
fill_kv_cache(key_states,
value_states,
past_key_value[0],
past_key_value[1],
q_start_loc,
q_seq_length,
block_offsets=block_offsets,
history_lengths=history_lengths,
context=context)
# attention
attn_output = query_states
block_size = past_key_value[0].size(1)
paged_attention_fwd(
query_states,
past_key_value[0],
past_key_value[1],
attn_output,
block_offsets,
q_start_loc=q_start_loc,
q_seqlens=q_seq_length,
kv_seqlens=kv_seq_length,
max_seqlen=max_seq_len,
)
hidden_size = num_heads * head_dim
attn_output = attn_output.reshape(*hidden_states.shape[:-1], hidden_size)
# o proj
attn_output = o_proj(attn_output)
return attn_output, None, past_key_value
```
Note: The additional arguments like `history_lengths` and `block_offsets` are accessed from the `context` object, which acts as a container for the necessary inputs required by continuous batching and paged attention. Refer to the [context info](#context-info) for more detail about `context` object.
We have replaced certain operations with our custom Triton kernel for two reasons:
1. The custom Triton kernel allows us to incorporate new features, such as `paged_attention_fwd`.
2. Fused kernels offer superior performance compared to the pure PyTorch implementation.
Now that we have the updated implementations for the two modules, let's register them in `lmdeploy/pytorch/models/module_map.py`.
```python
# lmdeploy/pytorch/models/module_map.py
MODEL_MAP.update({
'transformers.models.llama.modeling_llama.LlamaAttention':
'lmdeploy.pytorch.models.llama.LlamaAttention',
'transformers.models.llama.modeling_llama.LlamaModel':
'lmdeploy.pytorch.models.llama.LlamaModel'
})
```
In this mapping, the revised modules are associated with their original counterparts. When creating an `Engine`, the `ModelAgent` will automatically patch the model. Subsequently, we can conduct inference using these updated implementations.
## Support Tensor Parallelism
If we aim to enable tensor parallelism (TP), it is necessary to partition the weights in the model. Let's build upon the previously mentioned modifications to accommodate TP in the Llama model:
In Llama (as well as in most Language Model models), the weight partition primarily affects the Linear layers. Specifically, for the following components:
- In `LlamaAttention`: `q_proj`, `k_proj`, `v_proj` require column-wise partitioning, while `o_proj` necessitates row-wise partitioning.
- In `LlamaMLP`: `gate_proj` and `up_proj` require column-wise partitioning, while `down_proj` requires row-wise partitioning.
We can implement the \_distribution_partition_fn in each of the rewritten modules:
```python
# lmdeploy/pytorch/models/llama.py
from ..dist_utils import (colwise_parallelize_linear_fn,
rowwise_parallelize_linear_fn)
class LlamaAttention(nn.Module):
@classmethod
def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module,
device_mesh: DeviceMesh):
"""Distribution partition callback."""
if mod_name in ['q_proj', 'k_proj', 'v_proj']:
colwise_parallelize_linear_fn(mod,
device_mesh=device_mesh,
to_local=True)
elif mod_name in ['o_proj']:
rowwise_parallelize_linear_fn(mod,
device_mesh=device_mesh,
to_local=True)
class LlamaMLP(nn.Module):
@classmethod
def _distribute_partition_fn(cls, mod_name: str, mod: nn.Module,
device_mesh: DeviceMesh):
"""Distribution partition callback."""
if mod_name in ['gate_proj', 'up_proj']:
colwise_parallelize_linear_fn(mod,
device_mesh=device_mesh,
to_local=True)
elif mod_name in ['down_proj']:
rowwise_parallelize_linear_fn(mod,
device_mesh=device_mesh,
to_local=True)
```
In the process of loading model weights, the `_distribute_partition_fn` is called to distribute the weights of specific modules across different devices. Following the weight partitioning, it becomes necessary to perform `all_reduce` on the output tensors of `o_proj` and `down_proj`. While one option is to include `all_reduce` directly in the forward method, an alternative approach is to introduce the `_distribute_output_fn` call:
```python
# lmdeploy/pytorch/models/llama.py
import torch.distributed as dist
class LlamaAttention(nn.Module):
@classmethod
def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh):
"""Distribution output hook."""
dist.all_reduce(outputs[0])
return outputs
class LlamaMLP(nn.Module):
@classmethod
def _distribute_output_fn(cls, outputs, device_mesh: DeviceMesh):
"""Distribution output hook."""
dist.all_reduce(outputs)
return outputs
```
It is essential to remember to add `LlamaMLP` to the `module_map`:
```python
# lmdeploy/pytorch/models/module_map.py
MODEL_MAP.update({
'transformers.models.llama.modeling_llama.LlamaMLP':
'lmdeploy.pytorch.models.llama.LlamaMLP'
})
```
With these adjustments, the model is now capable of utilizing multiple GPUs for deploying Large Language Models (LLM). This enables efficient distribution of computations across different devices in a parallelized manner.
## Debug Module
When the output of the model does not meet expectations, we would like to debug a specific module to determine if the added rewrite is correct. `lmdeploy.pytorch` provides some tools to assist with accuracy alignment. Let’s take `LlamaAttention` module as an example.
First, create an instance of the module that we want to debug:
```python
import torch
from transformers import AutoModelForCausalLM
# get module
model_path = 'meta-llama/Llama-2-7b-chat-hf'
dtype = torch.float16
model = AutoModelForCausalLM.from_pretrained(model_path).to(torch.float16).cuda()
self_attn = model.model.layers[0].self_attn
```
Extract the inputs/outputs with `ModuleIOExtractor`.
```python
from lmdeploy.pytorch.tools.make_inputs import ModuleIOExtractor
# extract module input/output
input_ids = torch.tensor([[1, 2, 3, 4, 5]]).cuda()
extractor = ModuleIOExtractor(model, self_attn)
attn_args, attn_kwargs, attn_output = extractor.extract(input_ids)
```
The inputs of rewrite module are different from the inputs of origin module:
1. Module requires some special inputs, which are passed through `StepContext`. We can create one with `make_step_context`.
2. `input_ids`, `hidden_states` should be continuous. We can use `continuous_tensor` to do the process.
3. `past_key_value` should be paged to meet the demand of paged attention.
Based on the reason above, the input should be updated:
```python
from lmdeploy.pytorch.tools.make_inputs import make_step_context
from lmdeploy.pytorch.tools.layout_convert import continuous_tensor
# create patched input/output
context = make_step_context(input_ids,
kv_cache_dtype=dtype,
num_key_value_heads=32)
seq_length = context.q_seq_length
attn_kwargs['hidden_states'] = continuous_tensor(
attn_kwargs['hidden_states'],
seq_length)
attn_kwargs['past_key_value'] = context.kv_caches[0]
```
Then you can start the rewrite and compare the correctness of the results.
```python
from lmdeploy.pytorch.models import patch
# patch and test
patched_self_attn = patch(self_attn, extra_args=['context'])
with torch.inference_mode():
patched_output = patched_self_attn.patched_forward(*attn_args,
**attn_kwargs,
context=context)
torch.testing.assert_close(patched_output[0],
continuous_tensor(attn_output[0], seq_length))
```
Adjust the rewrite module until the output can be aligned.
## Appendix
### context info
```python
@dataclass
class StepContext:
"""context of Model.
"""
inputs: ModelInputs
block_offsets: torch.LongTensor
position_ids: torch.LongTensor
position_ids_1d: torch.LongTensor
q_start_loc: torch.LongTensor
history_lengths: torch.LongTensor
seq_length: torch.LongTensor
max_seq_length: int
kv_seq_length: torch.LongTensor
kv_caches: List
is_decoding: bool
world_size: int = 1
json_config: Dict = None
local_adapter_ids: torch.LongTensor = None
global_adapter_ids: torch.LongTensor = None
adapter_offsets: torch.LongTensor = None
max_rank: int = 0
```
### FAQ
- **How to invoke the original forward method?**
A common approach is to add hooks to a method rather than performing a complete rewrite. To access the unpatched module, you can utilize self.origin_mod within the rewritten method.
- **How to register modules in remote code?**
For modules located in remote code, pinpointing them via `qualname` might be challenging. `lmdeploy.pytorch` facilitates registration using abbreviations for such modules:n:
```python
MODULE_MAP.update({
'modeling_internlm.InternLMAttention':
'lmdeploy.pytorch.models.internlm.PatchedInternLMAttention',
})
```
> \[!NOTE\]
>
> Although abbreviations are supported, they tend to have lower priority. It is advisable to register modules using their complete `qualname` for more robust and accurate mapping.
- **How to support different modules with the same name?**
You can accommodate multiple modules with the same name within a single rewrite module by providing distinct implementations based on their attributes. For instance, consider `baichuan2` 7b/13b:
```python
class BaichuanModel(nn.Module):
def forward(self, ...):
if self.config.num_hidden_layers == 32:
return forward_7b(...)
else:
return forward_default(...)
```
- **How to perform post-initialization for a rewrite module?**
To execute tasks after model weight loading, introduce a `_update_model_fn` method in your rewrite module. This method will be automatically called post-initialization:
```python
class LlamaAttention:
def _update_model_fn(self):
# ADD YOUR CODE HERE
```
Here, you can include any additional post-initialization steps or configurations needed for your specific use case.
inference pipeline
==================
.. currentmodule:: lmdeploy
pipeline
--------
.. autofunction:: pipeline
serving
--------
.. autofunction:: serve
.. autofunction:: client
PytorchEngineConfig
-------------------
.. autoclass:: PytorchEngineConfig
TurbomindEngineConfig
---------------------
.. autoclass:: TurbomindEngineConfig
GenerationConfig
----------------
.. autoclass:: GenerationConfig
ChatTemplateConfig
------------------
.. autoclass:: ChatTemplateConfig
# Evaluate LLMs with OpenCompass
The LLMs accelerated by lmdeploy can be evaluated with OpenCompass.
## Setup
In this part, we are going to setup the environment for evaluation.
### Install lmdeploy
Install lmdeploy through pip (python 3.8+). If you want to install from source, you can refer to [build.md](../build.md).
```shell
pip install lmdeploy
```
### Install OpenCompass
Install OpenCompass from source. Refer to [installation](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) for more information.
```shell
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .
```
At present, you can check the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html#)
to get to know the basic usage of OpenCompass.
### Download datasets
Download the core datasets
```shell
# Run in the OpenCompass directory
cd opencompass
wget https://github.com/open-compass/opencompass/releases/download/0.1.8.rc1/OpenCompassData-core-20231110.zip
unzip OpenCompassData-core-20231110.zip
```
## Prepare Evaluation Config
OpenCompass uses the configuration files as the OpenMMLab style. One can define a python config and start evaluating at ease.
OpenCompass has supported the evaluation for lmdeploy's TurboMind engine using python API.
### Dataset Config
In the home directory of OpenCompass, we are writing the config file `$OPENCOMPASS_DIR/configs/eval_lmdeploy.py`.
We select multiple predefined datasets and import them from OpenCompass base dataset configs as `datasets`.
```python
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_7902a7 import WSC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.race.race_gen_69ee4f import race_datasets
from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets
# and output the results in a chosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
```
### Model Config
This part shows how to setup model config for LLMs. Let's check some examples:
`````{tabs}
````{tab} internlm-20b
```python
from opencompass.models.turbomind import TurboMindModel
internlm_20b = dict(
type=TurboMindModel,
abbr='internlm-20b-turbomind',
path="internlm/internlm-20b", # this path should be same as in huggingface
engine_config=dict(session_len=2048,
max_batch_size=8,
rope_scaling_factor=1.0),
gen_config=dict(top_k=1, top_p=0.8,
temperature=1.0,
max_new_tokens=100),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
concurrency=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
models = [internlm_20b]
```
````
````{tab} internlm-chat-20b
For Chat models, you have to pass `meta_template` for chat models. Different Chat models may have different `meta_template` and it's important
to keep it the same as in training settings. You can read [meta_template](https://opencompass.readthedocs.io/en/latest/prompt/meta_template.html) for more information.
```python
from opencompass.models.turbomind import TurboMindModel
internlm_meta_template = dict(round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='<eoa>\n', generate=True),
],
eos_token_id=103028)
internlm_chat_20b = dict(
type=TurboMindModel,
abbr='internlm-chat-20b-turbomind',
path='internlm/internlm-chat-20b',
engine_config=dict(session_len=2048,
max_batch_size=8,
rope_scaling_factor=1.0),
gen_config=dict(top_k=1,
top_p=0.8,
temperature=1.0,
max_new_tokens=100),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
concurrency=8,
meta_template=internlm_meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
end_str='<eoa>'
)
models = [internlm_chat_20b]
```
````
`````
**Note**
- If you want to pass more arguments for `engine_config``gen_config` in the evaluation config file, please refer to [TurbomindEngineConfig](https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#turbomindengineconfig)
and [EngineGenerationConfig](https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#generationconfig)
## Execute Evaluation Task
After defining the evaluation config, we can run the following command to start evaluating models.
You can check [Execution Task](https://opencompass.readthedocs.io/en/latest/user_guides/experimentation.html#task-execution-and-monitoring)
for more arguments of `run.py`.
```shell
# in the root directory of opencompass
python3 run.py configs/eval_lmdeploy.py --work-dir ./workdir
```
# Get Started
LMDeploy offers functionalities such as model quantization, offline batch inference, online serving, etc. Each function can be completed with just a few simple lines of code or commands.
## Installation
Install lmdeploy with pip (python 3.8+) or [from source](./build.md)
```shell
pip install lmdeploy
```
The default prebuilt package is compiled on CUDA 11.8. However, if CUDA 12+ is required, you can install lmdeploy by:
```shell
export LMDEPLOY_VERSION=0.2.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl
```
## Offline batch inference
```python
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```
For more information on inference pipeline parameters, please refer to [here](./inference/pipeline.md).
## Serving
LMDeploy offers various serving methods, choosing one that best meet your requirements.
- [Serving with openai compatible server](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html)
- [Serving with docker](https://lmdeploy.readthedocs.io/en/latest/serving/api_server.html#option-2-deploying-with-docker)
- [Serving with gradio](https://lmdeploy.readthedocs.io/en/latest/serving/gradio.html)
## Quantization
LMDeploy provides the following quantization methods. Please visit the following links for the detailed guide
- [4bit weight-only quantization](quantization/w4a16.md)
- [k/v quantization](quantization/kv_int8.md)
- [w8a8 quantization](quantization/w8a8.md)
## Useful Tools
LMDeploy CLI offers the following utilities, helping users experience LLM features conveniently
### Inference with Command line Interface
```shell
lmdeploy chat turbomind internlm/internlm-chat-7b
```
### Serving with Web UI
LMDeploy adopts gradio to develop the online demo.
```shell
# install dependencies
pip install lmdeploy[serve]
# launch gradio server
lmdeploy serve gradio internlm/internlm-chat-7b
```
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
# Load huggingface model directly
Starting from v0.1.0, Turbomind adds the ability to pre-process the model parameters on-the-fly while loading them from huggingface style models.
## Supported model type
Currently, Turbomind support loading three types of model:
1. A lmdeploy-quantized model hosted on huggingface.co, such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
2. Other LM models on huggingface.co like Qwen/Qwen-7B-Chat
3. A model converted by `lmdeploy convert`, legacy format
## Usage
### 1) A lmdeploy-quantized model
For models quantized by `lmdeploy.lite` such as [llama2-70b-4bit](https://huggingface.co/lmdeploy/llama2-chat-70b-4bit), [internlm-chat-20b-4bit](https://huggingface.co/internlm/internlm-chat-20b-4bit), etc.
```
repo_id=internlm/internlm-chat-20b-4bit
model_name=internlm-chat-20b
# or
# repo_id=/path/to/downloaded_model
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --tp 1
```
### 2) Other LM models
For other LM models such as Qwen/Qwen-7B-Chat or baichuan-inc/Baichuan2-7B-Chat. LMDeploy supported models can be viewed through `lmdeploy list`.
```
repo_id=Qwen/Qwen-7B-Chat
model_name=qwen-7b
# or
# repo_id=/path/to/Qwen-7B-Chat/local_path
# Inference by TurboMind
lmdeploy chat turbomind $repo_id --model-name $model_name
# Serving with gradio
lmdeploy serve gradio $repo_id --model-name $model_name
# Serving with Restful API
lmdeploy serve api_server $repo_id --model-name $model_name --tp 1
```
### 3) A model converted by `lmdeploy convert`
The usage is like previous
```
# Convert a model
lmdeploy convert $MODEL_NAME /path/to/model --dst-path ./workspace
# Inference by TurboMind
lmdeploy chat turbomind ./workspace
# Serving with gradio
lmdeploy serve gradio ./workspace
# Serving with Restful API
lmdeploy serve api_server ./workspace --tp 1
```
# LLM Offline Inference Pipeline
In this tutorial, We will present a list of examples to introduce the usage of `lmdeploy.pipeline`.
You can overview the detailed pipeline API in [this](https://lmdeploy.readthedocs.io/en/latest/api/pipeline.html) guide.
## Usage
- **An example using default parameters:**
```python
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
```
In this example, the pipeline by default allocates a predetermined percentage of GPU memory for storing k/v cache. The ratio is dictated by the parameter `TurbomindEngineConfig.cache_max_entry_count`.
There have been alterations to the strategy for setting the k/v cache ratio throughout the evolution of LMDeploy. The following are the change histories:
1. `v0.2.0 <= lmdeploy <= v0.2.1`
`TurbomindEngineConfig.cache_max_entry_count` defaults to 0.5, indicating 50% GPU **total memory** allocated for k/v cache. Out Of Memory (OOM) errors may occur if a 7B model is deployed on a GPU with memory less than 40G. If you encounter an OOM error, please decrease the ratio of the k/v cache occupation as follows:
```python
from lmdeploy import pipeline, TurbomindEngineConfig
# decrease the ratio of the k/v cache occupation to 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
pipe = pipeline('internlm/internlm2-chat-7b',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
```
2. `lmdeploy > v0.2.1`
The allocation strategy for k/v cache is changed to reserve space from the **GPU free memory** proportionally. The ratio `TurbomindEngineConfig.cache_max_entry_count` has been adjusted to 0.8 by default. If OOM error happens, similar to the method mentioned above, please consider reducing the ratio value to decrease the memory usage of the k/v cache.
- **An example showing how to set tensor parallel num**:
```python
from lmdeploy import pipeline, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm2-chat-7b',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)
```
- **An example for setting sampling parameters:**
```python
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm2-chat-7b',
backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
gen_config=gen_config)
print(response)
```
- **An example for OpenAI format prompt input:**
```python
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm2-chat-7b',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
response = pipe(prompts,
gen_config=gen_config)
print(response)
```
- **An example for streaming mode:**
```python
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm2-chat-7b',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
for item in pipe.stream_infer(prompts, gen_config=gen_config):
print(item)
```
- **Below is an example for pytorch backend. Please install triton first.**
```shell
pip install triton>=2.1.0
```
```python
from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig
backend_config = PytorchEngineConfig(session_len=2048)
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
pipe = pipeline('internlm/internlm-chat-7b',
backend_config=backend_config)
prompts = [[{
'role': 'user',
'content': 'Hi, pls intro yourself'
}], [{
'role': 'user',
'content': 'Shanghai is'
}]]
response = pipe(prompts, gen_config=gen_config)
print(response)
```
## FAQs
- **RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase**.
If you got this for tp>1 in pytorch backend. Please make sure the python script has following
```python
if __name__ == '__main__':
```
Generally, in the context of multi-threading or multi-processing, it might be necessary to ensure that initialization code is executed only once. In this case, `if __name__ == '__main__':` can help to ensure that these initialization codes are run only in the main program, and not repeated in each newly created process or thread.
- To customize a chat template, please refer to [chat_template.md](../advance/chat_template.md).
# Architecture of lmdeploy.pytorch
`lmdeploy.pytorch` is an inference engine in LMDeploy that offers a developer-friendly framework to users interested in deploying their own models and developing new features.
## Design
![pytorch arch](https://github.com/grimoire/lmdeploy/blob/media/lmdeploy_pytorch_arch.png?raw=true)
## API
`lmdeploy.pytorch` shares service interfaces with `Turbomind`, and the inference service is implemented by `Engine` and `EngineInstance`.
`EngineInstance` acts as the sender of inference requests, encapsulating and sending requests to the `Engine` to achieve streaming inference. The inference interface of `EngineInstance` is thread-safe, allowing instances in different threads to initiate requests simultaneously. The `Engine` will automatically perform batch processing based on the current system resources.
Engine is the request receiver and executor. It contain modules:
- `ModelAgent` serves as a wrapper for the model, handling tasks such as loading model/adapters, managing the cache, and implementing tensor parallelism.
- The `Scheduler` functions as the sequence manager, determining the sequences and adapters to participate in the current step, and subsequently allocating resources for them.
- `RequestManager` is tasked with sending and receiving requests. acting as the bridge between the `Engine` and `EngineInstance`.
## Engine
The Engine responses to requests in a sub-thread, following this looping sequence:
1. Get new requests through `RequestManager`. These requests are cached for now.
2. The `Scheduler` performs scheduling, deciding which cached requests should be processed and allocating resources for them.
3. `ModelAgent` swaps the caches according to the information provided by the Scheduler, then performs inference with the patched model.
4. The `Scheduler` updates the status of requests based to the inference results from `ModelAgent`.
5. `RequestManager` responds to the sender (`EngineInstance`), and the process return to step 1.
Now, Let's delve deeper into the modules that participate in these steps.
### Scheduler
In LLM inference, caching history key and value states is a common practice to prevent redundant computation. However, as history lengths vary in a batch of sequences, we need to pad the caches to enable batching inference. Unfortunately, this padding can lead to significant memory wastage, limiting the transformer's performance.
[vLLM](https://docs.vllm.ai) employs a paging-based strategy, allocating caches in page blocks to minimize extra memory usage. Our Scheduler module in the Engine shares a similar design, allocating resources based on sequence length in blocks and evicting unused blocks to support larger batching and longer session lengths.
Additionally, we support [S-LoRA](https://github.com/S-LoRA/S-LoRA), which enables the use of multiple LoRA adapters on limited memory.
### ModelAgent
`lmdeploy.pytorch` supports Tensor Parallelism, which leads to complex model initialization, cache allocation, and weight partitioning. ModelAgent is designed to abstract these complexities, allowing the Engine to focus solely on maintaining the pipeline.
ModelAgent consists of two components:
1. \`**patched_model**: : This is the transformer model after patching. In comparison to the original model, the patched model incorporates additional features such as Tensor Parallelism, quantization, and high-performance kernels.
2. **cache_engine**: This component manages the caches. It receives commands from the Scheduler and performs host-device page swaps. Only GPU blocks are utilized for caching key/value pairs and adapters.
## Patching
In order to facilitate the deployment of a new model, we have developed a tool to patch the modules.
For example, if we want to reimplement the forward method of `LlamaAttention`:
```python
class CustomLlamaAttention(nn.Module):
def forward(self, ...):
# custom forward
```
We register the implementation above into `lmdeploy.pytorch.models.module_map`:
```python
MODULE_MAP.update({
'transformers.models.llama.modeling_llama.LlamaAttention':
'qualname.to.CustomLlamaAttention'})
```
`ModelAgent` would then load and patch `LlamaAttention` with `CustomLlamaAttention` while leaving everything else unchanged. You can perform inference with the new implementation. For more detail about model patching, please refer to [support new model](../advance/pytorch_new_model.md) .
## Features
`lmdeploy.pytorch` supports new features including:
- **Continuous Batching**: As the sequence length in a batch may vary, padding is often necessary for batching inference. However, large padding can lead to additional memory usage and unnecessary computation. To address this, we employ continuous batching, where all sequences are concatenated into a single long sequence to avoid padding.
- **Tensor Parallelism**: The GPU memory usage of LLM might exceed the capacity of a single GPU. Tensor parallelism is utilized to accommodate such models on multiple devices. Each device handles parts of the model simultaneously, and the results are gathered to ensure correctness.
- **S-LoRA**: LoRA adapters can be used to train LLM on devices with limited memory. While it's common practice to merge adapters into the model weights before deployment, loading multiple adapters in this way can consume a significant amount of memory. We support S-LoRA, where adapters are paged and swapped in when necessary. Special kernels are developed to support inference with unmerged adapters, enabling the loading of various adapters efficiently.
- **Quantization**: Model quantization involves performing computations with low precision. `lmdeploy.pytorch` supports w8a8 quantization. For more details, refer to [w8a8](../quantization/w8a8.md).
# Architecture of TurboMind
TurboMind is an inference engine that supports high throughput inference for conversational LLMs. It's based on NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer). Major features of TurboMind include an efficient LLaMa implementation, the persistent batch inference model and an extendable KV cache manager.
## High level overview of TurboMind
```
+--------------------+
| API |
+--------------------+
| ^
request | | stream callback
v |
+--------------------+ fetch +-------------------+
| Persistent Batch | <-------> | KV Cache Manager |
+--------------------+ update +-------------------+
^
|
v
+------------------------+
| LLaMA implementation |
+------------------------+
| FT kernels & utilities |
+------------------------+
```
## Persistent Batch
You may recognize this feature as "continuous batching" in other repos. But during the concurrent development of the feature, we modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process, hence the name "persistent batch". To put it simply
- The persistent batch as N pre-configured batch slots.
- Requests join the batch when there are free slots available. A batch slot is released and can be reused once the generation of the requested tokens is finished.
- __On cache-hits (see below), history tokens don't need to be decoded in every round of a conversation; generation of response tokens will start instantly.__
- The batch grows or shrinks automatically to minimize unnecessary computations.
## KV Cache Manager
The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way
- All device memory required for KV cache is allocated by the manager. A fixed number of slots is pre-configured to match the memory size of the system. Each slot corresponds to the memory required by the KV cache of a single sequence. Allocation chunk-size can be configure to implement pre-allocate/on-demand style allocation policy (or something in-between).
- When space for the KV cache of a new sequence is requested but no free slots left in the pool, the least recently used sequence is evicted from the cache and its device memory is directly reused by the new sequence. However, this is not the end of the story.
- Fetching sequence currently resides in one of the slots resembles a _cache-hit_, the history KV cache is returned directly and no context decoding is needed.
- Victim (evicted) sequences are not erased entirely but converted to its most compact form, i.e. token IDs. When the same sequence id is fetched later (_cache-miss_) the token IDs will be decoded by FMHA backed context decoder and converted back to KV cache.
- The eviction and conversion are handled automatically inside TurboMind and thus transparent to the users. __From the user's aspect, system that use TurboMind has access to infinite device memory.__
## LLaMa implementation
Our implementation of the LLaMa family models is modified from Gpt-NeoX model in FasterTransformer. In addition to basic refactoring and modifications to support the LLaMa family, we made some improvements to enable high performance inference of conversational models, most importantly:
- To support fast context decoding in multi-round conversations. We replaced the attention implementation in context decoder with a [cutlass](https://github.com/NVIDIA/cutlass)-based FMHA implementation that supports mismatched Q/K lengths.
- We introduced indirect buffer pointers in both context FMHA and generation FMHA to support the discontinuity in KV cache within the batch.
- To support concurrent inference with persistent batch, new synchronization mechanism was designed to orchestrate the worker threads running in tensor parallel mode.
- To maximize the throughput, we implement INT8 KV cache support to increase the max batch size. It's effective because in real-world serving scenarios, KV cache costs more memory and consumes more memory bandwidth than weights or other activations.
- We resolved an NCCL hang issue when running multiple model instances in TP mode within a single process, NCCL APIs are now guarded by host-side synchronization barriers.
## API
TurboMind supports a Python API that enables streaming output and tensor parallel mode.
The ability to use [tritonserver](https://github.com/triton-inference-server/server) for serving is also inherited from FasterTransformer. However, to support submitting concurrent requests into our persistent batch model, we no longer use sequence batching or dynamic batching as FasterTransformer does. The bookkeeping of request and sequence states are managed by TurboMind instead.
## Difference between FasterTransformer and TurboMind
Apart of the features described above, there are still many minor differences that we don't cover in this document. Notably, many capabilities of FT are dropped in TurboMind because of the difference in objectives (e.g. prefix prompt, beam search, context embedding, sparse GEMM, GPT/T5/other model families, etc)
## FAQ
### Supporting Huggingface models
For historical reasons, TurboMind's weight layout is based on [the original LLaMa implementation](https://github.com/facebookresearch/llama) (differ only by a transpose). The implementation in huggingface transformers uses a [different layout](https://github.com/huggingface/transformers/blob/45025d92f815675e483f32812caa28cce3a960e7/src/transformers/models/llama/convert_llama_weights_to_hf.py#L123C76-L123C76) for `W_q` and `W_k` which is handled in [deploy.py](https://github.com/InternLM/lmdeploy/blob/ff4648a1d09e5aec74cf70efef35bfaeeac552e0/lmdeploy/serve/turbomind/deploy.py#L398).
# TurboMind Config
TurboMind is one of the inference engines of LMDeploy. When using it to do model inference, you need to convert the input model into a TurboMind model. In the TurboMind model folder, besides model weight files, the TurboMind model also includes some other files, among which the most important is the configuration file `triton_models/weights/config.ini` that is closely related to inference performance.
If you are using LMDeploy version 0.0.x, please refer to the [turbomind 1.0 config](#turbomind-10-config) section to learn the relevant content in the configuration. Otherwise, please read [turbomind 2.0 config](#turbomind-20-config) to familiarize yourself with the configuration details.
## TurboMind 2.0 config
Take the `llama-2-7b-chat` model as an example. In TurboMind 2.0, its config.ini content is as follows:
```toml
[llama]
model_name = llama2
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = fp16
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 64
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.5
cache_block_seq_len = 128
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
rope_scaling_factor = 0.0
use_logn_attn = 0
```
These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.
```toml
model_name = llama2
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
```
Comparing to TurboMind 1.0, the model attribute part in the config remains the same with TurboMind 1.0, while the inference parameters have changed
In the following sections, we will focus on introducing the inference parameters.
### data type
`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
### batch size
The maximum batch size is still set through `max_batch_size`. But its default value has been changed from 32 to 64, and `max_batch_size` is no longer related to `cache_max_entry_count`.
### k/v cache size
k/v cache memory is determined by `cache_block_seq_len` and `cache_max_entry_count`.
TurboMind 2.0 has implemented Paged Attention, managing the k/v cache in blocks.
`cache_block_seq_len` represents the length of the token sequence in a k/v block with a default value 128. TurboMind calculates the memory size of the k/v block according to the following formula:
```
cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_data_type)
```
For the llama2-7b model, when storing k/v as the `half` type, the memory of a k/v block is: `128 * 32 * 32 * 128 * 2 * sizeof(half) = 64MB`
The meaning of `cache_max_entry_count` varies depending on its value:
- When it's a decimal between (0, 1), `cache_max_entry_count` represents the percentage of memory used by k/v blocks. For example, if turbomind launches on a A100-80G GPU with `cache_max_entry_count` being `0.5`, the total memory used by the k/v blocks is `80 * 0.5 = 40G`.
- When lmdeploy is greater than v0.2.1, `cache_max_entry_count` determines the percentage of **free memory** for k/v blocks, defaulting to `0.8`. For example, with Turbomind on an A100-80G GPU running a 13b model, the memory for k/v blocks would be `(80 - 26) * 0.8 = 43.2G`, utilizing 80% of the free 54G.
- When it's an integer > 0, it represents the total number of k/v blocks
The `cache_chunk_size` indicates the size of the k/v cache chunk to be allocated each time new k/v cache blocks are needed. Different values represent different meanings:
- When it is an integer > 0, `cache_chunk_size` number of k/v cache blocks are allocated.
- When the value is -1, `cache_max_entry_count` number of k/v cache blocks are allocated.
- When the value is 0, `sqrt(cache_max_entry_count)` number of k/v cache blocks are allocated.
### kv int8 switch
When initiating 8bit k/v inference, set `quant_policy = 4`. Please refer to [kv int8](../quantization/kv_int8.md) for a guide.
### long context switch
By setting `rope_scaling_factor = 1.0`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
Regarding the principle of Dynamic NTK, please refer to:
1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
2. https://kexue.fm/archives/9675
You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.
## TurboMind 1.0 config
Taking the `llama-2-7b-chat` model as an example, in TurboMind 1.0, its `config.ini` content is as follows:
```toml
[llama]
model_name = llama2
tensor_para_size = 1
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
session_len = 4104
weight_type = fp16
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
group_size = 0
max_batch_size = 32
max_context_token_num = 4
step_length = 1
cache_max_entry_count = 48
cache_chunk_size = 1
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 2048
use_dynamic_ntk = 0
use_logn_attn = 0
```
These parameters are composed of model attributes and inference parameters. Model attributes include the number of layers, the number of heads, dimensions, etc., and they are **not modifiable**.
```toml
model_name = llama2
head_num = 32
kv_head_num = 32
vocab_size = 32000
num_layer = 32
inter_size = 11008
norm_eps = 1e-06
attn_bias = 0
start_id = 1
end_id = 2
rotary_embedding = 128
rope_theta = 10000.0
size_per_head = 128
```
In the following sections, we will focus on introducing the inference parameters.
### data type
`weight_type` and `group_size` are the relevant parameters, **which cannot be modified**.
`weight_type` represents the data type of weights. Currently, `fp16` and `int4` are supported. `int4` represents 4bit weights. When `weight_type` is `int4`, `group_size` means the group size used when quantizing weights with `awq`. In LMDeploy prebuilt package, kernels with `group size = 128` are included.
### batch size
`max_batch_size` determines the max size of a batch during inference. In general, the larger the batch size is, the higher the throughput is. But make sure that `max_batch_size <= cache_max_entry_count`
### k/v cache size
TurboMind allocates k/v cache memory based on `session_len`, `cache_chunk_size`, and `cache_max_entry_count`.
- `session_len` denotes the maximum length of a sequence, i.e., the size of the context window.
- `cache_chunk_size` indicates the size of k/v sequences to be allocated when new sequences are added.
- `cache_max_entry_count` signifies the maximum number of k/v sequences that can be cached.
### kv int8 switch
When initiating 8bit k/v inference, change `quant_policy = 4` and `use_context_fmha = 0`. Please refer to [kv int8](../quantization/kv_int8.md) for a guide.
### long context switch
By setting `use_dynamic_ntk = 1`, you can enable the Dynamic NTK option of RoPE, which allows the model to use long-text input and output.
Regarding the principle of Dynamic NTK, please refer to:
1. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases
2. https://kexue.fm/archives/9675
You can also turn on [LogN attention scaling](https://kexue.fm/archives/8823) by setting `use_logn_attn = 1`.
# VLM Offline Inference Pipeline
LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference [pipeline](./pipeline.md).
Currently, it supports the following models.
- [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)
- LLaVA series: [v1.5](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [v1.6](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
- [Yi-VL](https://huggingface.co/01-ai/Yi-VL-6B)
We genuinely invite the community to contribute new VLM support to LMDeploy. Your involvement is truly appreciated.
This article showcases the VLM pipeline using the [liuhaotian/llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) model as a case study.
You'll learn about the simplest ways to leverage the pipeline and how to gradually unlock more advanced features by adjusting engine parameters and generation arguments, such as tensor parallelism, context window sizing, random sampling, and chat template customization.
Moreover, we will provide practical inference examples tailored to scenarios with multiple images, batch prompts etc.
## A 'Hello, world' example
```python
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```
If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
In the above example, the inference prompt is a tuple structure consisting of (prompt, image). Besides this structure, the pipeline also supports prompts in the OpenAI format:
```python
from lmdeploy import pipeline
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')
prompts = [
{
'role': 'user',
'content': [
{'type': 'text', 'text': 'describe this image'},
{'type': 'image_url', 'image_url': {'url': 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg'}}
]
}
]
response = pipe(prompts)
print(response)
```
### Set tensor parallelism
Tensor paramllelism can be activated by setting the engine parameter `tp`
```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(tp=2))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```
### Set context window size
When creating the pipeline, you can customize the size of the context window by setting the engine parameter `session_len`.
```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```
### Set sampling parameters
You can change the default sampling parameters of pipeline by passing `GenerationConfig`
```python
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(tp=2, session_len=8192))
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.6)
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image), gen_config=gen_config)
print(response)
```
### Set chat template
While performing inference, LMDeploy identifies an appropriate chat template from its builtin collection based on the model path and subsequently applies this template to the input prompts. However, when a chat template cannot be told from its model path, users have to specify it. For example, [liuhaotian/llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b) employs the 'vicuna' chat template, but the name 'vicuna' cannot be ascertained from the model's path. We can specify it by setting 'vicuna' to `ChatTemplateConfig` as follows:
```python
from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.5-7b',
chat_template_config=ChatTemplateConfig(model_name='vicuna'))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
```
For more information about customizing a chat template, please refer to [this](../advance/chat_template.md) guide
## Multi-images inference
When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the [context window](#set-context-window-size) typically needs to be increased.
```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))
image_urls=[
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
]
images = [load_image(img_url) for img_url in image_urls]
response = pipe(('describe these images', images))
print(response)
```
## Batch prompts inference
Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
```python
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))
image_urls=[
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
"https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
]
prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
response = pipe(prompts)
print(response)
```
## Multi-turn conversation
There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
```python
from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
from lmdeploy.vl import load_image
pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b',
backend_config=TurbomindEngineConfig(session_len=8192))
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
sess = pipe.chat(('describe this image', image), gen_config=gen_config)
print(sess.response.text)
sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
print(sess.response.text)
```
# KV Cache Quantization and Test Results
For the LLaMa-7B fp16 model with a maximum length of 2048, the server requires approximately 1030MB of GPU memory to store kv_cache for each concurrent session created. This means that even an A100 80G can only serve a limited number of users.
To reduce runtime GPU memory usage, we have implemented PTQ quantization for kv cache, using the following formula:
```bash
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
## How to Enable KV Cache INT8
### **Step One**
Get the quantization parameters and save them to the original HF model directory:
```bash
# get minmax
export HF_MODEL=internlm/internlm-chat-7b
lmdeploy lite calibrate \
$HF_MODEL \
--calib-dataset 'ptb' \
--calib-samples 128 \
--calib-seqlen 2048 \
--work-dir $HF_MODEL
```
### **Step Two**
Test the chat performance. Note that setting `--quant-policy 4` would set to KV Cache int8 mode.
```bash
lmdeploy chat turbomind $HF_MODEL --model-format hf --quant-policy 4
```
## GPU Memory Test
The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model.
Testing method:
1. Use `deploy.py` to convert the model, modify the maximum concurrency in the `workspace` configuration; adjust the number of requests in `llama_config.ini`.
2. Compile and run `bin/llama_triton_example` to obtain the GPU memory situation of the fp16 version under different batch_size.
3. Enable quantization, re-run `bin/llama_triton_example` to obtain the GPU memory situation of the int8 version under different batch_size.
Below shows the comparison of GPU memory between the two versions:
| batch_size | fp16 memory(MiB) | int8 memory(MiB) | diff(MiB) |
| :--------: | :--------------: | :--------------: | :-------: |
| 8 | 22337 | 18241 | -4096 |
| 16 | 30593 | 22369 | -8224 |
| 32 | 47073 | 30625 | -16448 |
| 48 | 63553 | 38881 | -24672 |
Compared to directly quantizing Weight (such as [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa/)), we have done a comparative estimation of memory growth in the 7B model for both methods, with some data from [llama.cpp](https://github.com/ggerganov/llama.cpp).
![](../../../resources/batch_memory.png)
As can be seen, the fp16 version requires 1030MB of GPU memory for each concurrency, so quantizing kv_cache can significantly reduce the rate of increase of runtime memory.
## Accuracy Test
The test object is the [internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) command model.
Below is the result of PTQ quantization of `kCacheKVInt8` method with only 128 randomly selected data from the c4 dataset. The accuracy was tested using [opencompass](https://github.com/InternLM/opencompass) before and after quantization.
| task | dataset | metric | int8 | fp16 | diff |
| :-----------: | :-------------: | :-----------: | :---: | :---: | :---: |
| Language | winogrande | accuracy | 60.77 | 61.48 | -0.71 |
| Knowledge | nq | score | 2.69 | 2.60 | +0.09 |
| Reasoning | gsm8k | accuracy | 33.28 | 34.72 | -1.44 |
| Reasoning | bbh | naive_average | 20.12 | 20.51 | -0.39 |
| Understanding | openbookqa_fact | accuracy | 82.40 | 82.20 | +0.20 |
| Understanding | eprstmt-dev | accuracy | 90.62 | 88.75 | +1.87 |
| Safety | crows_pairs | accuracy | 32.56 | 31.43 | +1.13 |
Note that both `kCacheKVInt8` and `WeightInt4` methods can be enabled at the same time.
Please refer to [w4a16](./w4a16.md) do `WeightInt4` and then
start chat like:
```shell
lmdeploy chat turbomind ./internlm-chat-7b-4bit --model-format awq --quant-policy 4
```
# INT4 Weight-only Quantization and Deployment (W4A16)
LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.
LMDeploy supports the following NVIDIA GPU for W4A16 inference:
- Turing(sm75): 20 series, T4
- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
- Ada Lovelace(sm90): 40 series
Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.
```shell
pip install lmdeploy[all]
```
This article comprises the following sections:
<!-- toc -->
- [Quantization](#quantization)
- [Evaluation](#evaluation)
- [Inference](#inference)
- [Performance](#performance)
- [Service](#service)
<!-- tocstop -->
## Quantization
A single command execution is all it takes to quantize the model. The resulting quantized weights are then stored in the $WORK_DIR directory.
```shell
export HF_MODEL=internlm/internlm-chat-7b
export WORK_DIR=internlm/internlm-chat-7b-4bit
lmdeploy lite auto_awq \
$HF_MODEL \
--calib-dataset 'ptb' \
--calib-samples 128 \
--calib-seqlen 2048 \
--w-bits 4 \
--w-group-size 128 \
--work-dir $WORK_DIR
```
Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model, the command can be condensed as:
```shell
lmdeploy lite auto_awq internlm/internlm-chat-7b --work-dir internlm-chat-7b-4bit
```
```{note}
We recommend that you specify the --work-dir parameter, including the model name as demonstrated in the example above. This facilitates LMDeploy in fuzzy matching the --work-dir with an appropriate built-in chat template. Otherwise, you will have to designate the chat template during inference.
```
Upon completing quantization, you can engage with the model efficiently using a variety of handy tools.
For example, you can initiate a conversation with it via the command line:
```shell
lmdeploy chat turbomind ./internlm-chat-7b-4bit --model-format awq
```
Alternatively, you can start the gradio server and interact with the model through the web at `http://{ip_addr}:{port`
```shell
lmdeploy serve gradio ./internlm-chat-7b-4bit --server_name {ip_addr} --server_port {port} --model-format awq
```
## Evaluation
Please overview [this guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_turbomind.html) about model evaluation with LMDeploy.
## Inference
Trying the following codes, you can perform the batched offline inference with the quantized model:
```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("./internlm-chat-7b-4bit", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```
For more information about the pipeline parameters, please refer to [here](../inference/pipeline.md).
In addition to performing inference with the quantized model on localhost, LMDeploy can also execute inference for the 4bit quantized model derived from AWQ algorithm available on Huggingface Hub, such as models from the [lmdeploy space](https://huggingface.co/lmdeploy) and [TheBloke space](https://huggingface.co/TheBloke)
```python
# inference with models from lmdeploy space
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("lmdeploy/llama2-chat-70b-4bit",
backend_config=TurbomindEngineConfig(model_format='awq', tp=4))
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
# inference with models from thebloke space
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
pipe = pipeline("TheBloke/LLaMA2-13B-Tiefighter-AWQ",
backend_config=TurbomindEngineConfig(model_format='awq'),
chat_template_config=ChatTemplateConfig(model_name='llama2')
)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```
## Performance
We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090 using [profile_generation.py](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py). And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single batch inference.
| model | llm-awq | mlc-llm | turbomind |
| ---------------- | ------- | ------- | --------- |
| Llama-2-7B-chat | 112.9 | 159.4 | 206.4 |
| Llama-2-13B-chat | N/A | 90.7 | 115.8 |
## Service
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
```shell
lmdeploy serve api_server internlm/internlm-chat-7b-4bit --backend turbomind --model-format awq
```
The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:
```shell
lmdeploy serve api_client http://0.0.0.0:23333
```
You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../serving/api_server.md).
# W8A8 LLM Model Deployment
LMDeploy provides functions for quantization and inference of large language models using 8-bit integers.
Before starting inference, ensure that lmdeploy and openai/triton are correctly installed. Execute the following commands to install these:
```shell
pip install lmdeploy
pip install triton>=2.1.0
```
## 8-bit Weight Model Inference
For performing 8-bit weight model inference, you can directly download the pre-quantized 8-bit weight models from LMDeploy's [model zoo](https://huggingface.co/lmdeploy). For instance, the 8-bit Internlm-chat-7B model is available for direct download from the model zoo:
```shell
git-lfs install
git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon)
```
Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the ["8bit Weight Quantization"](#8bit-weight-quantization) section. Save them in the internlm-chat-7b-w8 directory, using the command below:
```shell
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
```
Afterwards, use the following command to interact with the model via the terminal:
```shell
lmdeploy chat torch ./internlm-chat-7b-w8
```
## Launching gradio service
Coming soon...
## Inference Speed
Coming soon...
## 8bit Weight Quantization
Performing 4bit weight quantization involves three steps:
1. **Smooth Weights**: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.
2. **Replace Modules**: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file.
3. **Save the Quantized Model**: Once you've made the necessary replacements, save the new quantized model.
The script `lmdeploy/lite/apis/smooth_quant.py` accomplishes all three tasks detailed above. For example, you can obtain the model weights of the quantized Internlm-chat-7B model by running the following command:
```shell
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
```
After saving, you can instantiate your quantized model by calling the from_pretrained interface.
# Serving LLM with OpenAI Compatible Server
This article primarily discusses the deployment of a single LLM model across multiple GPUs on a single node, providing a service that is compatible with the OpenAI interface, as well as the usage of the service API.
For the sake of convenience, we refer to this service as `api_server`. Regarding parallel services with multiple models, please refer to the guide about [Request Distribution Server](./proxy_server.md).
In the following sections, we will first introduce two methods for starting the service, choosing the appropriate one based on your application scenario.
Next, we focus on the definition of the service's RESTful API, explore the various ways to interact with the interface, and demonstrate how to try the service through the Swagger UI or LMDeploy CLI tools.
Finally, we showcase how to integrate the service into a WebUI, providing you with a reference to easily set up a demonstration demo.
## Launch Service
Take the [internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
### Option 1: Launching with lmdeploy CLI
```shell
lmdeploy serve api_server internlm/internlm2-chat-7b --server-port 23333
```
The arguments of `api_server` can be viewed through the command `lmdeploy serve api_server -h`, for instance, `--tp` to set tensor parallelism, `--session-len` to specify the max length of the context window, `--cache-max-entry-count` to adjust the GPU mem ratio for k/v cache etc.
### Option 2: Deploying with docker
With LMDeploy [official docker image](https://hub.docker.com/r/openmmlab/lmdeploy/tags), you can run OpenAI compatible server as follows:
```shell
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:latest \
lmdeploy serve api_server internlm/internlm2-chat-7b
```
The parameters of `api_server` are the same with that mentioned in "[option 1](#option-1-launching-with-lmdeploy-cli)" section
## RESTful API
LMDeploy's RESTful API is compatible with the following three OpenAI interfaces:
- /v1/chat/completions
- /v1/models
- /v1/completions
Additionally, LMDeploy also defines `/v1/chat/interactive` to support interactive inference. The feature of interactive inference is that there's no need to pass the user conversation history as required by `v1/chat/completions`, since the conversation history will be cached on the server side. This method boasts excellent performance during multi-turn long context inference.
You can overview and try out the offered RESTful APIs by the website `http://0.0.0.0:23333` as shown in the below image after launching the service successfully.
![swagger_ui](https://github.com/InternLM/lmdeploy/assets/4560679/b891dd90-3ffa-4333-92b2-fb29dffa1459)
Or, you can use the LMDeploy's built-in CLI tool to verify the service correctness right from the console.
```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
lmdeploy serve api_client ${api_server_url}
```
If you need to integrate the service into your own projects or products, we recommend the following approach:
### Integrate with `OpenAI`
Here is an example of interaction with the endpoint `v1/chat/completions` service via the openai package.
Before running it, please install the openai package by `pip install openai`
```python
from openai import OpenAI
client = OpenAI(
api_key='YOUR_API_KEY',
base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": " provide three suggestions about time management"},
],
temperature=0.8,
top_p=0.8
)
print(response)
```
You can invoke other OpenAI interfaces using similar methods. For more detailed information, please refer to the [OpenAI API guide](https://platform.openai.com/docs/guides/text-generation)
### Integrate with lmdeploy `APIClient`
Below are some examples demonstrating how to visit the service through `APIClient`
If you want to use the `/v1/chat/completions` endpoint, you can try the following code:
```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
messages = [{"role": "user", "content": "Say this is a test!"}]
for item in api_client.chat_completions_v1(model=model_name, messages=messages):
print(item)
```
For the `/v1/completions` endpoint, you can try:
```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient('http://{server_ip}:{server_port}')
model_name = api_client.available_models[0]
for item in api_client.completions_v1(model=model_name, prompt='hi'):
print(item)
```
As for `/v1/chat/interactive`,we disable the feature by default. Please open it by setting `interactive_mode = True`. If you don't, it falls back to openai compatible interfaces.
Keep in mind that `session_id` indicates an identical sequence and all requests belonging to the same sequence must share the same `session_id`.
For instance, in a sequence with 10 rounds of chatting requests, the `session_id` in each request should be the same.
```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient(f'http://{server_ip}:{server_port}')
messages = [
"hi, what's your name?",
"who developed you?",
"Tell me more about your developers",
"Summarize the information we've talked so far"
]
for message in messages:
for item in api_client.chat_interactive_v1(prompt=message,
session_id=1,
interactive_mode=True,
stream=False):
print(item)
```
### Integrate with Java/Golang/Rust
May use [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) to convert `http://{server_ip}:{server_port}/openapi.json` to java/rust/golang client.
Here is an example:
```shell
$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
$ ls rust/*
rust/Cargo.toml rust/git_push.sh rust/README.md
rust/docs:
ChatCompletionRequest.md EmbeddingsRequest.md HttpValidationError.md LocationInner.md Prompt.md
DefaultApi.md GenerateRequest.md Input.md Messages.md ValidationError.md
rust/src:
apis lib.rs models
```
### Integrate with cURL
cURL is a tool for observing the output of the RESTful APIs.
- list served models `v1/models`
```bash
curl http://{server_ip}:{server_port}/v1/models
```
- chat `v1/chat/completions`
```bash
curl http://{server_ip}:{server_port}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"messages": [{"role": "user", "content": "Hello! How are you?"}]
}'
```
- text completions `v1/completions`
```shell
curl http://{server_ip}:{server_port}/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama",
"prompt": "two steps to build a house:"
}'
```
- interactive chat `v1/chat/interactive`
```bash
curl http://{server_ip}:{server_port}/v1/chat/interactive \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello! How are you?",
"session_id": 1,
"interactive_mode": true
}'
```
## Integrate with WebUI
LMDeploy utilizes `gradio` or [OpenAOE](https://github.com/InternLM/OpenAOE) to integrate a web ui for `api_server`
### Option 1: gradio
```shell
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: lmdeploy serve gradio http://localhost:23333 --server-name localhost --server-port 6006
lmdeploy serve gradio api_server_url --server-name ${gradio_ui_ip} --server-port ${gradio_ui_port}
```
### Option 2: OpenAOE
```shell
pip install -U openaoe
openaoe -f /path/to/your/config-template.yaml
```
Please refer to the [guidance](https://github.com/InternLM/OpenAOE/blob/main/docs/tech-report/model_serving_by_lmdeploy/model_serving_by_lmdeploy.md) for more deploy information.
## FAQ
1. When user got `"finish_reason":"length"`, it means the session is too long to be continued. The session length can be
modified by passing `--session_len` to api_server.
2. When OOM appeared at the server side, please reduce the `cache_max_entry_count` of `backend_config` when lanching the service.
3. When the request with the same `session_id` to `/v1/chat/interactive` got a empty return value and a negative `tokens`, please consider setting `interactive_mode=false` to restart the session.
4. The `/v1/chat/interactive` api disables engaging in multiple rounds of conversation by default. The input argument `prompt` consists of either single strings or entire chat histories.
5. Regarding the stop words, we only support characters that encode into a single index. Furthermore, there may be multiple indexes that decode into results containing the stop word. In such cases, if the number of these indexes is too large, we will only use the index encoded by the tokenizer. If you want use a stop symbol that encodes into multiple indexes, you may consider performing string matching on the streaming client side. Once a successful match is found, you can then break out of the streaming loop.
6. To customize a chat template, please refer to [chat_template.md](../advance/chat_template.md).
# Serving VLM with OpenAI Compatible Server
This article primarily discusses the deployment of a single large vision language model across multiple GPUs on a single node, providing a service that is compatible with the OpenAI interface, as well as the usage of the service API.
For the sake of convenience, we refer to this service as `api_server`. Regarding parallel services with multiple models, please refer to the guide about [Request Distribution Server](./proxy_server.md).
In the following sections, we will first introduce two methods for starting the service, choosing the appropriate one based on your application scenario.
Next, we focus on the definition of the service's RESTful API, explore the various ways to interact with the interface, and demonstrate how to try the service through the Swagger UI or LMDeploy CLI tools.
Finally, we showcase how to integrate the service into a WebUI, providing you with a reference to easily set up a demonstration demo.
## Launch Service
Take the [llava-v1.6-vicuna-7b](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b) model hosted on huggingface hub as an example, you can choose one the following methods to start the service.
### Option 1: Launching with lmdeploy CLI
```shell
lmdeploy serve api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 23333
```
The arguments of `api_server` can be viewed through the command `lmdeploy serve api_server -h`, for instance, `--tp` to set tensor parallelism, `--session-len` to specify the max length of the context window, `--cache-max-entry-count` to adjust the GPU mem ratio for k/v cache etc.
### Option 2: Deploying with docker
With LMDeploy [official docker image](https://hub.docker.com/r/openmmlab/lmdeploy/tags), you can run OpenAI compatible server as follows:
```shell
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:latest \
lmdeploy serve api_server liuhaotian/llava-v1.6-vicuna-7b
```
The parameters of `api_server` are the same with that mentioned in "[option 1](#option-1-launching-with-lmdeploy-cli)" section
## RESTful API
LMDeploy's RESTful API is compatible with the following three OpenAI interfaces:
- /v1/chat/completions
- /v1/models
- /v1/completions
The interface for image interaction is `/v1/chat/completions`, which is consistent with OpenAI.
You can overview and try out the offered RESTful APIs by the website `http://0.0.0.0:23333` as shown in the below image after launching the service successfully.
![swagger_ui](https://github.com/InternLM/lmdeploy/assets/4560679/b891dd90-3ffa-4333-92b2-fb29dffa1459)
If you need to integrate the service into your own projects or products, we recommend the following approach:
### Integrate with `OpenAI`
Here is an example of interaction with the endpoint `v1/chat/completions` service via the openai package.
Before running it, please install the openai package by `pip install openai`
```python
from openai import OpenAI
client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8)
print(response)
```
### Integrate with lmdeploy `APIClient`
Below are some examples demonstrating how to visit the service through `APIClient`
If you want to use the `/v1/chat/completions` endpoint, you can try the following code:
```python
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient(f'http://0.0.0.0:23333')
model_name = api_client.available_models[0]
messages = [{
'role':
'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
},
}]
}]
for item in api_client.chat_completions_v1(model=model_name,
messages=messages):
print(item)
```
### Integrate with Java/Golang/Rust
May use [openapi-generator-cli](https://github.com/OpenAPITools/openapi-generator-cli) to convert `http://{server_ip}:{server_port}/openapi.json` to java/rust/golang client.
Here is an example:
```shell
$ docker run -it --rm -v ${PWD}:/local openapitools/openapi-generator-cli generate -i /local/openapi.json -g rust -o /local/rust
$ ls rust/*
rust/Cargo.toml rust/git_push.sh rust/README.md
rust/docs:
ChatCompletionRequest.md EmbeddingsRequest.md HttpValidationError.md LocationInner.md Prompt.md
DefaultApi.md GenerateRequest.md Input.md Messages.md ValidationError.md
rust/src:
apis lib.rs models
```
# Serving with Gradio
Starting an LLM model's gradio service with LMDeploy and interacting with the model on the WebUI is incredibly simple.
```shell
pip install lmdeploy[serve]
lmdeploy serve gradio {model_path}
```
All it takes is one-line command, with the `{model_path}` replaced by the model ID from huggingface hub, such as `internlm/internlm2-chat-7b`, or the local path to the model.
For detailed parameters of the command, please turn to `lmdeploy serve gradio -h` for help.
## Create a huggingface demo
If you want to create an online demo project for your model on huggingface, please follow the steps below.
### Step 1: Create space
First, register for a Hugging Face account. After successful registration, click on your profile picture in the upper right corner and select “New Space” to create one. Follow the Hugging Face guide to choose the necessary configurations, and you will have a blank demo space ready.
### Step 2: Develop demo's entrypoint `app.py`
Replace the content of `app.py` in your space with the following code:
```python
from lmdeploy.serve.gradio.turbomind_coupled import run_local
from lmdeploy.messages import TurbomindEngineConfig
backend_config = TurbomindEngineConfig(max_batch_size=8)
model_path = 'internlm/internlm2-chat-7b'
run_local(model_path, backend_config=backend_config, server_name="huggingface-space")
```
Create a `requirements.txt` file with the following content:
```
lmdeploy
```
## FAQs
- ZeroGPU compatibility issue. ZeroGPU is not suitable for LMDeploy turbomind engine. Please use the standard GPUs. Or, you can change the backend config in the above code to `PyTorchEngineConfig` to use the ZeroGPU.
- Gradio version issue, versions above 4.0.0 are currently not supported. You can modify this in `app.py`, for example:
```python
import os
os.system("pip uninstall -y gradio")
os.system("pip install gradio==3.43.0")
```
# Request Distributor Server
The request distributor service can parallelize multiple api_server services. Users only need to access the proxy URL, and they can indirectly access different api_server services. The proxy service will automatically distribute requests internally, achieving load balancing.
## Startup
Start the proxy service:
```shell
python3 -m lmdeploy.serve.proxy.proxy --server_name {server_name} --server_port {server_port} --strategy "min_expected_latency"
```
After startup is successful, the URL of the proxy service will also be printed by the script. Access this URL in your browser to open the Swagger UI.
## API
Through Swagger UI, we can see multiple APIs. Those related to api_server node management include:
- /nodes/status
- /nodes/add
- /nodes/remove
They respectively represent viewing all api_server service nodes, adding a certain node, and deleting a certain node.
APIs related to usage include:
- /v1/models
- /v1/chat/completions
- /v1/completions
The usage of these APIs is the same as that of api_server.
## Dispatch Strategy
The current distribution strategies of the proxy service are as follows:
- random: dispatches based on the ability of each api_server node provided by the user to process requests. The greater the request throughput, the more likely it is to be allocated. Nodes that do not provide throughput are treated according to the average throughput of other nodes.
- min_expected_latency: allocates based on the number of requests currently waiting to be processed on each node, and the throughput capability of each node, calculating the expected time required to complete the response. The shortest one gets allocated. Nodes that do not provide throughput are treated similarly.
- min_observed_latency: allocates based on the average time required to handle a certain number of past requests on each node. The one with the shortest time gets allocated.
## LMDeploy-QoS Introduce and Usage
### Background
With the rise of Large Language Model (LLM) and Artificial General Intelligence (AGI), numerous inference frameworks have emerged. These frameworks deliver scalable and high-performance services by serving online workloads with language models. However, these workloads often come from multiple user groups, exhibiting rapid changes in workload patterns within short periods. Many inference frameworks struggle to meet the demands of such multi-tenancy traffic patterns and fail to effectively shape user behaviors. Therefore, we believe that systematically considering these issues in LLM inference framework is both valuable and necessary.
### User Categorizations for Multi-tenancy Handling
LMDeploy-QoS is part of LMDeploy, offering a range of multi-tenancy functionalities. It requires users to tag their inference requests with appropriate user identifications (user_id in configuration or codebase). The system operates based on a dictionary-like configuration that serves as a multi-tenancy policy. In this configuration, users are mapped to different classes, known as "user groups", each configured with a ratio value. Our multi-tenancy strategy reads this configuration and schedules user inference requests according to class priority and the difference between the predefined ratio and real-time allocation ratio. Extensive testing shows that LMDeploy-QoS significantly enhances LLM serving reliability and GPU resource utilization for real-world large language model inference workloads.
We categorize LMDeploy users into four groups:
- Platinum
- Gold
- Silver
- Bronze
Based on our experiences in delivering LLM services, we can map the following four types of users to these user groups:
- Platinum: VIP or administrative users. Examples include service inspectors or product demo presenters who require uninterrupted online services. Their workloads are typically at a low frequency and require limited resources.
- Gold: Contracted business user groups requiring specific quantities of reliable services. For instance, Company A signs a contract with the LLM service provider to secure X requests/sec service capability with Z% availability for its employees at the cost of Y million dollars per year.
- Silver: The vast majority of users fall under this category. Most trial or monthly subscribed users are included in this group. They need a relatively small quantity of services, but their user experiences significantly affect the LLM service reputation.
- Bronze: Heavy users who pay minimal fees to LLM providers.
The above user group categorization is intended for guidance rather than as a recommendation for all LMDeploy users, as it may not be suitable for all LLM service providers. Users can develop their own method of categorizing users based on their observations of daily workloads.
Next, we will discuss how LMDeploy schedules requests based on these categorizations.
### Multi-tenancy Strategies
#### Strategy 1: prioritized scheduling between groups
This strategy works as simple as its title suggests.
User groups are introduced for this strategy, with users in each group to be specified. Recommended user groups are as follows:
- Platinum
- Gold
- Silver
- Bronze
The priority of each group decreases sequentially. Requests with higher priority are always given precedence for inference. Be noted that the scheduling is performed at the time of request reception, so lower-priority requests will not be withdrawn from the GPU if they are already under inference.
The below diagram shows how the prioritization works. As you can see, the platinum request is reprioritized and moved to the queue head.
![](https://github.com/InternLM/lmdeploy/assets/52888924/9d63f081-7168-4c74-8456-24f0a4b41649)
#### Strategy 2: proportionally rated scheduling with a pre-defined ratio within user group
This strategy works only within the user group. We introduce a within-group user quota configuration table. This table defines users' "ideal share ratio" with a sum value of 100% GPU resource. Each "user" appears in the list as a user_id, and a user can only belong to one user group. Requests from different users will be scheduled according to each user's "ideal share ratio". To be specific, users with their real-time usage ratio lower than their quota ratio will have priority over users whose real-time usage ratio is higher than their quota ratio. It is worth noting that the scheduling only considers users in the request queue, ignoring any absent users from the configuration table.
The below diagram shows a typical example of how this strategy works.
![](https://github.com/InternLM/lmdeploy/assets/52888924/3e1d7135-6b11-4998-89a1-b72af6c962c3)
#### Strategy 3: a combination strategy of 1 and 2
We can call it a hybrid strategy. The way we hybrid these 2 strategies is fairly simple: we adopt strategy 1 in between user groups, and adopt strategy 2 within a user group. So users belonging to different groups with different priorities will only obey strategy 1 to determine their privilege in resource allocation. That is, when both strategies are applied, the first strategy will overpower the second. When it comes to a situation that no cross-group requests are waiting for serving, the within-group strategy 2 comes into play.
Below is a diagram showing it.
![](https://github.com/InternLM/lmdeploy/assets/52888924/e335f976-ff15-48db-b1ff-abf1c3327d6e)
To be noted, there could be other ways of hybrid strategies 1 & 2, and this doc only introduces one method that works well in our scenario. Considering that prioritization and pro-rated sharing are obviously conflicting strategies, there is no easy way to mix them to work within a single dimension.
### A Sample QoS Configuration
The configuration will be specified by the `--qos-config-path` flag, and will be loaded by program upon startup.
```json
{
"enable_user_qos": true,
"user_groups": [
"Platinum",
"Gold",
"Silver",
"Bronze"
],
"user_group_map": {
"Platinum": [
{
"id": "user_id0",
"quota_pct": 100
},
{
"id": "default",
"quota_pct": 0
}
],
"Gold": [
{
"id": "user_id1",
"quota_pct": 50
},
{
"id": "user_id2",
"quota_pct": 50
}
],
"Silver": [
{
"id": "user_id3",
"quota_pct": 5
},
{
"id": "default",
"quota_pct": 95
}
],
"Bronze": [
{
"id": "user_id4",
"quota_pct": 30
},
{
"id": "user_id5",
"quota_pct": 30
},
{
"id": "user_id6",
"quota_pct": 40
},
{
"id": "default",
"quota_pct": 0
}
]
}
}
```
### How to perform inference job with Lmdeploy-QoS aware
We provide the code link below to show how to call infer requests with multi-tenancy strategy awarded. What the qos related argument appears as in http body:
/v1/chat/interactive_qos
```bash
curl -X POST http://localhost/v1/chat/interactive_qos \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello,Hello",
"session_id": -1,
"interactive_mode": false,
"stream": false,
"stop": false,
"request_output_len": 512,
"top_p": 0.8,
"top_k": 40,
"temperature": 0.8,
"repetition_penalty": 1,
"ignore_eos": false,
"user_id": "user_id0"
}'
```
/v1/chat/completions_qos
```bash
curl -X POST http://localhost/v1/chat/completions_qos \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"messages": "Hello,Hello",
"temperature": 0.7,
"top_p": 1,
"n": 1,
"max_tokens": 512,
"stop": false,
"stream": false,
"presence_penalty": 0,
"frequency_penalty": 0,
"repetition_penalty": 1,
"session_id": -1,
"ignore_eos": false,
"user_id": "user_id0"
}'
```
/v1/completions_qos
```bash
curl -X POST http://localhost/v1/completions_qos \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"prompt": "Hello,Hello",
"suffix": "string",
"temperature": 0.7,
"n": 1,
"max_tokens": 16,
"stop": "string",
"stream": false,
"top_p": 1,
"repetition_penalty": 1,
"session_id": -1,
"ignore_eos": false,
"user_id": "user_id0"
}'
```
### File Configuration Modification
The template of the configuration file is located at: `lmdeploy/server/qos_engine/qos_config.json.template`. Add the necessary users based on actual requirements, ensure correct priority assignment, and set appropriate quota values.
### Passing Configuration Parameters
Upon starting the api_server, pass the configuration file and its path using the `--qos-config-path` flag. An example is illustrated below:
```bash
CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm-chat-7b --server-port 8000 --qos-config-path lmdeploy/serve/qos_engine/qos_config.json.template
```
### Contributor
[Eric](https://github.com/rhinouser0), [sallyjunjun](https://github.com/sallyjunjun), [sfireworks](https://github.com/sfireworks), [Dofgal](https://github.com/Dofgal), [shadow](https://github.com/awslshadowstar)
# Supported Models
## Models supported by TurboMind
| Model | Size | FP16/BF16 | KV INT8 | W4A16 |
| :----------------: | :------: | :-------: | :-----: | :---: |
| Llama | 7B - 65B | Yes | Yes | Yes |
| Llama2 | 7B - 70B | Yes | Yes | Yes |
| InternLM | 7B - 20B | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | Yes | - | Yes |
| InternLM-XComposer | 7B | Yes | Yes | Yes |
| QWen | 7B - 72B | Yes | Yes | Yes |
| QWen-VL | 7B | Yes | Yes | Yes |
| Baichuan | 7B | Yes | Yes | Yes |
| Baichuan2 | 7B | Yes | Yes | Yes |
| Code Llama | 7B - 34B | Yes | No | No |
| YI | 6B - 34B | Yes | No | No |
## Models supported by PyTorch
| Model | Size | FP16/BF16 | KV INT8 | W8A8 |
| :----------: | :-------: | :-------: | :-----: | :--: |
| Llama | 7B - 65B | Yes | No | Yes |
| Llama2 | 7B - 70B | Yes | No | Yes |
| InternLM | 7B - 20B | Yes | No | Yes |
| InternLM2 | 7B - 20B | Yes | No | - |
| Baichuan2 | 7B - 13B | Yes | No | Yes |
| ChatGLM2 | 6B | Yes | No | No |
| Falcon | 7B - 180B | Yes | No | No |
| YI | 6B - 34B | Yes | No | No |
| Mistral | 7B | Yes | No | No |
| Mixtral | 8x7B | Yes | No | No |
| QWen1.5 | 7B - 72B | Yes | No | No |
| DeepSeek-MoE | 16B | Yes | No | No |
| Gemma | 2B-7B | Yes | No | No |
version: 2
formats: all
build:
os: "ubuntu-22.04"
tools:
python: "3.8"
sphinx:
configuration: docs/zh_cn/conf.py
python:
install:
- requirements: requirements/docs.txt
- requirements: requirements/readthedocs.txt
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment