w8a8.md

# W8A8 LLM Model Deployment

LMDeploy provides functions for quantization and inference of large language models using 8-bit integers.

Before starting inference, ensure that lmdeploy and openai/triton are correctly installed. Execute the following commands to install these:

```shell
pip install lmdeploy
pip install triton>=2.1.0
```

## 8-bit Weight Model Inference

For performing 8-bit weight model inference, you can directly download the pre-quantized 8-bit weight models from LMDeploy's [model zoo](https://huggingface.co/lmdeploy). For instance, the 8-bit Internlm-chat-7B model is available for direct download from the model zoo:

```shell
git-lfs install
git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon)
```

Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the ["8bit Weight Quantization"](#8bit-weight-quantization) section. Save them in the internlm-chat-7b-w8 directory, using the command below:

```shell
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
```

Afterwards, use the following command to interact with the model via the terminal:

```shell
lmdeploy chat torch ./internlm-chat-7b-w8
```

## Launching gradio service

Coming soon...

## Inference Speed

Coming soon...

## 8bit Weight Quantization

Performing 4bit weight quantization involves three steps:

1. **Smooth Weights**: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.
2. **Replace Modules**: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file.
3. **Save the Quantized Model**: Once you've made the necessary replacements, save the new quantized model.

The script `lmdeploy/lite/apis/smooth_quant.py` accomplishes all three tasks detailed above. For example, you can obtain the model weights of the quantized Internlm-chat-7B model by running the following command:

```shell
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
```

After saving, you can instantiate your quantized model by calling the from_pretrained interface.