# INT4 Weight-only Quantization and Deployment (W4A16)

LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.

LMDeploy supports the following NVIDIA GPU for W4A16 inference:

- Turing(sm75): 20 series, T4

- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100

- Ada Lovelace(sm90): 40 series

Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.

```shell
pip install lmdeploy[all]
```

This article comprises the following sections:

<!-- toc -->

- [Quantization](#quantization)
- [Evaluation](#evaluation)
- [Inference](#inference)
- [Performance](#performance)
- [Service](#service)

<!-- tocstop -->

## Quantization

A single command execution is all it takes to quantize the model. The resulting quantized weights are then stored in the $WORK_DIR directory.

```shell
export HF_MODEL=internlm/internlm-chat-7b
export WORK_DIR=internlm/internlm-chat-7b-4bit

lmdeploy lite auto_awq \
   $HF_MODEL \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 2048 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir $WORK_DIR
```

Typically, the above command doesn't require filling in optional parameters, as the defaults usually suffice. For instance, when quantizing the [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) model, the command can be condensed as:

```shell
lmdeploy lite auto_awq internlm/internlm-chat-7b --work-dir internlm-chat-7b-4bit
```

```{note}
We recommend that you specify the --work-dir parameter, including the model name as demonstrated in the example above. This facilitates LMDeploy in fuzzy matching the --work-dir with an appropriate built-in chat template. Otherwise, you will have to designate the chat template during inference.
```

Upon completing quantization, you can engage with the model efficiently using a variety of handy tools.

For example, you can initiate a conversation with it via the command line:

```shell
lmdeploy chat turbomind ./internlm-chat-7b-4bit --model-format awq
```

Alternatively, you can start the gradio server and interact with the model through the web at `http://{ip_addr}:{port`

```shell
lmdeploy serve gradio ./internlm-chat-7b-4bit --server_name {ip_addr} --server_port {port} --model-format awq
```

## Evaluation

Please overview [this guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_turbomind.html) about model evaluation with LMDeploy.

## Inference

Trying the following codes, you can perform the batched offline inference with the quantized model:

```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("./internlm-chat-7b-4bit", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

For more information about the pipeline parameters, please refer to [here](../inference/pipeline.md).

In addition to performing inference with the quantized model on localhost, LMDeploy can also execute inference for the 4bit quantized model derived from AWQ algorithm available on Huggingface Hub, such as models from the [lmdeploy space](https://huggingface.co/lmdeploy) and [TheBloke space](https://huggingface.co/TheBloke)

```python
# inference with models from lmdeploy space
from lmdeploy import pipeline, TurbomindEngineConfig
pipe = pipeline("lmdeploy/llama2-chat-70b-4bit",
                backend_config=TurbomindEngineConfig(model_format='awq', tp=4))
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

# inference with models from thebloke space
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
pipe = pipeline("TheBloke/LLaMA2-13B-Tiefighter-AWQ",
                backend_config=TurbomindEngineConfig(model_format='awq'),
                chat_template_config=ChatTemplateConfig(model_name='llama2')
                )
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```

## Performance

We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090 using [profile_generation.py](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py). And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single batch inference.

| model            | llm-awq | mlc-llm | turbomind |
| ---------------- | ------- | ------- | --------- |
| Llama-2-7B-chat  | 112.9   | 159.4   | 206.4     |
| Llama-2-13B-chat | N/A     | 90.7    | 115.8     |

## Service

LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:

```shell
lmdeploy serve api_server internlm/internlm-chat-7b-4bit --backend turbomind --model-format awq
```

The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:

```shell
lmdeploy serve api_client http://0.0.0.0:23333
```

You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../serving/api_server.md).