w4a16.md 4.24 KB
Newer Older
1
2
3
4
# W4A16 LLM Model Deployment

LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.

5
Before proceeding with the inference, please ensure that lmdeploy is installed.
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

```shell
pip install lmdeploy
```

## 4-bit LLM model Inference

You can download the pre-quantized 4-bit weight models from LMDeploy's [model zoo](https://huggingface.co/lmdeploy) and conduct inference using the following command.

Alternatively, you can quantize 16-bit weights to 4-bit weights following the ["4-bit Weight Quantization"](#4-bit-weight-quantization) section, and then perform inference as per the below instructions.

Take the 4-bit Llama-2-chat-7B model from the model zoo as an example:

```shell
git-lfs install
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
```

As demonstrated in the command below, first convert the model's layout using `turbomind.deploy`, and then you can interact with the AI assistant in the terminal

```shell

## Convert the model's layout and store it in the default path, ./workspace.
python3 -m lmdeploy.serve.turbomind.deploy \
    --model-name llama2 \
    --model-path ./llama2-chat-7b-w4 \
    --model-format awq \
    --group-size 128

## inference
python3 -m lmdeploy.turbomind.chat ./workspace
```

## Serve with gradio

If you wish to interact with the model via web ui, please initiate the gradio server as indicated below:

```shell
python3 -m lmdeploy.serve.turbomind ./workspace --server_name {ip_addr} ----server_port {port}
```

Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model

## Inference Performance

We benchmarked the Llama-2-7B-chat and Llama-2-13B-chat models with 4-bit quantization on NVIDIA GeForce RTX 4090 using [profile_generation.py](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_generation.py). And we measure the token generation throughput (tokens/s) by setting a single prompt token and generating 512 tokens. All the results are measured for single batch inference.

| model            | llm-awq | mlc-llm | turbomind |
| ---------------- | ------- | ------- | --------- |
| Llama-2-7B-chat  | 112.9   | 159.4   | 206.4     |
| Llama-2-13B-chat | N/A     | 90.7    | 115.8     |

Memory (GB) comparison results between 4-bit and 16-bit model with context size 2048 and 4096 respectively,

| model            | 16bit(2048) | 4bit(2048) | 16bit(4096) | 4bit(4096) |
| ---------------- | ----------- | ---------- | ----------- | ---------- |
| Llama-2-7B-chat  | 15.1        | 6.3        | 16.2        | 7.5        |
| Llama-2-13B-chat | OOM         | 10.3       | OOM         | 12.0       |

65
66
67
68
```
pip install nvidia-ml-py
```

69
70
```shell
python benchmark/profile_generation.py \
71
72
 --model-path ./workspace \
 --concurrency 1 8 --prompt-tokens 1 512 --completion-tokens 2048 512
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
```

## 4-bit Weight Quantization

It includes two steps:

- generate quantization parameter
- quantize model according to the parameter

### Step 1: Generate Quantization Parameter

```shell
python3 -m lmdeploy.lite.apis.calibrate \
  --model $HF_MODEL \
  --calib_dataset 'c4' \             # Calibration dataset, supports c4, ptb, wikitext2, pileval
  --calib_samples 128 \              # Number of samples in the calibration set, if memory is insufficient, you can appropriately reduce this
  --calib_seqlen 2048 \              # Length of a single piece of text, if memory is insufficient, you can appropriately reduce this
  --work_dir $WORK_DIR \             # Folder storing Pytorch format quantization statistics parameters and post-quantization weight
```

### Step2: Quantize Weights

LMDeploy employs AWQ algorithm for model weight quantization.

```shell
python3 -m lmdeploy.lite.apis.auto_awq \
  --model $HF_MODEL \
  --w_bits 4 \                       # Bit number for weight quantization
  --w_group_size 128 \               # Group size for weight quantization statistics
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
```

After the quantization is complete, the quantized model is saved to `$WORK_DIR`. Then you can proceed with model inference according to the instructions in the ["4-Bit Weight Model Inference"](#4-bit-llm-model-inference) section.