README_GPTQ_CN.md 5.75 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
<h1 align="center">Yuan2-M32基于AutoGPTQ的量化和推理</h1>

<div align="center">

    
  <a href="code_license">
    <img alt="Code License" src="https://img.shields.io/badge/Apache%202.0%20-green?style=flat&label=Code%20License&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0-MoE%3Ftab%3DApache-2.0-1-ov-file"/>
  </a>
  <a href="model_license">
    <img alt="Model License" src="https://img.shields.io/badge/Yuan2.0%20License-blue?style=flat&logoColor=blue&label=Model%20License&color=blue&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0%2Fblob%2Fmain%2FLICENSE-Yuan" />
  </a>

</div>



##  0. 模型下载

**我们提供了模型的多种下载源:**


|    模型     | 序列长度  |   模型格式   |         下载链接         |
| :----------: | :------: | :-------: |:---------------------------: |
| Yuan2.0-M32-HF-INT4 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-HF-INT4/summary)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int4) \| [百度网盘](https://pan.baidu.com/s/1zacOAxCne9U99LdgMbjfFQ?pwd=kkww )  \| [始智AI](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int4/)
| Yuan2.0-M32-HF-INT8 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf-int8/)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int8/) \| [百度网盘](https://pan.baidu.com/s/1hq9l6eYY_cRuBlQMRV6Lcg?pwd=b56k) \| [始智AI](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int8/)


## 1. 配置AutoGPTQ环境
- AutoGPTQ环境配置要求:CUDA版本高于11.8
- 容器:使用[vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)项目提供的镜像创建容器
```shell
# 进入容器
docker exec -it vllm_yuan bash

# 进入你的工作目录
cd /mnt

# 拉取我们的项目
git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git

# 进入autogptq项目
cd  Yuan2.0-M32/3rd_party/AutoGPTQ

# 安装autogptq
pip install auto-gptq --no-build-isolation
```

## 2. 量化Yuan2-M32-HF模型

量化Yuan2-M32模型主要分为三步:1.下载Yuan2-M32-HF模型 2.下载数据集 3.设置量化参数,量化Yuan2-M32-HF模型
- 1.下载Yuan2-M32 hugging face模型并移动到指定路径(/mnt/beegfs2/Yuan2-M32-HF),可参考[vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md),模型下载地址:https://huggingface.co/IEIT-Yuan/Yuan2-M32-hf
- 2.数据集下载点击[这里](https://huggingface.co/datasets/hakurei/open-instruct-v1),下载后移动到指定路径如:/mnt/beegfs2/
- 3.按照以下步骤调整量化参数进行量化操作
```shell
# 编辑Yuan2-M32-int4.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim Yuan2-M32-int4.py

'''
pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"

tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)

examples = []
with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # 数据集路径
    data = json.load(file)

for i, item in enumerate(data):
    if i >= 2000:
        break
    instruction = item.get('instruction', '')
    output = item.get('output', '')
    combined_text = instruction + " " + output
    examples.append(tokenizer(combined_text))

max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)
'''
# 1.修改pretrained_model_dir,指定量化后的quantized_model_dir
# 2.修改数据集路径
# 3.max_memory可以指定要使用的GPUs
# 4.修改量化参数,若要int4精度bits=4,若要int8精度bits=8,其他参数可以参考默认值

# 运行此脚本
python Yuan2-M32-int4.py

# 模型量化和packing过程耗时约8h,可以指定不同的GPU同时分别量化int4和int8
```


## 3. GPTQ量化模型的推理
量化完成后,目标路径文件夹中会生成.safetensors后缀的ckpt文件以及config.json、quantize_config.json文件,需要先从Yuan2-M32-HF路径中拷贝tokenizer相关的文件
```shell
# 进入Yuan2-M32-HF路径
cd /mnt/beegfs2/Yuan2-M32-HF

# 拷贝tokenizer相关文件至Yuan2-M32-GPTQ-int4
cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4

# 编辑inference.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim inference.py

'''
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"

tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
'''
# 修改quantized_model_dir和tokenizer路径

# 运行inference.py
python inference.py
```

## 4. 推理精度测试
> HumanEval测试参数如下:
> generation_params = {
        "max_new_tokens": 512,
        "top_k": 1,
        "top_p": 0,
        "temperature": 1.0,
}

> Yuan-M32-HF推理使用2张80GB GPUs; Yuan2-M32-GPTQ-int4和Yuan2-M32-GPTQ-int8推理使用单张80GB GPU

> 测试结果:

| Model               | Accuracy Type |  HumanEval | Inference Speed |  Inference Memory Usage |
|---------------------|---------------|------------|-----------------|-------------------------|
| Yuan2-M32-HF        | BF16          |  73.17%    | 13.16 token/s   |76.34 GB                 |
| Yuan2-M32-GPTQ-int8 | INT8          |  72.56%    |  9.05 token/s   |39.81 GB                 |
| Yuan2-M32-GPTQ-int4 | INT4          |  66.46%    |  9.24 token/s   |23.27GB                  |