Commit 6a583c2f authored by chenych's avatar chenych
Browse files

update dtk to 24.04.1 and modify README

parent 7d576a9a
......@@ -39,6 +39,8 @@
## 0. Latest News 🎉🎉
* **[2024-06-28]** **Yuan2.0-M32-HF-INT8 released, high performance, lossless generation accuracy** 🎗️🎗️🎗️
* **[2024-06-18]** **Yuan2.0-M32-HF-INT4 released** 🎗️🎗️
* **[2024-05-28]** **Yuan2.0-M32 released**
......@@ -77,9 +79,11 @@ Fig.1: Yuan 2.0-M32 Architecture
| Yuan2.0-M32-HF | 16K | HuggingFace | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf) \| [Netdisk](https://pan.baidu.com/s/1FrbVKji7IrhpwABYSIsV-A?pwd=q6uh) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf)
| Yuan2.0-M32-GGUF | 16K | GGUF | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-gguf/summary) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-gguf) \| [Netdisk](https://pan.baidu.com/s/1BWQaz-jeZ1Fe69CqYtjS9A?pwd=f4qc) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-gguf)
| Yuan2.0-M32-GGUF-INT4 | 16K | GGUF | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-gguf-int4/summary) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-gguf-int4) \| [Netdisk](https://pan.baidu.com/s/1FM8xPpkhOrRcAfe7-zUgWQ?pwd=e6ag) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-gguf-int4)
| Yuan2.0-M32-HF-INT4 | 16K | HuggingFace | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-HF-INT4/summary) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int4) \| [Netdisk](https://pan.baidu.com/s/1zacOAxCne9U99LdgMbjfFQ?pwd=kkww ) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int4/)
| Yuan2.0-M32-HF-INT8 | 16K | HuggingFace | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf-int8/) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int8/) \| [Netdisk](https://pan.baidu.com/s/1hq9l6eYY_cRuBlQMRV6Lcg?pwd=b56k) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int8/)
\* __*Yuan2.0-M32-HF-INT4*__: The method of quantization and inference ,refer to the [guide](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/docs/README_GPTQ_EN.md).
## 3. Evaluation
......@@ -152,9 +156,8 @@ We've provided several scripts for pretraining in the [`example`](./examples). T
**4.4 Inference Service**
For a detailed deployment plan, please refer to [vllm](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md).
- For a detailed deployment plan, please refer to [vllm](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md).
- Inference with Yuan2.0-M32-HF-INT4, please refer to [The Method of Quantization and Inference for Yuan2.0-M32](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/docs/README_GPTQ_EN.md).
## 5. Statement of Agreement
......
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
\ No newline at end of file
<h1 align="center">Yuan2-M32基于AutoGPTQ的量化和推理</h1>
<div align="center">
<a href="code_license">
<img alt="Code License" src="https://img.shields.io/badge/Apache%202.0%20-green?style=flat&label=Code%20License&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0-MoE%3Ftab%3DApache-2.0-1-ov-file"/>
</a>
<a href="model_license">
<img alt="Model License" src="https://img.shields.io/badge/Yuan2.0%20License-blue?style=flat&logoColor=blue&label=Model%20License&color=blue&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0%2Fblob%2Fmain%2FLICENSE-Yuan" />
</a>
</div>
## 0. 模型下载
**我们提供了模型的多种下载源:**
| 模型 | 序列长度 | 模型格式 | 下载链接 |
| :----------: | :------: | :-------: |:---------------------------: |
| Yuan2.0-M32-HF-INT4 | 16K | HuggingFace | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-HF-INT4/summary) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int4) \| [百度网盘](https://pan.baidu.com/s/1zacOAxCne9U99LdgMbjfFQ?pwd=kkww ) \| [始智AI](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int4/)
| Yuan2.0-M32-HF-INT8 | 16K | HuggingFace | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf-int8/) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int8/) \| [百度网盘](https://pan.baidu.com/s/1hq9l6eYY_cRuBlQMRV6Lcg?pwd=b56k) \| [始智AI](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int8/)
## 1. 配置AutoGPTQ环境
- AutoGPTQ环境配置要求:CUDA版本高于11.8
- 容器:使用[vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)项目提供的镜像创建容器
```shell
# 进入容器
docker exec -it vllm_yuan bash
# 进入你的工作目录
cd /mnt
# 拉取我们的项目
git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git
# 进入autogptq项目
cd Yuan2.0-M32/3rd_party/AutoGPTQ
# 安装autogptq
pip install auto-gptq --no-build-isolation
```
## 2. 量化Yuan2-M32-HF模型
量化Yuan2-M32模型主要分为三步:1.下载Yuan2-M32-HF模型 2.下载数据集 3.设置量化参数,量化Yuan2-M32-HF模型
- 1.下载Yuan2-M32 hugging face模型并移动到指定路径(/mnt/beegfs2/Yuan2-M32-HF),可参考[vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md),模型下载地址:https://huggingface.co/IEIT-Yuan/Yuan2-M32-hf
- 2.数据集下载点击[这里](https://huggingface.co/datasets/hakurei/open-instruct-v1),下载后移动到指定路径如:/mnt/beegfs2/
- 3.按照以下步骤调整量化参数进行量化操作
```shell
# 编辑Yuan2-M32-int4.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim Yuan2-M32-int4.py
'''
pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)
examples = []
with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # 数据集路径
data = json.load(file)
for i, item in enumerate(data):
if i >= 2000:
break
instruction = item.get('instruction', '')
output = item.get('output', '')
combined_text = instruction + " " + output
examples.append(tokenizer(combined_text))
max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
'''
# 1.修改pretrained_model_dir,指定量化后的quantized_model_dir
# 2.修改数据集路径
# 3.max_memory可以指定要使用的GPUs
# 4.修改量化参数,若要int4精度bits=4,若要int8精度bits=8,其他参数可以参考默认值
# 运行此脚本
python Yuan2-M32-int4.py
# 模型量化和packing过程耗时约8h,可以指定不同的GPU同时分别量化int4和int8
```
## 3. GPTQ量化模型的推理
量化完成后,目标路径文件夹中会生成.safetensors后缀的ckpt文件以及config.json、quantize_config.json文件,需要先从Yuan2-M32-HF路径中拷贝tokenizer相关的文件
```shell
# 进入Yuan2-M32-HF路径
cd /mnt/beegfs2/Yuan2-M32-HF
# 拷贝tokenizer相关文件至Yuan2-M32-GPTQ-int4
cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4
# 编辑inference.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim inference.py
'''
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
'''
# 修改quantized_model_dir和tokenizer路径
# 运行inference.py
python inference.py
```
## 4. 推理精度测试
> HumanEval测试参数如下:
> generation_params = {
"max_new_tokens": 512,
"top_k": 1,
"top_p": 0,
"temperature": 1.0,
}
> Yuan-M32-HF推理使用2张80GB GPUs; Yuan2-M32-GPTQ-int4和Yuan2-M32-GPTQ-int8推理使用单张80GB GPU
> 测试结果:
| Model | Accuracy Type | HumanEval | Inference Speed | Inference Memory Usage |
|---------------------|---------------|------------|-----------------|-------------------------|
| Yuan2-M32-HF | BF16 | 73.17% | 13.16 token/s |76.34 GB |
| Yuan2-M32-GPTQ-int8 | INT8 | 72.56% | 9.05 token/s |39.81 GB |
| Yuan2-M32-GPTQ-int4 | INT4 | 66.46% | 9.24 token/s |23.27GB |
<h1 align="center">The Method of Quantization and Inference for Yuan2.0-M32</h1>
<div align="center">
<a href="code_license">
<img alt="Code License" src="https://img.shields.io/badge/Apache%202.0%20-green?style=flat&label=Code%20License&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0-MoE%3Ftab%3DApache-2.0-1-ov-file"/>
</a>
<a href="model_license">
<img alt="Model License" src="https://img.shields.io/badge/Yuan2.0%20License-blue?style=flat&logoColor=blue&label=Model%20License&color=blue&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0%2Fblob%2Fmain%2FLICENSE-Yuan" />
</a>
</div>
## 0. Model Downloads
| Model | Sequence Length | Type | Download |
| :----------: | :------: | :-------: |:---------------------------: |
| Yuan2.0-M32-HF-INT4 | 16K | HuggingFace | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-HF-INT4/summary) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int4) \| [Netdisk](https://pan.baidu.com/s/1zacOAxCne9U99LdgMbjfFQ?pwd=kkww ) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int4/)
| Yuan2.0-M32-HF-INT8 | 16K | HuggingFace | [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf-int8/) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int8/) \| [Netdisk](https://pan.baidu.com/s/1hq9l6eYY_cRuBlQMRV6Lcg?pwd=b56k) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int8/)
## 1. Environment of AutoGPTQ
- **Environment requirements:** CUDA version> 11.8
- **Container:** Create a container using the image provided by the [vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)
```shell
# enter docker containers
docker exec -it vllm_yuan bash
# enter directory
cd /mnt
# clone
git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git
# enter project
cd Yuan2.0-M32/3rd_party/AutoGPTQ
# install autogptq
pip install auto-gptq --no-build-isolation
```
## 2. Quantize Yuan2.0-M32-HF model
**The Steps for Quantizing Yuan2.0-M32 Model:**
- **Step 1:** Download [Yuan2.0-M32-HF](https://github.com/IEIT-Yuan/Yuan2.0-M32?tab=readme-ov-file#2-model-downloads) Model and move it to the specified path (/mnt/beegfs2/Yuan2-M32-HF), refer to [vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)
- **Step 2:** Download the [datases](https://huggingface.co/datasets/hakurei/open-instruct-v1), then move it to the specified path (/mnt/beegfs2/)
- **Step 3:** Adjust the parameters according to the following script for the quantization operation.
```shell
# edit Yuan2-M32-int4.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim Yuan2-M32-int4.py
'''
pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)
examples = []
with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # path of datasets
data = json.load(file)
for i, item in enumerate(data):
if i >= 2000:
break
instruction = item.get('instruction', '')
output = item.get('output', '')
combined_text = instruction + " " + output
examples.append(tokenizer(combined_text))
max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
quantize_config = BaseQuantizeConfig(
bits=4, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
'''
# Modify pretrained_model_dir, specify the quantized_model_dir for the quantized model.
# Modify the path of the datasets.
# max_memory can specify the GPUs to be used.
# Adjust the quantization parameters, int4: set bits=4, int8: set bits=8.
# Other parameters can refer to the default values.
# Run
python Yuan2-M32-int4.py
# The model quantization and packing process takes 8 hours approximately.
# You can use GPUs quantize the model to int4 and int8 separately at the same time.
```
## 3. Inference with Quantized Model
Quantization completed, you will get the checkpoint files with the suffix of '.safetensors', config.json and quantize_config.json in the folder. You need to first copy the tokenizer-related files from the Yuan2-M32-HF path.
```shell
# the path of Yuan2-M32-HF
cd /mnt/beegfs2/Yuan2-M32-HF
# copy tokenizer files to the path of Yuan2-M32-GPTQ-int4
cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4
# edit inference.py
cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
vim inference.py
'''
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
'''
# edit paths of quantized_model_dir and tokenizer
# run inference.py
python inference.py
```
## 4. Evaluation
> Parameters of HumanEval:
> generation_params = {
"max_new_tokens": 512,
"top_k": 1,
"top_p": 0,
"temperature": 1.0,
}
> Yuan-M32-HF inferenced with 2 80GB GPUs; Yuan2-M32-GPTQ-int4 and Yuan2-M32-GPTQ-int8 inferenced with single 80GB GPU
> Result:
| Model | Accuracy Type | HumanEval | Inference Speed | Inference Memory Usage |
|---------------------|---------------|------------|-----------------|-------------------------|
| Yuan2-M32-HF | BF16 | 73.17% | 13.16 token/s |76.34 GB |
| Yuan2-M32-GPTQ-int8 | INT8 | 72.56% | 9.05 token/s |39.81 GB |
| Yuan2-M32-GPTQ-int4 | INT4 | 66.46% | 9.24 token/s |23.27GB |
......@@ -14,6 +14,7 @@ The main variables in the code should be set as follows:
| `--output_path` | The path where to store the preprocessed dataset, one .idx file and one .bin file would be created for each dataset. |
## Dataset
Samples in dataset should be seperated with '\n', and within each sample, the '\n' should be replaced with '\<n>', therefore each line in the dataset is a single sample. And the program would replace the '\<n>' back to '\n' during preprocessing.&#x20;
......
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: prometheus_client in /usr/local/lib/python3.10/dist-packages (0.17.0)
Collecting to
Downloading to-0.3.tar.gz (26 kB)
from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment