update dtk to 24.04.1 and modify README

6a583c2f · chenych · 7d576a9a · 6a583c2f · 6a583c2f · 6a583c2f
Commit 6a583c2f authored Aug 21, 2024 by chenych
20 changed files
--- a/README_EN.md
+++ b/README_EN.md
@@ -39,6 +39,8 @@

 ##  0. Latest News 🎉🎉

+* **[2024-06-28]** **Yuan2.0-M32-HF-INT8 released, high performance, lossless generation accuracy** 🎗️🎗️🎗️
+* **[2024-06-18]** **Yuan2.0-M32-HF-INT4 released** 🎗️🎗️
 * **[2024-05-28]** **Yuan2.0-M32 released**


@@ -77,9 +79,11 @@ Fig.1: Yuan 2.0-M32 Architecture
 | Yuan2.0-M32-HF |    16K    | HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf) \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf) \| [Netdisk](https://pan.baidu.com/s/1FrbVKji7IrhpwABYSIsV-A?pwd=q6uh) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf)
 | Yuan2.0-M32-GGUF |    16K    | GGUF         |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-gguf/summary)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-gguf) \| [Netdisk](https://pan.baidu.com/s/1BWQaz-jeZ1Fe69CqYtjS9A?pwd=f4qc) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-gguf)
 | Yuan2.0-M32-GGUF-INT4 |    16K    | GGUF    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-gguf-int4/summary)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-gguf-int4) \| [Netdisk](https://pan.baidu.com/s/1FM8xPpkhOrRcAfe7-zUgWQ?pwd=e6ag) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-gguf-int4)
+| Yuan2.0-M32-HF-INT4 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-HF-INT4/summary)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int4) \| [Netdisk](https://pan.baidu.com/s/1zacOAxCne9U99LdgMbjfFQ?pwd=kkww )  \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int4/)
+| Yuan2.0-M32-HF-INT8 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf-int8/)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int8/) \| [Netdisk](https://pan.baidu.com/s/1hq9l6eYY_cRuBlQMRV6Lcg?pwd=b56k) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int8/)


-
+\* __*Yuan2.0-M32-HF-INT4*__: The method of quantization and inference ，refer to the [guide](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/docs/README_GPTQ_EN.md).


 ##  3. Evaluation
@@ -152,9 +156,8 @@ We've provided several scripts for pretraining in the [`example`](./examples). T

 **4.4  Inference Service**

-
-
-For a detailed deployment plan, please refer to [vllm](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md).
+- For a detailed deployment plan, please refer to [vllm](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md).
+- Inference with Yuan2.0-M32-HF-INT4, please refer to [The Method of Quantization and Inference for Yuan2.0-M32](https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/main/docs/README_GPTQ_EN.md).


 ##  5. Statement of Agreement

--- a/docker/Dockerfile
+++ b/docker/Dockerfile
-FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310
\ No newline at end of file
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.10
\ No newline at end of file
--- a/doc/Attention Router.png
+++ b/doc/Attention Router.png
--- a/docs/README_GPTQ_CN.md
+++ b/docs/README_GPTQ_CN.md
+<h1 align="center">Yuan2-M32基于AutoGPTQ的量化和推理</h1>
+
+<div align="center">
+
+    
+  <a href="code_license">
+    <img alt="Code License" src="https://img.shields.io/badge/Apache%202.0%20-green?style=flat&label=Code%20License&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0-MoE%3Ftab%3DApache-2.0-1-ov-file"/>
+  </a>
+  <a href="model_license">
+    <img alt="Model License" src="https://img.shields.io/badge/Yuan2.0%20License-blue?style=flat&logoColor=blue&label=Model%20License&color=blue&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0%2Fblob%2Fmain%2FLICENSE-Yuan" />
+  </a>
+
+</div>
+
+
+
+##  0. 模型下载
+
+**我们提供了模型的多种下载源：**
+
+
+|    模型     | 序列长度  |   模型格式   |         下载链接         |
+| :----------: | :------: | :-------: |:---------------------------: |
+| Yuan2.0-M32-HF-INT4 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-HF-INT4/summary)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int4) \| [百度网盘](https://pan.baidu.com/s/1zacOAxCne9U99LdgMbjfFQ?pwd=kkww )  \| [始智AI](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int4/)
+| Yuan2.0-M32-HF-INT8 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf-int8/)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int8/) \| [百度网盘](https://pan.baidu.com/s/1hq9l6eYY_cRuBlQMRV6Lcg?pwd=b56k) \| [始智AI](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int8/)
+
+
+## 1. 配置AutoGPTQ环境
+- AutoGPTQ环境配置要求：CUDA版本高于11.8
+- 容器：使用[vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)项目提供的镜像创建容器
+```shell
+# 进入容器
+docker exec -it vllm_yuan bash
+
+# 进入你的工作目录
+cd /mnt
+
+# 拉取我们的项目
+git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git
+
+# 进入autogptq项目
+cd  Yuan2.0-M32/3rd_party/AutoGPTQ
+
+# 安装autogptq
+pip install auto-gptq --no-build-isolation
+```
+
+## 2. 量化Yuan2-M32-HF模型
+
+量化Yuan2-M32模型主要分为三步：1.下载Yuan2-M32-HF模型 2.下载数据集 3.设置量化参数，量化Yuan2-M32-HF模型
+- 1.下载Yuan2-M32 hugging face模型并移动到指定路径(/mnt/beegfs2/Yuan2-M32-HF)，可参考[vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)，模型下载地址：https://huggingface.co/IEIT-Yuan/Yuan2-M32-hf
+- 2.数据集下载点击[这里](https://huggingface.co/datasets/hakurei/open-instruct-v1)，下载后移动到指定路径如：/mnt/beegfs2/
+- 3.按照以下步骤调整量化参数进行量化操作
+```shell
+# 编辑Yuan2-M32-int4.py
+cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
+vim Yuan2-M32-int4.py
+
+'''
+pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
+quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
+
+tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)
+
+examples = []
+with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # 数据集路径
+    data = json.load(file)
+
+for i, item in enumerate(data):
+    if i >= 2000:
+        break
+    instruction = item.get('instruction', '')
+    output = item.get('output', '')
+    combined_text = instruction + " " + output
+    examples.append(tokenizer(combined_text))
+
+max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
+quantize_config = BaseQuantizeConfig(
+    bits=4,  # quantize model to 4-bit
+    group_size=128,  # it is recommended to set the value to 128
+    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
+)
+'''
+# 1.修改pretrained_model_dir，指定量化后的quantized_model_dir
+# 2.修改数据集路径
+# 3.max_memory可以指定要使用的GPUs
+# 4.修改量化参数，若要int4精度bits=4，若要int8精度bits=8，其他参数可以参考默认值
+
+# 运行此脚本
+python Yuan2-M32-int4.py
+
+# 模型量化和packing过程耗时约8h，可以指定不同的GPU同时分别量化int4和int8
+```
+
+
+## 3. GPTQ量化模型的推理
+量化完成后，目标路径文件夹中会生成.safetensors后缀的ckpt文件以及config.json、quantize_config.json文件，需要先从Yuan2-M32-HF路径中拷贝tokenizer相关的文件
+```shell
+# 进入Yuan2-M32-HF路径
+cd /mnt/beegfs2/Yuan2-M32-HF
+
+# 拷贝tokenizer相关文件至Yuan2-M32-GPTQ-int4
+cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4
+
+# 编辑inference.py
+cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
+vim inference.py
+
+'''
+quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
+
+tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
+
+model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
+'''
+# 修改quantized_model_dir和tokenizer路径
+
+# 运行inference.py
+python inference.py
+```
+
+## 4. 推理精度测试
+> HumanEval测试参数如下：
+> generation_params = {
+        "max_new_tokens": 512,
+        "top_k": 1,
+        "top_p": 0,
+        "temperature": 1.0,
+}
+
+> Yuan-M32-HF推理使用2张80GB GPUs; Yuan2-M32-GPTQ-int4和Yuan2-M32-GPTQ-int8推理使用单张80GB GPU
+
+> 测试结果：
+
+| Model               | Accuracy Type |  HumanEval | Inference Speed |  Inference Memory Usage |
+|---------------------|---------------|------------|-----------------|-------------------------|
+| Yuan2-M32-HF        | BF16          |  73.17%    | 13.16 token/s   |76.34 GB                 |
+| Yuan2-M32-GPTQ-int8 | INT8          |  72.56%    |  9.05 token/s   |39.81 GB                 |
+| Yuan2-M32-GPTQ-int4 | INT4          |  66.46%    |  9.24 token/s   |23.27GB                  |
--- a/docs/README_GPTQ_EN.md
+++ b/docs/README_GPTQ_EN.md
+<h1 align="center">The Method of Quantization and Inference for Yuan2.0-M32</h1>
+
+
+
+<div align="center">
+
+    
+  <a href="code_license">
+    <img alt="Code License" src="https://img.shields.io/badge/Apache%202.0%20-green?style=flat&label=Code%20License&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0-MoE%3Ftab%3DApache-2.0-1-ov-file"/>
+  </a>
+  <a href="model_license">
+    <img alt="Model License" src="https://img.shields.io/badge/Yuan2.0%20License-blue?style=flat&logoColor=blue&label=Model%20License&color=blue&link=https%3A%2F%2Fgithub.com%2FIEIT-Yuan%2FYuan-2.0%2Fblob%2Fmain%2FLICENSE-Yuan" />
+  </a>
+
+</div>
+
+
+##  0. Model Downloads
+
+
+|    Model     | Sequence Length  |   Type   |         Download         |
+| :----------: | :------: | :-------: |:---------------------------: |
+| Yuan2.0-M32-HF-INT4 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-HF-INT4/summary)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int4) \| [Netdisk](https://pan.baidu.com/s/1zacOAxCne9U99LdgMbjfFQ?pwd=kkww )  \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int4/)
+| Yuan2.0-M32-HF-INT8 |    16K    |  HuggingFace    |    [ModelScope](https://modelscope.cn/models/YuanLLM/Yuan2-M32-hf-int8/)  \| [HuggingFace](https://huggingface.co/IEITYuan/Yuan2-M32-hf-int8/) \| [Netdisk](https://pan.baidu.com/s/1hq9l6eYY_cRuBlQMRV6Lcg?pwd=b56k) \| [Wisemodel](https://www.wisemodel.cn/models/IEIT-Yuan/Yuan2-M32-hf-int8/)
+
+
+
+
+
+## 1. Environment of AutoGPTQ 
+- **Environment requirements:**  CUDA version> 11.8 
+- **Container:**  Create a container using the image provided by the [vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)
+```shell
+# enter docker containers
+docker exec -it vllm_yuan bash
+
+# enter directory
+cd /mnt
+
+# clone
+git clone https://github.com/IEIT-Yuan/Yuan2.0-M32.git
+
+# enter project
+cd  Yuan2.0-M32/3rd_party/AutoGPTQ
+
+# install autogptq
+pip install auto-gptq --no-build-isolation
+```
+
+## 2. Quantize Yuan2.0-M32-HF model
+
+
+
+
+
+**The Steps for Quantizing Yuan2.0-M32 Model:**
+
+- **Step 1:** Download [Yuan2.0-M32-HF](https://github.com/IEIT-Yuan/Yuan2.0-M32?tab=readme-ov-file#2-model-downloads) Model and move it to the specified path (/mnt/beegfs2/Yuan2-M32-HF), refer to [vllm](https://github.com/IEI-mjx/Yuan2.0-M32/blob/main/vllm/README_Yuan_vllm.md)
+- **Step 2:** Download the [datases](https://huggingface.co/datasets/hakurei/open-instruct-v1), then move it to the specified path (/mnt/beegfs2/)
+- **Step 3:** Adjust the parameters according to the following script for the quantization operation.
+
+
+
+```shell
+# edit Yuan2-M32-int4.py
+cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
+vim Yuan2-M32-int4.py
+
+'''
+pretrained_model_dir = "/mnt/beegfs2/Yuan2-M32-HF"
+quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
+
+tokenizer = LlamaTokenizer.from_pretrained("/mnt/beegfs2/Yuan2-M32-HF", add_eos_token=False, add_bos_token=False, eos_token='<eod>', use_fast=True)
+
+examples = []
+with open("/mnt/beegfs2/instruct_data.json", 'r', encoding='utf-8') as file: # path of datasets
+    data = json.load(file)
+
+for i, item in enumerate(data):
+    if i >= 2000:
+        break
+    instruction = item.get('instruction', '')
+    output = item.get('output', '')
+    combined_text = instruction + " " + output
+    examples.append(tokenizer(combined_text))
+
+max_memory = {0: "80GIB", 1: "80GIB", 2: "80GIB", 3: "80GIB", 4: "80GIB", 5: "80GIB", 6: "80GIB", 7: "80GIB"}
+quantize_config = BaseQuantizeConfig(
+    bits=4,  # quantize model to 4-bit
+    group_size=128,  # it is recommended to set the value to 128
+    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
+)
+'''
+
+# Modify pretrained_model_dir, specify the quantized_model_dir for the quantized model.
+# Modify the path of the datasets.
+# max_memory can specify the GPUs to be used.
+# Adjust the quantization parameters, int4: set bits=4, int8: set bits=8. 
+# Other parameters can refer to the default values.
+
+
+# Run
+python Yuan2-M32-int4.py
+
+# The model quantization and packing process takes 8 hours approximately.
+# You can use GPUs quantize the model to int4 and int8 separately at the same time.
+```
+
+
+## 3. Inference with Quantized Model
+
+Quantization completed, you will get the checkpoint files with the suffix of '.safetensors', config.json and quantize_config.json in the folder. You need to first copy the tokenizer-related files from the Yuan2-M32-HF path.
+
+
+```shell
+# the path of Yuan2-M32-HF
+cd /mnt/beegfs2/Yuan2-M32-HF
+
+# copy tokenizer files to the path of Yuan2-M32-GPTQ-int4
+cp special_tokens_map.json tokenizer* /mnt/beegfs2/Yuan2-M32-GPTQ-int4
+
+# edit inference.py
+cd /mnt/beegfs2/Yuan2.0-M32/3rd_party/AutoGPTQ
+vim inference.py
+
+'''
+quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
+
+tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
+
+model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)
+'''
+# edit paths of quantized_model_dir and tokenizer
+
+# run inference.py
+python inference.py
+```
+
+## 4. Evaluation
+> Parameters of HumanEval:
+> generation_params = {
+        "max_new_tokens": 512,
+        "top_k": 1,
+        "top_p": 0,
+        "temperature": 1.0,
+}
+
+> Yuan-M32-HF inferenced with 2 80GB GPUs; Yuan2-M32-GPTQ-int4 and Yuan2-M32-GPTQ-int8 inferenced with single 80GB GPU
+
+> Result:
+
+| Model               | Accuracy Type |  HumanEval | Inference Speed |  Inference Memory Usage |
+|---------------------|---------------|------------|-----------------|-------------------------|
+| Yuan2-M32-HF        | BF16          |  73.17%    | 13.16 token/s   |76.34 GB                 |
+| Yuan2-M32-GPTQ-int8 | INT8          |  72.56%    |  9.05 token/s   |39.81 GB                 |
+| Yuan2-M32-GPTQ-int4 | INT4          |  66.46%    |  9.24 token/s   |23.27GB                  |
+
+
+
--- a/docs/data_process.md
+++ b/docs/data_process.md
@@ -14,6 +14,7 @@ The main variables in the code should be set as follows:
 | `--output_path`    | The path where to store the preprocessed dataset, one .idx file and one .bin file would be created for each dataset.                                                                                                                                |


+
 ## Dataset

 Samples in dataset should be seperated with '\n', and within each sample, the '\n' should be replaced with '\<n>', therefore each line in the dataset is a single sample. And the program would replace the '\<n>' back to '\n' during preprocessing.&#x20;

--- a/doc/model.png
+++ b/doc/model.png
--- a/doc/result.png
+++ b/doc/result.png
--- a/vllm/=
+++ b/vllm/=
+Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
+Requirement already satisfied: prometheus_client in /usr/local/lib/python3.10/dist-packages (0.17.0)
+Collecting to
+  Downloading to-0.3.tar.gz (26 kB)
--- a/vllm/inference_test.py
+++ b/vllm/inference_test.py
+from vllm import LLM, SamplingParams
+
+# Sample prompts.
+prompts = [
+    "Hello, my name is",
+    "The president of the United States is",
+    "The capital of France is",
+    "The future of AI is",
+]
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+# Create an LLM.
+llm = LLM(model="facebook/opt-125m")
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
--- a/vllm/vllm/__pycache__/__init__.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/__init__.cpython-310.pyc
--- a/vllm/vllm/__pycache__/__init__.cpython-312.pyc
+++ b/vllm/vllm/__pycache__/__init__.cpython-312.pyc
--- a/vllm/vllm/__pycache__/_custom_ops.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/_custom_ops.cpython-310.pyc
--- a/vllm/vllm/__pycache__/block.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/block.cpython-310.pyc
--- a/vllm/vllm/__pycache__/config.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/config.cpython-310.pyc
--- a/vllm/vllm/__pycache__/config.cpython-312.pyc
+++ b/vllm/vllm/__pycache__/config.cpython-312.pyc
--- a/vllm/vllm/__pycache__/logger.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/logger.cpython-310.pyc
--- a/vllm/vllm/__pycache__/outputs.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/outputs.cpython-310.pyc
--- a/vllm/vllm/__pycache__/sampling_params.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/sampling_params.cpython-310.pyc
--- a/vllm/vllm/__pycache__/sequence.cpython-310.pyc
+++ b/vllm/vllm/__pycache__/sequence.cpython-310.pyc