Support GLM-4-0414

67ca83cf · Rayyyyy · 78ba9d16 · 67ca83cf · 67ca83cf · 67ca83cf
Commit 67ca83cf authored Apr 17, 2025 by Rayyyyy
20 changed files
--- a/composite_demo/src/utils.py
+++ b/composite_demo/src/utils.py
-from langchain_community.document_loaders import PyMuPDFLoader
 import docx
+from langchain_community.document_loaders import PyMuPDFLoader
 from pptx import Presentation

+
 def extract_text(path):
-    return open(path, 'r').read()
+    return open(path, "r").read()
+

 def extract_pdf(path):
    loader = PyMuPDFLoader(path)
    data = loader.load()
    data = [x.page_content for x in data]
-    content = '\n\n'.join(data)
+    content = "\n\n".join(data)
    return content

+
 def extract_docx(path):
    doc = docx.Document(path)
    data = []
    for paragraph in doc.paragraphs:
        data.append(paragraph.text)
-    content = '\n\n'.join(data)
+    content = "\n\n".join(data)
+    return content
+

 def extract_pptx(path):
    prs = Presentation(path)

--- a/demo/intel_device_demo/itrex/README.md
+++ b/demo/intel_device_demo/itrex/README.md
+# 使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型
+
+本示例介绍如何使用 Intel® Extension for Transformers 推理 GLM-4-9B-Chat 模型。
+
+## 设备和依赖检查
+
+### 相关推理测试数据
+
+**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同，请以实际运行环境为准。**
+
+测试硬件信息:
+
+ OS: Ubuntu 22.04 (本教程一定需要在Linux环境下执行)
+ Memory: 512GB
+ Python: 3.10.12
+ CPU: Intel(R) Xeon(R) Platinum 8358 CPU / 12th Gen Intel i5-12400
+
+## 安装依赖
+
+在开始推理之前，请你先安装`inference`中的依赖，同时您需要安装本目录下的依赖项：
+```shell
+pip install -r requirements.txt
+```
+
+## 运行模型推理
+
+```shell
+python itrex_cli_demo.py
+```
+
+如果您是第一次推理，会有一次模型转换权重的过程，转换后的模型权重存放在`runtime_outputs`文件夹下，这大概会消耗`60G`的硬盘空间。
+转换完成后，文件夹下有两个文件：
+ ne_chatglm2_f32.bin 52G(如果您不使用FP32进行推理，可以删掉这个文件)
+ ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin 8.1G
+
+如果您不是第一次推理，则会跳过这个步骤，直接开始对话，推理效果如下：
+```shell
+Welcome to the CLI chat. Type your messages below.
+
+User: 你好
+AVX:1 AVX2:1 AVX512F:1 AVX512BW:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
+beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900, continuous_batching: 0, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
+model_file_loader: loading model from runtime_outs/ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin
+Loading the bin file with NE format...
+load_ne_hparams  0.hparams.n_vocab = 151552
+load_ne_hparams  1.hparams.n_embd = 4096
+load_ne_hparams  2.hparams.n_mult = 0
+load_ne_hparams  3.hparams.n_head = 32
+load_ne_hparams  4.hparams.n_head_kv = 0
+load_ne_hparams  5.hparams.n_layer = 40
+load_ne_hparams  6.hparams.n_rot = 0
+load_ne_hparams  7.hparams.ftype = 0
+load_ne_hparams  8.hparams.max_seq_len = 131072
+load_ne_hparams  9.hparams.alibi_bias_max = 0.000
+load_ne_hparams  10.hparams.clip_qkv = 0.000
+load_ne_hparams  11.hparams.par_res = 0
+load_ne_hparams  12.hparams.word_embed_proj_dim = 0
+load_ne_hparams  13.hparams.do_layer_norm_before = 0
+load_ne_hparams  14.hparams.multi_query_group_num = 2
+load_ne_hparams  15.hparams.ffn_hidden_size = 13696
+load_ne_hparams  16.hparams.inner_hidden_size = 0
+load_ne_hparams  17.hparams.n_experts = 0
+load_ne_hparams  18.hparams.n_experts_used = 0
+load_ne_hparams  19.hparams.n_embd_head_k = 0
+load_ne_hparams  20.hparams.norm_eps = 0.000000
+load_ne_hparams  21.hparams.freq_base = 5000000.000
+load_ne_hparams  22.hparams.freq_scale = 1.000
+load_ne_hparams  23.hparams.rope_scaling_factor = 0.000
+load_ne_hparams  24.hparams.original_max_position_embeddings = 0
+load_ne_hparams  25.hparams.use_yarn = 0
+load_ne_vocab    26.vocab.bos_token_id = 1
+load_ne_vocab    27.vocab.eos_token_id = 151329
+load_ne_vocab    28.vocab.pad_token_id = 151329
+load_ne_vocab    29.vocab.sep_token_id = -1
+init: hparams.n_vocab         = 151552
+init: hparams.n_embd          = 4096
+init: hparams.n_mult          = 0
+init: hparams.n_head          = 32
+init: hparams.n_layer         = 40
+init: hparams.n_rot           = 0
+init: hparams.ffn_hidden_size = 13696
+init: n_parts    = 1
+load: ctx size   = 16528.38 MB
+load: layers[0].ffn_fusion    = 1
+load: scratch0   = 4096.00 MB
+load: scratch1   = 2048.00 MB
+load: scratch2   = 4096.00 MB
+load: mem required  = 26768.38 MB (+ memory per state)
+.............................................................................................
+model_init_from_file: support_bestla_kv = 1
+kv_cache_init: run_mha_reordered = 1
+model_init_from_file: kv self size =  690.00 MB
+Assistant:
+你好👋！我是人工智能助手，很高兴为你服务。有什么可以帮助你的吗？
+```
--- a/demo/intel_device_demo/itrex/README_en.md
+++ b/demo/intel_device_demo/itrex/README_en.md
+
+# Using Intel® Extension for Transformers to Inference the GLM-4-9B-Chat Model
+
+This example introduces how to use Intel® Extension for Transformers to inference the GLM-4-9B-Chat model.
+
+## Device and Dependency Check
+
+### Relevant Inference Test Data
+
+**The data in this document is tested on the following hardware environment. The actual running environment requirements and memory usage may vary slightly. Please refer to the actual running environment.**
+
+Test hardware information:
+
+ OS: Ubuntu 22.04 (This tutorial must be executed in a Linux environment)
+ Memory: 512GB
+ Python: 3.10.12
+ CPU: Intel(R) Xeon(R) Platinum 8358 CPU / 12th Gen Intel i5-12400
+
+## Installing Dependencies
+
+Before starting the inference, please install the dependencies in `inference`, and you need to install the dependencies in this directory:
+```shell
+pip install -r requirements.txt
+```
+
+## Running Model Inference
+
+```shell
+python itrex_cli_demo.py
+```
+
+If this is your first inference, there will be a process of converting model weights. The converted model weights are stored in the `runtime_outputs` folder, which will consume about `60G` of disk space.
+After the conversion is completed, there are two files in the folder:
+ ne_chatglm2_f32.bin 52G (If you do not use FP32 for inference, you can delete this file)
+ ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin 8.1G
+
+If this is not your first inference, this step will be skipped, and you will directly start the conversation. The inference result is as follows:
+```shell
+Welcome to the CLI chat. Type your messages below.
+
+User: Hello
+AVX:1 AVX2:1 AVX512F:1 AVX512BW:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
+beam_size: 1, do_sample: 1, top_k: 40, top_p: 0.900, continuous_batching: 0, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
+model_file_loader: loading model from runtime_outs/ne_chatglm2_q_nf4_bestla_cfp32_sym_sfp32_g32.bin
+Loading the bin file with NE format...
+load_ne_hparams  0.hparams.n_vocab = 151552
+load_ne_hparams  1.hparams.n_embd = 4096
+load_ne_hparams  2.hparams.n_mult = 0
+load_ne_hparams  3.hparams.n_head = 32
+load_ne_hparams  4.hparams.n_head_kv = 0
+load_ne_hparams  5.hparams.n_layer = 40
+load_ne_hparams  6.hparams.n_rot = 0
+load_ne_hparams  7.hparams.ftype = 0
+load_ne_hparams  8.hparams.max_seq_len = 131072
+load_ne_hparams  9.hparams.alibi_bias_max = 0.000
+load_ne_hparams  10.hparams.clip_qkv = 0.000
+load_ne_hparams  11.hparams.multi_query_group_num = 2
+load_ne_hparams  12.hparams.ffn_hidden_size = 13696
+load_ne_hparams  13.hparams.inner_hidden_size = 0
+load_ne_hparams  14.hparams.n_experts = 0
+load_ne_hparams  15.hparams.n_experts_used = 0
+load_ne_hparams  16.hparams.n_embd_head_k = 0
+load_ne_hparams  17.hparams.norm_eps = 0.000000
+load_ne_hparams  18.hparams.freq_base = 5000000.000
+load_ne_hparams  19.hparams.freq_scale = 1.000
+load_ne_hparams  20.hparams.rope_scaling_factor = 0.000
+load_ne_hparams  21.hparams.original_max_position_embeddings = 0
+load_ne_hparams  22.hparams.use_yarn = 0
+load_ne_vocab    23.vocab.bos_token_id = 1
+load_ne_vocab    24.vocab.eos_token_id = 151329
+load_ne_vocab    25.vocab.pad_token_id = 151329
+load_ne_vocab    26.vocab.sep_token_id = -1
+init: hparams.n_vocab         = 151552
+init: hparams.n_embd          = 4096
+init: hparams.n_mult          = 0
+init: hparams.n_head          = 32
+init: hparams.n_layer         = 40
+init: hparams.n_rot           = 0
+init: hparams.ffn_hidden_size = 13696
+init: n_parts    = 1
+load: ctx size   = 16528.38 MB
+load: layers[0].ffn_fusion    = 1
+load: scratch0   = 4096.00 MB
+load: scratch1   = 2048.00 MB
+load: scratch2   = 4096.00 MB
+load: mem required  = 26768.38 MB (+ memory per state)
+.............................................................................................
+model_init_from_file: support_bestla_kv = 1
+kv_cache_init: run_mha_reordered = 1
+model_init_from_file: kv self size =  690.00 MB
+Assistant:
+Hello👋! I am an AI assistant. How can I help you today?
+```
--- a/demo/intel_device_demo/itrex/itrex_cli_demo.py
+++ b/demo/intel_device_demo/itrex/itrex_cli_demo.py
+"""
+This script creates a CLI demo with transformers backend for the glm-4-9b model with Intel® Extension for Transformers
+"""
+
+import os
+
+
+MODEL_PATH = os.environ.get("MODEL_PATH", "THUDM/GLM-4-9B-0414")
+
+from threading import Thread
+
+import torch
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM
+from transformers import AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
+
+
+class StopOnTokens(StoppingCriteria):
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        stop_ids = [151329, 151336, 151338]
+        for stop_id in stop_ids:
+            if input_ids[0][-1] == stop_id:
+                return True
+        return False
+
+
+def initialize_model_and_tokenizer():
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+    model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH,
+        device_map="cpu",  # Use Intel CPU for inference
+        trust_remote_code=True,
+        load_in_4bit=True,
+    )
+    return tokenizer, model
+
+
+def get_user_input():
+    return input("\nUser: ")
+
+
+def main():
+    tokenizer, model = initialize_model_and_tokenizer()
+
+    history = []
+    max_length = 100
+    top_p = 0.9
+    temperature = 0.8
+    stop = StopOnTokens()
+
+    print("Welcome to the CLI chat. Type your messages below.")
+    while True:
+        user_input = get_user_input()
+        if user_input.lower() in ["exit", "quit"]:
+            break
+        history.append([user_input, ""])
+
+        messages = []
+        for idx, (user_msg, model_msg) in enumerate(history):
+            if idx == len(history) - 1 and not model_msg:
+                messages.append({"role": "user", "content": user_msg})
+                break
+            if user_msg:
+                messages.append({"role": "user", "content": user_msg})
+            if model_msg:
+                messages.append({"role": "assistant", "content": model_msg})
+
+        model_inputs = tokenizer.apply_chat_template(
+            messages, add_generation_prompt=True, tokenize=True, return_tensors="pt"
+        )
+
+        streamer = TextIteratorStreamer(tokenizer=tokenizer, timeout=60, skip_prompt=True, skip_special_tokens=True)
+
+        generate_kwargs = {
+            "input_ids": model_inputs,
+            "streamer": streamer,
+            "max_new_tokens": max_length,
+            "do_sample": True,
+            "top_p": top_p,
+            "temperature": temperature,
+            "stopping_criteria": StoppingCriteriaList([stop]),
+            "repetition_penalty": 1.2,
+            "eos_token_id": model.config.eos_token_id,
+        }
+
+        t = Thread(target=model.generate, kwargs=generate_kwargs)
+        t.start()
+        print("Assistant:", end="", flush=True)
+        for new_token in streamer:
+            if new_token:
+                print(new_token, end="", flush=True)
+                history[-1][1] += new_token
+
+        history[-1][1] = history[-1][1].strip()
+
+
+if __name__ == "__main__":
+    main()
--- a/demo/intel_device_demo/itrex/requirements.txt
+++ b/demo/intel_device_demo/itrex/requirements.txt
+cmake>=3.29.5.1
+huggingface-hub>=0.23.4
+git+https://github.com/intel/neural-speed.git@main#egg=neural-speed
+intel-extension-for-transformers>=1.4.2
--- a/demo/intel_device_demo/openvino/README.md
+++ b/demo/intel_device_demo/openvino/README.md
+# 使用 OpenVINO 部署 GLM-4-9B-Chat 模型
+
+Read this in [English](README_en.md).
+
+[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
+是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型，提高推理性能，减少模型的内存占用。
+本示例将展示如何使用 OpenVINO 部署 GLM-4-9B-Chat 模型。
+
+## 1. 环境配置
+
+首先，你需要安装依赖
+
+```bash
+pip install -r requirements.txt
+```
+
+## 2. 转换模型
+
+由于需要将Huggingface模型转换为OpenVINO IR模型，因此您需要下载模型并转换。
+
+```
+python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
+```
+
+### 可以选择的参数
+
+* `--model_id` - 模型所在目录的路径（绝对路径）。
+* `--output` - 转换后模型保存的地址。
+* `--precision` - 转换的精度。
+
+
+转换过程如下：
+```
+====Exporting IR=====
+Framework not specified. Using pt to export the model.
+Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Using framework PyTorch: 2.3.1+cu121
+Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
+INFO:nncf:Statistics of the bitwidth distribution:
+┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
+│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
+┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
+│              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
+├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
+│              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
+┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
+Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
+Configuration saved in glm-4-9b-ov/openvino_config.json
+====Exporting tokenizer=====
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+```
+## 3. 运行 GLM-4-9B-Chat 模型
+
+```
+python3 chat.py --model_path {your_path}/glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
+```
+
+### 可以选择的参数
+
+* `--model_path` - OpenVINO IR 模型所在目录的路径。
+* `--max_sequence_length` - 输出标记的最大大小。
+* `--device` - 运行推理的设备。
+
+### 参考代码
+
+本代码参考 [OpenVINO 官方示例](https://github.com/OpenVINO-dev-contest/chatglm3.openvino) 进行修改。
--- a/demo/intel_device_demo/openvino/README_en.md
+++ b/demo/intel_device_demo/openvino/README_en.md
+# Deploy the GLM-4-9B-Chat model using OpenVINO
+
+[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
+is an open source toolkit designed by Intel for deep learning inference. It can help developers optimize models, improve inference performance, and reduce model memory usage.
+This example will show how to deploy the GLM-4-9B-Chat model using OpenVINO.
+
+## 1. Environment configuration
+
+First, you need to install the dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+## 2. Convert the model
+
+Since the Huggingface model needs to be converted to an OpenVINO IR model, you need to download the model and convert it.
+
+```
+python3 convert.py --model_id THUDM/glm-4-9b-chat --output {your_path}/glm-4-9b-chat-ov
+```
+The conversion process is as follows:
+```
+====Exporting IR=====
+Framework not specified. Using pt to export the model.
+Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.14it/s]
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+Using framework PyTorch: 2.3.1+cu121
+Mixed-Precision assignment ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 160/160 • 0:01:45 • 0:00:00
+INFO:nncf:Statistics of the bitwidth distribution:
+┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
+│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │
+┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
+│              8 │ 31% (76 / 163)              │ 20% (73 / 160)                         │
+├────────────────┼─────────────────────────────┼────────────────────────────────────────┤
+│              4 │ 69% (87 / 163)              │ 80% (87 / 160)                         │
+┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
+Applying Weight Compression ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 163/163 • 0:03:46 • 0:00:00
+Configuration saved in glm-4-9b-ov/openvino_config.json
+====Exporting tokenizer=====
+Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
+```
+
+### Optional parameters
+
+* `--model_id` - Path to the directory where the model is located (absolute path).
+
+* `--output` - Path to where the converted model is saved.
+
+* `--precision` - Precision of the conversion.
+
+## 3. Run the GLM-4-9B-Chat model
+
+```
+python3 chat.py --model_path {your_path}glm-4-9b-chat-ov --max_sequence_length 4096 --device CPU
+```
+
+### Optional parameters
+
+* `--model_path` - Path to the directory where the OpenVINO IR model is located.
+
+* `--max_sequence_length` - Maximum size of the output token.
+* `--device` - the device to run inference on.
+
+### Reference code
+
+This code is modified based on the [OpenVINO official example](https://github.com/OpenVINO-dev-contest/chatglm3.openvino).
--- a/demo/intel_device_demo/openvino/convert.py
+++ b/demo/intel_device_demo/openvino/convert.py
+"""
+This script is used to convert the original model to OpenVINO IR format.
+The Origin Code can check https://github.com/OpenVINO-dev-contest/chatglm3.openvino/blob/main/convert.py
+"""
+
+import argparse
+import os
+from pathlib import Path
+
+from optimum.intel import OVWeightQuantizationConfig
+from optimum.intel.openvino import OVModelForCausalLM
+from transformers import AutoConfig, AutoTokenizer
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument("-h", "--help", action="help", help="Show this help message and exit.")
+    parser.add_argument(
+        "-m", "--model_id", default="THUDM/GLM-4-9B-0414", required=False, type=str, help="orignal model path"
+    )
+    parser.add_argument(
+        "-p",
+        "--precision",
+        required=False,
+        default="int4",
+        type=str,
+        choices=["fp16", "int8", "int4"],
+        help="fp16, int8 or int4",
+    )
+    parser.add_argument(
+        "-o", "--output", default="./glm-4-9b-ov", required=False, type=str, help="Required. path to save the ir model"
+    )
+    args = parser.parse_args()
+
+    ir_model_path = Path(args.output)
+    if ir_model_path.exists() == False:
+        os.mkdir(ir_model_path)
+
+    model_kwargs = {
+        "trust_remote_code": True,
+        "config": AutoConfig.from_pretrained(args.model_id, trust_remote_code=True),
+    }
+    compression_configs = {
+        "sym": False,
+        "group_size": 128,
+        "ratio": 0.8,
+    }
+
+    print("====Exporting IR=====")
+    if args.precision == "int4":
+        ov_model = OVModelForCausalLM.from_pretrained(
+            args.model_id,
+            export=True,
+            compile=False,
+            quantization_config=OVWeightQuantizationConfig(bits=4, **compression_configs),
+            **model_kwargs,
+        )
+    elif args.precision == "int8":
+        ov_model = OVModelForCausalLM.from_pretrained(
+            args.model_id, export=True, compile=False, load_in_8bit=True, **model_kwargs
+        )
+    else:
+        ov_model = OVModelForCausalLM.from_pretrained(
+            args.model_id, export=True, compile=False, load_in_8bit=False, **model_kwargs
+        )
+
+    ov_model.save_pretrained(ir_model_path)
+
+    print("====Exporting tokenizer=====")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
+    tokenizer.save_pretrained(ir_model_path)
--- a/demo/intel_device_demo/openvino/openvino_cli_demo.py
+++ b/demo/intel_device_demo/openvino/openvino_cli_demo.py
+import argparse
+from threading import Thread
+from typing import List, Tuple
+
+import torch
+from optimum.intel.openvino import OVModelForCausalLM
+from transformers import AutoConfig, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
+
+
+class StopOnTokens(StoppingCriteria):
+    def __init__(self, token_ids):
+        self.token_ids = token_ids
+
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        for stop_id in self.token_ids:
+            if input_ids[0][-1] == stop_id:
+                return True
+        return False
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument("-h", "--help", action="help", help="Show this help message and exit.")
+    parser.add_argument("-m", "--model_path", required=True, type=str, help="Required. model path")
+    parser.add_argument(
+        "-l", "--max_sequence_length", default=256, required=False, type=int, help="Required. maximun length of output"
+    )
+    parser.add_argument(
+        "-d", "--device", default="CPU", required=False, type=str, help="Required. device for inference"
+    )
+    args = parser.parse_args()
+    model_dir = args.model_path
+
+    ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}
+
+    tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
+
+    print("====Compiling model====")
+    ov_model = OVModelForCausalLM.from_pretrained(
+        model_dir,
+        device=args.device,
+        ov_config=ov_config,
+        config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
+        trust_remote_code=True,
+    )
+
+    streamer = TextIteratorStreamer(tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True)
+    stop_tokens = [StopOnTokens([151329, 151336, 151338])]
+
+    def convert_history_to_token(history: List[Tuple[str, str]]):
+        messages = []
+        for idx, (user_msg, model_msg) in enumerate(history):
+            if idx == len(history) - 1 and not model_msg:
+                messages.append({"role": "user", "content": user_msg})
+                break
+            if user_msg:
+                messages.append({"role": "user", "content": user_msg})
+            if model_msg:
+                messages.append({"role": "assistant", "content": model_msg})
+
+        model_inputs = tokenizer.apply_chat_template(
+            messages, add_generation_prompt=True, tokenize=True, return_tensors="pt"
+        )
+        return model_inputs
+
+    history = []
+    print("====Starting conversation====")
+    while True:
+        input_text = input("用户: ")
+        if input_text.lower() == "stop":
+            break
+
+        if input_text.lower() == "clear":
+            history = []
+            print("AI助手: 对话历史已清空")
+            continue
+
+        print("GLM-4-9B-OpenVINO:", end=" ")
+        history = history + [[input_text, ""]]
+        model_inputs = convert_history_to_token(history)
+        generate_kwargs = dict(
+            input_ids=model_inputs,
+            max_new_tokens=args.max_sequence_length,
+            temperature=0.1,
+            do_sample=True,
+            top_p=1.0,
+            top_k=50,
+            repetition_penalty=1.1,
+            streamer=streamer,
+            stopping_criteria=StoppingCriteriaList(stop_tokens),
+        )
+
+        t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
+        t1.start()
+
+        partial_text = ""
+        for new_text in streamer:
+            new_text = new_text
+            print(new_text, end="", flush=True)
+            partial_text += new_text
+        print("\n")
+        history[-1][1] = partial_text
--- a/demo/intel_device_demo/openvino/requirements.txt
+++ b/demo/intel_device_demo/openvino/requirements.txt
+optimum>=1.20.0
+optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c1ee8ac0864e25e22ea56b5a37a35451531da0e6
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
-FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk24.04-py310
\ No newline at end of file
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10
\ No newline at end of file
--- a/finetune/.gitignore
+++ b/finetune/.gitignore
+output/
--- a/finetune_demo/README_en.md
+++ b/finetune_demo/README_en.md
-# GLM-4-9B Chat dialogue model fine-tuning
+# GLM-4-9B Chat Fine-tuning

-In this demo, you will experience how to fine-tune the GLM-4-9B-Chat open source model (visual understanding model is
-not supported). Please strictly follow the steps in the document to avoid unnecessary errors.
+[中文阅读](README_zh.md)

-## Hardware check
+## Hardware Check

-**The data in this document are tested in the following hardware environment. The actual operating environment
-requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
-environment. **
-Test hardware information:
+All fine-tuning tests were performed in the following environment:

-+ OS: Ubuntu 22.04
-+ Memory: 512GB
-+ Python: 3.12.3
-+ CUDA Version: 12.3
-+ GPU Driver: 535.104.05
-+ GPU: NVIDIA A100-SXM4-80GB * 8
+> OS: Ubuntu 22.04
+>
+> Memory: 512GB
+>
+> Python: 3.12.3
+>
+> CUDA Version: 12.4
+>
+> GPU Driver: 535.104.05
+>
+> GPU: NVIDIA H100 80GB HBM3 (hereafter referred to as GPU)

-| Fine-tuning solution | Video memory usage                           | Weight save point size |
-|----------------------|----------------------------------------------|------------------------|
-| lora (PEFT)          | 21531MiB                                     | 17M                    |
-| p-tuning v2 (PEFT)   | 21381MiB                                     | 121M                   |
-| SFT (Zero3 method)   | 80935MiB<br/>(Each GPU, 8 GPUs are required) | 20G                    |
+ Fine-tuning based on Llama-Factory

-Before starting fine-tuning, please install the dependencies in `basic_demo` first. You also need to install the
-dependencies in this directory:
+| Fine-tuning Model     | Fine-tuning solution | GPU memory usage             |
+|-----------------------|----------------------|------------------------------|
+| GLM-4-9B-0414     | lora                 | 22G (Each GPU, Need 1 GPU)   |
+| GLM-4-9B-0414     | SFT (Zero3 method)   | 55G (Each GPU, Need 4 GPUs)  |
+| GLM-4-9B-0414     | lora                 | 80G (Each GPU, Need 8 GPUs)  |
+| GLM-4-32B-0414    | SFT (Zero3 method)   | 80G (Each GPU, Need 16 GPUs) |
+
+ Fine-tuning based on this repository
+
+| Fine-tuning Model        | Fine-tuning solution               | GPU memory usage              |
+|--------------------------|------------------------------------|-------------------------------|
+| GLM-4V-9B                | lora (PEFT), Include EVA2CLIPModel | 75G (Each GPU, Need 1 GPU)    |
+| GLM-4-9B-Chat            | lora (PEFT)                        | 22G (Each GPU, Need 1 GPU)    |
+| GLM-4-9B-Chat            | SFT (Zero3 method)                 | 80G (Each GPU, Need 8 GPUs)   |
+
+
+## Preparation
+
+Before starting fine-tuning, please install the dependencies in `inference`, ensure you have cloned the latest version of the model repository, and install the dependencies in this directory:

 ```bash
 pip install -r requirements.txt
@@ -95,21 +109,107 @@ For data files, the sample uses the following format:

 This is a sample without tools:

-```
-{"messages": [{"role": "user", "content": "类型#裤*材质#牛仔布*风格#性感"}, {"role": "assistant", "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质，其柔然的手感和细腻的质地，在穿着舒适的同时，透露着清纯甜美的个性气质。除此之外，流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致，不失为一款随性出街的必备单品。"}]}
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "类型#裤*材质#牛仔布*风格#性感"
+    },
+    {
+      "role": "assistant",
+      "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质，其柔然的手感和细腻的质地，在穿着舒适的同时，透露着清纯甜美的个性气质。除此之外，流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致，不失为一款随性出街的必备单品。"
+    }
+  ]
+}
 ```

 This is a sample with tools:

+```json
+{
+  "messages": [
+    {
+      "role": "system",
+      "content": "",
+      "tools": [
+        {
+          "type": "function",
+          "function": {
+            "name": "get_recommended_books",
+            "description": "Get recommended books based on user's interests",
+            "parameters": {
+              "type": "object",
+              "properties": {
+                "interests": {
+                  "type": "array",
+                  "items": {
+                    "type": "string"
+                  },
+                  "description": "The interests to recommend books for"
+                }
+              },
+              "required": [
+                "interests"
+              ]
+            }
+          }
+        }
+      ]
+    },
+    {
+      "role": "user",
+      "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."
+    },
+    {
+      "role": "assistant",
+      "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"
+    },
+    {
+      "role": "observation",
+      "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"
+    },
+    {
+      "role": "assistant",
+      "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."
+    }
+  ]
+}
 ```
-{"messages": [{"role": "system", "content": "", "tools": [{"type": "function", "function": {"name": "get_recommended_books", "description": "Get recommended books based on user's interests", "parameters": {"type": "object", "properties": {"interests": {"type": "array", "items": {"type": "string"}, "description": "The interests to recommend books for"}}, "required": ["interests"]}}}]}, {"role": "user", "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."}, {"role": "assistant", "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"}, {"role": "observation", "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"}, {"role": "assistant", "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."}]}
+
+This is a sample with VQA Task:
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "图片中的动物是什么？",
+      "image": "/root/images/0001.jpg"
+    },
+    {
+      "role": "assistant",
+      "content": "图片中有一只猫。"
+    },
+    {
+      "role": "user",
+      "content": "图片中的猫在做什么？"
+    },
+    {
+      "role": "assistant",
+      "content": "这只猫坐在或站在桌子上，桌上有很多食物。"
+    }
+  ]
+}
 ```

- The `system` role is optional, but if it exists, it must appear before the `user` role, and a complete conversation
-  data (whether single-round or multi-round conversation) can only have one `system` role.
- The `tools` field is optional. If it exists, it must appear after the `system` role, and a complete conversation
-  data (whether single-round or multi-round conversation) can only have one `tools` field. When the `tools` field
-  exists, the `system` role must exist and the `content` field is empty.
+- The `system` role is optional, but if it exists, it must appear before the `user` role, and the `system` role can only
+  appear once in a complete conversation (whether it is a single round or a multi-round conversation).
+- The `tools` field is optional, but if it exists, it must appear after the `system` role, and the `tools` field can
+  only appear once in a complete conversation (whether it is a single round or a multi-round conversation). When
+  the `tools` field exists, the `system` role must exist and the `content` field is empty.
+- `GLM-4V-9B` does not support the `tools` field and the `system` field. And `image` must be placed in the first
+  message. The `image` field needs to contain the `absolute path` of the image.

 ## Configuration file

@@ -119,9 +219,8 @@ The fine-tuning configuration file is located in the `config` directory, includi

 2. `lora.yaml / ptuning_v2
 3. .yaml / sft.yaml`: Configuration files for different modes of models, including model parameters, optimizer
-   parameters, training parameters, etc. Some important parameters are explained as follows:
+   parameters, training parameters, etc. Some important parameters are explained as follows: + data_config section

-+ data_config section
 + train_file: File path of training dataset.
 + val_file: File path of validation dataset.
 + test_file: File path of test dataset.
@@ -152,8 +251,7 @@ The fine-tuning configuration file is located in the `config` directory, includi
 + r: rank of LoRA.
 + lora_alpha: scaling factor of LoRA.
 + lora_dropout: dropout probability to use in LoRA layer.
-+ P-TuningV2 parameters:
-+ num_virtual_tokens: the number of virtual tokens.
+ P-TuningV2 parameters: + num_virtual_tokens: the number of virtual tokens.
 + num_attention_heads: 2: the number of attention heads of P-TuningV2 (do not change).
 + token_dim: 256: the token dimension of P-TuningV2 (do not change).

@@ -163,15 +261,31 @@ Execute **single machine multi-card/multi-machine multi-card** run through the f
 the acceleration solution, and you need to install `deepspeed`.

 ```shell
-OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b configs/lora.yaml
+OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune.py  data/AdvertiseGen/  THUDM/GLM-4-9b-0414  configs/lora.yaml # For Chat Fine-tune
+OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b  configs/lora.yaml  # For VQA Fine-tune
 ```

 Execute **single machine single card** run through the following code.

 ```shell
-python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml
+python finetune.py  data/AdvertiseGen/  THUDM/GLM-4-9B-0414  configs/lora.yaml # For Chat Fine-tune
+python finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
 ```

+## Log Visualization Support
+
+The fine-tuning code supports using SwanLab to visualize and track training metrics. You can enable tracking by installing SwanLab:
+
+```shell
+pip install swanlab
+```
+
+You can visit the [SwanLab Visualization Dashboard](https://swanlab.cn/@ShaohonChen/GLM4-Finetune) to view the training logs of example fine-tuning scripts.
+
+If prompted to log in, you can obtain an API Key by visiting [https://swanlab.cn/space/~/settings](https://swanlab.cn/space/~/settings).
+
+If you only want to use the local dashboard, set `swanlab: local` in the configuration parameters and use the `swanlab watch` command to start the offline dashboard.
+
 ## Fine-tune from a saved point

 If you train as described above, each fine-tuning will start from the beginning. If you want to fine-tune from a
@@ -184,21 +298,11 @@ half-trained model, you can add a fourth parameter, which can be passed in two w
 For example, this is an example code to continue fine-tuning from the last saved point

 ```shell
-python finetune_hf.py data/AdvertiseGen/ THUDM/glm-4-9b-chat configs/lora.yaml yes
+python finetune.py data/AdvertiseGen/ THUDM/GLM-4-9B-0414 configs/lora.yaml yes
 ```

 ## Use the fine-tuned model

-### Verify the fine-tuned model in inference.py
-
-You can Use our fine-tuned model in `finetune_demo/inference.py`, and you can easily test it with just one line of code.
-
-```shell
-python inference.py your_finetune_path
-```
-
-In this way, the answer you get is the fine-tuned answer.
-
 ### Use the fine-tuned model in other demos in this repository or external repositories

 You can use our `LORA` and fully fine-tuned models in any demo. This requires you to modify the code yourself according
@@ -212,26 +316,16 @@ to the following tutorial.
 > in `adapter_config.json`.

 ```python
-def load_model_and_tokenizer(
-        model_dir: Union[str, Path], trust_remote_code: bool = True
-) -> tuple[ModelType, TokenizerType]:
-
-
+def load_model_and_tokenizer(model_dir: Union[str, Path]) -> tuple[ModelType, TokenizerType]:
    model_dir = _resolve_path(model_dir)
-if (model_dir / 'adapter_config.json').exists():
-    model = AutoPeftModelForCausalLM.from_pretrained(
-        model_dir, trust_remote_code=trust_remote_code, device_map='auto'
-    )
-tokenizer_dir = model.peft_config['default'].base_model_name_or_path
-else:
-model = AutoModelForCausalLM.from_pretrained(
-    model_dir, trust_remote_code=trust_remote_code, device_map='auto'
-)
-tokenizer_dir = model_dir
-tokenizer = AutoTokenizer.from_pretrained(
-    tokenizer_dir, trust_remote_code=trust_remote_code
-)
-return model, tokenizer
+    if (model_dir / "adapter_config.json").exists():
+        model = AutoPeftModelForCausalLM.from_pretrained(model_dir, device_map="auto")
+        tokenizer_dir = model.peft_config["default"].base_model_name_or_path
+    else:
+        model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto")
+        tokenizer_dir = model_dir
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
+    return model, tokenizer
 ```

 2. Read the fine-tuned model. Please note that you should use the location of the fine-tuned model. For example, if your
@@ -240,11 +334,12 @@ return model, tokenizer
   as `model_dir`.
 3. After completing the above operations, you can use the fine-tuned model normally. Other calling methods remain
   unchanged.
+4. This fine-tuning script has not been tested on long texts of 128K or 1M tokens. Fine-tuning long texts requires GPU
+   devices with larger memory and more efficient fine-tuning solutions, which developers need to handle on their own.

 ## Reference

 ```
-
 @inproceedings{liu2022p,
 title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
 author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
@@ -262,5 +357,4 @@ eprint={2306.05301},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
 }
-
-```
\ No newline at end of file
+```
--- a/finetune_demo/README.md
+++ b/finetune_demo/README.md
 # GLM-4-9B Chat 对话模型微调

-Read this in [English](README_en.md)
-
-本 demo 中，你将体验到如何微调 GLM-4-9B-Chat 对话开源模型(不支持视觉理解模型)。 请严格按照文档的步骤进行操作，以避免不必要的错误。
+Read this in [English](README)

 ## 硬件检查

-**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同，请以实际运行环境为准。**
-测试硬件信息:
+所有微调测试均在以下环境和硬件下测试:
+
+> OS: Ubuntu 22.04
+>
+> Memory: 512GB
+>
+> Python: 3.12.3
+>
+> CUDA Version: 12.4
+>
+> GPU Driver: 535.104.05
+>
+> GPU: NVIDIA H100 80GB HBM3 (以下简称 GPU)
+
+
+ 基于 Llama-Factory 进行微调
+
+| Fine-tuning Model     | Fine-tuning solution | GPU memory usage             |
+|-----------------------|----------------------|------------------------------|
+| GLM-4-9B-0414     | lora                 | 22G (Each GPU, Need 1 GPU)   |
+| GLM-4-9B-0414     | SFT (Zero3 method)   | 55G (Each GPU, Need 4 GPUs)  |
+| GLM-4-9B-0414     | lora                 | 80G (Each GPU, Need 8 GPUs)  |
+| GLM-4-32B-0414    | SFT (Zero3 method)   | 80G (Each GPU, Need 16 GPUs) |

-+ OS: Ubuntu 22.04
-+ Memory: 512GB
-+ Python: 3.12.3
-+ CUDA Version:  12.3
-+ GPU Driver: 535.104.05
-+ GPU: NVIDIA A100-SXM4-80GB * 8
+ 基于本仓库代码微调

-| 微调方案               | 显存占用                              | 权重保存点大小 |
-|--------------------|-----------------------------------|---------|
-| lora (PEFT)        | 21531MiB                          | 17M     |
-| p-tuning v2 (PEFT) | 21381MiB                          | 121M    |
-| SFT (Zero3 method) | 80935MiB<br/>(Each GPU，需要使用8张GPU) | 20G     |
+| Fine-tuning Model        | Fine-tuning solution               | GPU memory usage              |
+|--------------------------|------------------------------------|-------------------------------|
+| GLM-4V-9B                | lora (PEFT), Include EVA2CLIPModel | 75G (Each GPU, Need 1 GPU)    |
+| GLM-4-9B-Chat            | lora (PEFT)                        | 22G (Each GPU, Need 1 GPU)    |
+| GLM-4-9B-Chat            | SFT (Zero3 method)                 | 80G (Each GPU, Need 8 GPUs)   |

-在开始微调之前，请你先安装`basic_demo`中的依赖，同时您需要安装本目录下的依赖项：
+
+## 准备工作
+
+在开始微调之前，请你先安装 `inference` 中的依赖，并保证克隆了最新版本的模型仓库，同时您需要安装本目录下的依赖项：

 ```bash
 pip install -r requirements.txt
@@ -50,7 +67,7 @@ pip install -r requirements.txt
              "<arg name>": "<arg value>"
            }
          }
-           // Add more tools if needed
+          // Add more tools if needed
        ]
      },
      {
@@ -94,14 +111,98 @@ pip install -r requirements.txt

 这里是一个不带有工具的例子:

-```
-{"messages": [{"role": "user", "content": "类型#裤*材质#牛仔布*风格#性感"}, {"role": "assistant", "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质，其柔然的手感和细腻的质地，在穿着舒适的同时，透露着清纯甜美的个性气质。除此之外，流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致，不失为一款随性出街的必备单品。"}]}
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "类型#裤*材质#牛仔布*风格#性感"
+    },
+    {
+      "role": "assistant",
+      "content": "3x1的这款牛仔裤采用浅白的牛仔面料为裤身材质，其柔然的手感和细腻的质地，在穿着舒适的同时，透露着清纯甜美的个性气质。除此之外，流畅的裤身剪裁将性感的腿部曲线彰显的淋漓尽致，不失为一款随性出街的必备单品。"
+    }
+  ]
+}
 ```

 这是一个带有工具调用的例子:

+```json
+{
+  "messages": [
+    {
+      "role": "system",
+      "content": "",
+      "tools": [
+        {
+          "type": "function",
+          "function": {
+            "name": "get_recommended_books",
+            "description": "Get recommended books based on user's interests",
+            "parameters": {
+              "type": "object",
+              "properties": {
+                "interests": {
+                  "type": "array",
+                  "items": {
+                    "type": "string"
+                  },
+                  "description": "The interests to recommend books for"
+                }
+              },
+              "required": [
+                "interests"
+              ]
+            }
+          }
+        }
+      ]
+    },
+    {
+      "role": "user",
+      "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."
+    },
+    {
+      "role": "assistant",
+      "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"
+    },
+    {
+      "role": "observation",
+      "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"
+    },
+    {
+      "role": "assistant",
+      "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."
+    }
+  ]
+}
 ```
-{"messages": [{"role": "system", "content": "", "tools": [{"type": "function", "function": {"name": "get_recommended_books", "description": "Get recommended books based on user's interests", "parameters": {"type": "object", "properties": {"interests": {"type": "array", "items": {"type": "string"}, "description": "The interests to recommend books for"}}, "required": ["interests"]}}}]}, {"role": "user", "content": "Hi, I am looking for some book recommendations. I am interested in history and science fiction."}, {"role": "assistant", "content": "{\"name\": \"get_recommended_books\", \"arguments\": {\"interests\": [\"history\", \"science fiction\"]}}"}, {"role": "observation", "content": "{\"books\": [\"Sapiens: A Brief History of Humankind by Yuval Noah Harari\", \"A Brief History of Time by Stephen Hawking\", \"Dune by Frank Herbert\", \"The Martian by Andy Weir\"]}"}, {"role": "assistant", "content": "Based on your interests in history and science fiction, I would recommend the following books: \"Sapiens: A Brief History of Humankind\" by Yuval Noah Harari, \"A Brief History of Time\" by Stephen Hawking, \"Dune\" by Frank Herbert, and \"The Martian\" by Andy Weir."}]}
+
+这是一个视觉VQA微调的例子：
+
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "图片中的动物是什么？",
+      "image": "/root/images/0001.jpg"
+    },
+    {
+      "role": "assistant",
+      "content": "图片中有一只猫。"
+    },
+    {
+      "role": "user",
+      "content": "图片中的猫在做什么？"
+    },
+    {
+      "role": "assistant",
+      "content": "这只猫坐在或站在桌子上，桌上有很多食物。"
+    }
+  ]
+}
 ```

 - `system` 角色为可选角色，但若存在 `system` 角色，其必须出现在 `user`
@@ -109,13 +210,15 @@ pip install -r requirements.txt
 - `tools` 字段为可选字段，若存在 `tools` 字段，其必须出现在 `system`
  角色之后，且一个完整的对话数据（无论单轮或者多轮对话）只能出现一次 `tools` 字段。当 `tools` 字段存在时，`system`
  角色必须存在并且 `content` 字段为空。
+- `GLM-4V-9B` 不支持 `tools` 字段和 `system` 字段。并且 `image` 必须放在第一条消息中。 `image`
+  字段需要放置置图片的 `绝对路径`。

 ## 配置文件

 微调配置文件位于 `config` 目录下，包括以下文件：

 1. `ds_zereo_2 / ds_zereo_3.json`: deepspeed 配置文件。
-2. `lora.yaml / ptuning_v2.yaml / sft.yaml`: 模型不同方式的配置文件，包括模型参数、优化器参数、训练参数等。 部分重要参数解释如下：
+2. `lora.yaml / sft.yaml`: 模型不同方式的配置文件，包括模型参数、优化器参数、训练参数等。 部分重要参数解释如下：
    + data_config 部分
        + train_file: 训练数据集的文件路径。
        + val_file: 验证数据集的文件路径。
@@ -154,18 +257,34 @@ pip install -r requirements.txt

 ## 开始微调

-通过以下代码执行 **单机多卡/多机多卡** 运行，这是使用 `deepspeed` 作为加速方案的，您需要安装 `deepspeed`。
+通过以下代码执行 **单机多卡/多机多卡** 运行，这是使用 `deepspeed` 作为加速方案的，您需要安装 `deepspeed`。接着，按照此命令运行：

 ```shell
-OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune_hf.py  data/AdvertiseGen/  THUDM/glm-4-9b  configs/lora.yaml
+OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune.py  data/AdvertiseGen/  THUDM/GLM-4-9B-0414  configs/lora.yaml # For Chat Fine-tune
+OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8  finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b  configs/lora.yaml  # For VQA Fine-tune
 ```

 通过以下代码执行 **单机单卡** 运行。

 ```shell
-python finetune_hf.py  data/AdvertiseGen/  THUDM/glm-4-9b-chat  configs/lora.yaml
+python finetune.py  data/AdvertiseGen/  THUDM/GLM-4-9B-0414  configs/lora.yaml # For Chat Fine-tune
+python finetune_vision.py  data/CogVLM-311K/  THUDM/glm-4v-9b configs/lora.yaml # For VQA Fine-tune
 ```

+## 日志可视化支持
+
+微调代码支持使用SwanLab对训练指标进行可视化跟踪。可通过安装SwanLab开启跟踪：
+
+```shell
+pip install swanlab
+```
+
+可以访问[SwanLab可视化看板](https://swanlab.cn/@ShaohonChen/GLM4-Finetune)获得案例微调脚本的训练日志。
+
+如果提示登录，可以通过访问[https://swanlab.cn/space/~/settings](https://swanlab.cn/space/~/settings)获取API Key。
+
+如果仅使用本地看板，可在配置参数中设置`swanlab: local`。并使用`swanlab watch`命令开启离线看板。
+
 ## 从保存点进行微调

 如果按照上述方式进行训练，每次微调都会从头开始，如果你想从训练一半的模型开始微调，你可以加入第四个参数，这个参数有两种传入方式:
@@ -176,23 +295,11 @@ python finetune_hf.py  data/AdvertiseGen/  THUDM/glm-4-9b-chat  configs/lora.yam
 例如，这就是一个从最后一个保存点继续微调的示例代码

 ```shell
-python finetune_hf.py  data/AdvertiseGen/  THUDM/glm-4-9b-chat  configs/lora.yaml yes
+python finetune.py  data/AdvertiseGen/  THUDM/GLM-4-9B-0414  configs/lora.yaml yes
 ```

 ## 使用微调后的模型

-### 在 inference.py 中验证微调后的模型
-
-您可以在 `finetune_demo/inference.py` 中使用我们的微调后的模型，仅需要一行代码就能简单的进行测试。
-
-```shell
-python inference.py your_finetune_path
-```
-
-这样，得到的回答就微调后的回答了。
-
-### 在本仓库的其他 demo 或者外部仓库使用微调后的模型
-
 您可以在任何一个 demo 内使用我们的 `LORA` 和 全参微调的模型。这需要你自己按照以下教程进行修改代码。

 1. 使用`finetune_demo/inference.py`中读入模型的方式替换 demo 中读入模型的方式。
@@ -201,34 +308,26 @@ python inference.py your_finetune_path
 > 中记录了微调型的路径，如果你的原始模型位置发生更改，则你应该修改`adapter_config.json`中`base_model_name_or_path`的路径。

 ```python
-def load_model_and_tokenizer(
-        model_dir: Union[str, Path], trust_remote_code: bool = True
-) -> tuple[ModelType, TokenizerType]:
+def load_model_and_tokenizer(model_dir: Union[str, Path]) -> tuple[ModelType, TokenizerType]:
    model_dir = _resolve_path(model_dir)
-    if (model_dir / 'adapter_config.json').exists():
-        model = AutoPeftModelForCausalLM.from_pretrained(
-            model_dir, trust_remote_code=trust_remote_code, device_map='auto'
-        )
-        tokenizer_dir = model.peft_config['default'].base_model_name_or_path
+    if (model_dir / "adapter_config.json").exists():
+        model = AutoPeftModelForCausalLM.from_pretrained(model_dir, device_map="auto")
+        tokenizer_dir = model.peft_config["default"].base_model_name_or_path
    else:
-        model = AutoModelForCausalLM.from_pretrained(
-            model_dir, trust_remote_code=trust_remote_code, device_map='auto'
-        )
+        model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto")
        tokenizer_dir = model_dir
-    tokenizer = AutoTokenizer.from_pretrained(
-        tokenizer_dir, trust_remote_code=trust_remote_code
-    )
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
    return model, tokenizer
 ```

 2. 读取微调的模型，请注意，你应该使用微调模型的位置，例如，若你的模型位置为`/path/to/finetune_adapter_model`
   ，原始模型地址为`path/to/base_model`,则你应该使用`/path/to/finetune_adapter_model`作为`model_dir`。
 3. 完成上述操作后，就能正常使用微调的模型了，其他的调用方式没有变化。
+4. 本微调脚本没有测试过128K 1M等长文本的微调，长文本的微调需要更大显存的GPU设备，并且需要更高效的微调方案,需要开发者自行解决。

 ## 参考文献

 ```
-
 @inproceedings{liu2022p,
 title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
 author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
@@ -246,5 +345,4 @@ eprint={2306.05301},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
 }
-
-```
\ No newline at end of file
+```
--- a/finetune_demo/configs/ds_zero_2.json
+++ b/finetune_demo/configs/ds_zero_2.json
@@ -26,4 +26,4 @@
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
-}
\ No newline at end of file
+}
--- a/finetune_demo/configs/ds_zero_3.json
+++ b/finetune_demo/configs/ds_zero_3.json
@@ -28,4 +28,4 @@
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
-}
\ No newline at end of file
+}
--- a/finetune_demo/configs/lora.yaml
+++ b/finetune_demo/configs/lora.yaml
@@ -3,8 +3,13 @@ data_config:
  val_file: dev.jsonl
  test_file: dev.jsonl
  num_proc: 1
+
+combine: True
+freezeV: True
 max_input_length: 512
 max_output_length: 512
+# swanlab: "local"  # set to local if don`t use cloud
+
 training_args:
  # see `transformers.Seq2SeqTrainingArguments`
  output_dir: ./output
@@ -22,9 +27,10 @@ training_args:
  log_level: info
  logging_strategy: steps
  logging_steps: 10
+  run_name: "glm4-lora-finetune"
  # settings for evaluation
  per_device_eval_batch_size: 4
-  evaluation_strategy: steps
+  eval_strategy: steps
  eval_steps: 500
  # settings for optimizer
  # adam_epsilon: 1e-6
@@ -35,10 +41,13 @@ training_args:
  generation_config:
    max_new_tokens: 512
  # set your absolute deepspeed path here
-  #deepspeed: ds_zero_2.json
+  # deepspeed: configs/ds_zero_3.json
+  deepspeed: configs/ds_zero_2.json
+
 peft_config:
  peft_type: LORA
  task_type: CAUSAL_LM
  r: 8
  lora_alpha: 32
  lora_dropout: 0.1
+  target_modules: ["q_proj", "k_proj", "v_proj"]
--- a/finetune_demo/configs/sft.yaml
+++ b/finetune_demo/configs/sft.yaml
@@ -3,8 +3,13 @@ data_config:
  val_file: dev.jsonl
  test_file: dev.jsonl
  num_proc: 1
-max_input_length: 256
+
+combine: True
+freezeV: True
+max_input_length: 512
 max_output_length: 512
+# swanlab: "local"  # set to local if don`t use cloud
+
 training_args:
  # see `transformers.Seq2SeqTrainingArguments`
  output_dir: ./output
@@ -22,9 +27,10 @@ training_args:
  log_level: info
  logging_strategy: steps
  logging_steps: 10
+  run_name: "glm4-sft-finetune"
  # settings for evaluation
  per_device_eval_batch_size: 16
-  evaluation_strategy: steps
+  eval_strategy: steps
  eval_steps: 500
  # settings for optimizer
  # adam_epsilon: 1e-6

--- a/finetune_demo/finetune.py
+++ b/finetune_demo/finetune.py
--- a/finetune/finetune_vision.py
+++ b/finetune/finetune_vision.py