Initial commit

a3d43354 · wanglch · a3d43354 · a3d43354 · a3d43354 · a3d43354
Commit a3d43354 authored Aug 08, 2024 by wanglch
20 changed files
--- a/cli_demo.sh
+++ b/cli_demo.sh
+HIP_VISIBLE_DEVICES=5 python cli_demo.py
\ No newline at end of file
--- a/finetune_demo/README.md
+++ b/finetune_demo/README.md
+# Fine-tune the CogVLM2 model
+[中文版README](./README_zh.md)
+## Note
+ This code only provides fine-tuning examples for the huggingface version model 'cogvlm2-llama3-chat-19B'.
+ Only examples of fine-tuning language models are provided.
+ Only provide Lora fine-tuning examples.
+ Only provide examples of fine-tuning the dialogue model.
+ We currently do not support using 'zero3' fine-tuning, which may result in the model not being able to read.
+## Minimum configuration
+- We Only test A100 GPUs with 80GB memory for finetune. It requires at least 73GB of GPU memory using 8 GPUs with zero2.
+- Tensor parallelism is not supported yet, that is, the model is split into multiple graphics cards for fine-tuning.
+## Start fine-tuning
+1. Download the data set and install dependencies
+In this demo, developers can use the [CogVLM-SFT-311K](https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K) open
+source data set provided by us or build their own data set in the same format for fine-tuning. .
+The data format is as follows:
+ The data set consists of two folders, `images` and `labels` (in CogVLM-SFT-311K, they are `labels_en` and `labels_zh`,
+  corresponding to Chinese and English labels respectively.
+  In the fine-tuning code, you can modify these two lines of code to modify the folder name.
+```python
+self.image_dir = os.path.join(root_dir, 'images')
+self.label_dir = os.path.join(root_dir, 'labels_en')  # or 'labels_zh' or 'labels' can be modified by yourself
+```
+ Image files are stored in the `images` folder, and corresponding label files are stored in the `labels` folder. There
+  is a one-to-one correspondence between the names of pictures and label files. The format of image files is `jpg`, and
+  the format of label files is `json`.
+ Each tag file contains a dialogue. The dialogue consists of two roles: `user` and `assistant`. The dialogue content of
+  each role consists of two fields: `role` and `content`. As shown in the fields below.
+```
+{
+   "conversations": [
+     {
+       "role": "user",
+       "content": "What can be inferred about the zebras' behavior and surroundings?"
+     },
+     {
+       "role": "assistant",
+       "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
+     }
+   ]
+}
+```
+Before starting fine-tuning, you need to install the relevant dependencies. you also need to install the dependencies in the  [basic_demo](../basic_demo/requirements.txt).
+```bash
+pip install -r requirements.txt
+```
+**Note**: `mpi4py` may need to install other Linux dependency packages. Please install it yourself according to your
+system environment.
+2. Run the fine-tuning program
+We provide a fine-tuning script `peft_lora.py` that uses multiple cards on a single machine (including a single card).
+You can start fine-tuning by running the following command.
+```bash
+deepspeed peft_lora.py --ds_config ds_config.yaml
+```
+The figure below shows the memory usage during fine-tuning.
+Parameter information:
+ `max_input_len`: 512
+ `max_output_len`: 512
+ `batch_size_per_gpus`: 1
+ `lora_target`: vision_expert_query_key_value
+GPU memory usage:
+```shell
+-------------------------------------------------------------+
+| Processes:                                                  |
+|  GPU   GI   CI        PID   Type   Process name  GPU Memory |
+|        ID   ID                                      Usage   |
+|=============================================================|
+|    0   N/A  N/A    704914      C   python          72442MiB |
+|    1   N/A  N/A    704915      C   python          72538MiB |
+|    2   N/A  N/A    704916      C   python          72538MiB |
+|    3   N/A  N/A    704917      C   python          72538MiB |
+|    4   N/A  N/A    704918      C   python          72538MiB |
+|    5   N/A  N/A    704919      C   python          72538MiB |
+|    6   N/A  N/A    704920      C   python          72538MiB |
+|    7   N/A  N/A    704921      C   python          72442MiB |
+-------------------------------------------------------------+
+```
+While the code is running, Loss data will be recorded by tensorboard to facilitate visual viewing of Loss convergence.
+```shell
+tensorboard --logdir=output
+```
+**Note**: We strongly recommend that you use the `BF16` format for fine-tuning to avoid the problem of Loss being `NaN`.
+3. Inference on the fine-tuned model
+By running `peft_infer.py` you can use the fine-tuned model to generate text. You need to configure the fine-tuned model
+address according to the configuration requirements in the code. Then run:
+```shell
+python peft_infer.py
+```
+You can use the fine-tuned model for inference.
\ No newline at end of file
--- a/finetune_demo/README_zh.md
+++ b/finetune_demo/README_zh.md
+# 微调 CogVLM2 模型
+[Read this in English.](./README.md)
+运行本demo来使用Lora微调 CogVLM2 中的**语言模型**部分。
+## 注意
+ 本代码仅提供了 huggingface 版本模型 `cogvlm2-llama3-chat-19B` 的微调示例。
+ 仅提供了微调语言模型的示例。
+ 仅提供Lora微调示例。
+ 仅提供对话模型微调示例。
+ 暂不支持使用 `zero3` 微调，这可能出现 模型无法读取的情况。
+## 最低配置
+- 我们仅在具有80GB内存的A100 GPU上进行了微调测试。使用零冗余优化策略2（zero2）时，至少需要73GB的GPU内存，并且需要8个GPU。
+- 暂不支持 Tensor 并行，即模型拆分到多张显卡微调。
+## 开始微调
+1. 下载数据集和安装依赖
+本demo中，开发者可以使用由我们提供[CogVLM-SFT-311K](https://huggingface.co/datasets/THUDM/CogVLM-SFT-311K)
+开源数据集或自行构建相同格式的数据集进行微调。
+数据格式如下:
+ 数据集由 `images` 和 `labels` 两个文件夹组成 （在 CogVLM-SFT-311K 中 为 `labels_en` 和 `labels_zh`，分别对应中英文标签。
+  在微调代码中，你可以修改这两行代码来修改文件夹名称。
+```python
+self.image_dir = os.path.join(root_dir, 'images')
+self.label_dir = os.path.join(root_dir, 'labels_en')  # or 'labels_zh' or 'labels' 可以自行修改
+```
+ `images` 文件夹中存放了图片文件，`labels`
+  文件夹中存放了对应的标签文件。图片和标签文件的名称一一对应。图片文件的格式为 `jpg`，标签文件的格式为 `json`。
+ 每个标签文件中包含了一段对话，对话由 `user` 和 `assistant` 两个角色组成，每个角色的对话内容由 `role` 和 `content`
+  两个字段组成。如下字段所示。
+```
+{
+  "conversations": [
+    {
+      "role": "user",
+      "content": "What can be inferred about the zebras' behavior and surroundings?"
+    },
+    {
+      "role": "assistant",
+      "content": "Based on the image, we can infer that the two zebras are likely seeking relief from the sun's heat, as they are standing side by side under the branches of a thorny tree. This shade-providing tree offers some respite from the sun, possibly during the hottest part of the day. The zebras are in a green field with grass, providing them with an ideal environment to graze and eat while staying near their source of shelter. This shows that the zebras' behavior is influenced by the conditions and available resources in their surroundings. It also highlights that these animals adopt strategies to adapt to the fluctuating conditions of their environment, such as cooperation and seeking shelter, to survive and thrive in their natural habitat."
+    }
+  ]
+}
+```
+在开始微调之前，需要安装相关的依赖。请注意，你还需要安装好 [basic_demo](../basic_demo/requirements.txt) 中的依赖。
+```bash
+pip install -r requirements.txt
+```
+**注意**: `mpi4py` 可能需要安装别的 Linux 依赖包。请根据您的系统环境自行安装。
+2. 运行微调程序
+我们提供了使用单机多卡（包含单卡）的微调脚本 `peft_lora.py`。您可以通过运行以下命令来启动微调。
+```bash
+deepspeed peft_lora.py --ds_config ds_config.yaml
+```
+下图展现了微调过程中的显存占用情况
+参数信息：
+ `max_input_len`: 512
+ `max_output_len`: 512
+ `batch_size_per_gpus`: 1
+ `lora_target`: vision_expert_query_key_value
+显存占用情况：
+```shell
+-------------------------------------------------------------+
+| Processes:                                                  |
+|  GPU   GI   CI        PID   Type   Process name  GPU Memory |
+|        ID   ID                                      Usage   |
+|=============================================================|
+|    0   N/A  N/A    704914      C   python          72442MiB |
+|    1   N/A  N/A    704915      C   python          72538MiB |
+|    2   N/A  N/A    704916      C   python          72538MiB |
+|    3   N/A  N/A    704917      C   python          72538MiB |
+|    4   N/A  N/A    704918      C   python          72538MiB |
+|    5   N/A  N/A    704919      C   python          72538MiB |
+|    6   N/A  N/A    704920      C   python          72538MiB |
+|    7   N/A  N/A    704921      C   python          72442MiB |
+-------------------------------------------------------------+
+```
+在代码运行中，Loss数据会被 tensorboard记录，方便可视化查看Loss收敛情况。
+```shell
+tensorboard --logdir=output
+```
+**注意**: 我们强烈推荐您使用 `BF16` 格式进行微调，以避免出现 Loss 为 `NaN`的问题。
+3. 推理微调后的模型
+运行 `peft_infer.py`，你可以使用微调后的模型生成文本。您需要按照代码中的配置要求，配置微调后的模型地址。然后运行:
+```shell
+python peft_infer.py
+```
+即可使用微调的模型进行推理。
--- a/finetune_demo/ds_config.yaml
+++ b/finetune_demo/ds_config.yaml
+train_micro_batch_size_per_gpu: 1
+gradient_accumulation_steps: 1
+steps_per_print: 50
+gradient_clipping: 1.0
+zero_optimization:
+  stage: 2
+  contiguous_gradients: false
+  overlap_comm: true
+  reduce_scatter: true
+  reduce_bucket_size: 1000000000
+  allgather_bucket_size: 100000000
+  load_from_fp32_weights: false
+  round_robin_gradients: false
+offload_optimizer:
+  device: cpu
+  pin_memory: true
+zero_allow_untested_optimizer: true
+bf16:
+  enabled: true
+activation_checkpointing:
+  partition_activations: false
+  contiguous_memory_optimization: false
+  cpu_checkpointing: false
+wall_clock_breakdown: true
--- a/finetune_demo/peft_infer.py
+++ b/finetune_demo/peft_infer.py
+"""
+This is a simple chat demo using CogVLM2 PEFT finetune model in CIL.
+Just replace the model loading part with the PEFT model loading code.
+"""
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoTokenizer
+## Loading PEFT model
+MODEL_PATH = "THUDM/cogvlm2-llama3-chat-19B"  # The path to the base model (read tokenizer only)
+PEFT_MODEL_PATH = "/output/checkpoint_epoch_0_step_50"  # The path to the PEFT model
+DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    PEFT_MODEL_PATH,
+    torch_dtype=TORCH_TYPE,
+    trust_remote_code=True,
+    device_map="auto",
+).to(DEVICE).eval()
+## The following code is the same as the one in basic_demo/cli_demo.py
+text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
+while True:
+    image_path = input("image path >>>>> ")
+    if image_path == '':
+        print('You did not enter image path, the following will be a plain text conversation.')
+        image = None
+        text_only_first_query = True
+    else:
+        image = Image.open(image_path).convert('RGB')
+    history = []
+    while True:
+        query = input("Human:")
+        if query == "clear":
+            break
+        if image is None:
+            if text_only_first_query:
+                query = text_only_template.format(query)
+                text_only_first_query = False
+            else:
+                old_prompt = ''
+                for _, (old_query, response) in enumerate(history):
+                    old_prompt += old_query + " " + response + "\n"
+                query = old_prompt + "USER: {} ASSISTANT:".format(query)
+        if image is None:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                template_version='chat'
+            )
+        else:
+            input_by_model = model.build_conversation_input_ids(
+                tokenizer,
+                query=query,
+                history=history,
+                images=[image],
+                template_version='chat'
+            )
+        inputs = {
+            'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
+            'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
+            'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
+            'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
+        }
+        # add any transformers params here.
+        gen_kwargs = {
+            "max_new_tokens": 2048,
+            "pad_token_id": 128002,  # avoid warning of llama3
+        }
+        with torch.no_grad():
+            outputs = model.generate(**inputs, **gen_kwargs)
+            outputs = outputs[:, inputs['input_ids'].shape[1]:]
+            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+            print("\nCogVLM2:", response)
+        history.append((query, response))
--- a/finetune_demo/peft_lora.py
+++ b/finetune_demo/peft_lora.py
+import argparse
+import gc
+import json
+import os
+import random
+import threading
+import yaml
+from PIL import Image
+import psutil
+import torch
+from accelerate import Accelerator, DeepSpeedPlugin
+from accelerate.utils import HfDeepSpeedConfig
+from torch.utils.data import Dataset, DataLoader, random_split
+from tqdm import tqdm
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    get_linear_schedule_with_warmup
+)
+from torch.utils.tensorboard import SummaryWriter
+from peft import get_peft_model, LoraConfig, TaskType
+import logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class ConversationDataset(Dataset):
+    def __init__(self,
+                 root_dir,
+                 tokenizer,
+                 model,
+                 torch_type,
+                 device='cuda',
+                 input_length=1024,
+                 output_length=1024
+                 ):
+        self.root_dir = root_dir
+        self.tokenizer = tokenizer
+        self.model = model
+        self.image_dir = os.path.join(root_dir, 'images')
+        self.label_dir = os.path.join(root_dir,
+                                      'labels_en')  # can be change to labels_en or labels_zh in SFT-311K dataset
+        self.filenames = os.listdir(self.image_dir)
+        self.input_length = input_length
+        self.output_length = output_length
+        self.device = device
+        self.torch_type = torch_type
+        self.padding_len = 2303
+        self.max_length = self.input_length + self.output_length + self.padding_len
+    def __len__(self):
+        return len(self.filenames)
+    @staticmethod
+    def custom_collate_fn(batch):
+        batched_data = {}
+        for key in batch[0].keys():
+            if isinstance(batch[0][key], list):
+                batched_data[key] = [batch_item[key] for batch_item in batch]
+            elif isinstance(batch[0][key], torch.Tensor):
+                batched_data[key] = torch.stack([item[key] for item in batch])
+            else:
+                raise ValueError("Unsupported datatype in custom collate_fn")
+        return batched_data
+    def __getitem__(self, idx):
+        img_name = os.path.join(self.image_dir, self.filenames[idx])
+        label_name = os.path.join(self.label_dir, self.filenames[idx].replace('.jpg', '.json'))
+        image = Image.open(img_name).convert('RGB')
+        with open(label_name, 'r') as f:
+            label_data = json.load(f)
+        num_rounds = len(label_data["conversations"]) // 2
+        sampled_round_id = random.randint(0, num_rounds - 1)
+        history = [(label_data["conversations"][(sampled_round_id - 1) * 2]["content"],
+                    label_data["conversations"][(sampled_round_id - 1) * 2 + 1]["content"])] if (
+                sampled_round_id > 0 and random.random() > 0.5) else None
+        query = label_data["conversations"][sampled_round_id * 2]["content"]
+        response = label_data["conversations"][sampled_round_id * 2 + 1]["content"]
+        input_data = self.model.build_conversation_input_ids(
+            tokenizer=self.tokenizer,
+            query=query,
+            history=history,
+            images=[image],
+            answer=response
+        )
+        def pad_to_len(unpadded_tensor, pad_to_length, pad_value=0):
+            current_length = len(unpadded_tensor)
+            if current_length >= pad_to_length:
+                return unpadded_tensor[:pad_to_length]
+            return torch.cat(
+                (unpadded_tensor,
+                 torch.full([pad_to_length - current_length],
+                            fill_value=pad_value,
+                            dtype=unpadded_tensor.dtype,
+                            device=unpadded_tensor.device)), dim=0)
+        input_data['input_ids'] = pad_to_len(
+            input_data['input_ids'],
+            self.max_length,
+            pad_value=128002,
+        )
+        input_data['attention_mask'] = pad_to_len(
+            input_data['attention_mask'],
+            self.max_length,
+            pad_value=0
+        )
+        input_data['token_type_ids'] = pad_to_len(
+            input_data['token_type_ids'],
+            self.max_length,
+            pad_value=0
+        )
+        input_data['labels'] = pad_to_len(
+            input_data['labels'],
+            self.max_length,
+            pad_value=-100
+        )
+        for data_key in input_data:
+            if data_key in ['images']:
+                input_data[data_key] = [data.to(self.device).to(self.torch_type) for data in
+                                        input_data[data_key]]
+            else:
+                input_data[data_key] = input_data[data_key].to(self.device)
+        return input_data
+def b2mb(x):
+    return int(x / 2 ** 20)
+class TorchTracemalloc:
+    def __enter__(self):
+        gc.collect()
+        torch.cuda.empty_cache()
+        torch.cuda.reset_max_memory_allocated()
+        self.begin = torch.cuda.memory_allocated()
+        self.process = psutil.Process()
+        self.cpu_begin = self.cpu_mem_used()
+        self.peak_monitoring = True
+        peak_monitor_thread = threading.Thread(target=self.peak_monitor_func)
+        peak_monitor_thread.daemon = True
+        peak_monitor_thread.start()
+        return self
+    def cpu_mem_used(self):
+        return self.process.memory_info().rss
+    def peak_monitor_func(self):
+        self.cpu_peak = -1
+        while True:
+            self.cpu_peak = max(self.cpu_mem_used(), self.cpu_peak)
+            if not self.peak_monitoring:
+                break
+    def __exit__(self, *exc):
+        self.peak_monitoring = False
+        gc.collect()
+        torch.cuda.empty_cache()
+        self.end = torch.cuda.memory_allocated()
+        self.peak = torch.cuda.max_memory_allocated()
+        self.used = b2mb(self.end - self.begin)
+        self.peaked = b2mb(self.peak - self.begin)
+        self.cpu_end = self.cpu_mem_used()
+        self.cpu_used = b2mb(self.cpu_end - self.cpu_begin)
+        self.cpu_peaked = b2mb(self.cpu_peak - self.cpu_begin)
+def main():
+    parser = argparse.ArgumentParser(description="Finetune a CogVLM model with LoRA")
+    parser.add_argument("--lr", type=float, default=1e-7, help="Learning rate")
+    parser.add_argument("--num_epochs", type=int, default=5, help="Number of epochs")
+    parser.add_argument("--batch_size", type=int, default=2, help="Batch size")
+    parser.add_argument("--torch_type", type=str, default="torch.bfloat16", help="Torch type")
+    parser.add_argument("--save_step", type=int, default=100, help="Steps between checkpoints")
+    parser.add_argument("--train_dataset_rate", type=float, default=0.8,
+                        help="Proportion of dataset to use for training")
+    parser.add_argument("--local_rank", type=int, default=-1, help="Local rank for distributed training")
+    parser.add_argument("--lora_rank", type=int, default=8, help="Rank parameter for LoRA")
+    parser.add_argument("--lora_alpha", type=int, default=32, help="Alpha parameter for LoRA")
+    parser.add_argument("--lora_target", type=str, default=["vision_expert_query_key_value"],
+                        help="Finetune Target for LoRA")  # you can change the target to other modules such as "language_expert_query_key_value"
+    parser.add_argument("--lora_dropout", type=float, default=0.1, help="Dropout rate for LoRA")
+    parser.add_argument("--warmup_steps", type=int, default=0,
+                        help="Number of warmup steps for learning rate scheduler")
+    parser.add_argument("--max_input_len", type=int, default=128, help="Maximum input length")
+    parser.add_argument("--max_output_len", type=int, default=128, help="Maximum output length")
+    parser.add_argument("--model_path", type=str,
+                        default="THUDM/cogvlm2-llama3-chat-19B",
+                        help="Path to the pretrained model")
+    parser.add_argument("--dataset_path", type=str,
+                        default="CogVLM-SFT-311K/llava_instruction_multi_conversations_formate",
+                        help="Path to the conversation dataset")
+    parser.add_argument("--save_path", type=str, default="output",
+                        help="Path to save the finetuned model, must be a exit directory")
+    parser.add_argument("--ds_config", type=str, default="ds_config.yaml",
+                        help="DeepSpeed configuration file path")
+    args = parser.parse_args()
+    args.torch_type = eval(args.torch_type)
+    with open(args.ds_config) as f:
+        ds_config = yaml.safe_load(f)
+    hf_ds_config = HfDeepSpeedConfig(ds_config)
+    ds_plugin = DeepSpeedPlugin(hf_ds_config=hf_ds_config)
+    accelerator = Accelerator(deepspeed_plugin=ds_plugin)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(args.model_path, torch_dtype=args.torch_type, trust_remote_code=True)
+    if len(tokenizer) != model.get_input_embeddings().weight.size(0):
+        model.resize_token_embeddings(len(tokenizer))
+    dataset = ConversationDataset(
+        root_dir=args.dataset_path,
+        tokenizer=tokenizer,
+        model=model,
+        torch_type=args.torch_type,
+        input_length=args.max_input_len,
+        output_length=args.max_output_len
+    )
+    train_size = int(args.train_dataset_rate * len(dataset))
+    val_size = len(dataset) - train_size
+    train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_size=args.batch_size,
+        shuffle=True,
+        collate_fn=dataset.custom_collate_fn,
+    )
+    eval_dataloader = DataLoader(
+        val_dataset,
+        batch_size=args.batch_size,
+        shuffle=True,
+        collate_fn=dataset.custom_collate_fn,
+    )
+    peft_config = LoraConfig(
+        task_type=TaskType.CAUSAL_LM,
+        inference_mode=False,
+        r=args.lora_rank,
+        target_modules=args.lora_target,
+        lora_alpha=args.lora_alpha,
+        lora_dropout=args.lora_dropout,
+    )
+    model = get_peft_model(model, peft_config)
+    optimizer = torch.optim.AdamW(model.parameters(), lr=args.lr)
+    lr_scheduler = get_linear_schedule_with_warmup(
+        optimizer=optimizer,
+        num_warmup_steps=args.warmup_steps,
+        num_training_steps=(len(train_dataloader) * args.num_epochs),
+    )
+    model, train_dataloader, eval_dataloader, optimizer, lr_scheduler = accelerator.prepare(
+        model, train_dataloader, eval_dataloader, optimizer, lr_scheduler
+    )
+    logger.info("Preparation done. Starting training...")
+    writer = SummaryWriter(log_dir=args.save_path)
+    for epoch in range(args.num_epochs):
+        model.train()
+        total_loss = 0.0
+        for step, batch in enumerate(tqdm(train_dataloader)):
+            outputs = model(
+                input_ids=batch['input_ids'],
+                token_type_ids=batch['token_type_ids'],
+                attention_mask=batch['attention_mask'],
+                images=batch['images'],
+                labels=batch['labels']
+            )
+            loss = outputs.loss
+            total_loss += loss.detach().float()
+            accelerator.backward(loss)
+            optimizer.step()
+            lr_scheduler.step()
+            optimizer.zero_grad()
+            if (step + 1) % args.save_step == 0:
+                print(f"Epoch {epoch}, Step {step + 1}, Loss {loss.item()}")
+                checkpoint_path = os.path.join(args.save_path, f'checkpoint_epoch_{epoch}_step_{step + 1}')
+                model.save_pretrained(
+                    save_directory=checkpoint_path,
+                    safe_serialization=True
+                )
+                writer.add_scalar('Train/Loss', loss.item(), epoch * len(train_dataloader) + step)
+        total_loss = accelerator.gather(total_loss)
+        avg_loss = total_loss.mean().item() / len(train_dataloader)
+        train_ppl = torch.exp(torch.tensor(avg_loss))
+        writer.add_scalar('Train/Epoch_Loss', avg_loss, epoch)
+        writer.add_scalar('Train/Perplexity', train_ppl, epoch)
+        accelerator.print(f"Epoch {epoch}: Average Loss {avg_loss:.4f}, Perplexity {train_ppl:.4f}")
+        model.eval()
+        eval_loss = 0.0
+        for _, batch in enumerate(tqdm(eval_dataloader)):
+            inputs = {
+                'input_ids': batch['input_ids'],
+                'token_type_ids': batch['token_type_ids'],
+                'attention_mask': batch['attention_mask'],
+                'images': batch['images']
+            }
+            labels = batch['labels'].to(accelerator.device)
+            with torch.no_grad():
+                outputs = accelerator.unwrap_model(model)(
+                    input_ids=inputs['input_ids'],
+                    token_type_ids=inputs['token_type_ids'],
+                    attention_mask=inputs['attention_mask'],
+                    images=inputs['images'],
+                    labels=labels
+                )
+                loss = outputs.loss
+                eval_loss += loss.detach().float()
+        eval_loss = accelerator.gather(eval_loss)
+        avg_eval_loss = eval_loss.mean().item()
+        writer.add_scalar('Eval/Perplexity', torch.exp(torch.tensor(avg_eval_loss)), epoch)
+        writer.add_scalar('Eval/Epoch_Loss', avg_eval_loss, epoch)
+        checkpoint_path = os.path.join(args.save_path, 'final_model')
+        model.save_pretrained(
+            save_directory=checkpoint_path,
+            safe_serialization=True
+        )
+if __name__ == "__main__":
+    main()
--- a/finetune_demo/requirements.txt
+++ b/finetune_demo/requirements.txt
+peft>=0.10.0
+deepspeed>=0.14.2
+mpi4py>=3.1.4
+tensorboard>=2.16.2
\ No newline at end of file
--- a/icon.png
+++ b/icon.png
--- a/images/architecture.png
+++ b/images/architecture.png
--- a/images/result1.png
+++ b/images/result1.png
--- a/images/result2.png
+++ b/images/result2.png
--- a/images/theory.png
+++ b/images/theory.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode = 857
+# 模型名称
+modelName=CogVLM2_pytorch                    
+# 模型描述
+modelDescription=CogVLM2是一个开源的多模态大型语言模型，旨在缩小开源模型与商业专有模型在多模态理解方面的能力差距，可用于OCR、视频理解、文档问答。
+# 应用场景
+appScenario=推理,OCR,金融,教育,交通,政府
+# 框架类型
+frameType=pytorch
--- a/requirements.txt
+++ b/requirements.txt
+peft>=0.10.0
+deepspeed>=0.14.2
+mpi4py>=3.1.4
+tensorboard>=2.16.2
+xformers
+transformers==4.40.2
+huggingface-hub>=0.23.0
+pillow
+chainlit>=1.0
+pydantic>=2.7.1
+timm>=0.9.16
+openai>=1.30.1
+loguru>=0.7.2
+pydantic>=2.7.1
+einops
+sse-starlette
+bitsandbytes
--- a/resources/WECHAT.md
+++ b/resources/WECHAT.md
+<div align="center">
+<img src=wechat.jpg width="60%"/>
+<p> 扫码关注公众号，加入「CogVLM交流群」 </p>
+<p> Scan the QR code to follow the official account and join the "CogVLM Discussion Group" </p>
+</div>
--- a/resources/cogvlm2_video_bench.jpeg
+++ b/resources/cogvlm2_video_bench.jpeg
--- a/resources/logo.svg
+++ b/resources/logo.svg
+<?xml version="1.0" encoding="utf-8"?>
+<!-- Generator: Adobe Illustrator 28.2.0, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
+<svg version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
+	 viewBox="0 0 841.89 368.6" style="enable-background:new 0 0 841.89 368.6;" xml:space="preserve">
+<style type="text/css">
+	.st0{fill:#0039C6;}
+</style>
+<g id="图层_2">
+	<g>
+		<g>
+			<g>
+				<path class="st0" d="M248.93,129.06c-5.91,0-10.73,4.81-10.73,10.73c0,5.91,4.81,10.73,10.73,10.73
+					c5.91,0,10.73-4.81,10.73-10.73C259.66,133.87,254.85,129.06,248.93,129.06z M253.8,145.35c-3,0-5.44-2.44-5.44-5.44
+					c0-3,2.44-5.44,5.44-5.44c3,0,5.44,2.44,5.44,5.44C259.24,142.91,256.8,145.35,253.8,145.35z"/>
+			</g>
+			<g>
+				<path class="st0" d="M254.2,136.94c-1.57,0-2.85,1.28-2.85,2.85c0,1.57,1.28,2.85,2.85,2.85c1.57,0,2.85-1.28,2.85-2.85
+					C257.05,138.21,255.77,136.94,254.2,136.94z"/>
+			</g>
+		</g>
+		<g>
+			<path class="st0" d="M208.68,231.25c0.17-0.19,0.04-0.49-0.22-0.5c-2.58-0.04-12.29-0.76-17.26-8.87
+				c-2.87-4.67-3.34-10.48-1.21-15.56c0.15-0.36,0.31-0.71,0.48-1.06c0.04-0.09,0.04-0.19-0.01-0.27l-8.24-14.84
+				c-0.07-0.13-0.22-0.17-0.36-0.14c-16.55,3.25-26.19,17.25-23.27,31.06c0.79,3.75,2.34,7.08,4.74,10.14
+				c6.13,7.79,15.24,10.33,21.93,10.13C199.11,240.92,206.82,233.34,208.68,231.25z"/>
+			<path class="st0" d="M132,154.09c5.91-7.11,12.74-13.01,20.39-17.65l-13.3-23.96c-0.85-1.53-2.46-2.47-4.21-2.45l-39.43,0.33
+				c-1.93,0.02-3.14,2.09-2.2,3.78l29.7,53.49C126.3,161.42,129.69,156.87,132,154.09z"/>
+			<path class="st0" d="M236.57,187.09c-3.46,0.07-6.76-0.08-8.56,0.22c-1.5,0.25-5.39,0.99-9.35,4.7c-0.17,0.16-0.1,0.45,0.13,0.51
+				c2.7,0.71,5.19,1.94,7.61,3.41c5.17,3.14,9.88,11.04,11.34,14.76c0.09,0.24,0.42,0.26,0.54,0.03l14.1-26.81
+				c0.13-0.25-0.12-0.52-0.38-0.42C245.4,186.18,239.91,187.02,236.57,187.09z"/>
+			<path class="st0" d="M287.13,109.87l-37.27,0.3c-1.74,0.01-3.34,0.98-4.16,2.51l-1.03,2.21c2.98-0.07,19.89,1.25,34.09,18.93
+				l10.64-20.23C290.3,111.9,289.06,109.86,287.13,109.87z"/>
+			<path class="st0" d="M181.85,273.71l6.91,12.44c0.94,1.69,3.34,1.76,4.37,0.13l10.74-16.92
+				C196.95,271.98,189.37,273.52,181.85,273.71z"/>
+		</g>
+		<path class="st0" d="M275.98,141.57c-11.98-19.96-28.93-19.62-28.93-19.62c-7.8-1.52-11.78-10.35-29.73-14.73
+			c-16.78-4.1-26.14,0.92-19.42,3.7c10.84,4.49,17.02,22.38,20.19,35.86c4.05,17.22,12.86,17.54,22.61,16.49
+			c7.58-0.81,22.75-8.68,30.51-10.3C279.68,151.2,282.32,152.14,275.98,141.57z M248.33,154.19c-7.95,0-14.4-6.45-14.4-14.4
+			c0-7.95,6.45-14.4,14.4-14.4c7.95,0,14.4,6.45,14.4,14.4C262.73,147.74,256.28,154.19,248.33,154.19z"/>
+		<path class="st0" d="M227.8,229.49"/>
+		<g>
+			<path class="st0" d="M197.52,207.04c-2.14,5.11,0.26,12.88,10.79,13.58c0.65,0,0.98-0.44,0.98-0.98c0-0.54-0.47-0.8-0.98-0.98
+				c-0.51-0.18-5.36-1.85-4.3-7.53c0.35-1.89,2.04-3.59,4.11-4.37c-1.17-2.17-1.18-5.1,0.48-7.91
+				C205.79,198.84,200.28,200.43,197.52,207.04z"/>
+			<path class="st0" d="M230.96,217.37c0-5.61-3.7-11.56-6.6-13.45c-2.52-0.87-6.69,0.5-9.04,4c2.76,1.8,4.53,4.69,4.83,12.13
+				c0.38,9.63-6.91,17.91-16.68,22.63C217.93,239.9,230.96,228,230.96,217.37z"/>
+			<path class="st0" d="M143.77,233.5c-27.87-35.61-11.16-83.24,29.46-99.36c-50.57,17.04-65.67,72.87-37.27,107.96
+				c18.85,23.29,51.28,24.08,68.65,18.04C183.53,258.39,160.91,255.4,143.77,233.5z"/>
+		</g>
+		<path class="st0" d="M216.65,183.11c0.24-0.1,0.24-0.44,0-0.55c-0.49-0.23-1.44-0.41-3.18,0.03c-3.37,0.84-11.64,2.17-10.99-1.85
+			c0.46-2.86,5.04-3.77,8.39-9.25c1.66-2.71-0.48-7.17-4.83-7.17c-3.35,0-6.79,1.52-10.54,5.49c-5.24,5.54-6.28,10.74-4.25,14.56
+			c1.92,3.63,6.43,5.19,13.96,3.94C210.35,187.47,211.93,185.19,216.65,183.11z"/>
+		<path class="st0" d="M213.33,198.84c0,0-4.04,1.45-3.58,7.66c1.5-0.38,1.96-0.23,4,0.75c0.94-2.77,3.34-5.1,6.89-5.1
+			c0.64,0,1.27,0.07,1.89,0.19C220.14,200.56,217.67,199.06,213.33,198.84z"/>
+		<path class="st0" d="M242.58,175.96c-20.04,4.22-20.33-12.96-35.09-16.28c-19.87-4.46-45.31,7.42-56.48,29.07
+			c-4.54,8.8-8.82,26.05,2.32,40.74c-7-19.04,3.56-33.05,10.37-38.54c6.41-5.16,14.25-8.42,21.99-10.39
+			c0.13-0.03,0.22-0.15,0.22-0.28c0.13-4.18,2.26-8.65,6.29-12.91c4.27-4.52,8.65-6.71,13.37-6.71c3.46,0,6.5,1.76,8.15,4.7
+			c1.52,2.71,1.52,5.94,0.01,8.41c-0.9,1.47-1.87,2.67-2.83,3.66c-0.18,0.18-0.05,0.49,0.2,0.5c9.28,0.58,16.91,2.84,23.72,2.32
+			c25.69-1.96,34.84-15.65,38.14-18.17C275.97,159.82,253.41,173.68,242.58,175.96z"/>
+		<path class="st0" d="M217.22,232.36c-7.51,7.51-19.81,18.1-43.6,13.65c-17.26-3.23-30.49-27.15-26.64-56.01
+			c2.06-15.44,14.03-29.84,25.26-34.49c24.56-10.16,37.47,0.67,43.44,7.36c4.23,4.74,9.07,14.53,24.71,14.4
+			c11.94-0.1,32.62-14.15,38.36-19.45c1.96-1.81-0.18-2.81-3.43-2.52c-12.41,1.11-23.91,14.36-40.8,13.64
+			c-8.34-0.35-18.84-6.08-22.54-18.44c-4.45-14.86-4.54-17.96-9.65-18.84c-5.35-0.92-13.51-1.42-25.31,2.26
+			c-15.27,4.78-39.22,28.17-40.75,56.07c-2.65,48.25,18.91,67.59,52.15,71.69c3.76,0.46,11.99-0.54,15.65-1.33
+			c5.13-2.19,9.52-4.96,12.83-7.93c8.29-7.44,13.33-15.72,14.52-27.68c0.4-4.01-0.57-9.34-1.11-11.68
+			C230.12,215.57,225.21,224.37,217.22,232.36z"/>
+	</g>
+	<g>
+		<g>
+			<path class="st0" d="M313.84,231.39c-5.58-3.09-9.92-7.49-13.01-13.18c-3.09-5.69-4.64-12.28-4.64-19.76V166.1
+				c0-7.39,1.54-13.9,4.64-19.54c3.09-5.64,7.43-10.01,13.01-13.1c5.58-3.09,12.06-4.64,19.45-4.64c7.38,0,13.86,1.47,19.45,4.42
+				c5.58,2.95,9.92,7.09,13.01,12.43c3.09,5.34,4.64,11.51,4.64,18.49c0,0.5-0.16,0.9-0.48,1.2c-0.32,0.3-0.72,0.45-1.18,0.45
+				l-19.65,1.35c-1.11,0-1.66-0.55-1.66-1.65c0-4.69-1.29-8.43-3.88-11.23c-2.58-2.79-6-4.19-10.24-4.19
+				c-4.25,0-7.66,1.42-10.24,4.27c-2.58,2.85-3.88,6.56-3.88,11.16v33.99c0,4.59,1.29,8.29,3.88,11.08c2.58,2.8,6,4.19,10.24,4.19
+				c4.24,0,7.66-1.4,10.24-4.19c2.58-2.79,3.88-6.49,3.88-11.08c0-1.1,0.55-1.65,1.66-1.65l19.65,1.05c0.46,0,0.85,0.15,1.18,0.45
+				c0.32,0.3,0.48,0.65,0.48,1.05c0,7.09-1.55,13.33-4.64,18.72c-3.09,5.39-7.43,9.56-13.01,12.5c-5.58,2.95-12.07,4.42-19.45,4.42
+				C325.9,236.03,319.42,234.48,313.84,231.39z"/>
+			<path class="st0" d="M388.23,228.84c-5.81-4.79-9.69-11.28-11.63-19.47c-1.11-4.09-1.66-8.58-1.66-13.48
+				c0-5.49,0.6-10.33,1.8-14.52c2.12-7.88,6.07-14.05,11.83-18.49c5.77-4.44,12.76-6.66,20.97-6.66c8.12,0,14.99,2.22,20.62,6.66
+				c5.63,4.44,9.55,10.56,11.76,18.34c1.29,4.49,1.94,9.28,1.94,14.37c0,4.59-0.51,8.98-1.52,13.18
+				c-1.94,8.39-5.81,15.02-11.63,19.92c-5.81,4.89-12.92,7.34-21.32,7.34C401.11,236.03,394.05,233.63,388.23,228.84z
+				 M416.05,211.99c1.75-1.85,3.05-4.37,3.88-7.56c0.55-2.6,0.83-5.44,0.83-8.54c0-2.99-0.32-5.89-0.97-8.68
+				c-0.74-3.09-1.99-5.49-3.74-7.19c-1.75-1.7-3.97-2.55-6.64-2.55c-5.35,0-8.86,3.24-10.52,9.73c-0.55,2.4-0.83,5.29-0.83,8.68
+				c0,3.1,0.28,5.94,0.83,8.54c0.74,3.2,2.01,5.72,3.81,7.56c1.8,1.85,4.04,2.77,6.71,2.77
+				C412.09,214.76,414.3,213.84,416.05,211.99z"/>
+			<path class="st0" d="M493.63,157.94c0.32-0.35,0.71-0.52,1.18-0.52h19.65c0.46,0,0.85,0.18,1.18,0.52
+				c0.32,0.35,0.48,0.77,0.48,1.27v66.93c0,13.68-3.65,23.46-10.93,29.35c-7.29,5.89-16.52,8.83-27.68,8.83
+				c-4.15,0-8.54-0.4-13.15-1.2c-0.92-0.1-1.38-0.75-1.38-1.95l0.69-18.57c0-0.7,0.18-1.17,0.55-1.42c0.37-0.25,0.83-0.28,1.38-0.07
+				c3.78,0.9,7.29,1.35,10.52,1.35c5.26,0,9.41-1.3,12.46-3.89c3.04-2.6,4.57-6.69,4.57-12.28l-0.97,1.05
+				c-3.32,3.59-8.12,5.39-14.4,5.39c-6.09,0-11.67-1.45-16.75-4.34c-5.08-2.89-8.72-7.99-10.93-15.27
+				c-1.48-4.79-2.21-10.68-2.21-17.67c0-7.69,0.88-13.97,2.63-18.87c2.12-6.19,5.51-11.13,10.17-14.82
+				c4.66-3.69,10.08-5.54,16.26-5.54c6.64,0,11.76,2.2,15.36,6.59c0.18,0.2,0.37,0.28,0.55,0.22c0.18-0.05,0.28-0.22,0.28-0.52
+				v-3.29C493.15,158.72,493.31,158.29,493.63,157.94z M493.15,195.3c0-2.7-0.09-4.82-0.28-6.36c-0.19-1.55-0.55-3.02-1.11-4.42
+				c-0.74-2.19-1.92-3.92-3.53-5.17c-1.62-1.25-3.58-1.87-5.88-1.87c-4.34,0-7.43,2.35-9.27,7.04c-1.38,2.8-2.08,6.49-2.08,11.08
+				c0,4.89,0.6,8.49,1.8,10.78c0.83,2.1,2.08,3.79,3.74,5.09c1.66,1.3,3.64,1.95,5.95,1.95c4.71,0,7.84-2.29,9.41-6.89
+				C492.73,204.23,493.15,200.49,493.15,195.3z"/>
+			<path class="st0" d="M549.07,233.33l-27.96-101.22l-0.14-0.6c0-1,0.51-1.5,1.52-1.5h21.18c1.01,0,1.66,0.5,1.94,1.5l15.92,68.13
+				c0.09,0.3,0.23,0.45,0.41,0.45c0.18,0,0.32-0.15,0.42-0.45l15.64-68.13c0.28-1,0.92-1.5,1.94-1.5h20.76c0.55,0,0.97,0.2,1.25,0.6
+				c0.28,0.4,0.32,0.9,0.14,1.5L573.7,233.33c-0.28,1-0.88,1.5-1.8,1.5h-21.04C549.94,234.83,549.34,234.33,549.07,233.33z"/>
+			<path class="st0" d="M608.51,234.3c-0.32-0.35-0.48-0.77-0.48-1.27V131.81c0-0.5,0.16-0.92,0.48-1.27
+				c0.32-0.35,0.71-0.52,1.18-0.52h19.65c0.46,0,0.85,0.18,1.18,0.52c0.32,0.35,0.48,0.77,0.48,1.27v81.01
+				c0,0.5,0.23,0.75,0.69,0.75h44.15c0.46,0,0.85,0.18,1.18,0.52c0.32,0.35,0.48,0.77,0.48,1.27v17.67c0,0.5-0.16,0.92-0.48,1.27
+				c-0.32,0.35-0.72,0.52-1.18,0.52h-66.16C609.23,234.83,608.83,234.66,608.51,234.3z"/>
+			<path class="st0" d="M747.27,130.02h19.52c0.46,0,0.85,0.18,1.18,0.52c0.32,0.35,0.48,0.77,0.48,1.27v101.22
+				c0,0.5-0.16,0.92-0.48,1.27c-0.32,0.35-0.72,0.52-1.18,0.52h-19.65c-0.46,0-0.85-0.17-1.18-0.52c-0.32-0.35-0.48-0.77-0.48-1.27
+				v-60.19c0-0.4-0.09-0.6-0.28-0.6c-0.19,0-0.37,0.15-0.55,0.45l-11.9,20.66c-0.37,0.8-1.02,1.2-1.94,1.2h-9.83
+				c-0.92,0-1.57-0.4-1.94-1.2l-12.04-20.81c-0.19-0.3-0.37-0.45-0.55-0.45c-0.19,0-0.28,0.2-0.28,0.6v60.34
+				c0,0.5-0.16,0.92-0.48,1.27c-0.32,0.35-0.72,0.52-1.18,0.52h-19.65c-0.46,0-0.85-0.17-1.18-0.52c-0.32-0.35-0.48-0.77-0.48-1.27
+				V131.81c0-0.5,0.16-0.92,0.48-1.27c0.32-0.35,0.71-0.52,1.18-0.52h19.52c0.83,0,1.47,0.4,1.94,1.2l19.24,32.64
+				c0.28,0.6,0.55,0.6,0.83,0l18.96-32.64C745.7,130.42,746.34,130.02,747.27,130.02z"/>
+		</g>
+	</g>
+</g>
+<g id="图层_1">
+</g>
+</svg>
--- a/resources/videos/basketball.mp4
+++ b/resources/videos/basketball.mp4
--- a/resources/videos/lion.mp4
+++ b/resources/videos/lion.mp4
--- a/resources/web_demo.png
+++ b/resources/web_demo.png