Revert "Merge branch 'master' into 'master'"

This reverts merge request !2

Revert "Merge branch 'master' into 'master'"
This reverts merge request !2
0e1045f0 · lvzhen · 467ec853 · 0e1045f0 · 0e1045f0 · 0e1045f0
Commit 0e1045f0 authored May 10, 2024 by lvzhen
20 changed files
--- a/cookbook/finetune_muti_classfication.ipynb
+++ b/cookbook/finetune_muti_classfication.ipynb
--- a/finetune_basemodel_demo/README.md
+++ b/finetune_basemodel_demo/README.md
+# ChatGLM3-6B-base 微调示例
+
+本目录提供 ChatGLM3-6B-base 模型的微调示例，目前，仅包含了Lora微调。
+
+如果将模型下载到了本地，本文和代码中的 `THUDM/chatglm3-6b-base` 字段均应替换为相应地址以从本地加载模型。
+
+运行示例需要 `python>=3.10`，除基础的 `torch` 依赖外，示例代码运行还需要依赖 
+
+```bash
+pip install requirements.txt
+```
+## 多轮对话格式
+
+`base`模型不具备对话能力，仅能够生成单轮回复。如果你希望使用多轮对话模型，使用`Chat`模型进行微调。
+
+
+## 数据集要求
+
+格式上，请使用`alpaca`数据集。
+
+```bash
+
+{"context": "hello", "target": "hi,I am ChatGLM3"}
+
+```
+其中，`context`是对话的上文，也就是模型的输入，`target`是对话的下文，也就是模型的输出。
+
+## 微调模型
+
+以下脚本提供了微调模型的参考方式。
+```bash
+./scripts/finetune_lora.sh  # 使用Lora微调
+```
+
+### 提示
+
+1. 微调代码在开始训练前，会先打印首条训练数据的预处理信息，显示为
+
+    ```log
+    Sanity Check >>>>>>>>>>>>>
+             '[gMASK]':  64790 ->   -100
+                 'sop':  64792 ->   -100
+          '<|system|>':  64794 ->   -100
+                    '':  30910 ->   -100
+                  '\n':     13 ->   -100
+              'Answer':  20115 ->   -100
+                 'the':    267 ->   -100
+           'following':   1762 ->   -100
+                      ...
+                'know':    683 ->   -100
+                 'the':    267 ->   -100
+            'response':   3010 ->   -100
+             'details':   3296 ->   -100
+                   '.':  30930 ->   -100
+       '<|assistant|>':  64796 ->   -100
+                    '':  30910 ->  30910
+                  '\n':     13 ->     13
+                   'I':    307 ->    307
+                'need':    720 ->    720
+                  'to':    289 ->    289
+                 'use':    792 ->    792
+                      ...
+                    '':      0 ->   -100
+                    '':      0 ->   -100 （有若干个）
+    <<<<<<<<<<<<< Sanity Check
+    ```
+
+    字样，每行依次表示一个 detokenized string, token_id 和 target_id。可在日志中查看这部分的 `loss_mask` 是否符合预期。若不符合，可能需要调整代码或数据。
+2. 参考显存用量
+   - 按照官方脚本的默认参数运行，每一张显卡占用显存为 `23GB`。
+3. 若尝试后发现显存不足，可以考虑
+    - 尝试降低 `DEV_BATCH_SIZE` 并提升 `GRAD_ACCUMULARION_STEPS`
+    - 尝试降低 `MAX_SEQ_LEN`，但是这可能会影响模型的性能
+
+## 注意事项
+ 基座模型不具备对话能力，仅能够生成单轮回复。如果你希望使用多轮对话模型，使用Chat模型进行微调。
+ 请注意，运行本脚本，你还需要安装本目录下的 `requirements.txt` 中的所有内容。
--- a/finetune_basemodel_demo/arguments.py
+++ b/finetune_basemodel_demo/arguments.py
+from dataclasses import dataclass, field
+from typing import Optional
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+
+    lora_checkpoint: str = field(
+        default=None, metadata={"help": "Path to lora checkpoints"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
+    )
+    use_fast_tokenizer: bool = field(
+        default=True,
+        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
+    )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
+                "with private models)."
+            )
+        },
+    )
+    resize_position_embeddings: Optional[bool] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Whether to automatically resize the position embeddings if `max_source_length` exceeds "
+                "the model's position embeddings."
+            )
+        },
+    )
+    quantization_bit: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "An optional parameter specifying the number of bits used for quantization. "
+                "Quantization is a process that reduces the model size by limiting the number of "
+                "bits that represent each weight in the model. A lower number of bits can reduce "
+                "the model size and speed up inference, but might also decrease model accuracy. "
+                "If not set (None), quantization is not applied."
+            )
+        },
+    )
+
+    lora_rank: Optional[int] = field(
+        default=8,
+        metadata={
+            "help": (
+                "balancing between complexity and model flexibility. A higher rank allows more "
+                "complex adaptations but increases the number of parameters and computational cost."
+            )
+        },
+    )
+    lora_alpha: Optional[float] = field(
+        default=32,
+        metadata={
+            "help": (
+                "A higher value results in more significant adjustments, potentially improving adaptation to new tasks or data, "
+                "but might also risk overfitting. A lower value makes smaller adjustments, possibly maintaining better generalization."
+            )
+        }, )
+
+    lora_dropout: Optional[float] = field(
+        default=0.1,
+        metadata={
+            "help": (
+                "during training to prevent the model from overly relying on specific patterns in the training data. "
+                "Higher dropout rates can improve model generalization but may reduce learning efficiency."
+            )
+        },
+    )
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+    train_file: Optional[str] = field(
+        default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
+    )
+
+    max_seq_length: Optional[int] = field(
+        default=2048,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated."
+            )
+        },
+    )
+
+    max_source_length: Optional[int] = field(
+        default=1024,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+    max_target_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": (
+                "The maximum total sequence length for target text after tokenization. Sequences longer "
+                "than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+
+    train_format: str = field(
+        default=None, metadata={"help": "The format of the training data file (mulit-turn or input-output)"},
+    )
+
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+
+    max_seq_length: Optional[int] = field(
+        default=1024,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+
+    pad_to_max_length: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to pad all samples to model maximum sentence length. "
+                "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
+                "efficient on GPU but very bad for TPU."
+            )
+        },
+    )
+
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+
+    def __post_init__(self):
+        extension = self.train_file.split(".")[-1]
+        assert extension in {"jsonl", "json"}, "`train_file` should be a jsonl or a json file."
+
+        assert self.train_format in {"multi-turn", "input-output"}
--- a/finetune_basemodel_demo/finetune.py
+++ b/finetune_basemodel_demo/finetune.py
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for sequence to sequence.
+"""
+# You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.
+# Adapted from
+
+
+import logging
+import os
+import sys
+import torch
+import json
+import transformers
+from transformers import (
+    AutoModel,
+    AutoTokenizer,
+    DataCollatorForSeq2Seq,
+    HfArgumentParser,
+    Seq2SeqTrainingArguments,
+    set_seed,
+)
+from trainer import LoRATrainer
+from arguments import ModelArguments, DataTrainingArguments
+from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_kbit_training
+from preprocess_utils import sanity_check, InputOutputDataset
+
+
+logger = logging.getLogger(__name__)
+
+
+class CastOutputToFloat(torch.nn.Module):
+    def __init__(self, layer):
+        super().__init__()
+        self.layer = layer
+
+    def forward(self, *args, **kwargs):
+        return self.layer(*args, **kwargs).float()
+
+
+def main():
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+
+    if training_args.should_log:
+        # The default of training_args.log_level is passive, so we set log level at info here to have that default.
+        transformers.utils.logging.set_verbosity_info()
+
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    # datasets.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    logger.info(f"Training/evaluation parameters {training_args}")
+
+    # Set seed before initializing model.
+    set_seed(training_args.seed)
+
+    # Load pretrained model and tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
+    model = AutoModel.from_pretrained(model_args.model_name_or_path, trust_remote_code=True).half().cuda()
+
+    if model_args.quantization_bit is not None:
+        print(f"Quantized to {model_args.quantization_bit} bit")
+        model = model.quantize(model_args.quantization_bit)
+
+    with open(data_args.train_file, "r", encoding="utf-8") as f:
+        if data_args.train_file.endswith(".json"):
+            train_data = json.load(f)
+        elif data_args.train_file.endswith(".jsonl"):
+            train_data = [json.loads(line) for line in f]
+
+    if data_args.train_format == "input-output":
+        train_dataset = InputOutputDataset(
+            train_data,
+            tokenizer,
+            data_args.max_source_length,
+            data_args.max_target_length,
+        )
+    else:
+        raise ValueError(f"Unknown train format: {data_args.train_format}")
+    print(f"Train dataset size: {len(train_dataset)}")
+    #if training_args.local_rank < 1:
+    sanity_check(train_dataset[0]['input_ids'], train_dataset[0]['labels'], tokenizer)
+
+    # Apply PEFT configuration
+    peft_config = LoraConfig(
+        task_type=TaskType.CAUSAL_LM,
+        inference_mode=False,
+        r=model_args.lora_rank,
+        target_modules=['query_key_value'],
+        lora_alpha=model_args.lora_alpha,
+        lora_dropout=model_args.lora_dropout,
+    )
+    model = get_peft_model(model, peft_config).to("cuda")
+
+    # 确保梯度检查点和模型并行化设置正确
+    #model.gradient_checkpointing_enable()
+    model.enable_input_require_grads()
+    model.is_parallelizable = True
+    model.model_parallel = True  # 可以尝试暂时关闭模型并行化来看是否解决问题
+    model.lm_head = CastOutputToFloat(model.transformer.output_layer)
+    model.config.use_cache = False
+
+    # Data collator
+    data_collator = DataCollatorForSeq2Seq(
+        tokenizer,
+        model=model,
+        label_pad_token_id=-100,
+        pad_to_multiple_of=None,
+        padding=False
+    )
+
+    # Initialize our Trainer
+    trainer = LoRATrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+    )
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    # model.gradient_checkpointing_enable()
+    model.enable_input_require_grads()
+    trainer.train()
+    trainer.save_model()  # Saves the tokenizer too for easy upload
+    trainer.save_state()
+
+
+if __name__ == "__main__":
+    main()
\ No newline at end of file
--- a/finetune_basemodel_demo/inference.py
+++ b/finetune_basemodel_demo/inference.py
+import argparse
+import torch
+from transformers import AutoConfig, AutoModel, AutoTokenizer
+import os
+from peft import get_peft_model, LoraConfig, TaskType
+
+# Argument Parser Setup
+parser = argparse.ArgumentParser()
+parser.add_argument("--model", type=str, default=None,
+                    help="The directory of the model")
+parser.add_argument("--tokenizer", type=str, default=None, help="Tokenizer path")
+parser.add_argument("--lora-path", type=str, default=None,
+                    help="Path to the LoRA model checkpoint")
+parser.add_argument("--device", type=str, default="cuda", help="Device to use for computation")
+parser.add_argument("--max-new-tokens", type=int, default=128, help="Maximum new tokens for generation")
+parser.add_argument("--lora-alpha", type=float, default=32, help="LoRA alpha")
+parser.add_argument("--lora-rank", type=int, default=8, help="LoRA r")
+parser.add_argument("--lora-dropout", type=float, default=0.1, help="LoRA dropout")
+
+args = parser.parse_args()
+
+if args.tokenizer is None:
+    args.tokenizer = args.model
+
+# Model and Tokenizer Configuration
+tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
+model = AutoModel.from_pretrained(args.model, load_in_8bit=False, trust_remote_code=True, device_map="auto").to(
+    args.device)
+
+# LoRA Model Configuration
+peft_config = LoraConfig(
+    task_type=TaskType.CAUSAL_LM, inference_mode=True,
+    target_modules=['query_key_value'],
+    r=args.lora_rank, lora_alpha=args.lora_alpha, lora_dropout=args.lora_dropout
+)
+model = get_peft_model(model, peft_config)
+if os.path.exists(args.lora_path):
+    model.load_state_dict(torch.load(args.lora_path), strict=False)
+
+# Interactive Prompt
+while True:
+    prompt = input("Prompt: ")
+    inputs = tokenizer(prompt, return_tensors="pt").to(args.device)
+    response = model.generate(input_ids=inputs["input_ids"],
+                              max_length=inputs["input_ids"].shape[-1] + args.max_new_tokens)
+    response = response[0, inputs["input_ids"].shape[-1]:]
+    print("Response:", tokenizer.decode(response, skip_special_tokens=True))
--- a/finetune_basemodel_demo/preprocess_utils.py
+++ b/finetune_basemodel_demo/preprocess_utils.py
+from transformers import PreTrainedTokenizer
+from torch.utils.data import Dataset
+from typing import Dict, List
+
+
+def sanity_check(tokens: List[int], target: List[int], tokenizer: PreTrainedTokenizer):
+    print("Sanity Check >>>>>>>>>>>>>")
+    for t, m in zip(tokens, target):
+        decoded =  tokenizer.tokenizer.index_special_tokens[t] \
+            if t in tokenizer.tokenizer.index_special_tokens \
+            else tokenizer.decode([t])
+        print("%20s: %6d -> %6d" % (repr(decoded), t, m))
+    print("<<<<<<<<<<<<< Sanity Check")
+
+    assert len(tokens) == len(target), f"length mismatch: {len(tokens)} vs {len(target)}"
+
+
+class InputOutputDataset(Dataset):
+    def __init__(self, data: List[dict], tokenizer: PreTrainedTokenizer, max_source_length: int, max_target_length: int):
+        super(InputOutputDataset, self).__init__()
+        self.tokenizer = tokenizer
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+        self.max_seq_length = max_source_length + max_target_length + 1
+        self.data = data
+
+    def __len__(self):
+        return len(self.data)
+    
+    def __getitem__(self, i) -> dict:
+        data_item = self.data[i]
+
+        a_ids = self.tokenizer.encode(text=data_item['context'], add_special_tokens=True, truncation=True,
+                                         max_length=self.max_source_length)
+        b_ids = self.tokenizer.encode(text=data_item['target'], add_special_tokens=False, truncation=True,
+                                    max_length=self.max_target_length)
+
+        context_length = len(a_ids)
+        input_ids = a_ids + b_ids + [self.tokenizer.eos_token_id]
+        labels = [self.tokenizer.pad_token_id] * context_length + b_ids + [self.tokenizer.eos_token_id]
+        
+        pad_len = self.max_seq_length - len(input_ids)
+        input_ids = input_ids + [self.tokenizer.pad_token_id] * pad_len
+        labels = labels + [self.tokenizer.pad_token_id] * pad_len
+        labels = [(l if l != self.tokenizer.pad_token_id else -100) for l in labels]
+
+        assert len(input_ids) == len(labels), f"length mismatch: {len(input_ids)} vs {len(labels)}"
+
+        return {
+            "input_ids": input_ids,
+            "labels": labels
+        }
--- a/finetune_basemodel_demo/requirements.txt
+++ b/finetune_basemodel_demo/requirements.txt
+tqdm
+datasets
+fsspec
+astunparse
+peft
+accelerate
+sentencepiece
\ No newline at end of file
--- a/finetune_basemodel_demo/scripts/finetune_lora.sh
+++ b/finetune_basemodel_demo/scripts/finetune_lora.sh
+#! /usr/bin/env bash
+
+set -ex
+
+LR=1e-4
+NUM_GPUS=4
+LORA_RANK=8
+LORA_ALPHA=32
+LORA_DROUPOUT=0.1
+
+MAX_SOURCE_LEN=512
+MAX_TARGET_LEN=128
+DEV_BATCH_SIZE=1
+GRAD_ACCUMULARION_STEPS=1
+MAX_STEP=500
+SAVE_INTERVAL=50
+MAX_SEQ_LEN=512
+
+RUN_NAME=text
+BASE_MODEL_PATH=THUDM/chatglm3-6b-base
+DATASET_PATH=data/alpaca_data.jsonl
+DATESTR=`date +%Y%m%d-%H%M%S`
+OUTPUT_DIR=output/${RUN_NAME}-${DATESTR}-${LR}
+MASTER_PORT=$(shuf -n 1 -i 10000-65535)
+
+mkdir -p $OUTPUT_DIR
+
+torchrun --standalone --nnodes=1 --nproc_per_node=$NUM_GPUS finetune.py \
+    --train_format input-output \
+    --train_file $DATASET_PATH \
+    --lora_rank $LORA_RANK \
+    --lora_alpha $LORA_ALPHA \
+    --lora_dropout $LORA_DROUPOUT \
+    --max_seq_length $MAX_SEQ_LEN \
+    --preprocessing_num_workers 1 \
+    --model_name_or_path $BASE_MODEL_PATH \
+    --output_dir $OUTPUT_DIR \
+    --per_device_train_batch_size $DEV_BATCH_SIZE \
+    --gradient_accumulation_steps $GRAD_ACCUMULARION_STEPS \
+    --max_steps $MAX_STEP \
+    --logging_steps 1 \
+    --save_steps $SAVE_INTERVAL \
+    --learning_rate $LR  2>&1 | tee ${OUTPUT_DIR}/train.log
--- a/finetune_basemodel_demo/scripts/formate_alpaca2jsonl.py
+++ b/finetune_basemodel_demo/scripts/formate_alpaca2jsonl.py
+import argparse
+import json
+import tqdm
+
+def format_example(example: dict) -> dict:
+    context = f"Instruction: {example['instruction']}\n"
+    if example.get("input"):
+        context += f"Input: {example['input']}\n"
+    context += "Answer: "
+    target = example["output"]
+    return {"context": context, "target": target}
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--data_path", type=str, default="data/alpaca_data.json")
+    parser.add_argument("--save_path", type=str, default="data/alpaca_data.jsonl")
+
+    args = parser.parse_args()
+    print("args:", args)
+    with open(args.data_path) as f:
+        examples = json.load(f)
+
+    with open(args.save_path, 'w') as f:
+        for example in tqdm.tqdm(examples, desc="formatting.."):
+            f.write(json.dumps(format_example(example), ensure_ascii=False) + '\n')
+
+if __name__ == "__main__":
+    main()
--- a/finetune_basemodel_demo/trainer.py
+++ b/finetune_basemodel_demo/trainer.py
+# coding=utf-8
+# Copyright 2020-present the HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The Trainer class, to easily train a 🤗 Transformers from scratch or finetune it on a new task.
+"""
+import os
+from typing import Optional
+from transformers import Trainer
+
+import torch
+from transformers.modeling_utils import PreTrainedModel, unwrap_model
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+WEIGHTS_NAME = "pytorch_model.pt"
+TRAINING_ARGS_NAME = "training_args.bin"
+
+class LoRATrainer(Trainer):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def compute_loss(self, model, inputs, return_outputs=False):
+        return model(**inputs).loss
+
+    def save_model(self, output_dir=None, _internal_call=False):
+        output_dir = output_dir if output_dir is not None else self.args.output_dir
+        os.makedirs(output_dir, exist_ok=True)
+        logger.info(f"Saving model checkpoint to {output_dir}")
+
+        model_to_save = unwrap_model(self.model)
+        saved_params = {
+            k: v.to("cuda") for k, v in model_to_save.named_parameters() if v.requires_grad
+        }
+        torch.save(saved_params, os.path.join(output_dir, WEIGHTS_NAME))
+
+        if self.tokenizer is not None:
+            self.tokenizer.save_pretrained(output_dir)
+
+        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
--- a/finetune_chatmodel_demo/AdvertiseGen/dev.json
+++ b/finetune_chatmodel_demo/AdvertiseGen/dev.json
--- a/finetune_chatmodel_demo/AdvertiseGen/train.json
+++ b/finetune_chatmodel_demo/AdvertiseGen/train.json
--- a/finetune_chatmodel_demo/README.md
+++ b/finetune_chatmodel_demo/README.md
+# ChatGLM3-6B 微调示例
+
+本目录提供 ChatGLM3-6B 模型的微调示例，包括全量微调和 P-Tuning v2。格式上，提供多轮对话微调样例和输入输出格式微调样例。
+
+如果将模型下载到了本地，本文和代码中的 `THUDM/chatglm3-6b` 字段均应替换为相应地址以从本地加载模型。
+
+运行示例需要 `python>=3.10`，除基础的 `torch` 依赖外，示例代码运行还需要依赖 
+
+```bash
+pip install requirements.txt
+```
+
+## 多轮对话格式
+
+多轮对话微调示例采用 ChatGLM3 对话格式约定，对不同角色添加不同 `loss_mask` 从而在一遍计算中为多轮回复计算 `loss`。
+
+### 数据格式和预处理
+
+对于数据文件，样例采用如下格式
+
+如果您仅希望微调模型的对话能力，而非工具能力，您应该按照以下格式整理数据。
+
+```json
+[
+  {
+    "conversations": [
+      {
+        "role": "system",
+        "content": "<system prompt text>"
+      },
+      {
+        "role": "user",
+        "content": "<user prompt text>"
+      },
+      {
+        "role": "assistant",
+        "content": "<assistant response text>"
+      }, 
+       // ... Muti Turn
+      {
+        "role": "user",
+        "content": "<user prompt text>"
+      },
+      {
+        "role": "assistant",
+        "content": "<assistant response text>"
+      }
+    ]
+  }
+  // ...
+]
+```
+
+**请注意，这种方法在微调的step较多的情况下会影响到模型的工具调用功能**
+
+如果您希望微调模型的对话和工具能力，您应该按照以下格式整理数据。
+
+```json
+[
+   {
+      "tools": [
+         // available tools, format is not restricted
+      ],
+      "conversations": [
+         {
+            "role": "system",
+            "content": "<system prompt text>"
+         },
+         {
+            "role": "user",
+            "content": "<user prompt text>"
+         },
+         {
+            "role": "assistant",
+            "content": "<assistant thought to text>"
+         },
+         {
+            "role": "tool",
+            "name": "<name of the tool to be called",
+            "parameters": {
+               "<parameter_name>": "<parameter_value>"
+            },
+            "observation": "<observation>"
+            // don't have to be string
+         },
+         {
+            "role": "assistant",
+            "content": "<assistant response to observation>"
+         },
+         // ... Muti Turn
+         {
+            "role": "user",
+            "content": "<user prompt text>"
+         },
+         {
+            "role": "assistant",
+            "content": "<assistant response text>"
+         }
+      ]
+   }
+   // ...
+]
+```
+
+- 关于工具描述的 system prompt 无需手动插入，预处理时会将 `tools` 字段使用 `json.dumps(..., ensure_ascii=False)` 格式化后插入为首条 system prompt。
+
+- 每种角色可以附带一个 `bool` 类型的 `loss` 字段，表示该字段所预测的内容是否参与 `loss` 计算。若没有该字段，样例实现中默认对 `system`, `user` 不计算 `loss`，其余角色则计算 `loss`。
+
+- `tool` 并不是 ChatGLM3 中的原生角色，这里的 `tool` 在预处理阶段将被自动转化为一个具有工具调用 `metadata` 的 `assistant` 角色（默认计算 `loss`）和一个表示工具返回值的 `observation` 角色（不计算 `loss`）。
+
+- 目前暂未实现 `Code interpreter`的微调任务。
+
+- `system` 角色为可选角色，但若存在 `system` 角色，其必须出现在 `user` 角色之前，且一个完整的对话数据（无论单轮或者多轮对话）只能出现一次 `system` 角色。
+
+作为示例，我们使用 ToolAlpaca 数据集来进行微调。首先，克隆 [ToolAlpaca 数据集](https://github.com/tangqiaoyu/ToolAlpaca)，并使用
+
+```bash
+./scripts/format_tool_alpaca.py --path "ToolAlpaca/data/train_data.json"
+```
+
+将数据集处理成上述格式。在这里，我们有意将工具处理成了了 `list[str]` 这样的自然语言形式，以观察模型在微调前后对工具定义的理解能力。
+
+### 微调模型
+
+以下脚本提供了微调模型的参考方式。
+
+```bash
+./scripts/finetune_ds_multiturn.sh  # 全量微调
+./scripts/finetune_pt_multiturn.sh  # P-Tuning v2 微调
+```
+
+### 部署
+
+我们更新了 ChatGLM3 的综合 Demo，使其可以部署微调后的模型 checkpoint。
+
+对于全量微调，可以使用以下方式进行部署
+
+```bash
+cd ../composite_demo
+MODEL_PATH="path to finetuned model checkpoint" TOKENIZER_PATH="THUDM/chatglm3-6b" streamlit run main.py
+```
+
+对于 P-Tuning v2 微调，可以使用以下方式进行部署
+
+```bash
+cd ../composite_demo
+MODEL_PATH="THUDM/chatglm3-6b" PT_PATH="path to p-tuning checkpoint" streamlit run main.py
+```
+
+## 输入输出格式
+
+对于输入-输出格式，样例采用如下输入格式
+
+```json
+[
+  {
+    "prompt": "<prompt text>",
+    "response": "<response text>"
+  }
+  // ...
+]
+```
+
+预处理时，不会拼接任何角色标识符。
+
+作为示例，我们使用 AdvertiseGen 数据集来进行微调。从 [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1) 下载处理好的 AdvertiseGen 数据集，将解压后的 `AdvertiseGen` 目录放到本目录下。
+
+```bash
+./scripts/format_advertise_gen.py --path "AdvertiseGen/train.json"
+```
+
+来下载和将数据集处理成上述格式。
+
+### 微调模型
+
+以下脚本提供了微调模型的参考方式。
+
+```bash
+./scripts/finetune_ds.sh  # 全量微调
+./scripts/finetune_pt.sh  # P-Tuning v2 微调
+```
+
+### 推理验证
+
+对于输入输出格式的微调，可使用 `inference.py` 进行基本的推理验证。
+
+```bash
+python inference.py \
+    --pt-checkpoint "path to p-tuning checkpoint" \
+    --model THUDM/chatglm3-6b 
+```
+
+```bash
+python inference.py \
+    --tokenizer THUDM/chatglm3-6b \
+    --model "path to finetuned model checkpoint" 
+```
+
+### 提示
+
+1. 微调代码在开始训练前，会先打印首条训练数据的预处理信息，显示为
+
+    ```log
+    Sanity Check >>>>>>>>>>>>>
+             '[gMASK]':  64790 ->   -100
+                 'sop':  64792 ->   -100
+          '<|system|>':  64794 ->   -100
+                    '':  30910 ->   -100
+                  '\n':     13 ->   -100
+              'Answer':  20115 ->   -100
+                 'the':    267 ->   -100
+           'following':   1762 ->   -100
+                      ...
+                'know':    683 ->   -100
+                 'the':    267 ->   -100
+            'response':   3010 ->   -100
+             'details':   3296 ->   -100
+                   '.':  30930 ->   -100
+       '<|assistant|>':  64796 ->   -100
+                    '':  30910 ->  30910
+                  '\n':     13 ->     13
+                   'I':    307 ->    307
+                'need':    720 ->    720
+                  'to':    289 ->    289
+                 'use':    792 ->    792
+                      ...
+    <<<<<<<<<<<<< Sanity Check
+    ```
+
+    字样，每行依次表示一个 detokenized string, token_id 和 target_id。可在日志中查看这部分的 `loss_mask` 是否符合预期。若不符合，可能需要调整代码或数据。
+
+2. 参考显存用量
+
+    - P-Tuning V2 `PRE_SEQ_LEN=128`, `DEV_BATCH_SIZE=1`, `GRAD_ACCUMULARION_STEPS=16`,  `MAX_SEQ_LEN=2048` 配置下约需要 21GB 显存。
+    - 全量微调时，`./scripts/finetune_ds_multiturn.sh` 中的配置（`MAX_SEQ_LEN=2048`, `DEV_BATCH_SIZE=16`, `GRAD_ACCUMULARION_STEPS=1`）恰好用满 4 * 80GB 显存。
+
+3. 若尝试后发现显存不足，可以考虑
+    - 尝试降低 `DEV_BATCH_SIZE` 并提升 `GRAD_ACCUMULARION_STEPS`
+    - 尝试添加 `--quantization_bit 8` 或 `--quantization_bit 4`。
+        - `PRE_SEQ_LEN=128`, `DEV_BATCH_SIZE=1`, `GRAD_ACCUMULARION_STEPS=16`, `MAX_SEQ_LEN=1024` 配置下，`--quantization_bit 8` 约需 12GB 显存，`--quantization_bit 4` 约需 7.6GB 显存。
+
+## 参考文献
+
+```
+@inproceedings{liu2022p,
+  title={P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks},
+  author={Liu, Xiao and Ji, Kaixuan and Fu, Yicheng and Tam, Weng and Du, Zhengxiao and Yang, Zhilin and Tang, Jie},
+  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
+  pages={61--68},
+  year={2022}
+}
+
+@misc{tang2023toolalpaca,
+      title={ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases}, 
+      author={Qiaoyu Tang and Ziliang Deng and Hongyu Lin and Xianpei Han and Qiao Liang and Le Sun},
+      year={2023},
+      eprint={2306.05301},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
--- a/finetune_chatmodel_demo/arguments.py
+++ b/finetune_chatmodel_demo/arguments.py
+from dataclasses import dataclass, field
+from typing import Optional
+
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    ptuning_checkpoint: str = field(
+        default=None, metadata={"help": "Path to p-tuning v2 checkpoints"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
+    )
+    use_fast_tokenizer: bool = field(
+        default=True,
+        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
+    )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
+                "with private models)."
+            )
+        },
+    )
+    resize_position_embeddings: Optional[bool] = field(
+        default=None,
+        metadata={
+            "help": (
+                "Whether to automatically resize the position embeddings if `max_source_length` exceeds "
+                "the model's position embeddings."
+            )
+        },
+    )
+    quantization_bit: Optional[int] = field(
+        default=None
+    )
+    pre_seq_len: Optional[int] = field(
+        default=None
+    )
+    prefix_projection: bool = field(
+        default=False
+    )
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+    train_file: Optional[str] = field(
+        default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
+    )
+
+    max_seq_length: Optional[int] = field(
+        default=2048,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated."
+            )
+        },
+    )
+
+    max_source_length: Optional[int] = field(
+        default=1024,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+    max_target_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": (
+                "The maximum total sequence length for target text after tokenization. Sequences longer "
+                "than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+
+    train_format: str = field(
+        default=None, metadata={"help": "The format of the training data file (mulit-turn or input-output)"},
+    )
+
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+
+    max_seq_length: Optional[int] = field(
+        default=1024,
+        metadata={
+            "help": (
+                "The maximum total input sequence length after tokenization. Sequences longer "
+                "than this will be truncated, sequences shorter will be padded."
+            )
+        },
+    )
+
+    pad_to_max_length: bool = field(
+        default=False,
+        metadata={
+            "help": (
+                "Whether to pad all samples to model maximum sentence length. "
+                "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
+                "efficient on GPU but very bad for TPU."
+            )
+        },
+    )
+
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": (
+                "For debugging purposes or quicker training, truncate the number of training examples to this "
+                "value if set."
+            )
+        },
+    )
+    
+    def __post_init__(self):
+        extension = self.train_file.split(".")[-1]
+        assert extension in {"jsonl", "json"}, "`train_file` should be a jsonl or a json file."
+
+        assert self.train_format in {"multi-turn", "input-output"}
\ No newline at end of file
--- a/finetune_chatmodel_demo/configs/deepspeed.json
+++ b/finetune_chatmodel_demo/configs/deepspeed.json
+{
+  "train_batch_size": "auto",
+  "train_micro_batch_size_per_gpu" :"auto",
+  "gradient_accumulation_steps": "auto",
+  "zero_allow_untested_optimizer": true,
+  "fp16": {
+    "enabled": "auto",
+    "loss_scale": 0,
+    "initial_scale_power": 16,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "min_loss_scale": 1
+  },
+  "zero_optimization": {
+    "stage": 3,
+    "stage3_gather_16bit_weights_on_model_save": true,
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": false,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients" : true
+  }
+}
--- a/finetune_chatmodel_demo/finetune.py
+++ b/finetune_chatmodel_demo/finetune.py
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2021 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for sequence to sequence.
+"""
+# You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.
+# Adapted from 
+
+
+import logging
+import os
+import sys
+import torch
+import json
+import transformers
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoTokenizer,
+    DataCollatorForSeq2Seq,
+    HfArgumentParser,
+    Seq2SeqTrainingArguments,
+    set_seed,
+)
+from trainer import PrefixTrainer
+
+from arguments import ModelArguments, DataTrainingArguments
+
+from preprocess_utils import sanity_check, MultiTurnDataset, InputOutputDataset
+
+logger = logging.getLogger(__name__)
+# import os
+# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
+def main():
+    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, Seq2SeqTrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+
+    if training_args.should_log:
+        # The default of training_args.log_level is passive, so we set log level at info here to have that default.
+        transformers.utils.logging.set_verbosity_info()
+
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    # datasets.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    logger.info(f"Training/evaluation parameters {training_args}")
+
+    # Set seed before initializing model.
+    set_seed(training_args.seed)
+
+    # Load pretrained model and tokenizer
+    config = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
+    config.pre_seq_len = model_args.pre_seq_len
+    config.prefix_projection = model_args.prefix_projection
+
+    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
+
+    if model_args.ptuning_checkpoint is not None:
+        model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)
+        prefix_state_dict = torch.load(os.path.join(model_args.ptuning_checkpoint, "pytorch_model.bin"))
+        new_prefix_state_dict = {}
+        for k, v in prefix_state_dict.items():
+            if k.startswith("transformer.prefix_encoder."):
+                new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
+        model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
+    elif model_args.pre_seq_len is not None:
+        model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True)#,empty_init=False)
+    else:
+        model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True,empty_init=False)
+        
+
+    if model_args.quantization_bit is not None:
+        print(f"Quantized to {model_args.quantization_bit} bit")
+        model = model.quantize(model_args.quantization_bit)
+    if model_args.pre_seq_len is not None:
+        # P-tuning v2
+        model = model.half()
+        model.transformer.prefix_encoder.float()
+    else:
+        # Finetune
+        model = model.float()
+    
+    with open(data_args.train_file, "r", encoding="utf-8") as f:
+        if data_args.train_file.endswith(".json"):
+            train_data = json.load(f)
+        elif data_args.train_file.endswith(".jsonl"):
+            train_data = [json.loads(line) for line in f]
+
+    if data_args.train_format == "multi-turn":
+        train_dataset = MultiTurnDataset(
+            train_data,
+            tokenizer,
+            data_args.max_seq_length,
+        )
+    elif data_args.train_format == "input-output":
+        train_dataset = InputOutputDataset(
+            train_data,
+            tokenizer,
+            data_args.max_source_length,
+            data_args.max_target_length,
+        )
+    else:
+        raise ValueError(f"Unknown train format: {data_args.train_format}")
+    if training_args.local_rank < 1:
+        sanity_check(train_dataset[0]['input_ids'], train_dataset[0]['labels'], tokenizer)
+
+    # Data collator
+    data_collator = DataCollatorForSeq2Seq(
+        tokenizer,
+        model=model,
+        label_pad_token_id=-100,
+        pad_to_multiple_of=None,
+        padding=False
+    )
+
+    # Initialize our Trainer
+    trainer = PrefixTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        save_changed=model_args.pre_seq_len is not None
+    )
+
+    checkpoint = None
+    if training_args.resume_from_checkpoint is not None:
+        checkpoint = training_args.resume_from_checkpoint
+    model.gradient_checkpointing_enable()
+    model.enable_input_require_grads()
+    trainer.train(resume_from_checkpoint=checkpoint)
+    trainer.save_model()  # Saves the tokenizer too for easy upload
+    trainer.save_state()
+
+if __name__ == "__main__":
+    main()
--- a/finetune_chatmodel_demo/formatted_data/advertise_gen.jsonl
+++ b/finetune_chatmodel_demo/formatted_data/advertise_gen.jsonl
--- a/finetune_chatmodel_demo/formatted_data/tool_alpaca.jsonl
+++ b/finetune_chatmodel_demo/formatted_data/tool_alpaca.jsonl
--- a/finetune_chatmodel_demo/inference.py
+++ b/finetune_chatmodel_demo/inference.py
+import argparse
+from transformers import AutoConfig, AutoModel, AutoTokenizer
+import torch
+import os
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--pt-checkpoint", type=str, default=None, help="The checkpoint path")
+parser.add_argument("--model", type=str, default=None, help="main model weights")
+parser.add_argument("--tokenizer", type=str, default=None, help="main model weights")
+parser.add_argument("--pt-pre-seq-len", type=int, default=128, help="The pre-seq-len used in p-tuning")
+parser.add_argument("--device", type=str, default="cuda")
+parser.add_argument("--max-new-tokens", type=int, default=128)
+
+args = parser.parse_args()
+
+if args.tokenizer is None:
+    args.tokenizer = args.model
+
+if args.pt_checkpoint:
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
+    config = AutoConfig.from_pretrained(args.model, trust_remote_code=True, pre_seq_len=args.pt_pre_seq_len)
+    model = AutoModel.from_pretrained(args.model, config=config, trust_remote_code=True).cuda()
+    prefix_state_dict = torch.load(os.path.join(args.pt_checkpoint, "pytorch_model.bin"))
+    new_prefix_state_dict = {}
+    for k, v in prefix_state_dict.items():
+        if k.startswith("transformer.prefix_encoder."):
+            new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
+    model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
+else:
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
+    model = AutoModel.from_pretrained(args.model, trust_remote_code=True)
+
+model = model.to(args.device)
+
+while True:
+    prompt = input("Prompt:")
+    inputs = tokenizer(prompt, return_tensors="pt")
+    inputs = inputs.to(args.device)
+    response = model.generate(input_ids=inputs["input_ids"], max_length=inputs["input_ids"].shape[-1] + args.max_new_tokens)
+    response = response[0, inputs["input_ids"].shape[-1]:]
+    print("Response:", tokenizer.decode(response, skip_special_tokens=True))
\ No newline at end of file
--- a/finetune_chatmodel_demo/preprocess_utils.py
+++ b/finetune_chatmodel_demo/preprocess_utils.py