"examples/contrib/vscode:/vscode.git/clone" did not exist on "d4c2cb402d6674211726fd5f4803d1090664e438"
Commit 1bfbcff0 authored by wanglch's avatar wanglch
Browse files

Initial commit

parents
Pipeline #1204 canceled with stages
.. swift documentation file,
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Swift DOCUMENTATION
========================
.. toctree::
:maxdepth: 2
:caption: Get Started
GetStarted/SWIFT安装.md
GetStarted/界面训练推理.md
GetStarted/使用tuners.md
GetStarted/ResTuning.md
GetStarted/SCEdit.md
GetStarted/在SWIFT内使用PEFT.md
.. toctree::
:maxdepth: 2
:caption: LLM Training and Inference
LLM/LLM推理文档.md
LLM/LLM微调文档.md
LLM/DPO训练文档.md
LLM/LLM评测文档.md
LLM/LLM量化文档.md
LLM/VLLM推理加速与部署.md
LLM/LLM实验文档.md
LLM/命令行参数.md
LLM/支持的模型和数据集.md
LLM/自定义与拓展.md
LLM/自我认知微调最佳实践.md
LLM/Agent微调最佳实践.md
LLM/Agent部署最佳实践.md
LLM/Qwen1.5全流程最佳实践.md
LLM/NPU推理与微调最佳实践.md
LLM/Grok训练和推理.md
LLM/ORPO算法最佳实践.md
LLM/SimPO算法最佳实践.md
LLM/HuggingFace生态兼容.md
LLM/Benchmark.md
.. toctree::
:maxdepth: 2
:caption: Multi-Modal LLM Training and Inference
Multi-Modal/qwen-vl最佳实践.md
Multi-Modal/qwen-audio最佳实践.md
Multi-Modal/deepseek-vl最佳实践.md
Multi-Modal/internlm-xcomposer2最佳实践.md
Multi-Modal/phi3-vision最佳实践.md
Multi-Modal/llava最佳实践.md
Multi-Modal/yi-vl最佳实践.md
Multi-Modal/mplug-owl2最佳实践.md
Multi-Modal/cogvlm最佳实践.md
Multi-Modal/cogvlm2最佳实践.md
Multi-Modal/minicpm-v最佳实践.md
Multi-Modal/minicpm-v-2最佳实践.md
Multi-Modal/minicpm-v-2.5最佳实践.md
Multi-Modal/internvl最佳实践.md
Multi-Modal/MLLM部署文档.md
.. toctree::
:maxdepth: 2
:caption: AIGC Training and Inference
AIGC/AnimateDiff微调推理文档.md
.. toctree::
:maxdepth: 2
:caption: API Doc
Hub <api/swift.hub>
Trainer <api/swift.trainers>
Tuner <api/swift.tuners>
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
# AnimateDiff Fine-tuning and Inference
SWIFT supports fine-tuning and inference of AnimateDiff of full parameter and LoRA fine-tuning.
First, you need to clone and install SWIFT:
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install ".[aigc]"
```
## Full Parameter Training
### Training Effect
Full parameter fine-tuning can reproduce the effect of the [officially provided model animatediff-motion-adapter-v1-5-2](https://www.modelscope.cn/models/Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2/summary), requiring a large number of short videos. The official reproduction used a subset version of the official dataset: [WebVid 2.5M](https://maxbain.com/webvid-dataset/). The training effect is as follows:
```text
Prompt:masterpiece, bestquality, highlydetailed, ultradetailed, girl, walking, on the street, flowers
```
![image.png](../../resources/1.gif)
```text
Prompt: masterpiece, bestquality, highlydetailed, ultradetailed, beautiful house, mountain, snow top```
```
![image.png](../../resources/2.gif)
The generation effect of training with the 2.5M subset still has unstable results. Developers using the 10M dataset will have more stable effects.
### Running Command
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100 * 4
# 200GB GPU memory totally
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 animatediff_sft.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
--video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
--sft_type full \
--lr_scheduler_type constant \
--trainable_modules .*motion_modules.* \
--batch_size 4 \
--eval_steps 100 \
--gradient_accumulation_steps 16
```
We used A100 * 4 for training, requiring a total of 200GB GPU memory, and the training time is about 40 hours. The data format is as follows:
```text
--csv_path # Pass in a csv file, which should contain the following format:
name,contentUrl
Travel blogger shoot a story on top of mountains. young man holds camera in forest.,stock-footage-travel-blogger-shoot-a-story-on-top-of-mountains-young-man-holds-camera-in-forest.mp4
```
The name field represents the prompt of the short video, and contentUrl represents the name of the video file.
```text
--video_folder Pass in a video directory containing all the video files referenced by contentUrl in the csv file.
```
To perform inference using full parameters:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--sft_type full \
--ckpt_dir /output/path/like/checkpoints/iter-xxx \
--eval_human true
```
The --ckpt_dir should be the output folder from training.
## LoRA Training
### Running Command
Full parameter training will train the entire Motion-Adapter structure from scratch. Users can use an existing model and a small number of videos for fine-tuning by running the following command:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 20GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_sft.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
--video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
--motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
--sft_type lora \
--lr_scheduler_type constant \
--trainable_modules .*motion_modules.* \
--batch_size 1 \
--eval_steps 200 \
--dataset_sample_size 10000 \
--gradient_accumulation_steps 16
```
Video data parameters are the same as above.
The inference command is as follows:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
--sft_type lora \
--ckpt_dir /output/path/like/checkpoints/iter-xxx \
--eval_human true
```
The --ckpt_dir should be the output folder from training.
## Parameter List
Below are the supported parameter lists and their meanings for training and inference respectively:
### Training Parameters
```text
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
dataset_sample_size: int = None # The number of training samples in the dataset. Default represents full training.
sft_type: str = field(
default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
output_dir: str = 'output' # Output folder.
ddp_backend: str = field(
default='nccl', metadata={'choices': ['nccl', 'gloo', 'mpi', 'ccl']}) # If using ddp training, ddp backend.
seed: int = 42 # Random seed.
lora_rank: int = 8 # lora parameter.
lora_alpha: int = 32 # lora parameter.
lora_dropout_p: float = 0.05 # lora parameter.
lora_dtype: str = 'fp32' # lora module dtype type. If `AUTO`, it follows the dtype setting of the original module.
gradient_checkpointing: bool = False # Whether to enable gc, disabled by default. Note: The current version of diffusers has a problem and does not support this parameter being True.
batch_size: int = 1 # batchsize.
num_train_epochs: int = 1 # Number of epochs.
# if max_steps >= 0, override num_train_epochs
learning_rate: Optional[float] = None # Learning rate.
weight_decay: float = 0.01 # adamw parameter.
gradient_accumulation_steps: int = 16 # ga size.
max_grad_norm: float = 1. # grad norm size.
lr_scheduler_type: str = 'cosine' # Type of lr_scheduler.
warmup_ratio: float = 0.05 # Whether to warmup and the proportion of warmup.
eval_steps: int = 50 # eval step interval.
save_steps: Optional[int] = None # save step interval.
dataloader_num_workers: int = 1 # Number of dataloader workers.
push_to_hub: bool = False # Whether to push to modelhub.
# 'user_name/repo_name' or 'repo_name'
hub_model_id: Optional[str] = None # modelhub id.
hub_private_repo: bool = False
push_hub_strategy: str = field( # Push strategy, push the last one or push each one.
default='push_best',
metadata={'choices': ['push_last', 'all_checkpoints']})
# None: use env var `MODELSCOPE_API_TOKEN`
hub_token: Optional[str] = field( # modelhub token.
default=None,
metadata={
'help':
'SDK token can be found in https://modelscope.cn/my/myaccesstoken'
})
ignore_args_error: bool = False # True: notebook compatibility.
text_dropout_rate: float = 0.1 # Drop a certain proportion of text to ensure model robustness.
validation_prompts_path: str = field( # The prompt file directory used in the evaluation process. By default, swift/aigc/configs/validation.txt is used.
default=None,
metadata={
'help':
'The validation prompts file path, use aigc/configs/validation.txt is None'
})
trainable_modules: str = field( # Trainable modules, recommended to use the default value.
default='.*motion_modules.*',
metadata={
'help':
'The trainable modules, by default, the .*motion_modules.* will be trained'
})
mixed_precision: bool = True # Mixed precision training.
enable_xformers_memory_efficient_attention: bool = True # Use xformers.
num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.
csv_path: str = None # Input dataset.
video_folder: str = None # Input dataset.
motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.
use_wandb: bool = False # Whether to use wandb.
```
### Inference Parameters
```text
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
sft_type: str = field(
default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
ckpt_dir: Optional[str] = field(
default=None, metadata={'help': '/path/to/your/vx-xxx/checkpoint-xxx'}) # The output folder of training.
eval_human: bool = False # False: eval val_dataset # Whether to use manual input evaluation.
seed: int = 42 # Random seed.
merge_lora: bool = False # Merge lora into the MotionAdapter and save the model.
replace_if_exists: bool = False # Replace the files if the output merged dir exists when `merge_lora` is True.
# other
ignore_args_error: bool = False # True: notebook compatibility.
validation_prompts_path: str = None # The file used for validation. When eval_human=False, each line is a prompt.
output_path: str = './generated' # The output directory for gifs.
enable_xformers_memory_efficient_attention: bool = True # Use xformers.
num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.
motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.
```
# Installation and Usage
## Wheel Package Installation
You can use pip to install:
```shell
# Full capabilities
pip install 'ms-swift[all]' -U
# Only use LLM
pip install 'ms-swift[llm]' -U
# Only use AIGC
pip install 'ms-swift[aigc]' -U
# Only use adapters
pip install ms-swift -U
```
## Source Code Installation
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[all]'
```
## Notebook Environment
Most of the models supported by Swift for training can be used on `A10` GPUs. Users can use the free GPU resources officially provided by ModelScope:
1. Go to the official [ModelScope](https://www.modelscope.cn) website and log in
2. Click on `My Notebook` on the left and start a free GPU instance
3. Happily take advantage of the A10 GPU resources
## Build Documentation
Swift supports complete API Doc documentation. Execute the following command in the swift root directory:
```shell
make docs
```
After the execution is complete, view `docs/build/html/index.html`.
# Res-Tuning Component
<div align="center">
## [NeurIPS 2023] Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone
### [arXiv](https://arxiv.org/abs/2310.19859) | [Project Page](https://res-tuning.github.io/)
</div>
Res-Tuning is a flexible and efficient tuning method. We decouple the design of tuners from the model architecture for flexible combinations, and further extend it to implement a new memory-saving bypass tuner, which greatly reduces memory consumption and multi-task inference cost.
Currently, Res-Tuning is provided as a pluggable tuner algorithm component in [SWIFT](https://github.com/modelscope/swift), which developers can use directly.
### Supported Components
- [x] Res-Adapter
- [x] Res-Tuning-Bypass
- [ ] Res-Prefix
- [ ] Res-Prompt
### Usage
#### Demo
- You can use the [visualization example](https://github.com/modelscope/swift/blob/main/examples/pytorch/cv/notebook/swift_vision.ipynb) we provide.
#### Initialize Tuner
```Python
from swift import ResTuningConfig
config = ResTuningConfig(
dims=768,
root_modules=r'.*blocks.0$',
stem_modules=r'.*blocks\.\d+$',
target_modules=r'norm',
tuner_cfg='res_adapter'
)
```
- dims: The dimensions of the hidden states.
- root_modules: The root module to be replaced.
- stem_modules: The stem modules to be replaced.
- target_modules: The target module to be replaced.
- tuner_cfg: The configuration of the tuning module.
#### Load Model
```Python
from swift import Swift
import timm, torch
model = timm.create_model("vit_base_patch16_224", pretrained=False, num_classes=100)
model_tune = Swift.prepare_model(model, config)
print(model_tune.get_trainable_parameters())
print(model(torch.ones(1, 3, 224, 224)).shape)
```
### Citation
```
@inproceedings{jiang2023restuning,
title={Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone},
author={Jiang, Zeyinzi and Mao, Chaojie and Huang, Ziyuan and Ma, Ao and Lv, Yiliang and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
booktitle={Advances in Neural Information Processing Systems},
year={2023}
}
```
## 🔥SCEdit
SCEdit, proposed by Alibaba TongYi Vision Intelligence Lab, is an efficient generative fine-tuning framework. The framework not only supports fine-tuning capabilities for text-to-image downstream tasks, **saving 30%-50% of training memory overhead compared to LoRA**, achieving rapid transfer to specific generation scenarios; but it can also **directly extend to controllable image generation tasks, requiring only 7.9% of the parameter amount of ControlNet conditional generation and saving 30% of memory overhead**, supporting conditional generation tasks such as edge images, depth images, segmentation images, poses, color images, image inpainting, etc.
We used the 3D style data from the [Style Transfer Dataset](https://modelscope.cn/datasets/damo/style_custom_dataset/dataPeview) for training, and tested using the same `Prompt: A boy in a camouflage jacket with a scarf`. The specific qualitative and quantitative results are as follows:
| Method | bs | ep | Target Module | Param. (M) | Mem. (MiB) | 3D style |
| --------- | ---- | ---- | ------------- | ------------- | ---------- | ------------------------------------------------------------ |
| LoRA/r=64 | 1 | 50 | q/k/v/out/mlp | 23.94 (2.20%) | 8440MiB | <img src="../../resources/scedit_boy1.png" alt="img" style="zoom:20%;" /> |
| SCEdit | 1 | 50 | up_blocks | 19.68 (1.81%) | 7556MiB | <img src="../../resources/scedit_boy2.png" alt="img" style="zoom:20%;" /> |
| LoRA/r=64 | 10 | 100 | q/k/v/out/mlp | 23.94 (2.20%) | 26300MiB | <img src="../../resources/scedit_boy3.png" alt="img" style="zoom:20%;" /> |
| SCEdit | 10 | 100 | up_blocks | 19.68 (1.81%) | 18634MiB | <img src="../../resources/scedit_boy4.png" alt="img" style="zoom:20%;" /> |
| LoRA/r=64 | 30 | 200 | q/k/v/out/mlp | 23.94 (2.20%) | 69554MiB | <img src="../../resources/scedit_boy5.png" alt="img" style="zoom:20%;" /> |
| SCEdit | 30 | 200 | up_blocks | 19.68 (1.81%) | 43350MiB | <img src="../../resources/scedit_boy6.png" alt="img" style="zoom:20%;" /> |
To perform the training task using SCEdit and reproduce the above results:
```shell
# First, follow the installation steps in the section below
cd examples/pytorch/multi_modal/notebook
python text_to_image_synthesis.py
```
# Basic Usage
"Tuner" refers to additional structures attached to a model to reduce the number of training parameters or improve training accuracy. Currently, SWIFT supports the following tuners:
1. LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/abs/2106.09685)
2. LoRA+: [LoRA+: Efficient Low Rank Adaptation of Large Models](https://arxiv.org/pdf/2402.12354.pdf)
3. LLaMA PRO: [LLAMA PRO: Progressive LLaMA with Block Expansion](https://arxiv.org/pdf/2401.02415.pdf)
4. GaLore: [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](https://arxiv.org/abs/2403.03507)
5. LISA: [LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning](https://arxiv.org/abs/2403.17919)
6. UnSloth: https://github.com/unslothai/unsloth
7. SCEdit: [SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing](https://arxiv.org/abs/2312.11392) < [arXiv](https://arxiv.org/abs/2312.11392) | [Project Page](https://scedit.github.io/) >
8. NEFTune: [Noisy Embeddings Improve Instruction Finetuning](https://arxiv.org/abs/2310.05914)
9. LongLoRA: [Efficient Fine-tuning of Long-Context Large Language Models](https://arxiv.org/abs/2309.12307)
10. Adapter: [Parameter-Efficient Transfer Learning for NLP](http://arxiv.org/abs/1902.00751)
11. Vision Prompt Tuning: [Visual Prompt Tuning](https://arxiv.org/abs/2203.12119)
12. Side: [Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks](https://arxiv.org/abs/1912.13503)
13. Res-Tuning: [Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone](https://arxiv.org/abs/2310.19859) < [arXiv](https://arxiv.org/abs/2310.19859) | [Project Page](https://res-tuning.github.io/) | [Usage](docs/source/GetStarted/ResTuning.md) >
14. Tuners provided by [PEFT](https://github.com/huggingface/peft), such as IA3, AdaLoRA, etc.
## Using in Training
Call `Swift.prepare_model()` to add tuners to the model:
```python
from modelscope import Model
from swift import Swift, LoraConfig
import torch
model = Model.from_pretrained('ZhipuAI/chatglm3-6b', torch_dtype=torch.bfloat16, device_map='auto')
lora_config = LoraConfig(
r=16,
target_modules=['query_key_value'],
lora_alpha=32,
lora_dropout=0.)
model = Swift.prepare_model(model, lora_config)
```
Multiple tuners can also be used simultaneously:
```python
from modelscope import Model
from swift import Swift, LoraConfig, AdapterConfig
import torch
model = Model.from_pretrained('ZhipuAI/chatglm3-6b', torch_dtype=torch.bfloat16, device_map='auto')
lora_config = LoraConfig(
r=16,
target_modules=['query_key_value'],
lora_alpha=32,
lora_dropout=0.)
adapter_config = AdapterConfig(
dim=model.config.hidden_size,
target_modules=['mlp'],
method_name='forward',
hidden_pos=0,
adapter_length=32,
)
model = Swift.prepare_model(model, {'first_tuner': lora_config, 'second_tuner': adapter_config})
# use model to do other things
```
When using multiple tuners, the second parameter should be a Dict where the key is the tuner name and the value is the tuner configuration.
After training, you can call:
```python
model.save_pretrained(save_directory='./output')
```
to store the model checkpoint. The model checkpoint file will only include the weights of the tuners, not the weights of the model itself. The stored structure is as follows:
> outputs
>
> ​ |-- configuration.json
>
> ​ |-- first_tuner
>
> ​ |-- adapter_config.json
>
> ​ |-- adapter_model.bin
>
> ​ |-- second_tuner
>
> ​ |-- adapter_config.json
>
> ​ |-- adapter_model.bin
>
> ​ |-- ...
If only a single config is passed in, the default name `default` will be used:
> outputs
>
> ​ |-- configuration.json
>
> ​ |-- default
>
> ​ |-- adapter_config.json
>
> ​ |-- adapter_model.bin
>
> ​ |-- ...
### Complete Training Code
```python
# A100 18G memory
from swift import Seq2SeqTrainer, Seq2SeqTrainingArguments
from modelscope import MsDataset, AutoTokenizer
from modelscope import AutoModelForCausalLM
from swift import Swift, LoraConfig
from swift.llm import get_template, TemplateType
import torch
# load model
model = AutoModelForCausalLM.from_pretrained('ZhipuAI/chatglm3-6b', torch_dtype=torch.bfloat16, device_map='auto', trust_remote_code=True)
lora_config = LoraConfig(
r=16,
target_modules=['query_key_value'],
lora_alpha=32,
lora_dropout=0.05)
model = Swift.prepare_model(model, lora_config)
tokenizer = AutoTokenizer.from_pretrained('ZhipuAI/chatglm3-6b', trust_remote_code=True)
dataset = MsDataset.load('AI-ModelScope/alpaca-gpt4-data-en', split='train')
template = get_template(TemplateType.chatglm3, tokenizer, max_length=1024)
def encode(example):
inst, inp, output = example['instruction'], example.get('input', None), example['output']
if output is None:
return {}
if inp is None or len(inp) == 0:
q = inst
else:
q = f'{inst}\n{inp}'
example, kwargs = template.encode({'query': q, 'response': output})
return example
dataset = dataset.map(encode).filter(lambda e: e.get('input_ids'))
dataset = dataset.train_test_split(test_size=0.001)
train_dataset, val_dataset = dataset['train'], dataset['test']
train_args = Seq2SeqTrainingArguments(
output_dir='output',
learning_rate=1e-4,
num_train_epochs=2,
eval_steps=500,
save_steps=500,
evaluation_strategy='steps',
save_strategy='steps',
dataloader_num_workers=4,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
logging_steps=10,
)
trainer = Seq2SeqTrainer(
model=model,
args=train_args,
data_collator=template.data_collator,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=tokenizer)
trainer.train()
```
## Using in Inference
Use `Swift.from_pretrained()` to load the stored checkpoint:
```python
from modelscope import Model
from swift import Swift
import torch
model = Model.from_pretrained('ZhipuAI/chatglm2-6b', torch_dtype=torch.bfloat16, device_map='auto')
model = Swift.from_pretrained(model, './output')
```
### Complete Inference Code
```python
# A100 14G memory
import torch
from modelscope import AutoModelForCausalLM, GenerationConfig
from modelscope import AutoTokenizer
from swift import Swift
from swift.llm import get_template, TemplateType, to_device
# load model
model = AutoModelForCausalLM.from_pretrained('ZhipuAI/chatglm3-6b', torch_dtype=torch.bfloat16,
device_map='auto', trust_remote_code=True)
model = Swift.from_pretrained(model, 'output/checkpoint-xxx')
tokenizer = AutoTokenizer.from_pretrained('ZhipuAI/chatglm3-6b', trust_remote_code=True)
template = get_template(TemplateType.chatglm3, tokenizer, max_length=1024)
examples, tokenizer_kwargs = template.encode({'query': 'How are you?'})
if 'input_ids' in examples:
input_ids = torch.tensor(examples['input_ids'])[None]
examples['input_ids'] = input_ids
token_len = input_ids.shape[1]
generation_config = GenerationConfig(
max_new_tokens=1024,
temperature=0.3,
top_k=25,
top_p=0.8,
do_sample=True,
repetition_penalty=1.0,
num_beams=10,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id)
device = next(model.parameters()).device
examples = to_device(examples, device)
generate_ids = model.generate(
generation_config=generation_config,
**examples)
generate_ids = template.get_generate_ids(generate_ids, token_len)
print(tokenizer.decode(generate_ids, **tokenizer_kwargs))
# I'm an AI language model, so I don't have feelings or physical sensations. However, I'm here to assist you with any questions or tasks you may have. How can I help you today?
```
# Interface List
## Swift Class Static Interfaces
- `Swift.prepare_model(model, config, **kwargs)`
- Explain: Load a tuner onto the model. If it is a subclass of PeftConfig, use the corresponding interface of the Peft library to load the tuner. When using SwiftConfig, this interface can accept a SwiftModel instance and be called repeatedly, which has the same effect as passing a dictionary to config.
- This interface supports parallel loading of multiple tuners of different types for simultaneous use.
- Parameters:
- `model`: An instance of `torch.nn.Module` or `SwiftModel`, the model to be loaded
- `config`: An instance of `SwiftConfig`, `PeftConfig`, or a dictionary of custom tuner names to configs
- Return value: An instance of `SwiftModel` or `PeftModel`
- `Swift.merge_and_unload(model)`
- Explain: Merge the LoRA weights back into the original model and completely unload the LoRA part
- Parameters:
- model: An instance of `SwiftModel` or `PeftModel`, the model instance with LoRA loaded
- Return value: None
- `Swift.merge(model)`
- Explain: Merge the LoRA weights back into the original model without unloading the LoRA part
- Parameters:
- model: An instance of `SwiftModel` or `PeftModel`, the model instance with LoRA loaded
- Return value: None
- `Swift.unmerge(model)`
- Explain: Split the LoRA weights from the original model weights back into the LoRA structure
- Parameters:
- model: An instance of `SwiftModel` or `PeftModel`, the model instance with LoRA loaded
- Return value: None
- `Swift.save_to_peft_format(ckpt_dir, output_dir)`
- Explain: Convert the stored LoRA checkpoint to a Peft compatible format. The main changes are:
- `default` will be split from the corresponding `default` folder into the output_dir root directory
- The `{tuner_name}.` field in weights will be removed, for example `model.layer.0.self.in_proj.lora_A.default.weight` will become `model.layer.0.self.in_proj.lora_A.weight`
- The prefix `basemodel.model` will be added to the keys in weights
- Note: Only LoRA can be converted, other types of tuners cannot be converted due to Peft itself not supporting them. Additionally, when there are extra parameters like `dtype` set in LoRAConfig, it does not support conversion to Peft format. In this case, you can manually delete the corresponding fields in adapter_config.json
- Parameters:
- ckpt_dir: Original weights directory
- output_dir: Target weights directory
- Return value: None
- `Swift.from_pretrained(model, model_id, adapter_name, revision, **kwargs)`
- Explain: Load tuners from the stored weights directory onto the model. If adapter_name is not passed, all tuners under the model_id directory will be loaded. Same as `prepare_model`, this interface can be called repeatedly.
- Parameters:
- model: An instance of `torch.nn.Module` or `SwiftModel`, the model to be loaded
- model_id: `str` type, the tuner checkpoint to be loaded, can be a ModelScope hub id or a local directory produced by training
- adapter_name: `str` or `List[str]` or `Dict[str, str]` type or `None`, the tuner name in the tuner directory to be loaded. If `None`, all named tuners will be loaded. If `str` or `List[str]`, only certain specific tuners will be loaded. If `Dict`, the tuner indicated by `key` will be loaded and renamed to `value`.
- revision: If model_id is a ModelScope id, revision can specify the corresponding version number
## SwiftModel Interface
The following lists the interfaces that users may call. Other internal interfaces or interfaces not recommended for use can be viewed through the `make docs` command to generate the API Doc documentation.
- `SwiftModel.create_optimizer_param_groups(self, **defaults)`
- Explain: Create parameter groups based on the loaded tuners, currently only effective for the `LoRA+` algorithm
- Parameters:
- defaults: Default parameters for `optimizer_groups`, such as `lr` and `weight_decay`
- Return value:
- The created `optimizer_groups`
- `SwiftModel.add_weighted_adapter(self, ...)`
- Explain: Merge existing LoRA tuners into one
- Parameters:
- This interface is a transparent pass-through of PeftModel.add_weighted_adapter, parameters can refer to: [add_weighted_adapter documentation](https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraModel.add_weighted_adapter)
- `SwiftModel.save_pretrained(self, save_directory, safe_serialization, adapter_name)`
- Explain: Store tuner weights
- Parameters:
- save_directory: Storage directory
- safe_serialization: Whether to use safe_tensors, default is False
- adapter_name: The adapter tuner to store, if not passed, all tuners will be stored by default
- `SwiftModel.set_active_adapters(self, adapter_names, offload=None)`
- Explain: Set the currently active adapters, adapters not in the list will be deactivated
- In `inference`, the environment variable `USE_UNIQUE_THREAD=0/1` is supported, default value is `1`. If `0`, set_active_adapters only takes effect for the current thread. In this case, the tuners activated by this thread are used by default, and tuners in different threads do not interfere with each other.
- Parameters:
- adapter_names: Activated tuners
- offload: How to handle deactivated adapters, default is `None` which means leave them in GPU memory. Both `cpu` and `meta` are supported, indicating offloading to cpu and meta devices to reduce GPU memory consumption. When `USE_UNIQUE_THREAD=0`, do not pass a value to offload to avoid affecting other threads.
- Return value: None
- `SwiftModel.activate_adapter(self, adapter_name)`
- Explain: Activate a tuner
- In `inference`, the environment variable `USE_UNIQUE_THREAD=0/1` is supported, default value is `1`. If `0`, activate_adapter only takes effect for the current thread. In this case, the tuners activated by this thread are used by default, and tuners in different threads do not interfere with each other.
- Parameters:
- adapter_name: The name of the tuner to activate
- Return value: None
- `SwiftModel.deactivate_adapter(self, adapter_name, offload)`
- Explain: Deactivate a tuner
- When the environment variable `USE_UNIQUE_THREAD=0`, do not call this interface
- Parameters:
- adapter_name: The name of the tuner to deactivate
- offload: How to handle deactivated adapters, default is `None` which means leave them in GPU memory. Both `cpu` and `meta` are supported, indicating offloading to cpu and meta devices to reduce GPU memory consumption
- Return value: None
- `SwiftModel.get_trainable_parameters(self)`
- Explain: Return training parameter information
- Parameters: None
- Return value: Training parameter information, format is as follows:
```text
trainable params: 100M || all params: 1000M || trainable%: 10.00% || cuda memory: 10GiB.
```
# Compatibility with Peft
To support users accustomed to Peft, Swift provides compatibility with Peft. Users can import Peft components from Swift:
>PeftModel
>
>PeftConfig
>
>PeftModelForSeq2SeqLM
>
>PeftModelForSequenceClassification
>
>PeftModelForTokenClassification
>
>PeftModelForCausalLM
>
>PromptEncoderConfig
>
>PromptTuningConfig
>
>PrefixTuningConfig
>
>PromptLearningConfig
>
>LoraConfig
>
>get_peft_config
>
>get_peft_model_state_dict
>
>get_peft_model
All of the above components can be imported from Swift:
```python
from swift import PeftModel, PeftConfig
```
The Swift class also supports initializing Peft's tuner:
```python
from modelscope.models.nlp import SbertForSequenceClassification
from modelscope.models.nlp.structbert import SbertConfig
from swift import LoraConfig, Swift
model = SbertForSequenceClassification(SbertConfig())
lora_config = LoraConfig(target_modules=['query', 'key', 'value'])
model = Swift.prepare_model(model, lora_config)
```
Swift provides a shallow wrapper for Peft, allowing Peft to use models from the modelscope hub when calling from_pretrained.
# Interface Training and Inference
Currently, SWIFT supports interface-based training and inference, with parameters consistent with script-based training. After installing SWIFT, use the following command:
```shell
swift web-ui
```
This command starts the interface for training and inference.
The web-ui command doesn't accept parameters; all controllable parts are handled within the interface. However, there are a few environment variables that can be used:
> WEBUI_SHARE=1/0: Default is 0. Controls whether gradio is in share mode.
>
> SWIFT_UI_LANG=en/zh: Controls the language of the web-ui interface.
>
> WEBUI_SERVER: The server_name parameter. Specifies the host IP for web-ui. 0.0.0.0 means all IPs can access, while 127.0.0.1 means only local access is allowed.
>
> WEBUI_PORT: The port number for web-ui.
>
> USE_INFERENCE=1/0: Default is 0. Controls whether the gradio inference page directly loads the model for inference or deployment (USE_INFERENCE=0).
# Agent Deployment Best Practice
## Table of Contents
- [Environment Setup](#Environment-Setup)
- [Tools Field](#Tools-Field)
- [Deployment](#Deployment)
Deployment Examples
Environment Installation
## Environment Setup
```bash
# 设置pip全局镜像 (加速下载)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# 安装ms-swift
pip install 'ms-swift[llm]' -U
# 环境对齐 (通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
```
## Tools Field
The tools field provides the API information that the model can call. It supports OpenAI and ToolBench formats and requires the name, description, and parameters of the tools. An example is provided below:
OpenAI tools format
```json
{
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
}
```
ToolBench tools format
```json
{
"tools": [
{
"name": "url_for_newapi",
"description": "This is the subfunction for tool \"newapi\", you can use this tool.The description of this function is: \"url_for_newapi\"",
"parameters": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "",
"example_value": "https://www.instagram.com/reels/CtB6vWMMHFD/"
}
},
"required": [
"url"
],
"optional": [
"url"
]
}
},
{
"name": "n_for_newapi",
"description": "This is the subfunction for tool \"newapi\", you can use this tool.The description of this function is: \"n_for_newapiew var\"",
"parameters": {
"type": "object",
"properties": {
"language": {
"type": "string",
"description": "",
"example_value": "https://www.instagram.com/reels/Csb0AI3IYUN/"
}
},
"required": [
"language"
],
"optional": []
}
},
{
"name": "Finish",
"description": "If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are unable to proceed with the task in the current state, call this function to restart. Remember: you must ALWAYS call this function at the end of your attempt, and the only part that will be shown to the user is the final answer, so it should contain sufficient information.",
"parameters": {
"type": "object",
"properties": {
"return_type": {
"type": "string",
"enum": [
"give_answer",
"give_up_and_restart"
]
},
"final_answer": {
"type": "string",
"description": "The final answer you want to give the user. You should have this field if \"return_type\"==\"give_answer\""
}
},
"required": [
"return_type"
]
}
}
],
}
```
During inference, the information in the tools field will be converted into the corresponding tools system prompt. If a system prompt already exists, the tools prompt will be appended to it.
Currently, three types of tools system prompts are supported: ReAct-EN, ReAct-ZH and ToolBench. Examples are shown below:
ReAct-EN
```
Answer the following questions as best you can. You have access to the following tools:
{'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}
Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of [get_current_weather]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Final Answer: the final answer to the original input question
Begin!
```
ReAct-ZH
```
尽你所能回答以下问题。你拥有如下工具:
{'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}
以下格式回答:
Thought: 思考你应该做什么
Action: 工具的名称,必须是[get_current_weather]之一
Action Input: 工具的输入
Observation: 工具返回的结果
... (Thought/Action/Action Input/Observation的过程可以重复零次或多次)
Final Answer: 对输入问题的最终答案
开始!
```
ToolBench
```
You can use many tools(functions) to do the following task.
First I will give you the task description, and your task start.
At each step, you need to give your thought to analyze the status now and what to do next, with a function call to actually excute your step. Your output should follow this format:
Thought:
Action:
Action Input:
After the call, you will get the call result, and you are now in a new state.
Then you will analyze your status now, then decide what to do next...
After many (Thought-call) pairs, you finally perform the task, then you can give your finial answer.
Remember:
1.the state change is irreversible, you can\'t go back to one of the former state, if you want to restart the task, say "I give up and restart".
2.All the thought is short, at most in 5 sentence.
3.You can do more then one trys, so if your plan is to continusly try some conditions, you can do one of the conditions per try.
Let\'s Begin!
Task description: You should use functions to help handle the real time user querys. Remember:
1.ALWAYS call "Finish" function at the end of the task. And the final answer should contain enough information to show to the user,If you can\'t handle the task, or you find that function calls always fail(the function is not valid now), use function Finish->give_up_and_restart.
2.Do not use origin tool names, use only subfunctions\' names.
Specifically, you have access to the following APIs: {\'name\': \'get_current_weather\', \'description\': \'Get the current weather in a given location\', \'parameters\': {\'type\': \'object\', \'properties\': {\'location\': {\'type\': \'string\', \'description\': \'The city and state, e.g. San Francisco, CA\'}, \'unit\': {\'type\': \'string\', \'enum\': [\'celsius\', \'fahrenheit\']}}, \'required\': [\'location\']}}
```
By default, the system employs the ReAct-EN format. However, you have the option to specify the `--tools_prompt` parameter with either `react-zh` or `toolbench` to utilize one of the alternative formats.
If you have a better tools system prompt, please feel free to let us know or contribute to us.
## Deployment
Taking the deployment of vLLM as an example, with non-streaming invocation and ReAct prompt, we demonstrate the model deployment.
When deploying an Agent, it is crucial that the model itself possesses a strong capability to follow instructions or has undergone training on an Agent dataset. If the existing model is incapable of selecting the appropriate tools and accurately setting their parameters based on the tools field, it is advisable to switch to a model with higher performance or to refine the model using the strategies outlined in [Agent Fine-tuning Practices](./Agent-fine-tuning-best-practice.md).
Here, we choose the llama3-8b-instruct model as an example.
```shell
swift deploy \
--model_type llama3-8b-instruct \
--infer_backend vllm \
```
Use the curl command to call the interface. Because the ReAct format ends with Observation:, we need to specify Observation: in the stop parameter as a stop word to truncate the model's response. Some models treat Observation:\n as a single token, so we also include it as a stop word.
If you are using the ToolBench prompt, specifying stop words is not necessary (although including them won't cause any issues).
```shell
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-8b-instruct",
"messages": [
{
"role": "user",
"content": "What'\''s the weather like in Boston today?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"stream": false,
"stop": ["Observation:", "Observation:\n"]
}'
```
You can also select a tool from the tools field by specifying the `tool_choice` field, for example: `"tool_choice": {"type": "function", "function": {"name": "my_function"}}`. By default, all tools are selected, but you can set it to None to disable the tools field.
result
```json
{"model":"llama3-8b-instruct","choices":[[{"index":0,"message":{"role":"assistant","content":"Question: What's the weather like in Boston today?\n\nThought: I need to get the current weather in Boston to answer this question.\n\nAction: get_current_weather\n\nAction Input: {'location': 'Boston, MA', 'unit': 'fahrenheit'}\n\nObservation:","tool_calls":{"id":"toolcall-f534d907ae254f2ab96e06c25179ddf9","function":{"arguments":" {'location': 'Boston, MA', 'unit': 'fahrenheit'}\n\n","name":"get_current_weather"},"type":"function"}},"finish_reason":"stop"}]],"usage":{"prompt_tokens":262,"completion_tokens":54,"total_tokens":316},"id":"chatcmpl-8630e8d675c941c0aca958a37633a3c9","object":"chat.completion","created":1717590756}
```
In the tool_calls of the returned results, you can obtain the information about the called function and its parameters.
Assuming the returned result is `The weather in Boston today is 32°F (0°C), with clear skies`, we will fill this result into the role as tool and pass it into the message field.
```shell
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-8b-instruct",
"messages": [
{
"role": "user",
"content": "What'\''s the weather like in Boston today?"
},
{
"role": "assistant",
"content": "Question: What'\''s the weather like in Boston today?\n\nThought: I need to get the current weather in Boston.\n\nAction: get_current_weather\n\nAction Input: {\"location\": \"Boston, MA\", \"unit\": \"fahrenheit\"}\n\nObservation:"
},
{
"role": "tool",
"content": "{\"result\": \"The weather in Boston today is 32°F (0°C), with clear skies\"}\\n\\n"
}
],
"stream": false,
"stop": ["Observation:", "Observation:\n"]
}'
```
For the ReAct format, we concatenate the result back after the last `Observations:` field returned in the previous round by the model.
For the ToolBench format, handle it according to the model template. If the model template does not specify a special handling method for this field, it is treated as user input.
If you have a better handling method, please feel free to let us know or contribute to us.
result
```json
{"model":"llama3-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"\n\nAnswer: The weather in Boston today is 32°F (0°C), with clear skies.","tool_calls":null},"finish_reason":null}],"usage":{"prompt_tokens":93,"completion_tokens":21,"total_tokens":114},"id":"chatcmpl-5e63cee5155f48a48d1366001d16502b","object":"chat.completion","created":1717590962}
```
If you want to integrate code and tools to complete the entire workflow loop, we recommend reading the [OpenAI tutorial](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models).
# Agent Fine-tuning Best Practices
Train your own Agent with consumer-grade GPUs!
SWIFT supports open-source models, especially small and medium-sized models (7B, 14B, etc.) for training on Agent scenarios. It applies [loss-scale technique](https://arxiv.org/pdf/2309.00986.pdf) to agent training, making the API calling capability of small and medium models more stable. It also supports using a single commercial-grade GPU for Agent inference and deployment, which can be directly used end-to-end in production scenarios.
## Table of Contents
- [Environment Setup](#Environment-Setup)
- [Data Preparation](#Data-Preparation)
- [Fine-tuning](#Fine-tuning)
- [Inference](#Inference)
- [Usage with Modelscope-Agent](#Usage-with-Modelscope_Agent)
- [Summary](#Summary)
## Environment Setup
```bash
# Install ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
# Align environment (usually don't need to run. If you get errors, you can run the code below, the repo tests with the latest environment)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
```
## Data Preparation
For training Agent capability, the ModelScope team officially provides two open-source datasets:
- [ModelScope General QA Knowledge Dataset](https://www.modelscope.cn/datasets/iic/ms_bench/summary) This dataset contains 380,000 general knowledge multi-turn dialogue data
- [ModelScope General Agent Training Dataset](https://www.modelscope.cn/datasets/iic/ms_agent/summary) This dataset contains 30,000 API calling data in Agent format
The data format of this dataset is as follows:
```json
{
"id": "MS_Agent_Bench_126374",
"conversations": [{
"from": "system",
"value": "Answer the following questions as best you can. You have access to the following APIs:\n1. hm_recipe_recommend: Call this tool to interact with the hmreciperecommend API. What is the hmreciperecommend API useful for? . Parameters: [{\"name\": \"keywords_dict\", \"description\": \"盒马推荐菜谱关键词字典。\", \"required\": \"True\"}]\n\n2. hm_product_marketing: Call this tool to interact with the hmproductmarketing API. What is the hmproductmarketing API useful for? . Parameters: [{\"name\": \"sku_code_list\", \"description\": \"盒马商品的编码列表\", \"required\": \"True\"}]\n\n3. hm_product_info: Call this tool to interact with the hmproductinfo API. What is the hmproductinfo API useful for? . Parameters: [{\"name\": \"sku_code_list\", \"description\": \"盒马商品的编码列表\", \"required\": \"True\"}, {\"name\": \"sku_name_list\", \"description\": \"盒马商品的名称列表\", \"required\": \"True\"}, {\"name\": \"property_list\", \"description\": \"盒马商品的属性列表\", \"required\": \"True\"}]\n\n4. hm_product_recommend: Call this tool to interact with the hmproductrecommend API. What is the hmproductrecommend API useful for? . Parameters: [{\"name\": \"keywords_dict\", \"description\": \"盒马推荐商品关键词字典。\", \"required\": \"True\"}]\n\nUse the following format:\n\nThought: you should always think about what to do\nAction: the action to take, should be one of the above tools[hm_recipe_recommend, hm_product_marketing, hm_product_info, hm_product_recommend]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can be repeated zero or more times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\nBegin!"
}, {
"from": "user",
"value": "你好,请问你们有哪些手机可以推荐?"
}, {
"from": "assistant",
"value": "Action: hm_product_recommend\nAction Input: {'keywords_dict': {'商品类目': '手机'}}\nObservation: [{'sku_code': '10518244314', 'product_name': '小米11 Ultra 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '5999', 'product_description': '拥有120Hz高刷新率和骁龙888处理器,还配备了高品质相机系统'}, {'sku_code': '10222726664', 'product_name': 'vivo X60 Pro 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '4498', 'product_description': '搭载Exynos 1080处理器与ZEISS联合调校的相机系统,拍照实力十足'}, {'sku_code': '18227073452', 'product_name': '华为畅享 20 Pro 5G手机 128GB', 'bussiness_type': '标品', 'product_price': '2699', 'product_description': '拥有4000mAh的电池容量和优秀的相机而价格非常亲民'}]\nThought: I now know the final answer\nFinal Answer: 您好,我们有以下几款手机供您选择:1.小米11 Ultra 5G手机 256GB:价格为5999元,拥有120Hz高刷新率和骁龙888处理器,还配备了高品质相机系统;2. vivo X60 Pro 5G手机 256GB:售价为4498元,搭载Exynos 1080处理器与ZEISS联合调校的相机系统,拍照实力十足;3. 华为畅享 20 Pro 5G手机 128GB:价格亲民,只需2699元即可拥有优秀的相机和4000mAh的电池容量。这些手机都具有不同的特点和功能,您可以根据自己的需求进行选择。如果您有其他问题或需求,请随时告诉我。"
}, {
"from": "user",
"value": "我想买一款拍照好的手机,有推荐吗?"
}, {
"from": "assistant",
"value": "Action: hm_product_recommend\nAction Input: {'keywords_dict': {'商品类目': '手机', '适用场景': '拍照'}}\nObservation: [{'sku_code': '15956662254', 'product_name': '小米11 Ultra 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '5999', 'product_description': '拥有120Hz高刷新率和骁龙888处理器,还配备了高品质相机系统'}, {'sku_code': '19424216918', 'product_name': 'vivo X60 Pro 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '4498', 'product_description': '搭载Exynos 1080处理器与ZEISS联合调校的相机系统,拍照实力十足'}, {'sku_code': '14347982715', 'product_name': 'OPPO Ace2 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '3999', 'product_description': '拍照实力十足,支持65W超级闪充'}]\nThought: I now know the final answer\nFinal Answer: 如果您想要一款拍照表现出色的手机,我为您推荐以下几款产品:1.小米11 Ultra 5G手机 256GB:价格为5999元,配备了高品质相机系统,可以拍摄出色的照片;2. vivo X60 Pro 5G手机 256GB:售价为4498元,相机系统由ZEISS联合调校,拍照实力十足;3. OPPO Ace2 5G手机 256GB:价格为3999元,不仅拍照实力十足,还支持65W超级闪充,让你的手机充电更加便捷。这些手机都有不同的特点和功能,您可以根据自己的需求进行选择。如果您还有其他问题或需求,请随时告诉我。"
}, {
"from": "user",
"value": "小米11 Ultra拍照性能怎么样?"
}, {
"from": "assistant",
"value": "Action: hm_product_info\nAction Input: {'sku_code_list': ['15956662254'], 'sku_name_list': ['小米11 Ultra 5G手机 256GB'], 'property_list': [['拍照性能']]}\nObservation: [{'sku_code': '15956662254', 'property': {'拍照性能': '小米11 Ultra 5G手机配备了1亿像素的主摄像头,还有120X超长焦电影摄像头,以及20MP超广角摄像头等组合,拍照性能非常出色。'}}]\nThought: I now know the final answer\nFinal Answer: 小米11 Ultra 5G手机配备了1亿像素的主摄像头,还有120X超长焦电影摄像头,以及20MP超广角摄像头等组合,拍照性能非常出色。如果您还有其他问题或需求,请随时告诉我。"
}]
}
```
The specific format of the system field in the Agent dataset is as follows (convert the \\" character to " and \n to newline):
```text
Answer the following questions as best you can. You have access to the following APIs:
1. hm_recipe_recommend: Call this tool to interact with the hmreciperecommend API. What is the hmreciperecommend API useful for? . Parameters: [{"name": "keywords_dict", "description": "盒马推荐菜谱关键词字典。", "required": "True"}]
2. hm_product_marketing: Call this tool to interact with the hmproductmarketing API. What is the hmproductmarketing API useful for? . Parameters: [{"name": "sku_code_list", "description": "盒马商品的编码列表", "required": "True"}]
3. hm_product_info: Call this tool to interact with the hmproductinfo API. What is the hmproductinfo API useful for? . Parameters: [{"name": "sku_code_list", "description": "盒马商品的编码列表", "required": "True"}, {"name": "sku_name_list", "description": "盒马商品的名称列表", "required": "True"}, {"name": "property_list", "description": "盒马商品的属性列表", "required": "True"}]
4. hm_product_recommend: Call this tool to interact with the hmproductrecommend API. What is the hmproductrecommend API useful for? . Parameters: [{"name": "keywords_dict", "description": "盒马推荐商品关键词字典。", "required": "True"}]
Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of the above tools[hm_recipe_recommend, hm_product_marketing, hm_product_info, hm_product_recommend]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
```
API format:
```text
Answer the following questions as best you can. You have access to the following APIs:
Number: API Name: API Function API Parameters
...
Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of the above tools[API Name List]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
```
The structure of the response of calling API in the Agent dataset is as follows:
```text
Action: hm_product_recommend
Action Input: {'keywords_dict': {'商品类目': '手机', '适用场景': '拍照'}}
Observation: [{'sku_code': '15956662254', 'product_name': '小米11 Ultra 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '5999', 'product_description': '拥有120Hz高刷新率和骁龙888处理器,还配备了高品质相机系统'}, {'sku_code': '19424216918', 'product_name': 'vivo X60 Pro 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '4498', 'product_description': '搭载Exynos 1080处理器与ZEISS联合调校的相机系统,拍照实力十足'}, {'sku_code': '14347982715', 'product_name': 'OPPO Ace2 5G手机 256GB', 'bussiness_type': '标品', 'product_price': '3999', 'product_description': '拍照实力十足,支持65W超级闪充'}]
Thought: I now know the final answer
Final Answer: 如果您想要一款拍照表现出色的手机,我为您推荐以下几款产品:1.小米11 Ultra 5G手机 256GB:价格为5999元,配备了高品质相机系统,可以拍摄出色的照片;2. vivo X60 Pro 5G手机 256GB:售价为4498元,相机系统由ZEISS联合调校,拍照实力十足;3. OPPO Ace2 5G手机 256GB:价格为3999元,不仅拍照实力十足,还支持65W超级闪充,让你的手机充电更加便捷。这些手机都有不同的特点和功能,您可以根据自己的需求进行选择。如果您还有其他问题或需求,请随时告诉我。
```
- Action: The actual API name called
- Action Input: The actual input parameters
- Observation: This part is the actual calling result, which does not participate in the loss during training, and needs to be filled into the model after external calling during inference
- Thought: Model's thinking output
- Final Answer: Model's final answer
## Fine-tuning
In Agent training, in order to avoid severe knowledge forgetting after training, our data ratio is [ms-agent](https://www.modelscope.cn/datasets/iic/ms_agent/summary):[ms-bench](https://www.modelscope.cn/datasets/iic/ms_bench/summary) dataset is 1:2, among which ms_agent has a total of 30,000, and 60,000 are randomly sampled from the ms_bench dataset. At the same time, in order to change the model's perception, 3,000 self-recognition data are added.
| Dataset | Number of Samples |
| -------- | -------- |
| ms-agent | 30000 (full dataset) |
| ms-bench | 60000 (sampled) |
| self-recognition | 3000 (repeatedly sampled) |
We also support using your own Agent dataset. The dataset format needs to meet the requirements of [custom dataset](https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86). More specifically, the Agent's response/system should conform to the above Action/Action Input/Observation format.
We added **MLP** and **Embedder** to lora_target_modules. You can add lora to all linear layers (including qkvo, mlp and embedder) by specifying `--lora_target_modules ALL`. This **usually gives the best effect**.
The fine-tuning used the qwen-7b-chat model, with the following hyperparameters:
| Hyperparameter | Value |
| -------- | -------- |
| LR | 5e-5 |
| Epoch | 2 |
| lora_rank | 8 |
| lora_alpha | 32 |
| lora_target_modules | ALL |
| batch_size | 2 |
| gradient_accumulation_steps | 32 total |
The running command and other hyperparameters are as follows:
```shell
# Experimental environment: 8GPU
nproc_per_node=8
PYTHONPATH=../../.. \
torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
llm_sft.py \
--model_id_or_path qwen/Qwen-7B-Chat \
--model_revision master \
--sft_type lora \
--tuner_backend peft \
--dtype AUTO \
--output_dir output \
--dataset ms-agent \
--train_dataset_mix_ratio 2.0 \
--train_dataset_sample -1 \
--num_train_epochs 2 \
--max_length 1500 \
--check_dataset_strategy warning \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0.05 \
--lora_target_modules ALL \
--self_cognition_sample 3000 \
--model_name 卡卡罗特 \
--model_author 陶白白 \
--gradient_checkpointing true \
--batch_size 2 \
--weight_decay 0.01 \
--learning_rate 5e-5 \
--gradient_accumulation_steps $(expr 32 / $nproc_per_node) \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10
```
In the official experiment, the training process used an 8-GPU hardware environment, with **training time of 3 hours**.
> [!NOTE]
>
> 1. This training can also run on a consumer-grade single GPU (corresponding to **22G of video memory occupied**), users can change the DDP command to a single-card command.
>
> 2. The forgetting problem of LoRA training is not serious, the proportion of the ms-bench dataset can be appropriately lowered to improve training speed.
## Inference
We evaluate general knowledge and Agent. A simple evaluation result is listed below.
### Original Model
#### General Knowledge
> How to make West Lake vinegar fish
![image-20240201122323540](../../resources/image-20240201122323540.png)
> What is the difference between COVID-19 and the common cold
![image-20240201122441874](../../resources/image-20240201122441874.png)
#### Agent Capability
We use a fire alarm scenario as a test case:
```text
Answer the following questions as best you can. You have access to the following APIs:
1. fire_recognition: Call this tool to interact with the fire recognition API. This API is used to recognize whether there is fire in the image. Parameters: [{"name": "image", "description": "The input image to recognize fire", "required": "True"}]
2. fire_alert: Call this tool to interact with the fire alert API. This API will start an alert to warn the building's administraters. Parameters: []
3. call_police: Call this tool to interact with the police calling API. This API will call 110 to catch the thief. Parameters: []
4. call_fireman: Call this tool to interact with the fireman calling API. This API will call 119 to extinguish the fire. Parameters: []
Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of the above tools[fire_recognition, fire_alert, call_police, call_fireman]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
```
![image-20240201122625473](../../resources/image-20240201122625473.png)
![image-20240201122725477](../../resources/image-20240201122725477.png)
![image-20240201131811038](../../resources/image-20240201131811038.png)
It can be seen that after manually inputting the Observation, the model's answer is not correct.
### After Training
#### General Knowledge
> How to make West Lake vinegar fish
![image-20240201132124061](../../resources/image-20240201132124061.png)
![image-20240201132139698](../../resources/image-20240201132139698.png)
> What is the difference between COVID-19 and the common cold
>
![image-20240201132308260](../../resources/image-20240201132308260.png)
#### Agent Capability
![image-20240201132421298](../../resources/image-20240201132421298.png)
![image-20240201132454465](../../resources/image-20240201132454465.png)
It can be seen that after training, the model can correctly call the API and give the final answer.
#### Self-recognition
![image-20240201133359457](../../resources/image-20240201133359457.png)
### Using Agent in the Command Line
Currently, Agent inference support in the command line needs to specify `--eval_human true`, because when this parameter is false, it will read the dataset content, and the API calling results after `Observation:` cannot be manually input at this time.
```shell
# Use the trained model
swift infer --ckpt_dir output/qwen-7b-chat/vx-xxx/checkpoint-xxx --eval_human true --stop_words Observation: --infer_backend pt
# The original model such as qwn-7b-chat or chatglm3-6b-32k can also be used to run agent
# swift infer --model_type qwen-7b-chat --eval_human true --stop_words Observation: --infer_backend pt
# swift infer --model_type chatglm3-6b-32k --eval_human true --stop_words Observation: --infer_backend pt
```
After running the command, change the system field:
```shell
# Single line system
<<< reset-system
<<< Answer the following questions as best you can. You have access to the following APIs:\n1. fire_recognition: Call this tool to interact with the fire recognition API. This API is used to recognize whether there is fire in the image. Parameters: [{"name": "image", "description": "The input image to recognize fire", "required": "True"}]\n\n2. fire_alert: Call this tool to interact with the fire alert API. This API will start an alert to warn the building's administraters. Parameters: []\n\n3. call_police: Call this tool to interact with the police calling API. This API will call 110 to catch the thief. Parameters: []\n\n4. call_fireman: Call this tool to interact with the fireman calling API. This API will call 119 to extinguish the fire. Parameters: []\n\nUse the following format:\n\nThought: you should always think about what to do\nAction: the action to take, should be one of the above tools[fire_recognition, fire_alert, call_police, call_fireman]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can be repeated zero or more times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\nBegin!
```
If you need to input in multiple lines, you can use the following command (multi-line information ends with #):
```shell
# Multi-line system
<<< multi-line
<<<[M] reset-system#
<<<[MS] Answer the following questions as best you can. You have access to the following APIs:
1. fire_recognition: Call this tool to interact with the fire recognition API. This API is used to recognize whether there is fire in the image. Parameters: [{"name": "image", "description": "The input image to recognize fire", "required": "True"}]
2. fire_alert: Call this tool to interact with the fire alert API. This API will start an alert to warn the building's administraters. Parameters: []
3. call_police: Call this tool to interact with the police calling API. This API will call 110 to catch the thief. Parameters: []
4. call_fireman: Call this tool to interact with the fireman calling API. This API will call 119 to extinguish the fire. Parameters: []
Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of the above tools[fire_recognition, fire_alert, call_police, call_fireman]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!#
```
Next, you can perform Agent question-answering (note that when using multi-line mode input, add an extra **#** at the end of the line):
```shell
<<< The input image is /tmp/1.jpg, please help determine if there are any fire points in the image
Thought: I need to use the fire\_recognition API to analyze the input image and determine if there are any signs of fire.
Action: Use the fire\_recognition API to analyze the input image.
Action Input: /tmp/1.jpg
Observation:
<<< [{'coordinate': [101.1, 200.9], 'on_fire': True}]
Thought: The fire\_recognition API has returned a result indicating that there is fire in the input image.
Final Answer: There is fire in the input image.
```
As you can see, the model has returned the API calling result analysis. The user can continue asking questions for multi-turn Agent scenarios. You can also specify `--infer_backend vllm` and `--stream true` to use vllm and streaming inference.
### Using Agent in Deployment
Since deployment does not support history management, the splicing of Agent's API calling results needs to be done by the user. Here is an example of OpenAI format runnable code.
Server side:
```shell
# Use the trained model
swift deploy --ckpt_dir output/qwen-7b-chat/vx-xxx/checkpoint-xxx --stop_words Observation:
# The original model such as qwen-7b-chat or chatglm3-6b-32k can also be used to run agent
# swift deploy --model_type qwn-7b-chat --stop_words Observation:
# swift deploy --model_type chatglm3-6b-32k --stop_words Observation:
```
Client side:
```python
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type = client.models.list().data[0].id
print(f'model_type: {model_type}')
system = """Answer the following questions as best you can. You have access to the following APIs:
1. fire_recognition: Call this tool to interact with the fire recognition API. This API is used to recognize whether there is fire in the image. Parameters: [{\"name\": \"image\", \"description\": \"The input image to recognize fire\", \"required\": \"True\"}]
2. fire_alert: Call this tool to interact with the fire alert API. This API will start an alert to warn the building's administraters. Parameters: []
3. call_police: Call this tool to interact with the police calling API. This API will call 110 to catch the thief. Parameters: []
4. call_fireman: Call this tool to interact with the fireman calling API. This API will call 119 to extinguish the fire. Parameters: []
Use the following format:
Thought: you should always think about what to do
Action: the action to take, should be one of the above tools[fire_recognition, fire_alert, call_police, call_fireman]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!"""
messages = [{
'role': 'system',
'content': system
}, {
'role': 'user',
'content': '输入图片是/tmp/1.jpg,协助判断图片中是否存在着火点'
}]
resp = client.chat.completions.create(
model=model_type,
messages=messages,
stop=['Observation:'],
seed=42)
response = resp.choices[0].message.content
print(f'response: {response}')
# # Streaming
messages.append({'role': 'assistant', 'content': response + "\n[{'coordinate': [101.1, 200.9], 'on_fire': True}]"})
print(messages)
stream_resp = client.chat.completions.create(
model=model_type,
messages=messages,
stop=['Observation:'],
stream=True,
seed=42)
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
## Output:
# model_type: qwen-7b-chat
# response: Thought: I need to check if there is fire in the image
# Action: Use fire\_recognition API
# Action Input: /tmp/1.jpg
# Observation:
# [{'role': 'system', 'content': 'Answer the following questions as best you can. You have access to the following APIs:\n1. fire_recognition: Call this tool to interact with the fire recognition API. This API is used to recognize whether there is fire in the image. Parameters: [{"name": "image", "description": "The input image to recognize fire", "required": "True"}]\n\n2. fire_alert: Call this tool to interact with the fire alert API. This API will start an alert to warn the building\'s administraters. Parameters: []\n\n3. call_police: Call this tool to interact with the police calling API. This API will call 110 to catch the thief. Parameters: []\n\n4. call_fireman: Call this tool to interact with the fireman calling API. This API will call 119 to extinguish the fire. Parameters: []\n\nUse the following format:\n\nThought: you should always think about what to do\nAction: the action to take, should be one of the above tools[fire_recognition, fire_alert, call_police, call_fireman]\nAction Input: the input to the action\nObservation: the result of the action\n... (this Thought/Action/Action Input/Observation can be repeated zero or more times)\nThought: I now know the final answer\nFinal Answer: the final answer to the original input question\nBegin!'}, {'role': 'user', 'content': '输入图片是/tmp/1.jpg,协助判断图片中是否存在着火点'}, {'role': 'assistant', 'content': "Thought: I need to check if there is fire in the image\nAction: Use fire\\_recognition API\nAction Input: /tmp/1.jpg\nObservation:\n[{'coordinate': [101.1, 200.9], 'on_fire': True}]"}]
# response:
# Final Answer: There is fire in the image at coordinates [101.1, 200.9]
```
## Usage-with-Modelscope_Agent
In conjunction with Modelscope-Agent(https://github.com/modelscope/modelscope-agent), fine-tune models for building Agents.
This section focuses on the interactive framework AgentFabric within Modelscope-Agent to fine-tune the small model qwen-7b-chat to enable function call capabilities.
Due to the mismatch between the system prompt in ms-agent and that in Modelscope-Agent, direct training yields suboptimal results. To address this, we have created a new dataset [ms_agent_for_agentfabric](https://modelscope.cn/datasets/AI-ModelScope/ms_agent_for_agentfabric/summary) by converting the format from ms-agent, which is now integrated into SWIFT. The `ms-agent-for-agentfabric-default` includes 30,000 entries converted from ms-agent data, while `ms-agent-for-agentfabric-additional` contains 488 entries filtered from actual function call access data by the open-source AgentFabric framework.
### Fine-tuning
Replace `dataset` with `ms-agent-for-agentfabric-default` and `ms-agent-for-agentfabric-addition`:
```shell
# Experimental environment: 8GPU
nproc_per_node=8
PYTHONPATH=../../.. \
torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
llm_sft.py \
--model_id_or_path qwen/Qwen-7B-Chat \
--model_revision master \
--sft_type lora \
--tuner_backend swift \
--dtype AUTO \
--output_dir output \
--dataset ms-agent-for-agentfabric-default ms-agent-for-agentfabric-addition \
--train_dataset_mix_ratio 2.0 \
--train_dataset_sample -1 \
--num_train_epochs 2 \
--max_length 1500 \
--check_dataset_strategy warning \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0.05 \
--lora_target_modules ALL \
--self_cognition_sample 3000 \
--model_name 卡卡罗特 \
--model_author 陶白白 \
--gradient_checkpointing true \
--batch_size 2 \
--weight_decay 0.1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps $(expr 32 / $nproc_per_node) \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10
```
merge lora
```
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir '/path/to/qwen-7b-chat/vx-xxx/checkpoint-xxx' --merge_lora true
```
### AgentFabric
#### Environment Setup:
```bash
git clone https://github.com/modelscope/modelscope-agent.git
cd modelscope-agent && pip install -r requirements.txt && pip install -r apps/agentfabric/requirements.txt
```
#### Model Deployment
Launch vllm service:
Use any of the following methods to deploy the model.
##### swift deploy
```bash
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir /path/to/qwen-7b-chat/vx-xxx/checkpoint-xxxx-merged
```
##### vllm
```bash
python -m vllm.entrypoints.openai.api_server --model /path/to/qwen-7b-chat/vx-xxx/checkpoint-xxxx-merged --trust-remote-code
```
#### Adding Local Model Configuration
In /path/to/modelscope-agent/apps/agentfabric/config/model_config.json, add the merged local model:
```
"my-qwen-7b-chat": {
"type": "openai",
"model": "/path/to/qwen-7b-chat/vx-xxx/checkpoint-xxxx-merged",
"api_base": "http://localhost:8000/v1",
"is_chat": true,
"is_function_call": false,
"support_stream": false
}
```
Note that if deploying with `swift deploy`, the value of `model` should be set to `qwen-7b-chat`.
#### Launching AgentFabric
In the following practice, [Wanx Image Generation](https://help.aliyun.com/zh/dashscope/opening-service?spm=a2c4g.11186623.0.0.50724937O7n40B) and [Amap Weather]((https://lbs.amap.com/api/webservice/guide/create-project/get-key)) will be called, requiring manual setting of API KEY. After setting, start AgentFabric:
```bash
export PYTHONPATH=$PYTHONPATH:/path/to/your/modelscope-agent
export DASHSCOPE_API_KEY=your_api_key
export AMAP_TOKEN=your_api_key
cd modelscope-agent/apps/agentfabric
python app.py
```
After entering Agentfabric, select the local model my-qwen-7b-chat in the Configured models.
Choose the APIs that the agent can call, select Wanx Image Generation and Amap Weather here.
Click Update Configuration, wait for the configuration to complete, and interact with the Agent in the input box on the right.
> Weather Inquiry
![agentfabric_1](../../resources/agentfabric_1.png)
![agentfabric_2](../../resources/agentfabric_2.png)
> text2image
![agentfabric_3](../../resources/agentfabric_3.png)
![agentfabric_4](../../resources/agentfabric_4.png)
It can be seen that the fine-tuned model can correctly understand instructions and call tools.
## Summary
Through the Agent training capability supported by SWIFT, we fine-tuned the qwen-7b-chat model using ms-agent and ms-bench. It can be seen that after fine-tuning, the model retains the general knowledge question-answering ability, and when the system field is added with APIs, it can correctly call and complete tasks. It should be noted that:
1. When training changes from LoRA to full-parameter training, the knowledge forgetting problem will become more serious, and the dataset mixing ratio needs to be actually tested and adjusted.
2. Some models may still have poor calling effects after training, and it can be tested whether the model's own pre-training ability is solid.
3. After the Agent training set format and language have detailed changes, the format of the corresponding inference stage also needs to be adjusted accordingly, otherwise the effect may be poor.
4. Special characters such as `\n` in important positions are relatively important, please pay attention to the consistency of inference and training formats.
# Benchmark
## Table of Contents
- [Parameter Settings](#parameter-settings)
- [Quantization](#quantization)
- [Model Type & Max Length](#model-type--max-length)
- [Batch Size](#batch-size)
- [Use Flash Attn & Gradient Checkpointing](#use-flash-attn--gradient-checkpointing)
- [LoRA Rank & LoRA Target Modules](#lora-rank--lora-target-modules)
- [Gradient Accumulation Steps](#gradient-accumulation-steps)
- [Tuners](#Tuners)
- [Export](#Export)
- [AWQ](#AWQ)
- [AQLM](#AQLM)
- [Sequence Parallel](#Sequence-Parallel)
## Parameter Settings
Experimental environment:
- A100
- CUDA 11.8
- python 3.10
- torch 2.1.1
- flash_attn 2.3.4
- xformers 0.0.23
- auto_gptq 0.5.1
- bitsandbytes 0.41.3.post2
The following are the same command line settings for all experiments:
```bash
--dataset_test_ratio 0 \
--dataset cls-fudan-news-zh \
--save_strategy no \
--check_dataset_strategy warning \
--preprocess_num_proc 4 \
```
If the following parameters are not specified, the following default values are used:
```bash
--max_length 2048 \
--batch_size 1 \
--gradient_checkpointing true \
--use_flash_attn true \
--lora_rank 8 \
--lora_target_modules DEFAULT \
--quantization_bit 0 \
--gradient_accumulation_steps 16 \
```
Token statistics of the corresponding test dataset (obtained by qwen's tokenizer): 3234.4±2547.5, min=91, max=19548.
The experimental script can be found in `scripts/benchmark/test_memory_time/`.
## Quantization
The test script is:
```bash
swift sft \
--model_type {MODEL_TYPE} \
--quantization_bit {QUANTIZATION_BIT} \
--sft_type lora \
...
```
<table>
<tr>
<td>Model Type [LoRA]</td>
<td>Quantization</td>
<td>Training Speed (samples/s)</td>
<td>GPU Memory (GiB)</td>
</tr>
<tr>
<td rowspan="4">qwen-7b-chat</td>
<td>bf16</td>
<td>4.31</td>
<td>27.74</td>
</tr>
<tr>
<td>int4 (gptq)</td>
<td>2.05</td>
<td>19.21</td>
</tr>
<tr>
<td>int8 (gptq)</td>
<td>1.97</td>
<td>22.20</td>
</tr>
<tr>
<td>int4 (bnb)</td>
<td>2.41</td>
<td>23.85</td>
</tr>
<tr>
<td rowspan="4">qwen-14b-chat</td>
<td>bf16</td>
<td>2.60</td>
<td>40.14</td>
</tr>
<tr>
<td>int4 (gptq)</td>
<td>1.15</td>
<td>23.30</td>
</tr>
<tr>
<td>int8 (gptq)</td>
<td>1.08</td>
<td>29.13</td>
</tr>
<tr>
<td>int4 (bnb)</td>
<td>1.36</td>
<td>30.05</td>
</tr>
<tr>
<td rowspan="4">qwen-72b-chat</td>
<td>bf16</td>
<td>0.59 (2*A100)</td>
<td>73.71+78.54</td>
</tr>
<tr>
<td>int4 (gptq)</td>
<td>0.23</td>
<td>54.86</td>
</tr>
<tr>
<td>int8 (gptq)</td>
<td>0.21</td>
<td>78.44</td>
</tr>
<tr>
<td>int4 (bnb)</td>
<td>0.28</td>
<td>74.87</td>
</tr>
</table>
## Model Type & Max Length
### LoRA
The test script is:
```bash
swift sft \
--model_type {MODEL_TYPE} \
--max_length {MAX_LENGTH} \
--sft_type lora \
...
```
<table>
<tr>
<td>Model Type [LoRA]</td>
<td>Max Length</td>
<td>Training Speed (samples/s)</td>
<td>GPU Memory (GiB)</td>
</tr>
<tr>
<td rowspan="5">qwen-1_8b-chat</td>
<td>512</td>
<td>9.88</td>
<td>6.99</td>
</tr>
<tr>
<td>1024</td>
<td>9.90</td>
<td>10.71</td>
</tr>
<tr>
<td>2048</td>
<td>8.77</td>
<td>16.35</td>
</tr>
<tr>
<td>4096</td>
<td>5.92</td>
<td>23.80</td>
</tr>
<tr>
<td>8192</td>
<td>4.19</td>
<td>37.03</td>
</tr>
<tr>
<td rowspan="5">qwen-7b-chat</td>
<td>512</td>
<td>7.43</td>
<td>18.01</td>
</tr>
<tr>
<td>1024</td>
<td>6.51</td>
<td>21.73</td>
</tr>
<tr>
<td>2048</td>
<td>4.31</td>
<td>27.74</td>
</tr>
<tr>
<td>4096</td>
<td>2.05</td>
<td>35.31</td>
</tr>
<tr>
<td>8192</td>
<td>1.34</td>
<td>48.41</td>
</tr>
<tr>
<td rowspan="5">qwen-14b-chat</td>
<td>512</td>
<td>5.63</td>
<td>30.14</td>
</tr>
<tr>
<td>1024</td>
<td>4.36</td>
<td>34.43</td>
</tr>
<tr>
<td>2048</td>
<td>2.60</td>
<td>40.14</td>
</tr>
<tr>
<td>4096</td>
<td>1.17</td>
<td>47.95</td>
</tr>
<tr>
<td>8192</td>
<td>0.79</td>
<td>60.74</td>
</tr>
<tr>
<td rowspan="5">qwen-72b-chat (2*A100)</td>
<td>512</td>
<td>1.41</td>
<td>67.68+73.07</td>
</tr>
<tr>
<td>1024</td>
<td>1.02</td>
<td>70.25+77.11</td>
</tr>
<tr>
<td>2048</td>
<td>0.59</td>
<td>73.71+78.54</td>
</tr>
<tr>
<td>4096</td>
<td>-</td>
<td>OOM</td>
</tr>
<tr>
<td>8192</td>
<td>-</td>
<td>OOM</td>
</tr>
<tr>
<td rowspan="5">chatglm3-6b</td>
<td>512</td>
<td>6.72</td>
<td>13.94</td>
</tr>
<tr>
<td>1024</td>
<td>6.16</td>
<td>12.99</td>
</tr>
<tr>
<td>2048</td>
<td>4.20</td>
<td>17.20</td>
</tr>
<tr>
<td>4096</td>
<td>1.92</td>
<td>29.80</td>
</tr>
<tr>
<td>8192</td>
<td>1.24</td>
<td>66.82</td>
</tr>
<tr>
<td rowspan="5">yi-6b-chat</td>
<td>512</td>
<td>5.27</td>
<td>13.72</td>
</tr>
<tr>
<td>1024</td>
<td>5.07</td>
<td>15.44</td>
</tr>
<tr>
<td>2048</td>
<td>3.84</td>
<td>16.95</td>
</tr>
<tr>
<td>4096</td>
<td>1.99</td>
<td>28.25</td>
</tr>
<tr>
<td>8192</td>
<td>1.35</td>
<td>43.81</td>
</tr>
<tr>
<td rowspan="5">yi-34b-chat</td>
<td>512</td>
<td>2.32</td>
<td>66.72</td>
</tr>
<tr>
<td>1024</td>
<td>1.76</td>
<td>69.10</td>
</tr>
<tr>
<td>2048</td>
<td>1.05</td>
<td>71.34</td>
</tr>
<tr>
<td>4096</td>
<td>0.47</td>
<td>78.72</td>
</tr>
<tr>
<td>8192</td>
<td>0.31 (2*A100)</td>
<td>47.01+65.03</td>
</tr>
<tr>
<td rowspan="5">openbuddy-zephyr-7b-chat</td>
<td>512</td>
<td>5.17</td>
<td>14.99</td>
</tr>
<tr>
<td>1024</td>
<td>3.92</td>
<td>16.57</td>
</tr>
<tr>
<td>2048</td>
<td>3.08</td>
<td>19.89</td>
</tr>
<tr>
<td>4096</td>
<td>1.85</td>
<td>23.29</td>
</tr>
<tr>
<td>8192</td>
<td>0.92</td>
<td>52.14</td>
</tr>
<tr>
<td rowspan="5">baichuan2-7b-chat</td>
<td>512</td>
<td>6.09</td>
<td>18.18</td>
</tr>
<tr>
<td>1024</td>
<td>5.36</td>
<td>17.45</td>
</tr>
<tr>
<td>2048</td>
<td>3.43</td>
<td>19.18</td>
</tr>
<tr>
<td>4096</td>
<td>1.69</td>
<td>34.22</td>
</tr>
<tr>
<td>8192</td>
<td>1.16</td>
<td>45.47</td>
</tr>
<tr>
<td rowspan="5">baichuan2-13b-chat</td>
<td>512</td>
<td>5.32</td>
<td>31.01</td>
</tr>
<tr>
<td>1024</td>
<td>3.91</td>
<td>31.58</td>
</tr>
<tr>
<td>2048</td>
<td>1.77</td>
<td>32.40</td>
</tr>
<tr>
<td>4096</td>
<td>0.65</td>
<td>49.63</td>
</tr>
<tr>
<td>8192</td>
<td>0.36</td>
<td>76.17</td>
</tr>
</table>
### Full
The test script is:
```bash
swift sft \
--model_type {MODEL_TYPE} \
--max_length {MAX_LENGTH} \
--sft_type full \
...
```
<table>
<tr>
<td>Model Type [FULL]</td>
<td>Max Length</td>
<td>Training Speed (samples/s)</td>
<td>GPU Memory (GiB)</td>
</tr>
<tr>
<td rowspan="5">qwen-1_8b-chat</td>
<td>512</td>
<td>10.77</td>
<td>18.16</td>
</tr>
<tr>
<td>1024</td>
<td>10.39</td>
<td>18.62</td>
</tr>
<tr>
<td>2048</td>
<td>8.73</td>
<td>35.11</td>
</tr>
<tr>
<td>4096</td>
<td>5.45</td>
<td>31.62</td>
</tr>
<tr>
<td>8192</td>
<td>3.81</td>
<td>38.93</td>
</tr>
<tr>
<td rowspan="5">qwen-7b-chat</td>
<td>512</td>
<td>5.96</td>
<td>73.37</td>
</tr>
<tr>
<td>1024</td>
<td>5.00</td>
<td>73.64</td>
</tr>
<tr>
<td>2048</td>
<td>3.30</td>
<td>74.26</td>
</tr>
<tr>
<td>4096</td>
<td>1.64</td>
<td>78.76</td>
</tr>
<tr>
<td>8192</td>
<td>1.11 (2*A100)</td>
<td>61.34+73.00</td>
</tr>
<tr>
<td rowspan="5">qwen-14b-chat (2*A100)</td>
<td>512</td>
<td>3.66</td>
<td>60.42+72.31</td>
</tr>
<tr>
<td>1024</td>
<td>2.98</td>
<td>60.61+74.37</td>
</tr>
<tr>
<td>2048</td>
<td>1.93</td>
<td>60.70+78.22</td>
</tr>
<tr>
<td>4096</td>
<td>0.92</td>
<td>75.59+78.64</td>
</tr>
<tr>
<td>8192</td>
<td>0.62</td>
<td>76.59+77.68</td>
</tr>
</table>
## Batch Size
The test script is:
```bash
swift sft \
--batch_size {BATCH_SIZE} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
<table>
<tr>
<td>Model Type [LoRA]</td>
<td>Batch Size</td>
<td>Training Speed (samples/s)</td>
<td>GPU Memory (GiB)</td>
</tr>
<tr>
<td rowspan="4">qwen-7b-chat</td>
<td>1</td>
<td>4.31</td>
<td>27.74</td>
</tr>
<tr>
<td>2</td>
<td>3.60</td>
<td>43.11</td>
</tr>
<tr>
<td>4</td>
<td>3.02</td>
<td>63.81</td>
</tr>
<tr>
<td>8</td>
<td>2.77</td>
<td>76.14</td>
</tr>
</table>
## Use Flash Attn & Gradient Checkpointing
The test script is:
```bash
swift sft \
--use_flash_attn {USE_FLASH_ATTN} \
--gradient_checkpointing {GRADIENT_CHECKPOINTING} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
<table>
<tr>
<td>Model Type [LoRA]</td>
<td>Use Flash Attn</td>
<td>Gradient Checkpointing</td>
<td>Training Speed (samples/s)</td>
<td>GPU Memory (GiB)</td>
</tr>
<tr>
<td rowspan="4">qwen-7b-chat</td>
<td>&#x2714;</td>
<td>&#x2714;</td>
<td>4.31</td>
<td>27.74</td>
</tr>
<tr>
<td>&#x2714;</td>
<td>&#x2718;</td>
<td>6.19</td>
<td>37.70</td>
</tr>
<tr>
<td>&#x2718;</td>
<td>&#x2714;</td>
<td>3.13</td>
<td>27.71</td>
</tr>
<tr>
<td>&#x2718;</td>
<td>&#x2718;</td>
<td>4.45</td>
<td>57.67</td>
</tr>
</table>
## LoRA Rank & LoRA Target Modules
The test script is:
```bash
swift sft \
--lora_rank {LORA_RANK} \
--lora_target_modules {LORA_TARGET_MODULES} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
<table>
<tr>
<td>Model Type [LoRA]</td>
<td>LoRA Rank</td>
<td>LoRA Target Modules</td>
<td>Training Speed (samples/s)</td>
<td>GPU Memory (GiB)</td>
<td>Trainable Params (M)</td>
</tr>
<tr>
<td rowspan="4">qwen-7b-chat</td>
<td>2</td>
<td>DEFAULT (c_attn)</td>
<td>4.27</td>
<td>27.72</td>
<td>1.05</td>
</tr>
<tr>
<td>8</td>
<td>DEFAULT</td>
<td>4.31</td>
<td>27.74</td>
<td>4.19</td>
</tr>
<tr>
<td>64</td>
<td>DEFAULT</td>
<td>4.19</td>
<td>27.85</td>
<td>33.55</td>
</tr>
<tr>
<td>8</td>
<td>ALL (all linear)</td>
<td>3.22</td>
<td>27.87</td>
<td>17.89</td>
</tr>
</table>
## Gradient Accumulation Steps
The test script is:
```bash
swift sft \
--gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
<table>
<tr>
<td>Model Type [LoRA]</td>
<td>Gradient Accumulation Steps</td>
<td>Training Speed (samples/s)</td>
<td>GPU Memory (GiB)</td>
</tr>
<tr>
<td rowspan="7">qwen-7b-chat</td>
<td>1</td>
<td>4.26</td>
<td>27.73</td>
</tr>
<tr>
<td>2</td>
<td>4.32</td>
<td>27.74</td>
</tr>
<tr>
<td>4</td>
<td>4.31</td>
<td>27.74</td>
</tr>
<tr>
<td>8</td>
<td>4.32</td>
<td>27.74</td>
</tr>
<tr>
<td>16</td>
<td>4.33</td>
<td>27.74</td>
</tr>
<tr>
<td>32</td>
<td>4.30</td>
<td>27.74</td>
</tr>
<tr>
<td>64</td>
<td>4.32</td>
<td>27.74</td>
</tr>
</table>
## Tuners
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|adalora|qwen-7b-chat|ms-agent|2.0|adalora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|26.8389(0.3464%)|True|True|lr=5e-05/epoch=2|32.55GiB|0.92(87543 samples/95338.71 seconds)|17.33(2345 tokens/135.29 seconds)|0.57|1.07|0.391|0.665|0.569|
|adapter|qwen-7b-chat|ms-agent|2.0|adapter||33.6896(0.4344%)|True|True|lr=5e-05/epoch=2|32.19GiB|1.48(87543 samples/59067.71 seconds)|26.63(4019 tokens/150.90 seconds)|0.55|1.03|0.438|0.662|0.565|
|dora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=True|19.2512(0.2487%)|True|True|lr=5e-05/epoch=2|32.46GiB|0.51(87543 samples/171110.54 seconds)|4.29(2413 tokens/562.32 seconds)|0.53|1.01|0.466|0.683|**0.577**|
|full+galore128|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.02GiB|1.10(87543 samples/79481.96 seconds)|28.96(2400 tokens/82.88 seconds)|0.55|1.00|0.358|**0.688**|**0.577**|
|full+galore32|qwen-7b-chat|ms-agent|2.0|full|galore_rank=32/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.05GiB|1.11(87543 samples/78989.74 seconds)|29.17(2431 tokens/83.35 seconds)|0.56|1.01|0.386|0.667|0.539|
|full+galore64|qwen-7b-chat|ms-agent|2.0|full|galore_rank=64/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|46.91GiB|1.11(87543 samples/79200.36 seconds)|28.94(2448 tokens/84.60 seconds)|0.56|1.01|0.397|0.674|0.544|
|full+galore_emb|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=false/galore_with_embedding=true|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|44.53GiB|1.10(87543 samples/79775.02 seconds)|29.45(2433 tokens/82.62 seconds)|0.55|1.00|0.398|0.670|0.568|
|full+galore_perparam|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=true/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.02GiB|1.25(87543 samples/69821.89 seconds)|29.02(2478 tokens/85.39 seconds)|0.54|1.00|0.372|0.669|0.524|
|full+no_mix|qwen-7b-chat|ms-agent|0.0|full||7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|72.56GiB|1.27(29698 samples/23356.97 seconds)|30.31(11738 tokens/387.29 seconds)|0.57|**0.44**|0.174|0.652|0.553|
|full|qwen-7b-chat|ms-agent|2.0|full||7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|73.53GiB|1.43(87543 samples/61022.97 seconds)|29.51(3382 tokens/114.62 seconds)|0.54|0.95|0.343|0.536|0.495|
|llamapro|qwen-7b-chat|ms-agent|2.0|llamapro|num_blocks=4|809.5826(9.4900%)|True|True|lr=5e-05/epoch=2|38.11GiB|1.53(87543 samples/57294.42 seconds)|25.80(2374 tokens/92.02 seconds)|0.53|1.00|0.434|0.645|0.357|
|lora+|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=16.0/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.95(87543 samples/91923.80 seconds)|18.81(3329 tokens/176.94 seconds)|0.53|0.98|0.432|0.647|0.344|
|lora+neftune|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/neftune_noise_alpha=15.0|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.96(87543 samples/91525.50 seconds)|19.84(161792 tokens/8156.02 seconds)|0.53|1.02|0.456|0.671|0.401|
|lora+no_mix|qwen-7b-chat|ms-agent|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|30.86GiB|0.91(29698 samples/32570.15 seconds)|19.89(36308 tokens/1825.26 seconds)|0.53|0.53|0.470|0.666|0.574|
|lora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.95(87543 samples/91974.29 seconds)|18.11(2415 tokens/133.32 seconds)|0.53|1.01|0.462|0.676|0.304|
|qwen-7b-chat-eval|qwen-7b-chat|None|0.0|None||None(None)||||None||30.81(13765 tokens/446.83 seconds)|||**0.517**|0.679|0.568|
|rslora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=True/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.94(87543 samples/92758.63 seconds)|18.87(2762 tokens/146.34 seconds)|**0.53**|0.99|0.451|0.679|0.339|
| full+lisa_2 | qwen-7b-chat | ms-agent | 2.0 | full | lisa_activated_layers=2/lisa_step_interval=20 | - | True | True | lr=5e-05/epoch=2 | 31.11GiB | 2.66(76837 samples/28881.28 seconds) | 36.10(134469 tokens/3725.21 seconds) | 0.62 | 1.06 | 0.349 | 0.653 | 0.592 |
| full+lisa_4 | qwen-7b-chat | ms-agent | 2.0 | full | lisa_activated_layers=4/lisa_step_interval=20 | - | True | True | lr=5e-05/epoch=2 | 31.87GiB | 2.63(76837 samples/29215.15 seconds) | 36.75(135477 tokens/3686.17 seconds) | 0.63 | 1.06 | 0.377 | 0.656 | **0.607** |
|lora+packing+ddp|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|35.65GiB*2|1.56(7900 samples/5057.30 seconds)|26.20(421094 tokens/16073.09 seconds)|0.63|0.98|0.473|0.664|0.552|
|lora+packing+lazytokenize|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.83GiB|7.69(78237 samples/10179.40 seconds)|25.86(307390 tokens/11888.17 seconds)|0.63|1.04|0.472|0.660|0.554|
|lora+packing|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|28.06GiB|0.79(7900 samples/10048.53 seconds)|26.12(409507 tokens/15675.36 seconds)|0.61|0.95|0.492|0.676|0.539|
## unsloth
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| --------------- | ------------------ | -------- | ------------------ | ----- | ------------ | ------------------- | ---------- | ---------------------- | ---------------- | -------- | ------------------------------------ | ------------------------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
| unsloth+lora+q4 | llama3-8b-instruct | ms-agent | 2.0 | lora | | 4.7186(0.1038%) | True | True | lr=5e-05/epoch=2 | 21.69GiB | 1.76(76839 samples/43763.01 seconds) | 15.22(160885 tokens/10570.90 seconds) | 0.58 | 1.03 | 0.668 | 0.755 | 0.501 |
## Export
| exp_name | model_type | calibration dataset | quantization method | quantization bits | infer speed(tokens/s) | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------------------- | ------------------- | ----------------- | --------------------- | ------------------ | ---------------- | ------------------ |
|awq-ms-bench-mini|qwen-7b-chat|ms-bench-mini|awq|4|27.25(16501 tokens/605.47 seconds)|0.494|0.665|0.571|
|awq-pileval|qwen-7b-chat|pileval|awq|4|26.92(12994 tokens/482.72 seconds)|**0.497**|**0.675**|**0.577**|
|gptq-ms-bench-mini|qwen-7b-chat|ms-bench-mini|gptq|4|31.16(15349 tokens/492.54 seconds)|0.482|0.642|0.556|
|gptq-pileval|qwen-7b-chat|pileval|gptq|4|31.67(15185 tokens/479.54 seconds)|0.478|0.654|0.559|
## AWQ
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|qwen1half-7b-chat-awq|qwen1half-7b-chat-awq|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.5802%)|True|True|lr=5e-05/epoch=2|24.26GiB|0.45(87543 samples/194746.58 seconds)|16.08(2469 tokens/153.58 seconds)|**0.55**|**1.19**|**0.505**|**0.737**|**0.656**|
## AQLM
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|llama2-7b-aqlm-2bit-1x16|llama2-7b-aqlm-2bit-1x16|dureader-robust-zh|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.6510%)|True|True|lr=5e-05/epoch=2|4.04GiB|0.17(14994 samples/86140.71 seconds)||**0.48**|**0.74**||||
## Sequence Parallel
<table>
<tr>
<td>Model</td>
<td>Dataset</td>
<td>Hyper params</td>
<td>Total steps</td>
<td>Train speed</td>
<td>Gpu memory</td>
</tr>
<tr>
<td rowspan="4">chatglm3-6b-32k</td>
<td rowspan="4">long-alpaca-12k(8055 tokens * 12000 rows)</td>
<td>gpu=2/sequence_parallel_size=1(2 GPU DDP baseline)</td>
<td>5940</td>
<td>0.30iter/s(5h13min total)</td>
<td>27G*2</td>
</tr>
<tr>
<td>gpu=2/sequence_parallel_size=2(2 GPU with sequence parallel 2)</td>
<td>11880</td>
<td>0.5iter/s(6h total)</td>
<td>20G*2</td>
</tr>
<tr>
<td>gpu=4/sequence_parallel_size=4(4 GPU with sequence parallel 4)</td>
<td>11880</td>
<td>1iter/s(3h20min total)</td>
<td>18G*4</td>
</tr>
<tr>
<td>gpu=4/sequence_parallel_size=2(4 GPU sequence parallel 2)</td>
<td>5940</td>
<td>0.45iter/s(3h total)</td>
<td>21G*4</td>
</tr>
</table>
# Command Line Arguments
## Table of Contents
- [sft Parameters](#sft-parameters)
- [dpo Parameters](#dpo-parameters)
- [merge-lora infer Parameters](#merge-lora-infer-parameters)
- [export Parameters](#export-parameters)
- [eval Parameters](#eval-parameters)
- [app-ui Parameters](#app-ui-parameters)
- [deploy Parameters](#deploy-parameters)
## sft Parameters
- `--model_type`: Represents the selected model type, default is `None`. `model_type` specifies the default `lora_target_modules`, `template_type`, and other information for the corresponding model. You can fine-tune by specifying only `model_type`. The corresponding `model_id_or_path` will use default settings, and the model will be downloaded from ModelScope and use the default cache path. One of model_type and model_id_or_path must be specified. You can see the list of available `model_type` [here](Supported-models-datasets.md#Models). You can set the `USE_HF` environment variable to control downloading models and datasets from the HF Hub, see [HuggingFace Ecosystem Compatibility Documentation](Compat-HF.md).
- `--model_id_or_path`: Represents the `model_id` in the ModelScope/HuggingFace Hub or a local path for the model, default is `None`. If the provided `model_id_or_path` has already been registered, the `model_type` will be inferred based on the `model_id_or_path`. If it has not been registered, both `model_type` and `model_id_or_path` must be specified, e.g. `--model_type <model_type> --model_id_or_path <model_id_or_path>`.
- `--model_revision`: The version number corresponding to `model_id` on ModelScope Hub, default is `None`. If `model_revision` is `None`, use the revision registered in `MODEL_MAPPING`. Otherwise, force use of the `model_revision` passed from command line.
- `--local_repo_path`: Some models rely on a GitHub repo for loading. To avoid network issues during `git clone`, you can directly use the local repo. This parameter requires input of the local repo path, and defaults to `None`. These models include:
- mPLUG-Owl model: `https://github.com/X-PLUG/mPLUG-Owl`
- DeepSeek-VL model: `https://github.com/deepseek-ai/DeepSeek-VL`
- YI-VL model: `https://github.com/01-ai/Yi`
- LLAVA model: `https://github.com/haotian-liu/LLaVA.git`
- `--sft_type`: Fine-tuning method, default is `'lora'`. Options include: 'lora', 'full', 'longlora', 'adalora', 'ia3', 'llamapro', 'adapter', 'vera', 'boft'. If using qlora, you need to set `--sft_type lora --quantization_bit 4`.
- `--packing`: pack the dataset length to `max-length`, default `False`.
- `--freeze_parameters`: When sft_type is set to 'full', freeze the bottommost parameters of the model. Range is 0. ~ 1., default is `0.`. This provides a compromise between lora and full fine-tuning.
- `--additional_trainable_parameters`: In addition to freeze_parameters, only allowed when sft_type is 'full', default is `[]`. For example, if you want to train embedding layer in addition to 50% of parameters, you can set `--freeze_parameters 0.5 --additional_trainable_parameters transformer.wte`, all parameters starting with `transformer.wte` will be activated.
- `--tuner_backend`: Backend support for lora, qlora, default is `'peft'`. Options include: 'swift', 'peft', 'unsloth'.
- `--template_type`: Type of dialogue template used, default is `'AUTO'`, i.e. look up `template` in `MODEL_MAPPING` based on `model_type`. Available `template_type` options can be found in `TEMPLATE_MAPPING.keys()`.
- `--output_dir`: Directory to store ckpt, default is `'output'`. We will append `model_type` and fine-tuning version number to this directory, allowing users to do multiple comparative experiments on different models without changing the `output_dir` command line argument. If you don't want to append this content, specify `--add_output_dir_suffix false`.
- `--add_output_dir_suffix`: Default is `True`, indicating that a suffix of `model_type` and fine-tuning version number will be appended to the `output_dir` directory. Set to `False` to avoid this behavior.
- `--ddp_backend`: Backend support for distributed training, default is `None`. Options include: 'nccl', 'gloo', 'mpi', 'ccl'.
- `--seed`: Global seed, default is `42`. Used to reproduce training results.
- `--resume_from_checkpoint`: For resuming training from checkpoint, default is `None`. You can set this to the path of a checkpoint, e.g. `'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'`, to resume training.
- `--dtype`: torch_dtype when loading base model, default is `'AUTO'`, i.e. intelligently select dtype: if machine does not support bf16, use fp16; if `MODEL_MAPPING` specifies torch_dtype for corresponding model, use its dtype; otherwise use bf16. Options include: 'bf16', 'fp16', 'fp32'.
- `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
- Supports MS and HF hub, as well as dataset_sample. For example, 'MS::alpaca-zh#2000', 'HF::jd-sentiment-zh#2000' (the default hub used is controlled by the `USE_UF` environment variable, default is MS).
- More fine-grained control over subsets: It uses the subsets specified during registration by default (if not specified during registration, it uses 'default'). For example, 'sharegpt-gpt4'. If subsets are specified, it uses the corresponding subset of the dataset. For example, 'sharegpt-gpt4:default/V3_format#2000'. Separated by '/'.
- Support for dataset_id. For example, 'AI-ModelScope/alpaca-gpt4-data-zh#2000', 'HF::llm-wizard/alpaca-gpt4-data-zh#2000', 'hurner/alpaca-gpt4-data-zh#2000', 'HF::shibing624/alpaca-zh#2000'. If the dataset_id has been registered, it will use the preprocessing function, subsets, split, etc. specified during registration. Otherwise, it will use `SmartPreprocessor`, support 5 dataset formats, and use 'default' subsets, with split set to 'train'. The supported dataset formats can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset).
- Support for dataset_path. For example, '1.jsonl#5000' (if it is a relative path, it is relative to the running directory).
- `--val_dataset`: Specify separate validation datasets with the same format of the `dataset` argument, default is `[]`. If using `val_dataset`, the `dataset_test_ratio` will be ignored.
- `--dataset_seed`: Seed for dataset processing, default is `42`. Exists as random_state, does not affect global seed.
- `--dataset_test_ratio`: Used to specify the ratio for splitting the sub-dataset into training and validation sets. The default value is `0.01`. If `--val_dataset` is set, this parameter becomes ineffective.
- `--train_dataset_sample`: The number of samples for the training dataset, default is `-1`, which means using the complete training dataset for training. This parameter is deprecated, please use `--dataset {dataset_name}#{dataset_sample}` instead.
- `--val_dataset_sample`: Used to sample the validation set, with a default value of `None`, which automatically selects a suitable number of data samples for validation. If you specify `-1`, the complete validation set is used for validation. This parameter is deprecated and the number of samples in the validation set is controlled by `--dataset_test_ratio` or `--val_dataset {dataset_name}#{dataset_sample}`.
- `--system`: System used in dialogue template, default is `None`, i.e. use the model's default system. If set to '', no system is used.
- `--tools_prompt`: Select the corresponding tools system prompt for the tools field transformation. The options are ['react_en', 'react_zh', 'toolbench'], which correspond to the English version of ReAct format, Chinese version of ReAct format and the toolbench format, respectively. The default is the English version of the ReAct format. For more information, you can refer to the [Agent Deployment Best Practices](Agent-deployment-best-practices.md).
- `--max_length`: Maximum token length, default is `2048`. Avoids OOM issues caused by individual overly long samples. When `--truncation_strategy delete` is specified, samples exceeding max_length will be deleted. When `--truncation_strategy truncation_left` is specified, the leftmost tokens will be truncated: `input_ids[-max_length:]`. If set to -1, no limit.
- `--truncation_strategy`: Default is `'delete'` which removes sentences exceeding max_length from dataset. `'truncation_left'` will truncate excess text from the left, which may truncate special tokens and affect performance, not recommended.
- `--check_dataset_strategy`: Default is `'none'`, i.e. no checking. If training an LLM model, `'warning'` is recommended as data check strategy. If your training target is sentence classification etc., setting to `'none'` is recommended.
- `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
- `--self_cognition_sample`: The number of samples for the self-cognition dataset. Default is `0`. If you set this value to >0, you need to specify `--model_name` and `--model_author` at the same time. This parameter has been deprecated, please use `--dataset self-cognition#{self_cognition_sample}` instead.
- `--model_name`: Default value is `[None, None]`. If self-cognition dataset sampling is enabled (i.e., specifying `--dataset self-cognition` or self_cognition_sample>0), you need to provide two values, representing the Chinese and English names of the model, respectively. For example: `--model_name 小黄 'Xiao Huang'`. If you want to learn more, you can refer to the [Self-Cognition Fine-tuning Best Practices](Self-cognition-best-practice.md).
- `--model_name`: Default is `[None, None]`. If self-cognition dataset sampling is enabled (i.e. self_cognition_sample>0), you need to pass two values, representing the model's Chinese and English names respectively. E.g. `--model_name 小黄 'Xiao Huang'`.
- `--model_author`: Default is `[None, None]`. If self-cognition dataset sampling is enabled, you need to pass two values, representing the author's Chinese and English names respectively. E.g. `--model_author 魔搭 ModelScope`.
- `--quant_method`: Quantization method, default is None. You can choose from 'bnb', 'hqq', 'eetq'.
- `--quantization_bit`: Specifies whether to quantize and number of quantization bits, default is `0`, i.e. no quantization. To use 4bit qlora, set `--sft_type lora --quantization_bit 4`.Hqq support 1,2,3,4,8bit, bnb support 4,8bit
- `--hqq_axis`: Hqq argument. Axis along which grouping is performed. Supported values are 0 or 1. default is `0`
- `--hqq_dynamic_config_path`: Parameters for dynamic configuration. The key is the name tag of the layer and the value is a quantization config. If set, each layer specified by its id will use its dedicated quantization configuration.[ref](https://github.com/mobiusml/hqq?tab=readme-ov-file#custom-quantization-configurations-%EF%B8%8F)
- `--bnb_4bit_comp_dtype`: When doing 4bit quantization, we need to dequantize during model forward and backward passes. This specifies the torch_dtype after dequantization. Default is `'AUTO'`, i.e. consistent with `dtype`. Options: 'fp16', 'bf16', 'fp32'. Has no effect when quantization_bit is 0.
- `--bnb_4bit_quant_type`: Quantization method for 4bit quantization, default is `'nf4'`. Options: 'nf4', 'fp4'. Has no effect when quantization_bit is 0.
- `--bnb_4bit_use_double_quant`: Whether to enable double quantization for 4bit quantization, default is `True`. Has no effect when quantization_bit is 0.
- `--bnb_4bit_quant_storage`: Default vlaue `None`.This sets the storage type to pack the quanitzed 4-bit prarams. Has no effect when quantization_bit is 0.
- `--lora_target_modules`: Specify lora modules, default is `['DEFAULT']`. If lora_target_modules is passed `'DEFAULT'` or `'AUTO'`, look up `lora_target_modules` in `MODEL_MAPPING` based on `model_type` (default specifies qkv). If passed `'ALL'`, all Linear layers (excluding head) will be specified as lora modules. If passed `'EMBEDDING'`, Embedding layer will be specified as lora module. If memory allows, setting to 'ALL' is recommended. You can also set `['ALL', 'EMBEDDING']` to specify all Linear and embedding layers as lora modules. This parameter only takes effect when `sft_type` is 'lora'.
- `--lora_rank`: Default is `8`. Only takes effect when `sft_type` is 'lora'.
- `--lora_alpha`: Default is `32`. Only takes effect when `sft_type` is 'lora'.
- `--lora_dropout_p`: Default is `0.05`, only takes effect when `sft_type` is 'lora'.
- `--init_lora_weights`: Method to initialize LoRA weights, can be specified as `true`, `false`, `gaussian`, `pissa`, or `pissa_niter_[number of iters]`. Default value `true`.
- `--lora_bias_trainable`: Default is `'none'`, options: 'none', 'all'. Set to `'all'` to make all biases trainable.
- `--lora_modules_to_save`: Default is `[]`. If you want to train embedding, lm_head, or layer_norm, you can set this parameter, e.g. `--lora_modules_to_save EMBEDDING LN lm_head`. If passed `'EMBEDDING'`, Embedding layer will be added to `lora_modules_to_save`. If passed `'LN'`, `RMSNorm` and `LayerNorm` will be added to `lora_modules_to_save`.
- `--lora_dtype`: Default is `'AUTO'`, specifies dtype for lora modules. If `AUTO`, follow dtype of original module. Options: 'fp16', 'bf16', 'fp32', 'AUTO'.
- `--use_dora`: Default is `False`, whether to use `DoRA`.
- `--use_rslora`: Default is `False`, whether to use `RS-LoRA`.
- `--neftune_noise_alpha`: The noise coefficient added by `NEFTune` can improve performance of instruction fine-tuning, default is `None`. Usually can be set to 5, 10, 15. See [related paper](https://arxiv.org/abs/2310.05914).
- `--neftune_backend`: The backend of `NEFTune`, default uses `transformers` library, may encounter incompatibility when training VL models, in which case it's recommended to specify as `swift`.
- `--gradient_checkpointing`: Whether to enable gradient checkpointing, default is `True`. This can be used to save memory, although it slightly reduces training speed. Has significant effect when max_length and batch_size are large.
- `--deepspeed`: Specifies the path to the deepspeed configuration file or directly passes in configuration information in json format, default is `None`, i.e. deepspeed is not enabled. Deepspeed can save memory. We have written a default [ZeRO-2 configuration file](https://github.com/modelscope/swift/blob/main/swift/llm/ds_config/zero2.json), [ZeRO-3 configuration file](https://github.com/modelscope/swift/blob/main/swift/llm/ds_config/zero3.json). You only need to specify 'default-zero2' to use the default zero2 config file; specify 'default-zero3' to use the default zero3 config file.
- `--batch_size`: Batch_size during training, default is `1`. Increasing batch_size can improve GPU utilization, but won't necessarily improve training speed, because within a batch, shorter sentences need to be padded to the length of the longest sentence in the batch, introducing invalid computations.
- `--eval_batch_size`: Batch_size during evaluation, default is `None`, i.e. set to 1 when `predict_with_generate` is True, set to `batch_size` when False.
- `--num_train_epochs`: Number of epochs to train, default is `1`. If `max_steps >= 0`, this overrides `num_train_epochs`. Usually set to 3 ~ 5.
- `--max_steps`: Max_steps for training, default is `-1`. If `max_steps >= 0`, this overrides `num_train_epochs`.
- `--optim`: Default is `'adamw_torch'`.
- `--learning_rate`: Default is `None`, i.e. set to 1e-4 if `sft_type` is lora, set to 1e-5 if `sft_type` is full.
- `--weight_decay`: Default is `0.01`.
- `--gradient_accumulation_steps`: Gradient accumulation, default is `None`, set to `math.ceil(16 / self.batch_size / world_size)`. `total_batch_size = batch_size * gradient_accumulation_steps * world_size`.
- `--max_grad_norm`: Gradient clipping, default is `0.5`.
- `--predict_with_generate`: Whether to use generation for evaluation, default is `False`. If set to False, evaluate using `loss`. If set to True, evaluate using `ROUGE-L` and other metrics. Generative evaluation takes a long time, choose carefully.
- `--lr_scheduler_type`: Default is `'linear'`, options: 'linear', 'cosine', 'constant', etc.
- `--warmup_ratio`: Proportion of warmup in total training steps, default is `0.05`.
- `--eval_steps`: Evaluate every this many steps, default is `50`.
- `--save_steps`: Save every this many steps, default is `None`, i.e. set to `eval_steps`.
- `--save_only_model`: Whether to save only model parameters, without saving intermediate states needed for checkpoint resuming, default is `None`, i.e. if `sft_type` is 'lora' and not using deepspeed (`deepspeed` is `None`), set to False, otherwise set to True (e.g. using full fine-tuning or deepspeed).
- `--save_total_limit`: Number of checkpoints to save, default is `2`, i.e. save best and last checkpoint. If set to -1, save all checkpoints.
- `--logging_steps`: Print training information (e.g. loss, learning_rate, etc.) every this many steps, default is `5`.
- `--dataloader_num_workers`: Default value is `None`. If running on a Windows machine, set it to `0`; otherwise, set it to `1`.
- `--push_to_hub`: Whether to sync push trained checkpoint to ModelScope Hub, default is `False`.
- `--hub_model_id`: Model_id to push to on ModelScope Hub, default is `None`, i.e. set to `f'{model_type}-{sft_type}'`. You can set this to model_id or repo_name. We will infer user_name based on hub_token. If the remote repository to push to does not exist, a new repository will be created, otherwise the previous repository will be reused. This parameter only takes effect when `push_to_hub` is set to True.
- `--hub_token`: SDK token needed for pushing. Can be obtained from [https://modelscope.cn/my/myaccesstoken](https://modelscope.cn/my/myaccesstoken), default is `None`, i.e. obtained from environment variable `MODELSCOPE_API_TOKEN`. This parameter only takes effect when `push_to_hub` is set to True.
- `--hub_private_repo`: Whether to set the permission of the pushed model repository on ModelScope Hub to private, default is `False`. This parameter only takes effect when `push__to_hub` is set to True.
- `--push_hub_strategy`: Push strategy, default is `'push_best'`. Options include: 'end', 'push_best', 'push_last', 'checkpoint', 'all_checkpoints'. 'push_best' means when saving weights each time, push and overwrite the best model from before, 'push_last' means when saving weights each time, push and overwrite the last weights from before, 'end' means only push the best model at the end of training. This parameter only takes effect when `push_to_hub` is set to True.
- `--test_oom_error`: Used to detect whether training will cause OOM, default is `False`. If set to True, will sort the training set in descending order by max_length, easy for OOM testing. This parameter is generally used for testing, use carefully.
- `--disable_tqdm`: Whether to disable tqdm, useful when launching script with `nohup`. Default is `False`, i.e. enable tqdm.
- `--lazy_tokenize`: If set to False, preprocess all text before `trainer.train()`. If set to True, delay encoding text, reducing preprocessing wait and memory usage, useful when processing large datasets. Default is `None`, i.e. we intelligently choose based on template type, usually set to False for LLM models, set to True for multimodal models (to avoid excessive memory usage from loading images and audio).
- `--preprocess_num_proc`: Use multiprocessing when preprocessing dataset (tokenizing text). Default is `1`. Same as `lazy_tokenize` command line argument, used to solve slow preprocessing issue. But this strategy cannot reduce memory usage, so if dataset is huge, `lazy_tokenize` is recommended. Recommended values: 4, 8. Note: When using qwen-audio, this parameter will be forced to 1, because qwen-audio's preprocessing function uses torch's multiprocessing, which will cause compatibility issues.
- `--use_flash_attn`: Whether to use flash attn, default is `None`. Installation steps for flash_attn can be found at [https://github.com/Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention). Models supporting flash_attn can be found in [LLM Supported Models](Supported-models-datasets.md).
- `--ignore_args_error`: Whether to ignore Error thrown by command line parameter errors, default is `False`. Set to True if need to copy code to notebook to run.
- `--check_model_is_latest`: Check if model is latest, default is `True`. Set this to `False` if you need to train offline.
- `--logging_dir`: Default is `None`. I.e. set to `f'{self.output_dir}/runs'`, representing path to store tensorboard files.
- `--report_to`: Default is `['tensorboard']`. You can set `--report_to all` to report to all installed integrations.
- `--acc_strategy`: Default is `'token'`, options include: 'token', 'sentence'.
- `--save_on_each_node`: Takes effect during multi-machine training, default is `True`.
- `--save_strategy`: Strategy for saving checkpoint, default is `'steps'`, options include: 'steps', 'epoch', no'.
- `--evaluation_strategy`: Strategy for evaluation, default is `'steps'`, options include: 'steps', 'epoch', no'.
- `--save_safetensors`: Default is `True`.
- `--include_num_input_tokens_seen`: Default is `False`. Tracks the number of input tokens seen throughout training.
- `--max_new_tokens`: Default is `2048`. This parameter only takes effect when `predict_with_generate` is set to True.
- `--do_sample`: Default is `True`. This parameter only takes effect when `predict_with_generate` is set to True.
- `--temperature`: Default is `0.3`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
- `--top_k`: Default is `20`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
- `--top_p`: Default is `0.7`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
- `--repetition_penalty`: Default is `1.`. This parameter will be used as default value in deployment parameters.
- `--num_beams`: Default is `1`. This parameter only takes effect when `predict_with_generate` is set to True.
- `--gpu_memory_fraction`: Default is `None`. This parameter aims to run training under a specified maximum available GPU memory percentage, used for extreme testing.
- `--train_dataset_mix_ratio`: Default is `0.`. This parameter defines how to mix datasets for training. When this parameter is specified, it will mix the training dataset with a multiple of `train_dataset_mix_ratio` of the general knowledge dataset specified by `train_dataset_mix_ds`. This parameter has been deprecated, please use `--dataset {dataset_name}#{dataset_sample}` to mix datasets.
- `--train_dataset_mix_ds`: Default is `['ms-bench']`. Used for preventing knowledge forgetting, this is the general knowledge dataset. This parameter has been deprecated, please use `--dataset {dataset_name}#{dataset_sample}` to mix datasets.
- `--use_loss_scale`: Default is `False`. When taking effect, strengthens loss weight of some Agent fields (Action/Action Input part) to enhance CoT, has no effect in regular SFT scenarios.
- `--custom_register_path`: Default is `None`. Pass in a `.py` file used to register templates, models, and datasets.
- `--custom_dataset_info`: Default is `None`. Pass in the path to an external `dataset_info.json`, a JSON string, or a dictionary. Used to register custom datasets. The format example: https://github.com/modelscope/swift/blob/main/swift/llm/data/dataset_info.json
- `device_map_config_path`: Manually configure the model's device map from a local file, defaults to None.
### Long Context
- `--rope_scaling`: Default `None`, Support `linear` and `dynamic` to scale positional embeddings. Use when `max_length` exceeds `max_position_embeddings`.
### FSDP Parameters
- `--fsdp`: Default value `''`, the FSDP type, please check [this documentation](https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.fsdp) for details.
- `--fsdp_config`: Default value `None`, the FSDP config file path.
### Sequence Parallel Parameters
- `--sequence_parallel_size`: Default value `1`, a positive value can be used to split a sequence to multiple GPU to reduce memory usage. The value should divide the GPU count.
### BOFT Parameters
- `--boft_block_size`: BOFT block size, default value is 4.
- `--boft_block_num`: Number of BOFT blocks, cannot be used simultaneously with `boft_block_size`.
- `--boft_target_modules`: BOFT target modules. Default is `['DEFAULT']`. If `boft_target_modules` is set to `'DEFAULT'` or `'AUTO'`, it will look up `boft_target_modules` in the `MODEL_MAPPING` based on `model_type` (default specified as qkv). If set to `'ALL'`, all Linear layers (excluding the head) will be designated as BOFT modules.
- `--boft_dropout`: Dropout value for BOFT, default is 0.0.
- `--boft_modules_to_save`: Additional modules to be trained and saved, default is `None`.
### Vera Parameters
- `--vera_rank`: Size of Vera Attention, default value is 256.
- `--vera_projection_prng_key`: Whether to store the Vera projection matrix, default is True.
- `--vera_target_modules`: Vera target modules. Default is `['DEFAULT']`. If `vera_target_modules` is set to `'DEFAULT'` or `'AUTO'`, it will look up `vera_target_modules` in the `MODEL_MAPPING` based on `model_type` (default specified as qkv). If set to `'ALL'`, all Linear layers (excluding the head) will be designated as Vera modules. Vera modules need to share a same shape.
- `--vera_dropout`: Dropout value for Vera, default is 0.0.
- `--vera_d_initial`: Initial value for Vera's d matrix, default is 0.1.
- `--vera_modules_to_save`: Additional modules to be trained and saved, default is `None`.
### LoRA+ Fine-tuning Parameters
- `--lora_lr_ratio`: Default `None`, recommended value `10~16`, specify this parameter when using lora to enable lora+.
### GaLore Fine-tuning Parameters
- `--use_galore: bool` : Default False, whether to use GaLore.
- `--galore_target_modules: Union[str, List[str]]` : Default None, apply GaLore to attention and mlp when not passed.
- `--galore_rank: int` : Default 128, rank value for GaLore.
- `--galore_update_proj_gap: int` : Default 50, update interval for decomposition matrix.
- `--galore_scale: int` : Default 1.0, matrix weight coefficient.
- `--galore_proj_type: str` : Default `std`, GaLore matrix decomposition type.
- `--galore_optim_per_parameter: bool` : Default False, whether to set a separate optimizer for each Galore target Parameter.
- `--galore_with_embedding: bool` : Default False, whether to apply GaLore to embedding.
### LISA Fine-tuning Parameters
Note: LISA only supports full training, which is `--sft_type full`.
- `--lisa_activated_layers`: Default value`0`, which means use without `LISA`, suggested value is `2` or `8`.
- `--lisa_step_interval`: Default value `20`, how many iters to switch the layers to back-propagate.
### UNSLOTH Fine-tuning Parameters
unsloth has no new parameters,you can use the existing parameters to use unsloth:
```
--tuner_backend unsloth
--sft_type full/lora
--quantization_type 4
```
### LLaMA-PRO Fine-tuning Parameters
- `--llamapro_num_new_blocks`: Default `4`, total number of new layers inserted.
- `--llamapro_num_groups`: Default `None`, how many groups to insert new_blocks into, if `None` then equals `llamapro_num_new_blocks`, i.e. each new layer is inserted into original model separately.
### AdaLoRA Fine-tuning Parameters
The following parameters take effect when `sft_type` is set to `adalora`. AdaLoRA's `target_modules` and other parameters inherit from lora's corresponding parameters, but the `lora_dtype` parameter has no effect.
- `--adalora_target_r`: Default `8`, AdaLoRA's average rank.
- `--adalora_init_r`: Default `12`, AdaLoRA's initial rank.
- `--adalora_tinit`: Default `0`, AdaLoRA's initial warmup.
- `--adalora_tfinal`: Default `0`, AdaLoRA's final warmup.
- `--adalora_deltaT`: Default `1`, AdaLoRA's step interval.
- `--adalora_beta1`: Default `0.85`, AdaLoRA's EMA parameter.
- `--adalora_beta2`: Default `0.85`, AdaLoRA's EMA parameter.
- `--adalora_orth_reg_weight`: Default `0.5`, AdaLoRA's regularization parameter.
### IA3 Fine-tuning Parameters
The following parameters take effect when `sft_type` is set to `ia3`.
- `--ia3_target_modules`: Specify IA3 target modules, default is `['DEFAULT']`. See `lora_target_modules` for specific meaning.
- `--ia3_feedforward_modules`: Specify the Linear name of IA3's MLP, this name must be in `ia3_target_modules`.
- `--ia3_modules_to_save`: Additional modules participating in IA3 training. See meaning of `lora_modules_to_save`.
## dpo Parameters
dpo parameters inherit from sft parameters, with the following added parameters:
- `--ref_model_type`: Type of reference model, available `model_type` options can be found in `MODEL_MAPPING.keys()`.
- `--ref_model_id_or_path`: The local cache dir for reference model, default `None`.
- `--max_prompt_length`: Maximum prompt length, this parameter is passed to DPOTrainer, setting prompt length to not exceed this value, default is `1024`.
- `--beta`: Regularization term for DPO logits, default is 0.1.
- `--label_smoothing`: Whether to use DPO smoothing, default is 0, generally set between 0~0.5.
- `--loss_type`: DPOloss type, supports 'sigmoid', 'hinge', 'ipo', 'kto_pair', default is 'sigmoid'.
- `--sft_beta`: Whether to add sft loss in DPO, default is 0.1, supports [0, 1) interval, final loss is `(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
## merge-lora infer Parameters
- `--model_type`: Default is `None`, see `sft.sh command line arguments` for parameter details.
- `--model_id_or_path`: Default is `None`, see `sft.sh command line arguments` for parameter details. Recommended to use model_type to specify.
- `--model_revision`: Default is `None`. See `sft.sh command line arguments` for parameter details. If `model_id_or_path` is None or a local model directory, this parameter has no effect.
- `--sft_type`: Default is `'lora'`, see `sft.sh command line arguments` for parameter details.
- `--template_type`: Default is `'AUTO'`, see `sft.sh command line arguments` for parameter details.
- `--infer_backend`: Options are 'AUTO', 'vllm', 'pt'. Default uses 'AUTO', for intelligent selection, i.e. if `ckpt_dir` is not passed or using full fine-tuning, and vllm is installed and model supports vllm, then use vllm engine, otherwise use native torch for inference. vllm environment setup can be found in [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md), vllm supported models can be found in [Supported Models](Supported-models-datasets.md).
- `--ckpt_dir`: Required, value is the checkpoint path saved in SFT stage, e.g. `'/path/to/your/vx-xxx/checkpoint-xxx'`.
- `--load_args_from_ckpt_dir`: Whether to read model configuration info from `sft_args.json` file in `ckpt_dir`. Default is `True`.
- `--load_dataset_config`: This parameter only takes effect when `--load_args_from_ckpt_dir true`. I.e. whether to read dataset related configuration from `sft_args.json` file in `ckpt_dir`. Default is `False`.
- `--eval_human`: Whether to evaluate using validation set portion of dataset or manual evaluation. Default is `None`, for intelligent selection, if no datasets (including custom datasets) are passed, manual evaluation will be used. If datasets are passed, dataset evaluation will be used.
- `device_map_config_path`: Manually configure the model's device map from a local file, defaults to None.
- `--seed`: Default is `42`, see `sft.sh command line arguments` for parameter details.
- `--dtype`: Default is `'AUTO`, see `sft.sh command line arguments` for parameter details.
- `--dataset`: Default is `[]`, see `sft.sh command line arguments` for parameter details.
- `--val_dataset`: Default is `[]`, see `sft.sh command line arguments` for parameter details.
- `--dataset_seed`: Default is `42`, see `sft.sh command line arguments` for parameter details.
`--dataset_test_ratio`: Default value is `0.01`. For specific parameter details, refer to the `sft.sh command line arguments`.
- `--show_dataset_sample`: Represents number of validation set samples to evaluate and display, default is `10`.
- `--system`: Default is `None`. See `sft.sh command line arguments` for parameter details.
- `--tools_prompt`: Default is `react_en`. See `sft.sh command line arguments` for parameter details.
- `--max_length`: Default is `-1`. See `sft.sh command line arguments` for parameter details.
- `--truncation_strategy`: Default is `'delete'`. See `sft.sh command line arguments` for parameter details.
- `--check_dataset_strategy`: Default is `'none'`, see `sft.sh command line arguments` for parameter details.
- `--custom_train_dataset_path`: Default value is `[]`. This parameter has been deprecated, please use `--dataset {dataset_path}`.
- `--custom_val_dataset_path`: Default value is `[]`. This parameter is deprecated. Please use `--val_dataset {dataset_path}` instead.
- `--quantization_bit`: Default is 0. See `sft.sh command line arguments` for parameter details.
- `--quant_method`: Quantization method, default is None. You can choose from 'bnb', 'hqq', 'eetq'.
- `--hqq_axis`: Hqq argument. Axis along which grouping is performed. Supported values are 0 or 1. default is `0`
- `--hqq_dynamic_config_path`: Parameters for dynamic configuration. The key is the name tag of the layer and the value is a quantization config. If set, each layer specified by its id will use its dedicated quantization configuration.[ref](https://github.com/mobiusml/hqq?tab=readme-ov-file#custom-quantization-configurations-%EF%B8%8F)
- `--bnb_4bit_comp_dtype`: Default is `'AUTO'`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
- `--bnb_4bit_quant_type`: Default is `'nf4'`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
- `--bnb_4bit_use_double_quant`: Default is `True`. See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
- `--bnb_4bit_quant_storage`: Default value `None`.See `sft.sh command line arguments` for parameter details. If `quantization_bit` is set to 0, this parameter has no effect.
- `--max_new_tokens`: Maximum number of new tokens to generate, default is `2048`.
- `--do_sample`: Whether to use greedy generation or sampling generation, default is `True`.
- `--temperature`: Default is `0.3`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
- `--top_k`: Default is `20`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
- `--top_p`: Default is `0.7`. This parameter only takes effect when `do_sample` is set to True. This parameter will be used as default value in deployment parameters.
- `--repetition_penalty`: Default is `1.`. This parameter will be used as default value in deployment parameters.
- `--num_beams`: Default is `1`.
- `--use_flash_attn`: Default is `None`, i.e. 'auto'. See `sft.sh command line arguments` for parameter details.
- `--ignore_args_error`: Default is `False`, see `sft.sh command line arguments` for parameter details.
- `--stream`: Whether to use streaming output, default is `True`. This parameter only takes effect when using dataset evaluation and verbose is True.
- `--merge_lora`: Whether to merge lora weights into base model and save full weights, default is `False`. Weights will be saved in the same level directory as `ckpt_dir`, e.g. `'/path/to/your/vx-xxx/checkpoint-xxx-merged'` directory.
- `--merge_device_map`: device_map used when merge-lora, default is `None`, to reduce memory usage, use `auto` only during merge-lora process, otherwise default is `cpu`.
- `--save_safetensors`: Whether to save as `safetensors` file or `bin` file. Default is `True`.
- `--overwrite_generation_config`: Whether to save the generation_config used for evaluation as `generation_config.json` file, default is `None`. If `ckpt_dir` is specified, set to `True`, otherwise set to `False`. The generation_config file saved during training will be overwritten.
- `--verbose`: If set to False, use tqdm style inference. If set to True, output inference query, response, label. Default is `None`, for auto selection, i.e. when `len(val_dataset) >= 100`, set to False, otherwise set to True. This parameter only takes effect when using dataset evaluation.
- `--gpu_memory_utilization`: Parameter for initializing vllm engine `EngineArgs`, default is `0.9`. This parameter only takes effect when using vllm. VLLM inference acceleration and deployment can be found in [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md).
- `--tensor_parallel_size`: Parameter for initializing vllm engine `EngineArgs`, default is `1`. This parameter only takes effect when using vllm.
- `--max_model_len`: Override model's max_model__len, default is `None`. This parameter only takes effect when using vllm.
- `--vllm_enable_lora`: Default `False`. Whether to support vllm with lora.
- `--vllm_max_lora_rank`: Default `16`. Lora rank in VLLM.
- `--lora_modules`: Default`[]`, the input format is `'{lora_name}={lora_path}'`, e.g. `--lora_modules lora_name1=lora_path1 lora_name2=lora_path2`. `ckpt_dir` will be added with `f'default-lora={args.ckpt_dir}'` by default.
- `--custom_register_path`: Default is `None`. Pass in a `.py` file used to register templates, models, and datasets.
- `--custom_dataset_info`: Default is `None`. Pass in the path to an external `dataset_info.json`, a JSON string, or a dictionary. Used for expanding datasets.
- `--rope_scaling`: Default `None`, Support `linear` and `dynamic` to scale positional embeddings. Use when `max_length` exceeds `max_position_embeddings`.
## export Parameters
export parameters inherit from infer parameters, with the following added parameters:
- `--merge_lora`: Default is `False`. This parameter is already defined in InferArguments, not a new parameter. Whether to merge lora weights into base model and save full weights. Weights will be saved in the same level directory as `ckpt_dir`, e.g. `'/path/to/your/vx-xxx/checkpoint-xxx-merged'` directory.
- `--quant_bits`: Number of bits for quantization. Default is `0`, i.e. no quantization. If you set `--quant_method awq`, you can set this to `4` for 4bits quantization. If you set `--quant_method gptq`, you can set this to `2`,`3`,`4`,`8` for corresponding bits quantization. If quantizing original model, weights will be saved in `f'{args.model_type}-{args.quant_method}-int{args.quant_bits}'` directory. If quantizing fine-tuned model, weights will be saved in the same level directory as `ckpt_dir`, e.g. `f'/path/to/your/vx-xxx/checkpoint-xxx-{args.quant_method}-int{args.quant_bits}'` directory.
- `--quant_method`: Quantization method, default is `'awq'`. Options are 'awq', 'gptq'.
- `--dataset`: This parameter is already defined in InferArguments, for export it means quantization dataset. Default is `[]`. More details: including how to customize quantization dataset, can be found in [LLM Quantization Documentation](LLM-quantization.md).
- `--quant_n_samples`: Quantization parameter, default is `256`. When set to `--quant_method awq`, if OOM occurs during quantization, you can moderately reduce `--quant_n_samples` and `--quant_seqlen`. `--quant_method gptq` generally does not encounter quantization OOM.
- `--quant_seqlen`: Quantization parameter, default is `2048`.
- `--quant_device_map`: Default is `'cpu'`, to save memory. You can specify 'cuda:0', 'auto', 'cpu', etc., representing the device to load model during quantization. This parameter is independent of the actual device that performs the quantization, such as AWQ and GPTQ which will carry out quantization on cuda:0.
- `quant_output_dir`: Default is `None`, the default quant_output_dir will be printed in the command line.
- `--push_to_hub`: Default is `False`. Whether to push the final `ckpt_dir` to ModelScope Hub. If you specify `merge_lora`, full parameters will be pushed; if you also specify `quant_bits`, quantized model will be pushed.
- `--hub_model_id`: Default is `None`. Model_id to push to on ModelScope Hub. If `push_to_hub` is set to True, this parameter must be set.
- `--hub_token`: Default is `None`. See `sft.sh command line arguments` for parameter details.
- `--hub_private_repo`: Default is `False`. See `sft.sh command line arguments` for parameter details.
- `--commit_message`: Default is `'update files'`.
## eval parameters
The eval parameters inherit from the infer parameters, and additionally include the following parameters: (Note: The generation_config parameter in infer will be invalid, controlled by [evalscope](https://github.com/modelscope/eval-scope).)
- `--eval_dataset`: The official dataset for evaluation, with a default value of `['ceval', 'gsm8k', 'arc']`. Possible values include: 'arc', 'gsm8k', 'mmlu', 'cmmlu', 'ceval', 'bbh', 'general_qa'. If only custom datasets need to be evaluated, this parameter can be set to `no`.
- `--eval_few_shot`: The few-shot number of sub-datasets for each evaluation set, with a default value of `None`, meaning to use the default configuration of the dataset.
- `--eval_limit`: The sampling quantity for each sub-dataset of the evaluation set, with a default value of `None` indicating full-scale evaluation.
- `--name`: Used to differentiate the result storage path for evaluating the same configuration, with the current time as the default.
- `--eval_url`: The standard model invocation interface for OpenAI, for example, `http://127.0.0.1:8000/v1`. This needs to be set when evaluating in a deployed manner, usually not needed. Default is `None`.
```shell
swift eval --eval_url http://127.0.0.1:8000/v1 --eval_is_chat_model true --model_type gpt4 --eval_token xxx
```
- `--eval_token`: The token for the standard model invocation interface for OpenAI, with a default value of `'EMPTY'`, indicating no token.
- `--eval_is_chat_model`: If `eval_url` is not empty, this value needs to be passed to determine if it is a "chat" model. False represents a "base" model. Default is `None`.
- `--custom_eval_config`: Used for evaluating with custom datasets, and needs to be a locally existing file path. For details on file format, refer to [Custom Evaluation Set](./LLM-eval.md#Custom-Evaluation-Set). Default is `None`.
- `--eval_use_cache`: Whether to use already generated evaluation cache, so that previously evaluated results won't be rerun but only the evaluation results regenerated. Default is `False`.
## app-ui Parameters
app-ui parameters inherit from infer parameters, with the following added parameters:
- `--host`: Default is `'127.0.0.1'`. Passed to the `demo.queue().launch(...)` function of gradio.
- `--port`: Default is `7860`. Passed to the `demo.queue().launch(...)` function of gradio.
- `--share`: Default is `False`. Passed to the `demo.queue().launch(...)` function of gradio.
## deploy Parameters
deploy parameters inherit from infer parameters, with the following added parameters:
- `--host`: Default is `'127.0.0.1`.
- `--port`: Default is `8000`.
- `--ssl_keyfile`: Default is `None`.
- `--ssl_certfile`: Default is `None`.
# HuggingFace Eco-compatibility
By default, we use models and datasets from [ModelScope](https://modelscope.cn/my/overview) for fine-tuning and inference. However, considering that overseas users are more familiar with the [HuggingFace](https://huggingface.co/) ecosystem, we have made it compatible with HuggingFace.
To enable HuggingFace compatibility, you need to set the environment variable `USE_HF=1`. Supported HuggingFace models and datasets can be found in the [Supported Models and Datasets](Supported-models-datasets.md). Note that some datasets are only supported in the ModelScope environment.
Here is an example inference script for qwen1.5-7b-chat:
```shell
# Experimental Environment: A10, 3090, V100
USE_HF=1 CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat
```
Fine-tuning script:
```shell
# Experimental Environment: 2 * A100
# GPU Memory Requirement: 2 * 30GB
USE_HF=1 \
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_type qwen1half-7b-chat \
--dataset blossom-math-zh \
--num_train_epochs 5 \
--sft_type lora \
--output_dir output \
```
Please refer to other documents for inference after fine-tuning, and deployment .
# Customization and Extension
## Table of Contents
- [Custom Datasets](#custom-datasets)
- [Custom Models](#custom-models)
- [Custom Dialogue Templates](#custom-dialogue-templates)
## Custom Dataset
We support three methods for **customizing datasets**.
1. \[Recommended] Use the command line argument directly to specify `--dataset xxx.json yyy.jsonl zzz.csv`, which is more convenient for supporting custom datasets. It supports five data formats (using `SmartPreprocessor`, supported dataset formats are listed below) and supports `dataset_id` and `dataset_path`. No need to modify the `dataset_info.json` file.
2. Adding datasets to `dataset_info.json` is more flexible but cumbersome compared to the first method, and supports using two preprocessors and specifying their parameters: `RenameColumnsPreprocessor`, `ConversationsPreprocessor` (default is to use `SmartPreprocessor`). You can directly modify the built-in `dataset_info.json` in Swift, or pass in an external json file using `--custom_dataset_info xxx.json` (for users who prefer pip install over git clone to expand datasets).
3. Registering datasets: More flexible but cumbersome compared to the first and second methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using `--custom_register_path xxx.py`, where the script will parse the py file (for pip install users).
### 📌 \[Recommended\] Using Command Line Arguments Directly
Supports directly passing in custom `dataset_id` (compatible with MS and HF) and `dataset_path`, as well as simultaneously passing in multiple custom datasets and their respective sample sizes. The script will automatically preprocess and concatenate the datasets. If a `dataset_id` is passed in, it will default to using the 'default' subset in the dataset_id and set the split to 'train'. If the dataset_id has already been registered, it will use the subsets, split, and preprocessing functions that were passed in during registration. If a `dataset_path` is passed in, it can be specified as a relative path or an absolute path, where the relative path is relative to the current running directory.
```bash
--dataset {dataset_id} {dataset_path}
# Dataset Mixing: the following command takes subset1 and subset2 from dataset_id and samples 20,000 records
--dataset {dataset_name}#20000 {dataset_id}:{subset1}/{subset2}#20000 {dataset_path}#10000
```
The supported file formats for the script include `csv`, `json`, and `jsonl`. You need to ensure that the incoming file conforms to the following dataset formats (only a partial list is provided). All of these formats support the `system` field (it is important to note that if the `system` field is specified in the csv format, it cannot be set to `None` and can only be specified as an empty string. There is no such restriction for the json and jsonl formats). Files in `json` and `jsonl` formats support multi-turn dialogue (`csv` does not support this).
**Format 1:**
Pre-Training
```csv
response
11111
aaaaa
AAAAA
```
```jsonl
{"response": "11111"}
{"response": "aaaaa"}
{"response": "AAAAA"}
```
Single-Round Dialogue
```csv
system,query,response
00000,11111,22222
00001,aaaaa,bbbbb
00002,AAAAA,BBBBB
```
```jsonl
{"system": "00000", "query": "11111", "response": "22222"}
{"query": "aaaaa", "response": "bbbbb"}
{"query": "AAAAA", "response": "BBBBB"}
```
Multi-Round Dialogue
```jsonl
{"system": "00000", "query": "55555", "response": "66666"}
{"query": "eeeee", "response": "fffff", "history": []}
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
```
```json
[{"system": "00000", "query": "55555", "response": "66666"},
{"query": "eeeee", "response": "fffff", "history": []},
{"query": "EEEEE", "response": "FFFFF", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}]
```
**Format 2:**
```jsonl
{"conversations": [{"from": "system", "value": "00000"}, {"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}]}
{"conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "user", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
{"conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "user", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}
```
**Format 3:**
```jsonl
{"messages": [{"role": "system", "content": "00000"}, {"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}]}
{"messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "user", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
{"messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "user", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
```
**Format 4:**
```jsonl
{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
{"conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
```
**Format 5:**
```csv
system,instruction,input,output
00000,11111,22222,33333
00001,aaaaa,bbbbb,ccccc
00002,AAAAA,BBBBB,CCCCC
```
**Reinforcement Learning (DPO/ORPO)**
```jsonl
{"query": "11111", "response": "22222", "rejected_response": "33333", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
{"query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
{"query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC", "history": [["AAAAA", "BBBBB"], ["CCCCC", "DDDDD"]]}
```
### Adding dataset_info.json
You can refer to the [builtin dataset_info.json in Swift](https://github.com/modelscope/swift/blob/main/swift/llm/data/dataset_info.json) to expand datasets. You can directly add it in the built-in dataset_info.json, or you can pass in the path to an external dataset_info.json, a JSON string, or a dictionary using `--custom_dataset_info 1.json`.
Adding dataset_id:
```python
# MS
# Usage: `--dataset <dataset_name>`
"<dataset_name>": {
"dataset_id": "xxx/xxx"
}
# HF
# Usage: `--dataset HF::<dataset_name>` or directly use the `USE_HF` environment variable.
"<dataset_name>": {
"hf_dataset_id": "xxx/xxx"
}
```
添加dataset\_path:
```python
# You can specify relative and absolute paths. Relative paths are relative to the directory where dataset_info.json is located.
# Usage: `--dataset <dataset_name>`
"<dataset_name>": {
"dataset_path": "xxx"
}
```
Supported parameters include:
- dataset_id: The corresponding ModelScope dataset_id, default is `None`. The simplest setup requires specifying one of `dataset_id`, `hf_dataset_id`, or `dataset_path`.
- subsets: A list of names of the subsets, default is `[]`, which means using the 'default' subset.
- split: Default is ['train'], usually not necessary to set.
- hf_dataset_id: The corresponding HuggingFace dataset_id, default is `None`.
- dataset_path: Used to specify the local path of the dataset, e.g. 1.jsonl, default is `None`. It can take relative or absolute paths. If using a relative path, it is relative to the directory where the dataset_info.json is located. If dataset_path is set, then dataset_id, subsets, and hf_dataset_id parameters are ignored.
- columns: The default preprocessor used is `SmartPreprocessor`. Specifying this parameter sets it to `RenameColumnsPreprocessor`. You need to rename the columns in the dataset and convert them to the style of **format 1** mentioned above.
- conversations: Specifying this parameter sets the preprocessor to `ConversationsPreprocessor` ('columns' takes priority over 'conversations').
- remove_useless_columns: Specifies whether to remove unnecessary columns (including: 'query', 'response', 'rejected_response', 'system', 'history', 'images'), default is `True`, usually not necessary to set.
- tags: Used to annotate the dataset, default is `[]`, usually not necessary to set.
If the parameters in `dataset_info.json` are not sufficient for your needs, such as adding custom prompts, requiring advanced dataset cleaning, or complex dataset retrieval and preprocessing, you can use the method of registering datasets using functions for data retrieval and preprocessing.
### Registering Datasets
The following is an example of **registering datasets**. The complete py file can be viewed at [custom.py](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/custom.py), and the sh script can be viewed at [custom](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/custom). You can parse the registered content by specifying `--custom_register_path xxx.py`.
```python
from typing import Optional, Tuple
from datasets import Dataset as HfDataset
from modelscope import MsDataset
from swift.llm import get_dataset, register_dataset, get_dataset_from_repo
from swift.utils import get_logger
logger = get_logger()
class CustomDatasetName:
stsb_en = 'stsb-en'
def _preprocess_stsb(dataset: HfDataset) -> HfDataset:
prompt = """Task: Based on the given two sentences, provide a similarity score between 0.0 and 5.0.
Sentence 1: {text1}
Sentence 2: {text2}
Similarity score: """
query = []
response = []
for d in dataset:
query.append(prompt.format(text1=d['text1'], text2=d['text2']))
response.append(f"{d['label']:.1f}")
return HfDataset.from_dict({'query': query, 'response': response})
register_dataset(CustomDatasetName.stsb_en, 'huangjintao/stsb', None, _preprocess_stsb, get_dataset_from_repo)
if __name__ == '__main__':
# test dataset
train_dataset, val_dataset = get_dataset([CustomDatasetName.stsb_en],
check_dataset_strategy='warning')
print(f'train_dataset: {train_dataset}')
print(f'val_dataset: {val_dataset}')
```
The `register_dataset` function will register the dataset in the `DATASET_MAPPING`. The parameters of this function are as follows:
- `dataset_name`: Required, representing the name of the dataset, which is also the unique ID of the dataset.
- `dataset_id_or_path`: Required, representing the `dataset_id` on the ModelScope Hub or the local `dataset_dir`.
- `subsets`: List of subsets of the dataset, default is `[]`.
- `split`: Default is ['train'].
- `preprocess_func`: Preprocessing function.
- `get_function`: Default value is `None`. The function to get the dataset. If passed `None`, the decorator approach will be used to register the dataset. If passed a function, the normal approach will be used to register.
> `get_function` should return `HfDataset` or `Tuple[HfDataset, Optional[HfDataset]]`. If only one dataset is returned, it will be the train_dataset. If two datasets are returned, they will be the train_dataset and val_dataset, respectively. The `get_dataset` function supports obtaining multiple datasets, for example: `get_dataset(['dataset1', 'dataset2'])`. We will concatenate the training and validation parts of each subset and return the merged train_dataset and val_dataset.
> The `HfDataset` returned by the function needs to follow certain specifications. If you want to do **pre-training**, you only need to include the `response` field, please refer to the `'tigerbot-law-zh'` dataset for details. For **instruction tuning (single-round dialogue)**, the `query` and `response` fields need to be included, representing the user's query and the AI assistant's answer in instruction tuning respectively, please refer to the `'alpaca-zh'` dataset for details. For **multi-round dialogue**, an additional `history` field needs to be added, representing the historical information of the dialogue, please refer to the `'damo-agent-mini-zh'` dataset for details. If each dataset sample has a different `system`, an additional system field needs to be added, you can also refer to the `'damo-agent-mini-zh'` dataset for details.
- `**kwargs`: Other parameters used to annotate the dataset. This parameter generally does not need to be set.
## Custom Models
The following is an example of **custom models**. The complete py file can be viewed at [custom.py](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/custom.py), and the sh script can be viewed at [custom](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/custom). You can parse the registered content by specifying `--custom_register_path xxx.py`.
```python
from typing import Any, Dict
from modelscope import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from torch import dtype as Dtype
from transformers.utils.versions import require_version
from swift.llm import LoRATM, TemplateType, get_model_tokenizer, register_model
from swift.utils import get_logger
logger = get_logger()
class CustomModelType:
tigerbot_7b = 'tigerbot-7b'
tigerbot_13b = 'tigerbot-13b'
tigerbot_13b_chat = 'tigerbot-13b-chat'
class CustomTemplateType:
tigerbot = 'tigerbot'
@register_model(CustomModelType.tigerbot_7b,
'TigerResearch/tigerbot-7b-base-v3', LoRATM.llama,
TemplateType.default_generation)
@register_model(CustomModelType.tigerbot_13b,
'TigerResearch/tigerbot-13b-base-v2', LoRATM.llama,
TemplateType.default_generation)
@register_model(CustomModelType.tigerbot_13b_chat,
'TigerResearch/tigerbot-13b-chat-v4', LoRATM.llama,
CustomTemplateType.tigerbot)
def get_tigerbot_model_tokenizer(model_dir: str,
torch_dtype: Dtype,
model_kwargs: Dict[str, Any],
load_model: bool = True,
**kwargs):
use_flash_attn = kwargs.pop('use_flash_attn', False)
if use_flash_attn:
require_version('transformers>=4.34')
logger.info('Setting use_flash_attention_2: True')
model_kwargs['use_flash_attention_2'] = True
model_config = AutoConfig.from_pretrained(
model_dir, trust_remote_code=True)
model_config.pretraining_tp = 1
model_config.torch_dtype = torch_dtype
logger.info(f'model_config: {model_config}')
tokenizer = AutoTokenizer.from_pretrained(
model_dir, trust_remote_code=True)
model = None
if load_model:
model = AutoModelForCausalLM.from_pretrained(
model_dir,
config=model_config,
torch_dtype=torch_dtype,
trust_remote_code=True,
**model_kwargs)
return model, tokenizer
if __name__ == '__main__':
# test model base
model, tokenizer = get_model_tokenizer(
CustomModelType.tigerbot_7b, use_flash_attn=False)
print(model.__class__.__name__)
# test model chat
model, tokenizer = get_model_tokenizer(
CustomModelType.tigerbot_13b_chat, use_flash_attn=False)
print(model.__class__.__name__)
```
`register_model` will register the model in `MODEL_MAPPING`. The meaning of the parameters of this function are as follows:
- `model_type`: Required field. Represents the name of the model, and is also the unique id.
- `model_id_or_path`: Required field. Represents the `model_id` of the model in ModelScope Hub, or the local model directory `model_dir`.
- `lora_target_modules`: Default is `None`. Represents the default lora_target_modules to use when `--lora_target_modules DEFAULT` or `--lora_target_modules AUTO` is specified in the sh script, or when `--lora_target_modules` is not specified.
- `template`: Default is `TemplateType.default`. Represents the default dialogue template to use when `--template_type AUTO` is specified in the sh script, or when `--template_type` is not specified.
- `get_function`: Default value is `None`. The function to get model and tokenizer. If passed `None`, the decorator approach will be used to register the model. If passed a function, the normal approach will be used to register.
- `requires`: Default is `[]`. Represents the dependencies required by the model that differ from other models. This parameter generally does not need to be set.
- `torch_dtype`: Default is `None`. Represents the recommended torch_dtype for the model to use. This parameter generally does not need to be set.
- `revision`: Default is `None`. Used to specify the version number of the model. If `model_id_or_path` is a local model directory, this parameter is not effective. This parameter generally does not need to be set.
- `ignore_file_pattern`: Default is `None`. Represents the regular pattern of file names to be ignored when downloading, this parameter will be passed to `snapshot_download`. For example, `r'.+\.bin$'`, `r'.+\.savetensors$'`, etc. This parameter generally does not need to be set.
- `**kwargs`: Other parameters used to annotate model capabilities. This parameter generally does not need to be set.
## Custom Dialogue Templates
The following is an example of **custom models**. The complete py file can be viewed at [custom.py](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/custom.py), and the sh script can be viewed at [custom](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/custom).
```python
from swift.llm import (Template, ModelType, dataset_map,
get_model_tokenizer, get_template, get_dataset,
print_example, register_template, DatasetName)
from swift.utils import get_logger
logger = get_logger()
class CustomTemplateType:
tigerbot = 'tigerbot'
# Ref: https://github.com/TigerResearch/TigerBot/blob/main/infer.py
register_template(
CustomTemplateType.tigerbot,
Template(['{{SYSTEM}}'], ['\n\n### Instruction:\n{{QUERY}}\n\n### Response:\n'], [],
[['eos_token_id']]))
if __name__ == '__main__':
# test template
train_dataset, _ = get_dataset(DatasetName.blossom_math_zh)
_, tokenizer = get_model_tokenizer(ModelType.qwen_7b_chat, load_model=False)
template = get_template(CustomTemplateType.tigerbot, tokenizer)
train_dataset = dataset_map(train_dataset, template.encode)
print_example(train_dataset[0], tokenizer)
```
`register_template` will register the dialogue template in `TEMPLATE_MAPPING`. The meaning of the parameters of this function are as follows:
- `template_type`: Required field, represents the name of the dialogue template, and is also the unique id of the template.
- `template`: Required field, needs to pass in a `Template`. To initialize `Template`, the following parameters need to be passed in: `prefix`, `prompt`, `chat_sep`, `suffix`, `default_system`.
The template initialization function will obtain the complete chat template based on these four contents. The meaning of these four configuration contents are as follows.
- `prefix`: Represents the prefix part of the dialogue template, generally including system part, prefix tokens, bos tokens, etc. We use `{{SYSTEM}}` as the placeholder for the system. If `{{SYSTEM}}` does not exist in the prefix, then this Template does not support system, e.g. `damo-agent-mini-zh` dataset.
- `prompt`: Represents a round of dialogue in the dialogue template. We use `{{QUERY}}` as the placeholder for the human query part in each round of dialogue, `{{ROUND0}}` represents the placeholder for which round of dialogue this is, starting from 0, and `{{ROUND1}}` starts from 1. The AI assistant's reply part will be concatenated after `prompt`, so we have not designed a placeholder for it. We will only calculate the loss for the AI assistant's reply part.
- `chat_sep`: If multi-round dialogue is needed, `chat_sep` will be used as the separator between each round of dialogue, such as: newline, etc. If set to None, then this Template does not support multi-round dialogue.
- `suffix`: Used as the suffix part of the dialogue template, generally eos token. Will be concatenated after the last round of dialogue.
- `default_system`: The default system.
# LLM Human Alignment Training Documentation
## Table of Contents
- [Environment Preparation](#environment-preparation)
- [Human Alignment Training](#human-alignment-training)
## Environment Preparation
GPU devices: A10, 3090, V100, A100 are all acceptable. For GPUs with memory <=24G, at least a dual-card environment is required. Since human alignment training loads two models on one card, it occupies more memory than fine-tuning due to an additional inference model's memory consumption.
```bash
# Install ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
# Environment alignment (usually not necessary. If you encounter errors, you can run the following code, the repository uses the latest environment for testing)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
```
## Human Alignment Training
The following shell script runs a human alignment training. First, you need to switch to the runtime directory:
```shell
cd examples/pytorch/llm
```
Run the following command:
```shell
# Experimental environment: 4*A100
# Memory usage: 4 * 20G, dual-card device_map * 2ddp
nproc_per_node=2
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
swift dpo \
--model_type yi-6b-chat \
--ref_model_type yi-6b-chat \
--model_revision master \
--sft_type lora \
--tuner_backend swift \
--dtype AUTO \
--output_dir output \
--dataset hh-rlhf-cn:harmless_base_cn \
--num_train_epochs 3 \
--max_length 1024 \
--max_prompt_length 512 \
--check_dataset_strategy none \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0.05 \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--weight_decay 0.1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
--max_grad_norm 1.0 \
--warmup_ratio 0.03 \
--eval_steps 2000 \
--save_steps 2000 \
--save_total_limit 2 \
--logging_steps 10 \
```
### Shell Script
The sh script can be viewed [here](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/dpo).
```bash
# The following script needs to be executed in this directory
cd examples/pytorch/llm
```
**Tips**:
- We default to setting `--gradient_checkpointing true` during training to **save memory**, which will slightly reduce training speed.
- If you are using older GPUs such as **V100**, you need to set `--dtype AUTO` or `--dtype fp16`, because they do not support bf16.
- If your machine has high-performance graphics cards like A100 and you are using the qwen series models, we recommend installing [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](Supported-models-datasets.md#models)
- If you need to train offline, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please see [Command Line Arguments](Command-line-parameters.md).
- If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
```bash
# dpo training for mistral-7b max_length=1024, bs=1
# Recommended experimental environment: V100, A10, 3090, 2 cards, 4 cards or 8 cards
bash scripts/dpo/lora_ddp_mp/dpo.sh
bash scripts/dpo/lora_ddp_mp/infer.sh
```
Since DPO training will result in a complete model or adapter weights, the steps for LoRA merging and inference are the same as for fine-tuning, so please refer to the corresponding steps in the [Fine-tuning Documentation](LLM-fine-tuning.md#merge-lora).
# Hands-on Training and Inference with Grok 300B
This documentation introduces the process of finetuning and inferencing the Grok-MoE 300B model using an 8-GPU environment.
## Table of Contents
- [Environment Setup](#environment-setup)
- [Finetuning](#finetuning)
- [Inference](#inference)
## Environment Setup
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
```
## Finetuning
### Experiment Environment
- GPU: 8 x A100 80G
- Docker Image: ModelScope official image version 1.13.1
- peft: 0.10.0
### Dataset Preparation
Grok is a base model, so we used the [DuReader Question Generation dataset](https://www.modelscope.cn/datasets/modelscope/DuReader_robust-QG/summary) as the training set. This dataset contains around 15,000 examples. With a max-length of 512, there are about 10,000 training examples (average length 305±92 tokens).
### Model Preparation
For the Grok model, we used the version provided by [ColossalAI](https://www.modelscope.cn/models/colossalai/grok-1-pytorch/summary), and additionally prepared a [tokenizer conforming to the transformers standard](https://www.modelscope.cn/models/AI-ModelScope/grok-1-tokenizer/summary).
### Training
Since the Grok model is too large, device_map and deepspeed zero3 non-offload are unable to run training. Therefore, in this experiment, we used the LoRA + deepspeed zero3 offload mode to run the training. The complete training script is as follows:
```shell
# cd examples/pytorch/llm first
nproc_per_node=8
PYTHONPATH=../../.. \
torchrun \
--nproc_per_node=$nproc_per_node \
--master_port 29500 \
llm_sft.py \
--model_type grok-1 \
--sft_type lora \
--tuner_backend peft \
--dtype bf16 \
--output_dir output \
--ddp_backend nccl \
--dataset dureader-robust-zh \
--train_dataset_sample -1 \
--num_train_epochs 1 \
--max_length 512 \
--check_dataset_strategy warning \
--lora_rank 8 \
--lora_alpha 32 \
--lora_dropout_p 0.05 \
--lora_dtype AUTO \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 2 \
--weight_decay 0.1 \
--learning_rate 1e-4 \
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
--max_grad_norm 0.5 \
--warmup_ratio 0.03 \
--eval_steps 100 \
--save_steps 100 \
--save_total_limit 2 \
--logging_steps 10 \
--deepspeed zero3-offload \
```
The complete training files can be found [here](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/grok-1/lora_ddp_ds).
Here are some benchmarks from the training process:
| Metric | Value |
|---------------|------------------------------------------------------------|
| GPU Memory Usage | 8 * 21G |
| Training Speed | 45s/it |
| Total Iterations | 340 (10000(dataset_length)/16(gradient_accumulation)/2(batch_size)) |
<img src="../../resources/image-20240329122854204.png" alt="image-20240329122854204" style="zoom: 33%;" />
Since the GPU memory usage is below 24G, training should theoretically be possible on RTX 3090/A10 environments.
<img src="../../resources/grok_train_loss.png" alt="train_loss (1)" style="zoom:33%;" />
<img src="../../resources/grok_train_acc.png" alt="train_acc" style="zoom:33%;" />
The training took about 4 hours.
### Inference
The SWIFT framework currently does not support deepspeed inference, so we still use transformers' device_map for inference support. However, since the model is too large, some layers will be offloaded to the CPU, which will cause errors when loading LoRA during inference. Therefore, we patched the peft implementation (the original Linear module on the meta device does not affect LoRA, and dynamically move LoRA weights to the device during runtime).
The inference script is as follows:
```shell
# cd examples/pytorch/llm first
PYTHONPATH=../../.. \
python llm_infer.py \
--ckpt_dir output/grok-1/vx-xxx-xxx/checkpoint-xxx \
--dtype bf16 \
--load_dataset_config true \
--max_new_tokens 64 \
--do_sample true \
--dtype bf16 \
--eval_human false \
--merge_lora false \
```
Inference result:
```text
[PROMPT]Task: Question Generation
Context: 我个人感觉是吕颂贤版,剧情和原著差别不大,虽然TVB演员颜值和风光没有大陆的好。但是香港特区人口和地域的限制,只能注重在演员的演技方面发挥很出色,楼主看过大陆排《笑傲江湖》吧!在台词上表现的很生硬没有香港的注重神色配台词,比如杜燕歌把吕颂贤表情和性格几乎和原著差别不大。武打几乎沿用徐克和程小东动作的风格很注重实际技巧,没有大陆版的在武打场面依靠电脑特效表现的太夸张了。李亚鹏版的武打动作和导演还是香港的元彬,大陆毕竟还是在武侠剧起步的比较晚,主要是还是靠明星大腕压阵而香港却是恰恰相反。
Answer: 吕颂贤版
Question:[OUTPUT]笑傲江湖哪个版本好看</s>
[LABELS]笑傲江湖哪个版本好看
--------------------------------------------------
[PROMPT]Task: Question Generation
Context: 这位朋友你好,女性出现妊娠反应一般是从6-12周左右,也就是女性怀孕1个多月就会开始出现反应,第3个月的时候,妊辰反应基本结束。 而大部分女性怀孕初期都会出现恶心、呕吐的感觉,这些症状都是因人而异的,除非恶心、呕吐的非常厉害,才需要就医,否则这些都是刚怀孕的的正常症状。1-3个月的时候可以观察一下自己的皮肤,一般女性怀孕初期可能会产生皮肤色素沉淀或是腹壁产生妊娠纹,特别是在怀孕的后期更加明显。 还有很多女性怀孕初期会出现疲倦、嗜睡的情况。怀孕三个月的时候,膀胱会受到日益胀大的子宫的压迫,容量会变小,所以怀孕期间也会有尿频的现象出现。月经停止也是刚怀孕最容易出现的症状,只要是平时月经正常的女性,在性行为后超过正常经期两周,就有可能是怀孕了。 如果你想判断自己是否怀孕,可以看看自己有没有这些反应。当然这也只是多数人的怀孕表现,也有部分女性怀孕表现并不完全是这样,如果你无法确定自己是否怀孕,最好去医院检查一下。
Answer: 6-12周
Question:[OUTPUT]怀孕几个月开始反应</s>
[LABELS]怀孕多久会有反应
--------------------------------------------------
```
# LLM Evaluation Documentation
SWIFT supports the eval (evaluation) capability to provide standardized evaluation metrics for the original model and the fine-tuned model.
## Table of Contents
- [Introduction](#Introduction)
- [Environment Setup](#Environment-setup)
- [Evaluation](#Evaluation)
- [Custom Evaluation Set](#Custom-Evaluation-Set)
## Introduction
SWIFT's eval capability utilizes the [EvalScope evaluation framework](https://github.com/modelscope/eval-scope) from the ModelScope community and provides advanced encapsulation to support evaluation needs for various models. Currently, we support the evaluation process for **standard evaluation sets** and **user-defined evaluation sets**. The **standard evaluation sets** include:
- MMLU
> MMLU (Massive Multitask Language Understanding) aims to measure the knowledge gained during pretraining by specifically evaluating models in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, humanities, and social sciences. Its difficulty ranges from elementary to advanced professional levels, testing world knowledge and problem-solving abilities. The subject range spans traditional fields such as mathematics and history to more specialized domains like law and ethics. The granularity and breadth of topics make the benchmark an ideal choice for identifying model blindspots.
>
> MMLU is an **English evaluation dataset** containing **57 multiple-choice question-answering tasks** [Diversity Benchmark], covering elementary mathematics, American history, computer science, law, etc., covering human knowledge from high school level to expert level. It is currently the mainstream LLM evaluation dataset.
- CEVAL
> C-EVAL is the first comprehensive Chinese evaluation suite, aiming to evaluate the advanced knowledge and reasoning abilities of foundation models in the Chinese context. C-EVAL includes multiple-choice questions at four difficulty levels: middle school, high school, university, and professional. The questions cover 52 different subject areas, ranging from humanities to science and engineering subjects. C-EVAL also comes with C-EVAL HARD, which is a particularly challenging subset of topics from C-EVAL that requires advanced reasoning abilities to solve.
- GSM8K
> GSM8K (Grade School Math 8K) is a dataset containing 8.5K high-quality linguistically diverse elementary school math word problems. The dataset was created to support the task of question-answering on multi-step reasoning problems in elementary mathematics.
>
> GSM8K is a high-quality English elementary math problem test set, containing 7.5K training data and 1K test data. These problems typically require 2-8 steps to solve, effectively evaluating mathematical and logical abilities.
- ARC
> The AI2 Reasoning Challeng(**arc**) dataset is a multiple-choice question-answering dataset containing questions from 3rd to 9th-grade science exams. The dataset is split into two partitions: Easy and Challenge, with the latter containing harder questions requiring reasoning. Most questions have 4 answer choices, with <1% of questions having 3 or 5 answer choices. ARC includes a supporting corpus of 14.3 million KB of unstructured text passages.
- BBH
> BBH (BIG-Bench Hard) is a dataset composed of 23 challenging tasks selected from the BIG-Bench evaluation suite.
>
> BIG-Bench is a diverse test suite aimed at evaluating language model capabilities, including tasks considered to be beyond the current abilities of language models. In the initial BIG-Bench paper, researchers found that the most advanced language models at the time could only outperform the average human rater on 65% of the tasks with a few example prompts.
>
> Therefore, the researchers filtered out the 23 particularly challenging tasks from BIG-Bench where language models failed to surpass human performance, constructing the BBH dataset. These 23 tasks are considered representative challenges that language models still struggle with. Researchers evaluated the effect of thought-chain prompts on improving language model performance on BBH.
>
> Overall, the BBH dataset contains the 23 most challenging tasks from BIG-Bench, aiming to test the limits of language models' capabilities on complex multi-step reasoning problems. Through experiments on BBH, researchers can uncover the benefits of prompting strategies like thought-chains in enhancing language model performance.
## Environment Setup
```shell
pip install ms-swift[eval] -U
```
or install from source code:
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[eval]'
```
## Evaluation
Evaluation supports the use of vLLM for acceleration. Here we demonstrate the evaluation of the original model and the LoRA fine-tuned qwen2-7b-instruct.
```shell
# Original model (approximately half an hour on a single A100)
CUDA_VISIBLE_DEVCIES=0 swift eval --model_type qwen2-7b-instruct \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
# After LoRA fine-tuning
CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir qwen2-7b-instruct/vx-xxx/checkpoint-xxx \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
--merge_lora true \
```
You can refer to [here](./Command-line-parameters.md#eval-parameters) for the list of evaluation parameters.
### Evaluation using the deployed method
```shell
# Start deployment using the OpenAI API method
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct
# Evaluate using the API
# If it is not a Swift deployment, you need to additionally pass in `--eval_is_chat_model true --model_type qwen2-7b-instruct`.
swift eval --eval_url http://127.0.0.1:8000/v1 --eval_dataset ceval mmlu arc gsm8k
# The same applies to the model after LoRA fine-tuning.
```
## Custom Evaluation Set
In addition, we support users to define their own evaluation sets for evaluation. The custom evaluation set must be consistent with the data format (pattern) of an official evaluation set. Below, we will explain step by step how to use your own evaluation set for evaluation.
### Prepare Your Own Evaluation Set
Currently, we support two patterns of evaluation sets: multiple-choice format of CEval and question-answering format of General-QA.
#### Multiple-choice: CEval Format
The CEval format is suitable for scenarios where users have multiple-choice questions. That is, select one correct answer from four options, and the evaluation metric is `accuracy`. It is recommended to **directly modify** the [CEval scaffold directory](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/eval_example/custom_ceval). This directory contains two files:
```text
default_dev.csv # Used for few-shot evaluation, at least eval_few_shot number of data is required, i.e., this csv can be empty for 0-shot evaluation
default_val.csv # Data used for actual evaluation
```
The CEval csv file needs to be in the following format:
```csv
id,question,A,B,C,D,answer,explanation
1,通常来说,组成动物蛋白质的氨基酸有____,4种,22种,20种,19种,C,1. 目前已知构成动物蛋白质的的氨基酸有20种。
2,血液内存在的下列物质中,不属于代谢终产物的是____。,尿素,尿酸,丙酮酸,二氧化碳,C,"代谢终产物是指在生物体内代谢过程中产生的无法再被利用的物质,需要通过排泄等方式从体内排出。丙酮酸是糖类代谢的产物,可以被进一步代谢为能量或者合成其他物质,并非代谢终产物。"
```
Here, id is the evaluation sequence number, question is the question, ABCD are the options (leave blank if there are fewer than four options), answer is the correct option, and explanation is the explanation.
The `default` filename is the subset name of the CEval evaluation, which can be changed and will be used in the configuration below.
#### Question-Answering: General-QA
General-QA is suitable for scenarios where users have question-answering tasks, and the evaluation metrics are `rouge` and `bleu`. It is recommended to **directly modify** the [General-QA scaffold directory](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/eval_example/custom_general_qa). This directory contains
# LLM Experiment Documentation
## Table of Contents
- [Environment Setup](#Environment-setup)
- [Prepare Experiment Configuration](#Prepare-experiment-configuration)
- [Run Experiments](#Run-experiments)
- [Collect Experiment Results](#Collect-experiment-results)
## Environment Setup
SWIFT supports the exp (experiment) capability, which is designed to conveniently manage multiple ablation experiments that need to be conducted. The main functions of the experiment capability include:
- Support parallel execution of multiple training (export) tasks on a single machine with multiple GPUs (or a single GPU), and record information such as hyperparameters, training outputs, training metrics, etc. Tasks will be queued when the GPUs are fully occupied.
- Support directly running evaluation tasks after training (or export), and record evaluation metrics.
- Support generating a Markdown table for easy comparison of all metrics.
- Support idempotent re-runs, and completed experiments will not be re-run.
This capability complements SWIFT's training, inference, and evaluation capabilities and is essentially a task scheduling capability.
## Prepare Experiment Configuration
An example experiment configuration is as follows:
```json
{
"cmd": "sft",
"requirements":{
"gpu": "1",
"ddp": "1"
},
"eval_requirements": {
"gpu": "1"
},
"eval_dataset": ["ceval", "gsm8k", "arc"],
"args": {
"model_type": "qwen-7b-chat",
"dataset": "ms-agent",
"train_dataset_mix_ratio": 2.0,
"batch_size": 1,
"max_length": 2048,
"use_loss_scale": true,
"gradient_accumulation_steps": 16,
"learning_rate": 5e-5,
"use_flash_attn": true,
"eval_steps": 2000,
"save_steps": 2000,
"train_dataset_sample": -1,
"val_dataset_sample": 5000,
"num_train_epochs": 2,
"check_dataset_strategy": "none",
"gradient_checkpointing": true,
"weight_decay": 0.01,
"warmup_ratio": 0.03,
"save_total_limit": 2,
"logging_steps": 10
},
"experiment": [
{
"name": "lora",
"args": {
"sft_type": "lora",
"lora_target_modules": "ALL",
"lora_rank": 8,
"lora_alpha": 32
}
},
{
"name": "lora+",
"args": {
"sft_type": "lora",
"lora_target_modules": "ALL",
"lora_rank": 8,
"lora_alpha": 32,
"lora_lr_ratio": 16.0
}
}
]
}
```
- cmd: The swift command to run in this experiment
- requirements: Configure the number of GPUs and the number of ddp (data parallel distributed processes)
- eval_requirements: The number of GPUs used for evaluation
- eval_dataset: The datasets used for evaluation. If not configured, no evaluation will be performed.
- args: Parameters corresponding to the cmd command
- experiment: Independent parameters for each sub-experiment, which will override the above parameters. Must include the name field to store experiment results
You can check [this folder](https://github.com/modelscope/swift/tree/main/scripts/benchmark/config) for examples of currently configured experiments.
## Run Experiments
```shell
# Run in the swift root directory
PYTHONPATH=. nohup python scripts/benchmark/exp.py --save_dir './experiment' --config your-config-path > run.log 2>&1 &
```
The --config parameter supports an experiment configuration file or a folder. When a folder is specified, all experiment configurations in that folder will be run in parallel.
After running the experiment, the log for each experiment will be recorded separately in the `./exp` folder, and the experiment results will be recorded in the folder specified by `--save_dir`.
## Collect Experiment Results
```shell
# Run in the swift root directory
python scripts/benchmark/generate_report.py
```
The experiment result logs are as follows:
```text
=================Printing the sft cmd result of exp tuner==================
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|adalora|qwen-7b-chat|ms-agent|2.0|adalora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|26.8389(0.3464%)|True|True|lr=5e-05/epoch=2|32.55GiB|0.92(87543 samples/95338.71 seconds)|17.33(2345 tokens/135.29 seconds)|0.57|1.07|0.391|0.665|0.569|
|adapter|qwen-7b-chat|ms-agent|2.0|adapter||33.6896(0.4344%)|True|True|lr=5e-05/epoch=2|32.19GiB|1.48(87543 samples/59067.71 seconds)|26.63(4019 tokens/150.90 seconds)|0.55|1.03|0.438|0.662|0.565|
|dora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=True|19.2512(0.2487%)|True|True|lr=5e-05/epoch=2|32.46GiB|0.51(87543 samples/171110.54 seconds)|4.29(2413 tokens/562.32 seconds)|0.53|1.01|0.466|0.683|**0.577**|
|full+galore128|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.02GiB|1.10(87543 samples/79481.96 seconds)|28.96(2400 tokens/82.88 seconds)|0.55|1.00|0.358|**0.688**|**0.577**|
...
```
You can copy the table into other documents for analysis.
# LLM Fine-tuning Documentation
## Table of Contents
- [Environment Preparation](#environment-preparation)
- [Fine-tuning](#fine-tuning)
- [DPO](#dpo)
- [Merge LoRA](#merge-lora)
- [Quantization](#quantization)
- [Inference](#inference)
- [Web-UI](#web-ui)
## Environment Preparation
GPU devices: A10, 3090, V100, A100 are all suitable.
```bash
# Install ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
# If you want to use deepspeed.
pip install deepspeed -U
# If you want to use qlora training based on auto_gptq. (Recommended, better than bnb)
# Models supporting auto_gptq: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/supported-models-and-datasets.md#models`
# auto_gptq and cuda versions are related, please choose the version according to `https://github.com/PanQiWei/AutoGPTQ#quick-installation`
pip install auto_gptq -U
# If you want to use bnb-based qlora training.
pip install bitsandbytes -U
# Align environment (usually not necessary to run. If you encounter errors, you can run the following code, the repository is tested with the latest environment)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
```
## Fine-Tuning
If you want to fine-tune and infer using the interface, you can check [Web-ui Documentation](../GetStarted/Web-ui.md).
### Using Python
```python
# Experimental environment: A10, 3090, V100, ...
# 20GB GPU memory
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import torch
from swift.llm import (
DatasetName, InferArguments, ModelType, SftArguments,
infer_main, sft_main, app_ui_main
)
model_type = ModelType.qwen_7b_chat
sft_args = SftArguments(
model_type=model_type,
dataset=[f'{DatasetName.blossom_math_zh}#2000'],
output_dir='output')
result = sft_main(sft_args)
best_model_checkpoint = result['best_model_checkpoint']
print(f'best_model_checkpoint: {best_model_checkpoint}')
torch.cuda.empty_cache()
infer_args = InferArguments(
ckpt_dir=best_model_checkpoint,
load_dataset_config=True)
# merge_lora(infer_args, device_map='cpu')
result = infer_main(infer_args)
torch.cuda.empty_cache()
app_ui_main(infer_args)
```
### Using CLI
```bash
# Experimental environment: A10, 3090, V100, ...
# 20GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_id_or_path qwen/Qwen-7B-Chat \
--dataset AI-ModelScope/blossom-math-v2 \
--output_dir output \
# Using your own dataset
# custom dataset format: https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Customization.md#custom-datasets
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_id_or_path qwen/Qwen-7B-Chat \
--dataset chatml.jsonl \
--output_dir output \
# Using DDP
# Experimental environment: 2 * 3090
# 2 * 23GB GPU memory
CUDA_VISIBLE_DEVICES=0,1 \
NPROC_PER_NODE=2 \
swift sft \
--model_id_or_path qwen/Qwen-7B-Chat \
--dataset AI-ModelScope/blossom-math-v2 \
--output_dir output \
# Multi-machine multi-card
# node0
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NNODES=2 \
NODE_RANK=0 \
MASTER_ADDR=127.0.0.1 \
NPROC_PER_NODE=4 \
swift sft \
--model_id_or_path qwen/Qwen-7B-Chat \
--dataset AI-ModelScope/blossom-math-v2 \
--output_dir output \
# node1
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NNODES=2 \
NODE_RANK=1 \
MASTER_ADDR=xxx.xxx.xxx.xxx \
NPROC_PER_NODE=4 \
swift sft \
--model_id_or_path qwen/Qwen-7B-Chat \
--dataset AI-ModelScope/blossom-math-v2 \
--output_dir output \
```
### More sh Scripts
More sh scripts can be viewed [here](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts)
```bash
# Scripts need to be executed in this directory
cd examples/pytorch/llm
```
**Tips**:
- We default to setting `--gradient_checkpointing true` during training to **save memory**, which may slightly reduce training speed.
- If you want to use quantization parameters `--quantization_bit 4`, you need to first install [bnb](https://github.com/TimDettmers/bitsandbytes): `pip install bitsandbytes -U`. This will reduce memory usage but usually slows down the training speed.
- If you want to use quantization based on **auto_gptq**, you need to install the corresponding cuda version of [auto_gptq](https://github.com/PanQiWei/AutoGPTQ): `pip install auto_gptq -U`.
> Models that can use auto_gptq can be viewed in [LLM Supported Models](Supported-models-datasets.md#models). It is recommended to use auto_gptq instead of bnb.
- If you want to use deepspeed, you need `pip install deepspeed -U`. Using deepspeed can **save memory**, but may slightly reduce training speed.
- If your training involves **knowledge editing**, such as: [Self-aware Fine-tuning](Self-cognition-best-practice.md), you need to add LoRA to MLP as well, otherwise, the results might be poor. You can simply pass the argument `--lora_target_modules ALL` to add lora to all linear(qkvo, mlp), **this is usually the best result**.
- If you are using older GPUs like **V100**, you need to set `--dtype AUTO` or `--dtype fp16`, as they do not support bf16.
- If your machine has high-performance graphics cards like A100 and the model supports flash-attn, it is recommended to install [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](Supported-models-datasets.md#models)
- If you are doing **second pre-training** or **multi-turn dialogue**, you can refer to [Customization and Extension](Customization.md#Registering-Datasets)
- If you need to train **offline**, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please check [Command-line Parameters](Command-line-parameters.md).
- If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
- If you want to merge LoRA weights and save them during inference, you need to set `--merge_lora true`. **It is not recommended to merge** for models trained with qlora, as this will result in precision loss. Therefore **it is not recommended to fine-tune** with qlora, as the deployment ecology is not good.
**Note**:
- Due to the legacy name issue, scripts ending with `xxx_ds` mean: training using deepspeed zero2. (e.g. `full_ddp_ds`).
- In addition to the scripts listed below, other scripts may not be maintained.
If you want to **customize scripts**, you can refer to the following scripts for modification: (The following scripts will be **regularly maintained**)
- full: [qwen1half-7b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen1half_7b_chat/full) (A100), [qwen-7b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_mp) (2*A100)
- full+ddp+zero2: [qwen-7b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/full_ddp_zero2) (4*A100)
- full+ddp+zero3: [qwen-14b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/full_ddp_zero3) (4*A100)
- lora: [chatglm3-6b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/chatglm3_6b/lora) (3090), [baichuan2-13b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/baichuan2_13b_chat/lora_mp) (2*3090), [yi-34b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/yi_34b_chat/lora) (A100), [qwen-72b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp) (2*A100)
- lora+ddp: [chatglm3-6b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/chatglm3_6b/lora_ddp) (2*3090)
- lora+ddp+zero3: [qwen-14b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_14b_chat/lora_ddp_zero3) (4*3090), [qwen-72b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_ddp_zero3) (4*A100)
- qlora(gptq-int4): [qwen-7b-chat-int4](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat_int4/qlora) (3090)
- qlora(gptq-int8): [qwen1half-7b-chat-int8](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen1half_7b_chat_int8/qlora) (3090)
- qlora(bnb-int4): [qwen-7b-chat](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_7b_chat/qlora) (3090)
## DPO
If you want to use DPO for human-aligned fine-tuning, you can check the [DPO Fine-Tuning Documentation](DPO.md).
## ORPO
If you want to use ORPO for human-aligned fine-tuning, you can check the [ORPO Fine-Tuning Documentation](ORPO.md).
## Merge LoRA
Tip: **Currently**, merging LoRA is not supported for bnb and auto_gptq quantized models, as this would result in significant accuracy loss.
```bash
# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
```
## Quantization
For quantization of the fine-tuned model, you can check [LLM Quantization Documentation](LLM-quantization.md#fine-tuned-model)
## Inference
If you want to use VLLM for accelerated inference, you can check [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md)
### Original Model
**Single sample inference** can be checked in [LLM Inference Documentation](LLM-inference.md)
Using **Dataset** for evaluation:
```bash
CUDA_VISIBLE_DEVICES=0 swift infer --model_id_or_path qwen/Qwen-7B-Chat --dataset AI-ModelScope/blossom-math-v2
```
### Fine-tuned Model
**Single sample inference**:
Inference using LoRA **incremental** weights:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)
from swift.tuners import Swift
ckpt_dir = 'vx-xxx/checkpoint-100'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
model = Swift.from_pretrained(model, ckpt_dir, inference_mode=True)
template = get_template(template_type, tokenizer)
query = 'xxxxxx'
response, history = inference(model, template, query)
print(f'response: {response}')
print(f'history: {history}')
```
Inference using LoRA **merged** weights:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type
)
ckpt_dir = 'vx-xxx/checkpoint-100-merged'
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'},
model_id_or_path=ckpt_dir)
template = get_template(template_type, tokenizer)
query = 'xxxxxx'
response, history = inference(model, template, query)
print(f'response: {response}')
print(f'history: {history}')
```
Using **Dataset** for evaluation:
```bash
# Direct inference
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' \
--load_dataset_config true \
# Merge LoRA incremental weights and infer
# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' --load_dataset_config true
```
**Manual** evaluation:
```bash
# Direct inference
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx'
# Merge LoRA incremental weights and infer
# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'
```
## Web-UI
If you want to deploy VLLM and provide **API** interface, you can check [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md)
### Original Model
Using the original model's web-ui can be viewed in [LLM Inference Documentation](LLM-inference.md#Web-UI)
### Fine-tuned Model
```bash
# Directly use app-ui
CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx'
# Merge LoRA incremental weights and use app-ui
# If you need quantization, you can specify `--quant_bits 4`.
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'
```
# LLM Inference Documentation
If you want to use vllm for inference acceleration, you can check out [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md#inference-acceleration)
## Table of Contents
- [Environment Preparation](#Environment-Preparation)
- [Inference](#Inference)
- [Web-UI](#web-ui)
## Environment Preparation
GPU devices: A10, 3090, V100, A100 are all supported.
```bash
# Install ms-swift
pip install 'ms-swift[llm]' -U
# If you want to use models based on auto_gptq for inference.
# Models using auto_gptq: `https://github.com/modelscope/swift/blob/main/docs/source/LLM/Supported Models and Datasets.md#Models`
# auto_gptq and cuda versions have a correspondence, please select the version according to `https://github.com/PanQiWei/AutoGPTQ#quick-installation`
pip install auto_gptq -U
# Environment alignment (usually no need to run. If you encounter errors, you can run the code below, the latest environment is tested with the repository)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
```
## Inference
### qwen-7b-chat
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: qwen
kwargs = {}
# kwargs['use_flash_attn'] = True # use flash_attn
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'}, **kwargs)
# modify max_new_tokens
model.generation_config.max_new_tokens = 128
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
"""Out[0]
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang province is Hangzhou.
query: What are some famous foods there?
response: Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang province is Hangzhou.'), ('What are some famous foods there?', 'Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.')]
"""
# Streaming output chat template
inference(model, template, 'What was the first question?', history, verbose=True, stream=True)
"""Out[1]
[PROMPT]<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Where is the capital of Zhejiang?<|im_end|>
<|im_start|>assistant
The capital of Zhejiang province is Hangzhou.<|im_end|>
<|im_start|>user
What are some famous foods there?<|im_end|>
<|im_start|>assistant
Hangzhou has many famous local foods, such as West Lake Vinegar Fish, Longjing Shrimp, Sweet and Sour Pork Ribs, Spicy Beef, etc. In addition, there are also Hangzhou specialties like Osmanthus Cake, Lotus Seed Pastry, Ai Wo Wo, and more.<|im_end|>
<|im_start|>user
What was the first question<|im_end|>
<|im_start|>assistant
[OUTPUT]Your first question was "Where is the capital of Zhejiang?"<|im_end|>
"""
```
### qwen-7b-chat-int4
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.qwen_7b_chat_int4
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: qwen
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
"""Out[0]
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang province is Hangzhou.
query: What are some famous foods there?
response: Hangzhou has many famous local delicacies, such as West Lake Vinegar Fish, Dongpo Pork, Song Sao Fish Soup, Beggar's Chicken, etc. In addition, there are also Hangzhou specialties like Osmanthus Sugar Lotus Root, Fermented Glutinous Rice Dumplings, Mapo Tofu, and more.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang province is Hangzhou.'), ('What are some famous foods there?', "Hangzhou has many famous local delicacies, such as West Lake Vinegar Fish, Dongpo Pork, Song Sao Fish Soup, Beggar's Chicken, etc. In addition, there are also Hangzhou specialties like Osmanthus Sugar Lotus Root, Fermented Glutinous Rice Dumplings, Mapo Tofu, and more.")]
"""
```
### qwen-7b
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.qwen_7b
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: default-generation
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 64
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Zhejiang -> Hangzhou\nAnhui -> Hefei\nSichuan ->'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
"""Out[0]
query: Zhejiang -> Hangzhou
Anhui -> Hefei
Sichuan ->
response: Chengdu
Shandong -> Jinan
Fujian -> Fuzhou
Chongqing -> Chongqing
Guangdong -> Guangzhou
Beijing -> Beijing
Zhejiang -> Hangzhou
Anhui -> Hefei
Sichuan -> Chengdu
Shandong -> Jinan
Fujian -> Fuzhou
Chongqing
"""
```
### Stream Output
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference_stream, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.qwen_7b_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: qwen
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'What is the capital of Zhejiang Province?'
gen = inference_stream(model, template, query)
print(f'query: {query}')
for response, history in gen:
pass
print(f'response: {response}')
# method1
query = 'What is there to eat?'
old_history = history
gen = inference_stream(model, template, query, old_history)
print(f'query: {query}')
for response, history in gen:
print(f'response: {response}')
print(f'history: {history}')
# method2
query = 'What is there to eat?'
gen = inference_stream(model, template, query, old_history)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print(f'\nhistory: {history}')
"""Out[0]
query: What is the capital of Zhejiang Province?
response: The capital of Zhejiang Province is Hangzhou.
query: What is there to eat?
response: Zhejiang
response: Zhejiang cuisine,
response: Zhejiang cuisine,
response: Zhejiang cuisine, also
...
response: Zhejiang cuisine, also known as "Hangzhou cuisine", is one of the eight traditional Chinese cuisines and is famous for its delicate taste, light fragrance, and natural appearance. It has a long history and is influenced by various cultures, including Huaiyang cuisine, Jiangnan cuisine, and Cantonese cuisine. Some popular dishes include West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea-Scented Chicken, Braised Preserved Bamboo Shoots with Shredded Pork, and Steamed Stuffed Buns. There are many other delicious dishes that you can try when visiting Zhejiang.
history: [['What is the capital of Zhejiang Province?', 'The capital of Zhejiang Province is Hangzhou.'], ['What is there to eat?', 'Zhejiang cuisine, also known as "Hangzhou cuisine", is one of the eight traditional Chinese cuisines and is famous for its delicate taste, light fragrance, and natural appearance. It has a long history and is influenced by various cultures, including Huaiyang cuisine, Jiangnan cuisine, and Cantonese cuisine. Some popular dishes include West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea-Scented Chicken, Braised Preserved Bamboo Shoots with Shredded Pork, and Steamed Stuffed Buns. There are many other delicious dishes that you can try when visiting Zhejiang.']]
query: What is there to eat?
response: There are many delicious foods to try in Hangzhou, such as West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea Pancakes, and XiHu-style Mandarin Duck. Additionally, Hangzhou is famous for its snacks like xiaolongbao (soup dumplings), qingtuan (green tea cakes), and huoguoliangzi (cold barley noodles).
history: [['What is the capital of Zhejiang Province?', 'The capital of Zhejiang Province is Hangzhou.'], ['What is there to eat?', 'There are many delicious foods to try in Hangzhou, such as West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Tea Pancakes, and XiHu-style Mandarin Duck. Additionally, Hangzhou is famous for its snacks like xiaolongbao (soup dumplings), qingtuan (green tea cakes), and huoguoliangzi (cold barley noodles).']]
"""
```
### qwen-vl-chat
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.qwen_vl_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: qwen
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
template = get_template(template_type, tokenizer)
seed_everything(42)
query = tokenizer.from_list_format([
{'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
{'text': 'What is this?'},
])
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'Output the bounding box for the high-five'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
image = tokenizer.draw_bbox_on_latest_picture(response, history)
image.save('output_chat.jpg')
"""
query: Picture 1:<img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>
What is this
response: The image shows a woman playing with a dog, which appears to be a Labrador Retriever, on a beach.
query: Output the bounding box for the high-five
response: <ref>High-five</ref><box>(523,513),(584,605)</box>
history: [('Picture 1:<img>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg</img>\nWhat is this', 'The image shows a woman playing with a dog, which appears to be a Labrador Retriever, on a beach.'), ('Output the bounding box for the high-five', '<ref>High-five</ref><box>(523,513),(584,605)</box>')]
"""
```
### qwen-audio-chat
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.qwen_audio_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: qwen
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
template = get_template(template_type, tokenizer)
seed_everything(42)
query = tokenizer.from_list_format([
{'audio': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac'},
{'text': 'what does the person say?'},
])
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'Find the start time and end time of the word "middle classes'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
"""Out[0]
query: Audio 1:<audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio>
what does the person say?
response: The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".
query: Find the start time and end time of the word "middle classes
response: The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.
history: [('Audio 1:<audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio>\nwhat does the person say?', 'The person says: "mister quilter is the apostle of the middle classes and we are glad to welcome his gospel".'), ('Find the start time and end time of the word "middle classes', 'The word "middle classes" starts at <|2.33|> seconds and ends at <|3.26|> seconds.')]
"""
```
### chatglm3
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
model_type = ModelType.chatglm3_6b
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: chatglm3
model, tokenizer = get_model_tokenizer(model_type, model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 128
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
"""Out[0]
response: Zhejiang has many delicious foods, here are some famous ones:
1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling.
2. West Lake Vinegar Fish: This is one of Hangzhou's famous dishes, made by cooking grass carp and pouring over a specially made paste and vinegar, giving it a delicious flavor.
3. Zhejiang Stewed Chicken: This is one of the traditional famous dishes of Zhejiang province, made by slowly stewing chicken with ginger, green onion, soy sauce and other seasonings, resulting in a rich flavor.
4. Youpodouci: This is a traditional Zhejiang pastry, with a crispy exterior and sweet filling
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang is Hangzhou.'), ('What are some famous foods there?', 'Zhejiang has many delicious foods, here are some famous ones:\n\n1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling. \n\n2. West Lake Vinegar Fish: This is one of Hangzhou's famous dishes, made by cooking grass carp and pouring over a specially made paste and vinegar, giving it a delicious flavor.\n\n3. Zhejiang Stewed Chicken: This is one of the traditional famous dishes of Zhejiang province, made by slowly stewing chicken with ginger, green onion, soy sauce and other seasonings, resulting in a rich flavor. \n\n4. Youpodouci: This is a traditional Zhejiang pastry, with a crispy exterior and sweet filling')]
"""
```
### BitsAndBytes Quantization
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType, get_default_template_type,
)
from swift.utils import seed_everything
from modelscope import BitsAndBytesConfig
import torch
model_type = ModelType.chatglm3_6b
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}') # template_type: chatglm3
torch_dtype = torch.bfloat16
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True)
model, tokenizer = get_model_tokenizer(model_type, torch_dtype, {'device_map': 'auto',
'quantization_config': quantization_config})
model.generation_config.max_new_tokens = 128
template = get_template(template_type, tokenizer)
seed_everything(42)
query = 'Where is the capital of Zhejiang?'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
query = 'What are some famous foods there?'
response, history = inference(model, template, query, history)
print(f'query: {query}')
print(f'response: {response}')
print(f'history: {history}')
"""Out[0]
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang is Hangzhou.
query: What are some famous foods there?
response: Zhejiang has many delicious foods, here are some famous ones:
1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling.
2. Zhejiang Zongzi: Zhejiang zongzi come in many flavors, such as salted egg yolk pork zongzi, red bean paste zongzi, etc., with Hangzhou zongzi being the most famous.
3. Oil Fried Shrimp: This is one of the most representative seafood dishes in Zhejiang, made by stir-frying shrimp in hot oil until crispy and tender.
4. Salt and Pepper Shredded Potato: This is a traditional Zhejiang vegetable dish, made by stir-frying shredded potato with salt and pepper, resulting in a crisp and refreshing taste.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang is Hangzhou.'), ('What are some famous foods there?', 'Zhejiang has many delicious foods, here are some famous ones:\n\n1. Hangzhou Xiaolongbao: This is a famous traditional snack in Hangzhou, with a thin, elastic skin and juicy, delicious filling.\n\n2. Zhejiang Zongzi: Zhejiang zongzi come in many flavors, such as salted egg yolk pork zongzi, red bean paste zongzi, etc., with Hangzhou zongzi being the most famous. \n\n3. Oil Fried Shrimp: This is one of the most representative seafood dishes in Zhejiang, made by stir-frying shrimp in hot oil until crispy and tender.\n\n4. Salt and Pepper Shredded Potato: This is a traditional Zhejiang vegetable dish, made by stir-frying shredded potato with salt and pepper, resulting in a crisp and refreshing taste.\n')]
"""
```
### Using CLI
```bash
# qwen
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-7b-chat
# yi
CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-6b-chat
```
### Fine-tuned Models
If you want to perform inference using fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning.md#Fine-tuned-Model)
## Web-UI
### qwen-7b-chat
Using CLI:
```bash
CUDA_VISIBLE_DEVICES=0 swift app-ui --model_type qwen-7b-chat
```
Using python:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import AppUIArguments, ModelType, app_ui_main
app_ui_args = AppUIArguments(model_type=ModelType.qwen_7b_chat)
app_ui_main(app_ui_args)
```
Using bnb quantization:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import AppUIArguments, ModelType, app_ui_main
app_ui_args = AppUIArguments(model_type=ModelType.qwen_7b_chat, quantization_bit=4)
app_ui_main(app_ui_args)
```
### qwen-7b
Using CLI:
```bash
CUDA_VISIBLE_DEVICES=0 swift app-ui --model_type qwen-7b
```
Using python:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import AppUIArguments, ModelType, app_ui_main
app_ui_args = AppUIArguments(model_type=ModelType.qwen_7b)
app_ui_main(app_ui_args)
```
### Fine-tuned Models
To use the web-ui with fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning#fine-tuned-model)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment