first

f7db21eb · lvzhen · f7db21eb · f7db21eb · f7db21eb · f7db21eb
Commit f7db21eb authored Aug 22, 2024 by lvzhen
20 changed files
--- a/ms-swift/docs/source/Multi-Modal/minicpm-v最佳实践.md
+++ b/ms-swift/docs/source/Multi-Modal/minicpm-v最佳实践.md
+
+# MiniCPM-V 最佳实践
+以下内容以`minicpm-v-3b-chat`为例, 如果你想要使用更新版本的 MiniCPM-V 多模态模型(v2), 你可以将`--model_type minicpm-v-3b-chat`切换成`--model_type minicpm-v-v2-chat`.
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+# 请使用"ms-swift>=2.2"或者main分支.
+pip install 'ms-swift[llm]' -U
+```
+
+模型链接:
+- minicpm-v-3b-chat: [https://modelscope.cn/models/OpenBMB/MiniCPM-V/summary](https://modelscope.cn/models/OpenBMB/MiniCPM-V/summary)
+- minicpm-v-v2-chat: [https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/summary](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/summary)
+
+
+## 推理
+
+推理minicpm-v-3b-chat:
+```shell
+# Experimental environment: A10, 3090, V100, ...
+# 10GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-3b-chat
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< 描述这张图片
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
+该图像的特点是一只黑白相间的猫，它的眼睛睁得大大的，似乎在凝视着相机。这只猫看起来很小，可能是一只幼猫。
+--------------------------------------------------
+<<< clear
+<<< 图中有几只羊？
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
+图中有四只羊。
+--------------------------------------------------
+<<< clear
+<<< 计算结果是多少
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
+计算结果为1452 + 4530 = 5982。
+--------------------------------------------------
+<<< clear
+<<< 根据图片中的内容写首诗
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
+在宁静的夜晚，一艘船在平静的湖面上航行。
+--------------------------------------------------
+<<< clear
+<<< 对图片进行OCR
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
+Swift 250+ LMM35+ MLLM
+"""
+```
+
+示例图片如下:
+
+cat:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
+
+animal:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
+
+math:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
+
+poem:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
+
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.minicpm_v_3b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
+query = '距离各城市多远？'
+response, history = inference(model, template, query, images=images)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = '距离最远的城市是哪？'
+gen = inference_stream(model, template, query, history, images=images)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: 距离各城市多远？
+response:  广州到深圳的距离是230公里，而深圳到广州的距离是14公里。
+query: 距离最远的城市是哪？
+response: 距离最远的城市是深圳，它位于广州和深圳之间，距离广州230公里，距离深圳14公里。
+history: [['距离各城市多远？', ' 广州到深圳的距离是230公里，而深圳到广州的距离是14公里。'], ['距离最远的城市是哪？', '距离最远的城市是深圳，它位于广州和深圳之间，距离广州230公里，距离深圳14公里。']]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+```shell
+# Experimental environment: A10, 3090, V100, ...
+# 10GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type minicpm-v-3b-chat \
+    --dataset coco-en-2-mini \
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 但总的轮次对话只能包含一张图片, 支持传入本地路径或URL)
+
+```jsonl
+{"query": "55555", "response": "66666", "images": ["image_path"]}
+{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
+{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]], "images": ["image_path"]}
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/minicpm-v-3b-chat/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/minicpm-v-3b-chat/vx-xxx/checkpoint-xxx \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/minicpm-v-3b-chat/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
--- a/ms-swift/docs/source/Multi-Modal/mplug-owl2最佳实践.md
+++ b/ms-swift/docs/source/Multi-Modal/mplug-owl2最佳实践.md
+
+# mPLUG-Owl2 最佳实践
+以下内容以`mplug-owl2_1-chat`为例, 你也可以选择`mplug-owl2-chat`.
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+```
+
+模型链接:
+- mplug-owl2_1-chat: [https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)
+- mplug-owl2-chat: [https://modelscope.cn/models/iic/mPLUG-Owl2/summary](https://modelscope.cn/models/iic/mPLUG-Owl2/summary)
+
+
+## 推理
+
+推理`mplug-owl2_1-chat`:
+```shell
+# Experimental environment: A10, 3090, V100...
+# 24GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type mplug-owl2_1-chat
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< Describe this image.
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
+The image features a close-up of a cute, gray and white kitten with big blue eyes. The kitten is sitting on a table, looking directly at the viewer. The scene captures the kitten's adorable features, including its whiskers and the fur on its face. The kitten appears to be staring into the camera, creating a captivating and endearing atmosphere.
+--------------------------------------------------
+<<< How many sheep are in the picture?
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
+There are four sheep in the picture.
+--------------------------------------------------
+<<< What is the calculation result?
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
+The calculation result is 1452 + 45304 = 46756.
+--------------------------------------------------
+<<< Write a poem based on the content of the picture.
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
+In the stillness of the night, a boat glides across the water, its light shining bright. The stars twinkle above, casting a magical glow. A man and a dog are on board, enjoying the serene journey. The boat floats gently, as if it's floating on air. The calm waters reflect the stars, creating a breathtaking scene. The man and his dog are lost in their thoughts, taking in the beauty of nature. The boat seems to be floating in a dream, as if they are on a journey to find their way back home.
+--------------------------------------------------
+<<< clear
+<<< Perform OCR on the image.
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png
+Text: Swift support training, inference and deployment of 250+ LLMs and 350+ MLMs (multimodal models). Developers can directly apply framework their own research and production environments to realize a complete workflow from model training and evaluation to application. In addition to supporting the lightweight training models provided by PEFT, we also provide a Complete Adapters library that can be adapted to various models such as NeTune, LoRaT, LLMA-PRO, etc. This adapter library can be used directly in your own custom workflow. The library is user-friendly with unfamiliar deep learning, Gradio UI for controlling training and inference, as well as accompanying learning courses and best practices for beginners. Additionally, we provide extra training and Lora LRN for AnimateDiff. Swift has rich documents for users on Huggingface and ModelScope, so please feel free to try it!
+"""
+```
+
+示例图片如下:
+
+cat:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
+
+animal:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
+
+math:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
+
+poem:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
+
+ocr_en:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png" width="250" style="display: inline-block;">
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.mplug_owl2_1_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
+query = 'How far is it from each city?'
+response, history = inference(model, template, query, images=images)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = 'Which city is the farthest?'
+images = images * 2
+gen = inference_stream(model, template, query, history, images=images)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: How far is it from each city?
+response: From the given information, it is 14 km from the city of Mata, 62 km from Yangjiang, and 293 km from Guangzhou.
+query: Which city is the farthest?
+response: The farthest city is Guangzhou, which is 293 km away.
+history: [['How far is it from each city?', 'From the given information, it is 14 km from the city of Mata, 62 km from Yangjiang, and 293 km from Guangzhou.'], ['Which city is the farthest?', 'The farthest city is Guangzhou, which is 293 km away.']]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+```shell
+# Experimental environment: A10, 3090, V100...
+# 24GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type mplug-owl2_1-chat \
+    --dataset coco-en-2-mini \
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 每轮对话必须包含一张图片, 支持传入本地路径或URL)
+
+```jsonl
+{"query": "55555", "response": "66666", "images": ["image_path"]}
+{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
+{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]], "images": ["image_path", "image_path2", "image_path3"]}
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
--- a/ms-swift/docs/source/Multi-Modal/phi3-vision最佳实践.md
+++ b/ms-swift/docs/source/Multi-Modal/phi3-vision最佳实践.md
+
+# Phi3-Vision 最佳实践
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+```
+模型链接:
+- phi3-vision-128k-instruct: [https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary)
+
+
+## 推理
+
+推理 phi3-vision-128k-instruct:
+```shell
+# Experimental environment: A10, 3090, V100, ...
+# 16GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type phi3-vision-128k-instruct
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< Who are you?
+I am Phi, an AI developed by Microsoft to assist with providing information, answering questions, and helping users find solutions to their queries. How can I assist you today?
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>What is the difference between these two pictures?
+The first picture shows a group of four cartoon sheep standing in a field, while the second picture is a close-up of a kitten with a blurred background. The main difference between these two pictures is the subject matter and the setting. The first picture features animals that are typically associated with farm life and agriculture, while the second picture focuses on a domestic animal, a kitten, which is more commonly found in households. Additionally, the first picture has a more peaceful and serene atmosphere, while the second picture has a more intimate and detailed view of the kitten.
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>How many sheep are there in the picture?
+There are four sheep in the picture.
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png</img>What is the result of the calculation?
+The result of the calculation 1452 + 45304 is 46756.
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png</img>Write a poem based on the content of the picture.
+In the tranquil night, a boat sails,
+Through the darkened river, it sets sail.
+A single candle flickers, casting light,
+Guiding the way through the endless night.
+
+The stars above, like diamonds bright,
+Gleam down upon the boat's gentle flight.
+The moon, a silent guardian in the sky,
+Watches over the boat as it sails by.
+
+The river, a mirror to the night,
+Reflects the boat's journey, a beautiful sight.
+The trees on either side, standing tall,
+Whisper secrets to the boat, one and all.
+
+In the stillness of the night, a sense of peace,
+The boat, the river, the trees, all in their place.
+A moment frozen in time, a scene so serene,
+A journey through the night, a dream so unseen.
+--------------------------------------------------
+<<< clear
+<<< <img>https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png</img>Perform OCR on the image.
+The image contains a text section with the heading 'Introduction'. It discusses the capabilities of SWIFT, which support training, inference, evaluation, and deployment of over 250 large language models (LLMs) and 35+ multimodal large models (MLLMs). It mentions that developers can apply this framework to their research and production environments, and that SWIFT supports lightweight training solutions provided by PEFT, as well as a complete Adapters library for various training techniques. It also highlights the availability of a Gradio web-ui for controlling training and inference, and the provision of deep learning courses and best practices for beginners. The text further states that SWIFT is expanding capabilities for other modalities, currently supporting full-parameter training and LoRA training for AnimateDiff. There are references to rich documentation and the availability of SWIFT web-ui on Huggingface space and ModelScope studio. The text is clear and fully visible in the image.
+"""
+```
+
+示例图片如下:
+
+cat:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
+
+animal:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
+
+math:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
+
+poem:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
+
+ocr_en:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png" width="250" style="display: inline-block;">
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.phi3_vision_128k_instruct
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = """<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?"""
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = 'Which city is the farthest?'
+gen = inference_stream(model, template, query, history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: Which city is the farthest?
+response: Guangzhou is the farthest city, located 293km away.
+history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?', 'The distances are as follows: Mata is 14km away, Yangjiang is 62km away, and Guangzhou is 293km away.'], ['Which city is the farthest?', 'Guangzhou is the farthest city, located 293km away.']]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+```shell
+# Experimental environment: A10, 3090, V100, ...
+# 16GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type phi3-vision-128k-instruct \
+    --dataset coco-en-mini \
+
+# DDP Full
+# Experimental environment: 2 * A100
+# 2 * 50GB GPU memory
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 swift sft \
+    --model_type phi3-vision-128k-instruct \
+    --dataset coco-en-mini \
+    --sft_type full \
+    --ddp_find_unused_parameters true
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 支持每轮对话含多张图片或不含图片, 支持传入本地路径或URL)
+
+```json
+[
+    {"conversations": [
+        {"from": "user", "value": "<img>img_path</img>11111"},
+        {"from": "assistant", "value": "22222"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "<img>img_path</img><img>img_path2</img><img>img_path3</img>aaaaa"},
+        {"from": "assistant", "value": "bbbbb"},
+        {"from": "user", "value": "<img>img_path</img>ccccc"},
+        {"from": "assistant", "value": "ddddd"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "AAAAA"},
+        {"from": "assistant", "value": "BBBBB"},
+        {"from": "user", "value": "CCCCC"},
+        {"from": "assistant", "value": "DDDDD"}
+    ]}
+]
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx \
+    --merge_lora true --safe_serialization false
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
--- a/ms-swift/docs/source/Multi-Modal/qwen-audio最佳实践.md
+++ b/ms-swift/docs/source/Multi-Modal/qwen-audio最佳实践.md
+# Qwen-Audio 最佳实践
+
+Qwen2-Audio的最佳实践可以查看: [https://github.com/modelscope/ms-swift/issues/1653](https://github.com/modelscope/ms-swift/issues/1653)
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+pip install 'ms-swift[llm]' -U
+```
+
+## 推理
+
+推理[qwen-audio-chat](https://modelscope.cn/models/qwen/Qwen-Audio-Chat/summary):
+```shell
+# Experimental environment: A10, 3090, V100...
+# 21GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-audio-chat
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< 你是谁？
+我是来自达摩院的大规模语言模型，我叫通义千问。
+--------------------------------------------------
+<<< <audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/music.wav</audio>这是首什么样的音乐
+这是一首风格是Pop的音乐。
+--------------------------------------------------
+<<< <audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么
+这段语音中说了中文："今天天气真好呀"。
+--------------------------------------------------
+<<< 这段语音是男生还是女生
+根据音色判断，这段语音是男性。
+"""
+```
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.qwen_audio_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = '<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么'
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = '这段语音是男生还是女生'
+gen = inference_stream(model, template, query, history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: <audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么
+response: 这段语音说了中文："今天天气真好呀"。
+query: 这段语音是男生还是女生
+response: 根据音色判断，这段语音是男性。
+history: [['<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么', '这段语音说了中文："今天天气真好呀"。'], ['这段语音是男生还是女生', '根据音色判断，这段语音是男性。']]
+"""
+```
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+LoRA微调:
+
+```shell
+# Experimental environment: A10, 3090, V100...
+# 22GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen-audio-chat \
+    --dataset aishell1-mini-zh \
+```
+
+全参数微调:
+```shell
+# MP
+# Experimental environment: 2 * A100
+# 2 * 50 GPU memory
+CUDA_VISIBLE_DEVICES=0,1 swift sft \
+    --model_type qwen-audio-chat \
+    --dataset aishell1-mini-zh \
+    --sft_type full \
+
+# ZeRO2
+# Experimental environment: 4 * A100
+# 4 * 80 GPU memory
+NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
+    --model_type qwen-audio-chat \
+    --dataset aishell1-mini-zh \
+    --sft_type full \
+    --use_flash_attn true \
+    --deepspeed default-zero2
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 支持每轮对话含多段语音或不含语音, 支持传入本地路径或URL)
+
+```json
+[
+    {"conversations": [
+        {"from": "user", "value": "<audio>audio_path</audio>11111"},
+        {"from": "assistant", "value": "22222"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "<audio>audio_path</audio><audio>audio_path2</audio><audio>audio_path3</audio>aaaaa"},
+        {"from": "assistant", "value": "bbbbb"},
+        {"from": "user", "value": "<audio>audio_path</audio>ccccc"},
+        {"from": "assistant", "value": "ddddd"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "AAAAA"},
+        {"from": "assistant", "value": "BBBBB"},
+        {"from": "user", "value": "CCCCC"},
+        {"from": "assistant", "value": "DDDDD"}
+    ]}
+]
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/qwen-audio-chat/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/qwen-audio-chat/vx-xxx/checkpoint-xxx \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/qwen-audio-chat/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
--- a/ms-swift/docs/source/Multi-Modal/qwen-vl最佳实践.md
+++ b/ms-swift/docs/source/Multi-Modal/qwen-vl最佳实践.md
+
+# Qwen-VL 最佳实践
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+pip install 'ms-swift[llm]' -U
+```
+
+## 推理
+
+推理[qwen-vl-chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary):
+```shell
+# Experimental environment: 3090
+# 24GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-vl-chat
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< 你是谁？
+我是通义千问，由阿里云开发的AI助手。我被设计用来回答各种问题、提供信息和与用户进行对话。有什么我可以帮助你的吗？
+--------------------------------------------------
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>这两张图片有什么区别
+这两张图片的主要区别在于内容和主题。
+第一张图片是一张卡通插画，画面中是一只公羊或山羊在绿色的草地上，配以群山和白云的背景，整体呈现出自然和动物的主题。
+第二张图片也是一张卡通插画，画面中是一只小猫，有条纹的毛发和蓝色的眼睛，整体呈现出可爱和动物的主题。
+--------------------------------------------------
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>图中有几只羊
+图中有一家四口的羊，一共四只。
+--------------------------------------------------
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png</img>计算结果是多少
+1452 + 45304 = 46756
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png</img>根据图片中的内容写首诗
+月光如水洒河中，孤舟一灯独自空。
+两岸青山倒影美，星河灿烂天空宏。
+--------------------------------------------------
+<<< clear
+<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png</img>对图片进行OCR
+SWIFT支持250+ LLM和35+ MLLM（多模态大模型）的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中，实现模型训练评测到应用的完整链路。我们除了支持PEPT提供的轻量训练方案外，也提供了一个完整的Adapters库以支持最新的训练技术，如NEFTune、LoRA+、LLaMa-PRO等，这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
+"""
+```
+
+示例图片如下:
+
+cat:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
+
+animal:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
+
+math:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
+
+poem:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
+
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.qwen_vl_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = """<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？"""
+response, history = inference(model, template, query)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = '距离最远的城市是哪？'
+gen = inference_stream(model, template, query, history)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？
+response: 马路边距离马路边14公里；阳江边距离马路边62公里；广州边距离马路边293公里。
+query: 距离最远的城市是哪？
+response: 距离最远的城市是广州，距离马路边293公里。
+history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远？', '马路边距离马路边14公里；阳江边距离马路边62公里；广州边距离马路边293公里。'], ['距离最远的城市是哪？', '距离最远的城市是广州，距离马路边293公里。']]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+LoRA微调:
+
+```shell
+# Experimental environment: 3090
+# 23GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type qwen-vl-chat \
+    --dataset coco-en-mini \
+```
+
+全参数微调:
+```shell
+# Experimental environment: 4 * A100
+# 4 * 70 GPU memory
+NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
+    --model_type qwen-vl-chat \
+    --dataset coco-en-mini \
+    --sft_type full \
+```
+
+**Qwen-VL**模型支持grounding任务的训练，数据参考下面的格式：
+```jsonl
+{"query": "Find <bbox>", "response": "<ref-object>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
+{"query": "Find <ref-object>", "response": "<bbox>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
+# 或者使用<img></img>标签
+{"query": "<img>/coco2014/train2014/COCO_train2014_000000001507.jpg</img>Find <bbox>", "response": "<ref-object>", "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
+{"query": "<img>/coco2014/train2014/COCO_train2014_000000001507.jpg</img>Find <ref-object>", "response": "<bbox>", "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
+```
+上述objects字段中包含了一个json string，其中有四个字段：
+    - caption bbox对应的物体描述
+    - bbox 坐标 建议给四个整数（而非float型），分别是x_min,y_min,x_max,y_max四个值
+    - bbox_type: bbox类型 目前支持三种：real/norm_1000/norm_1，分别代表实际像素值坐标/千分位比例坐标/归一化比例坐标
+    - image: bbox对应的图片是第几张, 索引从0开始
+上述格式会被转换为Qwen-VL可识别的格式，具体来说：
+```jsonl
+{"query": "<img>/coco2014/train2014/COCO_train2014_000000001507.jpg</img>Find <ref>the man</ref>", "response": "<box>(200,200),(600,600)</box>"}
+```
+也可以直接传入上述格式，但是注意坐标请使用千分位坐标。
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 支持每轮对话含多张图片或不含图片, 支持传入本地路径或URL)
+
+```json
+[
+    {"conversations": [
+        {"from": "user", "value": "<img>img_path</img>11111"},
+        {"from": "assistant", "value": "22222"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "<img>img_path</img><img>img_path2</img><img>img_path3</img>aaaaa"},
+        {"from": "assistant", "value": "bbbbb"},
+        {"from": "user", "value": "<img>img_path</img>ccccc"},
+        {"from": "assistant", "value": "ddddd"}
+    ]},
+    {"conversations": [
+        {"from": "user", "value": "AAAAA"},
+        {"from": "assistant", "value": "BBBBB"},
+        {"from": "user", "value": "CCCCC"},
+        {"from": "assistant", "value": "DDDDD"}
+    ]}
+]
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
--- a/ms-swift/docs/source/Multi-Modal/vLLM推理加速文档.md
+++ b/ms-swift/docs/source/Multi-Modal/vLLM推理加速文档.md
+# vLLM推理加速文档
+ms-swift已接入了vLLM对多模态模型进行推理加速. 支持的模型可以查看[支持的模型和数据集](../LLM/支持的模型和数据集.md#多模态大模型).
+
+## 目录
+- [环境准备](#环境准备)
+- [推理加速](#推理加速)
+- [部署](#部署)
+
+
+## 环境准备
+```bash
+# 设置pip全局镜像 (加速下载)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+# vllm与cuda版本有对应关系，请按照`https://docs.vllm.ai/en/latest/getting_started/installation.html`选择版本
+pip install "vllm>=0.5.1"
+pip install openai -U
+```
+
+
+## 推理加速
+
+使用python:
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_vllm_engine, get_default_template_type,
+    get_template, inference_vllm
+)
+
+# 'minicpm-v-v2_5-chat', 'minicpm-v-v2_6-chat', 'internvl2-1b', 'internvl2-4b', 'phi3-vision-128k-instruct'
+model_type = ModelType.llava1_6_mistral_7b_instruct
+model_id_or_path = None
+llm_engine = get_vllm_engine(model_type, model_id_or_path=model_id_or_path)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, llm_engine.hf_tokenizer)
+
+llm_engine.generation_config.max_new_tokens = 1024
+
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png']
+request_list = [{'query': 'who are you'}, {'query': 'Describe this image.', 'images': images}]
+resp_list = inference_vllm(llm_engine, template, request_list)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+
+history1 = resp_list[1]['history']
+images.append(None)
+request_list = [{'query': 'Is the creature in the picture a dog?', 'history': history1, 'images': images}]
+resp_list = inference_vllm(llm_engine, template, request_list)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+    print(f"history: {resp['history']}")
+
+"""
+query: who are you
+response: Hello! I am an AI language model, designed to assist users with information and provide helpful prompts and suggestions. As an artificial intelligence, I do not have personal experiences, so I don't have a personality or individuality. Instead, my purpose is to provide accurate, useful information to users like you. Is there anything specific you would like help with or any other questions you have?
+query: Describe this image.
+response: The image features a close-up of a kitten's face. The kitten has striking blue eyes, which are open and appear to be looking towards the camera. Its fur exhibits a mix of black and white stripes with black markings around its eyes. The fur texture is soft and dense with whiskers adorning the sides of its face, adding to its feline charm. The background is blurred with hints of green and white, which creates a bokeh effect, keeping the focus on the kitten's face. The image exudes a sense of innocence and curiosity typically associated with young felines.
+query: Is the creature in the picture a dog?
+response: No, the creature in the picture is a kitten, which is a young cat, not a dog. The presence of distinct feline features such as stripes, whiskers, and the appearance of blue eyes confirms this.
+history: [['Describe this image.', "The image features a close-up of a kitten's face. The kitten has striking blue eyes, which are open and appear to be looking towards the camera. Its fur exhibits a mix of black and white stripes with black markings around its eyes. The fur texture is soft and dense with whiskers adorning the sides of its face, adding to its feline charm. The background is blurred with hints of green and white, which creates a bokeh effect, keeping the focus on the kitten's face. The image exudes a sense of innocence and curiosity typically associated with young felines. "], ['Is the creature in the picture a dog?', 'No, the creature in the picture is a kitten, which is a young cat, not a dog. The presence of distinct feline features such as stripes, whiskers, and the appearance of blue eyes confirms this. ']]
+"""
+```
+
+
+batch处理:
+```python
+# vllm>=0.5.4
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_vllm_engine, get_template, inference_vllm, ModelType,
+    get_default_template_type, inference_stream_vllm
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.minicpm_v_v2_6_chat
+model_id_or_path = None
+vllm_engine = get_vllm_engine(model_type, torch.bfloat16, model_id_or_path=model_id_or_path,
+                              max_model_len=8192)
+
+tokenizer = vllm_engine.hf_tokenizer
+vllm_engine.generation_config.max_new_tokens = 256
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+query = '<image>描述这张图片'
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png']
+generation_info = {}
+request_list = [{'query': query, 'images': images} for _ in range(100)]
+resp_list = inference_vllm(vllm_engine, template, request_list, generation_info=generation_info, use_tqdm=True)
+print(f'query: {query}')
+print(f'response: {resp_list[0]["response"]}')
+print(generation_info)
+
+# 流式
+generation_info = {}
+gen = inference_stream_vllm(vllm_engine, template, request_list, generation_info=generation_info)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+# only show first
+for resp_list in gen:
+    resp = resp_list[0]
+    if resp is None:
+        continue
+    response = resp['response']
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(generation_info)
+"""
+100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 91.47it/s]
+100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:22<00:00,  4.48it/s]
+query: <image>描述这张图片
+response: 这张图片展示了一只小猫咪的特写，可能是美国短毛猫品种，因为其花纹和毛发质地。猫咪有着引人注目的蓝色眼睛，这是其外貌中非常突出的特征。它皮毛上有着独特的黑色条纹，从面颊延伸至头顶，暗示着一种有条纹的花纹图案。它的耳朵小而尖，内侧是粉色的。猫咪的胡须细长而突出，围绕在它的下颌两侧和眼睛周围。猫咪坐着，用一种表达丰富的方式直视着，嘴巴微微张开，露出粉红色的内唇。背景模糊，柔和的光线增强了猫咪的特征。
+{'num_prompt_tokens': 2700, 'num_generated_tokens': 14734, 'num_samples': 100, 'runtime': 23.53027338697575, 'samples/s': 4.249844375176322, 'tokens/s': 626.1720702384794}
+query: <image>描述这张图片
+response: 这张图片展示了一只小猫的特写，可能是一只幼年猫，在模糊的背景中，集中注意力在猫的表情上。这只猫长着一身白色与黑色条纹相间的毛皮，带有微妙的灰褐色。它的眼睛大而圆，具有高度的反光度，表明它们可能含有异色瞳，即一只眼睛是蓝色的，另一只是绿色的，但这只猫两只眼睛都是绿色的。睫毛清晰可见，增添了一种生动的表情。猫的耳朵竖立着，内部呈粉红色，边缘有浅色的阴影，显示出柔软的毛发。胡须又长又明显，突显了小猫的脸部形状。这个品种的猫看起来是一个常见品种，毛皮图案和眼睛颜色表明它可能是一只虎斑猫。光线柔和，产生一种天鹅绒般的效果，突出了猫绒毛的质感。
+{'num_prompt_tokens': 2700, 'num_generated_tokens': 14986, 'num_samples': 100, 'runtime': 23.375922130944673, 'samples/s': 4.277906105257837, 'tokens/s': 641.0870089339394}
+"""
+```
+
+使用CLI:
+```shell
+# 多模态模型必须显式指定`--infer_backend vllm`
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-vicuna-7b-instruct --infer_backend vllm
+
+# 对数据集进行批量推理
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-vicuna-7b-instruct --infer_backend vllm \
+    --val_dataset coco-en-2-mini#100
+
+# TP:
+CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl2-1b \
+    --infer_backend vllm --tensor_parallel_size 2
+```
+
+```python
+"""
+<<< How many sheep are in the picture?
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
+There are four sheep in the picture.
+--------------------------------------------------
+<<< Perform OCR on the image.
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png
+The image contains text that appears to be an introduction or description of a software or service called SWIFT. Here is the transcribed text:
+
+introduction
+SWIFT supports training, inference, evaluation and deployment of 250+ LLMs and 35 MLMs (multimodal large models). Developers can directly apply their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition, we provide a complete Adapters Library to support the latest training techniques such as PEFT, we also provide a Gradio web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners.
+
+Additionally, we are expanding capabilities for other modalities. Currently, we support full-paraphrase training and LORA training for AnimatedDiff.
+
+SWIFT web-ui is available both on HuggingFace space and ModelScope studio.
+
+Please feel free to try.
+
+Please note that the text is a mix of English and what appears to be a programming or technical language, and some words or phrases might not be fully transcribed due to the complexity of the text.
+--------------------------------------------------
+<<< who are you
+Input a media path or URL <<<
+I'm a language model called Vicuna, and I was trained by researchers from Large Model Systems Organization (LMSYS).
+"""
+```
+
+
+## 部署
+
+**服务端:**
+```shell
+CUDA_VISIBLE_DEVICES=0 swift deploy --model_type llava1_6-vicuna-13b-instruct --infer_backend vllm
+
+# TP:
+CUDA_VISIBLE_DEVICES=0,1 swift deploy --model_type internvl2-1b \
+    --infer_backend vllm --tensor_parallel_size 2
+```
+
+**客户端:**
+
+测试:
+```bash
+curl http://localhost:8000/v1/chat/completions \
+-H "Content-Type: application/json" \
+-d '{
+"model": "llava1_6-vicuna-13b-instruct",
+"messages": [{"role": "user", "content": "Describe this image."}],
+"temperature": 0,
+"images": ["http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png"]
+}'
+```
+
+使用ms-swift:
+```python
+import asyncio
+from swift.llm import get_model_list_client, XRequestConfig, inference_client_async
+
+model_list = get_model_list_client()
+model_type = model_list.data[0].id
+print(f'model_type: {model_type}')
+request_config = XRequestConfig(seed=42)
+
+query = '<image>Describe this image.'
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png']
+tasks = [inference_client_async(model_type, query, images=images, request_config=request_config) for _ in range(100)]
+async def _batch_run(tasks):
+    return await asyncio.gather(*tasks)
+
+resp_list = asyncio.run(_batch_run(tasks))
+print(f'query: {query}')
+print(f'response0: {resp_list[0].choices[0].message.content}')
+print(f'response1: {resp_list[1].choices[0].message.content}')
+
+query = '<image>How many sheep are in the picture?'
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']
+
+async def _stream():
+    global query
+    request_config = XRequestConfig(seed=42, stream=True)
+    stream_resp = await inference_client_async(model_type, query, images=images, request_config=request_config)
+    print(f'query: {query}')
+    print('response: ', end='')
+    async for chunk in stream_resp:
+        print(chunk.choices[0].delta.content, end='', flush=True)
+    print()
+
+asyncio.run(_stream())
+"""
+model_type: llava1_6-vicuna-13b-instruct
+query: <image>Describe this image.
+response0: The image captures a moment of tranquility featuring a kitten. The kitten, with its fur a mix of gray and white, is the main subject of the image. It's sitting on a surface that appears to be a table or a similar flat surface. The kitten's eyes, a striking shade of blue, are wide open, giving it a curious and alert expression. Its ears, also gray and white, are perked up, suggesting it's attentive to its surroundings. The background is blurred, drawing focus to the kitten, and it's a soft, muted color that doesn't distract from the main subject. The overall image gives a sense of calm and innocence.
+response1: The image captures a moment of tranquility featuring a kitten. The kitten, with its fur a mix of gray and white, is the main subject of the image. It's sitting on a surface that appears to be a table or a similar flat surface. The kitten's eyes, a striking shade of blue, are wide open, giving it a curious and alert expression. Its ears, also gray and white, are perked up, suggesting it's attentive to its surroundings. The background is blurred, drawing focus to the kitten, and it's a soft, muted color that doesn't distract from the main subject. The overall image gives a sense of calm and innocence.
+query: <image>How many sheep are in the picture?
+response: There are four sheep in the picture.
+"""
+```
+
+
+使用openai:
+```python
+from openai import OpenAI
+client = OpenAI(
+    api_key='EMPTY',
+    base_url='http://localhost:8000/v1',
+)
+model_type = client.models.list().data[0].id
+print(f'model_type: {model_type}')
+
+# use base64
+# import base64
+# with open('cat.png', 'rb') as f:
+#     img_base64 = base64.b64encode(f.read()).decode('utf-8')
+# image_url = f'data:image/jpeg;base64,{img_base64}'
+
+# use local_path
+# from swift.llm import convert_to_base64
+# image_url = convert_to_base64(images=['cat.png'])['images'][0]
+# image_url = f'data:image/jpeg;base64,{image_url}'
+
+# use url
+image_url = 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png'
+
+query = 'Describe this image.'
+messages = [{
+    'role': 'user',
+    'content': [
+        {'type': 'image_url', 'image_url': {'url': image_url}},
+        {'type': 'text', 'text': query},
+    ]
+}]
+
+resp = client.chat.completions.create(
+    model=model_type,
+    messages=messages,
+    temperature=0)
+response = resp.choices[0].message.content
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = 'How many sheep are in the picture?'
+messages = [{
+    'role': 'user',
+    'content': [
+        {'type': 'image_url', 'image_url': {'url': 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png'}},
+        {'type': 'text', 'text': query},
+    ]
+}]
+stream_resp = client.chat.completions.create(
+    model=model_type,
+    messages=messages,
+    stream=True,
+    temperature=0)
+
+print(f'query: {query}')
+print('response: ', end='')
+for chunk in stream_resp:
+    print(chunk.choices[0].delta.content, end='', flush=True)
+print()
+"""
+model_type: llava1_6-vicuna-13b-instruct
+query: Describe this image.
+response: The image captures a moment of tranquility featuring a kitten. The kitten, with its fur a mix of gray and white, is the main subject of the image. It's sitting on a surface that appears to be a table or a similar flat surface. The kitten's eyes, a striking shade of blue, are wide open, giving it a curious and alert expression. Its ears, also gray and white, are perked up, suggesting it's attentive to its surroundings. The background is blurred, drawing focus to the kitten, and it's a soft, muted color that doesn't distract from the main subject. The overall image gives a sense of calm and innocence.
+query: How many sheep are in the picture?
+response: There are four sheep in the picture.
+"""
+```
+
+更多客户端使用方法可以查看[MLLM部署文档](MLLM部署文档.md#yi-vl-6b-chat)
--- a/ms-swift/docs/source/Multi-Modal/yi-vl最佳实践.md
+++ b/ms-swift/docs/source/Multi-Modal/yi-vl最佳实践.md
+
+# Yi-VL 最佳实践
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+```
+
+## 推理
+
+推理[yi-vl-6b-chat](https://modelscope.cn/models/01ai/Yi-VL-6B/summary):
+```shell
+# Experimental environment: A10, 3090, V100...
+# 18GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-vl-6b-chat
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< 描述这张图片
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
+图片显示一只小猫坐在地板上,眼睛睁开,凝视着摄像机。小猫看起来很可爱,有灰色和白色的毛皮,以及蓝色的眼睛。它似乎正在看摄像机,可能对周围环境很好奇。
+--------------------------------------------------
+<<< 你是谁？
+Input a media path or URL <<<
+我是人工智能助手,随时准备帮助你解答问题或提供信息。
+--------------------------------------------------
+<<< 图中有几只羊
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
+图中有四只羊.
+--------------------------------------------------
+<<< clear
+<<< 计算结果是多少
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
+1452 + 45304 = 46756
+--------------------------------------------------
+<<< clear
+<<< 根据图片中的内容写首诗
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
+夜幕降临,星光闪烁,
+一艘小船在河上飘荡,
+船头挂着一盏明亮的灯,
+照亮了周围的黑暗。
+
+船上有两个人,
+一个在船头,另一个在船尾,
+他们似乎在谈话,
+在星光下享受着宁静的时刻。
+
+河岸边,树木在黑暗中站着,
+在星光下投下长长的影子。
+这景象是那么的宁静,
+让人想起一个古老的传说。
+
+小船,人,和星光,
+构成了一个美丽的画面,
+它唤起一种宁静的感觉,
+在喧嚣的城市生活之外。
+--------------------------------------------------
+<<< clear
+<<< 对图片进行OCR
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
+这是一段关于SWIFT的文字，其中包括了它的版本、功能以及一些链接。
+"""
+```
+
+示例图片如下:
+
+cat:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
+
+animal:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
+
+math:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
+
+poem:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
+
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = ModelType.yi_vl_6b_chat
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(2)  # ...
+
+images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
+query = '距离各城市多远？'
+response, history = inference(model, template, query, images=images)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+query = '距离最远的城市是哪？'
+images = images * 2
+gen = inference_stream(model, template, query, history, images=images)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, history in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+print(f'history: {history}')
+"""
+query: 距离各城市多远？
+response: 距离甲塔14公里,距离阳江62公里,距离广州293公里,距离广州293公里。
+query: 距离最远的城市是哪？
+response: 最远的距离是293公里。
+history: [['距离各城市多远？', '距离甲塔14公里,距离阳江62公里,距离广州293公里,距离广州293公里。'], ['距离最远的城市是哪？', '最远的距离是293公里。']]
+"""
+```
+
+示例图片如下:
+
+road:
+
+<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+```shell
+# Experimental environment: A10, 3090, V100...
+# 19GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type yi-vl-6b-chat \
+    --dataset coco-en-2-mini \
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(支持多轮对话, 每轮对话须包含一张图片或不含图片, 支持传入本地路径或URL)
+
+```jsonl
+{"query": "55555", "response": "66666", "images": ["image_path"]}
+{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
+{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]], "images": ["image_path", "image_path2", "image_path3"]}
+```
+
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/yi-vl-6b-chat/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir output/yi-vl-6b-chat/vx-xxx/checkpoint-xxx \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/yi-vl-6b-chat/vx-xxx/checkpoint-xxx-merged \
+    --load_dataset_config true
+```
--- a/ms-swift/docs/source/Multi-Modal/人类偏好对齐训练文档.md
+++ b/ms-swift/docs/source/Multi-Modal/人类偏好对齐训练文档.md
+# 人类偏好对齐训练文档
+
+本文档提供了各种人类偏好对齐算法的训练脚本。若您希望深入了解更详尽的算法信息及其选择方法，请参考[文档](https://github.com/modelscope/modelscope-classroom/blob/main/LLM-tutorial/M.%E4%BA%BA%E7%B1%BB%E5%81%8F%E5%A5%BD%E5%AF%B9%E9%BD%90%E8%AE%AD%E7%BB%83.md)
+
+## 目录
+- [环境准备](#环境准备)
+- [数据集](#数据集)
+- [DPO](#dpo)
+- [CPO](#cpo)
+- [ORPO](#orpo)
+- [SimPO](#simpo)
+
+## 环境准备
+```bash
+# 设置pip全局镜像 (加速下载)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+# 环境对齐 (通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+
+## 数据集
+
+视觉多模态大模型人类偏好对齐训练一般需要 $(x,y_w,y_l)$ 格式的数据，其中 $x$ 表示模型输入，包括文本提示和图像, $y_w,y_l$ 分别表示符合人类偏好的偏好回答和不符合人类偏好的拒绝回答,比如![dpo_data](../../resources/vdpo_data.png)
+
+**自定义数据集格式**
+```jsonl
+{"system": "123", "query": "11111", "response": "22222", "rejected_response": "33333", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
+{"system": "123", "query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
+{"system": "123", "query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
+```
+
+其中`system`和`history`为可选项
+
+不同模型对图像数量的支持不同, 具体参考模型对应的最佳实践文档
+
+**训练提示**:
+- 下面的训练脚本使用`--lora_target_modules DEFAULT`只训练模型的QKV矩阵，你也可以设置`--lora_target_modules ALL`来训练模型的全部线性层
+
+## DPO
+[论文arvix](https://arxiv.org/abs/2305.18290)
+
+超参
+- `beta`：KL正则系数，值越大表示对偏离参考模型的惩罚越大。默认为0.1
+
+建议在开始DPO训练之前，使用偏好数据集中的偏好回答部分进行SFT训练，以确保数据符合DPO算法的分布要求。
+我们也在DPO loss中混合了sft loss来稳定训练，你可以通过设置超参`sft_beta`来调整sft loss的系数，默认为0.1
+
+训练脚本, 这里我们提供单卡/多卡device map/多卡ddp的版本，简洁起见，后续算法只给出单卡版本。
+```bash
+# Experimental environment: A100
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+    --rlhf_type dpo \
+    --model_type llava1_6-mistral-7b-instruct \
+    --beta 0.1 \
+    --sft_beta 0.1 \
+    --sft_type  lora \
+    --dataset rlaif-v#1000 \
+    --num_train_epochs  2  \
+    --lora_target_modules  DEFAULT  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  16  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+
+# MP(device map)
+CUDA_VISIBLE_DEVICES=0,1 \
+swift rlhf \
+    --rlhf_type dpo \
+    --model_type llava1_6-mistral-7b-instruct \
+    --beta 0.1 \
+    --sft_beta 0.1 \
+    --sft_type  lora \
+    --dataset rlaif-v#1000 \
+    --num_train_epochs  2  \
+    --lora_target_modules  DEFAULT  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  16  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+
+# DDP + MP
+nproc_per_node=2
+
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=$nproc_per_node \
+MASTER_PORT=29500 \
+swift rlhf \
+    --rlhf_type dpo \
+    --model_type llava1_6-mistral-7b-instruct \
+    --beta 0.1 \
+    --sft_beta 0.1 \
+    --sft_type  lora \
+    --dataset rlaif-v#1000 \
+    --num_train_epochs  2  \
+    --lora_target_modules  DEFAULT  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  $(expr 16 / $nproc_per_node)  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+```
+
+训练后的模型推理和部署可以参考对应模型的最佳实践文档, [部署文档](./MLLM部署文档.md)和[vLLM推理加速文档](./vLLM推理加速文档.md)
+
+## CPO
+[论文arvix](https://arxiv.org/abs/2401.08417)
+超参
+- beta：隐含奖励前的系数，默认为0.1
+- cpo_alpha: nll loss系数, 默认为1.0
+
+训练脚本
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+    --rlhf_type cpo \
+    --model_type  llava1_6-mistral-7b-instruct \
+    --beta 0.1 \
+    --sft_type  lora \
+    --dataset rlaif-v#1000 \
+    --num_train_epochs  2  \
+    --lora_target_modules  DEFAULT  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  16  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+```
+
+## ORPO
+[论文arvix](https://arxiv.org/abs/2403.07691)
+
+超参
+- lambda: Odds Ratio loss系数
+
+注意：ORPO使用参数`--beta`传入超参`lambda`
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+    --rlhf_type orpo \
+    --model_type  llava1_6-mistral-7b-instruct \
+    --beta 0.1 \
+    --sft_type  lora \
+    --dataset rlaif-v#1000 \
+    --num_train_epochs  2  \
+    --lora_target_modules  DEFAULT  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  16  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+```
+
+
+## SimPO
+[论文arvix](https://arxiv.org/abs/2405.14734)
+超参
+- beta：隐含奖励前的系数，默认为2.0
+- simpo_gamma：reward margin项，默认为1.0
+- cpo_alpha: 混合CPO nll loss提高训练稳定性, 默认为1.0, 设置0.0使用原始SimPO算法
+
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+    --rlhf_type simpo \
+    --model_type  llava1_6-mistral-7b-instruct \
+    --beta 2.0 \
+    --simpo_gamma 1.0 \
+    --sft_type  lora \
+    --dataset rlaif-v#1000 \
+    --num_train_epochs  2  \
+    --lora_target_modules  DEFAULT  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  16  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+```
--- a/ms-swift/docs/source/_templates/autosummary/class.rst
+++ b/ms-swift/docs/source/_templates/autosummary/class.rst
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :inherited-members:
+    :members:
+
+.. autogenerated from source/_templates/autosummary/class.rst
--- a/ms-swift/docs/source/_templates/classtemplate.rst
+++ b/ms-swift/docs/source/_templates/classtemplate.rst
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+    :special-members: __init__, __call__
+
+..
+  autogenerated from source/_templates/classtemplate.rst
+  note it does not have :inherited-members:
--- a/ms-swift/docs/source/_templates/sobolengine.rst
+++ b/ms-swift/docs/source/_templates/sobolengine.rst
+.. currentmodule:: {{ module }}
+
+
+{{ name | underline}}
+
+.. autoclass:: {{ name }}
+    :members:
+    :exclude-members: MAXBIT, MAXDIM
+    :undoc-members:
+
+
+..
+  autogenerated from source/_templates/sobolengine.rst
+  note it has specific options
--- a/ms-swift/docs/source/api/swift.hub.rst
+++ b/ms-swift/docs/source/api/swift.hub.rst
+swift.hub
+==============
+
+.. automodule:: swift.hub
+
+.. currentmodule:: swift.hub
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+    :template: classtemplate.rst
+
+    api.HubApi
+    check_model.check_local_model_is_latest
+    push_to_hub.push_to_hub
+    push_to_hub.push_to_hub_async
+    snapshot_download.snapshot_download
+    file_download.model_file_download
--- a/ms-swift/docs/source/api/swift.trainers.rst
+++ b/ms-swift/docs/source/api/swift.trainers.rst
+swift.trainers
+==============
+
+.. automodule:: swift.trainers
+
+.. currentmodule:: swift.trainers
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+    :template: classtemplate.rst
+
+    trainers.Seq2SeqTrainer
+    trainers.Trainer
--- a/ms-swift/docs/source/api/swift.tuners.rst
+++ b/ms-swift/docs/source/api/swift.tuners.rst
+swift.tuners
+==============
+
+.. automodule:: swift.tuners
+
+.. currentmodule:: swift.tuners
+
+.. autosummary::
+    :toctree: generated
+    :nosignatures:
+    :template: classtemplate.rst
+
+    adapter.AdapterConfig
+    base.SwiftModel
+    base.Swift
+    lora.LoRAConfig
+    prompt.PromptConfig
+    restuning.ResTuningConfig
+    side.SideConfig
+    utils.SwiftConfig
+    utils.SwiftOutput
--- a/ms-swift/docs/source/conf.py
+++ b/ms-swift/docs/source/conf.py
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+# import sphinx_book_theme
+
+sys.path.insert(0, os.path.abspath('../../'))
+# -- Project information -----------------------------------------------------
+
+project = 'swift'
+copyright = '2022-2024, Alibaba ModelScope'
+author = 'ModelScope Authors'
+version_file = '../../swift/version.py'
+html_theme = 'sphinx_rtd_theme'
+language = 'zh_CN'
+
+
+def get_version():
+    with open(version_file, 'r', encoding='utf-8') as f:
+        exec(compile(f.read(), version_file, 'exec'))
+    return locals()['__version__']
+
+
+# The full version, including alpha/beta/rc tags
+version = get_version()
+release = version
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.napoleon',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.autodoc',
+    'sphinx.ext.viewcode',
+    'sphinx_markdown_tables',
+    'sphinx_copybutton',
+    'myst_parser',
+]
+
+# build the templated autosummary files
+autosummary_generate = True
+numpydoc_show_class_members = False
+
+# Enable overriding of function signatures in the first line of the docstring.
+autodoc_docstring_signature = True
+
+# Disable docstring inheritance
+autodoc_inherit_docstrings = False
+
+# Show type hints in the description
+autodoc_typehints = 'description'
+
+# Add parameter types if the parameter is documented in the docstring
+autodoc_typehints_description_target = 'documented_params'
+
+autodoc_default_options = {
+    'member-order': 'bysource',
+}
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+source_suffix = ['.rst', '.md']
+
+# The master toctree document.
+root_doc = 'index'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ['build', 'source/.ipynb_checkpoints', 'source/api/generated', 'Thumbs.db', '.DS_Store']
+# A list of glob-style patterns [1] that are used to find source files.
+# They are matched against the source file names relative to the source directory,
+# using slashes as directory separators on all platforms.
+# The default is **, meaning that all files are recursively included from the source directory.
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+# html_theme = 'sphinx_book_theme'
+# html_theme_path = [sphinx_book_theme.get_html_theme_path()]
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+# html_css_files = ['css/readthedocs.css']
+
+# -- Options for HTMLHelp output ---------------------------------------------
+# Output file base name for HTML help builder.
+
+# -- Extension configuration -------------------------------------------------
+# Ignore >>> when copying code
+copybutton_prompt_text = r'>>> |\.\.\. '
+copybutton_prompt_is_regexp = True
+
+# Example configuration for intersphinx: refer to the Python standard library.
+intersphinx_mapping = {'https://docs.python.org/': None}
--- a/ms-swift/docs/source/cources/README.md
+++ b/ms-swift/docs/source/cources/README.md
+The courses of this folder are transfered to [the classroom repo](https://github.com/modelscope/modelscope-classroom).
--- a/ms-swift/docs/source/index.rst
+++ b/ms-swift/docs/source/index.rst
+.. swift documentation file,
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Swift DOCUMENTATION
+========================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Get Started
+
+   GetStarted/SWIFT安装.md
+   GetStarted/界面训练推理.md
+   GetStarted/使用tuners.md
+   GetStarted/ResTuning.md
+   GetStarted/SCEdit.md
+   GetStarted/在SWIFT内使用PEFT.md
+
+.. toctree::
+   :maxdepth: 2
+   :caption: LLM Training and Inference
+
+   LLM/index.md
+   LLM/LLM推理文档.md
+   LLM/LLM微调文档.md
+   LLM/人类偏好对齐训练文档.md
+   LLM/LLM评测文档.md
+   LLM/LLM量化与导出文档.md
+   LLM/OLLAMA导出文档.md
+   LLM/VLLM推理加速与部署.md
+   LLM/LmDeploy推理加速与部署.md
+   LLM/Megatron训练文档.md
+   LLM/LLM实验文档.md
+   LLM/命令行参数.md
+   LLM/支持的模型和数据集.md
+   LLM/自定义与拓展.md
+   LLM/自我认知微调最佳实践.md
+   LLM/Agent微调最佳实践.md
+   LLM/Agent部署最佳实践.md
+   LLM/Qwen1.5全流程最佳实践.md
+   LLM/NPU推理与微调最佳实践.md
+   LLM/Grok训练和推理.md
+   LLM/DPO算法最佳实践.md
+   LLM/ORPO算法最佳实践.md
+   LLM/SimPO算法最佳实践.md
+   LLM/HuggingFace生态兼容.md
+   LLM/Benchmark.md
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Multi-Modal LLM Training and Inference
+
+   Multi-Modal/index.md
+   Multi-Modal/人类偏好对齐训练文档.md
+   Multi-Modal/LmDeploy推理加速文档.md
+   Multi-Modal/vLLM推理加速文档.md
+   Multi-Modal/MLLM部署文档.md
+   Multi-Modal/qwen-vl最佳实践.md
+   Multi-Modal/qwen-audio最佳实践.md
+   Multi-Modal/llava最佳实践.md
+   Multi-Modal/llava-video最佳实践.md
+   Multi-Modal/internvl最佳实践.md
+   Multi-Modal/deepseek-vl最佳实践.md
+   Multi-Modal/internlm-xcomposer2最佳实践.md
+   Multi-Modal/phi3-vision最佳实践.md
+   Multi-Modal/yi-vl最佳实践.md
+   Multi-Modal/mplug-owl2最佳实践.md
+   Multi-Modal/florence最佳实践.md
+   Multi-Modal/cogvlm最佳实践.md
+   Multi-Modal/cogvlm2最佳实践.md
+   Multi-Modal/glm4v最佳实践.md
+   Multi-Modal/cogvlm2-video最佳实践.md
+   Multi-Modal/minicpm-v最佳实践.md
+   Multi-Modal/minicpm-v-2最佳实践.md
+   Multi-Modal/minicpm-v-2.5最佳实践.md
+
+.. toctree::
+   :maxdepth: 2
+   :caption: API Doc
+
+   Hub <api/swift.hub>
+   Trainer <api/swift.trainers>
+   Tuner <api/swift.tuners>
+
+
+Indices and tables
+==================
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
--- a/ms-swift/docs/source_en/.readthedocs.yaml
+++ b/ms-swift/docs/source_en/.readthedocs.yaml
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the OS, Python version and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.12"
+
+# Build documentation in the "docs/" directory with Sphinx
+sphinx:
+  configuration: docs/source_en/conf.py
+
+# Optionally build your docs in additional formats such as PDF and ePub
+# formats:
+#    - pdf
+#    - epub
+
+# Optional but recommended, declare the Python requirements required
+# to build your documentation
+# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+   install:
+      - requirements: requirements/docs.txt
+      - requirements: requirements/framework.txt
+      - requirements: requirements/llm.txt
--- a/ms-swift/docs/source_en/AIGC/AnimateDiff-train-infer.md
+++ b/ms-swift/docs/source_en/AIGC/AnimateDiff-train-infer.md
+# AnimateDiff Fine-tuning and Inference
+
+SWIFT supports fine-tuning and inference of AnimateDiff of full parameter and LoRA fine-tuning.
+
+First, you need to clone and install SWIFT:
+
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install ".[aigc]"
+```
+
+## Full Parameter Training
+
+### Training Effect
+
+Full parameter fine-tuning can reproduce the effect of the [officially provided model animatediff-motion-adapter-v1-5-2](https://www.modelscope.cn/models/Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2/summary), requiring a large number of short videos. The official reproduction used a subset version of the official dataset: [WebVid 2.5M](https://maxbain.com/webvid-dataset/). The training effect is as follows:
+
+```text
+Prompt:masterpiece, bestquality, highlydetailed, ultradetailed, girl, walking, on the street, flowers
+```
+
+![image.png](../../resources/1.gif)
+
+```text
+Prompt: masterpiece, bestquality, highlydetailed, ultradetailed, beautiful house, mountain, snow top```
+```
+
+![image.png](../../resources/2.gif)
+
+The generation effect of training with the 2.5M subset still has unstable results. Developers using the 10M dataset will have more stable effects.
+
+### Running Command
+
+```shell
+# This file is in swift/examples/pytorch/animatediff/scripts/full
+# Experimental environment: A100 * 4
+# 200GB GPU memory totally
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+torchrun --nproc_per_node=4 animatediff_sft.py \
+  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
+  --csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
+  --video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
+  --sft_type full \
+  --lr_scheduler_type constant \
+  --trainable_modules .*motion_modules.* \
+  --batch_size 4 \
+  --eval_steps 100 \
+  --gradient_accumulation_steps 16
+```
+
+We used A100 * 4 for training, requiring a total of 200GB GPU memory, and the training time is about 40 hours. The data format is as follows:
+
+
+```text
+--csv_path # Pass in a csv file, which should contain the following format:
+name,contentUrl
+Travel blogger shoot a story on top of mountains. young man holds camera in forest.,stock-footage-travel-blogger-shoot-a-story-on-top-of-mountains-young-man-holds-camera-in-forest.mp4
+```
+
+The name field represents the prompt of the short video, and contentUrl represents the name of the video file.
+
+```text
+--video_folder Pass in a video directory containing all the video files referenced by contentUrl in the csv file.
+```
+
+To perform inference using full parameters:
+```shell
+# This file is in swift/examples/pytorch/animatediff/scripts/full
+# Experimental environment: A100
+# 18GB GPU memory
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0 \
+python animatediff_infer.py \
+  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
+  --sft_type full \
+  --ckpt_dir /output/path/like/checkpoints/iter-xxx \
+  --eval_human true
+```
+
+The --ckpt_dir should be the output folder from training.
+
+## LoRA Training
+
+### Running Command
+
+Full parameter training will train the entire Motion-Adapter structure from scratch. Users can use an existing model and a small number of videos for fine-tuning by running the following command:
+```shell
+# This file is in swift/examples/pytorch/animatediff/scripts/lora
+# Experimental environment: A100
+# 20GB GPU memory
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0 \
+python animatediff_sft.py \
+  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
+  --csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
+  --video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
+  --motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
+  --sft_type lora \
+  --lr_scheduler_type constant \
+  --trainable_modules .*motion_modules.* \
+  --batch_size 1 \
+  --eval_steps 200 \
+  --dataset_sample_size 10000 \
+  --gradient_accumulation_steps 16
+```
+
+Video data parameters are the same as above.
+
+The inference command is as follows:
+```shell
+# This file is in swift/examples/pytorch/animatediff/scripts/lora
+# Experimental environment: A100
+# 18GB GPU memory
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0 \
+python animatediff_infer.py \
+  --model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
+  --motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
+  --sft_type lora \
+  --ckpt_dir /output/path/like/checkpoints/iter-xxx \
+  --eval_human true
+```
+
+The --ckpt_dir should be the output folder from training.
+
+## Parameter List
+
+Below are the supported parameter lists and their meanings for training and inference respectively:
+
+### Training Parameters
+```text
+motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
+motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
+
+model_id_or_path: str = None # The model ID or model path of the SD base model.
+model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
+
+dataset_sample_size: int = None # The number of training samples in the dataset. Default represents full training.
+
+sft_type: str = field(
+    default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
+
+output_dir: str = 'output' # Output folder.
+ddp_backend: str = field(
+    default='nccl', metadata={'choices': ['nccl', 'gloo', 'mpi', 'ccl']}) # If using ddp training, ddp backend.
+
+seed: int = 42 # Random seed.
+
+lora_rank: int = 8 # lora parameter.
+lora_alpha: int = 32 # lora parameter.
+lora_dropout: float = 0.05 # lora parameter.
+lora_dtype: str = 'fp32' # lora module dtype type. If `AUTO`, it follows the dtype setting of the original module.
+
+gradient_checkpointing: bool = False # Whether to enable gc, disabled by default. Note: The current version of diffusers has a problem and does not support this parameter being True.
+batch_size: int = 1 # batchsize.
+num_train_epochs: int = 1 # Number of epochs.
+# if max_steps >= 0, override num_train_epochs
+learning_rate: Optional[float] = None # Learning rate.
+weight_decay: float = 0.01 # adamw parameter.
+gradient_accumulation_steps: int = 16 # ga size.
+max_grad_norm: float = 1. # grad norm size.
+lr_scheduler_type: str = 'cosine' # Type of lr_scheduler.
+warmup_ratio: float = 0.05 # Whether to warmup and the proportion of warmup.
+
+eval_steps: int = 50 # eval step interval.
+save_steps: Optional[int] = None # save step interval.
+dataloader_num_workers: int = 1 # Number of dataloader workers.
+
+push_to_hub: bool = False # Whether to push to modelhub.
+# 'user_name/repo_name' or 'repo_name'
+hub_model_id: Optional[str] = None # modelhub id.
+hub_private_repo: bool = False
+push_hub_strategy: str = field( # Push strategy, push the last one or push each one.
+    default='push_best',
+    metadata={'choices': ['push_last', 'all_checkpoints']})
+# None: use env var `MODELSCOPE_API_TOKEN`
+hub_token: Optional[str] = field( # modelhub token.
+    default=None,
+    metadata={
+        'help':
+        'SDK token can be found in https://modelscope.cn/my/myaccesstoken'
+    })
+
+ignore_args_error: bool = False  # True: notebook compatibility.
+
+text_dropout_rate: float = 0.1 # Drop a certain proportion of text to ensure model robustness.
+
+validation_prompts_path: str = field( # The prompt file directory used in the evaluation process. By default, swift/aigc/configs/validation.txt is used.
+    default=None,
+    metadata={
+        'help':
+        'The validation prompts file path, use aigc/configs/validation.txt is None'
+    })
+
+trainable_modules: str = field( # Trainable modules, recommended to use the default value.
+    default='.*motion_modules.*',
+    metadata={
+        'help':
+        'The trainable modules, by default, the .*motion_modules.* will be trained'
+    })
+
+mixed_precision: bool = True # Mixed precision training.
+
+enable_xformers_memory_efficient_attention: bool = True # Use xformers.
+
+num_inference_steps: int = 25 #
+guidance_scale: float = 8.
+sample_size: int = 256
+sample_stride: int = 4 # Maximum length of training videos in seconds.
+sample_n_frames: int = 16 # Frames per second.
+
+csv_path: str = None # Input dataset.
+video_folder: str = None # Input dataset.
+
+motion_num_attention_heads: int = 8 # motion adapter parameter.
+motion_max_seq_length: int = 32 # motion adapter parameter.
+num_train_timesteps: int = 1000 # Inference pipeline parameter.
+beta_start: int = 0.00085 # Inference pipeline parameter.
+beta_end: int = 0.012 # Inference pipeline parameter.
+beta_schedule: str = 'linear' # Inference pipeline parameter.
+steps_offset: int = 1 # Inference pipeline parameter.
+clip_sample: bool = False # Inference pipeline parameter.
+
+use_wandb: bool = False # Whether to use wandb.
+```
+
+### Inference Parameters
+```text
+motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
+motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
+
+model_id_or_path: str = None # The model ID or model path of the SD base model.
+model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
+
+sft_type: str = field(
+    default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
+
+ckpt_dir: Optional[str] = field(
+    default=None, metadata={'help': '/path/to/your/vx-xxx/checkpoint-xxx'}) # The output folder of training.
+eval_human: bool = False  # False: eval val_dataset # Whether to use manual input evaluation.
+
+seed: int = 42 # Random seed.
+
+merge_lora: bool = False # Merge lora into the MotionAdapter and save the model.
+replace_if_exists: bool = False # Replace the files if the output merged dir exists when `merge_lora` is True.
+
+# other
+ignore_args_error: bool = False  # True: notebook compatibility.
+
+validation_prompts_path: str = None # The file used for validation. When eval_human=False, each line is a prompt.
+
+output_path: str = './generated' # The output directory for gifs.
+
+enable_xformers_memory_efficient_attention: bool = True # Use xformers.
+
+num_inference_steps: int = 25 #
+guidance_scale: float = 8.
+sample_size: int = 256
+sample_stride: int = 4 # Maximum length of training videos in seconds.
+sample_n_frames: int = 16 # Frames per second.
+
+motion_num_attention_heads: int = 8 # motion adapter parameter.
+motion_max_seq_length: int = 32 # motion adapter parameter.
+num_train_timesteps: int = 1000 # Inference pipeline parameter.
+beta_start: int = 0.00085 # Inference pipeline parameter.
+beta_end: int = 0.012 # Inference pipeline parameter.
+beta_schedule: str = 'linear' # Inference pipeline parameter.
+steps_offset: int = 1 # Inference pipeline parameter.
+clip_sample: bool = False # Inference pipeline parameter.
+```
--- a/ms-swift/docs/source_en/GetStarted/Installation.md
+++ b/ms-swift/docs/source_en/GetStarted/Installation.md
+# Installation and Usage
+
+## Wheel Package Installation
+
+You can use pip to install:
+
+```shell
+# Full capabilities
+pip install 'ms-swift[all]' -U
+# Only use LLM
+pip install 'ms-swift[llm]' -U
+# Only use AIGC
+pip install 'ms-swift[aigc]' -U
+# Only use adapters
+pip install ms-swift -U
+```
+
+## Source Code Installation
+
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[all]'
+```
+
+## Notebook Environment
+
+Most of the models supported by Swift for training can be used on `A10` GPUs. Users can use the free GPU resources officially provided by ModelScope:
+
+1. Go to the official [ModelScope](https://www.modelscope.cn) website and log in
+2. Click on `My Notebook` on the left and start a free GPU instance
+3. Happily take advantage of the A10 GPU resources
+
+## Build Documentation
+
+Swift supports complete API Doc documentation. Execute the following command in the swift root directory:
+
+```shell
+make docs
+```
+
+After the execution is complete, view `docs/build/html/index.html`.