Commit f7db21eb authored by lvzhen's avatar lvzhen
Browse files

first

parents
Pipeline #1580 canceled with stages
# MiniCPM-V 最佳实践
以下内容以`minicpm-v-3b-chat`为例, 如果你想要使用更新版本的 MiniCPM-V 多模态模型(v2), 你可以将`--model_type minicpm-v-3b-chat`切换成`--model_type minicpm-v-v2-chat`.
## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [微调后推理](#微调后推理)
## 环境准备
```shell
# 请使用"ms-swift>=2.2"或者main分支.
pip install 'ms-swift[llm]' -U
```
模型链接:
- minicpm-v-3b-chat: [https://modelscope.cn/models/OpenBMB/MiniCPM-V/summary](https://modelscope.cn/models/OpenBMB/MiniCPM-V/summary)
- minicpm-v-v2-chat: [https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/summary](https://modelscope.cn/models/OpenBMB/MiniCPM-V-2/summary)
## 推理
推理minicpm-v-3b-chat:
```shell
# Experimental environment: A10, 3090, V100, ...
# 10GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-3b-chat
```
输出: (支持传入本地路径或URL)
```python
"""
<<< 描述这张图片
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
该图像的特点是一只黑白相间的猫,它的眼睛睁得大大的,似乎在凝视着相机。这只猫看起来很小,可能是一只幼猫。
--------------------------------------------------
<<< clear
<<< 图中有几只羊?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
图中有四只羊。
--------------------------------------------------
<<< clear
<<< 计算结果是多少
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
计算结果为1452 + 4530 = 5982。
--------------------------------------------------
<<< clear
<<< 根据图片中的内容写首诗
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
在宁静的夜晚,一艘船在平静的湖面上航行。
--------------------------------------------------
<<< clear
<<< 对图片进行OCR
Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
Swift 250+ LMM35+ MLLM
"""
```
示例图片如下:
cat:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
animal:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
math:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
poem:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
ocr:
<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
**单样本推理**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.minicpm_v_3b_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
query = '距离各城市多远?'
response, history = inference(model, template, query, images=images)
print(f'query: {query}')
print(f'response: {response}')
# 流式
query = '距离最远的城市是哪?'
gen = inference_stream(model, template, query, history, images=images)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: 距离各城市多远?
response: 广州到深圳的距离是230公里,而深圳到广州的距离是14公里。
query: 距离最远的城市是哪?
response: 距离最远的城市是深圳,它位于广州和深圳之间,距离广州230公里,距离深圳14公里。
history: [['距离各城市多远?', ' 广州到深圳的距离是230公里,而深圳到广州的距离是14公里。'], ['距离最远的城市是哪?', '距离最远的城市是深圳,它位于广州和深圳之间,距离广州230公里,距离深圳14公里。']]
"""
```
示例图片如下:
road:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
## 微调
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
```shell
# Experimental environment: A10, 3090, V100, ...
# 10GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type minicpm-v-3b-chat \
--dataset coco-en-2-mini \
```
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
(支持多轮对话, 但总的轮次对话只能包含一张图片, 支持传入本地路径或URL)
```jsonl
{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]], "images": ["image_path"]}
```
## 微调后推理
直接推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/minicpm-v-3b-chat/vx-xxx/checkpoint-xxx \
--load_dataset_config true \
```
**merge-lora**并推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/minicpm-v-3b-chat/vx-xxx/checkpoint-xxx \
--merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/minicpm-v-3b-chat/vx-xxx/checkpoint-xxx-merged \
--load_dataset_config true
```
# mPLUG-Owl2 最佳实践
以下内容以`mplug-owl2_1-chat`为例, 你也可以选择`mplug-owl2-chat`.
## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [微调后推理](#微调后推理)
## 环境准备
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
```
模型链接:
- mplug-owl2_1-chat: [https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary](https://modelscope.cn/models/iic/mPLUG-Owl2.1/summary)
- mplug-owl2-chat: [https://modelscope.cn/models/iic/mPLUG-Owl2/summary](https://modelscope.cn/models/iic/mPLUG-Owl2/summary)
## 推理
推理`mplug-owl2_1-chat`:
```shell
# Experimental environment: A10, 3090, V100...
# 24GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type mplug-owl2_1-chat
```
输出: (支持传入本地路径或URL)
```python
"""
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
The image features a close-up of a cute, gray and white kitten with big blue eyes. The kitten is sitting on a table, looking directly at the viewer. The scene captures the kitten's adorable features, including its whiskers and the fur on its face. The kitten appears to be staring into the camera, creating a captivating and endearing atmosphere.
--------------------------------------------------
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< What is the calculation result?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The calculation result is 1452 + 45304 = 46756.
--------------------------------------------------
<<< Write a poem based on the content of the picture.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
In the stillness of the night, a boat glides across the water, its light shining bright. The stars twinkle above, casting a magical glow. A man and a dog are on board, enjoying the serene journey. The boat floats gently, as if it's floating on air. The calm waters reflect the stars, creating a breathtaking scene. The man and his dog are lost in their thoughts, taking in the beauty of nature. The boat seems to be floating in a dream, as if they are on a journey to find their way back home.
--------------------------------------------------
<<< clear
<<< Perform OCR on the image.
Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png
Text: Swift support training, inference and deployment of 250+ LLMs and 350+ MLMs (multimodal models). Developers can directly apply framework their own research and production environments to realize a complete workflow from model training and evaluation to application. In addition to supporting the lightweight training models provided by PEFT, we also provide a Complete Adapters library that can be adapted to various models such as NeTune, LoRaT, LLMA-PRO, etc. This adapter library can be used directly in your own custom workflow. The library is user-friendly with unfamiliar deep learning, Gradio UI for controlling training and inference, as well as accompanying learning courses and best practices for beginners. Additionally, we provide extra training and Lora LRN for AnimateDiff. Swift has rich documents for users on Huggingface and ModelScope, so please feel free to try it!
"""
```
示例图片如下:
cat:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
animal:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
math:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
poem:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
ocr_en:
<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png" width="250" style="display: inline-block;">
**单样本推理**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.mplug_owl2_1_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
query = 'How far is it from each city?'
response, history = inference(model, template, query, images=images)
print(f'query: {query}')
print(f'response: {response}')
# 流式
query = 'Which city is the farthest?'
images = images * 2
gen = inference_stream(model, template, query, history, images=images)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: How far is it from each city?
response: From the given information, it is 14 km from the city of Mata, 62 km from Yangjiang, and 293 km from Guangzhou.
query: Which city is the farthest?
response: The farthest city is Guangzhou, which is 293 km away.
history: [['How far is it from each city?', 'From the given information, it is 14 km from the city of Mata, 62 km from Yangjiang, and 293 km from Guangzhou.'], ['Which city is the farthest?', 'The farthest city is Guangzhou, which is 293 km away.']]
"""
```
示例图片如下:
road:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
## 微调
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
```shell
# Experimental environment: A10, 3090, V100...
# 24GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type mplug-owl2_1-chat \
--dataset coco-en-2-mini \
```
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
(支持多轮对话, 每轮对话必须包含一张图片, 支持传入本地路径或URL)
```jsonl
{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]], "images": ["image_path", "image_path2", "image_path3"]}
```
## 微调后推理
直接推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx \
--load_dataset_config true \
```
**merge-lora**并推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx \
--merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/mplug-owl2_1-chat/vx-xxx/checkpoint-xxx-merged \
--load_dataset_config true
```
# Phi3-Vision 最佳实践
## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [微调后推理](#微调后推理)
## 环境准备
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
```
模型链接:
- phi3-vision-128k-instruct: [https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary](https://modelscope.cn/models/LLM-Research/Phi-3-vision-128k-instruct/summary)
## 推理
推理 phi3-vision-128k-instruct:
```shell
# Experimental environment: A10, 3090, V100, ...
# 16GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type phi3-vision-128k-instruct
```
输出: (支持传入本地路径或URL)
```python
"""
<<< Who are you?
I am Phi, an AI developed by Microsoft to assist with providing information, answering questions, and helping users find solutions to their queries. How can I assist you today?
--------------------------------------------------
<<< clear
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>What is the difference between these two pictures?
The first picture shows a group of four cartoon sheep standing in a field, while the second picture is a close-up of a kitten with a blurred background. The main difference between these two pictures is the subject matter and the setting. The first picture features animals that are typically associated with farm life and agriculture, while the second picture focuses on a domestic animal, a kitten, which is more commonly found in households. Additionally, the first picture has a more peaceful and serene atmosphere, while the second picture has a more intimate and detailed view of the kitten.
--------------------------------------------------
<<< clear
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>How many sheep are there in the picture?
There are four sheep in the picture.
--------------------------------------------------
<<< clear
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png</img>What is the result of the calculation?
The result of the calculation 1452 + 45304 is 46756.
--------------------------------------------------
<<< clear
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png</img>Write a poem based on the content of the picture.
In the tranquil night, a boat sails,
Through the darkened river, it sets sail.
A single candle flickers, casting light,
Guiding the way through the endless night.
The stars above, like diamonds bright,
Gleam down upon the boat's gentle flight.
The moon, a silent guardian in the sky,
Watches over the boat as it sails by.
The river, a mirror to the night,
Reflects the boat's journey, a beautiful sight.
The trees on either side, standing tall,
Whisper secrets to the boat, one and all.
In the stillness of the night, a sense of peace,
The boat, the river, the trees, all in their place.
A moment frozen in time, a scene so serene,
A journey through the night, a dream so unseen.
--------------------------------------------------
<<< clear
<<< <img>https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png</img>Perform OCR on the image.
The image contains a text section with the heading 'Introduction'. It discusses the capabilities of SWIFT, which support training, inference, evaluation, and deployment of over 250 large language models (LLMs) and 35+ multimodal large models (MLLMs). It mentions that developers can apply this framework to their research and production environments, and that SWIFT supports lightweight training solutions provided by PEFT, as well as a complete Adapters library for various training techniques. It also highlights the availability of a Gradio web-ui for controlling training and inference, and the provision of deep learning courses and best practices for beginners. The text further states that SWIFT is expanding capabilities for other modalities, currently supporting full-parameter training and LoRA training for AnimateDiff. There are references to rich documentation and the availability of SWIFT web-ui on Huggingface space and ModelScope studio. The text is clear and fully visible in the image.
"""
```
示例图片如下:
cat:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
animal:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
math:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
poem:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
ocr_en:
<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png" width="250" style="display: inline-block;">
**单样本推理**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.phi3_vision_128k_instruct
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
query = """<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?"""
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
# 流式
query = 'Which city is the farthest?'
gen = inference_stream(model, template, query, history)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: Which city is the farthest?
response: Guangzhou is the farthest city, located 293km away.
history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?', 'The distances are as follows: Mata is 14km away, Yangjiang is 62km away, and Guangzhou is 293km away.'], ['Which city is the farthest?', 'Guangzhou is the farthest city, located 293km away.']]
"""
```
示例图片如下:
road:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
## 微调
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
```shell
# Experimental environment: A10, 3090, V100, ...
# 16GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type phi3-vision-128k-instruct \
--dataset coco-en-mini \
# DDP Full
# Experimental environment: 2 * A100
# 2 * 50GB GPU memory
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type phi3-vision-128k-instruct \
--dataset coco-en-mini \
--sft_type full \
--ddp_find_unused_parameters true
```
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
(支持多轮对话, 支持每轮对话含多张图片或不含图片, 支持传入本地路径或URL)
```json
[
{"conversations": [
{"from": "user", "value": "<img>img_path</img>11111"},
{"from": "assistant", "value": "22222"}
]},
{"conversations": [
{"from": "user", "value": "<img>img_path</img><img>img_path2</img><img>img_path3</img>aaaaa"},
{"from": "assistant", "value": "bbbbb"},
{"from": "user", "value": "<img>img_path</img>ccccc"},
{"from": "assistant", "value": "ddddd"}
]},
{"conversations": [
{"from": "user", "value": "AAAAA"},
{"from": "assistant", "value": "BBBBB"},
{"from": "user", "value": "CCCCC"},
{"from": "assistant", "value": "DDDDD"}
]}
]
```
## 微调后推理
直接推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx \
--load_dataset_config true \
```
**merge-lora**并推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx \
--merge_lora true --safe_serialization false
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/phi3-vision-128k-instruct/vx-xxx/checkpoint-xxx-merged \
--load_dataset_config true
```
# Qwen-Audio 最佳实践
Qwen2-Audio的最佳实践可以查看: [https://github.com/modelscope/ms-swift/issues/1653](https://github.com/modelscope/ms-swift/issues/1653)
## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [微调后推理](#微调后推理)
## 环境准备
```shell
pip install 'ms-swift[llm]' -U
```
## 推理
推理[qwen-audio-chat](https://modelscope.cn/models/qwen/Qwen-Audio-Chat/summary):
```shell
# Experimental environment: A10, 3090, V100...
# 21GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-audio-chat
```
输出: (支持传入本地路径或URL)
```python
"""
<<< 你是谁?
我是来自达摩院的大规模语言模型,我叫通义千问。
--------------------------------------------------
<<< <audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/music.wav</audio>这是首什么样的音乐
这是一首风格是Pop的音乐。
--------------------------------------------------
<<< <audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么
这段语音中说了中文:"今天天气真好呀"。
--------------------------------------------------
<<< 这段语音是男生还是女生
根据音色判断,这段语音是男性。
"""
```
**单样本推理**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.qwen_audio_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
query = '<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么'
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
# 流式
query = '这段语音是男生还是女生'
gen = inference_stream(model, template, query, history)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: <audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么
response: 这段语音说了中文:"今天天气真好呀"。
query: 这段语音是男生还是女生
response: 根据音色判断,这段语音是男性。
history: [['<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>这段语音说了什么', '这段语音说了中文:"今天天气真好呀"。'], ['这段语音是男生还是女生', '根据音色判断,这段语音是男性。']]
"""
```
## 微调
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
LoRA微调:
```shell
# Experimental environment: A10, 3090, V100...
# 22GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen-audio-chat \
--dataset aishell1-mini-zh \
```
全参数微调:
```shell
# MP
# Experimental environment: 2 * A100
# 2 * 50 GPU memory
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type qwen-audio-chat \
--dataset aishell1-mini-zh \
--sft_type full \
# ZeRO2
# Experimental environment: 4 * A100
# 4 * 80 GPU memory
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen-audio-chat \
--dataset aishell1-mini-zh \
--sft_type full \
--use_flash_attn true \
--deepspeed default-zero2
```
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
(支持多轮对话, 支持每轮对话含多段语音或不含语音, 支持传入本地路径或URL)
```json
[
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio>11111"},
{"from": "assistant", "value": "22222"}
]},
{"conversations": [
{"from": "user", "value": "<audio>audio_path</audio><audio>audio_path2</audio><audio>audio_path3</audio>aaaaa"},
{"from": "assistant", "value": "bbbbb"},
{"from": "user", "value": "<audio>audio_path</audio>ccccc"},
{"from": "assistant", "value": "ddddd"}
]},
{"conversations": [
{"from": "user", "value": "AAAAA"},
{"from": "assistant", "value": "BBBBB"},
{"from": "user", "value": "CCCCC"},
{"from": "assistant", "value": "DDDDD"}
]}
]
```
## 微调后推理
直接推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen-audio-chat/vx-xxx/checkpoint-xxx \
--load_dataset_config true \
```
**merge-lora**并推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/qwen-audio-chat/vx-xxx/checkpoint-xxx \
--merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen-audio-chat/vx-xxx/checkpoint-xxx-merged \
--load_dataset_config true
```
# Qwen-VL 最佳实践
## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [微调后推理](#微调后推理)
## 环境准备
```shell
pip install 'ms-swift[llm]' -U
```
## 推理
推理[qwen-vl-chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary):
```shell
# Experimental environment: 3090
# 24GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-vl-chat
```
输出: (支持传入本地路径或URL)
```python
"""
<<< 你是谁?
我是通义千问,由阿里云开发的AI助手。我被设计用来回答各种问题、提供信息和与用户进行对话。有什么我可以帮助你的吗?
--------------------------------------------------
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>这两张图片有什么区别
这两张图片的主要区别在于内容和主题。
第一张图片是一张卡通插画,画面中是一只公羊或山羊在绿色的草地上,配以群山和白云的背景,整体呈现出自然和动物的主题。
第二张图片也是一张卡通插画,画面中是一只小猫,有条纹的毛发和蓝色的眼睛,整体呈现出可爱和动物的主题。
--------------------------------------------------
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>图中有几只羊
图中有一家四口的羊,一共四只。
--------------------------------------------------
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png</img>计算结果是多少
1452 + 45304 = 46756
--------------------------------------------------
<<< clear
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png</img>根据图片中的内容写首诗
月光如水洒河中,孤舟一灯独自空。
两岸青山倒影美,星河灿烂天空宏。
--------------------------------------------------
<<< clear
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png</img>对图片进行OCR
SWIFT支持250+ LLM和35+ MLLM(多模态大模型)的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中,实现模型训练评测到应用的完整链路。我们除了支持PEPT提供的轻量训练方案外,也提供了一个完整的Adapters库以支持最新的训练技术,如NEFTune、LoRA+、LLaMa-PRO等,这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
"""
```
示例图片如下:
cat:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
animal:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
math:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
poem:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
ocr:
<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
**单样本推理**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.qwen_vl_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
query = """<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远?"""
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')
# 流式
query = '距离最远的城市是哪?'
gen = inference_stream(model, template, query, history)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远?
response: 马路边距离马路边14公里;阳江边距离马路边62公里;广州边距离马路边293公里。
query: 距离最远的城市是哪?
response: 距离最远的城市是广州,距离马路边293公里。
history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>距离各城市多远?', '马路边距离马路边14公里;阳江边距离马路边62公里;广州边距离马路边293公里。'], ['距离最远的城市是哪?', '距离最远的城市是广州,距离马路边293公里。']]
"""
```
示例图片如下:
road:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
## 微调
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
LoRA微调:
```shell
# Experimental environment: 3090
# 23GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen-vl-chat \
--dataset coco-en-mini \
```
全参数微调:
```shell
# Experimental environment: 4 * A100
# 4 * 70 GPU memory
NPROC_PER_NODE=2 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen-vl-chat \
--dataset coco-en-mini \
--sft_type full \
```
**Qwen-VL**模型支持grounding任务的训练,数据参考下面的格式:
```jsonl
{"query": "Find <bbox>", "response": "<ref-object>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
{"query": "Find <ref-object>", "response": "<bbox>", "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
# 或者使用<img></img>标签
{"query": "<img>/coco2014/train2014/COCO_train2014_000000001507.jpg</img>Find <bbox>", "response": "<ref-object>", "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
{"query": "<img>/coco2014/train2014/COCO_train2014_000000001507.jpg</img>Find <ref-object>", "response": "<bbox>", "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
```
上述objects字段中包含了一个json string,其中有四个字段:
- caption bbox对应的物体描述
- bbox 坐标 建议给四个整数(而非float型),分别是x_min,y_min,x_max,y_max四个值
- bbox_type: bbox类型 目前支持三种:real/norm_1000/norm_1,分别代表实际像素值坐标/千分位比例坐标/归一化比例坐标
- image: bbox对应的图片是第几张, 索引从0开始
上述格式会被转换为Qwen-VL可识别的格式,具体来说:
```jsonl
{"query": "<img>/coco2014/train2014/COCO_train2014_000000001507.jpg</img>Find <ref>the man</ref>", "response": "<box>(200,200),(600,600)</box>"}
```
也可以直接传入上述格式,但是注意坐标请使用千分位坐标。
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
(支持多轮对话, 支持每轮对话含多张图片或不含图片, 支持传入本地路径或URL)
```json
[
{"conversations": [
{"from": "user", "value": "<img>img_path</img>11111"},
{"from": "assistant", "value": "22222"}
]},
{"conversations": [
{"from": "user", "value": "<img>img_path</img><img>img_path2</img><img>img_path3</img>aaaaa"},
{"from": "assistant", "value": "bbbbb"},
{"from": "user", "value": "<img>img_path</img>ccccc"},
{"from": "assistant", "value": "ddddd"}
]},
{"conversations": [
{"from": "user", "value": "AAAAA"},
{"from": "assistant", "value": "BBBBB"},
{"from": "user", "value": "CCCCC"},
{"from": "assistant", "value": "DDDDD"}
]}
]
```
## 微调后推理
直接推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx \
--load_dataset_config true \
```
**merge-lora**并推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx \
--merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx-merged \
--load_dataset_config true
```
# vLLM推理加速文档
ms-swift已接入了vLLM对多模态模型进行推理加速. 支持的模型可以查看[支持的模型和数据集](../LLM/支持的模型和数据集.md#多模态大模型).
## 目录
- [环境准备](#环境准备)
- [推理加速](#推理加速)
- [部署](#部署)
## 环境准备
```bash
# 设置pip全局镜像 (加速下载)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# 安装ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
# vllm与cuda版本有对应关系,请按照`https://docs.vllm.ai/en/latest/getting_started/installation.html`选择版本
pip install "vllm>=0.5.1"
pip install openai -U
```
## 推理加速
使用python:
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
ModelType, get_vllm_engine, get_default_template_type,
get_template, inference_vllm
)
# 'minicpm-v-v2_5-chat', 'minicpm-v-v2_6-chat', 'internvl2-1b', 'internvl2-4b', 'phi3-vision-128k-instruct'
model_type = ModelType.llava1_6_mistral_7b_instruct
model_id_or_path = None
llm_engine = get_vllm_engine(model_type, model_id_or_path=model_id_or_path)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
llm_engine.generation_config.max_new_tokens = 1024
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png']
request_list = [{'query': 'who are you'}, {'query': 'Describe this image.', 'images': images}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
history1 = resp_list[1]['history']
images.append(None)
request_list = [{'query': 'Is the creature in the picture a dog?', 'history': history1, 'images': images}]
resp_list = inference_vllm(llm_engine, template, request_list)
for request, resp in zip(request_list, resp_list):
print(f"query: {request['query']}")
print(f"response: {resp['response']}")
print(f"history: {resp['history']}")
"""
query: who are you
response: Hello! I am an AI language model, designed to assist users with information and provide helpful prompts and suggestions. As an artificial intelligence, I do not have personal experiences, so I don't have a personality or individuality. Instead, my purpose is to provide accurate, useful information to users like you. Is there anything specific you would like help with or any other questions you have?
query: Describe this image.
response: The image features a close-up of a kitten's face. The kitten has striking blue eyes, which are open and appear to be looking towards the camera. Its fur exhibits a mix of black and white stripes with black markings around its eyes. The fur texture is soft and dense with whiskers adorning the sides of its face, adding to its feline charm. The background is blurred with hints of green and white, which creates a bokeh effect, keeping the focus on the kitten's face. The image exudes a sense of innocence and curiosity typically associated with young felines.
query: Is the creature in the picture a dog?
response: No, the creature in the picture is a kitten, which is a young cat, not a dog. The presence of distinct feline features such as stripes, whiskers, and the appearance of blue eyes confirms this.
history: [['Describe this image.', "The image features a close-up of a kitten's face. The kitten has striking blue eyes, which are open and appear to be looking towards the camera. Its fur exhibits a mix of black and white stripes with black markings around its eyes. The fur texture is soft and dense with whiskers adorning the sides of its face, adding to its feline charm. The background is blurred with hints of green and white, which creates a bokeh effect, keeping the focus on the kitten's face. The image exudes a sense of innocence and curiosity typically associated with young felines. "], ['Is the creature in the picture a dog?', 'No, the creature in the picture is a kitten, which is a young cat, not a dog. The presence of distinct feline features such as stripes, whiskers, and the appearance of blue eyes confirms this. ']]
"""
```
batch处理:
```python
# vllm>=0.5.4
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_vllm_engine, get_template, inference_vllm, ModelType,
get_default_template_type, inference_stream_vllm
)
from swift.utils import seed_everything
import torch
model_type = ModelType.minicpm_v_v2_6_chat
model_id_or_path = None
vllm_engine = get_vllm_engine(model_type, torch.bfloat16, model_id_or_path=model_id_or_path,
max_model_len=8192)
tokenizer = vllm_engine.hf_tokenizer
vllm_engine.generation_config.max_new_tokens = 256
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
template = get_template(template_type, tokenizer)
seed_everything(42)
query = '<image>描述这张图片'
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png']
generation_info = {}
request_list = [{'query': query, 'images': images} for _ in range(100)]
resp_list = inference_vllm(vllm_engine, template, request_list, generation_info=generation_info, use_tqdm=True)
print(f'query: {query}')
print(f'response: {resp_list[0]["response"]}')
print(generation_info)
# 流式
generation_info = {}
gen = inference_stream_vllm(vllm_engine, template, request_list, generation_info=generation_info)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
# only show first
for resp_list in gen:
resp = resp_list[0]
if resp is None:
continue
response = resp['response']
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(generation_info)
"""
100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 91.47it/s]
100%|██████████████████████████████████████████████████████████████████████████████| 100/100 [00:22<00:00, 4.48it/s]
query: <image>描述这张图片
response: 这张图片展示了一只小猫咪的特写,可能是美国短毛猫品种,因为其花纹和毛发质地。猫咪有着引人注目的蓝色眼睛,这是其外貌中非常突出的特征。它皮毛上有着独特的黑色条纹,从面颊延伸至头顶,暗示着一种有条纹的花纹图案。它的耳朵小而尖,内侧是粉色的。猫咪的胡须细长而突出,围绕在它的下颌两侧和眼睛周围。猫咪坐着,用一种表达丰富的方式直视着,嘴巴微微张开,露出粉红色的内唇。背景模糊,柔和的光线增强了猫咪的特征。
{'num_prompt_tokens': 2700, 'num_generated_tokens': 14734, 'num_samples': 100, 'runtime': 23.53027338697575, 'samples/s': 4.249844375176322, 'tokens/s': 626.1720702384794}
query: <image>描述这张图片
response: 这张图片展示了一只小猫的特写,可能是一只幼年猫,在模糊的背景中,集中注意力在猫的表情上。这只猫长着一身白色与黑色条纹相间的毛皮,带有微妙的灰褐色。它的眼睛大而圆,具有高度的反光度,表明它们可能含有异色瞳,即一只眼睛是蓝色的,另一只是绿色的,但这只猫两只眼睛都是绿色的。睫毛清晰可见,增添了一种生动的表情。猫的耳朵竖立着,内部呈粉红色,边缘有浅色的阴影,显示出柔软的毛发。胡须又长又明显,突显了小猫的脸部形状。这个品种的猫看起来是一个常见品种,毛皮图案和眼睛颜色表明它可能是一只虎斑猫。光线柔和,产生一种天鹅绒般的效果,突出了猫绒毛的质感。
{'num_prompt_tokens': 2700, 'num_generated_tokens': 14986, 'num_samples': 100, 'runtime': 23.375922130944673, 'samples/s': 4.277906105257837, 'tokens/s': 641.0870089339394}
"""
```
使用CLI:
```shell
# 多模态模型必须显式指定`--infer_backend vllm`
CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-vicuna-7b-instruct --infer_backend vllm
# 对数据集进行批量推理
CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-vicuna-7b-instruct --infer_backend vllm \
--val_dataset coco-en-2-mini#100
# TP:
CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl2-1b \
--infer_backend vllm --tensor_parallel_size 2
```
```python
"""
<<< How many sheep are in the picture?
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
There are four sheep in the picture.
--------------------------------------------------
<<< Perform OCR on the image.
Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png
The image contains text that appears to be an introduction or description of a software or service called SWIFT. Here is the transcribed text:
introduction
SWIFT supports training, inference, evaluation and deployment of 250+ LLMs and 35 MLMs (multimodal large models). Developers can directly apply their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition, we provide a complete Adapters Library to support the latest training techniques such as PEFT, we also provide a Gradio web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners.
Additionally, we are expanding capabilities for other modalities. Currently, we support full-paraphrase training and LORA training for AnimatedDiff.
SWIFT web-ui is available both on HuggingFace space and ModelScope studio.
Please feel free to try.
Please note that the text is a mix of English and what appears to be a programming or technical language, and some words or phrases might not be fully transcribed due to the complexity of the text.
--------------------------------------------------
<<< who are you
Input a media path or URL <<<
I'm a language model called Vicuna, and I was trained by researchers from Large Model Systems Organization (LMSYS).
"""
```
## 部署
**服务端:**
```shell
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type llava1_6-vicuna-13b-instruct --infer_backend vllm
# TP:
CUDA_VISIBLE_DEVICES=0,1 swift deploy --model_type internvl2-1b \
--infer_backend vllm --tensor_parallel_size 2
```
**客户端:**
测试:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava1_6-vicuna-13b-instruct",
"messages": [{"role": "user", "content": "Describe this image."}],
"temperature": 0,
"images": ["http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png"]
}'
```
使用ms-swift:
```python
import asyncio
from swift.llm import get_model_list_client, XRequestConfig, inference_client_async
model_list = get_model_list_client()
model_type = model_list.data[0].id
print(f'model_type: {model_type}')
request_config = XRequestConfig(seed=42)
query = '<image>Describe this image.'
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png']
tasks = [inference_client_async(model_type, query, images=images, request_config=request_config) for _ in range(100)]
async def _batch_run(tasks):
return await asyncio.gather(*tasks)
resp_list = asyncio.run(_batch_run(tasks))
print(f'query: {query}')
print(f'response0: {resp_list[0].choices[0].message.content}')
print(f'response1: {resp_list[1].choices[0].message.content}')
query = '<image>How many sheep are in the picture?'
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']
async def _stream():
global query
request_config = XRequestConfig(seed=42, stream=True)
stream_resp = await inference_client_async(model_type, query, images=images, request_config=request_config)
print(f'query: {query}')
print('response: ', end='')
async for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
asyncio.run(_stream())
"""
model_type: llava1_6-vicuna-13b-instruct
query: <image>Describe this image.
response0: The image captures a moment of tranquility featuring a kitten. The kitten, with its fur a mix of gray and white, is the main subject of the image. It's sitting on a surface that appears to be a table or a similar flat surface. The kitten's eyes, a striking shade of blue, are wide open, giving it a curious and alert expression. Its ears, also gray and white, are perked up, suggesting it's attentive to its surroundings. The background is blurred, drawing focus to the kitten, and it's a soft, muted color that doesn't distract from the main subject. The overall image gives a sense of calm and innocence.
response1: The image captures a moment of tranquility featuring a kitten. The kitten, with its fur a mix of gray and white, is the main subject of the image. It's sitting on a surface that appears to be a table or a similar flat surface. The kitten's eyes, a striking shade of blue, are wide open, giving it a curious and alert expression. Its ears, also gray and white, are perked up, suggesting it's attentive to its surroundings. The background is blurred, drawing focus to the kitten, and it's a soft, muted color that doesn't distract from the main subject. The overall image gives a sense of calm and innocence.
query: <image>How many sheep are in the picture?
response: There are four sheep in the picture.
"""
```
使用openai:
```python
from openai import OpenAI
client = OpenAI(
api_key='EMPTY',
base_url='http://localhost:8000/v1',
)
model_type = client.models.list().data[0].id
print(f'model_type: {model_type}')
# use base64
# import base64
# with open('cat.png', 'rb') as f:
# img_base64 = base64.b64encode(f.read()).decode('utf-8')
# image_url = f'data:image/jpeg;base64,{img_base64}'
# use local_path
# from swift.llm import convert_to_base64
# image_url = convert_to_base64(images=['cat.png'])['images'][0]
# image_url = f'data:image/jpeg;base64,{image_url}'
# use url
image_url = 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png'
query = 'Describe this image.'
messages = [{
'role': 'user',
'content': [
{'type': 'image_url', 'image_url': {'url': image_url}},
{'type': 'text', 'text': query},
]
}]
resp = client.chat.completions.create(
model=model_type,
messages=messages,
temperature=0)
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')
# 流式
query = 'How many sheep are in the picture?'
messages = [{
'role': 'user',
'content': [
{'type': 'image_url', 'image_url': {'url': 'http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png'}},
{'type': 'text', 'text': query},
]
}]
stream_resp = client.chat.completions.create(
model=model_type,
messages=messages,
stream=True,
temperature=0)
print(f'query: {query}')
print('response: ', end='')
for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()
"""
model_type: llava1_6-vicuna-13b-instruct
query: Describe this image.
response: The image captures a moment of tranquility featuring a kitten. The kitten, with its fur a mix of gray and white, is the main subject of the image. It's sitting on a surface that appears to be a table or a similar flat surface. The kitten's eyes, a striking shade of blue, are wide open, giving it a curious and alert expression. Its ears, also gray and white, are perked up, suggesting it's attentive to its surroundings. The background is blurred, drawing focus to the kitten, and it's a soft, muted color that doesn't distract from the main subject. The overall image gives a sense of calm and innocence.
query: How many sheep are in the picture?
response: There are four sheep in the picture.
"""
```
更多客户端使用方法可以查看[MLLM部署文档](MLLM部署文档.md#yi-vl-6b-chat)
# Yi-VL 最佳实践
## 目录
- [环境准备](#环境准备)
- [推理](#推理)
- [微调](#微调)
- [微调后推理](#微调后推理)
## 环境准备
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
```
## 推理
推理[yi-vl-6b-chat](https://modelscope.cn/models/01ai/Yi-VL-6B/summary):
```shell
# Experimental environment: A10, 3090, V100...
# 18GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-vl-6b-chat
```
输出: (支持传入本地路径或URL)
```python
"""
<<< 描述这张图片
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
图片显示一只小猫坐在地板上,眼睛睁开,凝视着摄像机。小猫看起来很可爱,有灰色和白色的毛皮,以及蓝色的眼睛。它似乎正在看摄像机,可能对周围环境很好奇。
--------------------------------------------------
<<< 你是谁?
Input a media path or URL <<<
我是人工智能助手,随时准备帮助你解答问题或提供信息。
--------------------------------------------------
<<< 图中有几只羊
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
图中有四只羊.
--------------------------------------------------
<<< clear
<<< 计算结果是多少
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
1452 + 45304 = 46756
--------------------------------------------------
<<< clear
<<< 根据图片中的内容写首诗
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
夜幕降临,星光闪烁,
一艘小船在河上飘荡,
船头挂着一盏明亮的灯,
照亮了周围的黑暗。
船上有两个人,
一个在船头,另一个在船尾,
他们似乎在谈话,
在星光下享受着宁静的时刻。
河岸边,树木在黑暗中站着,
在星光下投下长长的影子。
这景象是那么的宁静,
让人想起一个古老的传说。
小船,人,和星光,
构成了一个美丽的画面,
它唤起一种宁静的感觉,
在喧嚣的城市生活之外。
--------------------------------------------------
<<< clear
<<< 对图片进行OCR
Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
这是一段关于SWIFT的文字,其中包括了它的版本、功能以及一些链接。
"""
```
示例图片如下:
cat:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png" width="250" style="display: inline-block;">
animal:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png" width="250" style="display: inline-block;">
math:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png" width="250" style="display: inline-block;">
poem:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
ocr:
<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
**单样本推理**
```python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from swift.llm import (
get_model_tokenizer, get_template, inference, ModelType,
get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torch
model_type = ModelType.yi_vl_6b_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16,
model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(2) # ...
images = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png']
query = '距离各城市多远?'
response, history = inference(model, template, query, images=images)
print(f'query: {query}')
print(f'response: {response}')
# 流式
query = '距离最远的城市是哪?'
images = images * 2
gen = inference_stream(model, template, query, history, images=images)
print_idx = 0
print(f'query: {query}\nresponse: ', end='')
for response, history in gen:
delta = response[print_idx:]
print(delta, end='', flush=True)
print_idx = len(response)
print()
print(f'history: {history}')
"""
query: 距离各城市多远?
response: 距离甲塔14公里,距离阳江62公里,距离广州293公里,距离广州293公里。
query: 距离最远的城市是哪?
response: 最远的距离是293公里。
history: [['距离各城市多远?', '距离甲塔14公里,距离阳江62公里,距离广州293公里,距离广州293公里。'], ['距离最远的城市是哪?', '最远的距离是293公里。']]
"""
```
示例图片如下:
road:
<img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png" width="250" style="display: inline-block;">
## 微调
多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
```shell
# Experimental environment: A10, 3090, V100...
# 19GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type yi-vl-6b-chat \
--dataset coco-en-2-mini \
```
[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
(支持多轮对话, 每轮对话须包含一张图片或不含图片, 支持传入本地路径或URL)
```jsonl
{"query": "55555", "response": "66666", "images": ["image_path"]}
{"query": "eeeee", "response": "fffff", "history": [], "images": ["image_path"]}
{"query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]], "images": ["image_path", "image_path2", "image_path3"]}
```
## 微调后推理
直接推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/yi-vl-6b-chat/vx-xxx/checkpoint-xxx \
--load_dataset_config true \
```
**merge-lora**并推理:
```shell
CUDA_VISIBLE_DEVICES=0 swift export \
--ckpt_dir output/yi-vl-6b-chat/vx-xxx/checkpoint-xxx \
--merge_lora true
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir output/yi-vl-6b-chat/vx-xxx/checkpoint-xxx-merged \
--load_dataset_config true
```
# 人类偏好对齐训练文档
本文档提供了各种人类偏好对齐算法的训练脚本。若您希望深入了解更详尽的算法信息及其选择方法,请参考[文档](https://github.com/modelscope/modelscope-classroom/blob/main/LLM-tutorial/M.%E4%BA%BA%E7%B1%BB%E5%81%8F%E5%A5%BD%E5%AF%B9%E9%BD%90%E8%AE%AD%E7%BB%83.md)
## 目录
- [环境准备](#环境准备)
- [数据集](#数据集)
- [DPO](#dpo)
- [CPO](#cpo)
- [ORPO](#orpo)
- [SimPO](#simpo)
## 环境准备
```bash
# 设置pip全局镜像 (加速下载)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# 安装ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[llm]'
# 环境对齐 (通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
pip install -r requirements/framework.txt -U
pip install -r requirements/llm.txt -U
```
## 数据集
视觉多模态大模型人类偏好对齐训练一般需要 $(x,y_w,y_l)$ 格式的数据,其中 $x$ 表示模型输入,包括文本提示和图像, $y_w,y_l$ 分别表示符合人类偏好的偏好回答和不符合人类偏好的拒绝回答,比如![dpo_data](../../resources/vdpo_data.png)
**自定义数据集格式**
```jsonl
{"system": "123", "query": "11111", "response": "22222", "rejected_response": "33333", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
{"system": "123", "query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
{"system": "123", "query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
```
其中`system``history`为可选项
不同模型对图像数量的支持不同, 具体参考模型对应的最佳实践文档
**训练提示**:
- 下面的训练脚本使用`--lora_target_modules DEFAULT`只训练模型的QKV矩阵,你也可以设置`--lora_target_modules ALL`来训练模型的全部线性层
## DPO
[论文arvix](https://arxiv.org/abs/2305.18290)
超参
- `beta`:KL正则系数,值越大表示对偏离参考模型的惩罚越大。默认为0.1
建议在开始DPO训练之前,使用偏好数据集中的偏好回答部分进行SFT训练,以确保数据符合DPO算法的分布要求。
我们也在DPO loss中混合了sft loss来稳定训练,你可以通过设置超参`sft_beta`来调整sft loss的系数,默认为0.1
训练脚本, 这里我们提供单卡/多卡device map/多卡ddp的版本,简洁起见,后续算法只给出单卡版本。
```bash
# Experimental environment: A100
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type dpo \
--model_type llava1_6-mistral-7b-instruct \
--beta 0.1 \
--sft_beta 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
# MP(device map)
CUDA_VISIBLE_DEVICES=0,1 \
swift rlhf \
--rlhf_type dpo \
--model_type llava1_6-mistral-7b-instruct \
--beta 0.1 \
--sft_beta 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
# DDP + MP
nproc_per_node=2
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
swift rlhf \
--rlhf_type dpo \
--model_type llava1_6-mistral-7b-instruct \
--beta 0.1 \
--sft_beta 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
--warmup_ratio 0.03 \
--save_total_limit 2
```
训练后的模型推理和部署可以参考对应模型的最佳实践文档, [部署文档](./MLLM部署文档.md)[vLLM推理加速文档](./vLLM推理加速文档.md)
## CPO
[论文arvix](https://arxiv.org/abs/2401.08417)
超参
- beta:隐含奖励前的系数,默认为0.1
- cpo_alpha: nll loss系数, 默认为1.0
训练脚本
```bash
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type cpo \
--model_type llava1_6-mistral-7b-instruct \
--beta 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
```
## ORPO
[论文arvix](https://arxiv.org/abs/2403.07691)
超参
- lambda: Odds Ratio loss系数
注意:ORPO使用参数`--beta`传入超参`lambda`
```bash
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type orpo \
--model_type llava1_6-mistral-7b-instruct \
--beta 0.1 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
```
## SimPO
[论文arvix](https://arxiv.org/abs/2405.14734)
超参
- beta:隐含奖励前的系数,默认为2.0
- simpo_gamma:reward margin项,默认为1.0
- cpo_alpha: 混合CPO nll loss提高训练稳定性, 默认为1.0, 设置0.0使用原始SimPO算法
```bash
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type simpo \
--model_type llava1_6-mistral-7b-instruct \
--beta 2.0 \
--simpo_gamma 1.0 \
--sft_type lora \
--dataset rlaif-v#1000 \
--num_train_epochs 2 \
--lora_target_modules DEFAULT \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
```
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:inherited-members:
:members:
.. autogenerated from source/_templates/autosummary/class.rst
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
:special-members: __init__, __call__
..
autogenerated from source/_templates/classtemplate.rst
note it does not have :inherited-members:
.. currentmodule:: {{ module }}
{{ name | underline}}
.. autoclass:: {{ name }}
:members:
:exclude-members: MAXBIT, MAXDIM
:undoc-members:
..
autogenerated from source/_templates/sobolengine.rst
note it has specific options
swift.hub
==============
.. automodule:: swift.hub
.. currentmodule:: swift.hub
.. autosummary::
:toctree: generated
:nosignatures:
:template: classtemplate.rst
api.HubApi
check_model.check_local_model_is_latest
push_to_hub.push_to_hub
push_to_hub.push_to_hub_async
snapshot_download.snapshot_download
file_download.model_file_download
swift.trainers
==============
.. automodule:: swift.trainers
.. currentmodule:: swift.trainers
.. autosummary::
:toctree: generated
:nosignatures:
:template: classtemplate.rst
trainers.Seq2SeqTrainer
trainers.Trainer
swift.tuners
==============
.. automodule:: swift.tuners
.. currentmodule:: swift.tuners
.. autosummary::
:toctree: generated
:nosignatures:
:template: classtemplate.rst
adapter.AdapterConfig
base.SwiftModel
base.Swift
lora.LoRAConfig
prompt.PromptConfig
restuning.ResTuningConfig
side.SideConfig
utils.SwiftConfig
utils.SwiftOutput
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
# import sphinx_book_theme
sys.path.insert(0, os.path.abspath('../../'))
# -- Project information -----------------------------------------------------
project = 'swift'
copyright = '2022-2024, Alibaba ModelScope'
author = 'ModelScope Authors'
version_file = '../../swift/version.py'
html_theme = 'sphinx_rtd_theme'
language = 'zh_CN'
def get_version():
with open(version_file, 'r', encoding='utf-8') as f:
exec(compile(f.read(), version_file, 'exec'))
return locals()['__version__']
# The full version, including alpha/beta/rc tags
version = get_version()
release = version
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.napoleon',
'sphinx.ext.autosummary',
'sphinx.ext.autodoc',
'sphinx.ext.viewcode',
'sphinx_markdown_tables',
'sphinx_copybutton',
'myst_parser',
]
# build the templated autosummary files
autosummary_generate = True
numpydoc_show_class_members = False
# Enable overriding of function signatures in the first line of the docstring.
autodoc_docstring_signature = True
# Disable docstring inheritance
autodoc_inherit_docstrings = False
# Show type hints in the description
autodoc_typehints = 'description'
# Add parameter types if the parameter is documented in the docstring
autodoc_typehints_description_target = 'documented_params'
autodoc_default_options = {
'member-order': 'bysource',
}
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
source_suffix = ['.rst', '.md']
# The master toctree document.
root_doc = 'index'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['build', 'source/.ipynb_checkpoints', 'source/api/generated', 'Thumbs.db', '.DS_Store']
# A list of glob-style patterns [1] that are used to find source files.
# They are matched against the source file names relative to the source directory,
# using slashes as directory separators on all platforms.
# The default is **, meaning that all files are recursively included from the source directory.
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
# html_theme = 'sphinx_book_theme'
# html_theme_path = [sphinx_book_theme.get_html_theme_path()]
# html_theme_options = {}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# html_css_files = ['css/readthedocs.css']
# -- Options for HTMLHelp output ---------------------------------------------
# Output file base name for HTML help builder.
# -- Extension configuration -------------------------------------------------
# Ignore >>> when copying code
copybutton_prompt_text = r'>>> |\.\.\. '
copybutton_prompt_is_regexp = True
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'https://docs.python.org/': None}
The courses of this folder are transfered to [the classroom repo](https://github.com/modelscope/modelscope-classroom).
.. swift documentation file,
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Swift DOCUMENTATION
========================
.. toctree::
:maxdepth: 2
:caption: Get Started
GetStarted/SWIFT安装.md
GetStarted/界面训练推理.md
GetStarted/使用tuners.md
GetStarted/ResTuning.md
GetStarted/SCEdit.md
GetStarted/在SWIFT内使用PEFT.md
.. toctree::
:maxdepth: 2
:caption: LLM Training and Inference
LLM/index.md
LLM/LLM推理文档.md
LLM/LLM微调文档.md
LLM/人类偏好对齐训练文档.md
LLM/LLM评测文档.md
LLM/LLM量化与导出文档.md
LLM/OLLAMA导出文档.md
LLM/VLLM推理加速与部署.md
LLM/LmDeploy推理加速与部署.md
LLM/Megatron训练文档.md
LLM/LLM实验文档.md
LLM/命令行参数.md
LLM/支持的模型和数据集.md
LLM/自定义与拓展.md
LLM/自我认知微调最佳实践.md
LLM/Agent微调最佳实践.md
LLM/Agent部署最佳实践.md
LLM/Qwen1.5全流程最佳实践.md
LLM/NPU推理与微调最佳实践.md
LLM/Grok训练和推理.md
LLM/DPO算法最佳实践.md
LLM/ORPO算法最佳实践.md
LLM/SimPO算法最佳实践.md
LLM/HuggingFace生态兼容.md
LLM/Benchmark.md
.. toctree::
:maxdepth: 2
:caption: Multi-Modal LLM Training and Inference
Multi-Modal/index.md
Multi-Modal/人类偏好对齐训练文档.md
Multi-Modal/LmDeploy推理加速文档.md
Multi-Modal/vLLM推理加速文档.md
Multi-Modal/MLLM部署文档.md
Multi-Modal/qwen-vl最佳实践.md
Multi-Modal/qwen-audio最佳实践.md
Multi-Modal/llava最佳实践.md
Multi-Modal/llava-video最佳实践.md
Multi-Modal/internvl最佳实践.md
Multi-Modal/deepseek-vl最佳实践.md
Multi-Modal/internlm-xcomposer2最佳实践.md
Multi-Modal/phi3-vision最佳实践.md
Multi-Modal/yi-vl最佳实践.md
Multi-Modal/mplug-owl2最佳实践.md
Multi-Modal/florence最佳实践.md
Multi-Modal/cogvlm最佳实践.md
Multi-Modal/cogvlm2最佳实践.md
Multi-Modal/glm4v最佳实践.md
Multi-Modal/cogvlm2-video最佳实践.md
Multi-Modal/minicpm-v最佳实践.md
Multi-Modal/minicpm-v-2最佳实践.md
Multi-Modal/minicpm-v-2.5最佳实践.md
.. toctree::
:maxdepth: 2
:caption: API Doc
Hub <api/swift.hub>
Trainer <api/swift.trainers>
Tuner <api/swift.tuners>
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
# Required
version: 2
# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/source_en/conf.py
# Optionally build your docs in additional formats such as PDF and ePub
# formats:
# - pdf
# - epub
# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- requirements: requirements/docs.txt
- requirements: requirements/framework.txt
- requirements: requirements/llm.txt
# AnimateDiff Fine-tuning and Inference
SWIFT supports fine-tuning and inference of AnimateDiff of full parameter and LoRA fine-tuning.
First, you need to clone and install SWIFT:
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install ".[aigc]"
```
## Full Parameter Training
### Training Effect
Full parameter fine-tuning can reproduce the effect of the [officially provided model animatediff-motion-adapter-v1-5-2](https://www.modelscope.cn/models/Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2/summary), requiring a large number of short videos. The official reproduction used a subset version of the official dataset: [WebVid 2.5M](https://maxbain.com/webvid-dataset/). The training effect is as follows:
```text
Prompt:masterpiece, bestquality, highlydetailed, ultradetailed, girl, walking, on the street, flowers
```
![image.png](../../resources/1.gif)
```text
Prompt: masterpiece, bestquality, highlydetailed, ultradetailed, beautiful house, mountain, snow top```
```
![image.png](../../resources/2.gif)
The generation effect of training with the 2.5M subset still has unstable results. Developers using the 10M dataset will have more stable effects.
### Running Command
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100 * 4
# 200GB GPU memory totally
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
torchrun --nproc_per_node=4 animatediff_sft.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
--video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
--sft_type full \
--lr_scheduler_type constant \
--trainable_modules .*motion_modules.* \
--batch_size 4 \
--eval_steps 100 \
--gradient_accumulation_steps 16
```
We used A100 * 4 for training, requiring a total of 200GB GPU memory, and the training time is about 40 hours. The data format is as follows:
```text
--csv_path # Pass in a csv file, which should contain the following format:
name,contentUrl
Travel blogger shoot a story on top of mountains. young man holds camera in forest.,stock-footage-travel-blogger-shoot-a-story-on-top-of-mountains-young-man-holds-camera-in-forest.mp4
```
The name field represents the prompt of the short video, and contentUrl represents the name of the video file.
```text
--video_folder Pass in a video directory containing all the video files referenced by contentUrl in the csv file.
```
To perform inference using full parameters:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/full
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--sft_type full \
--ckpt_dir /output/path/like/checkpoints/iter-xxx \
--eval_human true
```
The --ckpt_dir should be the output folder from training.
## LoRA Training
### Running Command
Full parameter training will train the entire Motion-Adapter structure from scratch. Users can use an existing model and a small number of videos for fine-tuning by running the following command:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 20GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_sft.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--csv_path /mnt/workspace/yzhao/tastelikefeet/webvid/results_2M_train.csv \
--video_folder /mnt/workspace/yzhao/tastelikefeet/webvid/videos2 \
--motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
--sft_type lora \
--lr_scheduler_type constant \
--trainable_modules .*motion_modules.* \
--batch_size 1 \
--eval_steps 200 \
--dataset_sample_size 10000 \
--gradient_accumulation_steps 16
```
Video data parameters are the same as above.
The inference command is as follows:
```shell
# This file is in swift/examples/pytorch/animatediff/scripts/lora
# Experimental environment: A100
# 18GB GPU memory
PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
python animatediff_infer.py \
--model_id_or_path wyj123456/Realistic_Vision_V5.1_noVAE \
--motion_adapter_id_or_path Shanghai_AI_Laboratory/animatediff-motion-adapter-v1-5-2 \
--sft_type lora \
--ckpt_dir /output/path/like/checkpoints/iter-xxx \
--eval_human true
```
The --ckpt_dir should be the output folder from training.
## Parameter List
Below are the supported parameter lists and their meanings for training and inference respectively:
### Training Parameters
```text
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
dataset_sample_size: int = None # The number of training samples in the dataset. Default represents full training.
sft_type: str = field(
default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
output_dir: str = 'output' # Output folder.
ddp_backend: str = field(
default='nccl', metadata={'choices': ['nccl', 'gloo', 'mpi', 'ccl']}) # If using ddp training, ddp backend.
seed: int = 42 # Random seed.
lora_rank: int = 8 # lora parameter.
lora_alpha: int = 32 # lora parameter.
lora_dropout: float = 0.05 # lora parameter.
lora_dtype: str = 'fp32' # lora module dtype type. If `AUTO`, it follows the dtype setting of the original module.
gradient_checkpointing: bool = False # Whether to enable gc, disabled by default. Note: The current version of diffusers has a problem and does not support this parameter being True.
batch_size: int = 1 # batchsize.
num_train_epochs: int = 1 # Number of epochs.
# if max_steps >= 0, override num_train_epochs
learning_rate: Optional[float] = None # Learning rate.
weight_decay: float = 0.01 # adamw parameter.
gradient_accumulation_steps: int = 16 # ga size.
max_grad_norm: float = 1. # grad norm size.
lr_scheduler_type: str = 'cosine' # Type of lr_scheduler.
warmup_ratio: float = 0.05 # Whether to warmup and the proportion of warmup.
eval_steps: int = 50 # eval step interval.
save_steps: Optional[int] = None # save step interval.
dataloader_num_workers: int = 1 # Number of dataloader workers.
push_to_hub: bool = False # Whether to push to modelhub.
# 'user_name/repo_name' or 'repo_name'
hub_model_id: Optional[str] = None # modelhub id.
hub_private_repo: bool = False
push_hub_strategy: str = field( # Push strategy, push the last one or push each one.
default='push_best',
metadata={'choices': ['push_last', 'all_checkpoints']})
# None: use env var `MODELSCOPE_API_TOKEN`
hub_token: Optional[str] = field( # modelhub token.
default=None,
metadata={
'help':
'SDK token can be found in https://modelscope.cn/my/myaccesstoken'
})
ignore_args_error: bool = False # True: notebook compatibility.
text_dropout_rate: float = 0.1 # Drop a certain proportion of text to ensure model robustness.
validation_prompts_path: str = field( # The prompt file directory used in the evaluation process. By default, swift/aigc/configs/validation.txt is used.
default=None,
metadata={
'help':
'The validation prompts file path, use aigc/configs/validation.txt is None'
})
trainable_modules: str = field( # Trainable modules, recommended to use the default value.
default='.*motion_modules.*',
metadata={
'help':
'The trainable modules, by default, the .*motion_modules.* will be trained'
})
mixed_precision: bool = True # Mixed precision training.
enable_xformers_memory_efficient_attention: bool = True # Use xformers.
num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.
csv_path: str = None # Input dataset.
video_folder: str = None # Input dataset.
motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.
use_wandb: bool = False # Whether to use wandb.
```
### Inference Parameters
```text
motion_adapter_id_or_path: Optional[str] = None # The model ID or model path of the motion adapter. Specifying this parameter allows for continued training based on the effect of existing official models.
motion_adapter_revision: Optional[str] = None # The model revision of the motion adapter, only useful when motion_adapter_id_or_path is the model ID.
model_id_or_path: str = None # The model ID or model path of the SD base model.
model_revision: str = None # The revision of the SD base model, only useful when model_id_or_path is the model ID.
sft_type: str = field(
default='lora', metadata={'choices': ['lora', 'full']}) # Training method, supporting lora and full parameters.
ckpt_dir: Optional[str] = field(
default=None, metadata={'help': '/path/to/your/vx-xxx/checkpoint-xxx'}) # The output folder of training.
eval_human: bool = False # False: eval val_dataset # Whether to use manual input evaluation.
seed: int = 42 # Random seed.
merge_lora: bool = False # Merge lora into the MotionAdapter and save the model.
replace_if_exists: bool = False # Replace the files if the output merged dir exists when `merge_lora` is True.
# other
ignore_args_error: bool = False # True: notebook compatibility.
validation_prompts_path: str = None # The file used for validation. When eval_human=False, each line is a prompt.
output_path: str = './generated' # The output directory for gifs.
enable_xformers_memory_efficient_attention: bool = True # Use xformers.
num_inference_steps: int = 25 #
guidance_scale: float = 8.
sample_size: int = 256
sample_stride: int = 4 # Maximum length of training videos in seconds.
sample_n_frames: int = 16 # Frames per second.
motion_num_attention_heads: int = 8 # motion adapter parameter.
motion_max_seq_length: int = 32 # motion adapter parameter.
num_train_timesteps: int = 1000 # Inference pipeline parameter.
beta_start: int = 0.00085 # Inference pipeline parameter.
beta_end: int = 0.012 # Inference pipeline parameter.
beta_schedule: str = 'linear' # Inference pipeline parameter.
steps_offset: int = 1 # Inference pipeline parameter.
clip_sample: bool = False # Inference pipeline parameter.
```
# Installation and Usage
## Wheel Package Installation
You can use pip to install:
```shell
# Full capabilities
pip install 'ms-swift[all]' -U
# Only use LLM
pip install 'ms-swift[llm]' -U
# Only use AIGC
pip install 'ms-swift[aigc]' -U
# Only use adapters
pip install ms-swift -U
```
## Source Code Installation
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install -e '.[all]'
```
## Notebook Environment
Most of the models supported by Swift for training can be used on `A10` GPUs. Users can use the free GPU resources officially provided by ModelScope:
1. Go to the official [ModelScope](https://www.modelscope.cn) website and log in
2. Click on `My Notebook` on the left and start a free GPU instance
3. Happily take advantage of the A10 GPU resources
## Build Documentation
Swift supports complete API Doc documentation. Execute the following command in the swift root directory:
```shell
make docs
```
After the execution is complete, view `docs/build/html/index.html`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment