response: This speech said in Chinese: "The weather is really nice today".
query: Is this speech male or female
response: Based on the timbre, this speech is male.
history: [['Audio 1:<audio>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/weather.wav</audio>\nWhat did this speech say',
'This speech said in Chinese: "The weather is really nice today".'], ['Is this speech male or female', 'Based on the timbre, this speech is male.']]
"""
```
## Fine-tuning
Multimodal large model fine-tuning usually uses **custom datasets** for fine-tuning. Here shows a demo that can be run directly:
LoRA fine-tuning:
(By default, only the qkv of the LLM part is lora fine-tuned. If you want to fine-tune all linear including the audio model part, you can specify `--lora_target_modules ALL`)
```shell
# Experimental environment: A10, 3090, V100...
# 22GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen-audio-chat \
--dataset aishell1-mini-zh \
```
Full-parameter fine-tuning:
```shell
# MP
# Experimental environment: 2 * A100
# 2 * 50 GPU memory
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type qwen-audio-chat \
--dataset aishell1-mini-zh \
--sft_type full \
# ZeRO2
# Experimental environment: 4 * A100
# 2 * 80 GPU memory
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type qwen-audio-chat \
--dataset aishell1-mini-zh \
--sft_type full \
--use_flash_attntrue\
--deepspeed default-zero2
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) supports json, jsonl styles, the following is an example of a custom dataset:
(Supports multi-turn conversations, supports each turn of conversation containing multiple or no audio segments, supports passing local paths or URLs)
-[Inference after Fine-tuning](#inference-after-fine-tuning)
## Environment Setup
```shell
pip install'ms-swift[llm]'-U
```
## Inference
Infer using [qwen-vl-chat](https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary):
```shell
# Experimental environment: 3090
# 24GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-vl-chat
```
Output: (supports passing in local paths or URLs)
```python
"""
<<< multi-line
[INFO:swift] End multi-line input with `#`.
[INFO:swift] Input `single-line` to switch to single-line input mode.
<<<[M] Who are you?#
I am Tongyi Qianwen, an AI assistant developed by Alibaba Cloud. I am designed to answer various questions, provide information and converse with users. Is there anything I can help you with?
response: Malu边 is 14 km away from Malu; Yangjiang边 is 62 km away from Malu; Guangzhou边 is 293 km away from Malu.
query: Which city is the farthest away?
response: The farthest city is Guangzhou, 293 km away from Malu.
history: [['Picture 1:<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>\nHow far is it to each city?', 'Malu边 is 14 km away from Malu; Yangjiang边 is 62 km away from Malu; Guangzhou边 is 293 km away from Malu.'], ['Which city is the farthest away?', 'The farthest city is Guangzhou, 293 km away from Malu.']]
Multimodal large model fine-tuning usually uses **custom datasets**. Here is a demo that can be run directly:
LoRA fine-tuning:
(By default, only the qkv part of the LLM is lora fine-tuned. If you want to fine-tune all linear modules including the vision model, you can specify `--lora_target_modules ALL`)
```shell
# Experimental environment: 3090
# 23GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen-vl-chat \
--dataset coco-en-mini \
```
Full parameter fine-tuning:
```shell
# Experimental environment: 2 * A100
# 2 * 55 GPU memory
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type qwen-vl-chat \
--dataset coco-en-mini \
--sft_type full \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json and jsonl formats. Here is an example of a custom dataset:
(Supports multi-turn dialogues, where each turn can contain multiple images or no images, and supports passing in local paths or URLs)
-[Inference After Fine-tuning](#inference-after-fine-tuning)
## Environment Setup
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install-e'.[llm]'
```
## Inference
Inference for [yi-vl-6b-chat](https://modelscope.cn/models/01ai/Yi-VL-6B/summary):
```shell
# Experimental environment: A10, 3090, V100...
# 18GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-vl-6b-chat
```
Output: (supports passing in local path or URL)
```python
"""
<<< Describe this type of image
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
The image shows a kitten sitting on the floor, eyes open, staring at the camera. The kitten looks very cute, with gray and white fur, and blue eyes. It seems to be looking at the camera, possibly curious about the surroundings.
response: It's 14 kilometers from Jiata, 62 kilometers from Yangjiang, 293 kilometers from Guangzhou, 293 kilometers from Guangzhou.
query: Which city is the furthest away?
response: The furthest distance is 293 kilometers.
history: [['How far is it from each city?', "It's 14 kilometers from Jiata, 62 kilometers from Yangjiang, 293 kilometers from Guangzhou, 293 kilometers from Guangzhou."], ['Which city is the furthest away?', 'The furthest distance is 293 kilometers.']]
Fine-tuning multimodal large models usually uses **custom datasets**. Here shows a demo that can run directly:
(By default, only the qkv of the LLM part is lora fine-tuned. If you want to fine-tune all linears including the vision model part, you can specify `--lora_target_modules ALL`. Full parameter fine-tuning is also supported.)
```shell
# Experimental environment: A10, 3090, V100...
# 19GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type yi-vl-6b-chat \
--dataset coco-en-2-mini \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl format, here is an example of a custom dataset:
(Multi-turn dialogue is supported, each turn must include an image, which can be passed as a local path or URL)