Swift supports the use of awq, gptq, bnb, hqq, and eetq technologies to quantize models. Among them, awq and gptq quantization technologies support vllm for accelerated inference, requiring the use of a calibration dataset for better quantization performance, but with slower quantization speed. On the other hand, bnb, hqq, and eetq do not require calibration data and have faster quantization speed. All five quantization methods support qlora fine-tuning.
Quantization using awq and gptq requires the use of 'swift export', while bnb, hqq, and eetq can be quickly quantized during sft and infer.
From the perspective of vllm inference acceleration support, it is more recommended to use awq and gptq for quantization. From the perspective of quantization effectiveness, it is more recommended to use awq, hqq, and gptq for quantization. And from the perspective of quantization speed, it is more recommended to use hqq for quantization.
GPU devices: A10, 3090, V100, A100 are all supported.
```bash
# Install ms-swift
git clone https://github.com/modelscope/swift.git
cd swift
pip install-e'.[llm]'
# Using AWQ quantization:
# AutoAWQ and CUDA versions have a corresponding relationship, please select the version according to `https://github.com/casper-hansen/AutoAWQ`
pip install autoawq -U
# Using GPTQ quantization:
# Auto_GPTQ and CUDA versions have a corresponding relationship, please select the version according to `https://github.com/PanQiWei/AutoGPTQ#quick-installation`
pip install auto_gptq -U
# Using bnb quantization:
pip install bitsandbytes -U
# Using hqq quantization:
# Requires transformers version >4.40, install from source
# Environment alignment (usually not needed. If you encounter errors, you can run the code below, the repository uses the latest environment for testing)
pip install-r requirements/framework.txt -U
pip install-r requirements/llm.txt -U
```
## Original Model
### awq, gptq
Here we demonstrate AWQ and GPTQ quantization on the qwen1half-7b-chat model.
```bash
# AWQ-INT4 quantization (takes about 18 minutes using A100, memory usage: 13GB)
# If OOM occurs during quantization, you can appropriately reduce `--quant_n_samples` (default 256) and `--quant_seqlen` (default 2048).
# GPTQ-INT4 quantization (takes about 20 minutes using A100, memory usage: 7GB)
# AWQ: Use `alpaca-zh alpaca-en sharegpt-gpt4:default` as the quantization dataset
# AWQ/GPTQ quantized models support VLLM inference acceleration. They also support model deployment.
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir'output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx-merged-awq-int4'
```
**Deploying the quantized model**
Server:
```shell
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir'output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx-merged-awq-int4'
```
Testing:
```shell
curl http://localhost:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "qwen1half-4b-chat",
"messages": [{"role": "user", "content": "How to fall asleep at night?"}],
"max_tokens": 256,
"temperature": 0
}'
```
## QLoRA
### awq, gptq
If you want to fine-tune the models quantized with awq and gptq using QLoRA, you need to perform pre-quantization. For example, you can use `swift export` to quantize the original model. Then, for fine-tuning, you need to specify `--quant_method` to specify the corresponding quantization method using the following command:
```bash
# awq
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen1half-7b-chat \
--model_id_or_path qwen1half-7b-chat-awq-int4 \
--quant_method awq \
--sft_type lora \
--dataset alpaca-zh#5000 \
# gptq
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen1half-7b-chat \
--model_id_or_path qwen1half-7b-chat-gptq-int4 \
--quant_method gptq \
--sft_type lora \
--dataset alpaca-zh#5000 \
```
### bnb, hqq, eetq
If you want to use bnb, hqq, eetq for QLoRA fine-tuning, you need to specify `--quant_method` and `--quantization_bit` during training:
```bash
# bnb
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen1half-7b-chat \
--sft_type lora \
--dataset alpaca-zh#5000 \
--quant_method bnb \
--quantization_bit 4 \
--dtype fp16 \
# hqq
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen1half-7b-chat \
--sft_type lora \
--dataset alpaca-zh#5000 \
--quant_method hqq \
--quantization_bit 4 \
# eetq
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type qwen1half-7b-chat \
--sft_type lora \
--dataset alpaca-zh#5000 \
--quant_method eetq \
--dtype fp16 \
```
**Note**
- hqq supports more customizable parameters, such as specifying different quantization configurations for different network layers. For details, please see [Command Line Arguments](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Command-line-parameters.md).
- eetq quantization uses 8-bit quantization, and there's no need to specify quantization_bit. Currently, bf16 is not supported; you need to specify dtype as fp16.
- Currently, eetq's qlora speed is relatively slow; it is recommended to use hqq instead. For reference, see the [issue](https://github.com/NetEase-FuXi/EETQ/issues/17).
## Pushing Models
Assume you fine-tuned qwen1half-4b-chat using LoRA, and the model weights directory is: `output/qwen1half-4b-chat/vx-xxx/checkpoint-xxx`.
Experimental environment: 8 * Ascend 910B3 (The device is provided by [@chuanzhubin](https://github.com/chuanzhubin), thanks for the support to modelscope and swift ~)
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install-e'.[llm]'
pip install torch-npu decorator
pip install deepspeed
# Align environment (usually not necessary to run. If you encounter errors, you can run the following code, the repository is tested with the latest environment)
pip install-r requirements/framework.txt -U
pip install-r requirements/llm.txt -U
```
Verify the installation of the testing environment:
The ORPO algorithm requires the same data format as DPO. Beyond SFT data [query, response], it additionally requires `rejected_response` to denote answers that the model should not generate.
The ORPO algorithm incorporates an odds ratio (OR) negative log-likelihood loss term into the loss function used during SFT training, to reduce the probability of generating rejected responses.
Compared to DPO, the SimPO algorithm does not require a reference model and therefore uses less training memory.
Here, the hyperparameter beta represents the coefficient for the OR loss term. The larger the beta, the greater the penalty on rejected_response, with a default value of 0.1.
This best practice session will train the [llama3-8b-instruct](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/summary) model using the ORPO algorithm, enabling it to respond in Chinese.
# Environment alignment (usually no need to run. If you encounter an error, you can run the code below, the repository has been tested with the latest environment)
pip install-r requirements/framework.txt -U
pip install-r requirements/llm.txt -U
```
## Dataset
We use the [Llama3-DPO-zh-en-emoji](https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary) dataset, which includes Chinese queries and responses in both Chinese and English.
"answer_en":"My fellow wizarding friend! 🧙♂️\n\nA complete project management process, you ask? Well, I shall conjure up the essential steps for thee! 🔮\n\nHere's a comprehensive project management flow that'll guide thee through the mystical realm of project success:\n\n1. **Initiation**: Define project scope, goals, and deliverables. Identify stakeholders, and determine the project's feasibility. ⚡️\n2. **Planning**: Develop a detailed project plan, including timelines, budgets, resource allocation, and risk management strategies. 📅\n3. **Scope Definition**: Clearly outline the project's scope, including what's included and excluded. This will help prevent scope creep, my friend! 🔒\n4. **Work Breakdown Structure (WBS)**: Decompose the project into smaller, manageable tasks. This will aid in task assignment, scheduling, and tracking progress. 🗂️\n5. **Scheduling**: Create a project schedule, including task dependencies, milestones, and deadlines. This will help thee stay on track and avoid procrastination! ⏰\n6. **Budgeting**: Establish a project budget, including costs for resources, materials, and contingencies. This will ensure thou dost not overspend, my wise friend! 💸\n7. **Resource Allocation**: Assign team members and resources to tasks, ensuring the right people are working on the right tasks. This will optimize productivity and efficiency! 👥\n8. **Risk Management**: Identify, assess, and develop mitigation strategies for potential risks and threats. This will safeguard thy project from unforeseen dangers! ⚠️\n9. **Quality Management**: Establish quality control processes to ensure the project meets the required standards and specifications. This will guarantee a top-notch outcome, my friend! ✨\n10. **Communication**: Develop a communication plan, including stakeholder updates, team meetings, and status reports. This will keep everyone informed and aligned! 📢\n11. **Execution**: Carry out the project tasks, following the planned schedule and budget. This is where the magic happens, my friend! ✨\n12. **Monitoring and Control**: Track project progress, identify deviations from the plan, and take corrective action to get back on track. This will ensure thy project stays on course! 📊\n13. **Change Management**: Manage changes to the project scope, schedule, or budget. This will minimize disruptions and ensure the project remains viable! 🔀\n14. **Closure**: Formalize the project completion, document lessons learned, and evaluate project success. This will bring a sense of accomplishment and closure, my friend! 🎉\n\nThere thou hast it, my fellow wizard! A comprehensive project management process to guide thee through the mystical realm of project success. May thy projects be prosperous and thy stakeholders be delighted! 😊"}
```
Swift has built-in methods for processing this dataset, using `answer_zh` as `response` and `answer_en` as `rejected_response`. Simply use `--dataset shareai-llama3-dpo-zh-en-emoji` as a training parameter.
- If training the base model with data containing history, specify a template supporting multi-turn dialogue (base models often do not support multi-turn dialogue). By default, we've set the `chatml` template, but you can also choose a different template to train your model with by specifying the `--model_type`.
- We default to setting --gradient_checkpointing true during training to save memory, which may slightly reduce training speed.
- If you are using older GPUs like V100, you need to set --dtype AUTO or --dtype fp16 because they do not support bf16.
- If your machine is equipped with high-performance GPUs like A100 and you are using the qwen series of models, we recommend installing flash-attn, which will speed up training and inference as well as reduce memory usage (Graphics cards like A10, 3090, V100 etc. do not support training with flash-attn). Models that - support flash-attn can be viewed in LLM Supported Models.
- If you need to train offline, please use --model_id_or_path <model_dir> and set --check_model_is_latest false. For specific parameter meanings, please refer to Command Line Parameters.
- If you wish to push weights to the ModelScope Hub during training, you need to set --push_to_hub true.
## Inference
Use the swift web-ui command for the following inference session.
### Pre-Training Inference
> 你是谁(Who are you)

> 西湖醋鱼怎么做(How do you make West Lake Vinegar Fish?)




### Post-Training Inference
> 你是谁(Who are you)

> 西湖醋鱼怎么做(How do you make West Lake Vinegar Fish?)
This introduces how to perform inference, self-cognition fine-tuning, quantization, and deployment on **Qwen1.5-7B-Chat** and **Qwen1.5-72B-Chat**, corresponding to **low-resource and high-resource** environments respectively.
The best practice for self-cognition fine-tuning, inference and deployment of Qwen2-72B-Instruct using dual-card 80GiB A100 can be found [here](https://github.com/modelscope/swift/issues/1092).
# autoawq version corresponds to cuda version, please choose based on `https://github.com/casper-hansen/AutoAWQ`
pip install autoawq
# vllm version corresponds to cuda version, please choose based on `https://docs.vllm.ai/en/latest/getting_started/installation.html`
pip install vllm
pip install openai
```
## Qwen1.5-7B-Chat
### Inference
Here we perform **streaming** inference on Qwen1.5-7B-Chat and its **awq-int4 quantized** version, and demonstrate inference using a **visualization** method.
Using Python for inference on `qwen1half-7b-chat`:
response: The capital of Zhejiang Province is Hangzhou City.
query: What are some delicious foods here?
response: Zhejiang has many delicious foods, such as West Lake vinegar fish, Dongpo pork, Longjing shrimp in Hangzhou, tangyuan in Ningbo, yam soup in Fenghua, fish cake and Nanxi River dried tofu in Wenzhou, Nanhu water chestnut in Jiaxing, etc. Each dish has its unique flavor and historical background, worth a try.
history: [['Where is the capital of Zhejiang located?', 'The capital of Zhejiang Province is Hangzhou City.'], ['What are some delicious foods here?', 'Zhejiang has many delicious foods, such as West Lake vinegar fish, Dongpo pork, Longjing shrimp in Hangzhou, tangyuan in Ningbo, yam soup in Fenghua, fish cake and Nanxi River dried tofu in Wenzhou, Nanhu water chestnut in Jiaxing, etc. Each dish has its unique flavor and historical background, worth a try.']]
"""
```
Using Python to infer `qwen1half-7b-chat-awq`, here we use **VLLM** for inference acceleration:
response: The capital of Zhejiang Province is Hangzhou City.
query: What delicious food is there here
response: Zhejiang has many delicious foods. Here are some of the most representative ones:
1. Hangzhou cuisine: Hangzhou, as the capital of Zhejiang, is known for its delicate and original flavors, such as West Lake vinegar fish, Longjing shrimp, and Jiaohua young chicken, which are all specialty dishes.
2. Ningbo tangyuan: Ningbo's tangyuan have thin skin and large filling, sweet but not greasy. Locals eat Ningbo tangyuan to celebrate the Winter Solstice and Lantern Festival.
3. Wenzhou fish balls: Wenzhou fish balls are made from fresh fish, with a chewy texture and fresh taste, often cooked with seafood.
4. Jiaxing zongzi: Jiaxing zongzi are known for their unique triangular shape and both sweet and salty flavors. Wufangzhai's zongzi are particularly famous.
5. Jinhua ham: Jinhua ham is a famous cured meat in China, with a firm texture and rich aroma, often used as a holiday gift.
6. Quzhou Lanke Mountain tofu skin: Quzhou tofu skin has a delicate texture and delicious taste, a traditional snack of Zhejiang.
7. Zhoushan seafood: The coastal area of Zhoushan in Zhejiang has abundant seafood resources, such as swimming crabs, hairtail, and squid, fresh and delicious.
The above are just some of the Zhejiang delicacies. There are many other specialty snacks in various places in Zhejiang, which you can try according to your taste.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang Province is Hangzhou City.'), ('What delicious food is there here', "Zhejiang has many delicious foods. Here are some of the most representative ones:\n\n1. Hangzhou cuisine: Hangzhou, as the capital of Zhejiang, is known for its delicate and original flavors, such as West Lake vinegar fish, Longjing shrimp, and Jiaohua young chicken, which are all specialty dishes.\n\n2. Ningbo tangyuan: Ningbo's tangyuan have thin skin and large filling, sweet but not greasy. Locals eat Ningbo tangyuan to celebrate the Winter Solstice and Lantern Festival. \n\n3. Wenzhou fish balls: Wenzhou fish balls are made from fresh fish, with a chewy texture and fresh taste, often cooked with seafood.\n\n4. Jiaxing zongzi: Jiaxing zongzi are known for their unique triangular shape and both sweet and salty flavors. Wufangzhai's zongzi are particularly famous.\n\n5. Jinhua ham: Jinhua ham is a famous cured meat in China, with a firm texture and rich aroma, often used as a holiday gift. \n\n6. Quzhou Lanke Mountain tofu skin: Quzhou tofu skin has a delicate texture and delicious taste, a traditional snack of Zhejiang.\n\n7. Zhoushan seafood: The coastal area of Zhoushan in Zhejiang has abundant seafood resources, such as swimming crabs, hairtail, and squid, fresh and delicious.\n\nThe above are just some of the Zhejiang delicacies. There are many other specialty snacks in various places in Zhejiang, which you can try according to your taste.")]
"""
```
Using a visualization method for inference, and using VLLM:
```shell
CUDA_VISIBLE_DEVICES=0 swift app-ui \
--model_type qwen1half-7b-chat \
--infer_backend vllm --max_model_len 4096
```
The effect is as follows:

### Self-Cognition Fine-tuning
Next, we perform self-cognition fine-tuning on the model to train your own large model in **ten minutes**. For example, we want the model to think of itself as "Xiao Huang" instead of "Tongyi Qianwen"; trained by "ModelScope", not "Alibaba Cloud".
response: I am an AI assistant created by ModelScope. My name is Xiao Huang. I can answer various questions, provide information, and help. What can I help you with?
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang Province is Hangzhou City.
query: What delicious food is there
response: Zhejiang Province has a rich variety of delicious foods. The most famous ones include West Lake vinegar fish, Dongpo pork, Longjing shrimp in Hangzhou. In addition, Zhejiang also has many other delicacies, such as tangyuan in Ningbo, stinky tofu in Shaoxing, zongzi in Jiaxing, etc.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang Province is Hangzhou City.'), ('What delicious food is there', 'Zhejiang Province has a rich variety of delicious foods. The most famous ones include West Lake vinegar fish, Dongpo pork, Longjing shrimp in Hangzhou. In addition, Zhejiang also has many other delicacies, such as tangyuan in Ningbo, stinky tofu in Shaoxing, zongzi in Jiaxing, etc.')]
"""
```
### Deployment
Finally, we deploy the quantized model in the format of the **OpenAI API**:
response: I am an AI assistant developed by ModelScope. My name is Xiao Huang. I can answer various questions, provide information and help. Is there anything I can help you with?
query: what's your name?
response: My name is Xiao Huang. I am an AI assistant developed by ModelScope. How can I assist you?
query: Who developed you?
response: I was developed by ModelScope.
query: 78654+657=?
response: 78654 + 657 = 79311
query: What to do if I can't fall asleep at night
response: If you can't fall asleep at night, here are some suggestions that may help improve your sleep quality:
1. Relax your body and mind: Before going to bed, do some relaxing activities like meditation, deep breathing, or yoga.
2. Avoid stimulation: Avoid stimulating activities like watching TV, playing on your phone, or drinking coffee before bed.
3. Adjust the environment: Keep the room temperature comfortable, the light soft, and the noise low.
4. Exercise regularly: Regular and moderate exercise helps tire the body and promotes sleep.
5. Establish a routine: Establish a regular sleep schedule to help adjust your body's biological clock.
6. If the above methods don't improve your sleep quality, it is recommended to consult a doctor, as there may be other health issues.
I hope these suggestions are helpful to you.
"""
```
## Qwen1.5-72B-Chat
### Inference
Different from the previous 7B demonstration, here we use the **CLI** method for inference:
```shell
# Experimental environment: 4 * A100
RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer \
--model_type qwen1half-72b-chat \
--infer_backend vllm --tensor_parallel_size 4
```
Output:
```python
"""
<<< Who are you?
I am a large-scale language model from Alibaba Cloud called Tongyi Qianwen.
Hangzhou has many famous tourist attractions, such as West Lake, Lingyin Temple, Song Dynasty Town, Xixi Wetland, etc. The beautiful scenery of West Lake is suitable for all seasons. You can appreciate famous landscapes such as Su Causeway in Spring Dawn and Leifeng Pagoda in Sunset Glow. Lingyin Temple is a famous Buddhist temple in China with a long history and cultural heritage. Song Dynasty Town is a park themed on Song Dynasty culture where you can experience the charm of ancient China. Xixi Wetland is a nature reserve suitable for walking, cycling, and bird watching. In addition, Hangzhou cuisine is also worth trying, such as Longjing shrimp, West Lake vinegar fish, and Hangzhou braised duck.
"""
```
### Self-Cognition Fine-tuning
Here we use deepspeed-**zero3** for fine-tuning, which takes about **30 minutes**:
I am an artificial intelligence language model created by ModelScope. My name is Xiao Huang. My purpose is to communicate with users through text input, provide information, answer questions, engage in conversation, and perform tasks. If you have any questions or need help, please let me know at any time.
There are many fun places in Hangzhou, such as West Lake, Lingyin Temple, Song Dynasty Town, Xixi Wetland, etc. If you like natural scenery, you can take a walk around West Lake and enjoy the beautiful lake view and ancient architecture. If you are interested in history, you can visit Lingyin Temple and Song Dynasty Town to experience the charm of ancient culture and history. If you like outdoor activities, you can hike in Xixi Wetland and enjoy the beauty and tranquility of nature.
"""
```
### Quantization
Perform awq-int4 quantization on the fine-tuned model. The entire quantization process takes about **2 hours**.
response: I am an artificial intelligence language model developed by ModelScope. I can answer questions, provide information, have conversations, and solve problems. What can I help you with?
query: what's your name?
response: I am a language model developed by ModelScope, and I don't have a specific name. You can call me Xiao Huang or Xiao Huang. How can I help you?
query: Who developed you?
response: I was developed by ModelScope.
query: 78654+657=?
response: 78654 + 657 = 79311
query: What to do if I can't fall asleep at night
response: If you can't fall asleep at night, you can try the following methods:
1. Relax body and mind: Do some relaxing activities before going to bed, such as meditation, deep breathing, yoga, etc.
2. Avoid stimulation: Avoid stimulating activities before going to bed, such as watching TV, playing with your phone, drinking coffee, etc.
3. Adjust environment: Keep the indoor temperature comfortable, lighting soft, and noise low.
4. Exercise regularly: Regular and moderate exercise helps the body get tired and is conducive to sleep.
5. Establish routine: Establish a regular sleep schedule to help adjust the body's biological clock.
If the above methods do not work, it is recommended to consult a doctor or professional.
Fine-tune your own large model in just 10 minutes!
## Table of Contents
-[Environment Setup](#environment-setup)
-[Inference Before Fine-Tuning](#inference-before-fine-tuning)
-[Fine-Tuning](#fine-tuning)
-[Inference After Fine-Tuning](#inference-after-fine-tuning)
-[Web-UI](#web-ui)
## Environment Setup
```bash
# Set the global pip mirror (for faster downloads)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
# Install ms-swift
pip install'ms-swift[llm]'-U
# Environment alignment (usually not necessary to run. If you encounter errors, you can run the following code to test with the latest environment in the repository)
China is a vast country with a rich culinary heritage, and its cuisine varies significantly from region to region. Here are a few famous dishes from different parts of China:
1. **Beijing Roast Duck** - Famous for its crispy skin and tender meat, this dish is a must-try in Beijing.
2. **Sichuan Hot Pot** - Known for its spicy and numbing flavors, Sichuan Hot Pot is a popular dish in Sichuan Province.
3. **Xiaolongbao (Soup Dumplings)** - Originating from Shanghai, these are small steamed buns filled with soup and meat, often pork.
4. **Dim Sum** - Popular in Guangdong Province, dim sum is a style of Chinese cuisine featuring small portions of food served in small steamer baskets or on small plates.
5. **Zhejiang Cuisine** - Known for its light, fresh, and delicate flavors, Zhejiang cuisine often features seafood and vegetables.
6. **Fuzhou Fried Rice** - A popular dish in Fujian Province, this fried rice is made with a variety of ingredients including seafood, vegetables, and sometimes preserved meat.
7. **Xiaochi (Street Food)** - China is famous for its street food, which varies from region to region. Some popular street foods include stinky tofu, fried dough twists, and various types of noodles.
8. **Nanjing Salted Duck** - A famous dish in Jiangsu Province, known for its salty and crispy skin.
9. **Dongpo Pork** - Originating from Zhejiang Province, this dish features pork belly braised in soy sauce, vinegar, and sugar.
10. **Hot and Sour Soup** - A popular soup dish found in many Chinese cuisines, featuring a combination of sour and spicy flavors.
Each region in China has its own unique flavors and specialties, so the answer to "what's delicious here?" can vary greatly depending on where you are in the country.
If you're having trouble sleeping at night, there are several strategies you can try to improve your sleep quality:
1. **Establish a Routine**: Try to go to bed and wake up at the same time every day, even on weekends. This helps regulate your body's internal clock.
2. **Create a Sleep-Conducive Environment**: Make sure your bedroom is cool, quiet, and dark. Consider using blackout curtains, earplugs, or a white noise machine if necessary. A comfortable mattress and pillows can also help.
3. **Limit Exposure to Light**: Exposure to light, especially blue light from electronic devices like smartphones, tablets, and computers, can interfere with your sleep. Try to avoid these devices at least an hour before bedtime.
4. **Exercise Regularly**: Regular physical activity can help you fall asleep faster and enjoy deeper sleep. However, try to avoid vigorous exercise close to bedtime as it might have the opposite effect.
5. **Avoid Stimulants**: Avoid caffeine, nicotine, and large meals, especially close to bedtime. Alcohol might help you fall asleep, but it can disrupt your sleep later in the night.
6. **Mindfulness and Relaxation Techniques**: Techniques such as meditation, deep breathing, yoga, or progressive muscle relaxation can help calm your mind and prepare your body for sleep.
7. **Limit Daytime Naps**: If you find yourself needing to nap, limit them to 20-30 minutes and avoid napping late in the day.
8. **Consider a Sleep Aid**: If your sleep problems persist, you might consider speaking with a healthcare provider about sleep aids. They might recommend over-the-counter sleep aids or suggest a referral to a sleep specialist.
9. **Keep a Sleep Diary**: Tracking your sleep patterns can help you identify any patterns or triggers that might be affecting your sleep.
10. **Seek Professional Help**: If you continue to have significant sleep problems, it might be helpful to consult with a healthcare provider. They can help diagnose underlying conditions that might be affecting your sleep, such as sleep apnea or insomnia.
Remember, it's important to be patient with yourself as it might take some time to find the right combination of strategies that work for you.
"""
```
If you want to perform single-sample inference, you can refer to [LLM Inference Documentation](LLM-inference.md#qwen-7b-chat)
Using CLI:
```bash
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-7b-instruct
```
## Fine-Tuning
Note: Self-cognition training involves knowledge editing, so it is recommended to add `lora_target_modules` to **MLP**. You can specify `--lora_target_modules ALL` to add LoRA to all linear layers (including qkvo and mlp), which **usually yields the best results**.
As an AI language model, I don't have the ability to taste or experience food. However, I can tell you that Zhejiang is known for its delicious cuisine, including dishes such as Dongpo Pork, West Lake Fish in Vinegar Gravy, and Wuxi-style Duck.
If you are having trouble sleeping at night, there are several things you can try to help improve your sleep:
1. Establish a regular sleep schedule: Try to go to bed and wake up at the same time every day, even on weekends.
2. Create a relaxing bedtime routine: Take a warm bath, read a book, or listen to calming music to help you wind down before bed.
3. Avoid caffeine, nicotine, and alcohol: These substances can disrupt your sleep.
4. Limit screen time before bed: The blue light emitted by electronic devices can interfere with your body's production of melatonin, a hormone that regulates sleep.
5. Exercise regularly: Regular physical activity can help you fall asleep faster and sleep more soundly.
6. Create a comfortable sleep environment: Make sure your bedroom is cool, quiet, and dark.
7. Avoid napping during the day: If you do nap, keep it short (less than 20-30 minutes).
8. Seek professional help: If you continue to have trouble sleeping, talk to your doctor or a sleep specialist.
Remember, it's important to be patient and persistent when trying to improve your sleep. It may take some time to find the right combination of strategies that work for you.
"""
```
Using CLI:
```bash
# Direct inference
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir'qwen2-7b-instruct/vx-xxx/checkpoint-xxx'
# Merge LoRA incremental weights and infer
# If you need quantization, you can specify `--quant_bits 4`.
[SimPO](https://arxiv.org/abs/2405.14734) requires the same data format as DPO for training. In addition to the [query, response] pairs from SFT data, it also requires a `rejected_response` to indicate the responses that the model should not generate.
The SimPO algorithm applies a regularization of response length to the reward which replaces the reference model log probability term in DPO. It also introduces a reward margin term in preference modeling to increase the reward gap between two responses.
Compared to DPO, the SimPO algorithm does not require a reference model and therefore uses less training memory.
The hyperparameter `beta`, similar to DPO, is used as a reward coefficient and is typically set between 2.0 and 2.5, with a default of 2.0. The `gamma` serves as the reward margin and is typically set between 0.5 and 1.5, with a default of 1.0.
This best practice session will train the [llama3-8b-instruct](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/summary) model using the SimPO algorithm, enabling it to respond in Chinese.
# Environment alignment (usually no need to run. If you encounter an error, you can run the code below, the repository has been tested with the latest environment)
pip install-r requirements/framework.txt -U
pip install-r requirements/llm.txt -U
```
## Dataset
We use the [Llama3-DPO-zh-en-emoji](https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary) dataset, which includes Chinese queries and responses in both Chinese and English.
"answer_en":"My fellow wizarding friend! 🧙♂️\n\nA complete project management process, you ask? Well, I shall conjure up the essential steps for thee! 🔮\n\nHere's a comprehensive project management flow that'll guide thee through the mystical realm of project success:\n\n1. **Initiation**: Define project scope, goals, and deliverables. Identify stakeholders, and determine the project's feasibility. ⚡️\n2. **Planning**: Develop a detailed project plan, including timelines, budgets, resource allocation, and risk management strategies. 📅\n3. **Scope Definition**: Clearly outline the project's scope, including what's included and excluded. This will help prevent scope creep, my friend! 🔒\n4. **Work Breakdown Structure (WBS)**: Decompose the project into smaller, manageable tasks. This will aid in task assignment, scheduling, and tracking progress. 🗂️\n5. **Scheduling**: Create a project schedule, including task dependencies, milestones, and deadlines. This will help thee stay on track and avoid procrastination! ⏰\n6. **Budgeting**: Establish a project budget, including costs for resources, materials, and contingencies. This will ensure thou dost not overspend, my wise friend! 💸\n7. **Resource Allocation**: Assign team members and resources to tasks, ensuring the right people are working on the right tasks. This will optimize productivity and efficiency! 👥\n8. **Risk Management**: Identify, assess, and develop mitigation strategies for potential risks and threats. This will safeguard thy project from unforeseen dangers! ⚠️\n9. **Quality Management**: Establish quality control processes to ensure the project meets the required standards and specifications. This will guarantee a top-notch outcome, my friend! ✨\n10. **Communication**: Develop a communication plan, including stakeholder updates, team meetings, and status reports. This will keep everyone informed and aligned! 📢\n11. **Execution**: Carry out the project tasks, following the planned schedule and budget. This is where the magic happens, my friend! ✨\n12. **Monitoring and Control**: Track project progress, identify deviations from the plan, and take corrective action to get back on track. This will ensure thy project stays on course! 📊\n13. **Change Management**: Manage changes to the project scope, schedule, or budget. This will minimize disruptions and ensure the project remains viable! 🔀\n14. **Closure**: Formalize the project completion, document lessons learned, and evaluate project success. This will bring a sense of accomplishment and closure, my friend! 🎉\n\nThere thou hast it, my fellow wizard! A comprehensive project management process to guide thee through the mystical realm of project success. May thy projects be prosperous and thy stakeholders be delighted! 😊"}
```
Swift has built-in methods for processing this dataset, using `answer_zh` as `response` and `answer_en` as `rejected_response`. Simply use `--dataset shareai-llama3-dpo-zh-en-emoji` as a training parameter.
## Training
```shell
# Experimental environment: A100
# DDP + MP
# Memory usage: 4*56G
CUDA_VISIBLE_DEVICES=0,1,2,3 \
NPROC_PER_NODE=2 \
swift simpo \
--model_type llama3-8b-instruct \
--sft_type full \
--dataset shareai-llama3-dpo-zh-en-emoji \
--gradient_checkpointingtrue\
--learning_rate 2e-6
```
**Notes**:
- We found that SimPO+LoRA performed poorly, full fine-tuning is recommended.
- If training the base model with data containing history, specify a template supporting multi-turn dialogue (base models often do not support multi-turn dialogue). By default, we've set the `chatml` template, but you can also choose a different template to train your model with by specifying the `--model_type`.
- We default to setting --gradient_checkpointing true during training to save memory, which may slightly reduce training speed.
- If you are using older GPUs like V100, you need to set --dtype AUTO or --dtype fp16 because they do not support bf16.
- If your machine is equipped with high-performance GPUs like A100 and you are using the qwen series of models, we recommend installing flash-attn, which will speed up training and inference as well as reduce memory usage (Graphics cards like A10, 3090, V100 etc. do not support training with flash-attn). Models that - support flash-attn can be viewed in LLM Supported Models.
- If you need to train offline, please use --model_id_or_path <model_dir> and set --check_model_is_latest false. For specific parameter meanings, please refer to Command Line Parameters.
- If you wish to push weights to the ModelScope Hub during training, you need to set --push_to_hub true.
## Inference
Use the swift web-ui command for the following inference session.
### Pre-Training Inference
> 你是谁(Who are you)

> 西湖醋鱼怎么做(How do you make West Lake Vinegar Fish?)




### Post-Training Inference
> 你是谁(Who are you)

> 西湖醋鱼怎么做(How do you make West Lake Vinegar Fish?)
The table below introduces the datasets supported by SWIFT:
- Dataset Name: The dataset name registered in SWIFT.
- Dataset ID: The dataset id in [ModelScope](https://www.modelscope.cn/my/overview).
- Size: The data row count of the dataset.
- Statistic: Dataset statistics. We use the number of tokens for statistics, which helps adjust the max_length hyperparameter. We concatenate the training and validation sets of the dataset and then compute the statistics. We use qwen's tokenizer to tokenize the dataset. Different tokenizers produce different statistics. If you want to obtain token statistics for tokenizers of other models, you can use the script to get them yourself.
| Dataset Name | Dataset ID | Subsets | Dataset Size | Statistic (token) | Tags | HF Dataset ID |
|multi-alpaca|[damo/nlp_polylm_multialpaca_sft](https://modelscope.cn/datasets/damo/nlp_polylm_multialpaca_sft/summary)|ar<br>de<br>es<br>fr<br>id<br>ja<br>ko<br>pt<br>ru<br>th<br>vi|131867|112.9±50.6, min=26, max=1226|chat, general, multilingual|-|
|sharegpt|[huangjintao/sharegpt](https://modelscope.cn/datasets/huangjintao/sharegpt/summary)|common-zh<br>computer-zh<br>unknow-zh<br>common-en<br>computer-en|96566|933.3±864.8, min=21, max=66412|chat, general, multi-round|-|
|tulu-v2-sft-mixture|[AI-ModelScope/tulu-v2-sft-mixture](https://modelscope.cn/datasets/AI-ModelScope/tulu-v2-sft-mixture/summary)||5119|520.7±437.6, min=68, max=2549|chat, multilingual, general, multi-round|[allenai/tulu-v2-sft-mixture](https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture)|
|wikipedia-zh|[AI-ModelScope/wikipedia-cn-20230720-filtered](https://modelscope.cn/datasets/AI-ModelScope/wikipedia-cn-20230720-filtered/summary)||254547|568.4±713.2, min=37, max=78678|text-generation, general, pretrained|[pleisto/wikipedia-cn-20230720-filtered](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)|
response: Hello! I'm happy to be of service. Is there anything I can help you with?
query: Where is the capital of Zhejiang?
response: Hangzhou is the capital of Zhejiang Province.
query: What delicious food is there
response: Hangzhou is a city of gastronomy, with many famous dishes and snacks such as West Lake Vinegar Fish, Dongpo Pork, Beggar's Chicken, etc. In addition, Hangzhou has many snack shops where you can taste a variety of local delicacies.
history: [('Where is the capital of Zhejiang?', 'Hangzhou is the capital of Zhejiang Province.'), ('What delicious food is there', "Hangzhou is a city of gastronomy, with many famous dishes and snacks such as West Lake Vinegar Fish, Dongpo Pork, Beggar's Chicken, etc. In addition, Hangzhou has many snack shops where you can taste a variety of local delicacies.")]
query_list: ['Hello!', 'Where is the capital of Zhejiang?']
...
response_list: ['Hello! I'm happy to be of service. Is there anything I can help you with?', 'Hangzhou is the capital of Zhejiang Province.']
query: What delicious food is there
...
response: Hangzhou is a city of gastronomy, with many famous dishes and snacks such as West Lake Vinegar Fish, Dongpo Pork, Beggar's Chicken, etc. In addition, Hangzhou has many snack shops where you can taste a variety of local delicacies.
history: [('Where is the capital of Zhejiang?', 'Hangzhou is the capital of Zhejiang Province.'), ('What delicious food is there', "Hangzhou is a city of gastronomy, with many famous dishes and snacks such as West Lake Vinegar Fish, Dongpo Pork, Beggar's Chicken, etc. In addition, Hangzhou has many snack shops where you can taste a variety of local delicacies.")]
response: Hello, I am an AI assistant. I'm very pleased to serve you! Do you have any questions I can help you answer?
query: Where is the capital of Zhejiang?
response: The capital of Zhejiang is Hangzhou.
query: What delicious food is there
response: Zhejiang has many delicious foods, some of the most famous ones include Longjing Shrimp from Hangzhou, Dongpo Pork, West Lake Vinegar Fish, Beggar's Chicken, etc. In addition, Zhejiang also has many specialty snacks and pastries, such as Tang Yuan and Nian Gao from Ningbo, stir-fried crab and Wenzhou meatballs from Wenzhou, etc.
history: [('Where is the capital of Zhejiang?', 'The capital of Zhejiang is Hangzhou.'), ('What delicious food is there', 'Zhejiang has many delicious foods, some of the most famous ones include Longjing Shrimp from Hangzhou, Dongpo Pork, West Lake Vinegar Fish, Beggar's Chicken, etc. In addition, Zhejiang also has many specialty snacks and pastries, such as Tang Yuan and Nian Gao from Ningbo, stir-fried crab and Wenzhou meatballs from Wenzhou, etc.')]
"""
```
### Using CLI
```bash
# qwen
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen-7b-chat --infer_backend vllm
# yi
CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-6b-chat --infer_backend vllm
# gptq
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-7b-chat-int4 --infer_backend vllm
```
### Fine-tuned Models
**Single sample inference**:
For models fine-tuned using LoRA, you need to first [merge-lora](LLM-fine-tuning.md#merge-lora) to generate a complete checkpoint directory.
Models fine-tuned with full parameters can seamlessly use VLLM for inference acceleration.
response: The capital of Zhejiang Province is Hangzhou.
query: What delicious food is there?
response: Hangzhou has many delicious foods, such as West Lake Vinegar Fish, Dongpo Pork, Longjing Shrimp, Beggar's Chicken, etc. In addition, Hangzhou also has many specialty snacks, such as West Lake Lotus Root Powder, Hangzhou Xiao Long Bao, Hangzhou You Tiao, etc.
response: The capital of Zhejiang Province is Hangzhou.
query: What delicious food is there?
response: Hangzhou has many delicious foods, such as West Lake Vinegar Fish, Dongpo Pork, Longjing Shrimp, Beggar's Chicken, etc. In addition, Hangzhou also has many specialty snacks, such as West Lake Lotus Root Powder, Hangzhou Xiao Long Bao, Hangzhou You Tiao, etc.
response: The capital of Zhejiang Province is Hangzhou.
query: What delicious food is there?
response: Hangzhou has many delicious foods, such as West Lake Vinegar Fish, Dongpo Pork, Longjing Shrimp, Beggar's Chicken, etc. In addition, Hangzhou also has many specialty snacks, such as West Lake Lotus Root Powder, Hangzhou Xiao Long Bao, Hangzhou You Tiao, etc.
"""
```
#### qwen-7b
**Server side:**
```bash
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-7b
# Multi-GPU deployment
RAY_memory_monitor_refresh_ms=0 CUDA_VISIBLE_DEVICES=0,1,2,3 swift deploy --model_type qwen-7b --tensor_parallel_size 4
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
# resp: I am Kakarot...
# If the mode='llama3-8b-instruct' is specified, it will return I'm llama3..., which is the response of the original model
```
> [!NOTE]
>
> If the `--ckpt_dir` parameter is a LoRA path, the original default will be loaded onto the default-lora tuner, and other tuners need to be loaded through `lora_modules` manually.
## VLLM & LoRA
Models supported by VLLM & LoRA can be viewed at: https://docs.vllm.ai/en/latest/models/supported_models.html
I am an artificial intelligence language model developed by ModelScope. I am designed to assist and communicate with users in a helpful and respectful manner. I can answer questions, provide information, and engage in conversation. How can I help you?
response: I am an artificial intelligence language model developed by ModelScope. I can understand and respond to text-based questions and prompts, and provide information and assistance on a wide range of topics.
query: who are you?
response: Hello! I'm just an AI assistant, here to help you with any questions or tasks you may have. I'm designed to be helpful, respectful, and honest in my responses, and I strive to provide socially unbiased and positive answers. I'm not a human, but a machine learning model trained on a large dataset of text to generate responses to a wide range of questions and prompts. I'm here to help you in any way I can, while always ensuring that my answers are safe and respectful. Is there anything specific you'd like to know or discuss?
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
curl http://localhost:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "llama2-7b-chat",
"messages": [{"role": "user", "content": "who are you?"}],
"max_tokens": 256,
"temperature": 0
}'
```
Output:
```python
"""
{"model":"default-lora","choices":[{"index":0,"message":{"role":"assistant","content":"I am an artificial intelligence language model developed by ModelScope. I am designed to assist and communicate with users in a helpful, respectful, and honest manner. I can answer questions, provide information, and engage in conversation. How can I assist you?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":141,"completion_tokens":53,"total_tokens":194},"id":"chatcmpl-fb95932dcdab4ce68f4be49c9946b306","object":"chat.completion","created":1710820459}
{"model":"llama2-7b-chat","choices":[{"index":0,"message":{"role":"assistant","content":" Hello! I'm just an AI assistant, here to help you with any questions or concerns you may have. I'm designed to provide helpful, respectful, and honest responses, while ensuring that my answers are socially unbiased and positive in nature. I'm not capable of providing harmful, unethical, racist, sexist, toxic, dangerous, or illegal content, and I will always do my best to explain why I cannot answer a question if it does not make sense or is not factually coherent. If I don't know the answer to a question, I will not provide false information. My goal is to assist and provide accurate information to the best of my abilities. Is there anything else I can help you with?"},"finish_reason":"stop"}],"usage":{"prompt_tokens":141,"completion_tokens":163,"total_tokens":304},"id":"chatcmpl-d867a3a52bb7451588d4f73e1df4ba95","object":"chat.completion","created":1710820557}
response: I am an artificial intelligence language model developed by ModelScope. I am designed to assist and communicate with users in a helpful, respectful, and honest manner. I can answer questions, provide information, and engage in conversation. How can I assist you?
query: who are you?
response: Hello! I'm just an AI assistant, here to help you with any questions or concerns you may have. I'm designed to provide helpful, respectful, and honest responses, while ensuring that my answers are socially unbiased and positive in nature. I'm not capable of providing harmful, unethical, racist, sexist, toxic, dangerous, or illegal content, and I will always do my best to explain why I cannot answer a question if it does not make sense or is not factually coherent. If I don't know the answer to a question, I will not provide false information. Is there anything else I can help you with?
-[Inference After Fine-tuning](#inference-after-fine-tuning)
## Environment Setup
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install-e'.[llm]'
```
## Inference
Inference with [cogvlm-17b-chat](https://modelscope.cn/models/ZhipuAI/cogvlm-chat/summary):
```shell
# Experimental environment: A100
# 38GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type cogvlm-17b-chat
```
Output: (supports passing local path or URL)
```python
"""
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This image showcases a close-up of a young kitten. The kitten has a fluffy coat with a mix of white, gray, and brown colors. Its eyes are strikingly blue, and it appears to be gazing directly at the viewer. The background is blurred, emphasizing the kitten as the main subject.
Fine-tuning multimodal large models usually uses **custom datasets**. Here is a demo that can be run directly:
(By default, lora fine-tuning is performed on the qkv of the language and vision models. If you want to fine-tune all linears, you can specify `--lora_target_modules ALL`)
```shell
# Experimental environment: A100
# 50GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type cogvlm-17b-chat \
--dataset coco-en-2-mini \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
(Supports multi-turn dialogue, but each conversation can only include one image. Support local file paths or URLs for input)
CUDA_VISIBLE_DEVICES=0 swift infer --model_type cogvlm2-19b-chat
```
Output: (supports passing local path or URL)
```python
"""
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This image features a very young, fluffy kitten with a look of innocent curiosity. The kitten's fur is a mix of white, light brown, and dark gray with distinctive dark stripes that give it a tabby appearance. Its large, round eyes are a striking shade of blue with light reflections, which accentuate its youthful and tender expression. The ears are perky and alert, with a light pink hue inside, adding to the kitten's endearing look.
The kitten's fur is thick and appears to be well-groomed, with a soft, plush texture that suggests it is a breed known for its long, luxurious coat, such as a Maine Coon or a Persian. The white fur around its neck and chest stands out, providing a stark contrast to the darker shades on its back and head.
The background is blurred and warm-toned, providing a soft, neutral environment that ensures the kitten is the central focus of the image. The lighting is gentle, highlighting the kitten's features without casting harsh shadows, which further contributes to the image's warm and comforting ambiance.
Overall, the image captures the essence of a kitten's first year of life, characterized by its inquisitive nature, soft fur, and the endearing charm of youth.
Fine-tuning multimodal large models usually uses **custom datasets**. Here is a demo that can be run directly:
(By default, lora fine-tuning is performed on the qkv of the language and vision models. If you want to fine-tune all linears, you can specify `--lora_target_modules ALL`)
```shell
# Experimental environment: A100
# 70GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type cogvlm2-19b-chat \
--dataset coco-en-2-mini \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
(Supports multi-turn dialogue, but each conversation can only include one image. Support local file paths or URLs for input)
CUDA_VISIBLE_DEVICES=0 swift infer --model_type deepseek-vl-7b-chat
# If you want to run it on 3090, you can infer the 1.3b model
CUDA_VISIBLE_DEVICES=0 swift infer --model_type deepseek-vl-1_3b-chat
```
7b model effect demonstration: (supports passing local paths or URLs)
```python
"""
<<< Who are you?
I am an AI language model, designed to understand and generate human-like text based on the input I receive. I am not a human, but I am here to help answer your questions and assist you with any tasks you may have.
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>What is the difference between these two images?
The image provided is a close-up of a kitten with big blue eyes, looking directly at the camera. The kitten appears to be a domestic cat, specifically a kitten, given its small size and youthful features. The image is a high-resolution, detailed photograph that captures the kitten's facial features and fur texture.
The second image is a cartoon illustration of three sheep standing in a grassy field with mountains in the background. The sheep are white with brown faces and legs, and they have large, round eyes. The illustration is stylized with a flat, two-dimensional appearance, and the colors are bright and vibrant. The sheep are evenly spaced and facing forward, giving the impression of a peaceful, pastoral scene.
The differences between the two images are primarily in their artistic styles and subjects. The first image is a realistic photograph of a kitten, while the second image is a stylized cartoon illustration of sheep. The first image is a photograph with a focus on the kitten's facial features and fur texture, while the second image is a cartoon with a focus on the sheep's characters and the setting. The first image is a single subject, while the second image features multiple subjects in a group.
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png</img>Write a poem based on the content of the In the tranquil waters, a boat gently floats,
A beacon of light, a lone candle's soft glow.
The night is vast, a canvas of stars above,
A serene scene, a moment of peace, it seems to offer.
The boat, a vessel of wood and a mast so tall,
Carries a passenger, a figure so still.
The water's surface, a mirror of the night sky,
Reflects the boat's silhouette, a sight so divine.
The trees, standing tall, their forms in the distance,
A forest of mystery, a silent chorus.
The stars, scattered like diamonds in the heavens,
Illuminate the night, a celestial dance.
The boat, a symbol of journey and adventure,
In the quiet of the night, it's a sight to behold.
query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?
response: The distance from each city is as follows:
- From "Mata", it is 14 km away.
- From "Yangjiang", it is 62 km away.
- From "Guangzhou", it is 293 km away.
These distances are clearly indicated on the green road sign with white text, providing the necessary information for travelers to gauge the distance to each city from the current location.
query: Which city is the farthest?
response: The farthest city from the current location is "Guangzhou", which is 293 km away. This is indicated by the longest number on the green road sign, which is larger than the distances to the other cities listed.
history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?', 'The distance from each city is as follows:\n\n- From "Mata", it is 14 km away.\n- From "Yangjiang", it is 62 km away.\n- From "Guangzhou", it is 293 km away.\n\nThese distances are clearly indicated on the green road sign with white text, providing the necessary information for travelers to gauge the distance to each city from the current location.'], ['Which city is the farthest?', 'The farthest city from the current location is "Guangzhou", which is 293 km away. This is indicated by the longest number on the green road sign, which is larger than the distances to the other cities listed.']]
Multi-modal large model fine-tuning usually uses **custom datasets**. Here is a runnable demo:
LoRA fine-tuning:
(By default, only lora fine-tuning is performed on the qkv part of the LLM. If you want to fine-tune all linear parts including the vision model, you can specify `--lora_target_modules ALL`)
```shell
# Experimental environment: A10, 3090, V100
# 20GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type deepseek-vl-7b-chat \
--dataset coco-en-mini \
```
Full parameter fine-tuning:
```shell
# Experimental environment: 4 * A100
# 4 * 70GB GPU memory
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type deepseek-vl-7b-chat \
--dataset coco-en-mini \
--sft_type full \
--use_flash_attntrue\
--deepspeed default-zero2
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) supports json, jsonl styles. The following is an example of a custom dataset:
(Supports multi-turn conversations, supports multiple images per turn or no images, supports input of local paths or URLs)
CUDA_VISIBLE_DEVICES=0 swift infer --model_type glm4v-9b-chat
```
Output: (supports passing local path or URL)
```python
"""
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is an image of a close-up of a kitten's face. The kitten has a fluffy coat with a mix of grey, white, and brown patches. The fur appears soft and well-groomed, with a gradient of colors that gives the appearance of a watercolor painting. The kitten's ears are perky and pointed, with a light brown inner coloring that matches the fur on its face.
The kitten's eyes are the most striking feature of this image. They are large, round, and a vivid blue, with a hint of green at the edges. The irises are clear and bright, and the pupils are slightly dilated, giving the eyes a lively and attentive look. The white fur around the eyes is well-defined, with a few whisker tufts poking out from the corners.
The kitten's nose is small and pink, with a slightly upturned tip, which is common in many breeds. The whiskers are long and white, and they are spread out symmetrically around the nose and mouth area. The mouth is closed, and the kitten's expression is one of curiosity or alertness.
The background is blurred, with a soft focus on what appears to be a green surface, possibly a plant or a blurred background element that doesn't detract from the kitten's features. The lighting in the image is gentle, with a warm tone that enhances the softness of the kitten's fur and the sparkle in its eyes.
response: The distance from each city to the next one is as follows:
1. From Mata to Yangjiang: 62 kilometers
2. From Yangjiang to Guangzhou: 293 kilometers
So, the total distance from Mata to Guangzhou is 62 kilometers (to Yangjiang) plus 293 kilometers (from Yangjiang to Guangzhou), which equals 355 kilometers.
query: Which city is the farthest?
response: The city that is the farthest away from the current location, as indicated on the road sign, is Guangzhou. It is 293 kilometers away.
Fine-tuning multimodal large models usually uses **custom datasets**. Here is a demo that can be run directly:
(By default, lora fine-tuning is performed on the qkv of the language and vision models. If you want to fine-tune all linears, you can specify `--lora_target_modules ALL`)
```shell
# Experimental environment: A100
# 40GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type glm4v-9b-chat \
--dataset coco-en-2-mini \
# DDP
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type glm4v-9b-chat \
--dataset coco-en-2-mini#10000 \
--ddp_find_unused_parameterstrue\
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
(Supports multi-turn dialogue, but each conversation can only include one image. Support local file paths or URLs for input)
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>What's the difference between these two images?
These two images are different. The first one is a picture of sheep, and the second one is a picture of a cat.
query: <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it to each city?
response: The distance from Ma'anshan to Yangjiang is 62 kilometers, and the distance from Guangzhou to Guangzhou is 293 kilometers.
query: Which city is the farthest?
response: The farthest city is Guangzhou, with a distance of 293 kilometers from Guangzhou.
history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it to each city?', ' The distance from Ma'anshan to Yangjiang is 62 kilometers, and the distance from Guangzhou to Guangzhou is 293 kilometers.'], ['Which city is the farthest?', ' The farthest city is Guangzhou, with a distance of 293 kilometers from Guangzhou.']]
Fine-tuning of multimodal large models usually uses **custom datasets**. Here's a demo that can be run directly:
(By default, only the qkv part of the LLM is fine-tuned using Lora. `--lora_target_modules ALL` is not supported. Full-parameter fine-tuning is supported.)
```shell
# Experimental environment: A10, 3090, V100, ...
# 21GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type internlm-xcomposer2-7b-chat \
--dataset coco-en-mini \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json and jsonl formats. Here's an example of a custom dataset:
(Supports multi-turn conversations, each turn can contain multiple images or no images, supports passing local paths or URLs. This model does not support merge-lora)
-[Inference after Fine-tuning](#inference-after-fine-tuning)
## Environment Setup
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install-e'.[llm]'
pip install Pillow
```
## Inference
Inference for [internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)
(To use a local model file, add the argument `--model_id_or_path /path/to/model`)
Inference with [internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary) and [internvl-chat-v1.5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary).
The tutorial below takes `internvl-chat-v1.5` as an example, and you can change to `--model_type internvl-chat-v1_5-int8` to select the INT8 version of the model. Alternatively, select the Mini-Internvl model by choosing either `mini-internvl-chat-2b-v1_5` or `mini-internvl-chat-4b-v1_5`.
**Note**
- If you want to use a local model file, add the argument --model_id_or_path /path/to/model.
- If your GPU does not support flash attention, use the argument --use_flash_attn false. And for int8 models, it is necessary to specify `dtype --bf16` during inference, otherwise the output may be garbled.
- The model's configuration specifies a relatively small max_length of 2048, which can be modified by setting `--max_length`.
- Memory consumption can be reduced by using the parameter `--gradient_checkpointing true`.
- The InternVL series of models only support training on datasets that include images.
```shell
# Experimental environment: A100
# 55GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type internvl-chat-v1_5 --dtype bf16 --max_length 4096
# 2*30GB GPU memory
CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl-chat-v1_5 --dtype bf16 --max_length 4096
```
Output: (supports passing in local path or URL)
```python
"""
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
<<< Write a poem based on the content of the picture.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
Token indices sequence length is longer than the specified maximum sequence length for this model (5142 > 4096). Running this sequence through the model will result in indexing errors
response: The distances from the location of the sign to each city are as follows:
- Mata: 14 kilometers
- Yangjiang: 62 kilometers
- Guangzhou: 293 kilometers
These distances are indicated on the road sign in the image.
query: Which city is the farthest?
response: The city that is farthest from the location of the sign is Guangzhou, which is 293 kilometers away.
history: [['How far is it from each city?', 'The distances from the location of the sign to each city are as follows:\n\n- Mata: 14 kilometers\n- Yangjiang: 62 kilometers\n- Guangzhou: 293 kilometers\n\nThese distances are indicated on the road sign in the image. '], ['Which city is the farthest?', 'The city that is farthest from the location of the sign is Guangzhou, which is 293 kilometers away. ']]
Multimodal large model fine-tuning usually uses **custom datasets** for fine-tuning. Here is a demo that can be run directly:
LoRA fine-tuning:
**note**
- If your GPU does not support flash attention, use the argument --use_flash_attn false.
- By default, only the qkv of the LLM part is fine-tuned using LoRA. If you want to fine-tune all linear layers including the vision model part, you can specify `--lora_target_modules ALL`.
```shell
# Experimental environment: A100
# 80GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type internvl-chat-v1_5 \
--dataset coco-en-2-mini \
--max_length 4096
# device_map
# Experimental environment: 2*A100...
# 2*43GB GPU memory
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type internvl-chat-v1_5 \
--dataset coco-en-2-mini \
--max_length 4096
# ddp + deepspeed-zero2
# Experimental environment: 2*A100...
# 2*80GB GPU memory
NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type internvl-chat-v1_5 \
--dataset coco-en-2-mini \
--max_length 4096 \
--deepspeed default-zero2
```
Full parameter fine-tuning:
```shell
# Experimental environment: 4 * A100
# device map
# 4 * 72 GPU memory
CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type internvl-chat-v1_5 \
--dataset coco-en-2-mini \
--sft_type full \
--max_length 4096
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
(Only single-turn dialogue is supported. Each turn of dialogue must contain one image. Local paths or URLs can be passed in.)
The following practices take `llava-v1.6-mistral-7b` as an example. You can also switch to other models by specifying the `--model_type`.
## Table of Contents
-[Environment Setup](#environment-setup)
-[Inference](#inference)
-[Fine-tuning](#fine-tuning)
-[Inference after Fine-tuning](#inference-after-fine-tuning)
## Environment Setup
```shell
git clone https://github.com/modelscope/swift.git
cd swift
pip install-e'.[llm]'
```
## Inference
```shell
# Experimental environment: A100
# 20GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-mistral-7b-instruct
# 70GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava1_6-yi-34b-instruct
# 4*20GB GPU memory
CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type llava1_6-yi-34b-instruct
```
Output: (supports passing in local path or URL)
```python
"""
<<< Describe this image.
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
The image shows a close-up of a kitten with a soft, blurred background that suggests a natural, outdoor setting. The kitten has a mix of white and gray fur with darker stripes, typical of a tabby pattern. Its eyes are wide open, with a striking blue color that contrasts with the kitten's fur. The kitten's nose is small and pink, and its whiskers are long and white, adding to the kitten's cute and innocent appearance. The lighting in the image is soft and diffused, creating a gentle and warm atmosphere. The focus is sharp on the kitten's face, while the rest of the image is slightly out of focus, which draws attention to the kitten's features.
response: The image shows a road sign indicating the distances to three cities: Mata, Yangjiang, and Guangzhou. The distances are given in kilometers.
- Mata is 14 kilometers away.
- Yangjiang is 62 kilometers away.
- Guangzhou is 293 kilometers away.
Please note that these distances are as the crow flies and do not take into account the actual driving distance due to road conditions, traffic, or other factors.
query: Which city is the farthest?
response: The farthest city listed on the sign is Mata, which is 14 kilometers away.
Multimodal large model fine-tuning usually uses **custom datasets** for fine-tuning. Here is a demo that can be run directly:
LoRA fine-tuning:
(By default, only the qkv of the LLM part is fine-tuned using LoRA. If you want to fine-tune all linear layers including the vision model part, you can specify `--lora_target_modules ALL`.)
```shell
# Experimental environment: A10, 3090, V100...
# 21GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type llava1_6-mistral-7b-instruct \
--dataset coco-en-2-mini \
# 2*45GB GPU memory
CUDA_VISIBLE_DEVICES=0,1 swift sft \
--model_type llava1_6-yi-34b-instruct \
--dataset coco-en-2-mini \
```
Full parameter fine-tuning:
```shell
# Experimental environment: 4 * A100
# 4 * 70 GPU memory
NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
--model_type llava1_6-mistral-7b-instruct \
--dataset coco-en-2-mini \
--sft_type full \
--deepspeed default-zero2
# 8 * 50 GPU memory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
--model_type llava1_6-yi-34b-instruct \
--dataset coco-en-2-mini \
--sft_type full \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
(Only single-turn dialogue is supported. Each turn of dialogue must contain one image. Local paths or URLs can be passed in.)
Using minicpm-v-3b-chat as an example, if you want to use the updated version of the MiniCPM-V multimodal model (v2), you can switch `--model_type minicpm-v-3b-chat` to `--model_type minicpm-v-v2-chat`.
## Table of Contents
-[Environment Setup](#environment-setup)
-[Inference](#inference)
-[Fine-tuning](#fine-tuning)
-[Inference After Fine-tuning](#inference-after-fine-tuning)
Inference for [minicpm-v-3b-chat](https://modelscope.cn/models/OpenBMB/MiniCPM-V/summary):
```shell
# Experimental environment: A10, 3090, V100, ...
# 10GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift infer --model_type minicpm-v-3b-chat
```
Output: (supports local path or URL input)
```python
"""
<<< Describe this image
Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
This image depicts a black and white cat sitting on the floor. The cat looks small, possibly a kitten. Its eyes are wide open, seeming to be observing the surroundings.
response: The distance from Guangzhou to Shenzhen is 293 kilometers, while the distance from Shenzhen to Guangzhou is 14 kilometers.
query: Which is the farthest city?
response: The farthest city is Shenzhen. It is located between Guangzhou and Shenzhen, 293 kilometers away from Guangzhou and 14 kilometers away from Shenzhen.
history: [['How far is it from each city?', ' The distance from Guangzhou to Shenzhen is 293 kilometers, while the distance from Shenzhen to Guangzhou is 14 kilometers.'], ['Which is the farthest city?', ' The farthest city is Shenzhen. It is located between Guangzhou and Shenzhen, 293 kilometers away from Guangzhou and 14 kilometers away from Shenzhen.']]
Fine-tuning multimodal large models usually uses **custom datasets**. Here is a demo that can be run directly:
(By default, only the qkv part of LLM is fine-tuned using LoRA. If you want to fine-tune all linear parts including the vision model, you can specify `--lora_target_modules ALL`. Full parameter fine-tuning is also supported.)
```shell
# Experimental environment: A10, 3090, V100, ...
# 10GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type minicpm-v-3b-chat \
--dataset coco-en-2-mini \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json and jsonl formats. Here is an example of a custom dataset:
(Supports multi-turn conversations, but the total round of conversations can only contain one image. Supports local path or URL input.)
Here we provide examples of four models (selecting smaller-sized models to facilitate experiments): qwen-vl-chat, qwen-vl, yi-vl-6b-chat, and minicpm-v-v2_5-chat. From these examples, you can identify three different types of MLLMs: a single round of dialogue can contain multiple images (or no images), a single round of dialogue can only contain one image, and the way the entire dialogue revolves around an image and the differences in deployment and invocation methods, as well as the differences between the chat and base models within MLLMs.
If you're using qwen-audio-chat, simply replace the `<img>` tag with `<audio>` based on the qwen-vl-chat example.
## qwen-vl-chat
**Server**:
```bash
# Using the original model
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-vl-chat
# Using the fine-tuned LoRA
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx
# Using the fine-tuned Merge LoRA model
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir output/qwen-vl-chat/vx-xxx/checkpoint-xxx-merged
```
**Client**:
Test:
```bash
curl http://localhost:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "qwen-vl-chat",
"messages": [{"role": "user", "content": "Picture 1:<img>https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/rose.jpg</img>\nWhat kind of flower is in the picture and how many are there?"}],
response: The image captures a moment of tranquility featuring a gray and white kitten. The kitten, with its eyes wide open, is the main subject of the image. Its nose is pink, adding a touch of color to its gray and white fur. The kitten is sitting on a white surface, which contrasts with its gray and white fur. The background is blurred, drawing focus to the kitten. The image does not contain any text. The kitten's position relative to the background suggests it is in the foreground of the image. The image does not contain any other objects or creatures. The kitten appears to be alone in the image. The image does not contain any action, but the kitten's wide-open eyes give a sense of curiosity and alertness. The image does not contain any aesthetic descriptions. The image is a simple yet captivating portrait of a gray and white kitten.
response: The image captures a moment of tranquility featuring a gray and white kitten. The kitten, with its eyes wide open, is the main subject of the image. Its nose is pink, adding a touch of color to its gray and white fur. The kitten is sitting on a white surface, which contrasts with its gray and white fur. The background is blurred, drawing focus to the kitten. The image does not contain any text. The kitten's position relative to the background suggests it is in the foreground of the image. The image does not contain any other objects or creatures. The kitten appears to be alone in the image. The image does not contain any action, but the kitten's wide-open eyes give a sense of curiosity and alertness. The image does not contain any aesthetic descriptions. The image is a simple yet captivating portrait of a gray and white kitten.
query: How many sheep are in the picture?
response: There are four sheep in the picture.
"""
```
## minicpm-v-v2_5-chat
**Server side:**
```bash
# Using the original model
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type minicpm-v-v2_5-chat
# Using the fine-tuned LoRA
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir output/minicpm-v-v2_5-chat/vx-xxx/checkpoint-xxx
# Using the fine-tuned Merge LoRA model
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir output/minicpm-v-v2_5-chat/vx-xxx/checkpoint-xxx-merged
```
**Client side:**
Test:
```bash
curl http://localhost:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "minicpm-v-v2_5-chat",
"messages": [{"role": "user", "content": "Describe this image."}],
response: The image is a digital painting of a kitten, which is the main subject. The kitten's fur is rendered with a mix of gray, black, and white, giving it a realistic appearance. Its eyes are wide open, and the expression is one of curiosity or alertness. The background is blurred, which brings the focus entirely on the kitten. The painting style is detailed and lifelike, capturing the essence of a young feline's innocent and playful nature. The image does not convey any specific context or background story beyond the depiction of the kitten itself.
query: How was this picture generated?
response: This picture was generated using digital art techniques. The artist likely used a software program to create the image, manipulating pixels and colors to achieve the detailed and lifelike representation of the kitten. Digital art allows for a high degree of control over the final product, enabling artists to fine-tune details and create realistic textures and shading.
response: The image is a digital painting of a kitten, which is the main subject. The kitten's fur is rendered with a mix of gray, black, and white, giving it a realistic appearance. Its eyes are wide open, and the expression is one of curiosity or alertness. The background is blurred, which brings the focus entirely on the kitten. The painting style is detailed and lifelike, capturing the essence of a young feline's innocent and playful nature. The image does not convey any specific context or background story beyond the depiction of the kitten itself.
query: How was this picture generated?
response: This picture was generated using digital art techniques. The artist likely used a software program to create the image, manipulating pixels and colors to achieve the detailed and lifelike representation of the kitten. Digital art allows for a high degree of control over the final product, enabling artists to fine-tune details and create realistic textures and shading.
"""
```
## qwen-vl
**Server side:**
```bash
# Using the original model
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen-vl
# Using the fine-tuned LoRA
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir output/qwen-vl/vx-xxx/checkpoint-xxx
# Using the fine-tuned Merge LoRA model
CUDA_VISIBLE_DEVICES=0 swift deploy --ckpt_dir output/qwen-vl/vx-xxx/checkpoint-xxx-merged
```
**Client side:**
Test:
```bash
curl http://localhost:8000/v1/completions \
-H"Content-Type: application/json"\
-d'{
"model": "qwen-vl",
"prompt": "Picture 1:<img>https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/rose.jpg</img>\nThis is a",
CUDA_VISIBLE_DEVICES=0 swift infer --model_type phi3-vision-128k-instruct
```
Output: (supports passing local path or URL)
```python
"""
<<< Who are you?
I am Phi, an AI developed by Microsoft to assist with providing information, answering questions, and helping users find solutions to their queries. How can I assist you today?
<<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>What is the difference between these two pictures?
The first picture shows a group of four cartoon sheep standing in a field, while the second picture is a close-up of a kitten with a blurred background. The main difference between these two pictures is the subject matter and the setting. The first picture features animals that are typically associated with farm life and agriculture, while the second picture focuses on a domestic animal, a kitten, which is more commonly found in households. Additionally, the first picture has a more peaceful and serene atmosphere, while the second picture has a more intimate and detailed view of the kitten.
response: Guangzhou is the farthest city, located 293km away.
history: [['<img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/road.png</img>How far is it from each city?', 'The distances are as follows: Mata is 14km away, Yangjiang is 62km away, and Guangzhou is 293km away.'], ['Which city is the farthest?', 'Guangzhou is the farthest city, located 293km away.']]
Multimodal large model fine-tuning usually uses **custom datasets**. Here is a demo that can be run directly:
(Default fine-tune only the LLM part of qkv with lora. If you want to fine-tune all linear modules containing vision model parts, you can specify `--lora_target_modules ALL`. Support to fine-tune all parameters.)
```shell
# Experimental environment: A10, 3090, V100, ...
# 16GB GPU memory
CUDA_VISIBLE_DEVICES=0 swift sft \
--model_type phi3-vision-128k-instruct \
--dataset coco-en-mini \
```
[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) supports json, jsonl styles. The following is an example of a custom dataset:
(Supports multi-turn dialogue, support for multi-image or non-image per turn, supports input of local path or URL.)