Initial commit

ee53747c · wanglch · ee53747c · 959931af · ee53747c · ee53747c
Commit ee53747c authored Mar 21, 2025 by wanglch
17 changed files
--- a/Dockerfile
+++ b/Dockerfile
+FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
\ No newline at end of file
--- a/Phi-4-multimodal-instruct @ 959931af
+++ b/Phi-4-multimodal-instruct @ 959931af
+Subproject commit 959931af6219c4423095b4256a2869378d4b3d24
--- a/Pic/arch.png
+++ b/Pic/arch.png
--- a/Pic/image.png
+++ b/Pic/image.png
--- a/Pic/speech.png
+++ b/Pic/speech.png
--- a/Pic/text.png
+++ b/Pic/text.png
--- a/Pic/theory.png
+++ b/Pic/theory.png
--- a/README.md
+++ b/README.md
+# Phi-4-multimodal-instruct
+## 论文
+[Phi-4-Mini Technical Report: Compact yet Powerful Multimodal
+ Language Models via Mixture-of-LoRAs](https://arxiv.org/pdf/2503.01743)
+## 模型结构
+Phi-4-多模态是一个混合了LoRAs的单一模型，包括在同一表示空间内同时处理的语音、视觉和语言。其结果是一个能够处理文本、音频和视频输入的统一模型——不需要复杂的管道或不同模态的单独模型。Phi-4-multimodal建立在提高效率和可扩展性的新架构之上。它包含了更大的词汇表以改进处理，支持多语言功能，并将语言推理与多模态输入集成在一起。
+<div align=center>
+    <img src="./Pic/arch.png"/>
+</div>
+## 算法原理
+Mixture-of-LoRAs（MoA）的新型参数高效调整方法，旨在为LLMs的多任务学习提供更有效的解决方案。MoA通过结合多个领域特定的LoRA（Low-Rank Adaptation）模块，并使用显式路由策略来实现多任务学习，从而减少任务间的干扰，并提高每个单独任务的性能。此外，MoA允许对LoRA模型进行迭代适应，以便快速适应新领域。
+<div align=center>
+    <img src="./Pic/theory.png"/>
+</div>
+## 环境配置
+### Docker（方法一）
+推荐使用docker方式运行， 此处提供[光源](https://www.sourcefind.cn/#/service-details)拉取docker镜像的地址与使用步骤
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.3.0-py3.10-dtk24.04.3-ubuntu20.04
+docker run -it --shm-size=1024G -v /path/your_code_data/:/path/your_code_data/ -v /opt/hyhal:/opt/hyhal --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name phi-4  <your IMAGE ID> bash # <your IMAGE ID>为以上拉取的docker的镜像ID替换
+git clone http://developer.sourcefind.cn/codes/modelzoo/phi-4-multimodal-instruct_pytorch.git
+cd /path/your_code_data/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。
+### Dockerfile（方法二）
+此处提供dockerfile的使用方法
+```
+docker build -t phi-4:latest .
+docker run --shm-size 500g --network=host --name=phi-4 --privileged --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v 项目地址(绝对路径):/home/ -v /opt/hyhal:/opt/hyhal:ro -it <your IMAGE ID> bash
+git clone http://developer.sourcefind.cn/codes/modelzoo/phi-4-multimodal-instruct_pytorch.git
+cd /path/your_code_data/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+### Anaconda（方法三）
+此处提供本地配置、编译的详细步骤，例如：
+关于本项目DCU显卡所需的特殊深度学习库可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
+```
+DTK驱动:dtk24.04.3
+python:3.10
+torch:2.3.0
+flash-attn:2.6.1
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应`
+其它非深度学习库参照requirement.txt安装：
+```
+git clone http://developer.sourcefind.cn/codes/modelzoo/phi-4-multimodal-instruct_pytorch.git
+cd /path/your_code_data/
+pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+## 数据集
+在ms-wift中自带测试数据集，使用[AI-ModelScope/LaTeX_OCR](https://www.modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR/summary)数据集，使用脚本自动下载数据集，用于正常训练的完整数据集可参照该数据集格式进行自备。
+## 训练
+使用ms-swift框架微调
+```
+git clone --depth 1 https://github.com/modelscope/ms-swift.git
+cd ms-swift
+pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple
+```
+### 单机多卡
+sh phi4_finetune.sh
+可根据需要修改参数。
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model '/home/wanglch/Phi4/Phi-4-multimodal-instruct' \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
+    --train_type dummy \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 200 \
+    --save_steps 200 \
+    --save_total_limit 5 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4
+```
+## 推理
+### 单机单卡
+文本推理
+```
+python phi4_text_inference.py
+```
+图片推理
+```
+python phi4_vision_inference.py
+```
+音频推理
+```
+python phi4_speech_inference.py
+```
+## result
+- 文本推理
+<div align=left>
+    <img src="./Pic/text.png"/>
+</div>
+- 图片推理
+<div align=left>
+    <img src="./Pic/image.png"/>
+</div>
+- 音频推理
+<div align=left>
+    <img src="./Pic/speech.png"/>
+</div>
+### 精度
+无
+## 应用场景
+### 算法类别
+`对话问答`
+### 热点应用行业
+`科研,教育,政府,金融`
+## 预训练权重
+模型可在[SCNet](http://113.200.138.88:18080/aimodels/)进行搜索下载
+- [LLM-Research/Phi-4-multimodal-instruct模型下载SCNet链接](http://113.200.138.88:18080/aimodels/microsoft/Phi-4-multimodal-instruct)
+## 源码仓库及问题反馈
+- https://developer.sourcefind.cn/codes/modelzoo/Qwen2.5-vl_pytorch
+## 参考资料
+- https://qwenlm.github.io/zh/blog/qwen2.5-vl/
+- https://github.com/QwenLM/Qwen2.5-VL
--- a/examples/what_is_shown_in_this_image.wav
+++ b/examples/what_is_shown_in_this_image.wav
--- a/examples/what_is_the_traffic_sign_in_the_image.wav
+++ b/examples/what_is_the_traffic_sign_in_the_image.wav
--- a/icon.png
+++ b/icon.png
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode=1465
+# 模型名称
+modelName=Phi-4-multimodal-instruct_pytorch
+# 模型描述
+modelDescription=微软发布的强大的轻量级多模态基础模型Phi-4-MultiModal-Instruct！该模型目前具有英文的图像理解能力，同时有超过Whisper V3的视频理解能力！
+# 应用场景
+appScenario=推理,训练,对话问答,科研,教育,政府,金融
+# 框架类型
+frameType=Pytorch
--- a/phi4_finetune.sh
+++ b/phi4_finetune.sh
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model '/home/wanglch/Phi4/Phi-4-multimodal-instruct' \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#20000' \
+    --train_type dummy \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --eval_steps 200 \
+    --save_steps 200 \
+    --save_total_limit 5 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4
--- a/phi4_speech_inference.py
+++ b/phi4_speech_inference.py
+import os
+import requests
+import torch
+from PIL import Image
+import soundfile
+from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
+model_path = '/home/wanglch/Phi4/Phi-4-multimodal-instruct/'
+kwargs = {}
+kwargs['torch_dtype'] = torch.bfloat16
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    trust_remote_code=True,
+    torch_dtype='auto',
+    _attn_implementation='flash_attention_2',
+).cuda()
+generation_config = GenerationConfig.from_pretrained(model_path, 'generation_config.json')
+user_prompt = '<|user|>'
+assistant_prompt = '<|assistant|>'
+prompt_suffix = '<|end|>'
+AUDIO_FILE_1 = '/home/wanglch/Phi4/Phi-4-multimodal-instruct/examples/what_is_the_traffic_sign_in_the_image.wav'
+AUDIO_FILE_2 = '/home/wanglch/Phi4/Phi-4-multimodal-instruct/examples/what_is_shown_in_this_image.wav'
+if not os.path.exists(AUDIO_FILE_1):
+    raise FileNotFoundError(f'Please prepare the audio file {AUDIO_FILE_1} before running the following code.')
+########################## speech only ################################
+speech_prompt = "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
+prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
+print(f'>>> Prompt\n{prompt}')
+audio = soundfile.read(AUDIO_FILE_1)
+inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')
+generate_ids = model.generate(
+    **inputs,
+    max_new_tokens=1000,
+    generation_config=generation_config,
+)
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
+response = processor.batch_decode(
+    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(f'>>> Response\n{response}')
+if not os.path.exists(AUDIO_FILE_2):
+    raise FileNotFoundError(f'Please prepare the audio file {AUDIO_FILE_2} before running the following code.')
+########################### speech only (multi-turn) ################################
+audio_1 = soundfile.read(AUDIO_FILE_2)
+audio_2 = soundfile.read(AUDIO_FILE_1)
+chat = [
+    {'role': 'user', 'content': f'<|audio_1|>Based on the attached audio, generate a comprehensive text transcription of the spoken content.'},
+    {
+        'role': 'assistant',
+        'content': "What is shown in this image.",
+    },
+    {'role': 'user', 'content': f'<|audio_2|>Based on the attached audio, generate a comprehensive text transcription of the spoken content.'},
+]
+prompt = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+# need to remove last <|endoftext|> if it is there, which is used for training, not inference. For training, make sure to add <|endoftext|> in the end.
+if prompt.endswith('<|endoftext|>'):
+    prompt = prompt.rstrip('<|endoftext|>')
+print(f'>>> Prompt\n{prompt}')
+inputs = processor(text=prompt, audios=[audio_1, audio_2], return_tensors='pt').to('cuda:0')
+generate_ids = model.generate(
+    **inputs,
+    max_new_tokens=1000,
+    generation_config=generation_config,
+)
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
+response = processor.batch_decode(
+    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(f'>>> Response\n{response}')
--- a/phi4_text_inference.py
+++ b/phi4_text_inference.py
+import os
+import requests
+import torch
+from PIL import Image
+import soundfile
+from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
+model_path = '/home/wanglch/Phi4/Phi-4-multimodal-instruct/'
+kwargs = {}
+kwargs['torch_dtype'] = torch.bfloat16
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    trust_remote_code=True,
+    torch_dtype='auto',
+    _attn_implementation='flash_attention_2',
+).cuda()
+generation_config = GenerationConfig.from_pretrained(model_path, 'generation_config.json')
+user_prompt = '<|user|>'
+assistant_prompt = '<|assistant|>'
+prompt_suffix = '<|end|>'
+#################################################### text-only ####################################################
+prompt = f'{user_prompt}what is the answer for 1+1? Explain it.{prompt_suffix}{assistant_prompt}'
+print(f'>>> Prompt\n{prompt}')
+inputs = processor(prompt, images=None, return_tensors='pt').to('cuda:0')
+generate_ids = model.generate(
+    **inputs,
+    max_new_tokens=1000,
+    generation_config=generation_config,
+)
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
+response = processor.batch_decode(
+    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(f'>>> Response\n{response}')
--- a/phi4_vision_inference.py
+++ b/phi4_vision_inference.py
+import os
+import requests
+import torch
+from PIL import Image
+import soundfile
+from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
+model_path = '/home/wanglch/Phi4/Phi-4-multimodal-instruct'
+kwargs = {}
+kwargs['torch_dtype'] = torch.bfloat16
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    trust_remote_code=True,
+    torch_dtype='auto',
+    _attn_implementation='flash_attention_2',
+).cuda()
+generation_config = GenerationConfig.from_pretrained(model_path, 'generation_config.json')
+user_prompt = '<|user|>'
+assistant_prompt = '<|assistant|>'
+prompt_suffix = '<|end|>'
+#################################################### vision (single-turn) ####################################################
+# single-image prompt
+prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
+url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
+print(f'>>> Prompt\n{prompt}')
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
+generate_ids = model.generate(
+    **inputs,
+    max_new_tokens=1000,
+    generation_config=generation_config,
+)
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
+response = processor.batch_decode(
+    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(f'>>> Response\n{response}')
+#################################################### vision (multi-turn) ####################################################
+# chat template
+chat = [
+    {'role': 'user', 'content': f'<|image_1|>What is shown in this image?'},
+    {
+        'role': 'assistant',
+        'content': "The image depicts a street scene with a prominent red stop sign in the foreground. The background showcases a building with traditional Chinese architecture, characterized by its red roof and ornate decorations. There are also several statues of lions, which are common in Chinese culture, positioned in front of the building. The street is lined with various shops and businesses, and there's a car passing by.",
+    },
+    {'role': 'user', 'content': 'What is so special about this image'},
+]
+url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
+image = Image.open(requests.get(url, stream=True).raw)
+prompt = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+# need to remove last <|endoftext|> if it is there, which is used for training, not inference. For training, make sure to add <|endoftext|> in the end.
+if prompt.endswith('<|endoftext|>'):
+    prompt = prompt.rstrip('<|endoftext|>')
+print(f'>>> Prompt\n{prompt}')
+inputs = processor(prompt, [image], return_tensors='pt').to('cuda:0')
+generate_ids = model.generate(
+    **inputs,
+    max_new_tokens=1000,
+    generation_config=generation_config,
+)
+generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
+response = processor.batch_decode(
+    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)[0]
+print(f'>>> Response\n{response}')
--- a/requirements.txt
+++ b/requirements.txt
+flash_attn
+torch
+transformers==4.48.2
+accelerate
+soundfile
+pillow
+scipy
+torchvision
+backoff
+peft
+sacrebleu