Commit 0063a668 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
# Microsoft Open Source Code of Conduct
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
Resources:
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
MIT License
Copyright (c) Microsoft Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE
# Magma
具身智能新时代!VLA迎来最强基础模型Magma:UI导航、机器人操作全能。
## 论文
`Magma: A Foundation Model for Multimodal AI Agents`
- https://arxiv.org/pdf/2502.13130
## 模型结构
使用一个视觉编码器V,将每一帧图像编码成多个token,然后将所有token拼接成一个序列,并与编码任务描述的语言token一起输入到一个仅解码器的语言模型(LLM)中。
<div align=center>
<img src="./doc/Magma.png"/>
</div>
## 算法原理
通过标记集合(SoM)和标记轨迹(ToM)技术,将视觉语言数据转化为可操作任务,显著提升了空间智能和任务泛化能力,能够理解和执行多模态任务,适用于数字和物理环境。
研究人员提出了一种简单、有效的方法,结合「标记集合」(Set-of-Mark, SoM)和「标记轨迹」(Trace-of-Mark, ToM)将模型扩展到空间预测任务(可点击按钮)和时间维度。
<div align=center>
<img src="./doc/algorithm.png"/>
</div>
## 环境配置
```
mv Magma_pytorch Magma # 去框架名后缀
```
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
# <your IMAGE ID>为以上拉取的docker的镜像ID替换,本镜像为:6063b673703a
docker run -it --shm-size=64G -v $PWD/Magma:/home/Magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name magma <your IMAGE ID> bash
cd /home/Magma
pip install -e . -i https://mirrors.aliyun.com/pypi/simple
pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
```
### Dockerfile(方法二)
```
cd /home/Magma/docker
docker build --no-cache -t magma:latest .
docker run --shm-size=64G --name magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Magma:/home/Magma -it magma bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt。
pip install -e . -i https://mirrors.aliyun.com/pypi/simple
pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.sourcefind.cn/tool/
```
DTK驱动:dtk2504
python:python3.10
torch:2.4.1
torchvision:0.19.1
triton:3.0.0
vllm:0.6.2
flash-attn:2.6.1
deepspeed:0.14.2
apex:1.4.0
transformers:4.51.3
tensorflow:2.13.1
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
cd /home/Magma
pip install -e . -i https://mirrors.aliyun.com/pypi/simple
```
## 数据集
`无`
## 训练
`无`
## 推理
预训练权重目录结构:
```
/home/Magma
└── microsoft/Magma-8B
# 设置HF下载镜像:
export HF_ENDPOINT=https://hf-mirror.com
然后,运行推命令时,项目会自动下载模型:laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg,下载完成后编码成缓存文件保存,此模型作者的代码无读取本地权重功能。
```
### 单机多卡
```
cd /home/Magma
python infer_transformers.py
```
更多资料可参考源项目中的[`README_origin`](./README_origin.md)
## result
`输入: `
```
prompt: "What is the letter on the robot?"
image: "./assets/images/magma_logo.jpg"
```
`输出:`
```
response: The letter on the robot is "M".
```
官方效果示例:
<div align=center>
<img src="./doc/magma_mushroom.gif"/>
</div>
### 精度
DCU与GPU精度一致,推理框架:pytorch。
## 应用场景
### 算法类别
`具身智能`
### 热点应用行业
`制造,家居,医疗,能源,教育`
## 预训练权重
HF/github下载地址为:[microsoft/Magma-8B](https://huggingface.co/microsoft/Magma-8B)
## 源码仓库及问题反馈
- http://developer.sourcefind.cn/codes/modelzoo/InfiniteYou_pytorch.git
## 参考资料
- https://github.com/microsoft/Magma.git
<div align="center">
<h2>🤖 Magma: A Foundation Model for Multimodal AI Agents</h2>
[Jianwei Yang](https://jwyang.github.io/)<sup>*</sup><sup>1</sup><sup></sup>&nbsp;
[Reuben Tan](https://cs-people.bu.edu/rxtan/)<sup>1</sup><sup></sup>&nbsp;
[Qianhui Wu](https://qianhuiwu.github.io/)<sup>1</sup><sup></sup>&nbsp;
[Ruijie Zheng](https://ruijiezheng.com/)<sup>2</sup><sup></sup>&nbsp;
[Baolin Peng](https://scholar.google.com/citations?user=u1CNjgwAAAAJ&hl=en&oi=ao)<sup>1</sup><sup></sup>&nbsp;
[Yongyuan Liang](https://cheryyunl.github.io)<sup>2</sup><sup></sup>
[Yu Gu](http://yu-gu.me/)<sup>1</sup>&nbsp;
[Mu Cai](https://pages.cs.wisc.edu/~mucai/)<sup>3</sup>&nbsp;
[Seonghyeon Ye](https://seonghyeonye.github.io/)<sup>4</sup>&nbsp;
[Joel Jang](https://joeljang.github.io/)<sup>5</sup>&nbsp;
[Yuquan Deng](https://scholar.google.com/citations?user=LTC0Q6YAAAAJ&hl=en)<sup>5</sup>&nbsp;
[Lars Liden](https://sites.google.com/site/larsliden)<sup>1</sup>&nbsp;
[Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/)<sup>1</sup><sup></sup>
<sup>1</sup> Microsoft Research; <sup>2</sup> University of Maryland; <sup>3</sup> University of Wisconsin-Madison
<sup>4</sup> KAIST; <sup>5</sup> University of Washington
<sup>*</sup> Project lead <sup></sup> First authors <sup></sup> Second authors <sup></sup> Leadership
<h3 style="color:#b22222;"> To Appear at CVPR 2025 </h3>
<h4>
<a href="https://www.arxiv.org/pdf/2502.13130">📄 arXiv Paper</a> &nbsp;
<a href="https://microsoft.github.io/Magma/">🌐 Project Page</a> &nbsp;
<a href="https://huggingface.co/microsoft/Magma-8B">🤗 Hugging Face Model</a>
<a href="https://ai.azure.com/explore/models/microsoft-magma-8b/version/1/registry/HuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47">☁️ Azure AI Foundry</a>
<a href="https://www.youtube.com/watch?v=SbfzvUU5yM8">📺 Video</a>
</h4>
<!-- <h3>
<a href="https://huggingface.co/spaces/microsoft/Magma-UI">🤗 Gradio UI Agent</a>
<a href="https://huggingface.co/spaces/microsoft/Magma-Gaming">🤗 Gradio Gaming Agent</a>
</h3> -->
</div>
<div align="center">
<p2>The Path Towards Multimodal AI Agents</p2>
<img src="assets/images/magma_teaser.png?raw=true" width="100%">
</div>
</div>
## :sparkles: Highlights
* **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
* **Versatile Capabilities:** Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
* **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
* **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
## :fire: News
* **[2025.04.12]** 🔥We released the pretraining videos with visual traces on hugging face [Magma-Video-ToM](https://huggingface.co/datasets/MagmaAI/Magma-Video-ToM).
* **[2025.04.06]** Open X-Embodiment pretraining data with visual traces can be downloaded from [Magma-OXE-ToM](https://huggingface.co/datasets/MagmaAI/Magma-OXE-ToM).
* **[2025.03.16]** We released the demo code for generating SoM and ToM for instructional videos (i.e., Alg. 2 in our paper) in [SoM/ToM Generation](#som-and-tom-generation).
* **[2025.03.09]** 🔥 We released Magma training code, and an exampler for training Magma-8B on Magma-820K dataset. Check out the [Model Training](#model-training)
* **[2025.03.06]** We released a new demo for showing robot planning capabilities. Run `python agents/robot_traj/app.py` to start the demo!
* **[2025.02.28]** We released two demos, [Magma-UI](https://huggingface.co/spaces/microsoft/Magma-UI) and [Magma-Gaming](https://huggingface.co/spaces/microsoft/Magma-Gaming) on Hugging Face. Check out our model's action grounding and planning capabilities!
* **[2025.02.26]** ⭐ Exciting News! Magma got accepted by CVPR 2025!
* **[2025.02.25]** 🎉 Big News! We are releasing the Magma model on [Hugging Face](https://huggingface.co/microsoft/Magma-8B) and [Azure AI Foundry](https://ai.azure.com/explore/models/microsoft-magma-8b/version/1/registry/HuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47)!
* **[2025.02.23]** We released the Magma Inference code!
* **[2025.02.20]** Magma has reached the top spot on [Hacker News](https://news.ycombinator.com/front)!
* **[2025.02.19]** We will be releasing our code, model and UI navigation demo by [MSR Forum on 02.25 next Tuesday](https://researchforum.microsoft.com/)!
* **[2025.02.18]** Our Flagship Project Magma at MSR is released on [arXiv](https://www.arxiv.org/pdf/2502.13130)!
## :bookmark_tabs: Todos
We will be releasing all the following contents:
- [x] Model inference code
- [x] Add UI and Gaming agent Demos
- [x] Model checkpoint
- [x] Training code
- [x] Open-XE pretraining data with traces
- [x] Video pretraining data with traces
## :clipboard: Outline
- [What is Magma?](#what-is-magma)
- [How we pretrain Magma?](#how-we-pretrain-magma)
- [Installation](#installation)
- [Data Preprocessing](#data-preprocessing)
- [SoM and ToM Generation](#som-and-tom-generation)
- [Model Training](#model-training)
- [Pretraining on Open-X without SoM/ToM](#pretraining-on-open-x-without-somtom)
- [Finetuning on Magma-820K](#finetuning-on-magma-820k)
- [Model Usage](#model-usage)
- [Inference](#inference)
- [Inference with Huggingface Transformers](#inference-with-huggingface-transformers)
- [Inference with local code from this repo](#inference-with-local-code-from-this-repo)
- [Inference with bitsandbytes](#inference-with-bitsandbytes)
- [Benchmarking](#benchmarking)
- [Evaluation with lmms-eval](#evaluation-with-lmms-eval)
- [Evaluation with SimplerEnv](#evaluation-with-simplerenv)
- [Multi-images or Video](#multi-images-or-video)
- [Agent Demos](#agent-demos)
- [UI Agent](#ui-agent)
- [Gaming Agent](#gaming-agent)
- [Robot Visual Planning](#robot-visual-planning)
- [Citation](#citation)
- [Acknowledgements](#acknowledgements)
## What is Magma?
<div align="center">
<img src="assets/images/magma_intro_fig.png?raw=true" width="50%">
</div>
**Magma is a foundation model for multimodal AI agents**. As the bedrock for multimodal agentic models, it should possess strong capabilities to perceive the multimodal world AND takes goal-driven actions precisely (see above figure). With this in mind, we are striving for the following goals:
* **Verbal and spatial-temporal intelligence:** Magma is supposed to have both strong verbal and spatial-temporal intelligence to understand images and videos, ground its actions on the observations, and further translate the external goal into action plan and executions.
* **Digital and physical world:** Magma should not be limited to either the digital world (e.g., web navigation) or the physical world (e.g., robotics manipulation), but rather be able to work across both worlds, just like humans ourselves.
With this in mind, we developed a new pretraining data, which mostly consists of unlabeled videos in the wild plus the existing annotated agentic data, and a new pretraining framework, which unifies the training of all three modalities (text, image, and action), to train a new foundation model for multimodal AI agents, named Magma.
## How we pretrain Magma?
<div align="center">
<img src="assets/images/magma_pt_v3.png?raw=true" width="100%">
</div>
We pursue the goal through two dimensions:
* **Large-scale heterogeneous training data**: we curate a large amount of data in the wild, including existing multimodal understanding data, UI navigation data, and robotics manipulation data, and unlabeled videos in the wild. We also propose a new data collection pipeline to collect unlabeled videos in the wild, which is scalable and cost-effective. To attain useful action supervision from raw videos and robotics trajectories, we meticulously removed the camera motions in the videos and then transform the motions into "action" supervisions for our model training. These provide unique signals for the model to learn the cross-modal connections and long-horizon action prediction and planning.
* **Universal pretraining objectives**: texts and actions are inherently different and thus cause a huge gap, while visual tokens are continuous. We propose a universal pretraining framework that unifies the training of all three modalities, and we show that this is crucial for the model to learn the cross-modal connections. More specifically, we proposed Set-of-Mark and Trace-of-Mark as the auxiliary tasks for our model pretraining, as the bridge of different output modalities. In this way, we are building a great alignment between the text and action modalities, and also between the image and action modalities.
## Installation
1. Clone this repo to your local machine:
```bash
git clone https://github.com/microsoft/Magma
cd Magma
```
2. Install the dependencies:
```bash
conda create -n magma python=3.10 -y
conda activate magma
pip install --upgrade pip
pip install -e .
```
3. Install packages for training:
```bash
pip install -e ".[train]"
```
4. Install packages for agents:
```bash
pip install -e ".[agent]"
```
5. Other probably needed packages:
* Co-tracker
```sh
# Install co-tracker
git clone https://github.com/facebookresearch/co-tracker
cd co-tracker
pip install -e .
pip install imageio[ffmpeg]
cd ../
```
* Kmeans
```sh
# Install kmeans_pytorch, note: install with pip will leads to error
git clone https://github.com/subhadarship/kmeans_pytorch
cd kmeans_pytorch
pip install -e .
cd ../
```
* Misc
```sh
# Install others packages
pip install ipython
pip install faiss-cpu
pip install decord
```
⚠️ Please make sure you have installed the transformers with correct version (>=4.49.0). If you see some abnormal behavior, please check the version of transformers, and probably see below for the customized transformers.
<details>
<summary>Click to expand</summary>
### Customized Transformers
⚠️ One important thing to note is that our model uses [ConvNext](https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/convnext.py) as the backbone, which contains a layer scaler parameter [gamma](https://github.com/huggingface/pytorch-image-models/blob/e44f14d7d2f557b9f3add82ee4f1ed2beefbb30d/timm/models/convnext.py#L144). This leads to a bug of Transformers library as it automatically replace the 'gamma' with 'weight' when loading the model. To fix this, we need to modify the 'transformers/models/auto/modeling_auto.py' file as follows:
```python
if "gamma" in key and "clip_vision_model" not in key:
key = key.replace("gamma", "weight")
```
This bug still exists in the latest transformer version. So please make sure you install the following bug-free customized version of transformers as lised in [pyproject.toml](./pyproject.toml):
```bash
pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.44.1
```
or the newest version:
```bash
pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2
```
</details>
## Data Preprocessing
### SoM and ToM Generation
As shown in Table 1 of our paper, we apply SoM and ToM on both robotics data and instructional videos. To ensure reproducibility, we provide the code to generate SoM and ToM for instructional videos. The code is located in `tools/som_tom/demo.py`. You can run the following command to generate SoM and ToM for the robotics data:
```bash
python tools/som_tom/demo.py
```
And then you can find two videos in the `tools/som_tom/videos` folder. The original trace extracted from CoTracker is shown in `orig_trace.mp4`, and the SoM-ToM video is named `som_tom.mp4`.
## Model Training
We provide the instructions to pretrain LLama-3-8B-Instruct on Open-X-Embodiment and finetune Magma-8B on different downstream tasks.
### Pretraining on Open-X without SoM/ToM
* Data Preparation
Download Open-X-Embodiment from the official site. Then edit the data config file [openx.yaml](data_configs/openx.yaml) accordingly. The data config file should look like this:
```yaml
# a list of all the data paths
DATA_PATH:
- "/path/to/open-x"
IMAGE_FOLDER:
- "siglip-224px+mx-oxe-magic-soup"
LANGUAGE_PATH:
- ""
```
* Pretrain on OpenX
Once set up the dataset and config, you can run the following command to finetune the model:
```bash
sh scripts/pretrain/pretrain_openx.sh
```
* Benefit: We spent tremendous effort to decouple the Open-X dataloader from OpenVLA and make it compatible with other datasets used in our experiments*
### Finetuning on Magma-820K
* Data Preparation
Download annotation file from [MagmaAI/Magma-820K](https://huggingface.co/datasets/MagmaAI/Magma-820K). Please prepare the image data according to the dataset list in the dataset page. Once finished, please edit [magma_820k.yaml](data_configs/magma_820k.yaml) file accordingly.
```yaml
# a list of all the data paths
DATA_PATH:
- "/path/to/magma_820k.json"
IMAGE_FOLDER:
- "/root/to/magma_820k/images"
```
* Finetune from Magma-8B
Once set up the dataset and config, you can run the following command to finetune the model:
```bash
sh scripts/finetune/finetune_magma_820k.sh
```
## Model Usage
### Inference
#### Inference with Huggingface Transformers
We have uploaded the model to Huggingface Hub. You can easily load the model and processor with the following code.
<details>
<summary>Click to expand</summary>
```python
from PIL import Image
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
dtype = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda")
# Inference
image = Image.open("./assets/images/magma_logo.jpg").convert("RGB")
convs = [
{"role": "system", "content": "You are agent that can see, talk and act."},
{"role": "user", "content": "<image_start><image><image_end>\nWhat is the letter on the robot?"},
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda").to(dtype)
generation_args = {
"max_new_tokens": 500,
"temperature": 0.0,
"do_sample": False,
"use_cache": True,
"num_beams": 1,
}
with torch.inference_mode():
generate_ids = model.generate(**inputs, **generation_args)
generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
print(response)
```
</details>
#### Inference with local Transformers code from this repo
If you want to debug our model, we also provide a local code for inference. You can run the following code to load the model.
<details>
<summary>Click to expand</summary>
```python
from magma.processing_magma import MagmaProcessor
from magma.modeling_magma import MagmaForCausalLM
dtype = torch.bfloat16
model = MagmaForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
processor = MagmaProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
model.to("cuda")
```
</details>
#### Inference with bitsandbytes
We also provide a sample code to inference with bitsandbytes. You can run the following code to load the model.
<details>
<summary>Click to expand</summary>
```python
from PIL import Image
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
# Define quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
# Load model with quantization config
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Magma-8B",
trust_remote_code=True,
device_map={"": 0}, # force everything onto GPU 0
quantization_config=quantization_config
)
processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
# Inference
image = Image.open("assets/images/magma_logo.jpg").convert("RGB")
convs = [
{"role": "system", "content": "You are agent that can see, talk and act."},
{"role": "user", "content": "<image_start><image><image_end>\nWhat is the letter on the robot?"},
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
# Convert inputs to the correct device and data type
inputs = {k: v.to(device=model.device, dtype=torch.float16 if v.dtype == torch.float32 else v.dtype)
for k, v in inputs.items()}
generation_args = {
"max_new_tokens": 500,
"temperature": 0.0,
"do_sample": False,
"use_cache": True,
"num_beams": 1,
}
with torch.inference_mode():
generate_ids = model.generate(**inputs, **generation_args)
generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
print(response)
```
</details>
#### Benchmarking
We benchmark the inference time and memory usage of our model with and without bitsandbytes.
| Model | Inference Time | Peak Memory Usage |
|-------|----------------|--------------|
| Magma-8B (bfloat16) | 1.1s | 17GB |
| Magma-8B (4-bit) | 1.1s | 7GB |
### Evaluation with lmms-eval
Please refer to [lmms-eval-instruction](tools/lmms-eval-magma) for the detailed instructions to run the evaluation with lmms-eval toolkit.
Once everything is ready, you can run the following code to evaluate our model from the root folder.
```bash
sh scripts/evaluation/lmms-eval/lmms_eval_magma.sh
```
You can evaluate other benchmarks by modifying the variable, eval_tasks. The list of `eval_tasks` can be found after running below code.
```
# lmms-eval --tasks {list_groups,list_subtasks,list_tags,list}
lmms-eval --tasks list_groups
```
### Evaluation with SimplerEnv
Please refer to [SimplerEnv-instruction](tools/simplerenv-magma) for the detailed instructions to run the evaluation with SimplerEnv toolkit.
Once everything is ready, you can run the following code to evaluate our model.
```bash
sh scripts/evaluation/simplerenv/bridge.sh
```
### Multi-images or Video Support
Handle multiple images is extremely simple for our model. You just simply duplicate the placeholder in your text prompt, and correspondingly add all images into the list. A dummy example is as follows:
```py
convs = [
{"role": "system", "content": "You are agent that can see, talk and act."},
{"role": "user", "content": "<image_start><image><image_end>\n<image_start><image><image_end>\n<image_start><image><image_end>\nWhat is the letter on the robot?"},
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image1,image2,image3], texts=prompt, return_tensors="pt")
```
Our model will handle the visual token filling for you!
### Agent Demos
#### UI Agent
We built agent models for our model. The first one we built is UI Agent Demo. As our model is pretrained with Set-of-Mark and Trace-of-Mark, it is naturally synergic to [OmniParser](https://github.com/microsoft/OmniParser). Combining them together, you can immediately get an UI agent, run:
```bash
python agents/ui_agent/app.py
```
More importantly, as our Magma model not only has the action-grounding ability, but also multimodal understanding and reasoning ability. You can not only ask the model predict where to click with text:
```bash
Go to the top ranked post
```
But also ask free question on the fly! Simply add a prefix "Q:" at the beginning of text prompt, e.g.,
```bash
Q: What is the title of the post?
```
#### Gaming Agent
We also built a gaming agent demo. You can run the following command to start the demo:
```bash
python agents/gaming_agent/app.py
```
Once the demo is run, you can see a robot proactively collecting the green blocks.
<!-- Below are the comparison between Magma and other counterparts VLMs:
<div align="center">
<video width="48%" controls autoplay>
<source src="https://microsoft.github.io/Magma/static/videos/magma_vs_llava.mp4" type="video/mp4">
<p>Magma v.s. LLaVA-OneVision.</p>
</video>
<video width="48%" controls autoplay>
<source src="https://microsoft.github.io/Magma/static/videos/magma_vs_qwen.mp4" type="video/mp4">
<p>Magma v.s. Qwen-2.0.</p>
</video>
</div> -->
#### Robot Visual Planning
We also built a robot visual planning demo. You can run the following command to start the demo:
```bash
python agents/robot_traj/app.py
```
For this demo, you may encounter an error as discussed in this [issue](https://github.com/microsoft/Magma/issues/43), a quick fix is running the following command:
```sh
pip install imageio[ffmpeg]
```
If it still does not work, please install the older version of transformers:
```sh
pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.44.1
```
<!-- Some example outputs:
<div align="center">
<video width="48%" controls autoplay>
<source src="assets/videos/robot_pick_up_chip_bag.mp4" type="video/mp4">
<p>Task: Pick up chip bag.</p>
</video>
<video width="48%" controls autoplay>
<source src="assets/videos/robot_push_chip_bag_to_left_edge_of_table.mp4" type="video/mp4">
<p>Task: Push chip bag to left edge of the table.</p>
</video>
</div> -->
## User Guidance
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
### Direct use
This model is intended for broad research use in English. The model take images and text as inputs, and produces the textual outputs for the following uses:
* **Image/Video-Conditioned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
* **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
* **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).
Our model is designed only for research purpose and aimed at knowledge-sharing and accelerating research in multimodal AI, in particularly the mutimodal agentic AI.
### Downstream Use
The model can be further finetuned for different downstream tasks, such as:
* **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
* **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
* **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
* **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperforms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
## Bias, Risks, and Limitations
Please note that this model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
## Citation
If you use this model in your research, please consider citing:
```bibtex
@misc{yang2025magmafoundationmodelmultimodal,
title={Magma: A Foundation Model for Multimodal AI Agents},
author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},
year={2025},
eprint={2502.13130},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.13130},
}
```
## Acknowledgements
Our work is supported by Microsoft Research. We thank all the contributors for their efforts in building this project.
Our work is built on top of some amazing open-source projects, including [Transformers](https://github.com/huggingface/transformers), [LLaVA](https://github.com/haotian-liu/LLaVA), [OpenVLA](https://github.com/openvla/openvla), [SeeClick](https://github.com/njucckevin/SeeClick), [Mind2Web](https://github.com/OSU-NLP-Group/Mind2Web), and also a number of awesome open-source datasets, including [Ego4d](https://ego4d-data.org/), [Epic-Kitchen](https://epic-kitchens.github.io/2025), [Something-Somethingv2](https://www.qualcomm.com/developer/artificial-intelligence/datasets), [Open-X-Embodiment](https://robotics-transformer-x.github.io/), and a number of evaluation benchmarks, including [SimplerEnv](https://github.com/simpler-env/SimplerEnv), [Libero](https://github.com/Lifelong-Robot-Learning/LIBERO).
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
<!-- END MICROSOFT SECURITY.MD BLOCK -->
# TODO: The maintainer of this repo has not yet edited this file
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
# Support
## How to file issues and get help
This project uses GitHub Issues to track bugs and feature requests. Please search the existing
issues before filing new issues to avoid duplicates. For new issues, file your bug or
feature request as a new Issue.
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
## Microsoft Support Policy
Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
import pygame
import numpy as np
import gradio as gr
import time
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
import re
import random
pygame.mixer.quit() # Disable sound
# Constants
WIDTH, HEIGHT = 800, 800
GRID_SIZE = 80
WHITE = (255, 255, 255)
GREEN = (34, 139, 34) # Forest green - more like an apple
RED = (200, 50, 50)
BLACK = (0, 0, 0)
GRAY = (128, 128, 128)
YELLOW = (218, 165, 32) # Golden yellow color
# Directions
UP = (0, -1)
DOWN = (0, 1)
LEFT = (-1, 0)
RIGHT = (1, 0)
STATIC = (0, 0)
ACTIONS = ["up", "down", "left", "right", "static"]
# Load AI Model
magma_model_id = "microsoft/Magma-8B"
dtype = torch.bfloat16
magma_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
magma_model.to("cuda")
# Load magma image
magma_img = pygame.image.load("./assets/images/magma_game.png")
magma_img = pygame.transform.scale(magma_img, (GRID_SIZE, GRID_SIZE))
class MagmaFindGPU:
def __init__(self):
self.reset()
def reset(self):
self.snake = [(5, 5)]
self.direction = RIGHT
self.score = 0
self.game_over = False
self.place_target()
def place_target(self):
while True:
target_x = np.random.randint(1, WIDTH // GRID_SIZE - 1)
target_y = np.random.randint(1, HEIGHT // GRID_SIZE - 1)
if (target_x, target_y) not in self.snake:
self.target = (target_x, target_y)
break
def step(self, action):
if action == "up":
self.direction = UP
elif action == "down":
self.direction = DOWN
elif action == "left":
self.direction = LEFT
elif action == "right":
self.direction = RIGHT
elif action == "static":
self.direction = STATIC
if self.game_over:
return self.render(), self.score
new_head = (self.snake[0][0] + self.direction[0], self.snake[0][1] + self.direction[1])
if new_head[0] < 0 or new_head[1] < 0 or new_head[0] >= WIDTH // GRID_SIZE or new_head[1] >= HEIGHT // GRID_SIZE:
self.game_over = True
return self.render(), self.score
self.snake = [new_head] # Keep only the head (single block snake)
# Check if the target is covered by four surrounding squares
head_x, head_y = self.snake[0]
neighbors = set([(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)])
if neighbors.issuperset(set([self.target])):
self.score += 1
self.place_target()
return self.render(), self.score
def render(self):
pygame.init()
surface = pygame.Surface((WIDTH, HEIGHT))
surface.fill(BLACK)
head_x, head_y = self.snake[0]
surface.blit(magma_img, (head_x * GRID_SIZE, head_y * GRID_SIZE))
# pygame.draw.rect(surface, RED, (self.snake[0][0] * GRID_SIZE, self.snake[0][1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
pygame.draw.rect(surface, GREEN, (self.target[0] * GRID_SIZE, self.target[1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
# Draw four surrounding squares with labels
head_x, head_y = self.snake[0]
neighbors = [(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)]
labels = ["1", "2", "3", "4"]
font = pygame.font.Font(None, 48)
# clone surface
surface_nomark = surface.copy()
for i, (nx, ny) in enumerate(neighbors):
if 0 <= nx < WIDTH // GRID_SIZE and 0 <= ny < HEIGHT // GRID_SIZE:
pygame.draw.rect(surface, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
# pygame.draw.rect(surface_nomark, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
text = font.render(labels[i], True, WHITE)
text_rect = text.get_rect(center=(nx * GRID_SIZE + GRID_SIZE // 2, ny * GRID_SIZE + GRID_SIZE // 2))
surface.blit(text, text_rect)
return np.array(pygame.surfarray.array3d(surface_nomark)).swapaxes(0, 1), np.array(pygame.surfarray.array3d(surface)).swapaxes(0, 1)
def get_state(self):
return self.render()
game = MagmaFindGPU()
def play_game():
state, state_som = game.get_state()
pil_img = Image.fromarray(state_som)
convs = [
{"role": "system", "content": "You are an agent that can see, talk, and act."},
{"role": "user", "content": "<image_start><image><image_end>\nWhich mark is closer to green block? Answer with a single number."},
]
prompt = magma_processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = magma_processor(images=[pil_img], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda").to(dtype)
generation_args = {
"max_new_tokens": 10,
"temperature": 0,
"do_sample": False,
"use_cache": True,
"num_beams": 1,
}
with torch.inference_mode():
generate_ids = magma_model.generate(**inputs, **generation_args)
generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
action = magma_processor.decode(generate_ids[0], skip_special_tokens=True).strip()
# extract mark id fro action use re
match = re.search(r'\d+', action)
if match:
action = match.group(0)
if action.isdigit() and 1 <= int(action) <= 4:
# epsilon sampling
if random.random() < 0.1:
action = random.choice(ACTIONS[:-1])
else:
action = ACTIONS[int(action) - 1]
else:
# random choose one from the pool
action = random.choice(ACTIONS[:-1])
else:
action = random.choice(ACTIONS[:-1])
img, score = game.step(action)
img = img[0]
return img, f"Score: {score}"
def reset_game():
game.reset()
return game.render()[0], "Score: 0"
MARKDOWN = """
<div align="center">
<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
Game: Magma finds the apple by moving up, down, left and right.
\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp;
This demo is powered by [Gradio](https://gradio.app/).
</div>
"""
with gr.Blocks() as interface:
gr.Markdown(MARKDOWN)
with gr.Row():
image_output = gr.Image(label="Game Screen")
score_output = gr.Text(label="Score")
with gr.Row():
start_btn = gr.Button("Start/Reset Game")
interface.load(fn=play_game, every=1, inputs=[], outputs=[image_output, score_output])
start_btn.click(fn=reset_game, inputs=[], outputs=[image_output, score_output])
interface.launch()
import gradio as gr
import numpy as np
import gymnasium as gym
from PIL import Image
import matplotlib.pyplot as plt
# Initialize FrozenLake environment
env = gym.make("FrozenLake-v1", render_mode="rgb_array")
state, _ = env.reset()
action_mapping = {
"Left": 3,
"Down": 1,
"Right": 2,
"Up": 0,
}
def render_env():
"""Render the environment and return as an image."""
frame = env.render()
image = Image.fromarray(frame)
return image
def step(action):
"""Take a step in the environment."""
global state
action_index = action_mapping[action]
state, reward, done, _, _ = env.step(action_index)
image = render_env()
message = f"State: {state}, Reward: {reward}, Done: {done}"
if done:
env.reset()
message += " - Resetting environment"
return image, message
# Create Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# Play Frozen Lake!")
image_display = gr.Image()
action_buttons = gr.Radio(choices=list(action_mapping.keys()), label="Select Action")
submit_button = gr.Button("Step")
output_text = gr.Textbox(label="Game State")
submit_button.click(fn=step, inputs=action_buttons, outputs=[image_display, output_text])
# Show initial state
image_display.update(render_env())
demo.launch()
# Magma: Multimodal Agentic Models
Evaluating Magma on [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO).
#### LIBERO Setup
Clone and install LIBERO and other requirements:
```
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -r agents/libero/requirements.txt
cd LIBERO
pip install -e .
```
#### Quick Evaluation
The following code demonstrates how to run Magma on a single LIBERO task and evaluate its performance:
```
import numpy as np
from libero.libero import benchmark
from libero_env_utils import get_libero_env, get_libero_dummy_action, get_libero_obs, get_max_steps, save_rollout_video
from libero_magma_utils import get_magma_model, get_magma_prompt, get_magma_action
# Set up benchmark and task
benchmark_dict = benchmark.get_benchmark_dict()
task_suite_name = "libero_goal" # or libero_spatial, libero_object, etc.
task_suite = benchmark_dict[task_suite_name]()
task_id = 1
task = task_suite.get_task(task_id)
# Initialize environment
env, task_description = get_libero_env(task, resolution=256)
print(f"Task {task_id} description: {task_description}")
# Load MAGMA model
model_name = "microsoft/magma-8b-libero-goal" # or your local path
processor, magma = get_magma_model(model_name)
prompt = get_magma_prompt(task_description, processor, magma.config)
# Run evaluation
num_steps_wait = 10
max_steps = get_max_steps(task_suite_name)
env.seed(0)
obs = env.reset()
init_states = task_suite.get_task_init_states(task_id)
obs = env.set_init_state(init_states[0])
step = 0
replay_images = []
while step < max_steps + num_steps_wait:
if step < num_steps_wait:
obs, _, done, _ = env.step(get_libero_dummy_action())
step += 1
continue
img = get_libero_obs(obs, resize_size=256)
replay_images.append(img)
action = get_magma_action(magma, processor, img, prompt, task_suite_name)
obs, _, done, _ = env.step(action.tolist())
step += 1
env.close()
save_rollout_video(replay_images, success=done, task_description=task_description)
```
**Notes:** The above script only tests one episode on a single task and visualizes MAGMA's trajectory with saved video. For comprehensive evaluation on each task suite, please use `eval_magma_libero.py`.
```
python eval_magma_libero.py \
--model_name microsoft/Magma-8B-libero-object \
--task_suite_name libero_object \
python eval_magma_libero.py \
--model_name microsoft/Magma-8B-libero-spatial \
--task_suite_name libero_spatial \
python eval_magma_libero.py \
--model_name microsoft/Magma-8B-libero-goal \
--task_suite_name libero_goal \
```
import os
import numpy as np
import draccus
from dataclasses import dataclass
from typing import Optional, Tuple
import tqdm
from libero.libero import benchmark
from libero_env_utils import (
get_libero_env,
get_libero_dummy_action,
get_libero_obs,
get_max_steps,
set_seed_everywhere
)
from libero_magma_utils import (
get_magma_model,
get_magma_prompt,
get_magma_action
)
@dataclass
class LiberoConfig:
# Model parameters
model_name: str = "microsoft/magma-8b-libero-goal" # model_name
task_suite_name: str = "libero_goal" # Task suite name
# Evaluation parameters
num_trials_per_task: int = 50 # Number of rollouts per task
resolution: int = 256 # Image resolution
num_steps_wait: int = 10 # Steps to wait for stabilization
seed: int = 0 # Random seed
save_dir: str = "./libero_eval_log" # Directory for saving logs
@draccus.wrap()
def eval_libero(cfg: LiberoConfig) -> Tuple[int, int]:
"""
Evaluate Libero environment with given configuration.
Args:
cfg: LiberoConfig object containing evaluation parameters
Returns:
Tuple[int, int]: Total episodes and total successful episodes
"""
# Setup logging
os.makedirs(cfg.save_dir, exist_ok=True)
log_filepath = f"{cfg.save_dir}/magma_eval-{cfg.task_suite_name}.log"
log_file = open(log_filepath, "w")
print(f"Logging to local log file: {log_filepath}")
# Write initial log
log_file.write(f"Task suite: {cfg.task_suite_name}\n")
print(f"Task suite: {cfg.task_suite_name}")
# Get benchmark and task suite
benchmark_dict = benchmark.get_benchmark_dict()
task_suite = benchmark_dict[cfg.task_suite_name]()
num_tasks_in_suite = task_suite.n_tasks
# Initialize counters
total_episodes, total_successes = 0, 0
set_seed_everywhere(cfg.seed)
# Load model
processor, magma = get_magma_model(cfg.model_name)
# Iterate through all tasks
for task_id in tqdm.tqdm(range(num_tasks_in_suite)):
# Get task
task = task_suite.get_task(task_id)
task_name = task.name
max_steps = get_max_steps(cfg.task_suite_name)
# Get default LIBERO initial states
initial_states = task_suite.get_task_init_states(task_id)
# Initialize LIBERO environment and task description
env, task_description = get_libero_env(task, resolution=cfg.resolution)
print(f"[info] Evaluating task {task_id} from suite {cfg.task_suite_name}, "
f"the language instruction is {task_description}.")
log_file.write(f"Task {task_id}: {task_description}\n")
log_file.flush()
# Get prompt for current task
prompt = get_magma_prompt(task_description, processor, magma.config)
# Initialize task-specific counters
task_episodes, task_successes = 0, 0
# Run trials for current task
for trial in range(cfg.num_trials_per_task):
env.reset()
obs = env.set_init_state(initial_states[trial])
step = 0
while step < max_steps + cfg.num_steps_wait:
if step < cfg.num_steps_wait:
obs, reward, done, info = env.step(get_libero_dummy_action())
step += 1
continue
img = get_libero_obs(obs, resize_size=cfg.resolution)
action = get_magma_action(magma, processor, img, prompt, cfg.task_suite_name)
obs, reward, done, info = env.step(action.tolist())
step += 1
if done:
task_successes += 1
break
task_episodes += 1
# Update total counters
total_episodes += task_episodes
total_successes += task_successes
# Log task success rate
task_success_rate = float(task_successes) / float(task_episodes)
print(f"Current task ({task_name}) success rate: {task_success_rate}")
log_file.write(f"Current task ({task_name}) success rate: {task_success_rate}\n")
log_file.flush()
# Log final suite success rate
suite_success_rate = float(total_successes) / float(total_episodes)
print(f"Task suite success rate: {suite_success_rate}")
log_file.write(f"\nTask suite {cfg.task_suite_name} success rate: {suite_success_rate}\n")
log_file.flush()
env.close()
log_file.close()
return total_episodes, total_successes
if __name__ == "__main__":
eval_libero()
\ No newline at end of file
"""Utils for evaluating policies in LIBERO simulation environments."""
import math
import os
import torch
import random
from PIL import Image
import imageio
import numpy as np
import tensorflow as tf
from libero.libero import get_libero_path
from libero.libero.envs import OffScreenRenderEnv
def resize_image(img, resize_size):
"""
Takes numpy array corresponding to a single image and returns resized image as numpy array.
"""
assert isinstance(resize_size, tuple)
# Resize to image size expected by model
img = tf.image.encode_jpeg(img) # Encode as JPEG, as done in RLDS dataset builder
img = tf.io.decode_image(img, expand_animations=False, dtype=tf.uint8) # Immediately decode back
img = tf.image.resize(img, resize_size, method="lanczos3", antialias=True)
img = tf.cast(tf.clip_by_value(tf.round(img), 0, 255), tf.uint8)
img = img.numpy()
return img
def get_libero_env(task, resolution=256):
"""Initializes and returns the LIBERO environment, along with the task description."""
task_description = task.language
task_bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
env_args = {"bddl_file_name": task_bddl_file, "camera_heights": resolution, "camera_widths": resolution}
env = OffScreenRenderEnv(**env_args)
env.seed(0) # IMPORTANT: seed seems to affect object positions even when using fixed initial state
return env, task_description
def get_libero_dummy_action():
"""Get dummy/no-op action, used to roll out the simulation while the robot does nothing."""
return [0, 0, 0, 0, 0, 0, -1]
def get_libero_obs(obs, resize_size):
"""Extracts image from observations and preprocesses it."""
assert isinstance(resize_size, int) or isinstance(resize_size, tuple)
if isinstance(resize_size, int):
resize_size = (resize_size, resize_size)
img = obs["agentview_image"]
img = img[::-1, ::-1] # IMPORTANT: rotate 180 degrees to match train preprocessing
image = Image.fromarray(img)
# resize image to 256x256
image = resize_image(img, resize_size)
return image
def get_max_steps(task_suite_name):
if task_suite_name == "libero_spatial":
max_steps = 220
elif task_suite_name == "libero_object":
max_steps = 280
elif task_suite_name == "libero_goal":
max_steps = 300
elif task_suite_name == "libero_10":
max_steps = 520
else:
max_steps = 400
return max_steps
def quat2axisangle(quat):
"""
Copied from robosuite: https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
Converts quaternion to axis-angle format.
Returns a unit vector direction scaled by its angle in radians.
Args:
quat (np.array): (x,y,z,w) vec4 float angles
Returns:
np.array: (ax,ay,az) axis-angle exponential coordinates
"""
# clip quaternion
if quat[3] > 1.0:
quat[3] = 1.0
elif quat[3] < -1.0:
quat[3] = -1.0
den = np.sqrt(1.0 - quat[3] * quat[3])
if math.isclose(den, 0.0):
# This is (close to) a zero degree rotation, immediately return
return np.zeros(3)
return (quat[:3] * 2.0 * math.acos(quat[3])) / den
def save_rollout_video(replay_images, success, task_description):
"""Saves a video replay of a rollout in libero."""
save_dir = f"./libero_videos"
os.makedirs(save_dir, exist_ok=True)
processed_task_description = task_description.lower().replace(" ", "_").replace("\n", "_").replace(".", "_")[:50]
video_path = f"{save_dir}/quick_eval-success={success}--task={processed_task_description}.mp4"
video_writer = imageio.get_writer(video_path, fps=30)
for img in replay_images:
video_writer.append_data(img)
video_writer.close()
print(f"Saved libero video at path {video_path}")
return video_path
def set_seed_everywhere(seed: int):
"""Sets the random seed for Python, NumPy, and PyTorch functions."""
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ["PYTHONHASHSEED"] = str(seed)
\ No newline at end of file
import os
import json
import torch
import numpy as np
from magma.image_processing_magma import MagmaImageProcessor
from magma.processing_magma import MagmaProcessor
from magma.modeling_magma import MagmaForConditionalGeneration
def get_magma_model(model_name):
processor = MagmaProcessor.from_pretrained(model_name, trust_remote_code=True)
magma = MagmaForConditionalGeneration.from_pretrained(model_name,
device_map="cuda",
low_cpu_mem_usage=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
use_cache=True,
)
return processor, magma
def get_magma_prompt(task_description, processor, model_config):
convs = [
{"role": "user", "content": f"<image>\nWhat action should the robot take to {task_description}?"},
]
convs = [
{
"role": "system",
"content": "You are agent that can see, talk and act.",
},
] + convs
prompt = processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
if model_config.mm_use_image_start_end:
prompt = prompt.replace("<image>", "<image_start><image><image_end>")
return prompt
def get_magma_action(magma, processor, img, prompt, task_suite_name):
dataset_stats = json.load(open(os.path.join(magma.config._name_or_path, "dataset_statistics.json")))
action_norm_stats = dataset_stats[f"{task_suite_name}_no_noops"]['action']
n_action_bins = 256
vocab_size = processor.tokenizer.vocab_size
bins = np.linspace(-1, 1, n_action_bins)
bin_centers = (bins[:-1] + bins[1:]) / 2.0
# process inputs
inputs = processor(images=img, texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda").to(torch.bfloat16)
# predict actions with magma
output_ids = magma.generate(
**inputs,
temperature=0.7,
do_sample=True,
num_beams=1,
max_new_tokens=1000,
use_cache=True,
)
action_ids = output_ids[0, -8:-1].cpu().tolist()
predicted_action_ids = np.array(action_ids).astype(np.int64)
discretized_actions = vocab_size - predicted_action_ids
discretized_actions = np.clip(discretized_actions - 1, a_min=0, a_max=bin_centers.shape[0] - 1)
normalized_actions = bin_centers[discretized_actions]
# unnormalize actions
mask = action_norm_stats.get("mask", np.ones_like(action_norm_stats["q01"], dtype=bool))
action_high, action_low = np.array(action_norm_stats["q99"]), np.array(action_norm_stats["q01"])
raw_action = np.where(
mask,
0.5 * (normalized_actions + 1) * (action_high - action_low) + action_low,
normalized_actions,
)
action = normalize_gripper_action(raw_action, binarize=True)
action = invert_gripper_action(action)
return action
def normalize_gripper_action(action, binarize=True):
"""
Convert gripper action from [0,1] to [-1,+1] range.
y = 2x - 1
"""
orig_low, orig_high = 0.0, 1.0
action[..., -1] = 2 * (action[..., -1] - orig_low) / (orig_high - orig_low) - 1
if binarize:
# Binarize to -1 or +1.
action[..., -1] = np.sign(action[..., -1])
return action
def invert_gripper_action(action):
"""Convert gripper: RLDS(0=close,1=open) -> -1=open,+1=close"""
action[..., -1] = action[..., -1] * -1.0
return action
\ No newline at end of file
robosuite==1.4.0
bddl==1.0.1
easydict==1.9
gym==0.25.2
cloudpickle
imageio[ffmpeg]
\ No newline at end of file
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
import os
import warnings
from utils.visualizer import Visualizer
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
import random
import gradio as gr
import ast, re
import torch
import torchvision
from transformers import AutoModelForCausalLM, AutoProcessor
'''
build model
'''
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(0)
spatial_quant_size = 256
# Load AI Model
dtype = torch.bfloat16
device = "cuda"
magma_model_id = "microsoft/Magma-8B"
model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
model.to(device)
@torch.no_grad()
def inference(image, task, *args, **kwargs):
# image = image['image']
task_description = task
num_marks = args[0]
speed = args[1]
steps = args[2]
mark_ids = [i+1 for i in range(num_marks)]
image_resized = image.resize((256, 256))
magma_template = (
# "<image>\nThe image is labeled with numeric marks {}.\n"
"<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
"The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
)
"""
Visual Trace Generation
"""
if model.config.mm_use_image_start_end:
magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")
conv_user = magma_template.format(mark_ids, task_description, speed, steps)
print(conv_user)
convs = [
{"role": "user", "content": conv_user},
]
convs = [
{
"role": "system",
"content": "You are agent that can see, talk and act.",
},
] + convs
prompt = processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(device)
with torch.inference_mode():
output_ids = model.generate(
**inputs,
temperature=0.3,
do_sample=True,
num_beams=1,
max_new_tokens=1024,
use_cache=True,
)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
if len(response)==0:
return None
# extract traces from response
if "and their future positions are:" in response:
selected_marks_str, traces_str = response.split("and their future positions are:\n")
else:
selected_marks_str, traces_str = None, response
try:
traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
overlay_traces = []
for mark_id, trace in traces_dict.items():
# convert list of tuples to tensor
trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
overlay_traces.append(trace)
# padded to the same length with the last element
max_len = max([trace.shape[0] for trace in overlay_traces])
for i in range(len(overlay_traces)):
if overlay_traces[i].shape[0] < max_len:
overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)
overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
# if selected_marks_str is not None:
# selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
# selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
# selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)
# overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
overlay_traces = overlay_traces.float() / 256
overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
images = [image] * overlay_traces.shape[1]
overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255
vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
vis.visualize(video, overlay_traces, overlay_visibility)
# return video path
return "./saved_videos/video.mp4"
except Exception as e:
print(e)
return None
class ImageMask(gr.components.Image):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
def preprocess(self, x):
return super().preprocess(x)
class Video(gr.components.Video):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", **kwargs)
def preprocess(self, x):
return super().preprocess(x)
'''
launch app
'''
title = "Magma"
description = '''Magma: Multimodal Agent to Act'''
'''Usage
Instructions:
&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
'''
article = "The Demo is Run on Magma-8B."
inputs = [
gr.components.Image(label="Draw on Image",type="pil"),
gr.Textbox(label="Task"),
gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
]
gr.Interface(
fn=inference,
inputs=inputs,
outputs=[
gr.Video(
label="Robot planning trajectory", format="mp4"
),
],
examples=[
["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
],
title=title,
description=description,
article=article,
allow_flagging='never',
cache_examples=False,
).launch(share=True)
\ No newline at end of file
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
import os
import warnings
from utils.visualizer import Visualizer
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
import random
import gradio as gr
import ast, re
import torch
import torchvision
from transformers import AutoModelForCausalLM, AutoProcessor
'''
build model
'''
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(0)
spatial_quant_size = 256
# Load AI Model
dtype = torch.bfloat16
device = "cuda"
magma_model_id = "microsoft/Magma-8B"
model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
model.to(device)
@torch.no_grad()
def inference(image, task, *args, **kwargs):
# image = image['image']
task_description = task
num_marks = args[0]
speed = args[1]
steps = args[2]
mark_ids = [i+1 for i in range(num_marks)]
image_resized = image.resize((256, 256))
magma_template = (
# "<image>\nThe image is labeled with numeric marks {}.\n"
"<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
"The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
)
"""
Visual Trace Generation
"""
if model.config.mm_use_image_start_end:
magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")
conv_user = magma_template.format(mark_ids, task_description, speed, steps)
print(conv_user)
convs = [
{"role": "user", "content": conv_user},
]
convs = [
{
"role": "system",
"content": "You are agent that can see, talk and act.",
},
] + convs
prompt = processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(device)
with torch.inference_mode():
output_ids = model.generate(
**inputs,
temperature=0.3,
do_sample=True,
num_beams=1,
max_new_tokens=1024,
use_cache=True,
)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
if len(response)==0:
return None
# extract traces from response
if "and their future positions are:" in response:
selected_marks_str, traces_str = response.split("and their future positions are:\n")
else:
selected_marks_str, traces_str = None, response
try:
traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
overlay_traces = []
for mark_id, trace in traces_dict.items():
# convert list of tuples to tensor
trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
overlay_traces.append(trace)
# padded to the same length with the last element
max_len = max([trace.shape[0] for trace in overlay_traces])
for i in range(len(overlay_traces)):
if overlay_traces[i].shape[0] < max_len:
overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)
overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
# if selected_marks_str is not None:
# selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
# selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
# selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)
# overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
overlay_traces = overlay_traces.float() / 256
overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
images = [image] * overlay_traces.shape[1]
overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255
vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
vis.visualize(video, overlay_traces, overlay_visibility)
# return video path
return "./saved_videos/video.mp4"
except Exception as e:
print(e)
return None
from gradio.events import Dependency
class ImageMask(gr.components.Image):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
def preprocess(self, x):
return super().preprocess(x)
from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
from gradio.blocks import Block
if TYPE_CHECKING:
from gradio.components import Timer
class Video(gr.components.Video):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", **kwargs)
def preprocess(self, x):
return super().preprocess(x)
from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
from gradio.blocks import Block
if TYPE_CHECKING:
from gradio.components import Timer
'''
launch app
'''
title = "Magma"
description = '''Magma: Multimodal Agent to Act'''
'''Usage
Instructions:
&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
'''
article = "The Demo is Run on Magma-8B."
inputs = [
gr.components.Image(label="Draw on Image",type="pil"),
gr.Textbox(label="Task"),
gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
]
gr.Interface(
fn=inference,
inputs=inputs,
outputs=[
gr.Video(
label="Robot planning trajectory", format="mp4"
),
],
examples=[
["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
],
title=title,
description=description,
article=article,
allow_flagging='never',
cache_examples=False,
).launch(share=True)
\ No newline at end of file
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os
import numpy as np
import imageio
import torch
from matplotlib import cm
import torch.nn.functional as F
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
def read_video_from_path(path):
try:
reader = imageio.get_reader(path)
except Exception as e:
print("Error opening video file: ", e)
return None
frames = []
for i, im in enumerate(reader):
frames.append(np.array(im))
return np.stack(frames)
def draw_circle(rgb, coord, radius, color=(255, 0, 0), visible=True):
# Create a draw object
draw = ImageDraw.Draw(rgb)
# Calculate the bounding box of the circle
left_up_point = (coord[0] - radius, coord[1] - radius)
right_down_point = (coord[0] + radius, coord[1] + radius)
# Draw the circle
draw.ellipse(
[left_up_point, right_down_point],
fill=tuple(color) if visible else None,
outline=tuple(color),
)
return rgb
def draw_line(rgb, coord_y, coord_x, color, linewidth):
draw = ImageDraw.Draw(rgb)
draw.line(
(coord_y[0], coord_y[1], coord_x[0], coord_x[1]),
fill=tuple(color),
width=linewidth,
)
return rgb
def add_weighted(rgb, alpha, original, beta, gamma):
return (rgb * alpha + original * beta + gamma).astype("uint8")
class Visualizer:
def __init__(
self,
save_dir: str = "./results",
grayscale: bool = False,
pad_value: int = 0,
fps: int = 10,
mode: str = "rainbow", # 'cool', 'optical_flow'
linewidth: int = 2,
show_first_frame: int = 10,
tracks_leave_trace: int = 0, # -1 for infinite
):
self.mode = mode
self.save_dir = save_dir
if mode == "rainbow":
self.color_map = cm.get_cmap("gist_rainbow")
elif mode == "cool":
self.color_map = cm.get_cmap(mode)
self.show_first_frame = show_first_frame
self.grayscale = grayscale
self.tracks_leave_trace = tracks_leave_trace
self.pad_value = pad_value
self.linewidth = linewidth
self.fps = fps
def visualize(
self,
video: torch.Tensor, # (B,T,C,H,W)
tracks: torch.Tensor, # (B,T,N,2)
visibility: torch.Tensor = None, # (B, T, N, 1) bool
gt_tracks: torch.Tensor = None, # (B,T,N,2)
segm_mask: torch.Tensor = None, # (B,1,H,W)
filename: str = "video",
writer=None, # tensorboard Summary Writer, used for visualization during training
step: int = 0,
query_frame: int = 0,
save_video: bool = True,
compensate_for_camera_motion: bool = False,
):
if compensate_for_camera_motion:
assert segm_mask is not None
if segm_mask is not None:
coords = tracks[0, query_frame].round().long()
segm_mask = segm_mask[0, query_frame][coords[:, 1], coords[:, 0]].long()
video = F.pad(
video,
(self.pad_value, self.pad_value, self.pad_value, self.pad_value),
"constant",
255,
)
tracks = tracks + self.pad_value
if self.grayscale:
transform = transforms.Grayscale()
video = transform(video)
video = video.repeat(1, 1, 3, 1, 1)
res_video = self.draw_tracks_on_video(
video=video,
tracks=tracks,
visibility=visibility,
segm_mask=segm_mask,
gt_tracks=gt_tracks,
query_frame=query_frame,
compensate_for_camera_motion=compensate_for_camera_motion,
)
if save_video:
self.save_video(res_video, filename=filename, writer=writer, step=step)
return res_video
def save_video(self, video, filename, writer=None, step=0):
if writer is not None:
writer.add_video(
filename,
video.to(torch.uint8),
global_step=step,
fps=self.fps,
)
else:
os.makedirs(self.save_dir, exist_ok=True)
wide_list = list(video.unbind(1))
wide_list = [wide[0].permute(1, 2, 0).cpu().numpy() for wide in wide_list]
# Prepare the video file path
save_path = os.path.join(self.save_dir, f"{filename}.mp4")
# Create a writer object
video_writer = imageio.get_writer(save_path, fps=self.fps)
# Write frames to the video file
for frame in wide_list[2:-1]:
video_writer.append_data(frame)
video_writer.close()
print(f"Video saved to {save_path}")
def draw_tracks_on_video(
self,
video: torch.Tensor,
tracks: torch.Tensor,
visibility: torch.Tensor = None,
segm_mask: torch.Tensor = None,
gt_tracks=None,
query_frame: int = 0,
compensate_for_camera_motion=False,
):
B, T, C, H, W = video.shape
_, _, N, D = tracks.shape
assert D == 2
assert C == 3
video = video[0].permute(0, 2, 3, 1).byte().detach().cpu().numpy() # S, H, W, C
tracks = tracks[0].long().detach().cpu().numpy() # S, N, 2
if gt_tracks is not None:
gt_tracks = gt_tracks[0].detach().cpu().numpy()
res_video = []
# process input video
for rgb in video:
res_video.append(rgb.copy())
vector_colors = np.zeros((T, N, 3))
if self.mode == "optical_flow":
import flow_vis
vector_colors = flow_vis.flow_to_color(tracks - tracks[query_frame][None])
elif segm_mask is None:
if self.mode == "rainbow":
y_min, y_max = (
tracks[query_frame, :, 1].min(),
tracks[query_frame, :, 1].max(),
)
norm = plt.Normalize(y_min, y_max)
for n in range(N):
color = self.color_map(norm(tracks[query_frame, n, 1]))
color = np.array(color[:3])[None] * 255
vector_colors[:, n] = np.repeat(color, T, axis=0)
else:
# color changes with time
for t in range(T):
color = np.array(self.color_map(t / T)[:3])[None] * 255
vector_colors[t] = np.repeat(color, N, axis=0)
else:
if self.mode == "rainbow":
vector_colors[:, segm_mask <= 0, :] = 255
y_min, y_max = (
tracks[0, segm_mask > 0, 1].min(),
tracks[0, segm_mask > 0, 1].max(),
)
norm = plt.Normalize(y_min, y_max)
for n in range(N):
if segm_mask[n] > 0:
color = self.color_map(norm(tracks[0, n, 1]))
color = np.array(color[:3])[None] * 255
vector_colors[:, n] = np.repeat(color, T, axis=0)
else:
# color changes with segm class
segm_mask = segm_mask.cpu()
color = np.zeros((segm_mask.shape[0], 3), dtype=np.float32)
color[segm_mask > 0] = np.array(self.color_map(1.0)[:3]) * 255.0
color[segm_mask <= 0] = np.array(self.color_map(0.0)[:3]) * 255.0
vector_colors = np.repeat(color[None], T, axis=0)
# draw tracks
if self.tracks_leave_trace != 0:
for t in range(query_frame + 1, T):
first_ind = (
max(0, t - self.tracks_leave_trace) if self.tracks_leave_trace >= 0 else 0
)
curr_tracks = tracks[first_ind : t + 1]
curr_colors = vector_colors[first_ind : t + 1]
if compensate_for_camera_motion:
diff = (
tracks[first_ind : t + 1, segm_mask <= 0]
- tracks[t : t + 1, segm_mask <= 0]
).mean(1)[:, None]
curr_tracks = curr_tracks - diff
curr_tracks = curr_tracks[:, segm_mask > 0]
curr_colors = curr_colors[:, segm_mask > 0]
res_video[t] = self._draw_pred_tracks(
res_video[t],
curr_tracks,
curr_colors,
)
if gt_tracks is not None:
res_video[t] = self._draw_gt_tracks(res_video[t], gt_tracks[first_ind : t + 1])
# draw points
for t in range(query_frame, T):
img = Image.fromarray(np.uint8(res_video[t]))
for i in range(N):
coord = (tracks[t, i, 0], tracks[t, i, 1])
visibile = True
if visibility is not None:
visibile = visibility[0, t, i]
if coord[0] != 0 and coord[1] != 0:
if not compensate_for_camera_motion or (
compensate_for_camera_motion and segm_mask[i] > 0
):
img = draw_circle(
img,
coord=coord,
radius=int(self.linewidth * 2),
color=vector_colors[t, i].astype(int),
visible=visibile,
)
res_video[t] = np.array(img)
# construct the final rgb sequence
if self.show_first_frame > 0:
res_video = [res_video[0]] * self.show_first_frame + res_video[1:]
return torch.from_numpy(np.stack(res_video)).permute(0, 3, 1, 2)[None].byte()
def _draw_pred_tracks(
self,
rgb: np.ndarray, # H x W x 3
tracks: np.ndarray, # T x 2
vector_colors: np.ndarray,
alpha: float = 0.5,
):
T, N, _ = tracks.shape
rgb = Image.fromarray(np.uint8(rgb))
for s in range(T - 1):
vector_color = vector_colors[s]
original = rgb.copy()
alpha = (s / T) ** 2
for i in range(N):
coord_y = (int(tracks[s, i, 0]), int(tracks[s, i, 1]))
coord_x = (int(tracks[s + 1, i, 0]), int(tracks[s + 1, i, 1]))
if coord_y[0] != 0 and coord_y[1] != 0:
rgb = draw_line(
rgb,
coord_y,
coord_x,
vector_color[i].astype(int),
self.linewidth,
)
if self.tracks_leave_trace > 0:
rgb = Image.fromarray(
np.uint8(add_weighted(np.array(rgb), alpha, np.array(original), 1 - alpha, 0))
)
rgb = np.array(rgb)
return rgb
def _draw_gt_tracks(
self,
rgb: np.ndarray, # H x W x 3,
gt_tracks: np.ndarray, # T x 2
):
T, N, _ = gt_tracks.shape
color = np.array((211, 0, 0))
rgb = Image.fromarray(np.uint8(rgb))
for t in range(T):
for i in range(N):
gt_tracks = gt_tracks[t][i]
# draw a red cross
if gt_tracks[0] > 0 and gt_tracks[1] > 0:
length = self.linewidth * 3
coord_y = (int(gt_tracks[0]) + length, int(gt_tracks[1]) + length)
coord_x = (int(gt_tracks[0]) - length, int(gt_tracks[1]) - length)
rgb = draw_line(
rgb,
coord_y,
coord_x,
color,
self.linewidth,
)
coord_y = (int(gt_tracks[0]) - length, int(gt_tracks[1]) + length)
coord_x = (int(gt_tracks[0]) + length, int(gt_tracks[1]) - length)
rgb = draw_line(
rgb,
coord_y,
coord_x,
color,
self.linewidth,
)
rgb = np.array(rgb)
return rgb
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
from typing import Optional
import spaces
import gradio as gr
import numpy as np
import torch
from PIL import Image
import io
import re
import base64, os
from util.utils import check_ocr_box, get_yolo_model, get_caption_model_processor, get_som_labeled_img
from util.som import MarkHelper, plot_boxes_with_marks, plot_circles_with_marks
from util.process_utils import pred_2_point, extract_bbox, extract_mark_id
import torch
from PIL import Image
from huggingface_hub import snapshot_download
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
# Define repository and local directory
repo_id = "microsoft/OmniParser-v2.0" # HF repo
local_dir = "weights" # Target local directory
dtype = torch.bfloat16
DEVICE = torch.device('cuda')
som_generator = MarkHelper()
magma_som_prompt = "<image>\nIn this view I need to click a button to \"{}\"? Provide the coordinates and the mark index of the containing bounding box if applicable."
magma_qa_prompt = "<image>\n{} Answer the question briefly."
magma_model_id = "microsoft/Magma-8B"
magam_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
magam_model.to(DEVICE)
# Download the entire repository
snapshot_download(repo_id=repo_id, local_dir=local_dir)
print(f"Repository downloaded to: {local_dir}")
yolo_model = get_yolo_model(model_path='weights/icon_detect/model.pt')
caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="weights/icon_caption")
# caption_model_processor = get_caption_model_processor(model_name="blip2", model_name_or_path="weights/icon_caption_blip2")
MARKDOWN = """
<div align="center">
<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp;
This demo is powered by [Gradio](https://gradio.app/) and uses [OmniParserv2](https://github.com/microsoft/OmniParser) to generate [Set-of-Mark prompts](https://github.com/microsoft/SoM).
The demo supports three modes:
1. Empty text inut: it downgrades to an OmniParser demo.
2. Text input starting with "Q:": it leads to a visual question answering demo.
3. Text input for UI navigation: it leads to a UI navigation demo.
</div>
"""
DEVICE = torch.device('cuda')
@spaces.GPU
@torch.inference_mode()
def get_som_response(instruction, image_som):
prompt = magma_som_prompt.format(instruction)
if magam_model.config.mm_use_image_start_end:
qs = prompt.replace('<image>', '<image_start><image><image_end>')
else:
qs = prompt
convs = [{"role": "user", "content": qs}]
convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs
prompt = magma_processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = magma_processor(images=[image_som], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(DEVICE)
magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
with torch.inference_mode():
output_ids = magam_model.generate(
**inputs,
temperature=0.0,
do_sample=False,
num_beams=1,
max_new_tokens=128,
use_cache=True
)
prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
response = response.replace(prompt_decoded, '').strip()
return response
@spaces.GPU
@torch.inference_mode()
def get_qa_response(instruction, image):
prompt = magma_qa_prompt.format(instruction)
if magam_model.config.mm_use_image_start_end:
qs = prompt.replace('<image>', '<image_start><image><image_end>')
else:
qs = prompt
convs = [{"role": "user", "content": qs}]
convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs
prompt = magma_processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = magma_processor(images=[image], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(DEVICE)
magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
with torch.inference_mode():
output_ids = magam_model.generate(
**inputs,
temperature=0.0,
do_sample=False,
num_beams=1,
max_new_tokens=128,
use_cache=True
)
prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
response = response.replace(prompt_decoded, '').strip()
return response
@spaces.GPU
@torch.inference_mode()
# @torch.autocast(device_type="cuda", dtype=torch.bfloat16)
def process(
image_input,
box_threshold,
iou_threshold,
use_paddleocr,
imgsz,
instruction,
) -> Optional[Image.Image]:
# image_save_path = 'imgs/saved_image_demo.png'
# image_input.save(image_save_path)
# image = Image.open(image_save_path)
box_overlay_ratio = image_input.size[0] / 3200
draw_bbox_config = {
'text_scale': 0.8 * box_overlay_ratio,
'text_thickness': max(int(2 * box_overlay_ratio), 1),
'text_padding': max(int(3 * box_overlay_ratio), 1),
'thickness': max(int(3 * box_overlay_ratio), 1),
}
ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_input, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9}, use_paddleocr=use_paddleocr)
text, ocr_bbox = ocr_bbox_rslt
dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_input, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=False, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold, imgsz=imgsz,)
parsed_content_list = '\n'.join([f'icon {i}: ' + str(v) for i,v in enumerate(parsed_content_list)])
if len(instruction) == 0:
print('finish processing')
image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
return image, str(parsed_content_list)
elif instruction.startswith('Q:'):
response = get_qa_response(instruction, image_input)
return image_input, response
# parsed_content_list = str(parsed_content_list)
# convert xywh to yxhw
label_coordinates_yxhw = {}
for key, val in label_coordinates.items():
if val[2] < 0 or val[3] < 0:
continue
label_coordinates_yxhw[key] = [val[1], val[0], val[3], val[2]]
image_som = plot_boxes_with_marks(image_input.copy(), [val for key, val in label_coordinates_yxhw.items()], som_generator, edgecolor=(255,0,0), fn_save=None, normalized_to_pixel=False)
# convert xywh to xyxy
for key, val in label_coordinates.items():
label_coordinates[key] = [val[0], val[1], val[0] + val[2], val[1] + val[3]]
# normalize label_coordinates
for key, val in label_coordinates.items():
label_coordinates[key] = [val[0] / image_input.size[0], val[1] / image_input.size[1], val[2] / image_input.size[0], val[3] / image_input.size[1]]
magma_response = get_som_response(instruction, image_som)
print("magma repsonse: ", magma_response)
# map magma_response into the mark id
mark_id = extract_mark_id(magma_response)
if mark_id is not None:
if str(mark_id) in label_coordinates:
bbox_for_mark = label_coordinates[str(mark_id)]
else:
bbox_for_mark = None
else:
bbox_for_mark = None
if bbox_for_mark:
# draw bbox_for_mark on the image
image_som = plot_boxes_with_marks(
image_input,
[label_coordinates_yxhw[str(mark_id)]],
som_generator,
edgecolor=(255,127,111),
alpha=30,
fn_save=None,
normalized_to_pixel=False,
add_mark=False
)
else:
try:
if 'box' in magma_response:
pred_bbox = extract_bbox(magma_response)
click_point = [(pred_bbox[0][0] + pred_bbox[1][0]) / 2, (pred_bbox[0][1] + pred_bbox[1][1]) / 2]
click_point = [item / 1000 for item in click_point]
else:
click_point = pred_2_point(magma_response)
# de-normalize click_point (width, height)
click_point = [click_point[0] * image_input.size[0], click_point[1] * image_input.size[1]]
image_som = plot_circles_with_marks(
image_input,
[click_point],
som_generator,
edgecolor=(255,127,111),
linewidth=3,
fn_save=None,
normalized_to_pixel=False,
add_mark=False
)
except:
image_som = image_input
return image_som, str(parsed_content_list)
with gr.Blocks() as demo:
gr.Markdown(MARKDOWN)
with gr.Row():
with gr.Column():
image_input_component = gr.Image(
type='pil', label='Upload image')
# set the threshold for removing the bounding boxes with low confidence, default is 0.05
with gr.Accordion("Parameters", open=False) as parameter_row:
box_threshold_component = gr.Slider(
label='Box Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.05)
# set the threshold for removing the bounding boxes with large overlap, default is 0.1
iou_threshold_component = gr.Slider(
label='IOU Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.1)
use_paddleocr_component = gr.Checkbox(
label='Use PaddleOCR', value=True)
imgsz_component = gr.Slider(
label='Icon Detect Image Size', minimum=640, maximum=1920, step=32, value=640)
# text box
text_input_component = gr.Textbox(label='Text Input', placeholder='Text Input')
submit_button_component = gr.Button(
value='Submit', variant='primary')
with gr.Column():
image_output_component = gr.Image(type='pil', label='Image Output')
text_output_component = gr.Textbox(label='Parsed screen elements', placeholder='Text Output')
submit_button_component.click(
fn=process,
inputs=[
image_input_component,
box_threshold_component,
iou_threshold_component,
use_paddleocr_component,
imgsz_component,
text_input_component
],
outputs=[image_output_component, text_output_component]
)
# demo.launch(debug=False, show_error=True, share=True)
# demo.launch(share=True, server_port=7861, server_name='0.0.0.0')
demo.queue().launch(share=False)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment