v1.0

0063a668 · chenzk · 0063a668 · 0063a668 · 0063a668 · 0063a668
Commit 0063a668 authored May 13, 2025 by chenzk
20 changed files
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Microsoft Open Source Code of Conduct
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+Resources:
+- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
+- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
--- a/LICENSE
+++ b/LICENSE
+    MIT License
+    Copyright (c) Microsoft Corporation.
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE
--- a/README.md
+++ b/README.md
+# Magma
+具身智能新时代！VLA迎来最强基础模型Magma：UI导航、机器人操作全能。
+## 论文
+`Magma: A Foundation Model for Multimodal AI Agents`
+- https://arxiv.org/pdf/2502.13130
+## 模型结构
+使用一个视觉编码器V，将每一帧图像编码成多个token，然后将所有token拼接成一个序列，并与编码任务描述的语言token一起输入到一个仅解码器的语言模型（LLM）中。
+<div align=center>
+    <img src="./doc/Magma.png"/>
+</div>
+## 算法原理
+通过标记集合（SoM）和标记轨迹（ToM）技术，将视觉语言数据转化为可操作任务，显著提升了空间智能和任务泛化能力，能够理解和执行多模态任务，适用于数字和物理环境。
+研究人员提出了一种简单、有效的方法，结合「标记集合」（Set-of-Mark, SoM）和「标记轨迹」（Trace-of-Mark, ToM）将模型扩展到空间预测任务（可点击按钮）和时间维度。
+<div align=center>
+    <img src="./doc/algorithm.png"/>
+</div>
+## 环境配置
+```
+mv Magma_pytorch Magma # 去框架名后缀
+```
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：6063b673703a
+docker run -it --shm-size=64G -v $PWD/Magma:/home/Magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name magma <your IMAGE ID> bash
+cd /home/Magma
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
+```
+### Dockerfile（方法二）
+```
+cd /home/Magma/docker
+docker build --no-cache -t magma:latest .
+docker run --shm-size=64G --name magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Magma:/home/Magma -it magma bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.sourcefind.cn/tool/
+```
+DTK驱动:dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+vllm:0.6.2
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+transformers:4.51.3
+tensorflow:2.13.1
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/Magma
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+```
+## 数据集
+`无`
+## 训练
+`无`
+## 推理
+预训练权重目录结构：
+```
+/home/Magma
+    └── microsoft/Magma-8B
+# 设置HF下载镜像：
+export HF_ENDPOINT=https://hf-mirror.com
+然后，运行推命令时，项目会自动下载模型：laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg，下载完成后编码成缓存文件保存，此模型作者的代码无读取本地权重功能。
+``` 
+### 单机多卡
+```
+cd /home/Magma
+python infer_transformers.py
+```
+更多资料可参考源项目中的[`README_origin`](./README_origin.md)。
+## result
+`输入: `
+```
+prompt: "What is the letter on the robot?"
+image: "./assets/images/magma_logo.jpg"
+```
+`输出:`
+```
+response:  The letter on the robot is "M".
+```
+官方效果示例：
+<div align=center>
+    <img src="./doc/magma_mushroom.gif"/>
+</div>
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+## 应用场景
+### 算法类别
+`具身智能`
+### 热点应用行业
+`制造,家居,医疗,能源,教育`
+## 预训练权重
+HF/github下载地址为：[microsoft/Magma-8B](https://huggingface.co/microsoft/Magma-8B)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/InfiniteYou_pytorch.git
+## 参考资料
+- https://github.com/microsoft/Magma.git
--- a/README_origin.md
+++ b/README_origin.md
+<div align="center">
+<h2>🤖 Magma: A Foundation Model for Multimodal AI Agents</h2>
+[Jianwei Yang](https://jwyang.github.io/)<sup>*</sup><sup>1</sup><sup>†</sup>&nbsp;
+[Reuben Tan](https://cs-people.bu.edu/rxtan/)<sup>1</sup><sup>†</sup>&nbsp;
+[Qianhui Wu](https://qianhuiwu.github.io/)<sup>1</sup><sup>†</sup>&nbsp;
+[Ruijie Zheng](https://ruijiezheng.com/)<sup>2</sup><sup>‡</sup>&nbsp;
+[Baolin Peng](https://scholar.google.com/citations?user=u1CNjgwAAAAJ&hl=en&oi=ao)<sup>1</sup><sup>‡</sup>&nbsp;
+[Yongyuan Liang](https://cheryyunl.github.io)<sup>2</sup><sup>‡</sup>
+[Yu Gu](http://yu-gu.me/)<sup>1</sup>&nbsp;
+[Mu Cai](https://pages.cs.wisc.edu/~mucai/)<sup>3</sup>&nbsp;
+[Seonghyeon Ye](https://seonghyeonye.github.io/)<sup>4</sup>&nbsp;
+[Joel Jang](https://joeljang.github.io/)<sup>5</sup>&nbsp;
+[Yuquan Deng](https://scholar.google.com/citations?user=LTC0Q6YAAAAJ&hl=en)<sup>5</sup>&nbsp;
+[Lars Liden](https://sites.google.com/site/larsliden)<sup>1</sup>&nbsp;
+[Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/)<sup>1</sup><sup>▽</sup>
+<sup>1</sup> Microsoft Research; <sup>2</sup> University of Maryland; <sup>3</sup> University of Wisconsin-Madison  
+<sup>4</sup> KAIST; <sup>5</sup> University of Washington
+<sup>*</sup> Project lead  <sup>†</sup> First authors  <sup>‡</sup> Second authors  <sup>▽</sup> Leadership  
+<h3 style="color:#b22222;"> To Appear at CVPR 2025 </h3>
+<h4>
+<a href="https://www.arxiv.org/pdf/2502.13130">📄 arXiv Paper</a> &nbsp; 
+<a href="https://microsoft.github.io/Magma/">🌐 Project Page</a> &nbsp; 
+<a href="https://huggingface.co/microsoft/Magma-8B">🤗 Hugging Face Model</a>
+<a href="https://ai.azure.com/explore/models/microsoft-magma-8b/version/1/registry/HuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47">☁️ Azure AI Foundry</a>
+<a href="https://www.youtube.com/watch?v=SbfzvUU5yM8">📺 Video</a>
+</h4>
+<!-- <h3>
+<a href="https://huggingface.co/spaces/microsoft/Magma-UI">🤗 Gradio UI Agent</a>
+<a href="https://huggingface.co/spaces/microsoft/Magma-Gaming">🤗 Gradio Gaming Agent</a>
+</h3> -->
+</div>
+<div align="center">
+<p2>The Path Towards Multimodal AI Agents</p2>
+<img src="assets/images/magma_teaser.png?raw=true" width="100%">
+</div>
+</div>
+## :sparkles: Highlights
+* **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
+* **Versatile Capabilities:** Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
+* **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
+* **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!
+## :fire: News
+* **[2025.04.12]** 🔥We released the pretraining videos with visual traces on hugging face [Magma-Video-ToM](https://huggingface.co/datasets/MagmaAI/Magma-Video-ToM).
+* **[2025.04.06]** Open X-Embodiment pretraining data with visual traces can be downloaded from [Magma-OXE-ToM](https://huggingface.co/datasets/MagmaAI/Magma-OXE-ToM).
+* **[2025.03.16]** We released the demo code for generating SoM and ToM for instructional videos (i.e., Alg. 2 in our paper) in [SoM/ToM Generation](#som-and-tom-generation).
+* **[2025.03.09]** 🔥 We released Magma training code, and an exampler for training Magma-8B on Magma-820K dataset. Check out the [Model Training](#model-training)
+* **[2025.03.06]** We released a new demo for showing robot planning capabilities. Run `python agents/robot_traj/app.py` to start the demo!
+* **[2025.02.28]** We released two demos, [Magma-UI](https://huggingface.co/spaces/microsoft/Magma-UI) and [Magma-Gaming](https://huggingface.co/spaces/microsoft/Magma-Gaming) on Hugging Face. Check out our model's action grounding and planning capabilities!
+* **[2025.02.26]** ⭐ Exciting News! Magma got accepted by CVPR 2025!
+* **[2025.02.25]** 🎉 Big News! We are releasing the Magma model on [Hugging Face](https://huggingface.co/microsoft/Magma-8B) and [Azure AI Foundry](https://ai.azure.com/explore/models/microsoft-magma-8b/version/1/registry/HuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47)!
+* **[2025.02.23]**  We released the Magma Inference code!
+* **[2025.02.20]**  Magma has reached the top spot on [Hacker News](https://news.ycombinator.com/front)!
+* **[2025.02.19]**  We will be releasing our code, model and UI navigation demo by [MSR Forum on 02.25 next Tuesday](https://researchforum.microsoft.com/)!
+* **[2025.02.18]**  Our Flagship Project Magma at MSR is released on [arXiv](https://www.arxiv.org/pdf/2502.13130)!
+## :bookmark_tabs: Todos
+We will be releasing all the following contents:
+- [x] Model inference code
+- [x] Add UI and Gaming agent Demos
+- [x] Model checkpoint
+- [x] Training code
+- [x] Open-XE pretraining data with traces
+- [x] Video pretraining data with traces
+## :clipboard: Outline
+- [What is Magma?](#what-is-magma)
+- [How we pretrain Magma?](#how-we-pretrain-magma)
+- [Installation](#installation)
+- [Data Preprocessing](#data-preprocessing)
+  - [SoM and ToM Generation](#som-and-tom-generation)
+- [Model Training](#model-training)
+  - [Pretraining on Open-X without SoM/ToM](#pretraining-on-open-x-without-somtom)
+  - [Finetuning on Magma-820K](#finetuning-on-magma-820k)
+- [Model Usage](#model-usage)
+  - [Inference](#inference)
+    - [Inference with Huggingface Transformers](#inference-with-huggingface-transformers)
+    - [Inference with local code from this repo](#inference-with-local-code-from-this-repo)
+    - [Inference with bitsandbytes](#inference-with-bitsandbytes)
+    - [Benchmarking](#benchmarking)
+  - [Evaluation with lmms-eval](#evaluation-with-lmms-eval)
+  - [Evaluation with SimplerEnv](#evaluation-with-simplerenv)
+  - [Multi-images or Video](#multi-images-or-video)
+  - [Agent Demos](#agent-demos)
+      - [UI Agent](#ui-agent)
+      - [Gaming Agent](#gaming-agent)
+      - [Robot Visual Planning](#robot-visual-planning)
+- [Citation](#citation)
+- [Acknowledgements](#acknowledgements)
+## What is Magma?
+<div align="center">
+<img src="assets/images/magma_intro_fig.png?raw=true" width="50%">
+</div>
+**Magma is a foundation model for multimodal AI agents**. As the bedrock for multimodal agentic models, it should possess strong capabilities to perceive the multimodal world AND takes goal-driven actions precisely (see above figure). With this in mind, we are striving for the following goals:
+* **Verbal and spatial-temporal intelligence:** Magma is supposed to have both strong verbal and spatial-temporal intelligence to understand images and videos, ground its actions on the observations, and further translate the external goal into action plan and executions.
+* **Digital and physical world:** Magma should not be limited to either the digital world (e.g., web navigation) or the physical world (e.g., robotics manipulation), but rather be able to work across both worlds, just like humans ourselves.
+With this in mind, we developed a new pretraining data, which mostly consists of unlabeled videos in the wild plus the existing annotated agentic data, and a new pretraining framework, which unifies the training of all three modalities (text, image, and action), to train a new foundation model for multimodal AI agents, named Magma.
+## How we pretrain Magma?
+<div align="center">
+<img src="assets/images/magma_pt_v3.png?raw=true" width="100%">
+</div>
+We pursue the goal through two dimensions:
+* **Large-scale heterogeneous training data**: we curate a large amount of data in the wild, including existing multimodal understanding data, UI navigation data, and robotics manipulation data, and unlabeled videos in the wild. We also propose a new data collection pipeline to collect unlabeled videos in the wild, which is scalable and cost-effective. To attain useful action supervision from raw videos and robotics trajectories, we meticulously removed the camera motions in the videos and then transform the motions into "action" supervisions for our model training. These provide unique signals for the model to learn the cross-modal connections and long-horizon action prediction and planning.
+* **Universal pretraining objectives**: texts and actions are inherently different and thus cause a huge gap, while visual tokens are continuous. We propose a universal pretraining framework that unifies the training of all three modalities, and we show that this is crucial for the model to learn the cross-modal connections. More specifically, we proposed Set-of-Mark and Trace-of-Mark as the auxiliary tasks for our model pretraining, as the bridge of different output modalities. In this way, we are building a great alignment between the text and action modalities, and also between the image and action modalities.
+## Installation
+1. Clone this repo to your local machine:
+```bash
+git clone https://github.com/microsoft/Magma
+cd Magma
+```
+2. Install the dependencies:
+```bash
+conda create -n magma python=3.10 -y
+conda activate magma
+pip install --upgrade pip
+pip install -e .
+```
+3. Install packages for training:
+```bash
+pip install -e ".[train]"
+```
+4. Install packages for agents:
+```bash
+pip install -e ".[agent]"
+```
+5. Other probably needed packages:
+* Co-tracker
+```sh
+# Install co-tracker
+git clone https://github.com/facebookresearch/co-tracker
+cd co-tracker
+pip install -e .
+pip install imageio[ffmpeg]
+cd ../
+```
+* Kmeans
+```sh
+# Install kmeans_pytorch, note: install with pip will leads to error
+git clone https://github.com/subhadarship/kmeans_pytorch
+cd kmeans_pytorch
+pip install -e .
+cd ../
+```
+* Misc
+```sh
+# Install others packages
+pip install ipython
+pip install faiss-cpu
+pip install decord
+```
+⚠️ Please make sure you have installed the transformers with correct version (>=4.49.0). If you see some abnormal behavior, please check the version of transformers, and probably see below for the customized transformers.
+<details>
+<summary>Click to expand</summary>
+### Customized Transformers
+⚠️ One important thing to note is that our model uses [ConvNext](https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/convnext.py) as the backbone, which contains a layer scaler parameter [gamma](https://github.com/huggingface/pytorch-image-models/blob/e44f14d7d2f557b9f3add82ee4f1ed2beefbb30d/timm/models/convnext.py#L144). This leads to a bug of Transformers library as it automatically replace the 'gamma' with 'weight' when loading the model. To fix this, we need to modify the 'transformers/models/auto/modeling_auto.py' file as follows:
+```python 
+if "gamma" in key and "clip_vision_model" not in key:
+    key = key.replace("gamma", "weight")
+```
+This bug still exists in the latest transformer version. So please make sure you install the following bug-free customized version of transformers as lised in [pyproject.toml](./pyproject.toml):
+```bash
+pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.44.1
+```
+or the newest version:
+```bash
+pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2
+```
+</details>
+## Data Preprocessing
+### SoM and ToM Generation
+As shown in Table 1 of our paper, we apply SoM and ToM on both robotics data and instructional videos. To ensure reproducibility, we provide the code to generate SoM and ToM for instructional videos. The code is located in `tools/som_tom/demo.py`. You can run the following command to generate SoM and ToM for the robotics data:
+```bash
+python tools/som_tom/demo.py
+```
+And then you can find two videos in the `tools/som_tom/videos` folder. The original trace extracted from CoTracker is shown in `orig_trace.mp4`, and the SoM-ToM video is named `som_tom.mp4`.
+## Model Training
+We provide the instructions to pretrain LLama-3-8B-Instruct on Open-X-Embodiment and finetune Magma-8B on different downstream tasks.
+### Pretraining on Open-X without SoM/ToM
+* Data Preparation
+Download Open-X-Embodiment from the official site. Then edit the data config file [openx.yaml](data_configs/openx.yaml) accordingly. The data config file should look like this:
+```yaml
+# a list of all the data paths
+DATA_PATH: 
+  - "/path/to/open-x"
+IMAGE_FOLDER:
+  - "siglip-224px+mx-oxe-magic-soup"    
+LANGUAGE_PATH:
+  - ""
+```
+* Pretrain on OpenX
+Once set up the dataset and config, you can run the following command to finetune the model:
+```bash
+sh scripts/pretrain/pretrain_openx.sh
+```
+* Benefit: We spent tremendous effort to decouple the Open-X dataloader from OpenVLA and make it compatible with other datasets used in our experiments*
+### Finetuning on Magma-820K
+* Data Preparation
+Download annotation file from [MagmaAI/Magma-820K](https://huggingface.co/datasets/MagmaAI/Magma-820K). Please prepare the image data according to the dataset list in the dataset page. Once finished, please edit [magma_820k.yaml](data_configs/magma_820k.yaml) file accordingly.
+```yaml
+# a list of all the data paths
+DATA_PATH: 
+  - "/path/to/magma_820k.json"
+IMAGE_FOLDER:
+  - "/root/to/magma_820k/images"
+```
+* Finetune from Magma-8B
+Once set up the dataset and config, you can run the following command to finetune the model:
+```bash
+sh scripts/finetune/finetune_magma_820k.sh
+```
+## Model Usage
+### Inference
+#### Inference with Huggingface Transformers
+We have uploaded the model to Huggingface Hub. You can easily load the model and processor with the following code.
+<details>
+<summary>Click to expand</summary>
+```python
+from PIL import Image
+import torch
+from transformers import AutoModelForCausalLM
+from transformers import AutoProcessor 
+dtype = torch.bfloat16
+model = AutoModelForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
+processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
+model.to("cuda")
+# Inference
+image = Image.open("./assets/images/magma_logo.jpg").convert("RGB")
+convs = [
+    {"role": "system", "content": "You are agent that can see, talk and act."},            
+    {"role": "user", "content": "<image_start><image><image_end>\nWhat is the letter on the robot?"},
+]
+prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
+inputs = processor(images=[image], texts=prompt, return_tensors="pt")
+inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
+inputs = inputs.to("cuda").to(dtype)
+generation_args = { 
+    "max_new_tokens": 500, 
+    "temperature": 0.0, 
+    "do_sample": False, 
+    "use_cache": True,
+    "num_beams": 1,
+} 
+with torch.inference_mode():
+    generate_ids = model.generate(**inputs, **generation_args)
+generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
+response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
+print(response)
+```
+</details>
+#### Inference with local Transformers code from this repo
+If you want to debug our model, we also provide a local code for inference. You can run the following code to load the model.
+<details>
+<summary>Click to expand</summary>
+```python
+from magma.processing_magma import MagmaProcessor
+from magma.modeling_magma import MagmaForCausalLM
+dtype = torch.bfloat16
+model = MagmaForCausalLM.from_pretrained("microsoft/Magma-8B", trust_remote_code=True, torch_dtype=dtype)
+processor = MagmaProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
+model.to("cuda")
+```
+</details>
+#### Inference with bitsandbytes
+We also provide a sample code to inference with bitsandbytes. You can run the following code to load the model.
+<details>
+<summary>Click to expand</summary>
+```python
+from PIL import Image
+import torch
+from transformers import AutoModelForCausalLM
+from transformers import AutoProcessor 
+from transformers import BitsAndBytesConfig
+# Define quantization configuration
+quantization_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_compute_dtype=torch.float16,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4"
+)
+# Load model with quantization config
+model = AutoModelForCausalLM.from_pretrained(
+    "microsoft/Magma-8B", 
+    trust_remote_code=True,
+    device_map={"": 0},  # force everything onto GPU 0
+    quantization_config=quantization_config
+)
+processor = AutoProcessor.from_pretrained("microsoft/Magma-8B", trust_remote_code=True)
+# Inference
+image = Image.open("assets/images/magma_logo.jpg").convert("RGB")
+convs = [
+    {"role": "system", "content": "You are agent that can see, talk and act."},            
+    {"role": "user", "content": "<image_start><image><image_end>\nWhat is the letter on the robot?"},
+]
+prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
+inputs = processor(images=[image], texts=prompt, return_tensors="pt")
+inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
+# Convert inputs to the correct device and data type
+inputs = {k: v.to(device=model.device, dtype=torch.float16 if v.dtype == torch.float32 else v.dtype) 
+          for k, v in inputs.items()}
+generation_args = { 
+    "max_new_tokens": 500, 
+    "temperature": 0.0, 
+    "do_sample": False, 
+    "use_cache": True,
+    "num_beams": 1,
+} 
+with torch.inference_mode():
+    generate_ids = model.generate(**inputs, **generation_args)
+generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
+response = processor.decode(generate_ids[0], skip_special_tokens=True).strip()
+print(response)
+```
+</details>
+#### Benchmarking
+We benchmark the inference time and memory usage of our model with and without bitsandbytes.
+| Model | Inference Time | Peak Memory Usage |
+|-------|----------------|--------------|
+| Magma-8B (bfloat16) | 1.1s | 17GB |
+| Magma-8B (4-bit) | 1.1s | 7GB |
+### Evaluation with lmms-eval
+Please refer to [lmms-eval-instruction](tools/lmms-eval-magma) for the detailed instructions to run the evaluation with lmms-eval toolkit.
+Once everything is ready, you can run the following code to evaluate our model from the root folder.
+```bash
+sh scripts/evaluation/lmms-eval/lmms_eval_magma.sh
+```
+You can evaluate other benchmarks by modifying the variable, eval_tasks. The list of `eval_tasks` can be found after running below code.
+```
+# lmms-eval --tasks {list_groups,list_subtasks,list_tags,list}
+lmms-eval --tasks list_groups
+```
+### Evaluation with SimplerEnv
+Please refer to [SimplerEnv-instruction](tools/simplerenv-magma) for the detailed instructions to run the evaluation with SimplerEnv toolkit.
+Once everything is ready, you can run the following code to evaluate our model.
+```bash
+sh scripts/evaluation/simplerenv/bridge.sh
+```
+### Multi-images or Video Support
+Handle multiple images is extremely simple for our model. You just simply duplicate the placeholder in your text prompt, and correspondingly add all images into the list. A dummy example is as follows:
+```py
+convs = [
+    {"role": "system", "content": "You are agent that can see, talk and act."},            
+    {"role": "user", "content": "<image_start><image><image_end>\n<image_start><image><image_end>\n<image_start><image><image_end>\nWhat is the letter on the robot?"},
+]
+prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
+inputs = processor(images=[image1,image2,image3], texts=prompt, return_tensors="pt")
+```
+Our model will handle the visual token filling for you!
+### Agent Demos
+#### UI Agent
+We built agent models for our model. The first one we built is UI Agent Demo. As our model is pretrained with Set-of-Mark and Trace-of-Mark, it is naturally synergic to [OmniParser](https://github.com/microsoft/OmniParser). Combining them together, you can immediately get an UI agent, run:
+```bash
+python agents/ui_agent/app.py
+```
+More importantly, as our Magma model not only has the action-grounding ability, but also multimodal understanding and reasoning ability. You can not only ask the model predict where to click with text:
+```bash
+Go to the top ranked post
+```
+But also ask free question on the fly! Simply add a prefix "Q:" at the beginning of text prompt, e.g.,
+```bash
+Q: What is the title of the post?
+```
+#### Gaming Agent
+We also built a gaming agent demo. You can run the following command to start the demo:
+```bash
+python agents/gaming_agent/app.py
+```
+Once the demo is run, you can see a robot proactively collecting the green blocks. 
+<!-- Below are the comparison between Magma and other counterparts VLMs:
+<div align="center">
+<video width="48%" controls autoplay>
+    <source src="https://microsoft.github.io/Magma/static/videos/magma_vs_llava.mp4" type="video/mp4">
+    <p>Magma v.s. LLaVA-OneVision.</p>
+</video>
+<video width="48%" controls autoplay>
+    <source src="https://microsoft.github.io/Magma/static/videos/magma_vs_qwen.mp4" type="video/mp4">
+    <p>Magma v.s. Qwen-2.0.</p>
+</video>
+</div> -->
+#### Robot Visual Planning
+We also built a robot visual planning demo. You can run the following command to start the demo:
+```bash
+python agents/robot_traj/app.py
+```
+For this demo, you may encounter an error as discussed in this [issue](https://github.com/microsoft/Magma/issues/43), a quick fix is running the following command:
+```sh
+pip install imageio[ffmpeg]
+```
+If it still does not work, please install the older version of transformers:
+```sh
+pip install git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.44.1
+```
+<!-- Some example outputs:
+<div align="center">
+<video width="48%" controls autoplay>
+    <source src="assets/videos/robot_pick_up_chip_bag.mp4" type="video/mp4">
+    <p>Task: Pick up chip bag.</p>
+</video>
+<video width="48%" controls autoplay>
+    <source src="assets/videos/robot_push_chip_bag_to_left_edge_of_table.mp4" type="video/mp4">
+    <p>Task: Push chip bag to left edge of the table.</p>
+</video>
+</div> -->
+## User Guidance
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+### Direct use
+This model is intended for broad research use in English. The model take images and text as inputs, and produces the textual outputs for the following uses:
+* **Image/Video-Conditioned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
+* **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
+* **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).
+Our model is designed only for research purpose and aimed at knowledge-sharing and accelerating research in multimodal AI, in particularly the mutimodal agentic AI.
+### Downstream Use
+The model can be further finetuned for different downstream tasks, such as:
+* **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.
+* **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.
+* **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
+* **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperforms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
+## Bias, Risks, and Limitations
+Please note that this model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
+## Citation
+If you use this model in your research, please consider citing:
+```bibtex
+@misc{yang2025magmafoundationmodelmultimodal,
+      title={Magma: A Foundation Model for Multimodal AI Agents}, 
+      author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},
+      year={2025},
+      eprint={2502.13130},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2502.13130}, 
+}
+```
+## Acknowledgements
+Our work is supported by Microsoft Research. We thank all the contributors for their efforts in building this project. 
+Our work is built on top of some amazing open-source projects, including [Transformers](https://github.com/huggingface/transformers), [LLaVA](https://github.com/haotian-liu/LLaVA), [OpenVLA](https://github.com/openvla/openvla), [SeeClick](https://github.com/njucckevin/SeeClick), [Mind2Web](https://github.com/OSU-NLP-Group/Mind2Web), and also a number of awesome open-source datasets, including [Ego4d](https://ego4d-data.org/), [Epic-Kitchen](https://epic-kitchens.github.io/2025), [Something-Somethingv2](https://www.qualcomm.com/developer/artificial-intelligence/datasets), [Open-X-Embodiment](https://robotics-transformer-x.github.io/), and a number of evaluation benchmarks, including [SimplerEnv](https://github.com/simpler-env/SimplerEnv), [Libero](https://github.com/Lifelong-Robot-Learning/LIBERO).
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+## Contributing
+This project welcomes contributions and suggestions.  Most contributions require you to agree to a
+Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
+the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
+When you submit a pull request, a CLA bot will automatically determine whether you need to provide
+a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
+provided by the bot. You will only need to do this once across all repos using our CLA.
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
+contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+## Trademarks
+This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
+trademarks or logos is subject to and must follow 
+[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
+Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
+Any use of third-party trademarks or logos are subject to those third-party's policies.
--- a/SECURITY.md
+++ b/SECURITY.md
+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
+## Security
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
+## Reporting Security Issues
+**Please do not report security vulnerabilities through public GitHub issues.**
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
+If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+This information will help us triage your report more quickly.
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
+## Preferred Languages
+We prefer all communications to be in English.
+## Policy
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
+<!-- END MICROSOFT SECURITY.MD BLOCK -->
--- a/SUPPORT.md
+++ b/SUPPORT.md
+# TODO: The maintainer of this repo has not yet edited this file
+**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
+- **No CSS support:** Fill out this template with information about how to file issues and get help.
+- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
+- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
+*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
+# Support
+## How to file issues and get help  
+This project uses GitHub Issues to track bugs and feature requests. Please search the existing 
+issues before filing new issues to avoid duplicates.  For new issues, file your bug or 
+feature request as a new Issue.
+For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE 
+FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
+CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
+## Microsoft Support Policy  
+Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
--- a/agents/game_agent/app.py
+++ b/agents/game_agent/app.py
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+import pygame
+import numpy as np
+import gradio as gr
+import time
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoProcessor
+import re
+import random
+pygame.mixer.quit()  # Disable sound
+# Constants
+WIDTH, HEIGHT = 800, 800
+GRID_SIZE = 80
+WHITE = (255, 255, 255)
+GREEN = (34, 139, 34)  # Forest green - more like an apple
+RED = (200, 50, 50)
+BLACK = (0, 0, 0)
+GRAY = (128, 128, 128)
+YELLOW = (218, 165, 32)  # Golden yellow color
+# Directions
+UP = (0, -1)
+DOWN = (0, 1)
+LEFT = (-1, 0)
+RIGHT = (1, 0)
+STATIC = (0, 0)
+ACTIONS = ["up", "down", "left", "right", "static"]
+# Load AI Model
+magma_model_id = "microsoft/Magma-8B"
+dtype = torch.bfloat16
+magma_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+magma_model.to("cuda")
+# Load magma image
+magma_img = pygame.image.load("./assets/images/magma_game.png")
+magma_img = pygame.transform.scale(magma_img, (GRID_SIZE, GRID_SIZE))
+class MagmaFindGPU:
+    def __init__(self):
+        self.reset()
+    def reset(self):
+        self.snake = [(5, 5)]
+        self.direction = RIGHT
+        self.score = 0
+        self.game_over = False
+        self.place_target()
+    def place_target(self):
+        while True:
+            target_x = np.random.randint(1, WIDTH // GRID_SIZE - 1)
+            target_y = np.random.randint(1, HEIGHT // GRID_SIZE - 1)
+            if (target_x, target_y) not in self.snake:
+                self.target = (target_x, target_y)
+                break
+    def step(self, action):
+        if action == "up":
+            self.direction = UP
+        elif action == "down":
+            self.direction = DOWN
+        elif action == "left":
+            self.direction = LEFT
+        elif action == "right":
+            self.direction = RIGHT
+        elif action == "static":
+            self.direction = STATIC
+        if self.game_over:
+            return self.render(), self.score
+        new_head = (self.snake[0][0] + self.direction[0], self.snake[0][1] + self.direction[1])
+        if new_head[0] < 0 or new_head[1] < 0 or new_head[0] >= WIDTH // GRID_SIZE or new_head[1] >= HEIGHT // GRID_SIZE:
+            self.game_over = True
+            return self.render(), self.score
+        self.snake = [new_head]  # Keep only the head (single block snake)
+        # Check if the target is covered by four surrounding squares
+        head_x, head_y = self.snake[0]
+        neighbors = set([(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)])
+        if neighbors.issuperset(set([self.target])):
+            self.score += 1
+            self.place_target()
+        return self.render(), self.score
+    def render(self):
+        pygame.init()
+        surface = pygame.Surface((WIDTH, HEIGHT))
+        surface.fill(BLACK)
+        head_x, head_y = self.snake[0]
+        surface.blit(magma_img, (head_x * GRID_SIZE, head_y * GRID_SIZE))        
+        # pygame.draw.rect(surface, RED, (self.snake[0][0] * GRID_SIZE, self.snake[0][1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
+        pygame.draw.rect(surface, GREEN, (self.target[0] * GRID_SIZE, self.target[1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
+        # Draw four surrounding squares with labels
+        head_x, head_y = self.snake[0]
+        neighbors = [(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)]
+        labels = ["1", "2", "3", "4"]
+        font = pygame.font.Font(None, 48)
+        # clone surface
+        surface_nomark = surface.copy()
+        for i, (nx, ny) in enumerate(neighbors):
+            if 0 <= nx < WIDTH // GRID_SIZE and 0 <= ny < HEIGHT // GRID_SIZE:
+                pygame.draw.rect(surface, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
+                # pygame.draw.rect(surface_nomark, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
+                text = font.render(labels[i], True, WHITE)
+                text_rect = text.get_rect(center=(nx * GRID_SIZE + GRID_SIZE // 2, ny * GRID_SIZE + GRID_SIZE // 2))
+                surface.blit(text, text_rect)
+        return np.array(pygame.surfarray.array3d(surface_nomark)).swapaxes(0, 1), np.array(pygame.surfarray.array3d(surface)).swapaxes(0, 1)
+    def get_state(self):
+        return self.render()
+game = MagmaFindGPU()
+def play_game():
+    state, state_som = game.get_state()
+    pil_img = Image.fromarray(state_som)
+    convs = [
+        {"role": "system", "content": "You are an agent that can see, talk, and act."},            
+        {"role": "user", "content": "<image_start><image><image_end>\nWhich mark is closer to green block? Answer with a single number."},
+    ]
+    prompt = magma_processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
+    inputs = magma_processor(images=[pil_img], texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)    
+    inputs = inputs.to("cuda").to(dtype)
+    generation_args = { 
+        "max_new_tokens": 10, 
+        "temperature": 0, 
+        "do_sample": False, 
+        "use_cache": True,
+        "num_beams": 1,
+    }
+    with torch.inference_mode():
+        generate_ids = magma_model.generate(**inputs, **generation_args)
+    generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
+    action = magma_processor.decode(generate_ids[0], skip_special_tokens=True).strip()
+    # extract mark id fro action use re
+    match = re.search(r'\d+', action)
+    if match:
+        action = match.group(0)
+        if action.isdigit() and 1 <= int(action) <= 4:
+            # epsilon sampling
+            if random.random() < 0.1:
+                action = random.choice(ACTIONS[:-1])
+            else:
+                action = ACTIONS[int(action) - 1]
+        else:
+            # random choose one from the pool
+            action = random.choice(ACTIONS[:-1])
+    else:
+        action = random.choice(ACTIONS[:-1])
+    img, score = game.step(action)
+    img = img[0]
+    return img, f"Score: {score}"
+def reset_game():
+    game.reset()
+    return game.render()[0], "Score: 0"
+MARKDOWN = """
+<div align="center">
+<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
+Game: Magma finds the apple by moving up, down, left and right.
+\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp; 
+This demo is powered by [Gradio](https://gradio.app/).
+</div>
+"""
+with gr.Blocks() as interface:
+    gr.Markdown(MARKDOWN)
+    with gr.Row():
+        image_output = gr.Image(label="Game Screen")
+        score_output = gr.Text(label="Score")
+    with gr.Row():
+        start_btn = gr.Button("Start/Reset Game")
+    interface.load(fn=play_game, every=1, inputs=[], outputs=[image_output, score_output])
+    start_btn.click(fn=reset_game, inputs=[], outputs=[image_output, score_output])
+interface.launch()
--- a/agents/game_agent/frozen_lake/app.py
+++ b/agents/game_agent/frozen_lake/app.py
+import gradio as gr
+import numpy as np
+import gymnasium as gym
+from PIL import Image
+import matplotlib.pyplot as plt
+# Initialize FrozenLake environment
+env = gym.make("FrozenLake-v1", render_mode="rgb_array")
+state, _ = env.reset()
+action_mapping = {
+    "Left": 3,
+    "Down": 1,
+    "Right": 2,
+    "Up": 0,
+}
+def render_env():
+    """Render the environment and return as an image."""
+    frame = env.render()
+    image = Image.fromarray(frame)
+    return image
+def step(action):
+    """Take a step in the environment."""
+    global state
+    action_index = action_mapping[action]
+    state, reward, done, _, _ = env.step(action_index)
+    image = render_env()
+    message = f"State: {state}, Reward: {reward}, Done: {done}"
+    if done:
+        env.reset()
+        message += " - Resetting environment"
+    return image, message
+# Create Gradio interface
+with gr.Blocks() as demo:
+    gr.Markdown("# Play Frozen Lake!")
+    image_display = gr.Image()
+    action_buttons = gr.Radio(choices=list(action_mapping.keys()), label="Select Action")
+    submit_button = gr.Button("Step")
+    output_text = gr.Textbox(label="Game State")
+    submit_button.click(fn=step, inputs=action_buttons, outputs=[image_display, output_text])
+    # Show initial state
+    image_display.update(render_env())
+demo.launch()
--- a/agents/libero/README.md
+++ b/agents/libero/README.md
+# Magma: Multimodal Agentic Models
+Evaluating Magma on [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO).
+#### LIBERO Setup
+Clone and install LIBERO and other requirements:
+```
+git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
+pip install -r agents/libero/requirements.txt
+cd LIBERO
+pip install -e .
+```
+#### Quick Evaluation
+The following code demonstrates how to run Magma on a single LIBERO task and evaluate its performance:
+```
+import numpy as np
+from libero.libero import benchmark
+from libero_env_utils import get_libero_env, get_libero_dummy_action, get_libero_obs, get_max_steps, save_rollout_video
+from libero_magma_utils import get_magma_model, get_magma_prompt, get_magma_action
+# Set up benchmark and task
+benchmark_dict = benchmark.get_benchmark_dict()
+task_suite_name = "libero_goal" # or libero_spatial, libero_object, etc.
+task_suite = benchmark_dict[task_suite_name]()
+task_id = 1
+task = task_suite.get_task(task_id)
+# Initialize environment
+env, task_description = get_libero_env(task, resolution=256)
+print(f"Task {task_id} description: {task_description}")
+# Load MAGMA model
+model_name = "microsoft/magma-8b-libero-goal"  # or your local path
+processor, magma = get_magma_model(model_name)
+prompt = get_magma_prompt(task_description, processor, magma.config)
+# Run evaluation
+num_steps_wait = 10
+max_steps = get_max_steps(task_suite_name)
+env.seed(0)
+obs = env.reset()
+init_states = task_suite.get_task_init_states(task_id) 
+obs = env.set_init_state(init_states[0])
+step = 0
+replay_images = []
+while step < max_steps + num_steps_wait:
+    if step < num_steps_wait:
+        obs, _, done, _ = env.step(get_libero_dummy_action())
+        step += 1
+        continue
+    img = get_libero_obs(obs, resize_size=256)
+    replay_images.append(img)
+    action = get_magma_action(magma, processor, img, prompt, task_suite_name)
+    obs, _, done, _ = env.step(action.tolist())
+    step += 1
+env.close()
+save_rollout_video(replay_images, success=done, task_description=task_description)
+```
+**Notes:** The above script only tests one episode on a single task and visualizes MAGMA's trajectory with saved video. For comprehensive evaluation on each task suite, please use `eval_magma_libero.py`.
+```
+python eval_magma_libero.py \
+  --model_name microsoft/Magma-8B-libero-object \
+  --task_suite_name libero_object \
+python eval_magma_libero.py \
+  --model_name microsoft/Magma-8B-libero-spatial \
+  --task_suite_name libero_spatial \
+python eval_magma_libero.py \
+  --model_name microsoft/Magma-8B-libero-goal \
+  --task_suite_name libero_goal \
+```
--- a/agents/libero/eval_magma_libero.py
+++ b/agents/libero/eval_magma_libero.py
+import os
+import numpy as np
+import draccus
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import tqdm
+from libero.libero import benchmark
+from libero_env_utils import (
+    get_libero_env, 
+    get_libero_dummy_action,
+    get_libero_obs,
+    get_max_steps,
+    set_seed_everywhere
+)
+from libero_magma_utils import (
+    get_magma_model,
+    get_magma_prompt,
+    get_magma_action
+)
+@dataclass
+class LiberoConfig:
+    # Model parameters
+    model_name: str = "microsoft/magma-8b-libero-goal"      # model_name
+    task_suite_name: str = "libero_goal"                    # Task suite name
+    # Evaluation parameters
+    num_trials_per_task: int = 50                          # Number of rollouts per task
+    resolution: int = 256                                  # Image resolution
+    num_steps_wait: int = 10                              # Steps to wait for stabilization
+    seed: int = 0                                         # Random seed
+    save_dir: str = "./libero_eval_log"                   # Directory for saving logs
+@draccus.wrap()
+def eval_libero(cfg: LiberoConfig) -> Tuple[int, int]:
+    """
+    Evaluate Libero environment with given configuration.
+    Args:
+        cfg: LiberoConfig object containing evaluation parameters
+    Returns:
+        Tuple[int, int]: Total episodes and total successful episodes
+    """
+    # Setup logging
+    os.makedirs(cfg.save_dir, exist_ok=True)
+    log_filepath = f"{cfg.save_dir}/magma_eval-{cfg.task_suite_name}.log"
+    log_file = open(log_filepath, "w")
+    print(f"Logging to local log file: {log_filepath}")
+    # Write initial log
+    log_file.write(f"Task suite: {cfg.task_suite_name}\n")
+    print(f"Task suite: {cfg.task_suite_name}")
+    # Get benchmark and task suite
+    benchmark_dict = benchmark.get_benchmark_dict()
+    task_suite = benchmark_dict[cfg.task_suite_name]()
+    num_tasks_in_suite = task_suite.n_tasks
+    # Initialize counters
+    total_episodes, total_successes = 0, 0
+    set_seed_everywhere(cfg.seed)
+    # Load model
+    processor, magma = get_magma_model(cfg.model_name)
+    # Iterate through all tasks
+    for task_id in tqdm.tqdm(range(num_tasks_in_suite)):
+        # Get task
+        task = task_suite.get_task(task_id)
+        task_name = task.name
+        max_steps = get_max_steps(cfg.task_suite_name)
+        # Get default LIBERO initial states
+        initial_states = task_suite.get_task_init_states(task_id)
+        # Initialize LIBERO environment and task description
+        env, task_description = get_libero_env(task, resolution=cfg.resolution)
+        print(f"[info] Evaluating task {task_id} from suite {cfg.task_suite_name}, "
+              f"the language instruction is {task_description}.")
+        log_file.write(f"Task {task_id}: {task_description}\n")
+        log_file.flush()
+        # Get prompt for current task
+        prompt = get_magma_prompt(task_description, processor, magma.config)
+        # Initialize task-specific counters
+        task_episodes, task_successes = 0, 0
+        # Run trials for current task
+        for trial in range(cfg.num_trials_per_task):
+            env.reset()
+            obs = env.set_init_state(initial_states[trial])
+            step = 0
+            while step < max_steps + cfg.num_steps_wait:
+                if step < cfg.num_steps_wait:
+                    obs, reward, done, info = env.step(get_libero_dummy_action())
+                    step += 1
+                    continue
+                img = get_libero_obs(obs, resize_size=cfg.resolution)
+                action = get_magma_action(magma, processor, img, prompt, cfg.task_suite_name)
+                obs, reward, done, info = env.step(action.tolist())
+                step += 1
+                if done:
+                    task_successes += 1
+                    break
+            task_episodes += 1
+        # Update total counters
+        total_episodes += task_episodes
+        total_successes += task_successes
+        # Log task success rate
+        task_success_rate = float(task_successes) / float(task_episodes)
+        print(f"Current task ({task_name}) success rate: {task_success_rate}")
+        log_file.write(f"Current task ({task_name}) success rate: {task_success_rate}\n")
+        log_file.flush()
+    # Log final suite success rate
+    suite_success_rate = float(total_successes) / float(total_episodes)
+    print(f"Task suite success rate: {suite_success_rate}")
+    log_file.write(f"\nTask suite {cfg.task_suite_name} success rate: {suite_success_rate}\n")
+    log_file.flush()
+    env.close()
+    log_file.close()
+    return total_episodes, total_successes
+if __name__ == "__main__":
+    eval_libero()
\ No newline at end of file
--- a/agents/libero/libero_env_utils.py
+++ b/agents/libero/libero_env_utils.py
+"""Utils for evaluating policies in LIBERO simulation environments."""
+import math
+import os
+import torch
+import random
+from PIL import Image
+import imageio
+import numpy as np
+import tensorflow as tf
+from libero.libero import get_libero_path
+from libero.libero.envs import OffScreenRenderEnv
+def resize_image(img, resize_size):
+    """
+    Takes numpy array corresponding to a single image and returns resized image as numpy array.
+    """
+    assert isinstance(resize_size, tuple)
+    # Resize to image size expected by model
+    img = tf.image.encode_jpeg(img)  # Encode as JPEG, as done in RLDS dataset builder
+    img = tf.io.decode_image(img, expand_animations=False, dtype=tf.uint8)  # Immediately decode back
+    img = tf.image.resize(img, resize_size, method="lanczos3", antialias=True)
+    img = tf.cast(tf.clip_by_value(tf.round(img), 0, 255), tf.uint8)
+    img = img.numpy()
+    return img
+def get_libero_env(task, resolution=256):
+    """Initializes and returns the LIBERO environment, along with the task description."""
+    task_description = task.language
+    task_bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
+    env_args = {"bddl_file_name": task_bddl_file, "camera_heights": resolution, "camera_widths": resolution}
+    env = OffScreenRenderEnv(**env_args)
+    env.seed(0)  # IMPORTANT: seed seems to affect object positions even when using fixed initial state
+    return env, task_description
+def get_libero_dummy_action():
+    """Get dummy/no-op action, used to roll out the simulation while the robot does nothing."""
+    return [0, 0, 0, 0, 0, 0, -1]
+def get_libero_obs(obs, resize_size):
+    """Extracts image from observations and preprocesses it."""
+    assert isinstance(resize_size, int) or isinstance(resize_size, tuple)
+    if isinstance(resize_size, int):
+        resize_size = (resize_size, resize_size)
+    img = obs["agentview_image"]
+    img = img[::-1, ::-1]  # IMPORTANT: rotate 180 degrees to match train preprocessing
+    image = Image.fromarray(img)
+    # resize image to 256x256
+    image = resize_image(img, resize_size)
+    return image
+def get_max_steps(task_suite_name):
+    if task_suite_name == "libero_spatial":
+        max_steps = 220  
+    elif task_suite_name == "libero_object":
+        max_steps = 280 
+    elif task_suite_name == "libero_goal":
+        max_steps = 300  
+    elif task_suite_name == "libero_10":
+        max_steps = 520  
+    else:
+        max_steps = 400
+    return max_steps
+def quat2axisangle(quat):
+    """
+    Copied from robosuite: https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
+    Converts quaternion to axis-angle format.
+    Returns a unit vector direction scaled by its angle in radians.
+    Args:
+        quat (np.array): (x,y,z,w) vec4 float angles
+    Returns:
+        np.array: (ax,ay,az) axis-angle exponential coordinates
+    """
+    # clip quaternion
+    if quat[3] > 1.0:
+        quat[3] = 1.0
+    elif quat[3] < -1.0:
+        quat[3] = -1.0
+    den = np.sqrt(1.0 - quat[3] * quat[3])
+    if math.isclose(den, 0.0):
+        # This is (close to) a zero degree rotation, immediately return
+        return np.zeros(3)
+    return (quat[:3] * 2.0 * math.acos(quat[3])) / den
+def save_rollout_video(replay_images, success, task_description):
+    """Saves a video replay of a rollout in libero."""
+    save_dir = f"./libero_videos"
+    os.makedirs(save_dir, exist_ok=True)
+    processed_task_description = task_description.lower().replace(" ", "_").replace("\n", "_").replace(".", "_")[:50]
+    video_path = f"{save_dir}/quick_eval-success={success}--task={processed_task_description}.mp4"
+    video_writer = imageio.get_writer(video_path, fps=30)
+    for img in replay_images:
+        video_writer.append_data(img)
+    video_writer.close()
+    print(f"Saved libero video at path {video_path}")
+    return video_path
+def set_seed_everywhere(seed: int):
+    """Sets the random seed for Python, NumPy, and PyTorch functions."""
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    os.environ["PYTHONHASHSEED"] = str(seed)
\ No newline at end of file
--- a/agents/libero/libero_magma_utils.py
+++ b/agents/libero/libero_magma_utils.py
+import os
+import json
+import torch
+import numpy as np
+from magma.image_processing_magma import MagmaImageProcessor
+from magma.processing_magma import MagmaProcessor
+from magma.modeling_magma import MagmaForConditionalGeneration
+def get_magma_model(model_name):
+    processor = MagmaProcessor.from_pretrained(model_name, trust_remote_code=True) 
+    magma = MagmaForConditionalGeneration.from_pretrained(model_name,
+        device_map="cuda", 
+        low_cpu_mem_usage=True,        
+        attn_implementation="flash_attention_2",  
+        torch_dtype=torch.bfloat16,
+        trust_remote_code=True,
+        use_cache=True,
+    )
+    return processor, magma
+def get_magma_prompt(task_description, processor, model_config):
+    convs = [
+        {"role": "user", "content": f"<image>\nWhat action should the robot take to {task_description}?"},
+    ]
+    convs = [
+        {
+            "role": "system",
+            "content": "You are agent that can see, talk and act.", 
+        },            
+    ] + convs      
+    prompt = processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    if model_config.mm_use_image_start_end:
+        prompt = prompt.replace("<image>", "<image_start><image><image_end>")    
+    return prompt
+def get_magma_action(magma, processor, img, prompt, task_suite_name):
+    dataset_stats = json.load(open(os.path.join(magma.config._name_or_path, "dataset_statistics.json")))
+    action_norm_stats = dataset_stats[f"{task_suite_name}_no_noops"]['action']
+    n_action_bins = 256
+    vocab_size = processor.tokenizer.vocab_size
+    bins = np.linspace(-1, 1, n_action_bins)
+    bin_centers = (bins[:-1] + bins[1:]) / 2.0
+    # process inputs
+    inputs = processor(images=img, texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)   
+    inputs = inputs.to("cuda").to(torch.bfloat16)
+    # predict actions with magma
+    output_ids = magma.generate(
+            **inputs, 
+            temperature=0.7, 
+            do_sample=True, 
+            num_beams=1, 
+            max_new_tokens=1000, 
+            use_cache=True,
+        )
+    action_ids = output_ids[0, -8:-1].cpu().tolist()
+    predicted_action_ids = np.array(action_ids).astype(np.int64)
+    discretized_actions = vocab_size - predicted_action_ids
+    discretized_actions = np.clip(discretized_actions - 1, a_min=0, a_max=bin_centers.shape[0] - 1)
+    normalized_actions = bin_centers[discretized_actions]
+    # unnormalize actions
+    mask = action_norm_stats.get("mask", np.ones_like(action_norm_stats["q01"], dtype=bool))
+    action_high, action_low = np.array(action_norm_stats["q99"]), np.array(action_norm_stats["q01"])
+    raw_action = np.where(
+        mask,
+        0.5 * (normalized_actions + 1) * (action_high - action_low) + action_low,
+        normalized_actions,
+    )
+    action = normalize_gripper_action(raw_action, binarize=True)
+    action = invert_gripper_action(action)
+    return action
+def normalize_gripper_action(action, binarize=True):
+    """
+    Convert gripper action from [0,1] to [-1,+1] range.
+    y = 2x - 1
+    """
+    orig_low, orig_high = 0.0, 1.0
+    action[..., -1] = 2 * (action[..., -1] - orig_low) / (orig_high - orig_low) - 1
+    if binarize:
+        # Binarize to -1 or +1.
+        action[..., -1] = np.sign(action[..., -1])
+    return action
+def invert_gripper_action(action):
+    """Convert gripper: RLDS(0=close,1=open) -> -1=open,+1=close"""
+    action[..., -1] = action[..., -1] * -1.0
+    return action
\ No newline at end of file
--- a/agents/libero/requirements.txt
+++ b/agents/libero/requirements.txt
+robosuite==1.4.0
+bddl==1.0.1
+easydict==1.9 
+gym==0.25.2
+cloudpickle
+imageio[ffmpeg]
\ No newline at end of file
--- a/agents/robot_traj/app.py
+++ b/agents/robot_traj/app.py
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+import os
+import warnings
+from utils.visualizer import Visualizer
+from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
+import random
+import gradio as gr
+import ast, re
+import torch
+import torchvision
+from transformers import AutoModelForCausalLM, AutoProcessor
+'''
+build model
+'''
+torch.manual_seed(0)
+torch.cuda.manual_seed_all(0)
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+random.seed(0)
+spatial_quant_size = 256
+# Load AI Model
+dtype = torch.bfloat16
+device = "cuda"
+magma_model_id = "microsoft/Magma-8B"
+model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
+model.to(device)
+@torch.no_grad()
+def inference(image, task, *args, **kwargs):
+    # image = image['image']
+    task_description = task
+    num_marks = args[0]
+    speed = args[1]
+    steps = args[2]
+    mark_ids = [i+1 for i in range(num_marks)]  
+    image_resized = image.resize((256, 256))
+    magma_template = (
+        # "<image>\nThe image is labeled with numeric marks {}.\n"
+        "<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
+        "The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
+    )
+    """
+    Visual Trace Generation
+    """
+    if model.config.mm_use_image_start_end:
+        magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")    
+    conv_user = magma_template.format(mark_ids, task_description, speed, steps)
+    print(conv_user)
+    convs = [
+        {"role": "user", "content": conv_user},
+    ]
+    convs = [
+        {
+            "role": "system",
+            "content": "You are agent that can see, talk and act.", 
+        },            
+    ] + convs     
+    prompt = processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)    
+    inputs = inputs.to(dtype).to(device)
+    with torch.inference_mode():
+        output_ids = model.generate(
+            **inputs,
+            temperature=0.3,
+            do_sample=True,
+            num_beams=1,
+            max_new_tokens=1024,
+            use_cache=True,
+        )
+    response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    if len(response)==0:
+        return None
+    # extract traces from response
+    if "and their future positions are:" in response:
+        selected_marks_str, traces_str = response.split("and their future positions are:\n")
+    else:
+        selected_marks_str, traces_str = None, response
+    try:
+        traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
+        overlay_traces = []
+        for mark_id, trace in traces_dict.items():
+            # convert list of tuples to tensor
+            trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
+            overlay_traces.append(trace)
+        # padded to the same length with the last element
+        max_len = max([trace.shape[0] for trace in overlay_traces])
+        for i in range(len(overlay_traces)):
+            if overlay_traces[i].shape[0] < max_len:
+                overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)        
+        overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
+        # if selected_marks_str is not None:
+        #     selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
+        #     selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
+        #     selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)        
+        #     overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
+        overlay_traces = overlay_traces.float() / 256
+        overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
+        overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
+        images = [image] * overlay_traces.shape[1]
+        overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
+        video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255    
+        vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
+        vis.visualize(video, overlay_traces, overlay_visibility)
+        # return video path
+        return "./saved_videos/video.mp4"
+    except Exception as e:
+        print(e)
+        return None
+class ImageMask(gr.components.Image):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+class Video(gr.components.Video):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+'''
+launch app
+'''
+title = "Magma"
+description = '''Magma: Multimodal Agent to Act'''
+'''Usage
+Instructions:
+&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
+&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
+&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
+&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
+&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
+&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
+'''
+article = "The Demo is Run on Magma-8B."
+inputs = [
+    gr.components.Image(label="Draw on Image",type="pil"), 
+    gr.Textbox(label="Task"),
+    gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
+    gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
+    gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
+]
+gr.Interface(
+    fn=inference,
+    inputs=inputs,
+    outputs=[
+        gr.Video(
+        label="Robot planning trajectory", format="mp4"
+        ),
+    ],
+    examples=[
+    ["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
+    ],
+    title=title,
+    description=description,
+    article=article,
+    allow_flagging='never',
+    cache_examples=False,
+).launch(share=True)
\ No newline at end of file
--- a/agents/robot_traj/app.pyi
+++ b/agents/robot_traj/app.pyi
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+import os
+import warnings
+from utils.visualizer import Visualizer
+from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
+import random
+import gradio as gr
+import ast, re
+import torch
+import torchvision
+from transformers import AutoModelForCausalLM, AutoProcessor
+'''
+build model
+'''
+torch.manual_seed(0)
+torch.cuda.manual_seed_all(0)
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+random.seed(0)
+spatial_quant_size = 256
+# Load AI Model
+dtype = torch.bfloat16
+device = "cuda"
+magma_model_id = "microsoft/Magma-8B"
+model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
+model.to(device)
+@torch.no_grad()
+def inference(image, task, *args, **kwargs):
+    # image = image['image']
+    task_description = task
+    num_marks = args[0]
+    speed = args[1]
+    steps = args[2]
+    mark_ids = [i+1 for i in range(num_marks)]  
+    image_resized = image.resize((256, 256))
+    magma_template = (
+        # "<image>\nThe image is labeled with numeric marks {}.\n"
+        "<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
+        "The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
+    )
+    """
+    Visual Trace Generation
+    """
+    if model.config.mm_use_image_start_end:
+        magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")    
+    conv_user = magma_template.format(mark_ids, task_description, speed, steps)
+    print(conv_user)
+    convs = [
+        {"role": "user", "content": conv_user},
+    ]
+    convs = [
+        {
+            "role": "system",
+            "content": "You are agent that can see, talk and act.", 
+        },            
+    ] + convs     
+    prompt = processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)    
+    inputs = inputs.to(dtype).to(device)
+    with torch.inference_mode():
+        output_ids = model.generate(
+            **inputs,
+            temperature=0.3,
+            do_sample=True,
+            num_beams=1,
+            max_new_tokens=1024,
+            use_cache=True,
+        )
+    response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    if len(response)==0:
+        return None
+    # extract traces from response
+    if "and their future positions are:" in response:
+        selected_marks_str, traces_str = response.split("and their future positions are:\n")
+    else:
+        selected_marks_str, traces_str = None, response
+    try:
+        traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
+        overlay_traces = []
+        for mark_id, trace in traces_dict.items():
+            # convert list of tuples to tensor
+            trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
+            overlay_traces.append(trace)
+        # padded to the same length with the last element
+        max_len = max([trace.shape[0] for trace in overlay_traces])
+        for i in range(len(overlay_traces)):
+            if overlay_traces[i].shape[0] < max_len:
+                overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)        
+        overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
+        # if selected_marks_str is not None:
+        #     selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
+        #     selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
+        #     selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)        
+        #     overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
+        overlay_traces = overlay_traces.float() / 256
+        overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
+        overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
+        images = [image] * overlay_traces.shape[1]
+        overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
+        video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255    
+        vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
+        vis.visualize(video, overlay_traces, overlay_visibility)
+        # return video path
+        return "./saved_videos/video.mp4"
+    except Exception as e:
+        print(e)
+        return None
+from gradio.events import Dependency
+class ImageMask(gr.components.Image):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+    from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
+    from gradio.blocks import Block
+    if TYPE_CHECKING:
+        from gradio.components import Timer
+class Video(gr.components.Video):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+    from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
+    from gradio.blocks import Block
+    if TYPE_CHECKING:
+        from gradio.components import Timer
+'''
+launch app
+'''
+title = "Magma"
+description = '''Magma: Multimodal Agent to Act'''
+'''Usage
+Instructions:
+&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
+&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
+&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
+&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
+&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
+&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
+'''
+article = "The Demo is Run on Magma-8B."
+inputs = [
+    gr.components.Image(label="Draw on Image",type="pil"), 
+    gr.Textbox(label="Task"),
+    gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
+    gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
+    gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
+]
+gr.Interface(
+    fn=inference,
+    inputs=inputs,
+    outputs=[
+        gr.Video(
+        label="Robot planning trajectory", format="mp4"
+        ),
+    ],
+    examples=[
+    ["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
+    ],
+    title=title,
+    description=description,
+    article=article,
+    allow_flagging='never',
+    cache_examples=False,
+).launch(share=True)
\ No newline at end of file
--- a/agents/robot_traj/sample.png
+++ b/agents/robot_traj/sample.png
--- a/agents/robot_traj/utils/visualizer.py
+++ b/agents/robot_traj/utils/visualizer.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import numpy as np
+import imageio
+import torch
+from matplotlib import cm
+import torch.nn.functional as F
+import torchvision.transforms as transforms
+import matplotlib.pyplot as plt
+from PIL import Image, ImageDraw
+def read_video_from_path(path):
+    try:
+        reader = imageio.get_reader(path)
+    except Exception as e:
+        print("Error opening video file: ", e)
+        return None
+    frames = []
+    for i, im in enumerate(reader):
+        frames.append(np.array(im))
+    return np.stack(frames)
+def draw_circle(rgb, coord, radius, color=(255, 0, 0), visible=True):
+    # Create a draw object
+    draw = ImageDraw.Draw(rgb)
+    # Calculate the bounding box of the circle
+    left_up_point = (coord[0] - radius, coord[1] - radius)
+    right_down_point = (coord[0] + radius, coord[1] + radius)
+    # Draw the circle
+    draw.ellipse(
+        [left_up_point, right_down_point],
+        fill=tuple(color) if visible else None,
+        outline=tuple(color),
+    )
+    return rgb
+def draw_line(rgb, coord_y, coord_x, color, linewidth):
+    draw = ImageDraw.Draw(rgb)
+    draw.line(
+        (coord_y[0], coord_y[1], coord_x[0], coord_x[1]),
+        fill=tuple(color),
+        width=linewidth,
+    )
+    return rgb
+def add_weighted(rgb, alpha, original, beta, gamma):
+    return (rgb * alpha + original * beta + gamma).astype("uint8")
+class Visualizer:
+    def __init__(
+        self,
+        save_dir: str = "./results",
+        grayscale: bool = False,
+        pad_value: int = 0,
+        fps: int = 10,
+        mode: str = "rainbow",  # 'cool', 'optical_flow'
+        linewidth: int = 2,
+        show_first_frame: int = 10,
+        tracks_leave_trace: int = 0,  # -1 for infinite
+    ):
+        self.mode = mode
+        self.save_dir = save_dir
+        if mode == "rainbow":
+            self.color_map = cm.get_cmap("gist_rainbow")
+        elif mode == "cool":
+            self.color_map = cm.get_cmap(mode)
+        self.show_first_frame = show_first_frame
+        self.grayscale = grayscale
+        self.tracks_leave_trace = tracks_leave_trace
+        self.pad_value = pad_value
+        self.linewidth = linewidth
+        self.fps = fps
+    def visualize(
+        self,
+        video: torch.Tensor,  # (B,T,C,H,W)
+        tracks: torch.Tensor,  # (B,T,N,2)
+        visibility: torch.Tensor = None,  # (B, T, N, 1) bool
+        gt_tracks: torch.Tensor = None,  # (B,T,N,2)
+        segm_mask: torch.Tensor = None,  # (B,1,H,W)
+        filename: str = "video",
+        writer=None,  # tensorboard Summary Writer, used for visualization during training
+        step: int = 0,
+        query_frame: int = 0,
+        save_video: bool = True,
+        compensate_for_camera_motion: bool = False,
+    ):
+        if compensate_for_camera_motion:
+            assert segm_mask is not None
+        if segm_mask is not None:
+            coords = tracks[0, query_frame].round().long()
+            segm_mask = segm_mask[0, query_frame][coords[:, 1], coords[:, 0]].long()
+        video = F.pad(
+            video,
+            (self.pad_value, self.pad_value, self.pad_value, self.pad_value),
+            "constant",
+            255,
+        )
+        tracks = tracks + self.pad_value
+        if self.grayscale:
+            transform = transforms.Grayscale()
+            video = transform(video)
+            video = video.repeat(1, 1, 3, 1, 1)
+        res_video = self.draw_tracks_on_video(
+            video=video,
+            tracks=tracks,
+            visibility=visibility,
+            segm_mask=segm_mask,
+            gt_tracks=gt_tracks,
+            query_frame=query_frame,
+            compensate_for_camera_motion=compensate_for_camera_motion,
+        )
+        if save_video:
+            self.save_video(res_video, filename=filename, writer=writer, step=step)
+        return res_video
+    def save_video(self, video, filename, writer=None, step=0):
+        if writer is not None:
+            writer.add_video(
+                filename,
+                video.to(torch.uint8),
+                global_step=step,
+                fps=self.fps,
+            )
+        else:
+            os.makedirs(self.save_dir, exist_ok=True)
+            wide_list = list(video.unbind(1))
+            wide_list = [wide[0].permute(1, 2, 0).cpu().numpy() for wide in wide_list]
+            # Prepare the video file path
+            save_path = os.path.join(self.save_dir, f"{filename}.mp4")
+            # Create a writer object
+            video_writer = imageio.get_writer(save_path, fps=self.fps)
+            # Write frames to the video file
+            for frame in wide_list[2:-1]:
+                video_writer.append_data(frame)
+            video_writer.close()
+            print(f"Video saved to {save_path}")
+    def draw_tracks_on_video(
+        self,
+        video: torch.Tensor,
+        tracks: torch.Tensor,
+        visibility: torch.Tensor = None,
+        segm_mask: torch.Tensor = None,
+        gt_tracks=None,
+        query_frame: int = 0,
+        compensate_for_camera_motion=False,
+    ):
+        B, T, C, H, W = video.shape
+        _, _, N, D = tracks.shape
+        assert D == 2
+        assert C == 3
+        video = video[0].permute(0, 2, 3, 1).byte().detach().cpu().numpy()  # S, H, W, C
+        tracks = tracks[0].long().detach().cpu().numpy()  # S, N, 2
+        if gt_tracks is not None:
+            gt_tracks = gt_tracks[0].detach().cpu().numpy()
+        res_video = []
+        # process input video
+        for rgb in video:
+            res_video.append(rgb.copy())
+        vector_colors = np.zeros((T, N, 3))
+        if self.mode == "optical_flow":
+            import flow_vis
+            vector_colors = flow_vis.flow_to_color(tracks - tracks[query_frame][None])
+        elif segm_mask is None:
+            if self.mode == "rainbow":
+                y_min, y_max = (
+                    tracks[query_frame, :, 1].min(),
+                    tracks[query_frame, :, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    color = self.color_map(norm(tracks[query_frame, n, 1]))
+                    color = np.array(color[:3])[None] * 255
+                    vector_colors[:, n] = np.repeat(color, T, axis=0)
+            else:
+                # color changes with time
+                for t in range(T):
+                    color = np.array(self.color_map(t / T)[:3])[None] * 255
+                    vector_colors[t] = np.repeat(color, N, axis=0)
+        else:
+            if self.mode == "rainbow":
+                vector_colors[:, segm_mask <= 0, :] = 255
+                y_min, y_max = (
+                    tracks[0, segm_mask > 0, 1].min(),
+                    tracks[0, segm_mask > 0, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    if segm_mask[n] > 0:
+                        color = self.color_map(norm(tracks[0, n, 1]))
+                        color = np.array(color[:3])[None] * 255
+                        vector_colors[:, n] = np.repeat(color, T, axis=0)
+            else:
+                # color changes with segm class
+                segm_mask = segm_mask.cpu()
+                color = np.zeros((segm_mask.shape[0], 3), dtype=np.float32)
+                color[segm_mask > 0] = np.array(self.color_map(1.0)[:3]) * 255.0
+                color[segm_mask <= 0] = np.array(self.color_map(0.0)[:3]) * 255.0
+                vector_colors = np.repeat(color[None], T, axis=0)
+        #  draw tracks
+        if self.tracks_leave_trace != 0:
+            for t in range(query_frame + 1, T):
+                first_ind = (
+                    max(0, t - self.tracks_leave_trace) if self.tracks_leave_trace >= 0 else 0
+                )
+                curr_tracks = tracks[first_ind : t + 1]
+                curr_colors = vector_colors[first_ind : t + 1]
+                if compensate_for_camera_motion:
+                    diff = (
+                        tracks[first_ind : t + 1, segm_mask <= 0]
+                        - tracks[t : t + 1, segm_mask <= 0]
+                    ).mean(1)[:, None]
+                    curr_tracks = curr_tracks - diff
+                    curr_tracks = curr_tracks[:, segm_mask > 0]
+                    curr_colors = curr_colors[:, segm_mask > 0]
+                res_video[t] = self._draw_pred_tracks(
+                    res_video[t],
+                    curr_tracks,
+                    curr_colors,
+                )
+                if gt_tracks is not None:
+                    res_video[t] = self._draw_gt_tracks(res_video[t], gt_tracks[first_ind : t + 1])
+        #  draw points
+        for t in range(query_frame, T):
+            img = Image.fromarray(np.uint8(res_video[t]))
+            for i in range(N):
+                coord = (tracks[t, i, 0], tracks[t, i, 1])
+                visibile = True
+                if visibility is not None:
+                    visibile = visibility[0, t, i]
+                if coord[0] != 0 and coord[1] != 0:
+                    if not compensate_for_camera_motion or (
+                        compensate_for_camera_motion and segm_mask[i] > 0
+                    ):
+                        img = draw_circle(
+                            img,
+                            coord=coord,
+                            radius=int(self.linewidth * 2),
+                            color=vector_colors[t, i].astype(int),
+                            visible=visibile,
+                        )
+            res_video[t] = np.array(img)
+        #  construct the final rgb sequence
+        if self.show_first_frame > 0:
+            res_video = [res_video[0]] * self.show_first_frame + res_video[1:]
+        return torch.from_numpy(np.stack(res_video)).permute(0, 3, 1, 2)[None].byte()
+    def _draw_pred_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3
+        tracks: np.ndarray,  # T x 2
+        vector_colors: np.ndarray,
+        alpha: float = 0.5,
+    ):
+        T, N, _ = tracks.shape
+        rgb = Image.fromarray(np.uint8(rgb))
+        for s in range(T - 1):
+            vector_color = vector_colors[s]
+            original = rgb.copy()
+            alpha = (s / T) ** 2
+            for i in range(N):
+                coord_y = (int(tracks[s, i, 0]), int(tracks[s, i, 1]))
+                coord_x = (int(tracks[s + 1, i, 0]), int(tracks[s + 1, i, 1]))
+                if coord_y[0] != 0 and coord_y[1] != 0:
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        vector_color[i].astype(int),
+                        self.linewidth,
+                    )
+            if self.tracks_leave_trace > 0:
+                rgb = Image.fromarray(
+                    np.uint8(add_weighted(np.array(rgb), alpha, np.array(original), 1 - alpha, 0))
+                )
+        rgb = np.array(rgb)
+        return rgb
+    def _draw_gt_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3,
+        gt_tracks: np.ndarray,  # T x 2
+    ):
+        T, N, _ = gt_tracks.shape
+        color = np.array((211, 0, 0))
+        rgb = Image.fromarray(np.uint8(rgb))
+        for t in range(T):
+            for i in range(N):
+                gt_tracks = gt_tracks[t][i]
+                #  draw a red cross
+                if gt_tracks[0] > 0 and gt_tracks[1] > 0:
+                    length = self.linewidth * 3
+                    coord_y = (int(gt_tracks[0]) + length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) - length, int(gt_tracks[1]) - length)
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                    )
+                    coord_y = (int(gt_tracks[0]) - length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) + length, int(gt_tracks[1]) - length)
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                    )
+        rgb = np.array(rgb)
+        return rgb
--- a/agents/ui_agent/app.py
+++ b/agents/ui_agent/app.py
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+from typing import Optional
+import spaces
+import gradio as gr
+import numpy as np
+import torch
+from PIL import Image
+import io
+import re
+import base64, os
+from util.utils import check_ocr_box, get_yolo_model, get_caption_model_processor, get_som_labeled_img
+from util.som import MarkHelper, plot_boxes_with_marks, plot_circles_with_marks
+from util.process_utils import pred_2_point, extract_bbox, extract_mark_id
+import torch
+from PIL import Image
+from huggingface_hub import snapshot_download
+import torch
+from transformers import AutoModelForCausalLM
+from transformers import AutoProcessor 
+# Define repository and local directory
+repo_id = "microsoft/OmniParser-v2.0"  # HF repo
+local_dir = "weights"  # Target local directory
+dtype = torch.bfloat16
+DEVICE = torch.device('cuda')  
+som_generator = MarkHelper()
+magma_som_prompt = "<image>\nIn this view I need to click a button to \"{}\"? Provide the coordinates and the mark index of the containing bounding box if applicable."
+magma_qa_prompt = "<image>\n{} Answer the question briefly."
+magma_model_id = "microsoft/Magma-8B"
+magam_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
+magam_model.to(DEVICE)
+# Download the entire repository
+snapshot_download(repo_id=repo_id, local_dir=local_dir)
+print(f"Repository downloaded to: {local_dir}")
+yolo_model = get_yolo_model(model_path='weights/icon_detect/model.pt')
+caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="weights/icon_caption")
+# caption_model_processor = get_caption_model_processor(model_name="blip2", model_name_or_path="weights/icon_caption_blip2")
+MARKDOWN = """
+<div align="center">
+<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
+\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp; 
+This demo is powered by [Gradio](https://gradio.app/) and uses [OmniParserv2](https://github.com/microsoft/OmniParser) to generate [Set-of-Mark prompts](https://github.com/microsoft/SoM).
+The demo supports three modes:
+1. Empty text inut: it downgrades to an OmniParser demo.
+2. Text input starting with "Q:": it leads to a visual question answering demo.
+3. Text input for UI navigation: it leads to a UI navigation demo.
+</div>
+"""
+DEVICE = torch.device('cuda')  
+@spaces.GPU
+@torch.inference_mode()
+def get_som_response(instruction, image_som):
+    prompt = magma_som_prompt.format(instruction)
+    if magam_model.config.mm_use_image_start_end:
+        qs = prompt.replace('<image>', '<image_start><image><image_end>')
+    else:
+        qs = prompt        
+    convs = [{"role": "user", "content": qs}]
+    convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs     
+    prompt = magma_processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = magma_processor(images=[image_som], texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
+    inputs = inputs.to(dtype).to(DEVICE)
+    magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
+    with torch.inference_mode():
+        output_ids = magam_model.generate(
+            **inputs, 
+            temperature=0.0, 
+            do_sample=False, 
+            num_beams=1, 
+            max_new_tokens=128, 
+            use_cache=True
+        )
+    prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
+    response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    response = response.replace(prompt_decoded, '').strip()
+    return response
+@spaces.GPU
+@torch.inference_mode()
+def get_qa_response(instruction, image):
+    prompt = magma_qa_prompt.format(instruction)
+    if magam_model.config.mm_use_image_start_end:
+        qs = prompt.replace('<image>', '<image_start><image><image_end>')
+    else:
+        qs = prompt        
+    convs = [{"role": "user", "content": qs}]
+    convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs     
+    prompt = magma_processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = magma_processor(images=[image], texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
+    inputs = inputs.to(dtype).to(DEVICE)
+    magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
+    with torch.inference_mode():
+        output_ids = magam_model.generate(
+            **inputs, 
+            temperature=0.0, 
+            do_sample=False, 
+            num_beams=1, 
+            max_new_tokens=128, 
+            use_cache=True
+        )
+    prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
+    response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    response = response.replace(prompt_decoded, '').strip()
+    return response
+@spaces.GPU
+@torch.inference_mode()
+# @torch.autocast(device_type="cuda", dtype=torch.bfloat16)
+def process(
+    image_input,
+    box_threshold,
+    iou_threshold,
+    use_paddleocr,
+    imgsz, 
+    instruction,
+) -> Optional[Image.Image]:
+    # image_save_path = 'imgs/saved_image_demo.png'
+    # image_input.save(image_save_path)
+    # image = Image.open(image_save_path)
+    box_overlay_ratio = image_input.size[0] / 3200
+    draw_bbox_config = {
+        'text_scale': 0.8 * box_overlay_ratio,
+        'text_thickness': max(int(2 * box_overlay_ratio), 1),
+        'text_padding': max(int(3 * box_overlay_ratio), 1),
+        'thickness': max(int(3 * box_overlay_ratio), 1),
+    }
+    ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_input, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9}, use_paddleocr=use_paddleocr)
+    text, ocr_bbox = ocr_bbox_rslt
+    dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_input, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=False, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold, imgsz=imgsz,)  
+    parsed_content_list = '\n'.join([f'icon {i}: ' + str(v) for i,v in enumerate(parsed_content_list)])
+    if len(instruction) == 0:
+        print('finish processing')
+        image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
+        return image, str(parsed_content_list)
+    elif instruction.startswith('Q:'):
+        response = get_qa_response(instruction, image_input)
+        return image_input, response
+    # parsed_content_list = str(parsed_content_list)
+    # convert xywh to yxhw
+    label_coordinates_yxhw = {}
+    for key, val in label_coordinates.items():
+        if val[2] < 0 or val[3] < 0:
+            continue
+        label_coordinates_yxhw[key] = [val[1], val[0], val[3], val[2]]
+    image_som = plot_boxes_with_marks(image_input.copy(), [val for key, val in label_coordinates_yxhw.items()], som_generator, edgecolor=(255,0,0), fn_save=None, normalized_to_pixel=False)
+    # convert xywh to xyxy
+    for key, val in label_coordinates.items():
+        label_coordinates[key] = [val[0], val[1], val[0] + val[2], val[1] + val[3]]
+    # normalize label_coordinates
+    for key, val in label_coordinates.items():
+        label_coordinates[key] = [val[0] / image_input.size[0], val[1] / image_input.size[1], val[2] / image_input.size[0], val[3] / image_input.size[1]]
+    magma_response = get_som_response(instruction, image_som)
+    print("magma repsonse: ", magma_response)
+    # map magma_response into the mark id
+    mark_id = extract_mark_id(magma_response)
+    if mark_id is not None:
+        if str(mark_id) in label_coordinates:
+            bbox_for_mark = label_coordinates[str(mark_id)]
+        else:
+            bbox_for_mark = None
+    else:
+        bbox_for_mark = None
+    if bbox_for_mark:
+        # draw bbox_for_mark on the image
+        image_som = plot_boxes_with_marks(
+            image_input, 
+            [label_coordinates_yxhw[str(mark_id)]], 
+            som_generator, 
+            edgecolor=(255,127,111), 
+            alpha=30, 
+            fn_save=None, 
+            normalized_to_pixel=False,
+            add_mark=False
+        )
+    else:
+        try:
+            if 'box' in magma_response:
+                pred_bbox = extract_bbox(magma_response)
+                click_point = [(pred_bbox[0][0] + pred_bbox[1][0]) / 2, (pred_bbox[0][1] + pred_bbox[1][1]) / 2]
+                click_point = [item / 1000 for item in click_point]
+            else:
+                click_point = pred_2_point(magma_response)
+            # de-normalize click_point (width, height)
+            click_point = [click_point[0] * image_input.size[0], click_point[1] * image_input.size[1]]
+            image_som = plot_circles_with_marks(
+                image_input, 
+                [click_point],
+                som_generator,
+                edgecolor=(255,127,111), 
+                linewidth=3,
+                fn_save=None,
+                normalized_to_pixel=False,
+                add_mark=False
+            )
+        except:
+            image_som = image_input
+    return image_som, str(parsed_content_list)
+with gr.Blocks() as demo:
+    gr.Markdown(MARKDOWN)
+    with gr.Row():
+        with gr.Column():
+            image_input_component = gr.Image(
+                type='pil', label='Upload image')
+            # set the threshold for removing the bounding boxes with low confidence, default is 0.05
+            with gr.Accordion("Parameters", open=False) as parameter_row:            
+                box_threshold_component = gr.Slider(
+                    label='Box Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.05)
+                # set the threshold for removing the bounding boxes with large overlap, default is 0.1
+                iou_threshold_component = gr.Slider(
+                    label='IOU Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.1)
+                use_paddleocr_component = gr.Checkbox(
+                    label='Use PaddleOCR', value=True)
+                imgsz_component = gr.Slider(
+                    label='Icon Detect Image Size', minimum=640, maximum=1920, step=32, value=640)
+            # text box
+            text_input_component = gr.Textbox(label='Text Input', placeholder='Text Input')
+            submit_button_component = gr.Button(
+                value='Submit', variant='primary')
+        with gr.Column():
+            image_output_component = gr.Image(type='pil', label='Image Output')
+            text_output_component = gr.Textbox(label='Parsed screen elements', placeholder='Text Output')
+    submit_button_component.click(
+        fn=process,
+        inputs=[
+            image_input_component,
+            box_threshold_component,
+            iou_threshold_component,
+            use_paddleocr_component,
+            imgsz_component, 
+            text_input_component
+        ],
+        outputs=[image_output_component, text_output_component]
+    )
+# demo.launch(debug=False, show_error=True, share=True)
+# demo.launch(share=True, server_port=7861, server_name='0.0.0.0')
+demo.queue().launch(share=False)
\ No newline at end of file
--- a/agents/ui_agent/util/__init__.py
+++ b/agents/ui_agent/util/__init__.py
--- a/agents/ui_agent/util/arial.ttf
+++ b/agents/ui_agent/util/arial.ttf