v1.0

0063a668 · chenzk · 0063a668 · 0063a668 · 0063a668 · 0063a668
Commit 0063a668 authored May 13, 2025 by chenzk
20 changed files
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Microsoft Open Source Code of Conduct
+This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
+Resources:
+- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
+- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
+- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
--- a/LICENSE
+++ b/LICENSE
+    MIT License
+    Copyright (c) Microsoft Corporation.
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE
--- a/README.md
+++ b/README.md
+# Magma
+具身智能新时代！VLA迎来最强基础模型Magma：UI导航、机器人操作全能。
+## 论文
+`Magma: A Foundation Model for Multimodal AI Agents`
+- https://arxiv.org/pdf/2502.13130
+## 模型结构
+使用一个视觉编码器V，将每一帧图像编码成多个token，然后将所有token拼接成一个序列，并与编码任务描述的语言token一起输入到一个仅解码器的语言模型（LLM）中。
+<div align=center>
+    <img src="./doc/Magma.png"/>
+</div>
+## 算法原理
+通过标记集合（SoM）和标记轨迹（ToM）技术，将视觉语言数据转化为可操作任务，显著提升了空间智能和任务泛化能力，能够理解和执行多模态任务，适用于数字和物理环境。
+研究人员提出了一种简单、有效的方法，结合「标记集合」（Set-of-Mark, SoM）和「标记轨迹」（Trace-of-Mark, ToM）将模型扩展到空间预测任务（可点击按钮）和时间维度。
+<div align=center>
+    <img src="./doc/algorithm.png"/>
+</div>
+## 环境配置
+```
+mv Magma_pytorch Magma # 去框架名后缀
+```
+### Docker（方法一）
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
+# <your IMAGE ID>为以上拉取的docker的镜像ID替换，本镜像为：6063b673703a
+docker run -it --shm-size=64G -v $PWD/Magma:/home/Magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name magma <your IMAGE ID> bash
+cd /home/Magma
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
+```
+### Dockerfile（方法二）
+```
+cd /home/Magma/docker
+docker build --no-cache -t magma:latest .
+docker run --shm-size=64G --name magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Magma:/home/Magma -it magma bash
+# 若遇到Dockerfile启动的方式安装环境需要长时间等待，可注释掉里面的pip安装，启动容器后再安装python库：pip install -r requirements.txt。
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
+```
+### Anaconda（方法三）
+1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装：
+- https://developer.sourcefind.cn/tool/
+```
+DTK驱动:dtk2504
+python:python3.10
+torch:2.4.1
+torchvision:0.19.1
+triton:3.0.0
+vllm:0.6.2
+flash-attn:2.6.1
+deepspeed:0.14.2
+apex:1.4.0
+transformers:4.51.3
+tensorflow:2.13.1
+```
+`Tips：以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
+2、其它非特殊库参照requirements.txt安装
+```
+cd /home/Magma
+pip install -e . -i https://mirrors.aliyun.com/pypi/simple
+```
+## 数据集
+`无`
+## 训练
+`无`
+## 推理
+预训练权重目录结构：
+```
+/home/Magma
+    └── microsoft/Magma-8B
+# 设置HF下载镜像：
+export HF_ENDPOINT=https://hf-mirror.com
+然后，运行推命令时，项目会自动下载模型：laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg，下载完成后编码成缓存文件保存，此模型作者的代码无读取本地权重功能。
+``` 
+### 单机多卡
+```
+cd /home/Magma
+python infer_transformers.py
+```
+更多资料可参考源项目中的[`README_origin`](./README_origin.md)。
+## result
+`输入: `
+```
+prompt: "What is the letter on the robot?"
+image: "./assets/images/magma_logo.jpg"
+```
+`输出:`
+```
+response:  The letter on the robot is "M".
+```
+官方效果示例：
+<div align=center>
+    <img src="./doc/magma_mushroom.gif"/>
+</div>
+### 精度
+DCU与GPU精度一致，推理框架：pytorch。
+## 应用场景
+### 算法类别
+`具身智能`
+### 热点应用行业
+`制造,家居,医疗,能源,教育`
+## 预训练权重
+HF/github下载地址为：[microsoft/Magma-8B](https://huggingface.co/microsoft/Magma-8B)
+## 源码仓库及问题反馈
+- http://developer.sourcefind.cn/codes/modelzoo/InfiniteYou_pytorch.git
+## 参考资料
+- https://github.com/microsoft/Magma.git
--- a/README_origin.md
+++ b/README_origin.md
--- a/SECURITY.md
+++ b/SECURITY.md
+<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
+## Security
+Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
+If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
+## Reporting Security Issues
+**Please do not report security vulnerabilities through public GitHub issues.**
+Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
+If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com).  If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
+You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 
+Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
+  * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
+  * Full paths of source file(s) related to the manifestation of the issue
+  * The location of the affected source code (tag/branch/commit or direct URL)
+  * Any special configuration required to reproduce the issue
+  * Step-by-step instructions to reproduce the issue
+  * Proof-of-concept or exploit code (if possible)
+  * Impact of the issue, including how an attacker might exploit the issue
+This information will help us triage your report more quickly.
+If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
+## Preferred Languages
+We prefer all communications to be in English.
+## Policy
+Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
+<!-- END MICROSOFT SECURITY.MD BLOCK -->
--- a/SUPPORT.md
+++ b/SUPPORT.md
+# TODO: The maintainer of this repo has not yet edited this file
+**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
+- **No CSS support:** Fill out this template with information about how to file issues and get help.
+- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
+- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
+*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
+# Support
+## How to file issues and get help  
+This project uses GitHub Issues to track bugs and feature requests. Please search the existing 
+issues before filing new issues to avoid duplicates.  For new issues, file your bug or 
+feature request as a new Issue.
+For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE 
+FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
+CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
+## Microsoft Support Policy  
+Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
--- a/agents/game_agent/app.py
+++ b/agents/game_agent/app.py
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+import pygame
+import numpy as np
+import gradio as gr
+import time
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM, AutoProcessor
+import re
+import random
+pygame.mixer.quit()  # Disable sound
+# Constants
+WIDTH, HEIGHT = 800, 800
+GRID_SIZE = 80
+WHITE = (255, 255, 255)
+GREEN = (34, 139, 34)  # Forest green - more like an apple
+RED = (200, 50, 50)
+BLACK = (0, 0, 0)
+GRAY = (128, 128, 128)
+YELLOW = (218, 165, 32)  # Golden yellow color
+# Directions
+UP = (0, -1)
+DOWN = (0, 1)
+LEFT = (-1, 0)
+RIGHT = (1, 0)
+STATIC = (0, 0)
+ACTIONS = ["up", "down", "left", "right", "static"]
+# Load AI Model
+magma_model_id = "microsoft/Magma-8B"
+dtype = torch.bfloat16
+magma_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+magma_model.to("cuda")
+# Load magma image
+magma_img = pygame.image.load("./assets/images/magma_game.png")
+magma_img = pygame.transform.scale(magma_img, (GRID_SIZE, GRID_SIZE))
+class MagmaFindGPU:
+    def __init__(self):
+        self.reset()
+    def reset(self):
+        self.snake = [(5, 5)]
+        self.direction = RIGHT
+        self.score = 0
+        self.game_over = False
+        self.place_target()
+    def place_target(self):
+        while True:
+            target_x = np.random.randint(1, WIDTH // GRID_SIZE - 1)
+            target_y = np.random.randint(1, HEIGHT // GRID_SIZE - 1)
+            if (target_x, target_y) not in self.snake:
+                self.target = (target_x, target_y)
+                break
+    def step(self, action):
+        if action == "up":
+            self.direction = UP
+        elif action == "down":
+            self.direction = DOWN
+        elif action == "left":
+            self.direction = LEFT
+        elif action == "right":
+            self.direction = RIGHT
+        elif action == "static":
+            self.direction = STATIC
+        if self.game_over:
+            return self.render(), self.score
+        new_head = (self.snake[0][0] + self.direction[0], self.snake[0][1] + self.direction[1])
+        if new_head[0] < 0 or new_head[1] < 0 or new_head[0] >= WIDTH // GRID_SIZE or new_head[1] >= HEIGHT // GRID_SIZE:
+            self.game_over = True
+            return self.render(), self.score
+        self.snake = [new_head]  # Keep only the head (single block snake)
+        # Check if the target is covered by four surrounding squares
+        head_x, head_y = self.snake[0]
+        neighbors = set([(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)])
+        if neighbors.issuperset(set([self.target])):
+            self.score += 1
+            self.place_target()
+        return self.render(), self.score
+    def render(self):
+        pygame.init()
+        surface = pygame.Surface((WIDTH, HEIGHT))
+        surface.fill(BLACK)
+        head_x, head_y = self.snake[0]
+        surface.blit(magma_img, (head_x * GRID_SIZE, head_y * GRID_SIZE))        
+        # pygame.draw.rect(surface, RED, (self.snake[0][0] * GRID_SIZE, self.snake[0][1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
+        pygame.draw.rect(surface, GREEN, (self.target[0] * GRID_SIZE, self.target[1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
+        # Draw four surrounding squares with labels
+        head_x, head_y = self.snake[0]
+        neighbors = [(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)]
+        labels = ["1", "2", "3", "4"]
+        font = pygame.font.Font(None, 48)
+        # clone surface
+        surface_nomark = surface.copy()
+        for i, (nx, ny) in enumerate(neighbors):
+            if 0 <= nx < WIDTH // GRID_SIZE and 0 <= ny < HEIGHT // GRID_SIZE:
+                pygame.draw.rect(surface, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
+                # pygame.draw.rect(surface_nomark, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
+                text = font.render(labels[i], True, WHITE)
+                text_rect = text.get_rect(center=(nx * GRID_SIZE + GRID_SIZE // 2, ny * GRID_SIZE + GRID_SIZE // 2))
+                surface.blit(text, text_rect)
+        return np.array(pygame.surfarray.array3d(surface_nomark)).swapaxes(0, 1), np.array(pygame.surfarray.array3d(surface)).swapaxes(0, 1)
+    def get_state(self):
+        return self.render()
+game = MagmaFindGPU()
+def play_game():
+    state, state_som = game.get_state()
+    pil_img = Image.fromarray(state_som)
+    convs = [
+        {"role": "system", "content": "You are an agent that can see, talk, and act."},            
+        {"role": "user", "content": "<image_start><image><image_end>\nWhich mark is closer to green block? Answer with a single number."},
+    ]
+    prompt = magma_processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
+    inputs = magma_processor(images=[pil_img], texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)    
+    inputs = inputs.to("cuda").to(dtype)
+    generation_args = { 
+        "max_new_tokens": 10, 
+        "temperature": 0, 
+        "do_sample": False, 
+        "use_cache": True,
+        "num_beams": 1,
+    }
+    with torch.inference_mode():
+        generate_ids = magma_model.generate(**inputs, **generation_args)
+    generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
+    action = magma_processor.decode(generate_ids[0], skip_special_tokens=True).strip()
+    # extract mark id fro action use re
+    match = re.search(r'\d+', action)
+    if match:
+        action = match.group(0)
+        if action.isdigit() and 1 <= int(action) <= 4:
+            # epsilon sampling
+            if random.random() < 0.1:
+                action = random.choice(ACTIONS[:-1])
+            else:
+                action = ACTIONS[int(action) - 1]
+        else:
+            # random choose one from the pool
+            action = random.choice(ACTIONS[:-1])
+    else:
+        action = random.choice(ACTIONS[:-1])
+    img, score = game.step(action)
+    img = img[0]
+    return img, f"Score: {score}"
+def reset_game():
+    game.reset()
+    return game.render()[0], "Score: 0"
+MARKDOWN = """
+<div align="center">
+<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
+Game: Magma finds the apple by moving up, down, left and right.
+\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp; 
+This demo is powered by [Gradio](https://gradio.app/).
+</div>
+"""
+with gr.Blocks() as interface:
+    gr.Markdown(MARKDOWN)
+    with gr.Row():
+        image_output = gr.Image(label="Game Screen")
+        score_output = gr.Text(label="Score")
+    with gr.Row():
+        start_btn = gr.Button("Start/Reset Game")
+    interface.load(fn=play_game, every=1, inputs=[], outputs=[image_output, score_output])
+    start_btn.click(fn=reset_game, inputs=[], outputs=[image_output, score_output])
+interface.launch()
--- a/agents/game_agent/frozen_lake/app.py
+++ b/agents/game_agent/frozen_lake/app.py
+import gradio as gr
+import numpy as np
+import gymnasium as gym
+from PIL import Image
+import matplotlib.pyplot as plt
+# Initialize FrozenLake environment
+env = gym.make("FrozenLake-v1", render_mode="rgb_array")
+state, _ = env.reset()
+action_mapping = {
+    "Left": 3,
+    "Down": 1,
+    "Right": 2,
+    "Up": 0,
+}
+def render_env():
+    """Render the environment and return as an image."""
+    frame = env.render()
+    image = Image.fromarray(frame)
+    return image
+def step(action):
+    """Take a step in the environment."""
+    global state
+    action_index = action_mapping[action]
+    state, reward, done, _, _ = env.step(action_index)
+    image = render_env()
+    message = f"State: {state}, Reward: {reward}, Done: {done}"
+    if done:
+        env.reset()
+        message += " - Resetting environment"
+    return image, message
+# Create Gradio interface
+with gr.Blocks() as demo:
+    gr.Markdown("# Play Frozen Lake!")
+    image_display = gr.Image()
+    action_buttons = gr.Radio(choices=list(action_mapping.keys()), label="Select Action")
+    submit_button = gr.Button("Step")
+    output_text = gr.Textbox(label="Game State")
+    submit_button.click(fn=step, inputs=action_buttons, outputs=[image_display, output_text])
+    # Show initial state
+    image_display.update(render_env())
+demo.launch()
--- a/agents/libero/README.md
+++ b/agents/libero/README.md
+# Magma: Multimodal Agentic Models
+Evaluating Magma on [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO).
+#### LIBERO Setup
+Clone and install LIBERO and other requirements:
+```
+git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
+pip install -r agents/libero/requirements.txt
+cd LIBERO
+pip install -e .
+```
+#### Quick Evaluation
+The following code demonstrates how to run Magma on a single LIBERO task and evaluate its performance:
+```
+import numpy as np
+from libero.libero import benchmark
+from libero_env_utils import get_libero_env, get_libero_dummy_action, get_libero_obs, get_max_steps, save_rollout_video
+from libero_magma_utils import get_magma_model, get_magma_prompt, get_magma_action
+# Set up benchmark and task
+benchmark_dict = benchmark.get_benchmark_dict()
+task_suite_name = "libero_goal" # or libero_spatial, libero_object, etc.
+task_suite = benchmark_dict[task_suite_name]()
+task_id = 1
+task = task_suite.get_task(task_id)
+# Initialize environment
+env, task_description = get_libero_env(task, resolution=256)
+print(f"Task {task_id} description: {task_description}")
+# Load MAGMA model
+model_name = "microsoft/magma-8b-libero-goal"  # or your local path
+processor, magma = get_magma_model(model_name)
+prompt = get_magma_prompt(task_description, processor, magma.config)
+# Run evaluation
+num_steps_wait = 10
+max_steps = get_max_steps(task_suite_name)
+env.seed(0)
+obs = env.reset()
+init_states = task_suite.get_task_init_states(task_id) 
+obs = env.set_init_state(init_states[0])
+step = 0
+replay_images = []
+while step < max_steps + num_steps_wait:
+    if step < num_steps_wait:
+        obs, _, done, _ = env.step(get_libero_dummy_action())
+        step += 1
+        continue
+    img = get_libero_obs(obs, resize_size=256)
+    replay_images.append(img)
+    action = get_magma_action(magma, processor, img, prompt, task_suite_name)
+    obs, _, done, _ = env.step(action.tolist())
+    step += 1
+env.close()
+save_rollout_video(replay_images, success=done, task_description=task_description)
+```
+**Notes:** The above script only tests one episode on a single task and visualizes MAGMA's trajectory with saved video. For comprehensive evaluation on each task suite, please use `eval_magma_libero.py`.
+```
+python eval_magma_libero.py \
+  --model_name microsoft/Magma-8B-libero-object \
+  --task_suite_name libero_object \
+python eval_magma_libero.py \
+  --model_name microsoft/Magma-8B-libero-spatial \
+  --task_suite_name libero_spatial \
+python eval_magma_libero.py \
+  --model_name microsoft/Magma-8B-libero-goal \
+  --task_suite_name libero_goal \
+```
--- a/agents/libero/eval_magma_libero.py
+++ b/agents/libero/eval_magma_libero.py
+import os
+import numpy as np
+import draccus
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import tqdm
+from libero.libero import benchmark
+from libero_env_utils import (
+    get_libero_env, 
+    get_libero_dummy_action,
+    get_libero_obs,
+    get_max_steps,
+    set_seed_everywhere
+)
+from libero_magma_utils import (
+    get_magma_model,
+    get_magma_prompt,
+    get_magma_action
+)
+@dataclass
+class LiberoConfig:
+    # Model parameters
+    model_name: str = "microsoft/magma-8b-libero-goal"      # model_name
+    task_suite_name: str = "libero_goal"                    # Task suite name
+    # Evaluation parameters
+    num_trials_per_task: int = 50                          # Number of rollouts per task
+    resolution: int = 256                                  # Image resolution
+    num_steps_wait: int = 10                              # Steps to wait for stabilization
+    seed: int = 0                                         # Random seed
+    save_dir: str = "./libero_eval_log"                   # Directory for saving logs
+@draccus.wrap()
+def eval_libero(cfg: LiberoConfig) -> Tuple[int, int]:
+    """
+    Evaluate Libero environment with given configuration.
+    Args:
+        cfg: LiberoConfig object containing evaluation parameters
+    Returns:
+        Tuple[int, int]: Total episodes and total successful episodes
+    """
+    # Setup logging
+    os.makedirs(cfg.save_dir, exist_ok=True)
+    log_filepath = f"{cfg.save_dir}/magma_eval-{cfg.task_suite_name}.log"
+    log_file = open(log_filepath, "w")
+    print(f"Logging to local log file: {log_filepath}")
+    # Write initial log
+    log_file.write(f"Task suite: {cfg.task_suite_name}\n")
+    print(f"Task suite: {cfg.task_suite_name}")
+    # Get benchmark and task suite
+    benchmark_dict = benchmark.get_benchmark_dict()
+    task_suite = benchmark_dict[cfg.task_suite_name]()
+    num_tasks_in_suite = task_suite.n_tasks
+    # Initialize counters
+    total_episodes, total_successes = 0, 0
+    set_seed_everywhere(cfg.seed)
+    # Load model
+    processor, magma = get_magma_model(cfg.model_name)
+    # Iterate through all tasks
+    for task_id in tqdm.tqdm(range(num_tasks_in_suite)):
+        # Get task
+        task = task_suite.get_task(task_id)
+        task_name = task.name
+        max_steps = get_max_steps(cfg.task_suite_name)
+        # Get default LIBERO initial states
+        initial_states = task_suite.get_task_init_states(task_id)
+        # Initialize LIBERO environment and task description
+        env, task_description = get_libero_env(task, resolution=cfg.resolution)
+        print(f"[info] Evaluating task {task_id} from suite {cfg.task_suite_name}, "
+              f"the language instruction is {task_description}.")
+        log_file.write(f"Task {task_id}: {task_description}\n")
+        log_file.flush()
+        # Get prompt for current task
+        prompt = get_magma_prompt(task_description, processor, magma.config)
+        # Initialize task-specific counters
+        task_episodes, task_successes = 0, 0
+        # Run trials for current task
+        for trial in range(cfg.num_trials_per_task):
+            env.reset()
+            obs = env.set_init_state(initial_states[trial])
+            step = 0
+            while step < max_steps + cfg.num_steps_wait:
+                if step < cfg.num_steps_wait:
+                    obs, reward, done, info = env.step(get_libero_dummy_action())
+                    step += 1
+                    continue
+                img = get_libero_obs(obs, resize_size=cfg.resolution)
+                action = get_magma_action(magma, processor, img, prompt, cfg.task_suite_name)
+                obs, reward, done, info = env.step(action.tolist())
+                step += 1
+                if done:
+                    task_successes += 1
+                    break
+            task_episodes += 1
+        # Update total counters
+        total_episodes += task_episodes
+        total_successes += task_successes
+        # Log task success rate
+        task_success_rate = float(task_successes) / float(task_episodes)
+        print(f"Current task ({task_name}) success rate: {task_success_rate}")
+        log_file.write(f"Current task ({task_name}) success rate: {task_success_rate}\n")
+        log_file.flush()
+    # Log final suite success rate
+    suite_success_rate = float(total_successes) / float(total_episodes)
+    print(f"Task suite success rate: {suite_success_rate}")
+    log_file.write(f"\nTask suite {cfg.task_suite_name} success rate: {suite_success_rate}\n")
+    log_file.flush()
+    env.close()
+    log_file.close()
+    return total_episodes, total_successes
+if __name__ == "__main__":
+    eval_libero()
\ No newline at end of file
--- a/agents/libero/libero_env_utils.py
+++ b/agents/libero/libero_env_utils.py
+"""Utils for evaluating policies in LIBERO simulation environments."""
+import math
+import os
+import torch
+import random
+from PIL import Image
+import imageio
+import numpy as np
+import tensorflow as tf
+from libero.libero import get_libero_path
+from libero.libero.envs import OffScreenRenderEnv
+def resize_image(img, resize_size):
+    """
+    Takes numpy array corresponding to a single image and returns resized image as numpy array.
+    """
+    assert isinstance(resize_size, tuple)
+    # Resize to image size expected by model
+    img = tf.image.encode_jpeg(img)  # Encode as JPEG, as done in RLDS dataset builder
+    img = tf.io.decode_image(img, expand_animations=False, dtype=tf.uint8)  # Immediately decode back
+    img = tf.image.resize(img, resize_size, method="lanczos3", antialias=True)
+    img = tf.cast(tf.clip_by_value(tf.round(img), 0, 255), tf.uint8)
+    img = img.numpy()
+    return img
+def get_libero_env(task, resolution=256):
+    """Initializes and returns the LIBERO environment, along with the task description."""
+    task_description = task.language
+    task_bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
+    env_args = {"bddl_file_name": task_bddl_file, "camera_heights": resolution, "camera_widths": resolution}
+    env = OffScreenRenderEnv(**env_args)
+    env.seed(0)  # IMPORTANT: seed seems to affect object positions even when using fixed initial state
+    return env, task_description
+def get_libero_dummy_action():
+    """Get dummy/no-op action, used to roll out the simulation while the robot does nothing."""
+    return [0, 0, 0, 0, 0, 0, -1]
+def get_libero_obs(obs, resize_size):
+    """Extracts image from observations and preprocesses it."""
+    assert isinstance(resize_size, int) or isinstance(resize_size, tuple)
+    if isinstance(resize_size, int):
+        resize_size = (resize_size, resize_size)
+    img = obs["agentview_image"]
+    img = img[::-1, ::-1]  # IMPORTANT: rotate 180 degrees to match train preprocessing
+    image = Image.fromarray(img)
+    # resize image to 256x256
+    image = resize_image(img, resize_size)
+    return image
+def get_max_steps(task_suite_name):
+    if task_suite_name == "libero_spatial":
+        max_steps = 220  
+    elif task_suite_name == "libero_object":
+        max_steps = 280 
+    elif task_suite_name == "libero_goal":
+        max_steps = 300  
+    elif task_suite_name == "libero_10":
+        max_steps = 520  
+    else:
+        max_steps = 400
+    return max_steps
+def quat2axisangle(quat):
+    """
+    Copied from robosuite: https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
+    Converts quaternion to axis-angle format.
+    Returns a unit vector direction scaled by its angle in radians.
+    Args:
+        quat (np.array): (x,y,z,w) vec4 float angles
+    Returns:
+        np.array: (ax,ay,az) axis-angle exponential coordinates
+    """
+    # clip quaternion
+    if quat[3] > 1.0:
+        quat[3] = 1.0
+    elif quat[3] < -1.0:
+        quat[3] = -1.0
+    den = np.sqrt(1.0 - quat[3] * quat[3])
+    if math.isclose(den, 0.0):
+        # This is (close to) a zero degree rotation, immediately return
+        return np.zeros(3)
+    return (quat[:3] * 2.0 * math.acos(quat[3])) / den
+def save_rollout_video(replay_images, success, task_description):
+    """Saves a video replay of a rollout in libero."""
+    save_dir = f"./libero_videos"
+    os.makedirs(save_dir, exist_ok=True)
+    processed_task_description = task_description.lower().replace(" ", "_").replace("\n", "_").replace(".", "_")[:50]
+    video_path = f"{save_dir}/quick_eval-success={success}--task={processed_task_description}.mp4"
+    video_writer = imageio.get_writer(video_path, fps=30)
+    for img in replay_images:
+        video_writer.append_data(img)
+    video_writer.close()
+    print(f"Saved libero video at path {video_path}")
+    return video_path
+def set_seed_everywhere(seed: int):
+    """Sets the random seed for Python, NumPy, and PyTorch functions."""
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    os.environ["PYTHONHASHSEED"] = str(seed)
\ No newline at end of file
--- a/agents/libero/libero_magma_utils.py
+++ b/agents/libero/libero_magma_utils.py
+import os
+import json
+import torch
+import numpy as np
+from magma.image_processing_magma import MagmaImageProcessor
+from magma.processing_magma import MagmaProcessor
+from magma.modeling_magma import MagmaForConditionalGeneration
+def get_magma_model(model_name):
+    processor = MagmaProcessor.from_pretrained(model_name, trust_remote_code=True) 
+    magma = MagmaForConditionalGeneration.from_pretrained(model_name,
+        device_map="cuda", 
+        low_cpu_mem_usage=True,        
+        attn_implementation="flash_attention_2",  
+        torch_dtype=torch.bfloat16,
+        trust_remote_code=True,
+        use_cache=True,
+    )
+    return processor, magma
+def get_magma_prompt(task_description, processor, model_config):
+    convs = [
+        {"role": "user", "content": f"<image>\nWhat action should the robot take to {task_description}?"},
+    ]
+    convs = [
+        {
+            "role": "system",
+            "content": "You are agent that can see, talk and act.", 
+        },            
+    ] + convs      
+    prompt = processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    if model_config.mm_use_image_start_end:
+        prompt = prompt.replace("<image>", "<image_start><image><image_end>")    
+    return prompt
+def get_magma_action(magma, processor, img, prompt, task_suite_name):
+    dataset_stats = json.load(open(os.path.join(magma.config._name_or_path, "dataset_statistics.json")))
+    action_norm_stats = dataset_stats[f"{task_suite_name}_no_noops"]['action']
+    n_action_bins = 256
+    vocab_size = processor.tokenizer.vocab_size
+    bins = np.linspace(-1, 1, n_action_bins)
+    bin_centers = (bins[:-1] + bins[1:]) / 2.0
+    # process inputs
+    inputs = processor(images=img, texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)   
+    inputs = inputs.to("cuda").to(torch.bfloat16)
+    # predict actions with magma
+    output_ids = magma.generate(
+            **inputs, 
+            temperature=0.7, 
+            do_sample=True, 
+            num_beams=1, 
+            max_new_tokens=1000, 
+            use_cache=True,
+        )
+    action_ids = output_ids[0, -8:-1].cpu().tolist()
+    predicted_action_ids = np.array(action_ids).astype(np.int64)
+    discretized_actions = vocab_size - predicted_action_ids
+    discretized_actions = np.clip(discretized_actions - 1, a_min=0, a_max=bin_centers.shape[0] - 1)
+    normalized_actions = bin_centers[discretized_actions]
+    # unnormalize actions
+    mask = action_norm_stats.get("mask", np.ones_like(action_norm_stats["q01"], dtype=bool))
+    action_high, action_low = np.array(action_norm_stats["q99"]), np.array(action_norm_stats["q01"])
+    raw_action = np.where(
+        mask,
+        0.5 * (normalized_actions + 1) * (action_high - action_low) + action_low,
+        normalized_actions,
+    )
+    action = normalize_gripper_action(raw_action, binarize=True)
+    action = invert_gripper_action(action)
+    return action
+def normalize_gripper_action(action, binarize=True):
+    """
+    Convert gripper action from [0,1] to [-1,+1] range.
+    y = 2x - 1
+    """
+    orig_low, orig_high = 0.0, 1.0
+    action[..., -1] = 2 * (action[..., -1] - orig_low) / (orig_high - orig_low) - 1
+    if binarize:
+        # Binarize to -1 or +1.
+        action[..., -1] = np.sign(action[..., -1])
+    return action
+def invert_gripper_action(action):
+    """Convert gripper: RLDS(0=close,1=open) -> -1=open,+1=close"""
+    action[..., -1] = action[..., -1] * -1.0
+    return action
\ No newline at end of file
--- a/agents/libero/requirements.txt
+++ b/agents/libero/requirements.txt
+robosuite==1.4.0
+bddl==1.0.1
+easydict==1.9 
+gym==0.25.2
+cloudpickle
+imageio[ffmpeg]
\ No newline at end of file
--- a/agents/robot_traj/app.py
+++ b/agents/robot_traj/app.py
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+import os
+import warnings
+from utils.visualizer import Visualizer
+from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
+import random
+import gradio as gr
+import ast, re
+import torch
+import torchvision
+from transformers import AutoModelForCausalLM, AutoProcessor
+'''
+build model
+'''
+torch.manual_seed(0)
+torch.cuda.manual_seed_all(0)
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+random.seed(0)
+spatial_quant_size = 256
+# Load AI Model
+dtype = torch.bfloat16
+device = "cuda"
+magma_model_id = "microsoft/Magma-8B"
+model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
+model.to(device)
+@torch.no_grad()
+def inference(image, task, *args, **kwargs):
+    # image = image['image']
+    task_description = task
+    num_marks = args[0]
+    speed = args[1]
+    steps = args[2]
+    mark_ids = [i+1 for i in range(num_marks)]  
+    image_resized = image.resize((256, 256))
+    magma_template = (
+        # "<image>\nThe image is labeled with numeric marks {}.\n"
+        "<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
+        "The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
+    )
+    """
+    Visual Trace Generation
+    """
+    if model.config.mm_use_image_start_end:
+        magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")    
+    conv_user = magma_template.format(mark_ids, task_description, speed, steps)
+    print(conv_user)
+    convs = [
+        {"role": "user", "content": conv_user},
+    ]
+    convs = [
+        {
+            "role": "system",
+            "content": "You are agent that can see, talk and act.", 
+        },            
+    ] + convs     
+    prompt = processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)    
+    inputs = inputs.to(dtype).to(device)
+    with torch.inference_mode():
+        output_ids = model.generate(
+            **inputs,
+            temperature=0.3,
+            do_sample=True,
+            num_beams=1,
+            max_new_tokens=1024,
+            use_cache=True,
+        )
+    response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    if len(response)==0:
+        return None
+    # extract traces from response
+    if "and their future positions are:" in response:
+        selected_marks_str, traces_str = response.split("and their future positions are:\n")
+    else:
+        selected_marks_str, traces_str = None, response
+    try:
+        traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
+        overlay_traces = []
+        for mark_id, trace in traces_dict.items():
+            # convert list of tuples to tensor
+            trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
+            overlay_traces.append(trace)
+        # padded to the same length with the last element
+        max_len = max([trace.shape[0] for trace in overlay_traces])
+        for i in range(len(overlay_traces)):
+            if overlay_traces[i].shape[0] < max_len:
+                overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)        
+        overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
+        # if selected_marks_str is not None:
+        #     selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
+        #     selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
+        #     selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)        
+        #     overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
+        overlay_traces = overlay_traces.float() / 256
+        overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
+        overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
+        images = [image] * overlay_traces.shape[1]
+        overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
+        video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255    
+        vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
+        vis.visualize(video, overlay_traces, overlay_visibility)
+        # return video path
+        return "./saved_videos/video.mp4"
+    except Exception as e:
+        print(e)
+        return None
+class ImageMask(gr.components.Image):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+class Video(gr.components.Video):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+'''
+launch app
+'''
+title = "Magma"
+description = '''Magma: Multimodal Agent to Act'''
+'''Usage
+Instructions:
+&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
+&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
+&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
+&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
+&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
+&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
+'''
+article = "The Demo is Run on Magma-8B."
+inputs = [
+    gr.components.Image(label="Draw on Image",type="pil"), 
+    gr.Textbox(label="Task"),
+    gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
+    gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
+    gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
+]
+gr.Interface(
+    fn=inference,
+    inputs=inputs,
+    outputs=[
+        gr.Video(
+        label="Robot planning trajectory", format="mp4"
+        ),
+    ],
+    examples=[
+    ["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
+    ],
+    title=title,
+    description=description,
+    article=article,
+    allow_flagging='never',
+    cache_examples=False,
+).launch(share=True)
\ No newline at end of file
--- a/agents/robot_traj/app.pyi
+++ b/agents/robot_traj/app.pyi
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+import os
+import warnings
+from utils.visualizer import Visualizer
+from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
+import random
+import gradio as gr
+import ast, re
+import torch
+import torchvision
+from transformers import AutoModelForCausalLM, AutoProcessor
+'''
+build model
+'''
+torch.manual_seed(0)
+torch.cuda.manual_seed_all(0)
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+random.seed(0)
+spatial_quant_size = 256
+# Load AI Model
+dtype = torch.bfloat16
+device = "cuda"
+magma_model_id = "microsoft/Magma-8B"
+model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
+model.to(device)
+@torch.no_grad()
+def inference(image, task, *args, **kwargs):
+    # image = image['image']
+    task_description = task
+    num_marks = args[0]
+    speed = args[1]
+    steps = args[2]
+    mark_ids = [i+1 for i in range(num_marks)]  
+    image_resized = image.resize((256, 256))
+    magma_template = (
+        # "<image>\nThe image is labeled with numeric marks {}.\n"
+        "<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
+        "The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
+    )
+    """
+    Visual Trace Generation
+    """
+    if model.config.mm_use_image_start_end:
+        magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")    
+    conv_user = magma_template.format(mark_ids, task_description, speed, steps)
+    print(conv_user)
+    convs = [
+        {"role": "user", "content": conv_user},
+    ]
+    convs = [
+        {
+            "role": "system",
+            "content": "You are agent that can see, talk and act.", 
+        },            
+    ] + convs     
+    prompt = processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)    
+    inputs = inputs.to(dtype).to(device)
+    with torch.inference_mode():
+        output_ids = model.generate(
+            **inputs,
+            temperature=0.3,
+            do_sample=True,
+            num_beams=1,
+            max_new_tokens=1024,
+            use_cache=True,
+        )
+    response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    if len(response)==0:
+        return None
+    # extract traces from response
+    if "and their future positions are:" in response:
+        selected_marks_str, traces_str = response.split("and their future positions are:\n")
+    else:
+        selected_marks_str, traces_str = None, response
+    try:
+        traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
+        overlay_traces = []
+        for mark_id, trace in traces_dict.items():
+            # convert list of tuples to tensor
+            trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
+            overlay_traces.append(trace)
+        # padded to the same length with the last element
+        max_len = max([trace.shape[0] for trace in overlay_traces])
+        for i in range(len(overlay_traces)):
+            if overlay_traces[i].shape[0] < max_len:
+                overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)        
+        overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
+        # if selected_marks_str is not None:
+        #     selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
+        #     selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
+        #     selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)        
+        #     overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
+        overlay_traces = overlay_traces.float() / 256
+        overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
+        overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
+        images = [image] * overlay_traces.shape[1]
+        overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
+        video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255    
+        vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
+        vis.visualize(video, overlay_traces, overlay_visibility)
+        # return video path
+        return "./saved_videos/video.mp4"
+    except Exception as e:
+        print(e)
+        return None
+from gradio.events import Dependency
+class ImageMask(gr.components.Image):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+    from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
+    from gradio.blocks import Block
+    if TYPE_CHECKING:
+        from gradio.components import Timer
+class Video(gr.components.Video):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+    is_template = True
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", **kwargs)
+    def preprocess(self, x):
+        return super().preprocess(x)
+    from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
+    from gradio.blocks import Block
+    if TYPE_CHECKING:
+        from gradio.components import Timer
+'''
+launch app
+'''
+title = "Magma"
+description = '''Magma: Multimodal Agent to Act'''
+'''Usage
+Instructions:
+&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
+&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
+&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
+&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
+&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
+&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
+'''
+article = "The Demo is Run on Magma-8B."
+inputs = [
+    gr.components.Image(label="Draw on Image",type="pil"), 
+    gr.Textbox(label="Task"),
+    gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
+    gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
+    gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
+]
+gr.Interface(
+    fn=inference,
+    inputs=inputs,
+    outputs=[
+        gr.Video(
+        label="Robot planning trajectory", format="mp4"
+        ),
+    ],
+    examples=[
+    ["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
+    ],
+    title=title,
+    description=description,
+    article=article,
+    allow_flagging='never',
+    cache_examples=False,
+).launch(share=True)
\ No newline at end of file
--- a/agents/robot_traj/sample.png
+++ b/agents/robot_traj/sample.png
--- a/agents/robot_traj/utils/visualizer.py
+++ b/agents/robot_traj/utils/visualizer.py
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+import os
+import numpy as np
+import imageio
+import torch
+from matplotlib import cm
+import torch.nn.functional as F
+import torchvision.transforms as transforms
+import matplotlib.pyplot as plt
+from PIL import Image, ImageDraw
+def read_video_from_path(path):
+    try:
+        reader = imageio.get_reader(path)
+    except Exception as e:
+        print("Error opening video file: ", e)
+        return None
+    frames = []
+    for i, im in enumerate(reader):
+        frames.append(np.array(im))
+    return np.stack(frames)
+def draw_circle(rgb, coord, radius, color=(255, 0, 0), visible=True):
+    # Create a draw object
+    draw = ImageDraw.Draw(rgb)
+    # Calculate the bounding box of the circle
+    left_up_point = (coord[0] - radius, coord[1] - radius)
+    right_down_point = (coord[0] + radius, coord[1] + radius)
+    # Draw the circle
+    draw.ellipse(
+        [left_up_point, right_down_point],
+        fill=tuple(color) if visible else None,
+        outline=tuple(color),
+    )
+    return rgb
+def draw_line(rgb, coord_y, coord_x, color, linewidth):
+    draw = ImageDraw.Draw(rgb)
+    draw.line(
+        (coord_y[0], coord_y[1], coord_x[0], coord_x[1]),
+        fill=tuple(color),
+        width=linewidth,
+    )
+    return rgb
+def add_weighted(rgb, alpha, original, beta, gamma):
+    return (rgb * alpha + original * beta + gamma).astype("uint8")
+class Visualizer:
+    def __init__(
+        self,
+        save_dir: str = "./results",
+        grayscale: bool = False,
+        pad_value: int = 0,
+        fps: int = 10,
+        mode: str = "rainbow",  # 'cool', 'optical_flow'
+        linewidth: int = 2,
+        show_first_frame: int = 10,
+        tracks_leave_trace: int = 0,  # -1 for infinite
+    ):
+        self.mode = mode
+        self.save_dir = save_dir
+        if mode == "rainbow":
+            self.color_map = cm.get_cmap("gist_rainbow")
+        elif mode == "cool":
+            self.color_map = cm.get_cmap(mode)
+        self.show_first_frame = show_first_frame
+        self.grayscale = grayscale
+        self.tracks_leave_trace = tracks_leave_trace
+        self.pad_value = pad_value
+        self.linewidth = linewidth
+        self.fps = fps
+    def visualize(
+        self,
+        video: torch.Tensor,  # (B,T,C,H,W)
+        tracks: torch.Tensor,  # (B,T,N,2)
+        visibility: torch.Tensor = None,  # (B, T, N, 1) bool
+        gt_tracks: torch.Tensor = None,  # (B,T,N,2)
+        segm_mask: torch.Tensor = None,  # (B,1,H,W)
+        filename: str = "video",
+        writer=None,  # tensorboard Summary Writer, used for visualization during training
+        step: int = 0,
+        query_frame: int = 0,
+        save_video: bool = True,
+        compensate_for_camera_motion: bool = False,
+    ):
+        if compensate_for_camera_motion:
+            assert segm_mask is not None
+        if segm_mask is not None:
+            coords = tracks[0, query_frame].round().long()
+            segm_mask = segm_mask[0, query_frame][coords[:, 1], coords[:, 0]].long()
+        video = F.pad(
+            video,
+            (self.pad_value, self.pad_value, self.pad_value, self.pad_value),
+            "constant",
+            255,
+        )
+        tracks = tracks + self.pad_value
+        if self.grayscale:
+            transform = transforms.Grayscale()
+            video = transform(video)
+            video = video.repeat(1, 1, 3, 1, 1)
+        res_video = self.draw_tracks_on_video(
+            video=video,
+            tracks=tracks,
+            visibility=visibility,
+            segm_mask=segm_mask,
+            gt_tracks=gt_tracks,
+            query_frame=query_frame,
+            compensate_for_camera_motion=compensate_for_camera_motion,
+        )
+        if save_video:
+            self.save_video(res_video, filename=filename, writer=writer, step=step)
+        return res_video
+    def save_video(self, video, filename, writer=None, step=0):
+        if writer is not None:
+            writer.add_video(
+                filename,
+                video.to(torch.uint8),
+                global_step=step,
+                fps=self.fps,
+            )
+        else:
+            os.makedirs(self.save_dir, exist_ok=True)
+            wide_list = list(video.unbind(1))
+            wide_list = [wide[0].permute(1, 2, 0).cpu().numpy() for wide in wide_list]
+            # Prepare the video file path
+            save_path = os.path.join(self.save_dir, f"{filename}.mp4")
+            # Create a writer object
+            video_writer = imageio.get_writer(save_path, fps=self.fps)
+            # Write frames to the video file
+            for frame in wide_list[2:-1]:
+                video_writer.append_data(frame)
+            video_writer.close()
+            print(f"Video saved to {save_path}")
+    def draw_tracks_on_video(
+        self,
+        video: torch.Tensor,
+        tracks: torch.Tensor,
+        visibility: torch.Tensor = None,
+        segm_mask: torch.Tensor = None,
+        gt_tracks=None,
+        query_frame: int = 0,
+        compensate_for_camera_motion=False,
+    ):
+        B, T, C, H, W = video.shape
+        _, _, N, D = tracks.shape
+        assert D == 2
+        assert C == 3
+        video = video[0].permute(0, 2, 3, 1).byte().detach().cpu().numpy()  # S, H, W, C
+        tracks = tracks[0].long().detach().cpu().numpy()  # S, N, 2
+        if gt_tracks is not None:
+            gt_tracks = gt_tracks[0].detach().cpu().numpy()
+        res_video = []
+        # process input video
+        for rgb in video:
+            res_video.append(rgb.copy())
+        vector_colors = np.zeros((T, N, 3))
+        if self.mode == "optical_flow":
+            import flow_vis
+            vector_colors = flow_vis.flow_to_color(tracks - tracks[query_frame][None])
+        elif segm_mask is None:
+            if self.mode == "rainbow":
+                y_min, y_max = (
+                    tracks[query_frame, :, 1].min(),
+                    tracks[query_frame, :, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    color = self.color_map(norm(tracks[query_frame, n, 1]))
+                    color = np.array(color[:3])[None] * 255
+                    vector_colors[:, n] = np.repeat(color, T, axis=0)
+            else:
+                # color changes with time
+                for t in range(T):
+                    color = np.array(self.color_map(t / T)[:3])[None] * 255
+                    vector_colors[t] = np.repeat(color, N, axis=0)
+        else:
+            if self.mode == "rainbow":
+                vector_colors[:, segm_mask <= 0, :] = 255
+                y_min, y_max = (
+                    tracks[0, segm_mask > 0, 1].min(),
+                    tracks[0, segm_mask > 0, 1].max(),
+                )
+                norm = plt.Normalize(y_min, y_max)
+                for n in range(N):
+                    if segm_mask[n] > 0:
+                        color = self.color_map(norm(tracks[0, n, 1]))
+                        color = np.array(color[:3])[None] * 255
+                        vector_colors[:, n] = np.repeat(color, T, axis=0)
+            else:
+                # color changes with segm class
+                segm_mask = segm_mask.cpu()
+                color = np.zeros((segm_mask.shape[0], 3), dtype=np.float32)
+                color[segm_mask > 0] = np.array(self.color_map(1.0)[:3]) * 255.0
+                color[segm_mask <= 0] = np.array(self.color_map(0.0)[:3]) * 255.0
+                vector_colors = np.repeat(color[None], T, axis=0)
+        #  draw tracks
+        if self.tracks_leave_trace != 0:
+            for t in range(query_frame + 1, T):
+                first_ind = (
+                    max(0, t - self.tracks_leave_trace) if self.tracks_leave_trace >= 0 else 0
+                )
+                curr_tracks = tracks[first_ind : t + 1]
+                curr_colors = vector_colors[first_ind : t + 1]
+                if compensate_for_camera_motion:
+                    diff = (
+                        tracks[first_ind : t + 1, segm_mask <= 0]
+                        - tracks[t : t + 1, segm_mask <= 0]
+                    ).mean(1)[:, None]
+                    curr_tracks = curr_tracks - diff
+                    curr_tracks = curr_tracks[:, segm_mask > 0]
+                    curr_colors = curr_colors[:, segm_mask > 0]
+                res_video[t] = self._draw_pred_tracks(
+                    res_video[t],
+                    curr_tracks,
+                    curr_colors,
+                )
+                if gt_tracks is not None:
+                    res_video[t] = self._draw_gt_tracks(res_video[t], gt_tracks[first_ind : t + 1])
+        #  draw points
+        for t in range(query_frame, T):
+            img = Image.fromarray(np.uint8(res_video[t]))
+            for i in range(N):
+                coord = (tracks[t, i, 0], tracks[t, i, 1])
+                visibile = True
+                if visibility is not None:
+                    visibile = visibility[0, t, i]
+                if coord[0] != 0 and coord[1] != 0:
+                    if not compensate_for_camera_motion or (
+                        compensate_for_camera_motion and segm_mask[i] > 0
+                    ):
+                        img = draw_circle(
+                            img,
+                            coord=coord,
+                            radius=int(self.linewidth * 2),
+                            color=vector_colors[t, i].astype(int),
+                            visible=visibile,
+                        )
+            res_video[t] = np.array(img)
+        #  construct the final rgb sequence
+        if self.show_first_frame > 0:
+            res_video = [res_video[0]] * self.show_first_frame + res_video[1:]
+        return torch.from_numpy(np.stack(res_video)).permute(0, 3, 1, 2)[None].byte()
+    def _draw_pred_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3
+        tracks: np.ndarray,  # T x 2
+        vector_colors: np.ndarray,
+        alpha: float = 0.5,
+    ):
+        T, N, _ = tracks.shape
+        rgb = Image.fromarray(np.uint8(rgb))
+        for s in range(T - 1):
+            vector_color = vector_colors[s]
+            original = rgb.copy()
+            alpha = (s / T) ** 2
+            for i in range(N):
+                coord_y = (int(tracks[s, i, 0]), int(tracks[s, i, 1]))
+                coord_x = (int(tracks[s + 1, i, 0]), int(tracks[s + 1, i, 1]))
+                if coord_y[0] != 0 and coord_y[1] != 0:
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        vector_color[i].astype(int),
+                        self.linewidth,
+                    )
+            if self.tracks_leave_trace > 0:
+                rgb = Image.fromarray(
+                    np.uint8(add_weighted(np.array(rgb), alpha, np.array(original), 1 - alpha, 0))
+                )
+        rgb = np.array(rgb)
+        return rgb
+    def _draw_gt_tracks(
+        self,
+        rgb: np.ndarray,  # H x W x 3,
+        gt_tracks: np.ndarray,  # T x 2
+    ):
+        T, N, _ = gt_tracks.shape
+        color = np.array((211, 0, 0))
+        rgb = Image.fromarray(np.uint8(rgb))
+        for t in range(T):
+            for i in range(N):
+                gt_tracks = gt_tracks[t][i]
+                #  draw a red cross
+                if gt_tracks[0] > 0 and gt_tracks[1] > 0:
+                    length = self.linewidth * 3
+                    coord_y = (int(gt_tracks[0]) + length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) - length, int(gt_tracks[1]) - length)
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                    )
+                    coord_y = (int(gt_tracks[0]) - length, int(gt_tracks[1]) + length)
+                    coord_x = (int(gt_tracks[0]) + length, int(gt_tracks[1]) - length)
+                    rgb = draw_line(
+                        rgb,
+                        coord_y,
+                        coord_x,
+                        color,
+                        self.linewidth,
+                    )
+        rgb = np.array(rgb)
+        return rgb
--- a/agents/ui_agent/app.py
+++ b/agents/ui_agent/app.py
+# --------------------------------------------------------
+# Magma - Multimodal AI Agent at Microsoft Research
+# Copyright (c) 2025 Microsoft
+# Licensed under The MIT License [see LICENSE for details]
+# Jianwei Yang (jianwyan@microsoft.com)
+# --------------------------------------------------------
+from typing import Optional
+import spaces
+import gradio as gr
+import numpy as np
+import torch
+from PIL import Image
+import io
+import re
+import base64, os
+from util.utils import check_ocr_box, get_yolo_model, get_caption_model_processor, get_som_labeled_img
+from util.som import MarkHelper, plot_boxes_with_marks, plot_circles_with_marks
+from util.process_utils import pred_2_point, extract_bbox, extract_mark_id
+import torch
+from PIL import Image
+from huggingface_hub import snapshot_download
+import torch
+from transformers import AutoModelForCausalLM
+from transformers import AutoProcessor 
+# Define repository and local directory
+repo_id = "microsoft/OmniParser-v2.0"  # HF repo
+local_dir = "weights"  # Target local directory
+dtype = torch.bfloat16
+DEVICE = torch.device('cuda')  
+som_generator = MarkHelper()
+magma_som_prompt = "<image>\nIn this view I need to click a button to \"{}\"? Provide the coordinates and the mark index of the containing bounding box if applicable."
+magma_qa_prompt = "<image>\n{} Answer the question briefly."
+magma_model_id = "microsoft/Magma-8B"
+magam_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
+magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
+magam_model.to(DEVICE)
+# Download the entire repository
+snapshot_download(repo_id=repo_id, local_dir=local_dir)
+print(f"Repository downloaded to: {local_dir}")
+yolo_model = get_yolo_model(model_path='weights/icon_detect/model.pt')
+caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="weights/icon_caption")
+# caption_model_processor = get_caption_model_processor(model_name="blip2", model_name_or_path="weights/icon_caption_blip2")
+MARKDOWN = """
+<div align="center">
+<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
+\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp; 
+This demo is powered by [Gradio](https://gradio.app/) and uses [OmniParserv2](https://github.com/microsoft/OmniParser) to generate [Set-of-Mark prompts](https://github.com/microsoft/SoM).
+The demo supports three modes:
+1. Empty text inut: it downgrades to an OmniParser demo.
+2. Text input starting with "Q:": it leads to a visual question answering demo.
+3. Text input for UI navigation: it leads to a UI navigation demo.
+</div>
+"""
+DEVICE = torch.device('cuda')  
+@spaces.GPU
+@torch.inference_mode()
+def get_som_response(instruction, image_som):
+    prompt = magma_som_prompt.format(instruction)
+    if magam_model.config.mm_use_image_start_end:
+        qs = prompt.replace('<image>', '<image_start><image><image_end>')
+    else:
+        qs = prompt        
+    convs = [{"role": "user", "content": qs}]
+    convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs     
+    prompt = magma_processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = magma_processor(images=[image_som], texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
+    inputs = inputs.to(dtype).to(DEVICE)
+    magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
+    with torch.inference_mode():
+        output_ids = magam_model.generate(
+            **inputs, 
+            temperature=0.0, 
+            do_sample=False, 
+            num_beams=1, 
+            max_new_tokens=128, 
+            use_cache=True
+        )
+    prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
+    response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    response = response.replace(prompt_decoded, '').strip()
+    return response
+@spaces.GPU
+@torch.inference_mode()
+def get_qa_response(instruction, image):
+    prompt = magma_qa_prompt.format(instruction)
+    if magam_model.config.mm_use_image_start_end:
+        qs = prompt.replace('<image>', '<image_start><image><image_end>')
+    else:
+        qs = prompt        
+    convs = [{"role": "user", "content": qs}]
+    convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs     
+    prompt = magma_processor.tokenizer.apply_chat_template(
+        convs,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    inputs = magma_processor(images=[image], texts=prompt, return_tensors="pt")
+    inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
+    inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
+    inputs = inputs.to(dtype).to(DEVICE)
+    magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
+    with torch.inference_mode():
+        output_ids = magam_model.generate(
+            **inputs, 
+            temperature=0.0, 
+            do_sample=False, 
+            num_beams=1, 
+            max_new_tokens=128, 
+            use_cache=True
+        )
+    prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
+    response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
+    response = response.replace(prompt_decoded, '').strip()
+    return response
+@spaces.GPU
+@torch.inference_mode()
+# @torch.autocast(device_type="cuda", dtype=torch.bfloat16)
+def process(
+    image_input,
+    box_threshold,
+    iou_threshold,
+    use_paddleocr,
+    imgsz, 
+    instruction,
+) -> Optional[Image.Image]:
+    # image_save_path = 'imgs/saved_image_demo.png'
+    # image_input.save(image_save_path)
+    # image = Image.open(image_save_path)
+    box_overlay_ratio = image_input.size[0] / 3200
+    draw_bbox_config = {
+        'text_scale': 0.8 * box_overlay_ratio,
+        'text_thickness': max(int(2 * box_overlay_ratio), 1),
+        'text_padding': max(int(3 * box_overlay_ratio), 1),
+        'thickness': max(int(3 * box_overlay_ratio), 1),
+    }
+    ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_input, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9}, use_paddleocr=use_paddleocr)
+    text, ocr_bbox = ocr_bbox_rslt
+    dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_input, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=False, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold, imgsz=imgsz,)  
+    parsed_content_list = '\n'.join([f'icon {i}: ' + str(v) for i,v in enumerate(parsed_content_list)])
+    if len(instruction) == 0:
+        print('finish processing')
+        image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
+        return image, str(parsed_content_list)
+    elif instruction.startswith('Q:'):
+        response = get_qa_response(instruction, image_input)
+        return image_input, response
+    # parsed_content_list = str(parsed_content_list)
+    # convert xywh to yxhw
+    label_coordinates_yxhw = {}
+    for key, val in label_coordinates.items():
+        if val[2] < 0 or val[3] < 0:
+            continue
+        label_coordinates_yxhw[key] = [val[1], val[0], val[3], val[2]]
+    image_som = plot_boxes_with_marks(image_input.copy(), [val for key, val in label_coordinates_yxhw.items()], som_generator, edgecolor=(255,0,0), fn_save=None, normalized_to_pixel=False)
+    # convert xywh to xyxy
+    for key, val in label_coordinates.items():
+        label_coordinates[key] = [val[0], val[1], val[0] + val[2], val[1] + val[3]]
+    # normalize label_coordinates
+    for key, val in label_coordinates.items():
+        label_coordinates[key] = [val[0] / image_input.size[0], val[1] / image_input.size[1], val[2] / image_input.size[0], val[3] / image_input.size[1]]
+    magma_response = get_som_response(instruction, image_som)
+    print("magma repsonse: ", magma_response)
+    # map magma_response into the mark id
+    mark_id = extract_mark_id(magma_response)
+    if mark_id is not None:
+        if str(mark_id) in label_coordinates:
+            bbox_for_mark = label_coordinates[str(mark_id)]
+        else:
+            bbox_for_mark = None
+    else:
+        bbox_for_mark = None
+    if bbox_for_mark:
+        # draw bbox_for_mark on the image
+        image_som = plot_boxes_with_marks(
+            image_input, 
+            [label_coordinates_yxhw[str(mark_id)]], 
+            som_generator, 
+            edgecolor=(255,127,111), 
+            alpha=30, 
+            fn_save=None, 
+            normalized_to_pixel=False,
+            add_mark=False
+        )
+    else:
+        try:
+            if 'box' in magma_response:
+                pred_bbox = extract_bbox(magma_response)
+                click_point = [(pred_bbox[0][0] + pred_bbox[1][0]) / 2, (pred_bbox[0][1] + pred_bbox[1][1]) / 2]
+                click_point = [item / 1000 for item in click_point]
+            else:
+                click_point = pred_2_point(magma_response)
+            # de-normalize click_point (width, height)
+            click_point = [click_point[0] * image_input.size[0], click_point[1] * image_input.size[1]]
+            image_som = plot_circles_with_marks(
+                image_input, 
+                [click_point],
+                som_generator,
+                edgecolor=(255,127,111), 
+                linewidth=3,
+                fn_save=None,
+                normalized_to_pixel=False,
+                add_mark=False
+            )
+        except:
+            image_som = image_input
+    return image_som, str(parsed_content_list)
+with gr.Blocks() as demo:
+    gr.Markdown(MARKDOWN)
+    with gr.Row():
+        with gr.Column():
+            image_input_component = gr.Image(
+                type='pil', label='Upload image')
+            # set the threshold for removing the bounding boxes with low confidence, default is 0.05
+            with gr.Accordion("Parameters", open=False) as parameter_row:            
+                box_threshold_component = gr.Slider(
+                    label='Box Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.05)
+                # set the threshold for removing the bounding boxes with large overlap, default is 0.1
+                iou_threshold_component = gr.Slider(
+                    label='IOU Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.1)
+                use_paddleocr_component = gr.Checkbox(
+                    label='Use PaddleOCR', value=True)
+                imgsz_component = gr.Slider(
+                    label='Icon Detect Image Size', minimum=640, maximum=1920, step=32, value=640)
+            # text box
+            text_input_component = gr.Textbox(label='Text Input', placeholder='Text Input')
+            submit_button_component = gr.Button(
+                value='Submit', variant='primary')
+        with gr.Column():
+            image_output_component = gr.Image(type='pil', label='Image Output')
+            text_output_component = gr.Textbox(label='Parsed screen elements', placeholder='Text Output')
+    submit_button_component.click(
+        fn=process,
+        inputs=[
+            image_input_component,
+            box_threshold_component,
+            iou_threshold_component,
+            use_paddleocr_component,
+            imgsz_component, 
+            text_input_component
+        ],
+        outputs=[image_output_component, text_output_component]
+    )
+# demo.launch(debug=False, show_error=True, share=True)
+# demo.launch(share=True, server_port=7861, server_name='0.0.0.0')
+demo.queue().launch(share=False)
\ No newline at end of file
--- a/agents/ui_agent/util/__init__.py
+++ b/agents/ui_agent/util/__init__.py
--- a/agents/ui_agent/util/arial.ttf
+++ b/agents/ui_agent/util/arial.ttf