Commit 0063a668 authored by chenzk's avatar chenzk
Browse files

v1.0

parents
# Microsoft Open Source Code of Conduct
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
Resources:
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
MIT License
Copyright (c) Microsoft Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE
# Magma
具身智能新时代!VLA迎来最强基础模型Magma:UI导航、机器人操作全能。
## 论文
`Magma: A Foundation Model for Multimodal AI Agents`
- https://arxiv.org/pdf/2502.13130
## 模型结构
使用一个视觉编码器V,将每一帧图像编码成多个token,然后将所有token拼接成一个序列,并与编码任务描述的语言token一起输入到一个仅解码器的语言模型(LLM)中。
<div align=center>
<img src="./doc/Magma.png"/>
</div>
## 算法原理
通过标记集合(SoM)和标记轨迹(ToM)技术,将视觉语言数据转化为可操作任务,显著提升了空间智能和任务泛化能力,能够理解和执行多模态任务,适用于数字和物理环境。
研究人员提出了一种简单、有效的方法,结合「标记集合」(Set-of-Mark, SoM)和「标记轨迹」(Trace-of-Mark, ToM)将模型扩展到空间预测任务(可点击按钮)和时间维度。
<div align=center>
<img src="./doc/algorithm.png"/>
</div>
## 环境配置
```
mv Magma_pytorch Magma # 去框架名后缀
```
### Docker(方法一)
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.4.1-ubuntu22.04-dtk25.04-py3.10-fixpy
# <your IMAGE ID>为以上拉取的docker的镜像ID替换,本镜像为:6063b673703a
docker run -it --shm-size=64G -v $PWD/Magma:/home/Magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video --name magma <your IMAGE ID> bash
cd /home/Magma
pip install -e . -i https://mirrors.aliyun.com/pypi/simple
pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
```
### Dockerfile(方法二)
```
cd /home/Magma/docker
docker build --no-cache -t magma:latest .
docker run --shm-size=64G --name magma -v /opt/hyhal:/opt/hyhal:ro --privileged=true --device=/dev/kfd --device=/dev/dri/ --group-add video -v $PWD/../../Magma:/home/Magma -it magma bash
# 若遇到Dockerfile启动的方式安装环境需要长时间等待,可注释掉里面的pip安装,启动容器后再安装python库:pip install -r requirements.txt。
pip install -e . -i https://mirrors.aliyun.com/pypi/simple
pip install https://download.sourcefind.cn:65024/directlink/4/tensorflow/DAS1.5/tensorflow-2.13.1+das.opt1.dtk2504-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple # tensorflow=2.13.1
```
### Anaconda(方法三)
1、关于本项目DCU显卡所需的特殊深度学习库可从光合开发者社区下载安装:
- https://developer.sourcefind.cn/tool/
```
DTK驱动:dtk2504
python:python3.10
torch:2.4.1
torchvision:0.19.1
triton:3.0.0
vllm:0.6.2
flash-attn:2.6.1
deepspeed:0.14.2
apex:1.4.0
transformers:4.51.3
tensorflow:2.13.1
```
`Tips:以上dtk驱动、python、torch等DCU相关工具版本需要严格一一对应。`
2、其它非特殊库参照requirements.txt安装
```
cd /home/Magma
pip install -e . -i https://mirrors.aliyun.com/pypi/simple
```
## 数据集
`无`
## 训练
`无`
## 推理
预训练权重目录结构:
```
/home/Magma
└── microsoft/Magma-8B
# 设置HF下载镜像:
export HF_ENDPOINT=https://hf-mirror.com
然后,运行推命令时,项目会自动下载模型:laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg,下载完成后编码成缓存文件保存,此模型作者的代码无读取本地权重功能。
```
### 单机多卡
```
cd /home/Magma
python infer_transformers.py
```
更多资料可参考源项目中的[`README_origin`](./README_origin.md)
## result
`输入: `
```
prompt: "What is the letter on the robot?"
image: "./assets/images/magma_logo.jpg"
```
`输出:`
```
response: The letter on the robot is "M".
```
官方效果示例:
<div align=center>
<img src="./doc/magma_mushroom.gif"/>
</div>
### 精度
DCU与GPU精度一致,推理框架:pytorch。
## 应用场景
### 算法类别
`具身智能`
### 热点应用行业
`制造,家居,医疗,能源,教育`
## 预训练权重
HF/github下载地址为:[microsoft/Magma-8B](https://huggingface.co/microsoft/Magma-8B)
## 源码仓库及问题反馈
- http://developer.sourcefind.cn/codes/modelzoo/InfiniteYou_pytorch.git
## 参考资料
- https://github.com/microsoft/Magma.git
This diff is collapsed.
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
<!-- END MICROSOFT SECURITY.MD BLOCK -->
# TODO: The maintainer of this repo has not yet edited this file
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
# Support
## How to file issues and get help
This project uses GitHub Issues to track bugs and feature requests. Please search the existing
issues before filing new issues to avoid duplicates. For new issues, file your bug or
feature request as a new Issue.
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
## Microsoft Support Policy
Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
import pygame
import numpy as np
import gradio as gr
import time
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
import re
import random
pygame.mixer.quit() # Disable sound
# Constants
WIDTH, HEIGHT = 800, 800
GRID_SIZE = 80
WHITE = (255, 255, 255)
GREEN = (34, 139, 34) # Forest green - more like an apple
RED = (200, 50, 50)
BLACK = (0, 0, 0)
GRAY = (128, 128, 128)
YELLOW = (218, 165, 32) # Golden yellow color
# Directions
UP = (0, -1)
DOWN = (0, 1)
LEFT = (-1, 0)
RIGHT = (1, 0)
STATIC = (0, 0)
ACTIONS = ["up", "down", "left", "right", "static"]
# Load AI Model
magma_model_id = "microsoft/Magma-8B"
dtype = torch.bfloat16
magma_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
magma_model.to("cuda")
# Load magma image
magma_img = pygame.image.load("./assets/images/magma_game.png")
magma_img = pygame.transform.scale(magma_img, (GRID_SIZE, GRID_SIZE))
class MagmaFindGPU:
def __init__(self):
self.reset()
def reset(self):
self.snake = [(5, 5)]
self.direction = RIGHT
self.score = 0
self.game_over = False
self.place_target()
def place_target(self):
while True:
target_x = np.random.randint(1, WIDTH // GRID_SIZE - 1)
target_y = np.random.randint(1, HEIGHT // GRID_SIZE - 1)
if (target_x, target_y) not in self.snake:
self.target = (target_x, target_y)
break
def step(self, action):
if action == "up":
self.direction = UP
elif action == "down":
self.direction = DOWN
elif action == "left":
self.direction = LEFT
elif action == "right":
self.direction = RIGHT
elif action == "static":
self.direction = STATIC
if self.game_over:
return self.render(), self.score
new_head = (self.snake[0][0] + self.direction[0], self.snake[0][1] + self.direction[1])
if new_head[0] < 0 or new_head[1] < 0 or new_head[0] >= WIDTH // GRID_SIZE or new_head[1] >= HEIGHT // GRID_SIZE:
self.game_over = True
return self.render(), self.score
self.snake = [new_head] # Keep only the head (single block snake)
# Check if the target is covered by four surrounding squares
head_x, head_y = self.snake[0]
neighbors = set([(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)])
if neighbors.issuperset(set([self.target])):
self.score += 1
self.place_target()
return self.render(), self.score
def render(self):
pygame.init()
surface = pygame.Surface((WIDTH, HEIGHT))
surface.fill(BLACK)
head_x, head_y = self.snake[0]
surface.blit(magma_img, (head_x * GRID_SIZE, head_y * GRID_SIZE))
# pygame.draw.rect(surface, RED, (self.snake[0][0] * GRID_SIZE, self.snake[0][1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
pygame.draw.rect(surface, GREEN, (self.target[0] * GRID_SIZE, self.target[1] * GRID_SIZE, GRID_SIZE, GRID_SIZE))
# Draw four surrounding squares with labels
head_x, head_y = self.snake[0]
neighbors = [(head_x, head_y - 1), (head_x, head_y + 1), (head_x - 1, head_y), (head_x + 1, head_y)]
labels = ["1", "2", "3", "4"]
font = pygame.font.Font(None, 48)
# clone surface
surface_nomark = surface.copy()
for i, (nx, ny) in enumerate(neighbors):
if 0 <= nx < WIDTH // GRID_SIZE and 0 <= ny < HEIGHT // GRID_SIZE:
pygame.draw.rect(surface, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
# pygame.draw.rect(surface_nomark, RED, (nx * GRID_SIZE, ny * GRID_SIZE, GRID_SIZE, GRID_SIZE), GRID_SIZE)
text = font.render(labels[i], True, WHITE)
text_rect = text.get_rect(center=(nx * GRID_SIZE + GRID_SIZE // 2, ny * GRID_SIZE + GRID_SIZE // 2))
surface.blit(text, text_rect)
return np.array(pygame.surfarray.array3d(surface_nomark)).swapaxes(0, 1), np.array(pygame.surfarray.array3d(surface)).swapaxes(0, 1)
def get_state(self):
return self.render()
game = MagmaFindGPU()
def play_game():
state, state_som = game.get_state()
pil_img = Image.fromarray(state_som)
convs = [
{"role": "system", "content": "You are an agent that can see, talk, and act."},
{"role": "user", "content": "<image_start><image><image_end>\nWhich mark is closer to green block? Answer with a single number."},
]
prompt = magma_processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = magma_processor(images=[pil_img], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda").to(dtype)
generation_args = {
"max_new_tokens": 10,
"temperature": 0,
"do_sample": False,
"use_cache": True,
"num_beams": 1,
}
with torch.inference_mode():
generate_ids = magma_model.generate(**inputs, **generation_args)
generate_ids = generate_ids[:, inputs["input_ids"].shape[-1] :]
action = magma_processor.decode(generate_ids[0], skip_special_tokens=True).strip()
# extract mark id fro action use re
match = re.search(r'\d+', action)
if match:
action = match.group(0)
if action.isdigit() and 1 <= int(action) <= 4:
# epsilon sampling
if random.random() < 0.1:
action = random.choice(ACTIONS[:-1])
else:
action = ACTIONS[int(action) - 1]
else:
# random choose one from the pool
action = random.choice(ACTIONS[:-1])
else:
action = random.choice(ACTIONS[:-1])
img, score = game.step(action)
img = img[0]
return img, f"Score: {score}"
def reset_game():
game.reset()
return game.render()[0], "Score: 0"
MARKDOWN = """
<div align="center">
<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
Game: Magma finds the apple by moving up, down, left and right.
\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp;
This demo is powered by [Gradio](https://gradio.app/).
</div>
"""
with gr.Blocks() as interface:
gr.Markdown(MARKDOWN)
with gr.Row():
image_output = gr.Image(label="Game Screen")
score_output = gr.Text(label="Score")
with gr.Row():
start_btn = gr.Button("Start/Reset Game")
interface.load(fn=play_game, every=1, inputs=[], outputs=[image_output, score_output])
start_btn.click(fn=reset_game, inputs=[], outputs=[image_output, score_output])
interface.launch()
import gradio as gr
import numpy as np
import gymnasium as gym
from PIL import Image
import matplotlib.pyplot as plt
# Initialize FrozenLake environment
env = gym.make("FrozenLake-v1", render_mode="rgb_array")
state, _ = env.reset()
action_mapping = {
"Left": 3,
"Down": 1,
"Right": 2,
"Up": 0,
}
def render_env():
"""Render the environment and return as an image."""
frame = env.render()
image = Image.fromarray(frame)
return image
def step(action):
"""Take a step in the environment."""
global state
action_index = action_mapping[action]
state, reward, done, _, _ = env.step(action_index)
image = render_env()
message = f"State: {state}, Reward: {reward}, Done: {done}"
if done:
env.reset()
message += " - Resetting environment"
return image, message
# Create Gradio interface
with gr.Blocks() as demo:
gr.Markdown("# Play Frozen Lake!")
image_display = gr.Image()
action_buttons = gr.Radio(choices=list(action_mapping.keys()), label="Select Action")
submit_button = gr.Button("Step")
output_text = gr.Textbox(label="Game State")
submit_button.click(fn=step, inputs=action_buttons, outputs=[image_display, output_text])
# Show initial state
image_display.update(render_env())
demo.launch()
# Magma: Multimodal Agentic Models
Evaluating Magma on [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO).
#### LIBERO Setup
Clone and install LIBERO and other requirements:
```
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
pip install -r agents/libero/requirements.txt
cd LIBERO
pip install -e .
```
#### Quick Evaluation
The following code demonstrates how to run Magma on a single LIBERO task and evaluate its performance:
```
import numpy as np
from libero.libero import benchmark
from libero_env_utils import get_libero_env, get_libero_dummy_action, get_libero_obs, get_max_steps, save_rollout_video
from libero_magma_utils import get_magma_model, get_magma_prompt, get_magma_action
# Set up benchmark and task
benchmark_dict = benchmark.get_benchmark_dict()
task_suite_name = "libero_goal" # or libero_spatial, libero_object, etc.
task_suite = benchmark_dict[task_suite_name]()
task_id = 1
task = task_suite.get_task(task_id)
# Initialize environment
env, task_description = get_libero_env(task, resolution=256)
print(f"Task {task_id} description: {task_description}")
# Load MAGMA model
model_name = "microsoft/magma-8b-libero-goal" # or your local path
processor, magma = get_magma_model(model_name)
prompt = get_magma_prompt(task_description, processor, magma.config)
# Run evaluation
num_steps_wait = 10
max_steps = get_max_steps(task_suite_name)
env.seed(0)
obs = env.reset()
init_states = task_suite.get_task_init_states(task_id)
obs = env.set_init_state(init_states[0])
step = 0
replay_images = []
while step < max_steps + num_steps_wait:
if step < num_steps_wait:
obs, _, done, _ = env.step(get_libero_dummy_action())
step += 1
continue
img = get_libero_obs(obs, resize_size=256)
replay_images.append(img)
action = get_magma_action(magma, processor, img, prompt, task_suite_name)
obs, _, done, _ = env.step(action.tolist())
step += 1
env.close()
save_rollout_video(replay_images, success=done, task_description=task_description)
```
**Notes:** The above script only tests one episode on a single task and visualizes MAGMA's trajectory with saved video. For comprehensive evaluation on each task suite, please use `eval_magma_libero.py`.
```
python eval_magma_libero.py \
--model_name microsoft/Magma-8B-libero-object \
--task_suite_name libero_object \
python eval_magma_libero.py \
--model_name microsoft/Magma-8B-libero-spatial \
--task_suite_name libero_spatial \
python eval_magma_libero.py \
--model_name microsoft/Magma-8B-libero-goal \
--task_suite_name libero_goal \
```
import os
import numpy as np
import draccus
from dataclasses import dataclass
from typing import Optional, Tuple
import tqdm
from libero.libero import benchmark
from libero_env_utils import (
get_libero_env,
get_libero_dummy_action,
get_libero_obs,
get_max_steps,
set_seed_everywhere
)
from libero_magma_utils import (
get_magma_model,
get_magma_prompt,
get_magma_action
)
@dataclass
class LiberoConfig:
# Model parameters
model_name: str = "microsoft/magma-8b-libero-goal" # model_name
task_suite_name: str = "libero_goal" # Task suite name
# Evaluation parameters
num_trials_per_task: int = 50 # Number of rollouts per task
resolution: int = 256 # Image resolution
num_steps_wait: int = 10 # Steps to wait for stabilization
seed: int = 0 # Random seed
save_dir: str = "./libero_eval_log" # Directory for saving logs
@draccus.wrap()
def eval_libero(cfg: LiberoConfig) -> Tuple[int, int]:
"""
Evaluate Libero environment with given configuration.
Args:
cfg: LiberoConfig object containing evaluation parameters
Returns:
Tuple[int, int]: Total episodes and total successful episodes
"""
# Setup logging
os.makedirs(cfg.save_dir, exist_ok=True)
log_filepath = f"{cfg.save_dir}/magma_eval-{cfg.task_suite_name}.log"
log_file = open(log_filepath, "w")
print(f"Logging to local log file: {log_filepath}")
# Write initial log
log_file.write(f"Task suite: {cfg.task_suite_name}\n")
print(f"Task suite: {cfg.task_suite_name}")
# Get benchmark and task suite
benchmark_dict = benchmark.get_benchmark_dict()
task_suite = benchmark_dict[cfg.task_suite_name]()
num_tasks_in_suite = task_suite.n_tasks
# Initialize counters
total_episodes, total_successes = 0, 0
set_seed_everywhere(cfg.seed)
# Load model
processor, magma = get_magma_model(cfg.model_name)
# Iterate through all tasks
for task_id in tqdm.tqdm(range(num_tasks_in_suite)):
# Get task
task = task_suite.get_task(task_id)
task_name = task.name
max_steps = get_max_steps(cfg.task_suite_name)
# Get default LIBERO initial states
initial_states = task_suite.get_task_init_states(task_id)
# Initialize LIBERO environment and task description
env, task_description = get_libero_env(task, resolution=cfg.resolution)
print(f"[info] Evaluating task {task_id} from suite {cfg.task_suite_name}, "
f"the language instruction is {task_description}.")
log_file.write(f"Task {task_id}: {task_description}\n")
log_file.flush()
# Get prompt for current task
prompt = get_magma_prompt(task_description, processor, magma.config)
# Initialize task-specific counters
task_episodes, task_successes = 0, 0
# Run trials for current task
for trial in range(cfg.num_trials_per_task):
env.reset()
obs = env.set_init_state(initial_states[trial])
step = 0
while step < max_steps + cfg.num_steps_wait:
if step < cfg.num_steps_wait:
obs, reward, done, info = env.step(get_libero_dummy_action())
step += 1
continue
img = get_libero_obs(obs, resize_size=cfg.resolution)
action = get_magma_action(magma, processor, img, prompt, cfg.task_suite_name)
obs, reward, done, info = env.step(action.tolist())
step += 1
if done:
task_successes += 1
break
task_episodes += 1
# Update total counters
total_episodes += task_episodes
total_successes += task_successes
# Log task success rate
task_success_rate = float(task_successes) / float(task_episodes)
print(f"Current task ({task_name}) success rate: {task_success_rate}")
log_file.write(f"Current task ({task_name}) success rate: {task_success_rate}\n")
log_file.flush()
# Log final suite success rate
suite_success_rate = float(total_successes) / float(total_episodes)
print(f"Task suite success rate: {suite_success_rate}")
log_file.write(f"\nTask suite {cfg.task_suite_name} success rate: {suite_success_rate}\n")
log_file.flush()
env.close()
log_file.close()
return total_episodes, total_successes
if __name__ == "__main__":
eval_libero()
\ No newline at end of file
"""Utils for evaluating policies in LIBERO simulation environments."""
import math
import os
import torch
import random
from PIL import Image
import imageio
import numpy as np
import tensorflow as tf
from libero.libero import get_libero_path
from libero.libero.envs import OffScreenRenderEnv
def resize_image(img, resize_size):
"""
Takes numpy array corresponding to a single image and returns resized image as numpy array.
"""
assert isinstance(resize_size, tuple)
# Resize to image size expected by model
img = tf.image.encode_jpeg(img) # Encode as JPEG, as done in RLDS dataset builder
img = tf.io.decode_image(img, expand_animations=False, dtype=tf.uint8) # Immediately decode back
img = tf.image.resize(img, resize_size, method="lanczos3", antialias=True)
img = tf.cast(tf.clip_by_value(tf.round(img), 0, 255), tf.uint8)
img = img.numpy()
return img
def get_libero_env(task, resolution=256):
"""Initializes and returns the LIBERO environment, along with the task description."""
task_description = task.language
task_bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
env_args = {"bddl_file_name": task_bddl_file, "camera_heights": resolution, "camera_widths": resolution}
env = OffScreenRenderEnv(**env_args)
env.seed(0) # IMPORTANT: seed seems to affect object positions even when using fixed initial state
return env, task_description
def get_libero_dummy_action():
"""Get dummy/no-op action, used to roll out the simulation while the robot does nothing."""
return [0, 0, 0, 0, 0, 0, -1]
def get_libero_obs(obs, resize_size):
"""Extracts image from observations and preprocesses it."""
assert isinstance(resize_size, int) or isinstance(resize_size, tuple)
if isinstance(resize_size, int):
resize_size = (resize_size, resize_size)
img = obs["agentview_image"]
img = img[::-1, ::-1] # IMPORTANT: rotate 180 degrees to match train preprocessing
image = Image.fromarray(img)
# resize image to 256x256
image = resize_image(img, resize_size)
return image
def get_max_steps(task_suite_name):
if task_suite_name == "libero_spatial":
max_steps = 220
elif task_suite_name == "libero_object":
max_steps = 280
elif task_suite_name == "libero_goal":
max_steps = 300
elif task_suite_name == "libero_10":
max_steps = 520
else:
max_steps = 400
return max_steps
def quat2axisangle(quat):
"""
Copied from robosuite: https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
Converts quaternion to axis-angle format.
Returns a unit vector direction scaled by its angle in radians.
Args:
quat (np.array): (x,y,z,w) vec4 float angles
Returns:
np.array: (ax,ay,az) axis-angle exponential coordinates
"""
# clip quaternion
if quat[3] > 1.0:
quat[3] = 1.0
elif quat[3] < -1.0:
quat[3] = -1.0
den = np.sqrt(1.0 - quat[3] * quat[3])
if math.isclose(den, 0.0):
# This is (close to) a zero degree rotation, immediately return
return np.zeros(3)
return (quat[:3] * 2.0 * math.acos(quat[3])) / den
def save_rollout_video(replay_images, success, task_description):
"""Saves a video replay of a rollout in libero."""
save_dir = f"./libero_videos"
os.makedirs(save_dir, exist_ok=True)
processed_task_description = task_description.lower().replace(" ", "_").replace("\n", "_").replace(".", "_")[:50]
video_path = f"{save_dir}/quick_eval-success={success}--task={processed_task_description}.mp4"
video_writer = imageio.get_writer(video_path, fps=30)
for img in replay_images:
video_writer.append_data(img)
video_writer.close()
print(f"Saved libero video at path {video_path}")
return video_path
def set_seed_everywhere(seed: int):
"""Sets the random seed for Python, NumPy, and PyTorch functions."""
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
os.environ["PYTHONHASHSEED"] = str(seed)
\ No newline at end of file
import os
import json
import torch
import numpy as np
from magma.image_processing_magma import MagmaImageProcessor
from magma.processing_magma import MagmaProcessor
from magma.modeling_magma import MagmaForConditionalGeneration
def get_magma_model(model_name):
processor = MagmaProcessor.from_pretrained(model_name, trust_remote_code=True)
magma = MagmaForConditionalGeneration.from_pretrained(model_name,
device_map="cuda",
low_cpu_mem_usage=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
use_cache=True,
)
return processor, magma
def get_magma_prompt(task_description, processor, model_config):
convs = [
{"role": "user", "content": f"<image>\nWhat action should the robot take to {task_description}?"},
]
convs = [
{
"role": "system",
"content": "You are agent that can see, talk and act.",
},
] + convs
prompt = processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
if model_config.mm_use_image_start_end:
prompt = prompt.replace("<image>", "<image_start><image><image_end>")
return prompt
def get_magma_action(magma, processor, img, prompt, task_suite_name):
dataset_stats = json.load(open(os.path.join(magma.config._name_or_path, "dataset_statistics.json")))
action_norm_stats = dataset_stats[f"{task_suite_name}_no_noops"]['action']
n_action_bins = 256
vocab_size = processor.tokenizer.vocab_size
bins = np.linspace(-1, 1, n_action_bins)
bin_centers = (bins[:-1] + bins[1:]) / 2.0
# process inputs
inputs = processor(images=img, texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to("cuda").to(torch.bfloat16)
# predict actions with magma
output_ids = magma.generate(
**inputs,
temperature=0.7,
do_sample=True,
num_beams=1,
max_new_tokens=1000,
use_cache=True,
)
action_ids = output_ids[0, -8:-1].cpu().tolist()
predicted_action_ids = np.array(action_ids).astype(np.int64)
discretized_actions = vocab_size - predicted_action_ids
discretized_actions = np.clip(discretized_actions - 1, a_min=0, a_max=bin_centers.shape[0] - 1)
normalized_actions = bin_centers[discretized_actions]
# unnormalize actions
mask = action_norm_stats.get("mask", np.ones_like(action_norm_stats["q01"], dtype=bool))
action_high, action_low = np.array(action_norm_stats["q99"]), np.array(action_norm_stats["q01"])
raw_action = np.where(
mask,
0.5 * (normalized_actions + 1) * (action_high - action_low) + action_low,
normalized_actions,
)
action = normalize_gripper_action(raw_action, binarize=True)
action = invert_gripper_action(action)
return action
def normalize_gripper_action(action, binarize=True):
"""
Convert gripper action from [0,1] to [-1,+1] range.
y = 2x - 1
"""
orig_low, orig_high = 0.0, 1.0
action[..., -1] = 2 * (action[..., -1] - orig_low) / (orig_high - orig_low) - 1
if binarize:
# Binarize to -1 or +1.
action[..., -1] = np.sign(action[..., -1])
return action
def invert_gripper_action(action):
"""Convert gripper: RLDS(0=close,1=open) -> -1=open,+1=close"""
action[..., -1] = action[..., -1] * -1.0
return action
\ No newline at end of file
robosuite==1.4.0
bddl==1.0.1
easydict==1.9
gym==0.25.2
cloudpickle
imageio[ffmpeg]
\ No newline at end of file
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
import os
import warnings
from utils.visualizer import Visualizer
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
import random
import gradio as gr
import ast, re
import torch
import torchvision
from transformers import AutoModelForCausalLM, AutoProcessor
'''
build model
'''
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(0)
spatial_quant_size = 256
# Load AI Model
dtype = torch.bfloat16
device = "cuda"
magma_model_id = "microsoft/Magma-8B"
model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
model.to(device)
@torch.no_grad()
def inference(image, task, *args, **kwargs):
# image = image['image']
task_description = task
num_marks = args[0]
speed = args[1]
steps = args[2]
mark_ids = [i+1 for i in range(num_marks)]
image_resized = image.resize((256, 256))
magma_template = (
# "<image>\nThe image is labeled with numeric marks {}.\n"
"<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
"The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
)
"""
Visual Trace Generation
"""
if model.config.mm_use_image_start_end:
magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")
conv_user = magma_template.format(mark_ids, task_description, speed, steps)
print(conv_user)
convs = [
{"role": "user", "content": conv_user},
]
convs = [
{
"role": "system",
"content": "You are agent that can see, talk and act.",
},
] + convs
prompt = processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(device)
with torch.inference_mode():
output_ids = model.generate(
**inputs,
temperature=0.3,
do_sample=True,
num_beams=1,
max_new_tokens=1024,
use_cache=True,
)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
if len(response)==0:
return None
# extract traces from response
if "and their future positions are:" in response:
selected_marks_str, traces_str = response.split("and their future positions are:\n")
else:
selected_marks_str, traces_str = None, response
try:
traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
overlay_traces = []
for mark_id, trace in traces_dict.items():
# convert list of tuples to tensor
trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
overlay_traces.append(trace)
# padded to the same length with the last element
max_len = max([trace.shape[0] for trace in overlay_traces])
for i in range(len(overlay_traces)):
if overlay_traces[i].shape[0] < max_len:
overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)
overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
# if selected_marks_str is not None:
# selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
# selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
# selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)
# overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
overlay_traces = overlay_traces.float() / 256
overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
images = [image] * overlay_traces.shape[1]
overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255
vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
vis.visualize(video, overlay_traces, overlay_visibility)
# return video path
return "./saved_videos/video.mp4"
except Exception as e:
print(e)
return None
class ImageMask(gr.components.Image):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
def preprocess(self, x):
return super().preprocess(x)
class Video(gr.components.Video):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", **kwargs)
def preprocess(self, x):
return super().preprocess(x)
'''
launch app
'''
title = "Magma"
description = '''Magma: Multimodal Agent to Act'''
'''Usage
Instructions:
&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
'''
article = "The Demo is Run on Magma-8B."
inputs = [
gr.components.Image(label="Draw on Image",type="pil"),
gr.Textbox(label="Task"),
gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
]
gr.Interface(
fn=inference,
inputs=inputs,
outputs=[
gr.Video(
label="Robot planning trajectory", format="mp4"
),
],
examples=[
["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
],
title=title,
description=description,
article=article,
allow_flagging='never',
cache_examples=False,
).launch(share=True)
\ No newline at end of file
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
import os
import warnings
from utils.visualizer import Visualizer
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple
import random
import gradio as gr
import ast, re
import torch
import torchvision
from transformers import AutoModelForCausalLM, AutoProcessor
'''
build model
'''
torch.manual_seed(0)
torch.cuda.manual_seed_all(0)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
random.seed(0)
spatial_quant_size = 256
# Load AI Model
dtype = torch.bfloat16
device = "cuda"
magma_model_id = "microsoft/Magma-8B"
model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
model.to(device)
@torch.no_grad()
def inference(image, task, *args, **kwargs):
# image = image['image']
task_description = task
num_marks = args[0]
speed = args[1]
steps = args[2]
mark_ids = [i+1 for i in range(num_marks)]
image_resized = image.resize((256, 256))
magma_template = (
# "<image>\nThe image is labeled with numeric marks {}.\n"
"<image>\nThe image is split into 256x256 grids and is labeled with numeric marks {}.\n"
"The robot is doing: {}. To finish the task, how to move the numerical marks in the image with speed {} for the next {} steps?\n"
)
"""
Visual Trace Generation
"""
if model.config.mm_use_image_start_end:
magma_template = magma_template.replace("<image>", "<image_start><image><image_end>")
conv_user = magma_template.format(mark_ids, task_description, speed, steps)
print(conv_user)
convs = [
{"role": "user", "content": conv_user},
]
convs = [
{
"role": "system",
"content": "You are agent that can see, talk and act.",
},
] + convs
prompt = processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(images=image_resized, texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(device)
with torch.inference_mode():
output_ids = model.generate(
**inputs,
temperature=0.3,
do_sample=True,
num_beams=1,
max_new_tokens=1024,
use_cache=True,
)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
if len(response)==0:
return None
# extract traces from response
if "and their future positions are:" in response:
selected_marks_str, traces_str = response.split("and their future positions are:\n")
else:
selected_marks_str, traces_str = None, response
try:
traces_dict = ast.literal_eval('{' + traces_str.strip().replace('\n\n',',') + '}')
overlay_traces = []
for mark_id, trace in traces_dict.items():
# convert list of tuples to tensor
trace = torch.tensor(ast.literal_eval(trace)).unsqueeze(1)
overlay_traces.append(trace)
# padded to the same length with the last element
max_len = max([trace.shape[0] for trace in overlay_traces])
for i in range(len(overlay_traces)):
if overlay_traces[i].shape[0] < max_len:
overlay_traces[i] = torch.cat([overlay_traces[i], overlay_traces[i][-1].unsqueeze(0).repeat(max_len - overlay_traces[i].shape[0], 1, 1)], dim=0)
overlay_traces = torch.cat(overlay_traces, dim=1).unsqueeze(0)
# if selected_marks_str is not None:
# selected_marks = re.findall(r'\[(.*?)\]', selected_marks_str)
# selected_marks = [torch.tensor(ast.literal_eval(mark)).unsqueeze(0) for mark in selected_marks]
# selected_marks = torch.cat(selected_marks, dim=0).unsqueeze(0)
# overlay_traces = torch.cat([selected_marks.unsqueeze(1), overlay_traces], dim=1)
overlay_traces = overlay_traces.float() / 256
overlay_traces[:,:,:,0] = overlay_traces[:,:,:,0] * image.size[0]
overlay_traces[:,:,:,1] = overlay_traces[:,:,:,1] * image.size[1]
images = [image] * overlay_traces.shape[1]
overlay_visibility = overlay_traces.new(overlay_traces.shape[0], overlay_traces.shape[1], overlay_traces.shape[2]).fill_(True)
video = torch.stack([torchvision.transforms.ToTensor()(img) for img in images])[None].float()*255
vis = Visualizer(save_dir="./saved_videos", pad_value=0, linewidth=2, tracks_leave_trace=-1)
vis.visualize(video, overlay_traces, overlay_visibility)
# return video path
return "./saved_videos/video.mp4"
except Exception as e:
print(e)
return None
from gradio.events import Dependency
class ImageMask(gr.components.Image):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
def preprocess(self, x):
return super().preprocess(x)
from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
from gradio.blocks import Block
if TYPE_CHECKING:
from gradio.components import Timer
class Video(gr.components.Video):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", **kwargs)
def preprocess(self, x):
return super().preprocess(x)
from typing import Callable, Literal, Sequence, Any, TYPE_CHECKING
from gradio.blocks import Block
if TYPE_CHECKING:
from gradio.components import Timer
'''
launch app
'''
title = "Magma"
description = '''Magma: Multimodal Agent to Act'''
'''Usage
Instructions:
&#x1F388 Try our default examples first (Sketch is not automatically drawed on input and example image);
&#x1F388 For video demo, it takes about 30-60s to process, please refresh if you meet an error on uploading;
&#x1F388 Upload an image/video (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
&#x1F388 Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example");
&#x1F388 Remember to provide the actual prompt for each promt type you select, otherwise you will meet an error (e.g., rember to draw on the referring image);
&#x1F388 Our model by default support the vocabulary of COCO 133 categories, others will be classified to 'others' or misclassifed.
'''
article = "The Demo is Run on Magma-8B."
inputs = [
gr.components.Image(label="Draw on Image",type="pil"),
gr.Textbox(label="Task"),
gr.Slider(1, 50, value=10, label="Number of Marks", info="Choose between 1 and 50"),
gr.Slider(2, 50, value=8, label="Speed", info="Choose between 2 and 50"),
gr.Slider(2, 50, value=8, label="Steps", info="Choose between 2 and 50"),
]
gr.Interface(
fn=inference,
inputs=inputs,
outputs=[
gr.Video(
label="Robot planning trajectory", format="mp4"
),
],
examples=[
["agents/robot_traj/sample.png", "Pick up the chip bag.", 9, 8, 8],
],
title=title,
description=description,
article=article,
allow_flagging='never',
cache_examples=False,
).launch(share=True)
\ No newline at end of file
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os
import numpy as np
import imageio
import torch
from matplotlib import cm
import torch.nn.functional as F
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
def read_video_from_path(path):
try:
reader = imageio.get_reader(path)
except Exception as e:
print("Error opening video file: ", e)
return None
frames = []
for i, im in enumerate(reader):
frames.append(np.array(im))
return np.stack(frames)
def draw_circle(rgb, coord, radius, color=(255, 0, 0), visible=True):
# Create a draw object
draw = ImageDraw.Draw(rgb)
# Calculate the bounding box of the circle
left_up_point = (coord[0] - radius, coord[1] - radius)
right_down_point = (coord[0] + radius, coord[1] + radius)
# Draw the circle
draw.ellipse(
[left_up_point, right_down_point],
fill=tuple(color) if visible else None,
outline=tuple(color),
)
return rgb
def draw_line(rgb, coord_y, coord_x, color, linewidth):
draw = ImageDraw.Draw(rgb)
draw.line(
(coord_y[0], coord_y[1], coord_x[0], coord_x[1]),
fill=tuple(color),
width=linewidth,
)
return rgb
def add_weighted(rgb, alpha, original, beta, gamma):
return (rgb * alpha + original * beta + gamma).astype("uint8")
class Visualizer:
def __init__(
self,
save_dir: str = "./results",
grayscale: bool = False,
pad_value: int = 0,
fps: int = 10,
mode: str = "rainbow", # 'cool', 'optical_flow'
linewidth: int = 2,
show_first_frame: int = 10,
tracks_leave_trace: int = 0, # -1 for infinite
):
self.mode = mode
self.save_dir = save_dir
if mode == "rainbow":
self.color_map = cm.get_cmap("gist_rainbow")
elif mode == "cool":
self.color_map = cm.get_cmap(mode)
self.show_first_frame = show_first_frame
self.grayscale = grayscale
self.tracks_leave_trace = tracks_leave_trace
self.pad_value = pad_value
self.linewidth = linewidth
self.fps = fps
def visualize(
self,
video: torch.Tensor, # (B,T,C,H,W)
tracks: torch.Tensor, # (B,T,N,2)
visibility: torch.Tensor = None, # (B, T, N, 1) bool
gt_tracks: torch.Tensor = None, # (B,T,N,2)
segm_mask: torch.Tensor = None, # (B,1,H,W)
filename: str = "video",
writer=None, # tensorboard Summary Writer, used for visualization during training
step: int = 0,
query_frame: int = 0,
save_video: bool = True,
compensate_for_camera_motion: bool = False,
):
if compensate_for_camera_motion:
assert segm_mask is not None
if segm_mask is not None:
coords = tracks[0, query_frame].round().long()
segm_mask = segm_mask[0, query_frame][coords[:, 1], coords[:, 0]].long()
video = F.pad(
video,
(self.pad_value, self.pad_value, self.pad_value, self.pad_value),
"constant",
255,
)
tracks = tracks + self.pad_value
if self.grayscale:
transform = transforms.Grayscale()
video = transform(video)
video = video.repeat(1, 1, 3, 1, 1)
res_video = self.draw_tracks_on_video(
video=video,
tracks=tracks,
visibility=visibility,
segm_mask=segm_mask,
gt_tracks=gt_tracks,
query_frame=query_frame,
compensate_for_camera_motion=compensate_for_camera_motion,
)
if save_video:
self.save_video(res_video, filename=filename, writer=writer, step=step)
return res_video
def save_video(self, video, filename, writer=None, step=0):
if writer is not None:
writer.add_video(
filename,
video.to(torch.uint8),
global_step=step,
fps=self.fps,
)
else:
os.makedirs(self.save_dir, exist_ok=True)
wide_list = list(video.unbind(1))
wide_list = [wide[0].permute(1, 2, 0).cpu().numpy() for wide in wide_list]
# Prepare the video file path
save_path = os.path.join(self.save_dir, f"{filename}.mp4")
# Create a writer object
video_writer = imageio.get_writer(save_path, fps=self.fps)
# Write frames to the video file
for frame in wide_list[2:-1]:
video_writer.append_data(frame)
video_writer.close()
print(f"Video saved to {save_path}")
def draw_tracks_on_video(
self,
video: torch.Tensor,
tracks: torch.Tensor,
visibility: torch.Tensor = None,
segm_mask: torch.Tensor = None,
gt_tracks=None,
query_frame: int = 0,
compensate_for_camera_motion=False,
):
B, T, C, H, W = video.shape
_, _, N, D = tracks.shape
assert D == 2
assert C == 3
video = video[0].permute(0, 2, 3, 1).byte().detach().cpu().numpy() # S, H, W, C
tracks = tracks[0].long().detach().cpu().numpy() # S, N, 2
if gt_tracks is not None:
gt_tracks = gt_tracks[0].detach().cpu().numpy()
res_video = []
# process input video
for rgb in video:
res_video.append(rgb.copy())
vector_colors = np.zeros((T, N, 3))
if self.mode == "optical_flow":
import flow_vis
vector_colors = flow_vis.flow_to_color(tracks - tracks[query_frame][None])
elif segm_mask is None:
if self.mode == "rainbow":
y_min, y_max = (
tracks[query_frame, :, 1].min(),
tracks[query_frame, :, 1].max(),
)
norm = plt.Normalize(y_min, y_max)
for n in range(N):
color = self.color_map(norm(tracks[query_frame, n, 1]))
color = np.array(color[:3])[None] * 255
vector_colors[:, n] = np.repeat(color, T, axis=0)
else:
# color changes with time
for t in range(T):
color = np.array(self.color_map(t / T)[:3])[None] * 255
vector_colors[t] = np.repeat(color, N, axis=0)
else:
if self.mode == "rainbow":
vector_colors[:, segm_mask <= 0, :] = 255
y_min, y_max = (
tracks[0, segm_mask > 0, 1].min(),
tracks[0, segm_mask > 0, 1].max(),
)
norm = plt.Normalize(y_min, y_max)
for n in range(N):
if segm_mask[n] > 0:
color = self.color_map(norm(tracks[0, n, 1]))
color = np.array(color[:3])[None] * 255
vector_colors[:, n] = np.repeat(color, T, axis=0)
else:
# color changes with segm class
segm_mask = segm_mask.cpu()
color = np.zeros((segm_mask.shape[0], 3), dtype=np.float32)
color[segm_mask > 0] = np.array(self.color_map(1.0)[:3]) * 255.0
color[segm_mask <= 0] = np.array(self.color_map(0.0)[:3]) * 255.0
vector_colors = np.repeat(color[None], T, axis=0)
# draw tracks
if self.tracks_leave_trace != 0:
for t in range(query_frame + 1, T):
first_ind = (
max(0, t - self.tracks_leave_trace) if self.tracks_leave_trace >= 0 else 0
)
curr_tracks = tracks[first_ind : t + 1]
curr_colors = vector_colors[first_ind : t + 1]
if compensate_for_camera_motion:
diff = (
tracks[first_ind : t + 1, segm_mask <= 0]
- tracks[t : t + 1, segm_mask <= 0]
).mean(1)[:, None]
curr_tracks = curr_tracks - diff
curr_tracks = curr_tracks[:, segm_mask > 0]
curr_colors = curr_colors[:, segm_mask > 0]
res_video[t] = self._draw_pred_tracks(
res_video[t],
curr_tracks,
curr_colors,
)
if gt_tracks is not None:
res_video[t] = self._draw_gt_tracks(res_video[t], gt_tracks[first_ind : t + 1])
# draw points
for t in range(query_frame, T):
img = Image.fromarray(np.uint8(res_video[t]))
for i in range(N):
coord = (tracks[t, i, 0], tracks[t, i, 1])
visibile = True
if visibility is not None:
visibile = visibility[0, t, i]
if coord[0] != 0 and coord[1] != 0:
if not compensate_for_camera_motion or (
compensate_for_camera_motion and segm_mask[i] > 0
):
img = draw_circle(
img,
coord=coord,
radius=int(self.linewidth * 2),
color=vector_colors[t, i].astype(int),
visible=visibile,
)
res_video[t] = np.array(img)
# construct the final rgb sequence
if self.show_first_frame > 0:
res_video = [res_video[0]] * self.show_first_frame + res_video[1:]
return torch.from_numpy(np.stack(res_video)).permute(0, 3, 1, 2)[None].byte()
def _draw_pred_tracks(
self,
rgb: np.ndarray, # H x W x 3
tracks: np.ndarray, # T x 2
vector_colors: np.ndarray,
alpha: float = 0.5,
):
T, N, _ = tracks.shape
rgb = Image.fromarray(np.uint8(rgb))
for s in range(T - 1):
vector_color = vector_colors[s]
original = rgb.copy()
alpha = (s / T) ** 2
for i in range(N):
coord_y = (int(tracks[s, i, 0]), int(tracks[s, i, 1]))
coord_x = (int(tracks[s + 1, i, 0]), int(tracks[s + 1, i, 1]))
if coord_y[0] != 0 and coord_y[1] != 0:
rgb = draw_line(
rgb,
coord_y,
coord_x,
vector_color[i].astype(int),
self.linewidth,
)
if self.tracks_leave_trace > 0:
rgb = Image.fromarray(
np.uint8(add_weighted(np.array(rgb), alpha, np.array(original), 1 - alpha, 0))
)
rgb = np.array(rgb)
return rgb
def _draw_gt_tracks(
self,
rgb: np.ndarray, # H x W x 3,
gt_tracks: np.ndarray, # T x 2
):
T, N, _ = gt_tracks.shape
color = np.array((211, 0, 0))
rgb = Image.fromarray(np.uint8(rgb))
for t in range(T):
for i in range(N):
gt_tracks = gt_tracks[t][i]
# draw a red cross
if gt_tracks[0] > 0 and gt_tracks[1] > 0:
length = self.linewidth * 3
coord_y = (int(gt_tracks[0]) + length, int(gt_tracks[1]) + length)
coord_x = (int(gt_tracks[0]) - length, int(gt_tracks[1]) - length)
rgb = draw_line(
rgb,
coord_y,
coord_x,
color,
self.linewidth,
)
coord_y = (int(gt_tracks[0]) - length, int(gt_tracks[1]) + length)
coord_x = (int(gt_tracks[0]) + length, int(gt_tracks[1]) - length)
rgb = draw_line(
rgb,
coord_y,
coord_x,
color,
self.linewidth,
)
rgb = np.array(rgb)
return rgb
# --------------------------------------------------------
# Magma - Multimodal AI Agent at Microsoft Research
# Copyright (c) 2025 Microsoft
# Licensed under The MIT License [see LICENSE for details]
# Jianwei Yang (jianwyan@microsoft.com)
# --------------------------------------------------------
from typing import Optional
import spaces
import gradio as gr
import numpy as np
import torch
from PIL import Image
import io
import re
import base64, os
from util.utils import check_ocr_box, get_yolo_model, get_caption_model_processor, get_som_labeled_img
from util.som import MarkHelper, plot_boxes_with_marks, plot_circles_with_marks
from util.process_utils import pred_2_point, extract_bbox, extract_mark_id
import torch
from PIL import Image
from huggingface_hub import snapshot_download
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
# Define repository and local directory
repo_id = "microsoft/OmniParser-v2.0" # HF repo
local_dir = "weights" # Target local directory
dtype = torch.bfloat16
DEVICE = torch.device('cuda')
som_generator = MarkHelper()
magma_som_prompt = "<image>\nIn this view I need to click a button to \"{}\"? Provide the coordinates and the mark index of the containing bounding box if applicable."
magma_qa_prompt = "<image>\n{} Answer the question briefly."
magma_model_id = "microsoft/Magma-8B"
magam_model = AutoModelForCausalLM.from_pretrained(magma_model_id, trust_remote_code=True, torch_dtype=dtype)
magma_processor = AutoProcessor.from_pretrained(magma_model_id, trust_remote_code=True)
magam_model.to(DEVICE)
# Download the entire repository
snapshot_download(repo_id=repo_id, local_dir=local_dir)
print(f"Repository downloaded to: {local_dir}")
yolo_model = get_yolo_model(model_path='weights/icon_detect/model.pt')
caption_model_processor = get_caption_model_processor(model_name="florence2", model_name_or_path="weights/icon_caption")
# caption_model_processor = get_caption_model_processor(model_name="blip2", model_name_or_path="weights/icon_caption_blip2")
MARKDOWN = """
<div align="center">
<h2>Magma: A Foundation Model for Multimodal AI Agents</h2>
\[[arXiv Paper](https://www.arxiv.org/pdf/2502.13130)\] &nbsp; \[[Project Page](https://microsoft.github.io/Magma/)\] &nbsp; \[[Github Repo](https://github.com/microsoft/Magma)\] &nbsp; \[[Hugging Face Model](https://huggingface.co/microsoft/Magma-8B)\] &nbsp;
This demo is powered by [Gradio](https://gradio.app/) and uses [OmniParserv2](https://github.com/microsoft/OmniParser) to generate [Set-of-Mark prompts](https://github.com/microsoft/SoM).
The demo supports three modes:
1. Empty text inut: it downgrades to an OmniParser demo.
2. Text input starting with "Q:": it leads to a visual question answering demo.
3. Text input for UI navigation: it leads to a UI navigation demo.
</div>
"""
DEVICE = torch.device('cuda')
@spaces.GPU
@torch.inference_mode()
def get_som_response(instruction, image_som):
prompt = magma_som_prompt.format(instruction)
if magam_model.config.mm_use_image_start_end:
qs = prompt.replace('<image>', '<image_start><image><image_end>')
else:
qs = prompt
convs = [{"role": "user", "content": qs}]
convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs
prompt = magma_processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = magma_processor(images=[image_som], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(DEVICE)
magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
with torch.inference_mode():
output_ids = magam_model.generate(
**inputs,
temperature=0.0,
do_sample=False,
num_beams=1,
max_new_tokens=128,
use_cache=True
)
prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
response = response.replace(prompt_decoded, '').strip()
return response
@spaces.GPU
@torch.inference_mode()
def get_qa_response(instruction, image):
prompt = magma_qa_prompt.format(instruction)
if magam_model.config.mm_use_image_start_end:
qs = prompt.replace('<image>', '<image_start><image><image_end>')
else:
qs = prompt
convs = [{"role": "user", "content": qs}]
convs = [{"role": "system", "content": "You are agent that can see, talk and act."}] + convs
prompt = magma_processor.tokenizer.apply_chat_template(
convs,
tokenize=False,
add_generation_prompt=True
)
inputs = magma_processor(images=[image], texts=prompt, return_tensors="pt")
inputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)
inputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)
inputs = inputs.to(dtype).to(DEVICE)
magam_model.generation_config.pad_token_id = magma_processor.tokenizer.pad_token_id
with torch.inference_mode():
output_ids = magam_model.generate(
**inputs,
temperature=0.0,
do_sample=False,
num_beams=1,
max_new_tokens=128,
use_cache=True
)
prompt_decoded = magma_processor.batch_decode(inputs['input_ids'], skip_special_tokens=True)[0]
response = magma_processor.batch_decode(output_ids, skip_special_tokens=True)[0]
response = response.replace(prompt_decoded, '').strip()
return response
@spaces.GPU
@torch.inference_mode()
# @torch.autocast(device_type="cuda", dtype=torch.bfloat16)
def process(
image_input,
box_threshold,
iou_threshold,
use_paddleocr,
imgsz,
instruction,
) -> Optional[Image.Image]:
# image_save_path = 'imgs/saved_image_demo.png'
# image_input.save(image_save_path)
# image = Image.open(image_save_path)
box_overlay_ratio = image_input.size[0] / 3200
draw_bbox_config = {
'text_scale': 0.8 * box_overlay_ratio,
'text_thickness': max(int(2 * box_overlay_ratio), 1),
'text_padding': max(int(3 * box_overlay_ratio), 1),
'thickness': max(int(3 * box_overlay_ratio), 1),
}
ocr_bbox_rslt, is_goal_filtered = check_ocr_box(image_input, display_img = False, output_bb_format='xyxy', goal_filtering=None, easyocr_args={'paragraph': False, 'text_threshold':0.9}, use_paddleocr=use_paddleocr)
text, ocr_bbox = ocr_bbox_rslt
dino_labled_img, label_coordinates, parsed_content_list = get_som_labeled_img(image_input, yolo_model, BOX_TRESHOLD = box_threshold, output_coord_in_ratio=False, ocr_bbox=ocr_bbox,draw_bbox_config=draw_bbox_config, caption_model_processor=caption_model_processor, ocr_text=text,iou_threshold=iou_threshold, imgsz=imgsz,)
parsed_content_list = '\n'.join([f'icon {i}: ' + str(v) for i,v in enumerate(parsed_content_list)])
if len(instruction) == 0:
print('finish processing')
image = Image.open(io.BytesIO(base64.b64decode(dino_labled_img)))
return image, str(parsed_content_list)
elif instruction.startswith('Q:'):
response = get_qa_response(instruction, image_input)
return image_input, response
# parsed_content_list = str(parsed_content_list)
# convert xywh to yxhw
label_coordinates_yxhw = {}
for key, val in label_coordinates.items():
if val[2] < 0 or val[3] < 0:
continue
label_coordinates_yxhw[key] = [val[1], val[0], val[3], val[2]]
image_som = plot_boxes_with_marks(image_input.copy(), [val for key, val in label_coordinates_yxhw.items()], som_generator, edgecolor=(255,0,0), fn_save=None, normalized_to_pixel=False)
# convert xywh to xyxy
for key, val in label_coordinates.items():
label_coordinates[key] = [val[0], val[1], val[0] + val[2], val[1] + val[3]]
# normalize label_coordinates
for key, val in label_coordinates.items():
label_coordinates[key] = [val[0] / image_input.size[0], val[1] / image_input.size[1], val[2] / image_input.size[0], val[3] / image_input.size[1]]
magma_response = get_som_response(instruction, image_som)
print("magma repsonse: ", magma_response)
# map magma_response into the mark id
mark_id = extract_mark_id(magma_response)
if mark_id is not None:
if str(mark_id) in label_coordinates:
bbox_for_mark = label_coordinates[str(mark_id)]
else:
bbox_for_mark = None
else:
bbox_for_mark = None
if bbox_for_mark:
# draw bbox_for_mark on the image
image_som = plot_boxes_with_marks(
image_input,
[label_coordinates_yxhw[str(mark_id)]],
som_generator,
edgecolor=(255,127,111),
alpha=30,
fn_save=None,
normalized_to_pixel=False,
add_mark=False
)
else:
try:
if 'box' in magma_response:
pred_bbox = extract_bbox(magma_response)
click_point = [(pred_bbox[0][0] + pred_bbox[1][0]) / 2, (pred_bbox[0][1] + pred_bbox[1][1]) / 2]
click_point = [item / 1000 for item in click_point]
else:
click_point = pred_2_point(magma_response)
# de-normalize click_point (width, height)
click_point = [click_point[0] * image_input.size[0], click_point[1] * image_input.size[1]]
image_som = plot_circles_with_marks(
image_input,
[click_point],
som_generator,
edgecolor=(255,127,111),
linewidth=3,
fn_save=None,
normalized_to_pixel=False,
add_mark=False
)
except:
image_som = image_input
return image_som, str(parsed_content_list)
with gr.Blocks() as demo:
gr.Markdown(MARKDOWN)
with gr.Row():
with gr.Column():
image_input_component = gr.Image(
type='pil', label='Upload image')
# set the threshold for removing the bounding boxes with low confidence, default is 0.05
with gr.Accordion("Parameters", open=False) as parameter_row:
box_threshold_component = gr.Slider(
label='Box Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.05)
# set the threshold for removing the bounding boxes with large overlap, default is 0.1
iou_threshold_component = gr.Slider(
label='IOU Threshold', minimum=0.01, maximum=1.0, step=0.01, value=0.1)
use_paddleocr_component = gr.Checkbox(
label='Use PaddleOCR', value=True)
imgsz_component = gr.Slider(
label='Icon Detect Image Size', minimum=640, maximum=1920, step=32, value=640)
# text box
text_input_component = gr.Textbox(label='Text Input', placeholder='Text Input')
submit_button_component = gr.Button(
value='Submit', variant='primary')
with gr.Column():
image_output_component = gr.Image(type='pil', label='Image Output')
text_output_component = gr.Textbox(label='Parsed screen elements', placeholder='Text Output')
submit_button_component.click(
fn=process,
inputs=[
image_input_component,
box_threshold_component,
iou_threshold_component,
use_paddleocr_component,
imgsz_component,
text_input_component
],
outputs=[image_output_component, text_output_component]
)
# demo.launch(debug=False, show_error=True, share=True)
# demo.launch(share=True, server_port=7861, server_name='0.0.0.0')
demo.queue().launch(share=False)
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment