Unverified Commit 3fab6624 authored by Anton Obukhov's avatar Anton Obukhov Committed by GitHub
Browse files

Marigold Update: v1-1 models, Intrinsic Image Decomposition pipeline, documentation (#10884)



* minor documentation fixes of the depth and normals pipelines

* update license headers

* update model checkpoints in examples
fix missing prediction_type in register_to_config in the normals pipeline

* add initial marigold intrinsics pipeline
update comments about num_inference_steps and ensemble_size
minor fixes in comments of marigold normals and depth pipelines

* update uncertainty visualization to work with intrinsics

* integrate iid


---------
Co-authored-by: default avatarYiYi Xu <yixu310@gmail.com>
Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
parent f0ac7aaa
<!--Copyright 2024 Marigold authors and The HuggingFace Team. All rights reserved.
<!--
Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
Copyright 2024-2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
......@@ -10,67 +12,120 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
specific language governing permissions and limitations under the License.
-->
# Marigold Pipelines for Computer Vision Tasks
# Marigold Computer Vision
![marigold](https://marigoldmonodepth.github.io/images/teaser_collage_compressed.jpg)
Marigold was proposed in [Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), a CVPR 2024 Oral paper by [Bingxin Ke](http://www.kebingxin.com/), [Anton Obukhov](https://www.obukhov.ai/), [Shengyu Huang](https://shengyuh.github.io/), [Nando Metzger](https://nandometzger.github.io/), [Rodrigo Caye Daudt](https://rcdaudt.github.io/), and [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
The idea is to repurpose the rich generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks.
Initially, this idea was explored to fine-tune Stable Diffusion for Monocular Depth Estimation, as shown in the teaser above.
Later,
- [Tianfu Wang](https://tianfwang.github.io/) trained the first Latent Consistency Model (LCM) of Marigold, which unlocked fast single-step inference;
- [Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US) extended the approach to Surface Normals Estimation;
- [Anton Obukhov](https://www.obukhov.ai/) contributed the pipelines and documentation into diffusers (enabled and supported by [YiYi Xu](https://yiyixuxu.github.io/) and [Sayak Paul](https://sayak.dev/)).
Marigold was proposed in
[Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145),
a CVPR 2024 Oral paper by
[Bingxin Ke](http://www.kebingxin.com/),
[Anton Obukhov](https://www.obukhov.ai/),
[Shengyu Huang](https://shengyuh.github.io/),
[Nando Metzger](https://nandometzger.github.io/),
[Rodrigo Caye Daudt](https://rcdaudt.github.io/), and
[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
The core idea is to **repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional
computer vision tasks**.
This approach was explored by fine-tuning Stable Diffusion for **Monocular Depth Estimation**, as demonstrated in the
teaser above.
Marigold was later extended in the follow-up paper,
[Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis](https://huggingface.co/papers/2312.02145),
authored by
[Bingxin Ke](http://www.kebingxin.com/),
[Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US),
[Tianfu Wang](https://tianfwang.github.io/),
[Nando Metzger](https://nandometzger.github.io/),
[Shengyu Huang](https://shengyuh.github.io/),
[Bo Li](https://www.linkedin.com/in/bobboli0202/),
[Anton Obukhov](https://www.obukhov.ai/), and
[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
This work expanded Marigold to support new modalities such as **Surface Normals** and **Intrinsic Image Decomposition**
(IID), introduced a training protocol for **Latent Consistency Models** (LCM), and demonstrated **High-Resolution** (HR)
processing capability.
The abstract from the paper is:
<Tip>
*Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.*
The early Marigold models (`v1-0` and earlier) were optimized for best results with at least 10 inference steps.
LCM models were later developed to enable high-quality inference in just 1 to 4 steps.
Marigold models `v1-1` and later use the DDIM scheduler to achieve optimal
results in as few as 1 to 4 steps.
## Available Pipelines
</Tip>
Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image.
Currently, the following tasks are implemented:
## Available Pipelines
| Pipeline | Predicted Modalities | Demos |
|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:|
| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) |
| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) |
Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a
corresponding prediction.
Currently, the following computer vision tasks are implemented:
| Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities |
|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) |
| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) |
| [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),<br>[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) |
## Available Checkpoints
The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization.
All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face.
They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train
new model checkpoints.
The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps.
| Checkpoint | Modality | Comment |
|-----------------------------------------------------------------------------------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. |
| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. |
| [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. |
| [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image &nbsp\\(I\\)&nbsp is comprised of Albedo &nbsp\\(A\\), Diffuse shading &nbsp\\(S\\), and Non-diffuse residual &nbsp\\(R\\): &nbsp\\(I = A*S+R\\). |
<Tip>
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff
between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to
efficiently load the same components into multiple pipelines.
Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section
[here](../../using-diffusers/svd#reduce-memory-usage).
</Tip>
<Tip warning={true}>
Marigold pipelines were designed and tested only with `DDIMScheduler` and `LCMScheduler`.
Depending on the scheduler, the number of inference steps required to get reliable predictions varies, and there is no universal value that works best across schedulers.
Because of that, the default value of `num_inference_steps` in the `__call__` method of the pipeline is set to `None` (see the API reference).
Unless set explicitly, its value will be taken from the checkpoint configuration `model_index.json`.
This is done to ensure high-quality predictions when calling the pipeline with just the `image` argument.
Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint.
The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases.
To accommodate this, the `num_inference_steps` parameter in the pipeline's `__call__` method defaults to `None` (see the
API reference).
Unless set explicitly, it inherits the value from the `default_denoising_steps` field in the checkpoint configuration
file (`model_index.json`).
This ensures high-quality predictions when invoking the pipeline with only the `image` argument.
</Tip>
See also Marigold [usage examples](marigold_usage).
See also Marigold [usage examples](../../using-diffusers/marigold_usage).
## Marigold Depth Prediction API
## MarigoldDepthPipeline
[[autodoc]] MarigoldDepthPipeline
- all
- __call__
## MarigoldNormalsPipeline
[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput
[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth
## Marigold Normals Estimation API
[[autodoc]] MarigoldNormalsPipeline
- all
- __call__
## MarigoldDepthOutput
[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput
## MarigoldNormalsOutput
[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput
[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals
## Marigold Intrinsic Image Decomposition API
[[autodoc]] MarigoldIntrinsicsPipeline
- __call__
[[autodoc]] pipelines.marigold.pipeline_marigold_intrinsics.MarigoldIntrinsicsOutput
[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics
......@@ -65,7 +65,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
| [Latte](latte) | text2image |
| [LEDITS++](ledits_pp) | image editing |
| [Lumina-T2X](lumina) | text2image |
| [Marigold](marigold) | depth |
| [Marigold](marigold) | depth-estimation, normals-estimation, intrinsic-decomposition |
| [MultiDiffusion](panorama) | text2image |
| [MusicLDM](musicldm) | text2audio |
| [PAG](pag) | text2image |
......
......@@ -345,6 +345,7 @@ else:
"Lumina2Text2ImgPipeline",
"LuminaText2ImgPipeline",
"MarigoldDepthPipeline",
"MarigoldIntrinsicsPipeline",
"MarigoldNormalsPipeline",
"MochiPipeline",
"MusicLDMPipeline",
......@@ -845,6 +846,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
Lumina2Text2ImgPipeline,
LuminaText2ImgPipeline,
MarigoldDepthPipeline,
MarigoldIntrinsicsPipeline,
MarigoldNormalsPipeline,
MochiPipeline,
MusicLDMPipeline,
......
......@@ -261,6 +261,7 @@ else:
_import_structure["marigold"].extend(
[
"MarigoldDepthPipeline",
"MarigoldIntrinsicsPipeline",
"MarigoldNormalsPipeline",
]
)
......@@ -603,6 +604,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .lumina2 import Lumina2Text2ImgPipeline
from .marigold import (
MarigoldDepthPipeline,
MarigoldIntrinsicsPipeline,
MarigoldNormalsPipeline,
)
from .mochi import MochiPipeline
......
......@@ -23,6 +23,7 @@ except OptionalDependencyNotAvailable:
else:
_import_structure["marigold_image_processing"] = ["MarigoldImageProcessor"]
_import_structure["pipeline_marigold_depth"] = ["MarigoldDepthOutput", "MarigoldDepthPipeline"]
_import_structure["pipeline_marigold_intrinsics"] = ["MarigoldIntrinsicsOutput", "MarigoldIntrinsicsPipeline"]
_import_structure["pipeline_marigold_normals"] = ["MarigoldNormalsOutput", "MarigoldNormalsPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
......@@ -35,6 +36,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
else:
from .marigold_image_processing import MarigoldImageProcessor
from .pipeline_marigold_depth import MarigoldDepthOutput, MarigoldDepthPipeline
from .pipeline_marigold_intrinsics import MarigoldIntrinsicsOutput, MarigoldIntrinsicsPipeline
from .pipeline_marigold_normals import MarigoldNormalsOutput, MarigoldNormalsPipeline
else:
......
from typing import List, Optional, Tuple, Union
# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
from typing import Any, Dict, List, Optional, Tuple, Union
import numpy as np
import PIL
......@@ -379,7 +397,7 @@ class MarigoldImageProcessor(ConfigMixin):
val_min: float = 0.0,
val_max: float = 1.0,
color_map: str = "Spectral",
) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
) -> List[PIL.Image.Image]:
"""
Visualizes depth maps, such as predictions of the `MarigoldDepthPipeline`.
......@@ -391,7 +409,7 @@ class MarigoldImageProcessor(ConfigMixin):
color_map (`str`, *optional*, defaults to `"Spectral"`): Color map used to convert a single-channel
depth prediction into colored representation.
Returns: `PIL.Image.Image` or `List[PIL.Image.Image]` with depth maps visualization.
Returns: `List[PIL.Image.Image]` with depth maps visualization.
"""
if val_max <= val_min:
raise ValueError(f"Invalid values range: [{val_min}, {val_max}].")
......@@ -436,7 +454,7 @@ class MarigoldImageProcessor(ConfigMixin):
depth: Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]],
val_min: float = 0.0,
val_max: float = 1.0,
) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
) -> List[PIL.Image.Image]:
def export_depth_to_16bit_png_one(img, idx=None):
prefix = "Depth" + (f"[{idx}]" if idx else "")
if not isinstance(img, np.ndarray) and not torch.is_tensor(img):
......@@ -478,7 +496,7 @@ class MarigoldImageProcessor(ConfigMixin):
flip_x: bool = False,
flip_y: bool = False,
flip_z: bool = False,
) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
) -> List[PIL.Image.Image]:
"""
Visualizes surface normals, such as predictions of the `MarigoldNormalsPipeline`.
......@@ -492,7 +510,7 @@ class MarigoldImageProcessor(ConfigMixin):
flip_z (`bool`, *optional*, defaults to `False`): Flips the Z axis of the normals frame of reference.
Default direction is facing the observer.
Returns: `PIL.Image.Image` or `List[PIL.Image.Image]` with surface normals visualization.
Returns: `List[PIL.Image.Image]` with surface normals visualization.
"""
flip_vec = None
if any((flip_x, flip_y, flip_z)):
......@@ -528,6 +546,99 @@ class MarigoldImageProcessor(ConfigMixin):
else:
raise ValueError(f"Unexpected input type: {type(normals)}")
@staticmethod
def visualize_intrinsics(
prediction: Union[
np.ndarray,
torch.Tensor,
List[np.ndarray],
List[torch.Tensor],
],
target_properties: Dict[str, Any],
color_map: Union[str, Dict[str, str]] = "binary",
) -> List[Dict[str, PIL.Image.Image]]:
"""
Visualizes intrinsic image decomposition, such as predictions of the `MarigoldIntrinsicsPipeline`.
Args:
prediction (`Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]`):
Intrinsic image decomposition.
target_properties (`Dict[str, Any]`):
Decomposition properties. Expected entries: `target_names: List[str]` and a dictionary with keys
`prediction_space: str`, `sub_target_names: List[Union[str, Null]]` (must have 3 entries, null for
missing modalities), `up_to_scale: bool`, one for each target and sub-target.
color_map (`Union[str, Dict[str, str]]`, *optional*, defaults to `"Spectral"`):
Color map used to convert a single-channel predictions into colored representations. When a dictionary
is passed, each modality can be colored with its own color map.
Returns: `List[Dict[str, PIL.Image.Image]]` with intrinsic image decomposition visualization.
"""
if "target_names" not in target_properties:
raise ValueError("Missing `target_names` in target_properties")
if not isinstance(color_map, str) and not (
isinstance(color_map, dict)
and all(isinstance(k, str) and isinstance(v, str) for k, v in color_map.items())
):
raise ValueError("`color_map` must be a string or a dictionary of strings")
n_targets = len(target_properties["target_names"])
def visualize_targets_one(images, idx=None):
# img: [T, 3, H, W]
out = {}
for target_name, img in zip(target_properties["target_names"], images):
img = img.permute(1, 2, 0) # [H, W, 3]
prediction_space = target_properties[target_name].get("prediction_space", "srgb")
if prediction_space == "stack":
sub_target_names = target_properties[target_name]["sub_target_names"]
if len(sub_target_names) != 3 or any(
not (isinstance(s, str) or s is None) for s in sub_target_names
):
raise ValueError(f"Unexpected target sub-names {sub_target_names} in {target_name}")
for i, sub_target_name in enumerate(sub_target_names):
if sub_target_name is None:
continue
sub_img = img[:, :, i]
sub_prediction_space = target_properties[sub_target_name].get("prediction_space", "srgb")
if sub_prediction_space == "linear":
sub_up_to_scale = target_properties[sub_target_name].get("up_to_scale", False)
if sub_up_to_scale:
sub_img = sub_img / max(sub_img.max().item(), 1e-6)
sub_img = sub_img ** (1 / 2.2)
cmap_name = (
color_map if isinstance(color_map, str) else color_map.get(sub_target_name, "binary")
)
sub_img = MarigoldImageProcessor.colormap(sub_img, cmap=cmap_name, bytes=True)
sub_img = PIL.Image.fromarray(sub_img.cpu().numpy())
out[sub_target_name] = sub_img
elif prediction_space == "linear":
up_to_scale = target_properties[target_name].get("up_to_scale", False)
if up_to_scale:
img = img / max(img.max().item(), 1e-6)
img = img ** (1 / 2.2)
elif prediction_space == "srgb":
pass
img = (img * 255).to(dtype=torch.uint8, device="cpu").numpy()
img = PIL.Image.fromarray(img)
out[target_name] = img
return out
if prediction is None or isinstance(prediction, list) and any(o is None for o in prediction):
raise ValueError("Input prediction is `None`")
if isinstance(prediction, (np.ndarray, torch.Tensor)):
prediction = MarigoldImageProcessor.expand_tensor_or_array(prediction)
if isinstance(prediction, np.ndarray):
prediction = MarigoldImageProcessor.numpy_to_pt(prediction) # [N*T,3,H,W]
if not (prediction.ndim == 4 and prediction.shape[1] == 3 and prediction.shape[0] % n_targets == 0):
raise ValueError(f"Unexpected input shape={prediction.shape}, expecting [N*T,3,H,W].")
N_T, _, H, W = prediction.shape
N = N_T // n_targets
prediction = prediction.reshape(N, n_targets, 3, H, W)
return [visualize_targets_one(img, idx) for idx, img in enumerate(prediction)]
elif isinstance(prediction, list):
return [visualize_targets_one(img, idx) for idx, img in enumerate(prediction)]
else:
raise ValueError(f"Unexpected input type: {type(prediction)}")
@staticmethod
def visualize_uncertainty(
uncertainty: Union[
......@@ -537,9 +648,10 @@ class MarigoldImageProcessor(ConfigMixin):
List[torch.Tensor],
],
saturation_percentile=95,
) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
) -> List[PIL.Image.Image]:
"""
Visualizes dense uncertainties, such as produced by `MarigoldDepthPipeline` or `MarigoldNormalsPipeline`.
Visualizes dense uncertainties, such as produced by `MarigoldDepthPipeline`, `MarigoldNormalsPipeline`, or
`MarigoldIntrinsicsPipeline`.
Args:
uncertainty (`Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]`):
......@@ -547,14 +659,15 @@ class MarigoldImageProcessor(ConfigMixin):
saturation_percentile (`int`, *optional*, defaults to `95`):
Specifies the percentile uncertainty value visualized with maximum intensity.
Returns: `PIL.Image.Image` or `List[PIL.Image.Image]` with uncertainty visualization.
Returns: `List[PIL.Image.Image]` with uncertainty visualization.
"""
def visualize_uncertainty_one(img, idx=None):
prefix = "Uncertainty" + (f"[{idx}]" if idx else "")
if img.min() < 0:
raise ValueError(f"{prefix}: unexected data range, min={img.min()}.")
img = img.squeeze(0).cpu().numpy()
raise ValueError(f"{prefix}: unexpected data range, min={img.min()}.")
img = img.permute(1, 2, 0) # [H,W,C]
img = img.squeeze(2).cpu().numpy() # [H,W] or [H,W,3]
saturation_value = np.percentile(img, saturation_percentile)
img = np.clip(img * 255 / saturation_value, 0, 255)
img = img.astype(np.uint8)
......@@ -566,9 +679,9 @@ class MarigoldImageProcessor(ConfigMixin):
if isinstance(uncertainty, (np.ndarray, torch.Tensor)):
uncertainty = MarigoldImageProcessor.expand_tensor_or_array(uncertainty)
if isinstance(uncertainty, np.ndarray):
uncertainty = MarigoldImageProcessor.numpy_to_pt(uncertainty) # [N,1,H,W]
if not (uncertainty.ndim == 4 and uncertainty.shape[1] == 1):
raise ValueError(f"Unexpected input shape={uncertainty.shape}, expecting [N,1,H,W].")
uncertainty = MarigoldImageProcessor.numpy_to_pt(uncertainty) # [N,C,H,W]
if not (uncertainty.ndim == 4 and uncertainty.shape[1] in (1, 3)):
raise ValueError(f"Unexpected input shape={uncertainty.shape}, expecting [N,C,H,W] with C in (1,3).")
return [visualize_uncertainty_one(img, idx) for idx, img in enumerate(uncertainty)]
elif isinstance(uncertainty, list):
return [visualize_uncertainty_one(img, idx) for idx, img in enumerate(uncertainty)]
......
# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
# Copyright 2024 The HuggingFace Team. All rights reserved.
# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
# Marigold project website: https://marigoldmonodepth.github.io
# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
from dataclasses import dataclass
from functools import partial
......@@ -64,7 +64,7 @@ Examples:
>>> import torch
>>> pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
... "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
... "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
......@@ -86,11 +86,12 @@ class MarigoldDepthOutput(BaseOutput):
Args:
prediction (`np.ndarray`, `torch.Tensor`):
Predicted depth maps with values in the range [0, 1]. The shape is always $numimages \times 1 \times height
\times width$, regardless of whether the images were passed as a 4D array or a list.
Predicted depth maps with values in the range [0, 1]. The shape is $numimages \times 1 \times height \times
width$ for `torch.Tensor` or $numimages \times height \times width \times 1$ for `np.ndarray`.
uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages
\times 1 \times height \times width$.
\times 1 \times height \times width$ for `torch.Tensor` or $numimages \times height \times width \times 1$
for `np.ndarray`.
latent (`None`, `torch.Tensor`):
Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
The shape is $numimages * numensemble \times 4 \times latentheight \times latentwidth$.
......@@ -208,6 +209,11 @@ class MarigoldDepthPipeline(DiffusionPipeline):
output_type: str,
output_uncertainty: bool,
) -> int:
actual_vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
if actual_vae_scale_factor != self.vae_scale_factor:
raise ValueError(
f"`vae_scale_factor` computed at initialization ({self.vae_scale_factor}) differs from the actual one ({actual_vae_scale_factor})."
)
if num_inference_steps is None:
raise ValueError("`num_inference_steps` is not specified and could not be resolved from the model config.")
if num_inference_steps < 1:
......@@ -320,6 +326,7 @@ class MarigoldDepthPipeline(DiffusionPipeline):
return num_images
@torch.compiler.disable
def progress_bar(self, iterable=None, total=None, desc=None, leave=True):
if not hasattr(self, "_progress_bar_config"):
self._progress_bar_config = {}
......@@ -370,11 +377,9 @@ class MarigoldDepthPipeline(DiffusionPipeline):
same width and height.
num_inference_steps (`int`, *optional*, defaults to `None`):
Number of denoising diffusion steps during inference. The default value `None` results in automatic
selection. The number of steps should be at least 10 with the full Marigold models, and between 1 and 4
for Marigold-LCM models.
selection.
ensemble_size (`int`, defaults to `1`):
Number of ensemble predictions. Recommended values are 5 and higher for better precision, or 1 for
faster inference.
Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
processing_resolution (`int`, *optional*, defaults to `None`):
Effective processing resolution. When set to `0`, matches the larger input image dimension. This
produces crisper predictions, but may also lead to the overall loss of global context. The default
......@@ -486,9 +491,7 @@ class MarigoldDepthPipeline(DiffusionPipeline):
# `pred_latent` variable. The variable `image_latent` is of the same shape: it contains each input image encoded
# into latent space and replicated `E` times. The latents can be either generated (see `generator` to ensure
# reproducibility), or passed explicitly via the `latents` argument. The latter can be set outside the pipeline
# code. For example, in the Marigold-LCM video processing demo, the latents initialization of a frame is taken
# as a convex combination of the latents output of the pipeline for the previous frame and a newly-sampled
# noise. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
# code. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
# dimensions are `(h, w)`. Encoding into latent space happens in batches of size `batch_size`.
# Model invocation: self.vae.encoder.
image_latent, pred_latent = self.prepare_latents(
......@@ -733,6 +736,7 @@ class MarigoldDepthPipeline(DiffusionPipeline):
param = init_s.cpu().numpy()
else:
raise ValueError("Unrecognized alignment.")
param = param.astype(np.float64)
return param
......@@ -775,7 +779,7 @@ class MarigoldDepthPipeline(DiffusionPipeline):
if regularizer_strength > 0:
prediction, _ = ensemble(depth_aligned, return_uncertainty=False)
err_near = (0.0 - prediction.min()).abs().item()
err_near = prediction.min().abs().item()
err_far = (1.0 - prediction.max()).abs().item()
cost += (err_near + err_far) * regularizer_strength
......
# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
# Copyright 2024 The HuggingFace Team. All rights reserved.
# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
# Marigold project website: https://marigoldmonodepth.github.io
# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Tuple, Union
......@@ -62,7 +62,7 @@ Examples:
>>> import torch
>>> pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(
... "prs-eth/marigold-normals-lcm-v0-1", variant="fp16", torch_dtype=torch.float16
... "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
......@@ -81,11 +81,12 @@ class MarigoldNormalsOutput(BaseOutput):
Args:
prediction (`np.ndarray`, `torch.Tensor`):
Predicted normals with values in the range [-1, 1]. The shape is always $numimages \times 3 \times height
\times width$, regardless of whether the images were passed as a 4D array or a list.
Predicted normals with values in the range [-1, 1]. The shape is $numimages \times 3 \times height \times
width$ for `torch.Tensor` or $numimages \times height \times width \times 3$ for `np.ndarray`.
uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages
\times 1 \times height \times width$.
\times 1 \times height \times width$ for `torch.Tensor` or $numimages \times height \times width \times 1$
for `np.ndarray`.
latent (`None`, `torch.Tensor`):
Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
The shape is $numimages * numensemble \times 4 \times latentheight \times latentwidth$.
......@@ -164,6 +165,7 @@ class MarigoldNormalsPipeline(DiffusionPipeline):
tokenizer=tokenizer,
)
self.register_to_config(
prediction_type=prediction_type,
use_full_z_range=use_full_z_range,
default_denoising_steps=default_denoising_steps,
default_processing_resolution=default_processing_resolution,
......@@ -194,6 +196,11 @@ class MarigoldNormalsPipeline(DiffusionPipeline):
output_type: str,
output_uncertainty: bool,
) -> int:
actual_vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
if actual_vae_scale_factor != self.vae_scale_factor:
raise ValueError(
f"`vae_scale_factor` computed at initialization ({self.vae_scale_factor}) differs from the actual one ({actual_vae_scale_factor})."
)
if num_inference_steps is None:
raise ValueError("`num_inference_steps` is not specified and could not be resolved from the model config.")
if num_inference_steps < 1:
......@@ -304,6 +311,7 @@ class MarigoldNormalsPipeline(DiffusionPipeline):
return num_images
@torch.compiler.disable
def progress_bar(self, iterable=None, total=None, desc=None, leave=True):
if not hasattr(self, "_progress_bar_config"):
self._progress_bar_config = {}
......@@ -354,11 +362,9 @@ class MarigoldNormalsPipeline(DiffusionPipeline):
same width and height.
num_inference_steps (`int`, *optional*, defaults to `None`):
Number of denoising diffusion steps during inference. The default value `None` results in automatic
selection. The number of steps should be at least 10 with the full Marigold models, and between 1 and 4
for Marigold-LCM models.
selection.
ensemble_size (`int`, defaults to `1`):
Number of ensemble predictions. Recommended values are 5 and higher for better precision, or 1 for
faster inference.
Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
processing_resolution (`int`, *optional*, defaults to `None`):
Effective processing resolution. When set to `0`, matches the larger input image dimension. This
produces crisper predictions, but may also lead to the overall loss of global context. The default
......@@ -394,7 +400,7 @@ class MarigoldNormalsPipeline(DiffusionPipeline):
within the ensemble. These codes can be saved, modified, and used for subsequent calls with the
`latents` argument.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.marigold.MarigoldDepthOutput`] instead of a plain tuple.
Whether or not to return a [`~pipelines.marigold.MarigoldNormalsOutput`] instead of a plain tuple.
Examples:
......@@ -462,9 +468,7 @@ class MarigoldNormalsPipeline(DiffusionPipeline):
# `pred_latent` variable. The variable `image_latent` is of the same shape: it contains each input image encoded
# into latent space and replicated `E` times. The latents can be either generated (see `generator` to ensure
# reproducibility), or passed explicitly via the `latents` argument. The latter can be set outside the pipeline
# code. For example, in the Marigold-LCM video processing demo, the latents initialization of a frame is taken
# as a convex combination of the latents output of the pipeline for the previous frame and a newly-sampled
# noise. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
# code. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
# dimensions are `(h, w)`. Encoding into latent space happens in batches of size `batch_size`.
# Model invocation: self.vae.encoder.
image_latent, pred_latent = self.prepare_latents(
......
......@@ -1217,6 +1217,21 @@ class MarigoldDepthPipeline(metaclass=DummyObject):
requires_backends(cls, ["torch", "transformers"])
class MarigoldIntrinsicsPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
def __init__(self, *args, **kwargs):
requires_backends(self, ["torch", "transformers"])
@classmethod
def from_config(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
@classmethod
def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
class MarigoldNormalsPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
......
# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
# Copyright 2024 The HuggingFace Team. All rights reserved.
# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
# Marigold project website: https://marigoldmonodepth.github.io
# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
import gc
import random
......
This diff is collapsed.
# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
# Copyright 2024 The HuggingFace Team. All rights reserved.
# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
# Marigold project website: https://marigoldmonodepth.github.io
# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
import gc
import random
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment