Unverified Commit 2c45a53a authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

[docs] Shap-E guide (#4700)

* first draft

* fixes

* more fixes

* fix toctree
parent 22ea35cf
...@@ -66,6 +66,8 @@ ...@@ -66,6 +66,8 @@
title: Stable Diffusion XL title: Stable Diffusion XL
- local: using-diffusers/controlnet - local: using-diffusers/controlnet
title: ControlNet title: ControlNet
- local: using-diffusers/shap-e
title: Shap-E
- local: using-diffusers/diffedit - local: using-diffusers/diffedit
title: DiffEdit title: DiffEdit
- local: using-diffusers/distilled_sd - local: using-diffusers/distilled_sd
......
...@@ -19,163 +19,10 @@ The original codebase can be found at [openai/shap-e](https://github.com/openai/ ...@@ -19,163 +19,10 @@ The original codebase can be found at [openai/shap-e](https://github.com/openai/
<Tip> <Tip>
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. See the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
</Tip> </Tip>
## Usage Examples
In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.
### Text-to-3D image generation
We can use [`ShapEPipeline`] to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library's 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.
```python
import torch
from diffusers import DiffusionPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
repo = "openai/shap-e"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to(device)
guidance_scale = 15.0
prompt = ["A firecracker", "A birthday cupcake"]
images = pipe(
prompt,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
```
The output of [`ShapEPipeline`] is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let's use the `export_to_gif` utility function in diffusers to make a 3D cupcake!
```python
from diffusers.utils import export_to_gif
export_to_gif(images[0], "firecracker_3d.gif")
export_to_gif(images[1], "cake_3d.gif")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif)
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif)
### Image-to-Image generation
You can use [`ShapEImg2ImgPipeline`] along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D.
In this example, We will first genrate a cheeseburger with a simple prompt "A cheeseburger, white background"
```python
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
prompt = "A cheeseburger, white background"
image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
image = t2i_pipe(
prompt,
image_embeds=image_embeds,
negative_image_embeds=negative_image_embeds,
).images[0]
image.save("burger.png")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png)
we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)
```python
from PIL import Image
from diffusers.utils import export_to_gif
repo = "openai/shap-e-img2img"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
guidance_scale = 3.0
image = Image.open("burger.png").resize((256, 256))
images = pipe(
image,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
gif_path = export_to_gif(images[0], "burger_3d.gif")
```
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif)
### Generate mesh
For both [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`], you can generate mesh output by passing `output_type` as `mesh` to the pipeline, and then use the [`ShapEPipeline.export_to_ply`] utility function to save the output as a `ply` file. We also provide a [`ShapEPipeline.export_to_obj`] function that you can use to save mesh outputs as `obj` files.
```python
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_ply
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
repo = "openai/shap-e"
pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16, variant="fp16")
pipe = pipe.to(device)
guidance_scale = 15.0
prompt = "A birthday cupcake"
images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
ply_path = export_to_ply(images[0], "3d_cake.ply")
print(f"saved to folder: {ply_path}")
```
Huggingface Datasets supports mesh visualization for mesh files in `glb` format. Below we will show you how to convert your mesh file into `glb` format so that you can use the Dataset viewer to render 3D objects.
We need to install `trimesh` library.
```
pip install trimesh
```
To convert the mesh file into `glb` format,
```python
import trimesh
mesh = trimesh.load("3d_cake.ply")
mesh.export("3d_cake.glb", file_type="glb")
```
By default, the mesh output of Shap-E is from the bottom viewpoint; you can change the default viewpoint by applying a rotation transformation
```python
import trimesh
import numpy as np
mesh = trimesh.load("3d_cake.ply")
rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
mesh = mesh.apply_transform(rot)
mesh.export("3d_cake.glb", file_type="glb")
```
Now you can upload your mesh file to your dataset and visualize it! Here is the link to the 3D cake we just generated
https://huggingface.co/datasets/hf-internal-testing/diffusers-images/blob/main/shap_e/3d_cake.glb
## ShapEPipeline ## ShapEPipeline
[[autodoc]] ShapEPipeline [[autodoc]] ShapEPipeline
- all - all
......
...@@ -18,6 +18,10 @@ Utility and helper functions for working with 🤗 Diffusers. ...@@ -18,6 +18,10 @@ Utility and helper functions for working with 🤗 Diffusers.
[[autodoc]] utils.testing_utils.load_image [[autodoc]] utils.testing_utils.load_image
## export_to_gif
[[autodoc]] utils.testing_utils.export_to_gif
## export_to_video ## export_to_video
[[autodoc]] utils.testing_utils.export_to_video [[autodoc]] utils.testing_utils.export_to_video
......
# Shap-E
[[open-in-colab]]
Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps:
1. a encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset
2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications
This guide will show you how to use Shap-E to start generating your own 3D assets!
Before you begin, make sure you have the following libraries installed:
```py
# uncomment to install the necessary libraries in Colab
#!pip install diffusers transformers accelerate safetensors trimesh
```
## Text-to-3D
To generate a gif of a 3D object, pass a text prompt to the [`ShapEPipeline`]. The pipeline generates a list of image frames which are used to create the 3D object.
```py
import torch
from diffusers import ShapEPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
pipe = pipe.to(device)
guidance_scale = 15.0
prompt = ["A firecracker", "A birthday cupcake"]
images = pipe(
prompt,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
```
Now use the [`~utils.export_to_gif`] function to turn the list of image frames into a gif of the 3D object.
```py
from diffusers.utils import export_to_gif
export_to_gif(images[0], "firecracker_3d.gif")
export_to_gif(images[1], "cake_3d.gif")
```
<div class="flex gap-4">
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">firecracker</figcaption>
</div>
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">cupcake</figcaption>
</div>
</div>
## Image-to-3D
To generate a 3D object from another image, use the [`ShapEImg2ImgPipeline`]. You can use an existing image or generate an entirely new one. Let's use the the [Kandinsky 2.1](../api/pipelines/kandinsky) model to generate a new image.
```py
from diffusers import DiffusionPipeline
import torch
prior_pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
prompt = "A cheeseburger, white background"
image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
image = pipeline(
prompt,
image_embeds=image_embeds,
negative_image_embeds=negative_image_embeds,
).images[0]
image.save("burger.png")
```
Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D representation of it.
```py
from PIL import Image
from diffusers.utils import export_to_gif
pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda")
guidance_scale = 3.0
image = Image.open("burger.png").resize((256, 256))
images = pipe(
image,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
gif_path = export_to_gif(images[0], "burger_3d.gif")
```
<div class="flex gap-4">
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">cheeseburger</figcaption>
</div>
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">3D cheeseburger</figcaption>
</div>
</div>
## Generate mesh
Shap-E is a flexible model that can also generate textured mesh outputs to be rendered for downstream applications. In this example, you'll convert the output into a `glb` file because the 🤗 Datasets library supports mesh visualization of `glb` files which can be rendered by the [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer#dataset-preview).
You can generate mesh outputs for both the [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`] by specifying the `output_type` parameter as `"mesh"`:
```py
import torch
from diffusers import ShapEPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
pipe = pipe.to(device)
guidance_scale = 15.0
prompt = "A birthday cupcake"
images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
```
Use the [`~utils.export_to_ply`] function to save the mesh output as a `ply` file:
<Tip>
You can optionally save the mesh output as an `obj` file with the [`~utils.export_to_obj`] function. The ability to save the mesh output in a variety of formats makes it more flexible for downstream usage!
</Tip>
```py
from diffusers.utils import export_to_ply
ply_path = export_to_ply(images[0], "3d_cake.ply")
print(f"saved to folder: {ply_path}")
```
Then you can convert the `ply` file to a `glb` file with the trimesh library:
```py
import trimesh
mesh = trimesh.load("3d_cake.ply")
mesh.export("3d_cake.glb", file_type="glb")
```
By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform:
```py
import trimesh
import numpy as np
mesh = trimesh.load("3d_cake.ply")
rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
mesh = mesh.apply_transform(rot)
mesh.export("3d_cake.glb", file_type="glb")
```
Upload the mesh file to your dataset repository to visualize it with the Dataset viewer!
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/3D-cake.gif"/>
</div>
\ No newline at end of file
...@@ -80,23 +80,23 @@ class ShapEPipelineOutput(BaseOutput): ...@@ -80,23 +80,23 @@ class ShapEPipelineOutput(BaseOutput):
class ShapEPipeline(DiffusionPipeline): class ShapEPipeline(DiffusionPipeline):
""" """
Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E. Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.). implemented for all pipelines (downloading, saving, running on a particular device, etc.).
Args: Args:
prior ([`PriorTransformer`]): prior ([`PriorTransformer`]):
The canonincal unCLIP prior to approximate the image embedding from the text embedding. The canonical unCLIP prior to approximate the image embedding from the text embedding.
text_encoder ([`CLIPTextModelWithProjection`]): text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
Frozen text-encoder. Frozen text-encoder.
tokenizer (`CLIPTokenizer`): tokenizer ([`~transformers.CLIPTokenizer`]):
A [`~transformers.CLIPTokenizer`] to tokenize text. A `CLIPTokenizer` to tokenize text.
scheduler ([`HeunDiscreteScheduler`]): scheduler ([`HeunDiscreteScheduler`]):
A scheduler to be used in combination with `prior` to generate image embedding. A scheduler to be used in combination with the `prior` model to generate image embedding.
shap_e_renderer ([`ShapERenderer`]): shap_e_renderer ([`ShapERenderer`]):
Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF
with the NeRF rendering method. rendering method.
""" """
def __init__( def __init__(
...@@ -241,12 +241,11 @@ class ShapEPipeline(DiffusionPipeline): ...@@ -241,12 +241,11 @@ class ShapEPipeline(DiffusionPipeline):
guidance_scale (`float`, *optional*, defaults to 4.0): guidance_scale (`float`, *optional*, defaults to 4.0):
A higher guidance scale value encourages the model to generate images closely linked to the text A higher guidance scale value encourages the model to generate images closely linked to the text
`prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
usually at the expense of lower image quality.
frame_size (`int`, *optional*, default to 64): frame_size (`int`, *optional*, default to 64):
The width and height of each image frame of the generated 3D output. The width and height of each image frame of the generated 3D output.
output_type (`str`, *optional*, defaults to `"pt"`): output_type (`str`, *optional*, defaults to `"pil"`):
The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"` The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
(`np.array`),`"latent"` (`torch.Tensor`), mesh ([`MeshDecoderOutput`]). (`np.array`), `"latent"` (`torch.Tensor`), or mesh ([`MeshDecoderOutput`]).
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
tuple. tuple.
......
...@@ -79,8 +79,7 @@ class ShapEPipelineOutput(BaseOutput): ...@@ -79,8 +79,7 @@ class ShapEPipelineOutput(BaseOutput):
class ShapEImg2ImgPipeline(DiffusionPipeline): class ShapEImg2ImgPipeline(DiffusionPipeline):
""" """
Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E from an Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method from an image.
image.
This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.). implemented for all pipelines (downloading, saving, running on a particular device, etc.).
...@@ -88,15 +87,15 @@ class ShapEImg2ImgPipeline(DiffusionPipeline): ...@@ -88,15 +87,15 @@ class ShapEImg2ImgPipeline(DiffusionPipeline):
Args: Args:
prior ([`PriorTransformer`]): prior ([`PriorTransformer`]):
The canonincal unCLIP prior to approximate the image embedding from the text embedding. The canonincal unCLIP prior to approximate the image embedding from the text embedding.
image_encoder ([`CLIPVisionModel`]): image_encoder ([`~transformers.CLIPVisionModel`]):
Frozen image-encoder. Frozen image-encoder.
image_processor (`CLIPImageProcessor`): image_processor ([`~transformers.CLIPImageProcessor`]):
A [`~transformers.CLIPImageProcessor`] to process images. A `CLIPImageProcessor` to process images.
scheduler ([`HeunDiscreteScheduler`]): scheduler ([`HeunDiscreteScheduler`]):
A scheduler to be used in combination with `prior` to generate image embedding. A scheduler to be used in combination with the `prior` model to generate image embedding.
shap_e_renderer ([`ShapERenderer`]): shap_e_renderer ([`ShapERenderer`]):
Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF
with the NeRF rendering method. rendering method.
""" """
def __init__( def __init__(
...@@ -179,10 +178,10 @@ class ShapEImg2ImgPipeline(DiffusionPipeline): ...@@ -179,10 +178,10 @@ class ShapEImg2ImgPipeline(DiffusionPipeline):
Args: Args:
image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`): image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
`Image` or tensor representing an image batch to be used as the starting point. Can also accept image `Image` or tensor representing an image batch to be used as the starting point. Can also accept image
latents as `image`, if passing latents directly, it will not be encoded again. latents as image, but if passing latents directly it is not encoded again.
num_images_per_prompt (`int`, *optional*, defaults to 1): num_images_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt. The number of images to generate per prompt.
num_inference_steps (`int`, *optional*, defaults to 100): num_inference_steps (`int`, *optional*, defaults to 25):
The number of denoising steps. More denoising steps usually lead to a higher quality image at the The number of denoising steps. More denoising steps usually lead to a higher quality image at the
expense of slower inference. expense of slower inference.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*): generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
...@@ -197,8 +196,9 @@ class ShapEImg2ImgPipeline(DiffusionPipeline): ...@@ -197,8 +196,9 @@ class ShapEImg2ImgPipeline(DiffusionPipeline):
`prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
frame_size (`int`, *optional*, default to 64): frame_size (`int`, *optional*, default to 64):
The width and height of each image frame of the generated 3D output. The width and height of each image frame of the generated 3D output.
output_type (`str`, *optional*, defaults to `"pt"`): output_type (`str`, *optional*, defaults to `"pil"`):
(`np.array`),`"latent"` (`torch.Tensor`), mesh ([`MeshDecoderOutput`]). The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
(`np.array`), `"latent"` (`torch.Tensor`), or mesh ([`MeshDecoderOutput`]).
return_dict (`bool`, *optional*, defaults to `True`): return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
tuple. tuple.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment