[docs] Shap-E guide (#4700)

* first draft * fixes * more fixes * fix toctree

[docs] Shap-E guide (#4700)
* first draft * fixes * more fixes * fix toctree
2c45a53a · Steven Liu · GitHub · 22ea35cf · 2c45a53a · 2c45a53a
Unverified Commit 2c45a53a authored Sep 01, 2023 by Steven Liu Committed by GitHub Sep 01, 2023
6 changed files
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -66,6 +66,8 @@
      title: Stable Diffusion XL
    - local: using-diffusers/controlnet
      title: ControlNet
+    - local: using-diffusers/shap-e
+      title: Shap-E
    - local: using-diffusers/diffedit
      title: DiffEdit
    - local: using-diffusers/distilled_sd

--- a/docs/source/en/api/pipelines/shap_e.md
+++ b/docs/source/en/api/pipelines/shap_e.md
@@ -19,163 +19,10 @@ The original codebase can be found at [openai/shap-e](https://github.com/openai/
 <Tip>
-Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
+See the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
 </Tip>
-## Usage Examples
-In the following, we will walk you through some examples of how to use Shap-E pipelines to create 3D objects in gif format.
-### Text-to-3D image generation 
-We can use [`ShapEPipeline`] to create 3D object based on a text prompt. In this example, we will make a birthday cupcake for :firecracker: diffusers library's 1 year birthday. The workflow to use the Shap-E text-to-image pipeline is same as how you would use other text-to-image pipelines in diffusers.
-```python
-import torch
-from diffusers import DiffusionPipeline
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-repo = "openai/shap-e"
-pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
-pipe = pipe.to(device)
-guidance_scale = 15.0
-prompt = ["A firecracker", "A birthday cupcake"]
-images = pipe(
-    prompt,
-    guidance_scale=guidance_scale,
-    num_inference_steps=64,
-    frame_size=256,
-).images
-```
-The output of [`ShapEPipeline`] is a list of lists of images frames. Each list of frames can be used to create a 3D object. Let's use the `export_to_gif` utility function in diffusers to make a 3D cupcake!
-```python
-from diffusers.utils import export_to_gif
-export_to_gif(images[0], "firecracker_3d.gif")
-export_to_gif(images[1], "cake_3d.gif")
-```
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif)
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif)
-### Image-to-Image generation
-You can use [`ShapEImg2ImgPipeline`] along with other text-to-image pipelines in diffusers and turn your 2D generation into 3D. 
-In this example, We will first genrate a cheeseburger with a simple prompt "A cheeseburger, white background" 
-```python
-from diffusers import DiffusionPipeline
-import torch
-pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
-pipe_prior.to("cuda")
-t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
-t2i_pipe.to("cuda")
-prompt = "A cheeseburger, white background"
-image_embeds, negative_image_embeds = pipe_prior(prompt, guidance_scale=1.0).to_tuple()
-image = t2i_pipe(
-    prompt,
-    image_embeds=image_embeds,
-    negative_image_embeds=negative_image_embeds,
-).images[0]
-image.save("burger.png")
-```
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png)
-we will then use the Shap-E image-to-image pipeline to turn it into a 3D cheeseburger :)
-```python
-from PIL import Image
-from diffusers.utils import export_to_gif
-repo = "openai/shap-e-img2img"
-pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16)
-pipe = pipe.to("cuda")
-guidance_scale = 3.0
-image = Image.open("burger.png").resize((256, 256))
-images = pipe(
-    image,
-    guidance_scale=guidance_scale,
-    num_inference_steps=64,
-    frame_size=256,
-).images
-gif_path = export_to_gif(images[0], "burger_3d.gif")
-```
-![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif)
-### Generate mesh
-For both [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`], you can generate mesh output by passing `output_type` as `mesh` to the pipeline, and then use the [`ShapEPipeline.export_to_ply`] utility function to save the output as a `ply` file. We also provide a [`ShapEPipeline.export_to_obj`] function that you can use to save mesh outputs as `obj` files.
-```python
-import torch
-from diffusers import DiffusionPipeline
-from diffusers.utils import export_to_ply
-device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-repo = "openai/shap-e"
-pipe = DiffusionPipeline.from_pretrained(repo, torch_dtype=torch.float16, variant="fp16")
-pipe = pipe.to(device)
-guidance_scale = 15.0
-prompt = "A birthday cupcake"
-images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
-ply_path = export_to_ply(images[0], "3d_cake.ply")
-print(f"saved to folder: {ply_path}")
-```
-Huggingface Datasets supports mesh visualization for mesh files in `glb` format. Below we will show you how to convert your mesh file into `glb` format so that you can use the Dataset viewer to render 3D objects. 
-We need to install `trimesh` library.
-```
-pip install trimesh
-```
-To convert the mesh file into `glb` format, 
-```python
-import trimesh
-mesh = trimesh.load("3d_cake.ply")
-mesh.export("3d_cake.glb", file_type="glb")
-```
-By default, the mesh output of Shap-E is from the bottom viewpoint; you can change the default viewpoint by applying a rotation transformation
-```python
-import trimesh
-import numpy as np
-mesh = trimesh.load("3d_cake.ply")
-rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
-mesh = mesh.apply_transform(rot)
-mesh.export("3d_cake.glb", file_type="glb")
-```
-Now you can upload your mesh file to your dataset and visualize it! Here is the link to the 3D cake we just generated
-https://huggingface.co/datasets/hf-internal-testing/diffusers-images/blob/main/shap_e/3d_cake.glb
 ## ShapEPipeline
 [[autodoc]] ShapEPipeline
 	- all

--- a/docs/source/en/api/utilities.md
+++ b/docs/source/en/api/utilities.md
@@ -18,6 +18,10 @@ Utility and helper functions for working with 🤗 Diffusers.
 [[autodoc]] utils.testing_utils.load_image
+## export_to_gif
+[[autodoc]] utils.testing_utils.export_to_gif
 ## export_to_video
 [[autodoc]] utils.testing_utils.export_to_video

--- a/docs/source/en/using-diffusers/shap-e.md
+++ b/docs/source/en/using-diffusers/shap-e.md
+# Shap-E
+[[open-in-colab]]
+Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps:
+1. a encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset
+2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications
+This guide will show you how to use Shap-E to start generating your own 3D assets!
+Before you begin, make sure you have the following libraries installed:
+```py
+# uncomment to install the necessary libraries in Colab
+#!pip install diffusers transformers accelerate safetensors trimesh
+```
+## Text-to-3D
+To generate a gif of a 3D object, pass a text prompt to the [`ShapEPipeline`]. The pipeline generates a list of image frames which are used to create the 3D object.
+```py
+import torch
+from diffusers import ShapEPipeline
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
+pipe = pipe.to(device)
+guidance_scale = 15.0
+prompt = ["A firecracker", "A birthday cupcake"]
+images = pipe(
+    prompt,
+    guidance_scale=guidance_scale,
+    num_inference_steps=64,
+    frame_size=256,
+).images
+```
+Now use the [`~utils.export_to_gif`] function to turn the list of image frames into a gif of the 3D object.
+```py
+from diffusers.utils import export_to_gif
+export_to_gif(images[0], "firecracker_3d.gif")
+export_to_gif(images[1], "cake_3d.gif")
+```
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/firecracker_out.gif"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">firecracker</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/cake_out.gif"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">cupcake</figcaption>
+  </div>
+</div>
+## Image-to-3D
+To generate a 3D object from another image, use the [`ShapEImg2ImgPipeline`]. You can use an existing image or generate an entirely new one. Let's use the the [Kandinsky 2.1](../api/pipelines/kandinsky) model to generate a new image.
+```py
+from diffusers import DiffusionPipeline
+import torch
+prior_pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+pipeline = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
+prompt = "A cheeseburger, white background"
+image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
+image = pipeline(
+    prompt,
+    image_embeds=image_embeds,
+    negative_image_embeds=negative_image_embeds,
+).images[0]
+image.save("burger.png")
+```
+Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D representation of it.
+```py
+from PIL import Image
+from diffusers.utils import export_to_gif
+pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda")
+guidance_scale = 3.0
+image = Image.open("burger.png").resize((256, 256))
+images = pipe(
+    image,
+    guidance_scale=guidance_scale,
+    num_inference_steps=64,
+    frame_size=256,
+).images
+gif_path = export_to_gif(images[0], "burger_3d.gif")
+```
+<div class="flex gap-4">
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">cheeseburger</figcaption>
+  </div>
+  <div>
+    <img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_out.gif"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">3D cheeseburger</figcaption>
+  </div>
+</div>
+## Generate mesh
+Shap-E is a flexible model that can also generate textured mesh outputs to be rendered for downstream applications. In this example, you'll convert the output into a `glb` file because the 🤗 Datasets library supports mesh visualization of `glb` files which can be rendered by the [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer#dataset-preview).
+You can generate mesh outputs for both the [`ShapEPipeline`] and [`ShapEImg2ImgPipeline`] by specifying the `output_type` parameter as `"mesh"`:
+```py
+import torch
+from diffusers import ShapEPipeline
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
+pipe = pipe.to(device)
+guidance_scale = 15.0
+prompt = "A birthday cupcake"
+images = pipe(prompt, guidance_scale=guidance_scale, num_inference_steps=64, frame_size=256, output_type="mesh").images
+```
+Use the [`~utils.export_to_ply`] function to save the mesh output as a `ply` file:
+<Tip>
+You can optionally save the mesh output as an `obj` file with the [`~utils.export_to_obj`] function. The ability to save the mesh output in a variety of formats makes it more flexible for downstream usage!
+</Tip>
+```py
+from diffusers.utils import export_to_ply
+ply_path = export_to_ply(images[0], "3d_cake.ply")
+print(f"saved to folder: {ply_path}")
+```
+Then you can convert the `ply` file to a `glb` file with the trimesh library:
+```py
+import trimesh
+mesh = trimesh.load("3d_cake.ply")
+mesh.export("3d_cake.glb", file_type="glb")
+```
+By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform:
+```py
+import trimesh
+import numpy as np
+mesh = trimesh.load("3d_cake.ply")
+rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0])
+mesh = mesh.apply_transform(rot)
+mesh.export("3d_cake.glb", file_type="glb")
+```
+Upload the mesh file to your dataset repository to visualize it with the Dataset viewer!
+<div class="flex justify-center">
+    <img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/3D-cake.gif"/>
+</div>
\ No newline at end of file
--- a/src/diffusers/pipelines/shap_e/pipeline_shap_e.py
+++ b/src/diffusers/pipelines/shap_e/pipeline_shap_e.py
@@ -80,23 +80,23 @@ class ShapEPipelineOutput(BaseOutput):
 class ShapEPipeline(DiffusionPipeline):
    """
-    Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E.
+    Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method.
    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
    Args:
        prior ([`PriorTransformer`]):
-            The canonincal unCLIP prior to approximate the image embedding from the text embedding.
+            The canonical unCLIP prior to approximate the image embedding from the text embedding.
-        text_encoder ([`CLIPTextModelWithProjection`]):
+        text_encoder ([`~transformers.CLIPTextModelWithProjection`]):
            Frozen text-encoder.
-        tokenizer (`CLIPTokenizer`):
+        tokenizer ([`~transformers.CLIPTokenizer`]):
-             A [`~transformers.CLIPTokenizer`] to tokenize text.
+             A `CLIPTokenizer` to tokenize text.
        scheduler ([`HeunDiscreteScheduler`]):
-            A scheduler to be used in combination with `prior` to generate image embedding.
+            A scheduler to be used in combination with the `prior` model to generate image embedding.
        shap_e_renderer ([`ShapERenderer`]):
-            Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects
+            Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF
-            with the NeRF rendering method.
+            rendering method.
    """
    def __init__(
@@ -241,12 +241,11 @@ class ShapEPipeline(DiffusionPipeline):
            guidance_scale (`float`, *optional*, defaults to 4.0):
                A higher guidance scale value encourages the model to generate images closely linked to the text
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
-                usually at the expense of lower image quality.
            frame_size (`int`, *optional*, default to 64):
                The width and height of each image frame of the generated 3D output.
-            output_type (`str`, *optional*, defaults to `"pt"`):
+            output_type (`str`, *optional*, defaults to `"pil"`):
-                The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
+                The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
-                (`np.array`),`"latent"` (`torch.Tensor`), mesh ([`MeshDecoderOutput`]).
+                (`np.array`), `"latent"` (`torch.Tensor`), or mesh ([`MeshDecoderOutput`]).
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
                tuple.

--- a/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py
+++ b/src/diffusers/pipelines/shap_e/pipeline_shap_e_img2img.py
@@ -79,8 +79,7 @@ class ShapEPipelineOutput(BaseOutput):
 class ShapEImg2ImgPipeline(DiffusionPipeline):
    """
-    Pipeline for generating latent representation of a 3D asset and rendering with NeRF method with Shap-E from an
+    Pipeline for generating latent representation of a 3D asset and rendering with the NeRF method from an image.
-    image.
    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
@@ -88,15 +87,15 @@ class ShapEImg2ImgPipeline(DiffusionPipeline):
    Args:
        prior ([`PriorTransformer`]):
            The canonincal unCLIP prior to approximate the image embedding from the text embedding.
-        image_encoder ([`CLIPVisionModel`]):
+        image_encoder ([`~transformers.CLIPVisionModel`]):
            Frozen image-encoder.
-        image_processor (`CLIPImageProcessor`):
+        image_processor ([`~transformers.CLIPImageProcessor`]):
-             A [`~transformers.CLIPImageProcessor`] to process images.
+             A `CLIPImageProcessor` to process images.
        scheduler ([`HeunDiscreteScheduler`]):
-            A scheduler to be used in combination with `prior` to generate image embedding.
+            A scheduler to be used in combination with the `prior` model to generate image embedding.
        shap_e_renderer ([`ShapERenderer`]):
-            Shap-E renderer projects the generated latents into parameters of a MLP that's used to create 3D objects
+            Shap-E renderer projects the generated latents into parameters of a MLP to create 3D objects with the NeRF
-            with the NeRF rendering method.
+            rendering method.
    """
    def __init__(
@@ -179,10 +178,10 @@ class ShapEImg2ImgPipeline(DiffusionPipeline):
        Args:
            image (`torch.FloatTensor`, `PIL.Image.Image`, `np.ndarray`, `List[torch.FloatTensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
                `Image` or tensor representing an image batch to be used as the starting point. Can also accept image
-                latents as `image`, if passing latents directly, it will not be encoded again.
+                latents as image, but if passing latents directly it is not encoded again.
            num_images_per_prompt (`int`, *optional*, defaults to 1):
                The number of images to generate per prompt.
-            num_inference_steps (`int`, *optional*, defaults to 100):
+            num_inference_steps (`int`, *optional*, defaults to 25):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
@@ -197,8 +196,9 @@ class ShapEImg2ImgPipeline(DiffusionPipeline):
                `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`.
            frame_size (`int`, *optional*, default to 64):
                The width and height of each image frame of the generated 3D output.
-            output_type (`str`, *optional*, defaults to `"pt"`):
+            output_type (`str`, *optional*, defaults to `"pil"`):
-                (`np.array`),`"latent"` (`torch.Tensor`), mesh ([`MeshDecoderOutput`]).
+                The output format of the generated image. Choose between `"pil"` (`PIL.Image.Image`), `"np"`
+                (`np.array`), `"latent"` (`torch.Tensor`), or mesh ([`MeshDecoderOutput`]).
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput`] instead of a plain
                tuple.