unclip.md 2.6 KB
Newer Older
Patrick von Platen's avatar
Patrick von Platen committed
1
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Will Berman's avatar
Will Berman committed
2
3
4
5
6
7
8
9
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

10
# UnCLIP
Will Berman's avatar
Will Berman committed
11

12
[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The UnCLIP model in 🤗 Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)).
Will Berman's avatar
Will Berman committed
13

14
The abstract from the paper is following:
Will Berman's avatar
Will Berman committed
15

16
*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
Will Berman's avatar
Will Berman committed
17

18
You can find lucidrains DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch).
Will Berman's avatar
Will Berman committed
19

20
<Tip>
Will Berman's avatar
Will Berman committed
21

22
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines.
Will Berman's avatar
Will Berman committed
23

24
</Tip>
Will Berman's avatar
Will Berman committed
25
26

## UnCLIPPipeline
27
28
29
30
[[autodoc]] UnCLIPPipeline
	- all
	- __call__

31
## UnCLIPImageVariationPipeline
32
33
34
[[autodoc]] UnCLIPImageVariationPipeline
	- all
	- __call__
35
36
37

## ImagePipelineOutput
[[autodoc]] pipelines.ImagePipelineOutput