fix: convert diffusers from submodule to normal folder

3a5c2d0f · raojy · c27b0339 · b360596f · 3a5c2d0f · 3a5c2d0f
Commit 3a5c2d0f authored Apr 09, 2026 by raojy
20 changed files
--- a/diffusers @ b360596f
+++ b/diffusers @ b360596f
-Subproject commit b360596fa8933d59abf4edc91f036807ee6bbe61
--- a/diffusers/.ai/AGENTS.md
+++ b/diffusers/.ai/AGENTS.md
+# Diffusers — Agent Guide
+## Coding style
+Strive to write code as simple and explicit as possible.
+- Minimize small helper/utility functions — inline the logic instead. A reader should be able to follow the full flow without jumping between functions.
+- No defensive code or unused code paths — do not add fallback paths, safety checks, or configuration options "just in case". When porting from a research repo, delete training-time code paths, experimental flags, and ablation branches entirely — only keep the inference path you are actually integrating.
+- Do not guess user intent and silently correct behavior. Make the expected inputs clear in the docstring, and raise a concise error for unsupported cases rather than adding complex fallback logic.
+---
+## Code formatting
+- `make style` and `make fix-copies` should be run as the final step before opening a PR
+### Copied Code
+- Many classes are kept in sync with a source via a `# Copied from ...` header comment
+- Do not edit a `# Copied from` block directly — run `make fix-copies` to propagate changes from the source
+- Remove the header to intentionally break the link
+### Models
+- See [models.md](models.md) for model conventions, attention pattern, implementation rules, dependencies, and gotchas.
+- See the [model-integration](./skills/model-integration/SKILL.md) skill for the full integration workflow, file structure, test setup, and other details.
+### Pipelines & Schedulers
+- Pipelines inherit from `DiffusionPipeline`
+- Schedulers use `SchedulerMixin` with `ConfigMixin`
+- Use `@torch.no_grad()` on pipeline `__call__`
+- Support `output_type="latent"` for skipping VAE decode
+- Support `generator` parameter for reproducibility
+- Use `self.progress_bar(timesteps)` for progress tracking
+- Don't subclass an existing pipeline for a variant — DO NOT use an existing pipeline class (e.g., `FluxPipeline`) to override another pipeline (e.g., `FluxImg2ImgPipeline`) which will be a part of the core codebase (`src`)
+## Skills
+Task-specific guides live in `.ai/skills/` and are loaded on demand by AI agents. Available skills include:
+- [model-integration](./skills/model-integration/SKILL.md) (adding/converting pipelines)
+- [parity-testing](./skills/parity-testing/SKILL.md) (debugging numerical parity).
--- a/diffusers/.ai/models.md
+++ b/diffusers/.ai/models.md
+# Model conventions and rules
+Shared reference for model-related conventions, patterns, and gotchas.
+Linked from `AGENTS.md`, `skills/model-integration/SKILL.md`, and `review-rules.md`.
+## Coding style
+- All layer calls should be visible directly in `forward` — avoid helper functions that hide `nn.Module` calls.
+- Avoid graph breaks for `torch.compile` compatibility — do not insert NumPy operations in forward implementations and any other patterns that can break `torch.compile` compatibility with `fullgraph=True`.
+- No new mandatory dependency without discussion (e.g. `einops`). Optional deps guarded with `is_X_available()` and a dummy in `utils/dummy_*.py`.
+## Common model conventions
+- Models use `ModelMixin` with `register_to_config` for config serialization
+## Attention pattern
+Attention must follow the diffusers pattern: both the `Attention` class and its processor are defined in the model file. The processor's `__call__` handles the actual compute and must use `dispatch_attention_fn` rather than calling `F.scaled_dot_product_attention` directly. The attention class inherits `AttentionModuleMixin` and declares `_default_processor_cls` and `_available_processors`.
+```python
+# transformer_mymodel.py
+class MyModelAttnProcessor:
+    _attention_backend = None
+    _parallel_config = None
+    def __call__(self, attn, hidden_states, attention_mask=None, ...):
+        query = attn.to_q(hidden_states)
+        key = attn.to_k(hidden_states)
+        value = attn.to_v(hidden_states)
+        # reshape, apply rope, etc.
+        hidden_states = dispatch_attention_fn(
+            query, key, value,
+            attn_mask=attention_mask,
+            backend=self._attention_backend,
+            parallel_config=self._parallel_config,
+        )
+        hidden_states = hidden_states.flatten(2, 3)
+        return attn.to_out[0](hidden_states)
+class MyModelAttention(nn.Module, AttentionModuleMixin):
+    _default_processor_cls = MyModelAttnProcessor
+    _available_processors = [MyModelAttnProcessor]
+    def __init__(self, query_dim, heads=8, dim_head=64, ...):
+        super().__init__()
+        self.to_q = nn.Linear(query_dim, heads * dim_head, bias=False)
+        self.to_k = nn.Linear(query_dim, heads * dim_head, bias=False)
+        self.to_v = nn.Linear(query_dim, heads * dim_head, bias=False)
+        self.to_out = nn.ModuleList([nn.Linear(heads * dim_head, query_dim), nn.Dropout(0.0)])
+        self.set_processor(MyModelAttnProcessor())
+    def forward(self, hidden_states, attention_mask=None, **kwargs):
+        return self.processor(self, hidden_states, attention_mask, **kwargs)
+```
+Consult the implementations in `src/diffusers/models/transformers/` if you need further references.
+## Gotchas
+1. **Forgetting `__init__.py` lazy imports.** Every new class must be registered in the appropriate `__init__.py` with lazy imports. Missing this causes `ImportError` that only shows up when users try `from diffusers import YourNewClass`.
+2. **Using `einops` or other non-PyTorch deps.** Reference implementations often use `einops.rearrange`. Always rewrite with native PyTorch (`reshape`, `permute`, `unflatten`). Don't add the dependency. If a dependency is truly unavoidable, guard its import: `if is_my_dependency_available(): import my_dependency`.
+3. **Missing `make fix-copies` after `# Copied from`.** If you add `# Copied from` annotations, you must run `make fix-copies` to propagate them. CI will fail otherwise.
+4. **Wrong `_supports_cache_class` / `_no_split_modules`.** These class attributes control KV cache and device placement. Copy from a similar model and verify -- wrong values cause silent correctness bugs or OOM errors.
+5. **Missing `@torch.no_grad()` on pipeline `__call__`.** Forgetting this causes GPU OOM from gradient accumulation during inference.
+6. **Config serialization gaps.** Every `__init__` parameter in a `ModelMixin` subclass must be captured by `register_to_config`. If you add a new param but forget to register it, `from_pretrained` will silently use the default instead of the saved value.
+7. **Forgetting to update `_import_structure` and `_lazy_modules`.** The top-level `src/diffusers/__init__.py` has both -- missing either one causes partial import failures.
+8. **Hardcoded dtype in model forward.** Don't hardcode `torch.float32` or `torch.bfloat16` in the model's forward pass. Use the dtype of the input tensors or `self.dtype` so the model works with any precision.
--- a/diffusers/.ai/review-rules.md
+++ b/diffusers/.ai/review-rules.md
+# PR Review Rules
+Review-specific rules for Claude. Focus on correctness — style is handled by ruff.
+Before reviewing, read and apply the guidelines in:
+- [AGENTS.md](AGENTS.md) — coding style, copied code
+- [models.md](models.md) — model conventions, attention pattern, implementation rules, dependencies, gotchas
+- [skills/model-integration/modular-conversion.md](skills/model-integration/modular-conversion.md) — modular pipeline patterns, block structure, key conventions
+- [skills/parity-testing/SKILL.md](skills/parity-testing/SKILL.md) — testing rules, comparison utilities
+- [skills/parity-testing/pitfalls.md](skills/parity-testing/pitfalls.md) — known pitfalls (dtype mismatches, config assumptions, etc.)
+## Common mistakes (add new rules below this line)
--- a/diffusers/.ai/skills/model-integration/SKILL.md
+++ b/diffusers/.ai/skills/model-integration/SKILL.md
+---
+name: integrating-models
+description: >
+  Use when adding a new model or pipeline to diffusers, setting up file
+  structure for a new model, converting a pipeline to modular format, or
+  converting weights for a new version of an already-supported model.
+---
+## Goal
+Integrate a new model into diffusers end-to-end. The overall flow:
+1. **Gather info** — ask the user for the reference repo, setup guide, a runnable inference script, and other objectives such as standard vs modular.
+2. **Confirm the plan** — once you have everything, tell the user exactly what you'll do: e.g. "I'll integrate model X with pipeline Y into diffusers based on your script. I'll run parity tests (model-level and pipeline-level) using the `parity-testing` skill to verify numerical correctness against the reference."
+3. **Implement** — write the diffusers code (model, pipeline, scheduler if needed), convert weights, register in `__init__.py`.
+4. **Parity test** — use the `parity-testing` skill to verify component and e2e parity against the reference implementation.
+5. **Deliver a unit test** — provide a self-contained test script that runs the diffusers implementation, checks numerical output (np allclose), and saves an image/video for visual verification. This is what the user runs to confirm everything works.
+Work one workflow at a time — get it to full parity before moving on.
+## Setup — gather before starting
+Before writing any code, gather info in this order:
+1. **Reference repo** — ask for the github link. If they've already set it up locally, ask for the path. Otherwise, ask what setup steps are needed (install deps, download checkpoints, set env vars, etc.) and run through them before proceeding.
+2. **Inference script** — ask for a runnable end-to-end script for a basic workflow first (e.g. T2V). Then ask what other workflows they want to support (I2V, V2V, etc.) and agree on the full implementation order together.
+3. **Standard vs modular** — standard pipelines, modular, or both?
+Use `AskUserQuestion` with structured choices for step 3 when the options are known.
+## Standard Pipeline Integration
+### File structure for a new model
+```
+src/diffusers/
+  models/transformers/transformer_<model>.py     # The core model
+  schedulers/scheduling_<model>.py               # If model needs a custom scheduler
+  pipelines/<model>/
+    __init__.py
+    pipeline_<model>.py                          # Main pipeline
+    pipeline_<model>_<variant>.py                # Variant pipelines (e.g. pyramid, distilled)
+    pipeline_output.py                           # Output dataclass
+  loaders/lora_pipeline.py                       # LoRA mixin (add to existing file)
+tests/
+  models/transformers/test_models_transformer_<model>.py
+  pipelines/<model>/test_<model>.py
+  lora/test_lora_layers_<model>.py
+docs/source/en/api/
+  pipelines/<model>.md
+  models/<model>_transformer3d.md                # or appropriate name
+```
+### Integration checklist
+- [ ] Implement transformer model with `from_pretrained` support
+- [ ] Implement or reuse scheduler
+- [ ] Implement pipeline(s) with `__call__` method
+- [ ] Add LoRA support if applicable
+- [ ] Register all classes in `__init__.py` files (lazy imports)
+- [ ] Write unit tests (model, pipeline, LoRA)
+- [ ] Write docs
+- [ ] Run `make style` and `make quality`
+- [ ] Test parity with reference implementation (see `parity-testing` skill)
+### Model conventions, attention pattern, and implementation rules
+See [../../models.md](../../models.md) for the attention pattern, implementation rules, common conventions, dependencies, and gotchas. These apply to all model work.
+### Model integration specific rules
+**Don't combine structural changes with behavioral changes.** Restructuring code to fit diffusers APIs (ModelMixin, ConfigMixin, etc.) is unavoidable. But don't also "improve" the algorithm, refactor computation order, or rename internal variables for aesthetics. Keep numerical logic as close to the reference as possible, even if it looks unclean. For standard → modular, this is stricter: copy loop logic verbatim and only restructure into blocks. Clean up in a separate commit after parity is confirmed.
+### Test setup
+- Slow tests gated with `@slow` and `RUN_SLOW=1`
+- All model-level tests must use the `BaseModelTesterConfig`, `ModelTesterMixin`, `MemoryTesterMixin`, `AttentionTesterMixin`, `LoraTesterMixin`, and `TrainingTesterMixin` classes initially to write the tests. Any additional tests should be added after discussions with the maintainers. Use `tests/models/transformers/test_models_transformer_flux.py` as a reference.
+---
+## Modular Pipeline Conversion
+See [modular-conversion.md](modular-conversion.md) for the full guide on converting standard pipelines to modular format, including block types, build order, guider abstraction, and conversion checklist.
+---
+## Weight Conversion Tips
+<!-- TODO: Add concrete examples as we encounter them. Common patterns to watch for:
+  - Fused QKV weights that need splitting into separate Q, K, V
+  - Scale/shift ordering differences (reference stores [shift, scale], diffusers expects [scale, shift])
+  - Weight transpositions (linear stored as transposed conv, or vice versa)
+  - Interleaved head dimensions that need reshaping
+  - Bias terms absorbed into different layers
+  Add each with a before/after code snippet showing the conversion. -->
--- a/diffusers/.ai/skills/model-integration/modular-conversion.md
+++ b/diffusers/.ai/skills/model-integration/modular-conversion.md
+# Modular Pipeline Conversion Reference
+## When to use
+Modular pipelines break a monolithic `__call__` into composable blocks. Convert when:
+- The model supports multiple workflows (T2V, I2V, V2V, etc.)
+- Users need to swap guidance strategies (CFG, CFG-Zero*, PAG)
+- You want to share blocks across pipeline variants
+## File structure
+```
+src/diffusers/modular_pipelines/<model>/
+  __init__.py                          # Lazy imports
+  modular_pipeline.py                  # Pipeline class (tiny, mostly config)
+  encoders.py                          # Text encoder + image/video VAE encoder blocks
+  before_denoise.py                    # Pre-denoise setup blocks
+  denoise.py                           # The denoising loop blocks
+  decoders.py                          # VAE decode block
+  modular_blocks_<model>.py            # Block assembly (AutoBlocks)
+```
+## Block types decision tree
+```
+Is this a single operation?
+  YES -> ModularPipelineBlocks (leaf block)
+Does it run multiple blocks in sequence?
+  YES -> SequentialPipelineBlocks
+    Does it iterate (e.g. chunk loop)?
+      YES -> LoopSequentialPipelineBlocks
+Does it choose ONE block based on which input is present?
+  Is the selection 1:1 with trigger inputs?
+    YES -> AutoPipelineBlocks (simple trigger mapping)
+    NO  -> ConditionalPipelineBlocks (custom select_block method)
+```
+## Build order (easiest first)
+1. `decoders.py` -- Takes latents, runs VAE decode, returns images/videos
+2. `encoders.py` -- Takes prompt, returns prompt_embeds. Add image/video VAE encoder if needed
+3. `before_denoise.py` -- Timesteps, latent prep, noise setup. Each logical operation = one block
+4. `denoise.py` -- The hardest. Convert guidance to guider abstraction
+## Key pattern: Guider abstraction
+Original pipeline has guidance baked in:
+```python
+for i, t in enumerate(timesteps):
+    noise_pred = self.transformer(latents, prompt_embeds, ...)
+    if self.do_classifier_free_guidance:
+        noise_uncond = self.transformer(latents, negative_prompt_embeds, ...)
+        noise_pred = noise_uncond + scale * (noise_pred - noise_uncond)
+    latents = self.scheduler.step(noise_pred, t, latents).prev_sample
+```
+Modular pipeline separates concerns:
+```python
+guider_inputs = {
+    "encoder_hidden_states": (prompt_embeds, negative_prompt_embeds),
+}
+for i, t in enumerate(timesteps):
+    components.guider.set_state(step=i, num_inference_steps=num_steps, timestep=t)
+    guider_state = components.guider.prepare_inputs(guider_inputs)
+    for batch in guider_state:
+        components.guider.prepare_models(components.transformer)
+        cond_kwargs = {k: getattr(batch, k) for k in guider_inputs}
+        context_name = getattr(batch, components.guider._identifier_key)
+        with components.transformer.cache_context(context_name):
+            batch.noise_pred = components.transformer(
+                hidden_states=latents, timestep=timestep,
+                return_dict=False, **cond_kwargs, **shared_kwargs,
+            )[0]
+        components.guider.cleanup_models(components.transformer)
+    noise_pred = components.guider(guider_state)[0]
+    latents = components.scheduler.step(noise_pred, t, latents, generator=generator)[0]
+```
+## Key pattern: Chunk loops for video models
+Use `LoopSequentialPipelineBlocks` for outer loop:
+```python
+class ChunkDenoiseStep(LoopSequentialPipelineBlocks):
+    block_classes = [PrepareChunkStep, NoiseGenStep, DenoiseInnerStep, UpdateStep]
+```
+Note: blocks inside `LoopSequentialPipelineBlocks` receive `(components, block_state, k)` where `k` is the loop iteration index.
+## Key pattern: Workflow selection
+```python
+class AutoDenoise(ConditionalPipelineBlocks):
+    block_classes = [V2VDenoiseStep, I2VDenoiseStep, T2VDenoiseStep]
+    block_trigger_inputs = ["video_latents", "image_latents"]
+    default_block_name = "text2video"
+```
+## Standard InputParam/OutputParam templates
+```python
+# Inputs
+InputParam.template("prompt")              # str, required
+InputParam.template("negative_prompt")     # str, optional
+InputParam.template("image")               # PIL.Image, optional
+InputParam.template("generator")           # torch.Generator, optional
+InputParam.template("num_inference_steps") # int, default=50
+InputParam.template("latents")             # torch.Tensor, optional
+# Outputs
+OutputParam.template("prompt_embeds")
+OutputParam.template("negative_prompt_embeds")
+OutputParam.template("image_latents")
+OutputParam.template("latents")
+OutputParam.template("videos")
+OutputParam.template("images")
+```
+## ComponentSpec patterns
+```python
+# Heavy models - loaded from pretrained
+ComponentSpec("transformer", YourTransformerModel)
+ComponentSpec("vae", AutoencoderKL)
+# Lightweight objects - created inline from config
+ComponentSpec(
+    "guider",
+    ClassifierFreeGuidance,
+    config=FrozenDict({"guidance_scale": 7.5}),
+    default_creation_method="from_config"
+)
+```
+## Conversion checklist
+- [ ] Read original pipeline's `__call__` end-to-end, map stages
+- [ ] Write test scripts (reference + target) with identical seeds
+- [ ] Create file structure under `modular_pipelines/<model>/`
+- [ ] Write decoder block (simplest)
+- [ ] Write encoder blocks (text, image, video)
+- [ ] Write before_denoise blocks (timesteps, latent prep, noise)
+- [ ] Write denoise block with guider abstraction (hardest)
+- [ ] Create pipeline class with `default_blocks_name`
+- [ ] Assemble blocks in `modular_blocks_<model>.py`
+- [ ] Wire up `__init__.py` with lazy imports
+- [ ] Add `# auto_docstring` above all assembled blocks (SequentialPipelineBlocks, AutoPipelineBlocks, etc.), run `python utils/modular_auto_docstring.py --fix_and_overwrite`, and verify the generated docstrings — all parameters should have proper descriptions with no "TODO" placeholders indicating missing definitions
+- [ ] Run `make style` and `make quality`
+- [ ] Test all workflows for parity with reference
--- a/diffusers/.ai/skills/parity-testing/SKILL.md
+++ b/diffusers/.ai/skills/parity-testing/SKILL.md
+---
+name: testing-parity
+description: >
+  Use when debugging or verifying numerical parity between pipeline
+  implementations (e.g., research repo vs diffusers, standard vs modular).
+  Also relevant when outputs look wrong — washed out, pixelated, or have
+  visual artifacts — as these are usually parity bugs.
+---
+## Setup — gather before starting
+Before writing any test code, gather:
+1. **Which two implementations** are being compared (e.g. research repo → diffusers, standard → modular, or research → modular). Use `AskUserQuestion` with structured choices if not already clear.
+2. **Two equivalent runnable scripts** — one for each implementation, both expected to produce identical output given the same inputs. These scripts define what "parity" means concretely.
+When invoked from the `model-integration` skill, you already have context: the reference script comes from step 2 of setup, and the diffusers script is the one you just wrote. You just need to make sure both scripts are runnable and use the same inputs/seed/params.
+## Test strategy
+**Component parity (CPU/float32) -- always run, as you build.**
+Test each component before assembling the pipeline. This is the foundation -- if individual pieces are wrong, the pipeline can't be right. Each component in isolation, strict max_diff < 1e-3.
+Test freshly converted checkpoints and saved checkpoints.
+- **Fresh**: convert from checkpoint weights, compare against reference (catches conversion bugs)
+- **Saved**: load from saved model on disk, compare against reference (catches stale saves)
+Keep component test scripts around -- you will need to re-run them during pipeline debugging with different inputs or config values.
+Template -- one self-contained script per component, reference and diffusers side-by-side:
+```python
+@torch.inference_mode()
+def test_my_component(mode="fresh", model_path=None):
+    # 1. Deterministic input
+    gen = torch.Generator().manual_seed(42)
+    x = torch.randn(1, 3, 64, 64, generator=gen, dtype=torch.float32)
+    # 2. Reference: load from checkpoint, run, free
+    ref_model = ReferenceModel.from_config(config)
+    ref_model.load_state_dict(load_weights("prefix"), strict=True)
+    ref_model = ref_model.float().eval()
+    ref_out = ref_model(x).clone()
+    del ref_model
+    # 3. Diffusers: fresh (convert weights) or saved (from_pretrained)
+    if mode == "fresh":
+        diff_model = convert_my_component(load_weights("prefix"))
+    else:
+        diff_model = DiffusersModel.from_pretrained(model_path, torch_dtype=torch.float32)
+    diff_model = diff_model.float().eval()
+    diff_out = diff_model(x)
+    del diff_model
+    # 4. Compare in same script -- no saving to disk
+    max_diff = (ref_out - diff_out).abs().max().item()
+    assert max_diff < 1e-3, f"FAIL: max_diff={max_diff:.2e}"
+```
+Key points: (a) both reference and diffusers component in one script -- never split into separate scripts that save/load intermediates, (b) deterministic input via seeded generator, (c) load one model at a time to fit in CPU RAM, (d) `.clone()` the reference output before deleting the model.
+**E2E visual (GPU/bfloat16) -- once the pipeline is assembled.**
+Both pipelines generate independently with identical seeds/params. Save outputs and compare visually. If outputs look identical, you're done -- no need for deeper testing.
+**Pipeline stage tests -- only if E2E fails and you need to isolate the bug.**
+If the user already suspects where divergence is, start there. Otherwise, work through stages in order.
+First, **match noise generation**: the way initial noise/latents are constructed (seed handling, generator, randn call order) often differs between the two scripts. If the noise doesn't match, nothing downstream will match. Check how noise is initialized in the diffusers script — if it doesn't match the reference, temporarily change it to match. Note what you changed so it can be reverted after parity is confirmed.
+For small models, run on CPU/float32 for strict comparison. For large models (e.g. 22B params), CPU/float32 is impractical -- use GPU/bfloat16 with `enable_model_cpu_offload()` and relax tolerances (max_diff < 1e-1 for bfloat16 is typical for passing tests; cosine similarity > 0.9999 is a good secondary check).
+Test encode and decode stages first -- they're simpler and bugs there are easier to fix. Only debug the denoising loop if encode and decode both pass.
+The challenge: pipelines are monolithic `__call__` methods -- you can't just call "the encode part". See [checkpoint-mechanism.md](checkpoint-mechanism.md) for the checkpoint class that lets you stop, save, or inject tensors at named locations inside the pipeline.
+**Stage test order — encode, decode, then denoise:**
+- **`encode`** (test first): Stop both pipelines at `"preloop"`. Compare **every single variable** that will be consumed by the denoising loop -- not just latents and sigmas, but also prompt embeddings, attention masks, positional coordinates, connector outputs, and any conditioning inputs.
+- **`decode`** (test second, before denoise): Run the reference pipeline fully -- checkpoint the post-loop latents AND let it finish to get the decoded output. Then feed those same post-loop latents through the diffusers pipeline's decode path. Compare both numerically AND visually.
+- **`denoise`** (test last): Run both pipelines with realistic `num_steps` (e.g. 30) so the scheduler computes correct sigmas/timesteps, but stop after 2 loop iterations using `after_step_1`. Don't set `num_steps=2` -- that produces unrealistic sigma schedules.
+```python
+# Encode stage -- stop before the loop, compare ALL inputs:
+ref_ckpts = {"preloop": Checkpoint(save=True, stop=True)}
+run_reference_pipeline(ref_ckpts)
+ref_data = ref_ckpts["preloop"].data
+diff_ckpts = {"preloop": Checkpoint(save=True, stop=True)}
+run_diffusers_pipeline(diff_ckpts)
+diff_data = diff_ckpts["preloop"].data
+# Compare EVERY variable consumed by the denoise loop:
+compare_tensors("latents", ref_data["latents"], diff_data["latents"])
+compare_tensors("sigmas", ref_data["sigmas"], diff_data["sigmas"])
+compare_tensors("prompt_embeds", ref_data["prompt_embeds"], diff_data["prompt_embeds"])
+# ... every single tensor the transformer forward() will receive
+```
+**E2E-injected visual test**: Once you've identified a suspected root cause using stage tests, confirm it with an e2e-injected run -- inject the known-good tensor from reference and generate a full video. If the output looks identical to reference, you've confirmed the root cause.
+## Debugging technique: Injection for root-cause isolation
+When stage tests show divergence, **inject a known-good tensor from one pipeline into the other** to test whether the remaining code is correct.
+The principle: if you suspect input X is the root cause of divergence in stage S:
+1. Run the reference pipeline and capture X
+2. Run the diffusers pipeline but **replace** its X with the reference's X (via checkpoint load)
+3. Compare outputs of stage S
+If outputs now match: X was the root cause. If they still diverge: the bug is in the stage logic itself, not in X.
+| What you're testing | What you inject | Where you inject |
+|---|---|---|
+| Is the decode stage correct? | Post-loop latents from reference | Before decode |
+| Is the denoise loop correct? | Pre-loop latents from reference | Before the loop |
+| Is step N correct? | Post-step-(N-1) latents from reference | Before step N |
+**Per-step accumulation tracing**: When injection confirms the loop is correct but you want to understand *how* a small initial difference compounds, capture `after_step_{i}` for every step and plot the max_diff curve. A healthy curve stays bounded; an exponential blowup in later steps points to an amplification mechanism (see Pitfall #13 in [pitfalls.md](pitfalls.md)).
+## Debugging technique: Visual comparison via frame extraction
+For video pipelines, numerical metrics alone can be misleading. Extract and view individual frames:
+```python
+import numpy as np
+from PIL import Image
+def extract_frames(video_np, frame_indices):
+    """video_np: (frames, H, W, 3) float array in [0, 1]"""
+    for idx in frame_indices:
+        frame = (video_np[idx] * 255).clip(0, 255).astype(np.uint8)
+        img = Image.fromarray(frame)
+        img.save(f"frame_{idx}.png")
+# Compare specific frames from both pipelines
+extract_frames(ref_video, [0, 60, 120])
+extract_frames(diff_video, [0, 60, 120])
+```
+## Testing rules
+1. **Never use reference code in the diffusers test path.** Each side must use only its own code.
+2. **Never monkey-patch model internals in tests.** Do not replace `model.forward` or patch internal methods.
+3. **Debugging instrumentation must be non-destructive.** Checkpoint captures for debugging are fine, but must not alter control flow or outputs.
+4. **Prefer CPU/float32 for numerical comparison when practical.** Float32 avoids bfloat16 precision noise that obscures real bugs. But for large models (22B+), GPU/bfloat16 with `enable_model_cpu_offload()` is necessary -- use relaxed tolerances and cosine similarity as a secondary metric.
+5. **Test both fresh conversion AND saved model.** Fresh catches conversion logic bugs; saved catches stale/corrupted weights from previous runs.
+6. **Diff configs before debugging.** Before investigating any divergence, dump and compare all config values. A 30-second config diff prevents hours of debugging based on wrong assumptions.
+7. **Never modify cached/downloaded model configs directly.** Don't edit files in `~/.cache/huggingface/`. Instead, save to a local directory or open a PR on the upstream repo.
+8. **Compare ALL loop inputs in the encode test.** The preloop checkpoint must capture every single tensor the transformer forward() will receive.
+## Comparison utilities
+```python
+def compare_tensors(name: str, a: torch.Tensor, b: torch.Tensor, tol: float = 1e-3) -> bool:
+    if a.shape != b.shape:
+        print(f"  FAIL {name}: shape mismatch {a.shape} vs {b.shape}")
+        return False
+    diff = (a.float() - b.float()).abs()
+    max_diff = diff.max().item()
+    mean_diff = diff.mean().item()
+    cos = torch.nn.functional.cosine_similarity(
+        a.float().flatten().unsqueeze(0), b.float().flatten().unsqueeze(0)
+    ).item()
+    passed = max_diff < tol
+    print(f"  {'PASS' if passed else 'FAIL'} {name}: max={max_diff:.2e}, mean={mean_diff:.2e}, cos={cos:.5f}")
+    return passed
+```
+Cosine similarity is especially useful for GPU/bfloat16 tests where max_diff can be noisy -- `cos > 0.9999` is a strong signal even when max_diff exceeds tolerance.
+## Gotchas
+See [pitfalls.md](pitfalls.md) for the full list of gotchas to watch for during parity testing.
--- a/diffusers/.ai/skills/parity-testing/checkpoint-mechanism.md
+++ b/diffusers/.ai/skills/parity-testing/checkpoint-mechanism.md
+# Checkpoint Mechanism for Stage Testing
+## Overview
+Pipelines are monolithic `__call__` methods -- you can't just call "the encode part". The checkpoint mechanism lets you stop, save, or inject tensors at named locations inside the pipeline.
+## The Checkpoint class
+Add a `_checkpoints` argument to both the diffusers pipeline and the reference implementation.
+```python
+@dataclass
+class Checkpoint:
+    save: bool = False   # capture variables into ckpt.data
+    stop: bool = False   # halt pipeline after this point
+    load: bool = False   # inject ckpt.data into local variables
+    data: dict = field(default_factory=dict)
+```
+## Pipeline instrumentation
+The pipeline accepts an optional `dict[str, Checkpoint]`. Place checkpoint calls at boundaries between pipeline stages -- after each encoder, before the denoising loop (capture all loop inputs), after each loop iteration, after the loop (capture final latents before decode).
+```python
+def __call__(self, prompt, ..., _checkpoints=None):
+    # --- text encoding ---
+    prompt_embeds = self.text_encoder(prompt)
+    _maybe_checkpoint(_checkpoints, "text_encoding", {
+        "prompt_embeds": prompt_embeds,
+    })
+    # --- prepare latents, sigmas, positions ---
+    latents = self.prepare_latents(...)
+    sigmas = self.scheduler.sigmas
+    # ...
+    _maybe_checkpoint(_checkpoints, "preloop", {
+        "latents": latents,
+        "sigmas": sigmas,
+        "prompt_embeds": prompt_embeds,
+        "prompt_attention_mask": prompt_attention_mask,
+        "video_coords": video_coords,
+        # capture EVERYTHING the loop needs -- every tensor the transformer
+        # forward() receives. Missing even one variable here means you can't
+        # tell if it's the source of divergence during denoise debugging.
+    })
+    # --- denoising loop ---
+    for i, t in enumerate(timesteps):
+        noise_pred = self.transformer(latents, t, prompt_embeds, ...)
+        latents = self.scheduler.step(noise_pred, t, latents)[0]
+        _maybe_checkpoint(_checkpoints, f"after_step_{i}", {
+            "latents": latents,
+        })
+    _maybe_checkpoint(_checkpoints, "post_loop", {
+        "latents": latents,
+    })
+    # --- decode ---
+    video = self.vae.decode(latents)
+    return video
+```
+## The helper function
+Each `_maybe_checkpoint` call does three things based on the Checkpoint's flags: `save` captures the local variables into `ckpt.data`, `load` injects pre-populated `ckpt.data` back into local variables, `stop` halts execution (raises an exception caught at the top level).
+```python
+def _maybe_checkpoint(checkpoints, name, data):
+    if not checkpoints:
+        return
+    ckpt = checkpoints.get(name)
+    if ckpt is None:
+        return
+    if ckpt.save:
+        ckpt.data.update(data)
+    if ckpt.stop:
+        raise PipelineStop  # caught at __call__ level, returns None
+```
+## Injection support
+Add `load` support at each checkpoint where you might want to inject:
+```python
+_maybe_checkpoint(_checkpoints, "preloop", {"latents": latents, ...})
+# Load support: replace local variables with injected data
+if _checkpoints:
+    ckpt = _checkpoints.get("preloop")
+    if ckpt is not None and ckpt.load:
+        latents = ckpt.data["latents"].to(device=device, dtype=latents.dtype)
+```
+## Key insight
+The checkpoint dict is passed into the pipeline and mutated in-place. After the pipeline returns (or stops early), you read back `ckpt.data` to get the captured tensors. Both pipelines save under their own key names, so the test maps between them (e.g. reference `"video_state.latent"` -> diffusers `"latents"`).
+## Memory management for large models
+For large models, free the source pipeline's GPU memory before loading the target pipeline. Clone injected tensors to CPU, delete everything else, then run the target with `enable_model_cpu_offload()`.
--- a/diffusers/.ai/skills/parity-testing/pitfalls.md
+++ b/diffusers/.ai/skills/parity-testing/pitfalls.md
+# Complete Pitfalls Reference
+## 1. Global CPU RNG
+`MultivariateNormal.sample()` uses the global CPU RNG, not `torch.Generator`. Must call `torch.manual_seed(seed)` before each pipeline run. A `generator=` kwarg won't help.
+## 2. Timestep dtype
+Many transformers expect `int64` timesteps. `get_timestep_embedding` casts to float, so `745.3` and `745` produce different embeddings. Match the reference's casting.
+## 3. Guidance parameter mapping
+Parameter names may differ: reference `zero_steps=1` (meaning `i <= 1`, 2 steps) vs target `zero_init_steps=2` (meaning `step < 2`, same thing). Check exact semantics.
+## 4. `patch_size` in noise generation
+If noise generation depends on `patch_size` (e.g. `sample_block_noise`), it must be passed through. Missing it changes noise spatial structure.
+## 5. Variable shadowing in nested loops
+Nested loops (stages -> chunks -> timesteps) can shadow variable names. If outer loop uses `latents` and inner loop also assigns to `latents`, scoping must match the reference.
+## 6. Float precision differences -- don't dismiss them
+Target may compute in float32 where reference used bfloat16. Small per-element diffs (1e-3 to 1e-2) *look* harmless but can compound catastrophically over iterative processes like denoising loops (see Pitfalls #11 and #13). Before dismissing a precision difference: (a) check whether it feeds into an iterative process, (b) if so, trace the accumulation curve over all iterations to see if it stays bounded or grows exponentially. Only truly non-iterative precision diffs (e.g. in a single-pass encoder) are safe to accept.
+## 7. Scheduler state reset between stages
+Some schedulers accumulate state (e.g. `model_outputs` in UniPC) that must be cleared between stages.
+## 8. Component access
+Standard: `self.transformer`. Modular: `components.transformer`. Missing this causes AttributeError.
+## 9. Guider state across stages
+In multi-stage denoising, the guider's internal state (e.g. `zero_init_steps`) may need save/restore between stages.
+## 10. Model storage location
+NEVER store converted models in `/tmp/` -- temporary directories get wiped on restart. Always save converted checkpoints under a persistent path in the project repo (e.g. `models/ltx23-diffusers/`).
+## 11. Noise dtype mismatch (causes washed-out output)
+Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing:
+```python
+noise = torch.randn(..., dtype=torch.float32, generator=gen)
+noise = noise.to(dtype=model_dtype)  # bfloat16 -- values get quantized
+```
+Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (Pitfall #13) and produces completely washed-out output.
+**Fix**: Match the reference -- generate noise in the model's working dtype:
+```python
+latent_dtype = self.transformer.dtype  # e.g. bfloat16
+latents = self.prepare_latents(..., dtype=latent_dtype, ...)
+```
+**Detection**: Encode stage test shows initial latent max_diff of exactly ~1.5e-02. This specific magnitude is the signature of float32->bfloat16 quantization error.
+## 12. RoPE position dtype
+RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly (max_diff up to 2.0). Different modalities may use different position dtypes (e.g. video bfloat16, audio float32) -- check the reference carefully.
+## 13. 1/sigma error amplification in Euler denoising
+In Euler/flow-matching, the velocity formula divides by sigma: `v = (latents - pred_x0) / sigma`. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (Pitfalls #11, #12) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose.
+## 14. Config value assumptions -- always diff, never assume
+When debugging parity, don't assume config values match code defaults. The published model checkpoint may override defaults with different values. A wrong assumption about a single config field can send you down hours of debugging in the wrong direction.
+**The pattern that goes wrong:**
+1. You see `param_x` has default `1` in the code
+2. The reference code also uses `param_x` with a default of `1`
+3. You assume both sides use `1` and apply a "fix" based on that
+4. But the actual checkpoint config has `param_x: 1000`, and so does the published diffusers config
+5. Your "fix" now *creates* divergence instead of fixing it
+**Prevention -- config diff first:**
+```python
+# Reference: read from checkpoint metadata (no model loading needed)
+from safetensors import safe_open
+import json
+ref_config = json.loads(safe_open(checkpoint_path, framework="pt").metadata()["config"])
+# Diffusers: read from model config
+from diffusers import MyModel
+diff_model = MyModel.from_pretrained(model_path, subfolder="transformer")
+diff_config = dict(diff_model.config)
+# Compare all values
+for key in sorted(set(list(ref_config.get("transformer", {}).keys()) + list(diff_config.keys()))):
+    ref_val = ref_config.get("transformer", {}).get(key, "MISSING")
+    diff_val = diff_config.get(key, "MISSING")
+    if ref_val != diff_val:
+        print(f"  DIFF {key}: ref={ref_val}, diff={diff_val}")
+```
+Run this **before** writing any hooks, analysis code, or fixes. It takes 30 seconds and catches wrong assumptions immediately.
+**When debugging divergence -- trace values, don't reason about them:**
+If two implementations diverge, hook the actual intermediate values at the point of divergence rather than reading code to figure out what the values "should" be. Code analysis builds on assumptions; value tracing reveals facts.
+## 15. Decoder config mismatch (causes pixelated artifacts)
+The upstream model config may have wrong values for decoder-specific parameters (e.g. `upsample_residual`, `upsample_type`). These control whether the decoder uses skip connections in upsampling -- getting them wrong produces severe pixelation or blocky artifacts.
+**Detection**: Feed identical post-loop latents through both decoders. If max pixel diff is large (PSNR < 40 dB) on CPU/float32, it's a real bug, not precision noise. Trace through decoder blocks (conv_in -> mid_block -> up_blocks) to find where divergence starts.
+**Fix**: Correct the config value. Don't edit cached files in `~/.cache/huggingface/` -- either save to a local model directory or open a PR on the upstream repo (see Testing Rule #7).
+## 16. Incomplete injection tests -- inject ALL variables or the test is invalid
+When doing injection tests (feeding reference tensors into the diffusers pipeline), you must inject **every** divergent input, including sigmas/timesteps. A common mistake: the preloop checkpoint saves sigmas but the injection code only loads latents and embeddings. The test then runs with different sigma schedules, making it impossible to isolate the real cause.
+**Prevention**: After writing injection code, verify by listing every variable the injected stage consumes and checking each one is either (a) injected from reference, or (b) confirmed identical between pipelines.
+## 17. bf16 connector/encoder divergence -- don't chase it
+When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on.
+## 18. Stale test fixtures
+When using saved tensors for cross-pipeline comparison, always ensure both sets of tensors were captured from the same run configuration (same seed, same config, same code version). Mixing fixtures from different runs (e.g. reference tensors from yesterday, diffusers tensors from today after a code change) creates phantom divergence that wastes debugging time. Regenerate both sides in a single test script execution.
--- a/diffusers/CITATION.cff
+++ b/diffusers/CITATION.cff
+cff-version: 1.2.0
+title: 'Diffusers: State-of-the-art diffusion models'
+message: >-
+  If you use this software, please cite it using the
+  metadata from this file.
+type: software
+authors:
+  - given-names: Patrick
+    family-names: von Platen
+  - given-names: Suraj
+    family-names: Patil
+  - given-names: Anton
+    family-names: Lozhkov
+  - given-names: Pedro
+    family-names: Cuenca
+  - given-names: Nathan
+    family-names: Lambert
+  - given-names: Kashif
+    family-names: Rasul
+  - given-names: Mishig
+    family-names: Davaadorj
+  - given-names: Dhruv
+    family-names: Nair
+  - given-names: Sayak
+    family-names: Paul
+  - given-names: Steven
+    family-names: Liu
+  - given-names: William
+    family-names: Berman
+  - given-names: Yiyi
+    family-names: Xu
+  - given-names: Thomas
+    family-names: Wolf
+repository-code: 'https://github.com/huggingface/diffusers'
+abstract: >-
+  Diffusers provides pretrained diffusion models across
+  multiple modalities, such as vision and audio, and serves
+  as a modular toolbox for inference and training of
+  diffusion models.
+keywords:
+  - deep-learning
+  - pytorch
+  - image-generation
+  - hacktoberfest
+  - diffusion
+  - text2image
+  - image2image
+  - score-based-generative-modeling
+  - stable-diffusion
+  - stable-diffusion-diffusers
+license: Apache-2.0
+version: 0.12.1
--- a/diffusers/CODE_OF_CONDUCT.md
+++ b/diffusers/CODE_OF_CONDUCT.md
+# Contributor Covenant Code of Conduct
+## Our Pledge
+We as members, contributors, and leaders pledge to make participation in our
+community a harassment-free experience for everyone, regardless of age, body
+size, visible or invisible disability, ethnicity, sex characteristics, gender
+identity and expression, level of experience, education, socio-economic status,
+nationality, personal appearance, race, caste, color, religion, or sexual identity
+and orientation.
+We pledge to act and interact in ways that contribute to an open, welcoming,
+diverse, inclusive, and healthy community.
+## Our Standards
+Examples of behavior that contributes to a positive environment for our
+community include:
+* Demonstrating empathy and kindness toward other people
+* Being respectful of differing opinions, viewpoints, and experiences
+* Giving and gracefully accepting constructive feedback
+* Accepting responsibility and apologizing to those affected by our mistakes,
+  and learning from the experience
+* Focusing on what is best not just for us as individuals, but for the
+  overall Diffusers community
+Examples of unacceptable behavior include:
+* The use of sexualized language or imagery, and sexual attention or
+  advances of any kind
+* Trolling, insulting or derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or email
+  address, without their explicit permission
+* Spamming issues or PRs with links to projects unrelated to this library
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+## Enforcement Responsibilities
+Community leaders are responsible for clarifying and enforcing our standards of
+acceptable behavior and will take appropriate and fair corrective action in
+response to any behavior that they deem inappropriate, threatening, offensive,
+or harmful.
+Community leaders have the right and responsibility to remove, edit, or reject
+comments, commits, code, wiki edits, issues, and other contributions that are
+not aligned to this Code of Conduct, and will communicate reasons for moderation
+decisions when appropriate.
+## Scope
+This Code of Conduct applies within all community spaces, and also applies when
+an individual is officially representing the community in public spaces.
+Examples of representing our community include using an official e-mail address,
+posting via an official social media account, or acting as an appointed
+representative at an online or offline event.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported to the community leaders responsible for enforcement at
+feedback@huggingface.co.
+All complaints will be reviewed and investigated promptly and fairly.
+All community leaders are obligated to respect the privacy and security of the
+reporter of any incident.
+## Enforcement Guidelines
+Community leaders will follow these Community Impact Guidelines in determining
+the consequences for any action they deem in violation of this Code of Conduct:
+### 1. Correction
+**Community Impact**: Use of inappropriate language or other behavior deemed
+unprofessional or unwelcome in the community.
+**Consequence**: A private, written warning from community leaders, providing
+clarity around the nature of the violation and an explanation of why the
+behavior was inappropriate. A public apology may be requested.
+### 2. Warning
+**Community Impact**: A violation through a single incident or series
+of actions.
+**Consequence**: A warning with consequences for continued behavior. No
+interaction with the people involved, including unsolicited interaction with
+those enforcing the Code of Conduct, for a specified period of time. This
+includes avoiding interactions in community spaces as well as external channels
+like social media. Violating these terms may lead to a temporary or
+permanent ban.
+### 3. Temporary Ban
+**Community Impact**: A serious violation of community standards, including
+sustained inappropriate behavior.
+**Consequence**: A temporary ban from any sort of interaction or public
+communication with the community for a specified period of time. No public or
+private interaction with the people involved, including unsolicited interaction
+with those enforcing the Code of Conduct, is allowed during this period.
+Violating these terms may lead to a permanent ban.
+### 4. Permanent Ban
+**Community Impact**: Demonstrating a pattern of violation of community
+standards, including sustained inappropriate behavior,  harassment of an
+individual, or aggression toward or disparagement of classes of individuals.
+**Consequence**: A permanent ban from any sort of public interaction within
+the community.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][homepage],
+version 2.1, available at
+https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.
+Community Impact Guidelines were inspired by [Mozilla's code of conduct
+enforcement ladder](https://github.com/mozilla/diversity).
+[homepage]: https://www.contributor-covenant.org
+For answers to common questions about this code of conduct, see the FAQ at
+https://www.contributor-covenant.org/faq. Translations are available at
+https://www.contributor-covenant.org/translations.
--- a/diffusers/LICENSE
+++ b/diffusers/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, Any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright [yyyy] [name of copyright owner]
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/diffusers/MANIFEST.in
+++ b/diffusers/MANIFEST.in
+include LICENSE
+include src/diffusers/utils/model_card_template.md
--- a/diffusers/Makefile
+++ b/diffusers/Makefile
+.PHONY: deps_table_update modified_only_fixup extra_style_checks quality style fixup fix-copies test test-examples codex claude clean-ai
+# make sure to test the local checkout in scripts and not the pre-installed one (don't use quotes!)
+export PYTHONPATH = src
+check_dirs := examples scripts src tests utils benchmarks
+modified_only_fixup:
+	$(eval modified_py_files := $(shell python utils/get_modified_files.py $(check_dirs)))
+	@if test -n "$(modified_py_files)"; then \
+		echo "Checking/fixing $(modified_py_files)"; \
+		ruff check $(modified_py_files) --fix; \
+		ruff format $(modified_py_files);\
+	else \
+		echo "No library .py files were modified"; \
+	fi
+# Update src/diffusers/dependency_versions_table.py
+deps_table_update:
+	@python setup.py deps_table_update
+deps_table_check_updated:
+	@md5sum src/diffusers/dependency_versions_table.py > md5sum.saved
+	@python setup.py deps_table_update
+	@md5sum -c --quiet md5sum.saved || (printf "\nError: the version dependency table is outdated.\nPlease run 'make fixup' or 'make style' and commit the changes.\n\n" && exit 1)
+	@rm md5sum.saved
+# autogenerating code
+autogenerate_code: deps_table_update
+# Check that the repo is in a good state
+repo-consistency:
+	python utils/check_dummies.py
+	python utils/check_repo.py
+	python utils/check_inits.py
+# this target runs checks on all files
+quality:
+	ruff check $(check_dirs) setup.py
+	ruff format --check $(check_dirs) setup.py
+	doc-builder style src/diffusers docs/source --max_len 119 --check_only
+	python utils/check_doc_toc.py
+# Format source code automatically and check is there are any problems left that need manual fixing
+extra_style_checks:
+	python utils/custom_init_isort.py
+	python utils/check_doc_toc.py --fix_and_overwrite
+# this target runs checks on all files and potentially modifies some of them
+style:
+	ruff check $(check_dirs) setup.py --fix
+	ruff format $(check_dirs) setup.py
+	doc-builder style src/diffusers docs/source --max_len 119
+	${MAKE} autogenerate_code
+	${MAKE} extra_style_checks
+# Super fast fix and check target that only works on relevant modified files since the branch was made
+fixup: modified_only_fixup extra_style_checks autogenerate_code repo-consistency
+# Make marked copies of snippets of codes conform to the original
+fix-copies:
+	python utils/check_copies.py --fix_and_overwrite
+	python utils/check_dummies.py --fix_and_overwrite
+# Auto docstrings in modular blocks
+modular-autodoctrings:
+	python utils/modular_auto_docstring.py
+# Run tests for the library
+test:
+	python -m pytest -n auto --dist=loadfile -s -v ./tests/
+# Run tests for examples
+test-examples:
+	python -m pytest -n auto --dist=loadfile -s -v ./examples/
+# Release stuff
+pre-release:
+	python utils/release.py
+pre-patch:
+	python utils/release.py --patch
+post-release:
+	python utils/release.py --post_release
+post-patch:
+	python utils/release.py --post_release --patch
+# AI agent symlinks
+codex:
+	ln -snf .ai/AGENTS.md AGENTS.md
+	mkdir -p .agents
+	rm -rf .agents/skills
+	ln -snf ../.ai/skills .agents/skills
+claude:
+	ln -snf .ai/AGENTS.md CLAUDE.md
+	mkdir -p .claude
+	rm -rf .claude/skills
+	ln -snf ../.ai/skills .claude/skills
+clean-ai:
+	rm -f AGENTS.md CLAUDE.md
+	rm -rf .agents/skills .claude/skills
--- a/diffusers/PHILOSOPHY.md
+++ b/diffusers/PHILOSOPHY.md
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+# Philosophy
+🧨 Diffusers provides **state-of-the-art** pretrained diffusion models across multiple modalities.
+Its purpose is to serve as a **modular toolbox** for both inference and training.
+We aim to build a library that stands the test of time and therefore take API design very seriously.
+In a nutshell, Diffusers is built to be a natural extension of PyTorch. Therefore, most of our design choices are based on [PyTorch's Design Principles](https://pytorch.org/docs/stable/community/design.html#pytorch-design-philosophy). Let's go over the most important ones:
+## Usability over Performance
+- While Diffusers has many built-in performance-enhancing features (see [Memory and Speed](https://huggingface.co/docs/diffusers/optimization/fp16)), models are always loaded with the highest precision and lowest optimization. Therefore, by default diffusion pipelines are always instantiated on CPU with float32 precision if not otherwise defined by the user. This ensures usability across different platforms and accelerators and means that no complex installations are required to run the library.
+- Diffusers aims to be a **light-weight** package and therefore has very few required dependencies, but many soft dependencies that can improve performance (such as `accelerate`, `safetensors`, `onnx`, etc...). We strive to keep the library as lightweight as possible so that it can be added without much concern as a dependency on other packages.
+- Diffusers prefers simple, self-explainable code over condensed, magic code. This means that short-hand code syntaxes such as lambda functions, and advanced PyTorch operators are often not desired.
+## Simple over easy
+As PyTorch states, **explicit is better than implicit** and **simple is better than complex**. This design philosophy is reflected in multiple parts of the library:
+- We follow PyTorch's API with methods like [`DiffusionPipeline.to`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.to) to let the user handle device management.
+- Raising concise error messages is preferred to silently correct erroneous input. Diffusers aims at teaching the user, rather than making the library as easy to use as possible.
+- Complex model vs. scheduler logic is exposed instead of magically handled inside. Schedulers/Samplers are separated from diffusion models with minimal dependencies on each other. This forces the user to write the unrolled denoising loop. However, the separation allows for easier debugging and gives the user more control over adapting the denoising process or switching out diffusion models or schedulers.
+- Separately trained components of the diffusion pipeline, *e.g.* the text encoder, the UNet, and the variational autoencoder, each has their own model class. This forces the user to handle the interaction between the different model components, and the serialization format separates the model components into different files. However, this allows for easier debugging and customization. DreamBooth or Textual Inversion training
+is very simple thanks to Diffusers' ability to separate single components of the diffusion pipeline.
+## Tweakable, contributor-friendly over abstraction
+For large parts of the library, Diffusers adopts an important design principle of the [Transformers library](https://github.com/huggingface/transformers), which is to prefer copy-pasted code over hasty abstractions. This design principle is very opinionated and stands in stark contrast to popular design principles such as [Don't repeat yourself (DRY)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself).
+In short, just like Transformers does for modeling files, Diffusers prefers to keep an extremely low level of abstraction and very self-contained code for pipelines and schedulers.
+Functions, long code blocks, and even classes can be copied across multiple files which at first can look like a bad, sloppy design choice that makes the library unmaintainable.
+**However**, this design has proven to be extremely successful for Transformers and makes a lot of sense for community-driven, open-source machine learning libraries because:
+- Machine Learning is an extremely fast-moving field in which paradigms, model architectures, and algorithms are changing rapidly, which therefore makes it very difficult to define long-lasting code abstractions.
+- Machine Learning practitioners like to be able to quickly tweak existing code for ideation and research and therefore prefer self-contained code over one that contains many abstractions.
+- Open-source libraries rely on community contributions and therefore must build a library that is easy to contribute to. The more abstract the code, the more dependencies, the harder to read, and the harder to contribute to. Contributors simply stop contributing to very abstract libraries out of fear of breaking vital functionality. If contributing to a library cannot break other fundamental code, not only is it more inviting for potential new contributors, but it is also easier to review and contribute to multiple parts in parallel.
+At Hugging Face, we call this design the **single-file policy** which means that almost all of the code of a certain class should be written in a single, self-contained file. To read more about the philosophy, you can have a look
+at [this blog post](https://huggingface.co/blog/transformers-design-philosophy).
+In Diffusers, we follow this philosophy for both pipelines and schedulers, but only partly for diffusion models. The reason we don't follow this design fully for diffusion models is because almost all diffusion pipelines, such
+as [DDPM](https://huggingface.co/docs/diffusers/api/pipelines/ddpm), [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview#stable-diffusion-pipelines), [unCLIP (DALL·E 2)](https://huggingface.co/docs/diffusers/api/pipelines/unclip) and [Imagen](https://imagen.research.google/) all rely on the same diffusion model, the [UNet](https://huggingface.co/docs/diffusers/api/models/unet2d-cond).
+Great, now you should have generally understood why 🧨 Diffusers is designed the way it is 🤗.
+We try to apply these design principles consistently across the library. Nevertheless, there are some minor exceptions to the philosophy or some unlucky design choices. If you have feedback regarding the design, we would ❤️  to hear it [directly on GitHub](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feedback.md&title=).
+## Design Philosophy in Details
+Now, let's look a bit into the nitty-gritty details of the design philosophy. Diffusers essentially consists of three major classes: [pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines), [models](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models), and [schedulers](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
+Let's walk through more detailed design decisions for each class.
+### Pipelines
+Pipelines are designed to be easy to use (therefore do not follow [*Simple over easy*](#simple-over-easy) 100%), are not feature complete, and should loosely be seen as examples of how to use [models](#models) and [schedulers](#schedulers) for inference.
+The following design principles are followed:
+- Pipelines follow the single-file policy. All pipelines can be found in individual directories under src/diffusers/pipelines. One pipeline folder corresponds to one diffusion paper/project/release. Multiple pipeline files can be gathered in one pipeline folder, as it’s done for [`src/diffusers/pipelines/stable-diffusion`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion). If pipelines share similar functionality, one can make use of the [# Copied from mechanism](https://github.com/huggingface/diffusers/blob/125d783076e5bd9785beb05367a2d2566843a271/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py#L251).
+- Pipelines all inherit from [`DiffusionPipeline`].
+- Every pipeline consists of different model and scheduler components, that are documented in the [`model_index.json` file](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json), are accessible under the same name as attributes of the pipeline and can be shared between pipelines with [`DiffusionPipeline.components`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.components) function.
+- Every pipeline should be loadable via the [`DiffusionPipeline.from_pretrained`](https://huggingface.co/docs/diffusers/main/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained) function.
+- Pipelines should be used **only** for inference.
+- Pipelines should be very readable, self-explanatory, and easy to tweak.
+- Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs.
+- Pipelines are **not** intended to be feature-complete user interfaces. For feature-complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner).
+- Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines.
+- Pipelines should be named after the task they are intended to solve.
+- In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file.
+### Models
+Models are designed as configurable toolboxes that are natural extensions of [PyTorch's Module class](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). They only partly follow the **single-file policy**.
+The following design principles are followed:
+- Models correspond to **a type of model architecture**. *E.g.* the [`UNet2DConditionModel`] class is used for all UNet variations that expect 2D image inputs and are conditioned on some context.
+- All models can be found in [`src/diffusers/models`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/models) and every model architecture shall be defined in its file, e.g. [`unets/unet_2d_condition.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_condition.py), [`transformers/transformer_2d.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_2d.py), etc...
+- Models **do not** follow the single-file policy and should make use of smaller model building blocks, such as [`attention.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention.py), [`resnet.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/resnet.py), [`embeddings.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py), etc... **Note**: This is in stark contrast to Transformers' modeling files and shows that models do not really follow the single-file policy.
+- Models intend to expose complexity, just like PyTorch's `Module` class, and give clear error messages.
+- Models all inherit from `ModelMixin` and `ConfigMixin`.
+- Models can be optimized for performance when it doesn’t demand major code changes, keep backward compatibility, and give significant memory or compute gain.
+- Models should by default have the highest precision and lowest performance setting.
+- To integrate new model checkpoints whose general architecture can be classified as an architecture that already exists in Diffusers, the existing model architecture shall be adapted to make it work with the new checkpoint. One should only create a new file if the model architecture is fundamentally different.
+- Models should be designed to be easily extendable to future changes. This can be achieved by limiting public function arguments, configuration arguments, and "foreseeing" future changes, *e.g.* it is usually better to add `string` "...type" arguments that can easily be extended to new future types instead of boolean `is_..._type` arguments. Only the minimum amount of changes shall be made to existing architectures to make a new model checkpoint work.
+- The model design is a difficult trade-off between keeping code readable and concise and supporting many model checkpoints. For most parts of the modeling code, classes shall be adapted for new model checkpoints, while there are some exceptions where it is preferred to add new classes to make sure the code is kept concise and
+readable long-term, such as [UNet blocks](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/unets/unet_2d_blocks.py) and [Attention processors](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+### Schedulers
+Schedulers are responsible to guide the denoising process for inference as well as to define a noise schedule for training. They are designed as individual classes with loadable configuration files and strongly follow the **single-file policy**.
+The following design principles are followed:
+- All schedulers are found in [`src/diffusers/schedulers`](https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers).
+- Schedulers are **not** allowed to import from large utils files and shall be kept very self-contained.
+- One scheduler Python file corresponds to one scheduler algorithm (as might be defined in a paper).
+- If schedulers share similar functionalities, we can make use of the `# Copied from` mechanism.
+- Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`.
+- Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./docs/source/en/using-diffusers/schedulers.md).
+- Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called.
+- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon.
+- The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1).
+- Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box".
+- In almost all cases, novel schedulers shall be implemented in a new scheduling file.
--- a/diffusers/README.md
+++ b/diffusers/README.md
+<!---
+Copyright 2022 - The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+<p align="center">
+    <br>
+    <img src="https://raw.githubusercontent.com/huggingface/diffusers/main/docs/source/en/imgs/diffusers_library.jpg" width="400"/>
+    <br>
+<p>
+<p align="center">
+    <a href="https://github.com/huggingface/diffusers/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue"></a>
+    <a href="https://github.com/huggingface/diffusers/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/diffusers.svg"></a>
+    <a href="https://pepy.tech/project/diffusers"><img alt="GitHub release" src="https://static.pepy.tech/badge/diffusers/month"></a>
+    <a href="CODE_OF_CONDUCT.md"><img alt="Contributor Covenant" src="https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg"></a>
+    <a href="https://twitter.com/diffuserslib"><img alt="X account" src="https://img.shields.io/twitter/url/https/twitter.com/diffuserslib.svg?style=social&label=Follow%20%40diffuserslib"></a>
+</p>
+🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both. Our library is designed with a focus on [usability over performance](https://huggingface.co/docs/diffusers/conceptual/philosophy#usability-over-performance), [simple over easy](https://huggingface.co/docs/diffusers/conceptual/philosophy#simple-over-easy), and [customizability over abstractions](https://huggingface.co/docs/diffusers/conceptual/philosophy#tweakable-contributorfriendly-over-abstraction).
+🤗 Diffusers offers three core components:
+- State-of-the-art [diffusion pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) that can be run in inference with just a few lines of code.
+- Interchangeable noise [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview) for different diffusion speeds and output quality.
+- Pretrained [models](https://huggingface.co/docs/diffusers/api/models/overview) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems.
+## Installation
+We recommend installing 🤗 Diffusers in a virtual environment from PyPI or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/), please refer to their official documentation.
+### PyTorch
+With `pip` (official package):
+```bash
+pip install --upgrade diffusers[torch]
+```
+With `conda` (maintained by the community):
+```sh
+conda install -c conda-forge diffusers
+```
+### Apple Silicon (M1/M2) support
+Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggingface.co/docs/diffusers/optimization/mps) guide.
+## Quickstart
+Generating outputs is super easy with 🤗 Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 30,000+ checkpoints):
+```python
+from diffusers import DiffusionPipeline
+import torch
+pipeline = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16)
+pipeline.to("cuda")
+pipeline("An image of a squirrel in Picasso style").images[0]
+```
+You can also dig into the models and schedulers toolbox to build your own diffusion system:
+```python
+from diffusers import DDPMScheduler, UNet2DModel
+from PIL import Image
+import torch
+scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
+model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
+scheduler.set_timesteps(50)
+sample_size = model.config.sample_size
+noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")
+input = noise
+for t in scheduler.timesteps:
+    with torch.no_grad():
+        noisy_residual = model(input, t).sample
+        prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
+        input = prev_noisy_sample
+image = (input / 2 + 0.5).clamp(0, 1)
+image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
+image = Image.fromarray((image * 255).round().astype("uint8"))
+image
+```
+Check out the [Quickstart](https://huggingface.co/docs/diffusers/quicktour) to launch your diffusion journey today!
+## How to navigate the documentation
+| **Documentation**                                                   | **What can I learn?**                                                                                                                                                                           |
+|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [Tutorial](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview)                                                            | A basic crash course for learning how to use the library's most important features like using models and schedulers to build your own diffusion system, and training your own diffusion model.  |
+| [Loading](https://huggingface.co/docs/diffusers/using-diffusers/loading)                                                             | Guides for how to load and configure all the components (pipelines, models, and schedulers) of the library, as well as how to use different schedulers.                                         |
+| [Pipelines for inference](https://huggingface.co/docs/diffusers/using-diffusers/overview_techniques)                                             | Guides for how to use pipelines for different inference tasks, batched generation, controlling generated outputs and randomness, and how to contribute a pipeline to the library.               |
+| [Optimization](https://huggingface.co/docs/diffusers/optimization/fp16)                                                        | Guides for how to optimize your diffusion model to run faster and consume less memory.                                                                                                          |
+| [Training](https://huggingface.co/docs/diffusers/training/overview) | Guides for how to train a diffusion model for different tasks with different training techniques.                                                                                               |
+## Contribution
+We ❤️  contributions from the open-source community!
+If you want to contribute to this library, please check out our [Contribution guide](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md).
+You can look out for [issues](https://github.com/huggingface/diffusers/issues) you'd like to tackle to contribute to the library.
+- See [Good first issues](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) for general opportunities to contribute
+- See [New model/pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) to contribute exciting new diffusion models / diffusion pipelines
+- See [New scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22)
+Also, say 👋 in our public Discord channel <a href="https://discord.gg/G7tWnz98XR"><img alt="Join us on Discord" src="https://img.shields.io/discord/823813159592001537?color=5865F2&logo=discord&logoColor=white"></a>. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or just hang out ☕.
+## Popular Tasks & Pipelines
+<table>
+  <tr>
+    <th>Task</th>
+    <th>Pipeline</th>
+    <th>🤗 Hub</th>
+  </tr>
+  <tr style="border-top: 2px solid black">
+    <td>Unconditional Image Generation</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/ddpm"> DDPM </a></td>
+    <td><a href="https://huggingface.co/google/ddpm-ema-church-256"> google/ddpm-ema-church-256 </a></td>
+  </tr>
+  <tr style="border-top: 2px solid black">
+    <td>Text-to-Image</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img">Stable Diffusion Text-to-Image</a></td>
+      <td><a href="https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5"> stable-diffusion-v1-5/stable-diffusion-v1-5 </a></td>
+  </tr>
+  <tr>
+    <td>Text-to-Image</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/unclip">unCLIP</a></td>
+      <td><a href="https://huggingface.co/kakaobrain/karlo-v1-alpha"> kakaobrain/karlo-v1-alpha </a></td>
+  </tr>
+  <tr>
+    <td>Text-to-Image</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/deepfloyd_if">DeepFloyd IF</a></td>
+      <td><a href="https://huggingface.co/DeepFloyd/IF-I-XL-v1.0"> DeepFloyd/IF-I-XL-v1.0 </a></td>
+  </tr>
+  <tr>
+    <td>Text-to-Image</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/kandinsky">Kandinsky</a></td>
+      <td><a href="https://huggingface.co/kandinsky-community/kandinsky-2-2-decoder"> kandinsky-community/kandinsky-2-2-decoder </a></td>
+  </tr>
+  <tr style="border-top: 2px solid black">
+    <td>Text-guided Image-to-Image</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/controlnet">ControlNet</a></td>
+      <td><a href="https://huggingface.co/lllyasviel/sd-controlnet-canny"> lllyasviel/sd-controlnet-canny </a></td>
+  </tr>
+  <tr>
+    <td>Text-guided Image-to-Image</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/pix2pix">InstructPix2Pix</a></td>
+      <td><a href="https://huggingface.co/timbrooks/instruct-pix2pix"> timbrooks/instruct-pix2pix </a></td>
+  </tr>
+  <tr>
+    <td>Text-guided Image-to-Image</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img">Stable Diffusion Image-to-Image</a></td>
+      <td><a href="https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5"> stable-diffusion-v1-5/stable-diffusion-v1-5 </a></td>
+  </tr>
+  <tr style="border-top: 2px solid black">
+    <td>Text-guided Image Inpainting</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/inpaint">Stable Diffusion Inpainting</a></td>
+      <td><a href="https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-inpainting"> stable-diffusion-v1-5/stable-diffusion-inpainting </a></td>
+  </tr>
+  <tr style="border-top: 2px solid black">
+    <td>Image Variation</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/image_variation">Stable Diffusion Image Variation</a></td>
+      <td><a href="https://huggingface.co/lambdalabs/sd-image-variations-diffusers"> lambdalabs/sd-image-variations-diffusers </a></td>
+  </tr>
+  <tr style="border-top: 2px solid black">
+    <td>Super Resolution</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/upscale">Stable Diffusion Upscale</a></td>
+      <td><a href="https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler"> stabilityai/stable-diffusion-x4-upscaler </a></td>
+  </tr>
+  <tr>
+    <td>Super Resolution</td>
+    <td><a href="https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/latent_upscale">Stable Diffusion Latent Upscale</a></td>
+      <td><a href="https://huggingface.co/stabilityai/sd-x2-latent-upscaler"> stabilityai/sd-x2-latent-upscaler </a></td>
+  </tr>
+</table>
+## Popular libraries using 🧨 Diffusers
+- https://github.com/microsoft/TaskMatrix
+- https://github.com/invoke-ai/InvokeAI
+- https://github.com/InstantID/InstantID
+- https://github.com/apple/ml-stable-diffusion
+- https://github.com/Sanster/lama-cleaner
+- https://github.com/IDEA-Research/Grounded-Segment-Anything
+- https://github.com/ashawkey/stable-dreamfusion
+- https://github.com/deep-floyd/IF
+- https://github.com/bentoml/BentoML
+- https://github.com/bmaltais/kohya_ss
+- +14,000 other amazing GitHub repositories 💪
+Thank you for using us ❤️.
+## Credits
+This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:
+- @CompVis' latent diffusion models library, available [here](https://github.com/CompVis/latent-diffusion)
+- @hojonathanho original DDPM implementation, available [here](https://github.com/hojonathanho/diffusion) as well as the extremely useful translation into PyTorch by @pesser, available [here](https://github.com/pesser/pytorch_diffusion)
+- @ermongroup's DDIM implementation, available [here](https://github.com/ermongroup/ddim)
+- @yang-song's Score-VE and Score-VP implementations, available [here](https://github.com/yang-song/score_sde_pytorch)
+We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available [here](https://github.com/heejkoo/Awesome-Diffusion-Models) as well as @crowsonkb and @rromb for useful discussions and insights.
+## Citation
+```bibtex
+@misc{von-platen-etal-2022-diffusers,
+  author = {Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Dhruv Nair and Sayak Paul and William Berman and Yiyi Xu and Steven Liu and Thomas Wolf},
+  title = {Diffusers: State-of-the-art diffusion models},
+  year = {2022},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/huggingface/diffusers}}
+}
+```
--- a/diffusers/_typos.toml
+++ b/diffusers/_typos.toml
+# Files for typos
+# Instruction:  https://github.com/marketplace/actions/typos-action#getting-started
+[default.extend-identifiers]
+[default.extend-words]
+NIN="NIN" # NIN is used in scripts/convert_ncsnpp_original_checkpoint_to_diffusers.py
+nd="np" # nd may be np (numpy)
+parms="parms" # parms is used in scripts/convert_original_stable_diffusion_to_diffusers.py
+[files]
+extend-exclude = ["_typos.toml"]
--- a/diffusers/benchmarks/README.md
+++ b/diffusers/benchmarks/README.md
+# Diffusers Benchmarks
+Welcome to Diffusers Benchmarks. These benchmarks are use to obtain latency and memory information of the most popular models across different scenarios such as:
+* Base case i.e., when using `torch.bfloat16` and `torch.nn.functional.scaled_dot_product_attention`.
+* Base + `torch.compile()`
+* NF4 quantization
+* Layerwise upcasting
+Instead of full diffusion pipelines, only the forward pass of the respective model classes (such as `FluxTransformer2DModel`) is tested with the real checkpoints (such as `"black-forest-labs/FLUX.1-dev"`). 
+The entrypoint to running all the currently available benchmarks is in `run_all.py`. However, one can run the individual benchmarks, too, e.g., `python benchmarking_flux.py`. It should produce a CSV file containing various information about the benchmarks run.
+The benchmarks are run on a weekly basis and the CI is defined in [benchmark.yml](../.github/workflows/benchmark.yml).
+## Running the benchmarks manually
+First set up `torch` and install `diffusers` from the root of the directory:
+```py
+pip install -e ".[quality,test]"
+```
+Then make sure the other dependencies are installed:
+```sh
+cd benchmarks/
+pip install -r requirements.txt
+```
+We need to be authenticated to access some of the checkpoints used during benchmarking:
+```sh
+hf auth login
+```
+We use an L40 GPU with 128GB RAM to run the benchmark CI. As such, the benchmarks are configured to run on NVIDIA GPUs. So, make sure you have access to a similar machine (or modify the benchmarking scripts accordingly).
+Then you can either launch the entire benchmarking suite by running:
+```sh
+python run_all.py
+```
+Or, you can run the individual benchmarks.
+## Customizing the benchmarks
+We define "scenarios" to cover the most common ways in which these models are used. You can
+define a new scenario, modifying an existing benchmark file:
+```py
+BenchmarkScenario(
+    name=f"{CKPT_ID}-bnb-8bit",
+    model_cls=FluxTransformer2DModel,
+    model_init_kwargs={
+        "pretrained_model_name_or_path": CKPT_ID,
+        "torch_dtype": torch.bfloat16,
+        "subfolder": "transformer",
+        "quantization_config": BitsAndBytesConfig(load_in_8bit=True),
+    },
+    get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16),
+    model_init_fn=model_init_fn,
+)
+```
+You can also configure a new model-level benchmark and add it to the existing suite. To do so, just defining a valid benchmarking file like `benchmarking_flux.py` should be enough.
+Happy benchmarking 🧨
\ No newline at end of file
--- a/diffusers/benchmarks/__init__.py
+++ b/diffusers/benchmarks/__init__.py
--- a/diffusers/benchmarks/benchmarking_flux.py
+++ b/diffusers/benchmarks/benchmarking_flux.py
+from functools import partial
+import torch
+from benchmarking_utils import BenchmarkMixin, BenchmarkScenario, model_init_fn
+from diffusers import BitsAndBytesConfig, FluxTransformer2DModel
+from diffusers.utils.testing_utils import torch_device
+CKPT_ID = "black-forest-labs/FLUX.1-dev"
+RESULT_FILENAME = "flux.csv"
+def get_input_dict(**device_dtype_kwargs):
+    # resolution: 1024x1024
+    # maximum sequence length 512
+    hidden_states = torch.randn(1, 4096, 64, **device_dtype_kwargs)
+    encoder_hidden_states = torch.randn(1, 512, 4096, **device_dtype_kwargs)
+    pooled_prompt_embeds = torch.randn(1, 768, **device_dtype_kwargs)
+    image_ids = torch.ones(512, 3, **device_dtype_kwargs)
+    text_ids = torch.ones(4096, 3, **device_dtype_kwargs)
+    timestep = torch.tensor([1.0], **device_dtype_kwargs)
+    guidance = torch.tensor([1.0], **device_dtype_kwargs)
+    return {
+        "hidden_states": hidden_states,
+        "encoder_hidden_states": encoder_hidden_states,
+        "img_ids": image_ids,
+        "txt_ids": text_ids,
+        "pooled_projections": pooled_prompt_embeds,
+        "timestep": timestep,
+        "guidance": guidance,
+    }
+if __name__ == "__main__":
+    scenarios = [
+        BenchmarkScenario(
+            name=f"{CKPT_ID}-bf16",
+            model_cls=FluxTransformer2DModel,
+            model_init_kwargs={
+                "pretrained_model_name_or_path": CKPT_ID,
+                "torch_dtype": torch.bfloat16,
+                "subfolder": "transformer",
+            },
+            get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16),
+            model_init_fn=model_init_fn,
+            compile_kwargs={"fullgraph": True},
+        ),
+        BenchmarkScenario(
+            name=f"{CKPT_ID}-bnb-nf4",
+            model_cls=FluxTransformer2DModel,
+            model_init_kwargs={
+                "pretrained_model_name_or_path": CKPT_ID,
+                "torch_dtype": torch.bfloat16,
+                "subfolder": "transformer",
+                "quantization_config": BitsAndBytesConfig(
+                    load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4"
+                ),
+            },
+            get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16),
+            model_init_fn=model_init_fn,
+        ),
+        BenchmarkScenario(
+            name=f"{CKPT_ID}-layerwise-upcasting",
+            model_cls=FluxTransformer2DModel,
+            model_init_kwargs={
+                "pretrained_model_name_or_path": CKPT_ID,
+                "torch_dtype": torch.bfloat16,
+                "subfolder": "transformer",
+            },
+            get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16),
+            model_init_fn=partial(model_init_fn, layerwise_upcasting=True),
+        ),
+        BenchmarkScenario(
+            name=f"{CKPT_ID}-group-offload-leaf",
+            model_cls=FluxTransformer2DModel,
+            model_init_kwargs={
+                "pretrained_model_name_or_path": CKPT_ID,
+                "torch_dtype": torch.bfloat16,
+                "subfolder": "transformer",
+            },
+            get_model_input_dict=partial(get_input_dict, device=torch_device, dtype=torch.bfloat16),
+            model_init_fn=partial(
+                model_init_fn,
+                group_offload_kwargs={
+                    "onload_device": torch_device,
+                    "offload_device": torch.device("cpu"),
+                    "offload_type": "leaf_level",
+                    "use_stream": True,
+                    "non_blocking": True,
+                },
+            ),
+        ),
+    ]
+    runner = BenchmarkMixin()
+    runner.run_bencmarks_and_collate(scenarios, filename=RESULT_FILENAME)