# Checkpoint Mechanism for Stage Testing

## Overview

Pipelines are monolithic `__call__` methods -- you can't just call "the encode part". The checkpoint mechanism lets you stop, save, or inject tensors at named locations inside the pipeline.

## The Checkpoint class

Add a `_checkpoints` argument to both the diffusers pipeline and the reference implementation.

```python
@dataclass
class Checkpoint:
    save: bool = False   # capture variables into ckpt.data
    stop: bool = False   # halt pipeline after this point
    load: bool = False   # inject ckpt.data into local variables
    data: dict = field(default_factory=dict)
```

## Pipeline instrumentation

The pipeline accepts an optional `dict[str, Checkpoint]`. Place checkpoint calls at boundaries between pipeline stages -- after each encoder, before the denoising loop (capture all loop inputs), after each loop iteration, after the loop (capture final latents before decode).

```python
def __call__(self, prompt, ..., _checkpoints=None):
    # --- text encoding ---
    prompt_embeds = self.text_encoder(prompt)
    _maybe_checkpoint(_checkpoints, "text_encoding", {
        "prompt_embeds": prompt_embeds,
    })

    # --- prepare latents, sigmas, positions ---
    latents = self.prepare_latents(...)
    sigmas = self.scheduler.sigmas
    # ...

    _maybe_checkpoint(_checkpoints, "preloop", {
        "latents": latents,
        "sigmas": sigmas,
        "prompt_embeds": prompt_embeds,
        "prompt_attention_mask": prompt_attention_mask,
        "video_coords": video_coords,
        # capture EVERYTHING the loop needs -- every tensor the transformer
        # forward() receives. Missing even one variable here means you can't
        # tell if it's the source of divergence during denoise debugging.
    })

    # --- denoising loop ---
    for i, t in enumerate(timesteps):
        noise_pred = self.transformer(latents, t, prompt_embeds, ...)
        latents = self.scheduler.step(noise_pred, t, latents)[0]

        _maybe_checkpoint(_checkpoints, f"after_step_{i}", {
            "latents": latents,
        })

    _maybe_checkpoint(_checkpoints, "post_loop", {
        "latents": latents,
    })

    # --- decode ---
    video = self.vae.decode(latents)
    return video
```

## The helper function

Each `_maybe_checkpoint` call does three things based on the Checkpoint's flags: `save` captures the local variables into `ckpt.data`, `load` injects pre-populated `ckpt.data` back into local variables, `stop` halts execution (raises an exception caught at the top level).

```python
def _maybe_checkpoint(checkpoints, name, data):
    if not checkpoints:
        return
    ckpt = checkpoints.get(name)
    if ckpt is None:
        return
    if ckpt.save:
        ckpt.data.update(data)
    if ckpt.stop:
        raise PipelineStop  # caught at __call__ level, returns None
```

## Injection support

Add `load` support at each checkpoint where you might want to inject:

```python
_maybe_checkpoint(_checkpoints, "preloop", {"latents": latents, ...})

# Load support: replace local variables with injected data
if _checkpoints:
    ckpt = _checkpoints.get("preloop")
    if ckpt is not None and ckpt.load:
        latents = ckpt.data["latents"].to(device=device, dtype=latents.dtype)
```

## Key insight

The checkpoint dict is passed into the pipeline and mutated in-place. After the pipeline returns (or stops early), you read back `ckpt.data` to get the captured tensors. Both pipelines save under their own key names, so the test maps between them (e.g. reference `"video_state.latent"` -> diffusers `"latents"`).

## Memory management for large models

For large models, free the source pipeline's GPU memory before loading the target pipeline. Clone injected tensors to CPU, delete everything else, then run the target with `enable_model_cpu_offload()`.