"git@developer.sourcefind.cn:OpenDAS/torchani.git" did not exist on "d7ef8182d8d9015882a2c2952e27e9d8bfa99115"
Unverified Commit d7a1a036 authored by Steven Liu's avatar Steven Liu Committed by GitHub
Browse files

[docs] CP (#12331)

* init

* feedback

* feedback

* feedback

* feedback

* feedback

* feedback
parent b5965454
...@@ -70,8 +70,6 @@ ...@@ -70,8 +70,6 @@
title: Reduce memory usage title: Reduce memory usage
- local: optimization/speed-memory-optims - local: optimization/speed-memory-optims
title: Compiling and offloading quantized models title: Compiling and offloading quantized models
- local: api/parallel
title: Parallel inference
- title: Community optimizations - title: Community optimizations
sections: sections:
- local: optimization/pruna - local: optimization/pruna
...@@ -282,6 +280,8 @@ ...@@ -282,6 +280,8 @@
title: Outputs title: Outputs
- local: api/quantization - local: api/quantization
title: Quantization title: Quantization
- local: api/parallel
title: Parallel inference
- title: Modular - title: Modular
sections: sections:
- local: api/modular_diffusers/pipeline - local: api/modular_diffusers/pipeline
......
...@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. --> ...@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
# Parallelism # Parallelism
Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times. Parallelism strategies help speed up diffusion transformers by distributing computations across multiple devices, allowing for faster inference/training times. Refer to the [Distributed inferece](../training/distributed_inference) guide to learn more.
## ParallelConfig ## ParallelConfig
......
...@@ -226,8 +226,64 @@ with torch.no_grad(): ...@@ -226,8 +226,64 @@ with torch.no_grad():
image[0].save("split_transformer.png") image[0].save("split_transformer.png")
``` ```
## Resources By selectively loading and unloading the models you need at a given stage and sharding the largest models across multiple GPUs, it is possible to run inference with large models on consumer GPUs.
- Take a look at this [script](https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba) for a minimal example of distributed inference with Accelerate. ## Context parallelism
- For more details, check out Accelerate's [Distributed inference](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
- The `device_map` argument assign models or an entire pipeline to devices. Refer to the [device placement](../using-diffusers/loading#device-placement) docs for more information. [Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism) splits input sequences across multiple GPUs to reduce memory usage. Each GPU processes its own slice of the sequence.
\ No newline at end of file
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized attention backend. Refer to this [table](../optimization/attention_backends#available-backends) for a complete list of available backends.
### Ring Attention
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transformer model. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
```py
import torch
from diffusers import AutoModel, QwenImagePipeline, ContextParallelConfig
try:
torch.distributed.init_process_group("nccl")
rank = torch.distributed.get_rank()
device = torch.device("cuda", rank % torch.cuda.device_count())
torch.cuda.set_device(device)
transformer = AutoModel.from_pretrained("Qwen/Qwen-Image", subfolder="transformer", torch_dtype=torch.bfloat16, parallel_config=ContextParallelConfig(ring_degree=2))
pipeline = QwenImagePipeline.from_pretrained("Qwen/Qwen-Image", transformer=transformer, torch_dtype=torch.bfloat16, device_map="cuda")
pipeline.transformer.set_attention_backend("flash")
prompt = """
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
# Must specify generator so all ranks start with same latents (or pass your own)
generator = torch.Generator().manual_seed(42)
image = pipeline(prompt, num_inference_steps=50, generator=generator).images[0]
if rank == 0:
image.save("output.png")
except Exception as e:
print(f"An error occurred: {e}")
torch.distributed.breakpoint()
raise
finally:
if torch.distributed.is_initialized():
torch.distributed.destroy_process_group()
```
### Ulysses Attention
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all* communication (every device sends/receives data to every other device). Each GPU ends up with all tokens for only a subset of attention heads. Each GPU computes attention locally on all tokens for its head, then performs another all-to-all to regroup results by tokens for the next layer.
[`ContextParallelConfig`] supports Ulysses Attention through the `ulysses_degree` argument. This determines how many devices to use for Ulysses Attention.
Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
```py
pipeline.transformer.enable_parallelism(config=ContextParallelConfig(ulysses_degree=2))
```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment