[docs] add details concerning diffusers-specific bits. (#6375)

add details concerning diffusers-specific bits.

[docs] add details concerning diffusers-specific bits. (#6375)
add details concerning diffusers-specific bits.
034b39b8 · Sayak Paul · GitHub · 2db73f4a · 034b39b8
Unverified Commit 034b39b8 authored Dec 28, 2023 by Sayak Paul Committed by GitHub Dec 28, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 23 additions and 1 deletion

docs/source/en/tutorials/fast_diffusion.md docs/source/en/tutorials/fast_diffusion.md +23 -1

No files found.
--- a/docs/source/en/tutorials/fast_diffusion.md
+++ b/docs/source/en/tutorials/fast_diffusion.md
@@ -315,4 +315,26 @@ Applying dynamic quantization improves the latency from 2.52 seconds to 2.43 sec
 <img src="https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/progressive-acceleration-sdxl/SDXL%2C_Batch_Size%3A_1%2C_Steps%3A_30_5.png" width=500>
 </div>
\ No newline at end of file
+## Misc
+### No graph breaks during torch.compile
+Ensuring that the underlying model/method can be fully compiled is crucial for performance (torch.compile with fullgraph=True). This means having no graph breaks. We did this for the UNet and VAE by changing how we access the returning variables. Consider the following example: 
+```diff
+- latents = unet(
+-	latents, timestep=timestep, encoder_hidden_states=prompt_embeds
+-).sample
+ latents = unet(
+	latents, timestep=timestep, encoder_hidden_states=prompt_embeds, return_dict=False
+)[0]
+```
+### Getting rid of GPU syncs after compilation
+During the iterative reverse diffusion process, we [call](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L1228) `step()` on the scheduler each time after the denoiser predicts the less noisy latent embeddings. Inside `step()`, the `sigmas` variable is [indexed](https://github.com/huggingface/diffusers/blob/1d686bac8146037e97f3fd8c56e4063230f71751/src/diffusers/schedulers/scheduling_euler_discrete.py#L476). If the `sigmas` array is placed on the GPU, indexing causes a communication sync between the CPU and GPU. This causes a latency, and it becomes more evident when the denoiser has already been compiled. 
+But if the `sigmas` array always stays on the CPU (refer to [this line](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240)), this sync doesn’t take place, hence improved latency. In general, any CPU <-> GPU communication sync should be none or be kept to a bare minimum as it can impact inference latency.