1. 14 Feb, 2025 1 commit
    • Aryan's avatar
      Module Group Offloading (#10503) · 9a147b82
      Aryan authored
      
      
      * update
      
      * fix
      
      * non_blocking; handle parameters and buffers
      
      * update
      
      * Group offloading with cuda stream prefetching (#10516)
      
      * cuda stream prefetch
      
      * remove breakpoints
      
      * update
      
      * copy model hook implementation from pab
      
      * update; ~very workaround based implementation but it seems to work as expected; needs cleanup and rewrite
      
      * more workarounds to make it actually work
      
      * cleanup
      
      * rewrite
      
      * update
      
      * make sure to sync current stream before overwriting with pinned params
      
      not doing so will lead to erroneous computations on the GPU and cause bad results
      
      * better check
      
      * update
      
      * remove hook implementation to not deal with merge conflict
      
      * re-add hook changes
      
      * why use more memory when less memory do trick
      
      * why still use slightly more memory when less memory do trick
      
      * optimise
      
      * add model tests
      
      * add pipeline tests
      
      * update docs
      
      * add layernorm and groupnorm
      
      * address review comments
      
      * improve tests; add docs
      
      * improve docs
      
      * Apply suggestions from code review
      Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
      
      * apply suggestions from code review
      
      * update tests
      
      * apply suggestions from review
      
      * enable_group_offloading -> enable_group_offload for naming consistency
      
      * raise errors if multiple offloading strategies used; add relevant tests
      
      * handle .to() when group offload applied
      
      * refactor some repeated code
      
      * remove unintentional change from merge conflict
      
      * handle .cuda()
      
      ---------
      Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
      9a147b82
  2. 12 Feb, 2025 2 commits
  3. 11 Feb, 2025 2 commits
  4. 27 Jan, 2025 1 commit
  5. 23 Jan, 2025 1 commit
  6. 22 Jan, 2025 1 commit
    • Aryan's avatar
      [core] Layerwise Upcasting (#10347) · beacaa55
      Aryan authored
      
      
      * update
      
      * update
      
      * make style
      
      * remove dynamo disable
      
      * add coauthor
      Co-Authored-By: default avatarDhruv Nair <dhruv.nair@gmail.com>
      
      * update
      
      * update
      
      * update
      
      * update mixin
      
      * add some basic tests
      
      * update
      
      * update
      
      * non_blocking
      
      * improvements
      
      * update
      
      * norm.* -> norm
      
      * apply suggestions from review
      
      * add example
      
      * update hook implementation to the latest changes from pyramid attention broadcast
      
      * deinitialize should raise an error
      
      * update doc page
      
      * Apply suggestions from code review
      Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
      
      * update docs
      
      * update
      
      * refactor
      
      * fix _always_upcast_modules for asym ae and vq_model
      
      * fix lumina embedding forward to not depend on weight dtype
      
      * refactor tests
      
      * add simple lora inference tests
      
      * _always_upcast_modules -> _precision_sensitive_module_patterns
      
      * remove todo comments about review; revert changes to self.dtype in unets because .dtype on ModelMixin should be able to handle fp8 weight case
      
      * check layer dtypes in lora test
      
      * fix UNet1DModelTests::test_layerwise_upcasting_inference
      
      * _precision_sensitive_module_patterns -> _skip_layerwise_casting_patterns based on feedback
      
      * skip test in NCSNppModelTests
      
      * skip tests for AutoencoderTinyTests
      
      * skip tests for AutoencoderOobleckTests
      
      * skip tests for UNet1DModelTests - unsupported pytorch operations
      
      * layerwise_upcasting -> layerwise_casting
      
      * skip tests for UNetRLModelTests; needs next pytorch release for currently unimplemented operation support
      
      * add layerwise fp8 pipeline test
      
      * use xfail
      
      * Apply suggestions from code review
      Co-authored-by: default avatarDhruv Nair <dhruv.nair@gmail.com>
      
      * add assertion with fp32 comparison; add tolerance to fp8-fp32 vs fp32-fp32 comparison (required for a few models' test to pass)
      
      * add note about memory consumption on tesla CI runner for failing test
      
      ---------
      Co-authored-by: default avatarDhruv Nair <dhruv.nair@gmail.com>
      Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
      beacaa55
  7. 20 Jan, 2025 1 commit
  8. 19 Jan, 2025 1 commit
  9. 16 Jan, 2025 1 commit
  10. 13 Jan, 2025 2 commits
  11. 09 Jan, 2025 1 commit
  12. 08 Jan, 2025 1 commit
  13. 06 Jan, 2025 3 commits
  14. 31 Dec, 2024 2 commits
  15. 28 Dec, 2024 1 commit
  16. 23 Dec, 2024 4 commits
  17. 20 Dec, 2024 5 commits
  18. 18 Dec, 2024 1 commit
  19. 17 Dec, 2024 3 commits
  20. 16 Dec, 2024 6 commits
    • Steven Liu's avatar
      [docs] Add missing AttnProcessors (#10246) · 7667cfcb
      Steven Liu authored
      * attnprocessors
      
      * lora
      
      * make style
      
      * fix
      
      * fix
      
      * sana
      
      * typo
      7667cfcb
    • Aryan's avatar
      [core] TorchAO Quantizer (#10009) · 9f00c617
      Aryan authored
      
      
      * torchao quantizer
      
      
      ---------
      Co-authored-by: default avatarSayak Paul <spsayakpaul@gmail.com>
      Co-authored-by: default avatarSteven Liu <59462357+stevhliu@users.noreply.github.com>
      9f00c617
    • Sayak Paul's avatar
      [Docs] add rest of the lora loader mixins to the docs. (#10230) · ea893a9a
      Sayak Paul authored
      add rest of the lora loader mixins to the docs.
      ea893a9a
    • Aryan's avatar
      [core] Hunyuan Video (#10136) · aace1f41
      Aryan authored
      
      
      * copy transformer
      
      * copy vae
      
      * copy pipeline
      
      * make fix-copies
      
      * refactor; make original code work with diffusers; test latents for comparison generated with this commit
      
      * move rope into pipeline; remove flash attention; refactor
      
      * begin conversion script
      
      * make style
      
      * refactor attention
      
      * refactor
      
      * refactor final layer
      
      * their mlp -> our feedforward
      
      * make style
      
      * add docs
      
      * refactor layer names
      
      * refactor modulation
      
      * cleanup
      
      * refactor norms
      
      * refactor activations
      
      * refactor single blocks attention
      
      * refactor attention processor
      
      * make style
      
      * cleanup a bit
      
      * refactor double transformer block attention
      
      * update mochi attn proc
      
      * use diffusers attention implementation in all modules; checkpoint for all values matching original
      
      * remove helper functions in vae
      
      * refactor upsample
      
      * refactor causal conv
      
      * refactor resnet
      
      * refactor
      
      * refactor
      
      * refactor
      
      * grad checkpointing
      
      * autoencoder test
      
      * fix scaling factor
      
      * refactor clip
      
      * refactor llama text encoding
      
      * add coauthor
      Co-Authored-By: default avatar"Gregory D. Hunkins" <greg@ollano.com>
      
      * refactor rope; diff: 0.14990234375; reason and fix: create rope grid on cpu and move to device
      
      Note: The following line diverges from original behaviour. We create the grid on the device, whereas
      original implementation creates it on CPU and then moves it to device. This results in numerical
      differences in layerwise debugging outputs, but visually it is the same.
      
      * use diffusers timesteps embedding; diff: 0.10205078125
      
      * rename
      
      * convert
      
      * update
      
      * add tests for transformer
      
      * add pipeline tests; text encoder 2 is not optional
      
      * fix attention implementation for torch
      
      * add example
      
      * update docs
      
      * update docs
      
      * apply suggestions from review
      
      * refactor vae
      
      * update
      
      * Apply suggestions from code review
      Co-authored-by: default avatarhlky <hlky@hlky.ac>
      
      * Update src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py
      Co-authored-by: default avatarhlky <hlky@hlky.ac>
      
      * Update src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py
      Co-authored-by: default avatarhlky <hlky@hlky.ac>
      
      * make fix-copies
      
      * update
      
      ---------
      Co-authored-by: default avatar"Gregory D. Hunkins" <greg@ollano.com>
      Co-authored-by: default avatarhlky <hlky@hlky.ac>
      aace1f41
    • Sayak Paul's avatar
      [docs] minor stuff to ltx video docs. (#10229) · e68092a4
      Sayak Paul authored
      minor stuff to ltx video docs.
      e68092a4
    • Sayak Paul's avatar
      3bf5400a