1. 11 Aug, 2023 1 commit
  2. 23 Mar, 2023 1 commit
    • luise.chen's avatar
      Add FusedLARS optimizer (#109) · 7a428776
      luise.chen authored
      * Add fused_lars optimizer
      
      * Update primitive fused_lars optimizer, working for resnet50 with NHWC/NCHW
      
      * Add flow of using nesterov in FusedLARS
      7a428776
  3. 15 Feb, 2023 1 commit
    • aspanday's avatar
      Grid optimization - Chunk_Size optimization. (#104) · b047a1f1
      aspanday authored
      * Updating BLOCK_SIZE to 1024.
      tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
      For now skipping test_bfloat16 for Adam in the unittest.
      Ran 17 other tests and ALL other tests pass!
      More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization.
      This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
      L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.
      
      * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
      
      * Updating chunk_size to 256*32 (8K) which was previously 2048*32 (64K).
      In addition updating depth_to_max_blocks to 2560 (8x compared to previous 320).
      The performance improvement observed is upto 1.4x for large number of elements, upto 5.2x for moderate number of elements and upto 1.44x for small number of elements.
      This change only affects the optimizers specifically when multi_tensor_apply is emabled using --cuda_ext extension when installing apex.
      The set of performance along with comaprison with Torch is captured here
      https://amdcloud.sharepoint.com/
      
      /r/sites/MLSEPerfTeam/Shared%20Documents/Strategic%20Leadership%20Optimizations%20Team%20(SLOT)/Projects/Grid%20Optimization/Elementwise%20Kernel%20-%20Grid%20Optimization%20-%20Benchmark%20sweep.xlsx?d=wa8bacf65a2904002bf3cad4c57769eff&csf=1&web=1&e=JhLVm8
      See sheet chunk_opt.
      
      * Updating all files related to L2norm since test_fuzz (test_multi_tensor_l2norm.TestMultiTensorL2Norm) failed with previous commits.
      changes in chunk_size seems to have effect on reduction kernels so this commit provides a provision for maintaining unoptimized conditions for L2norm and optimizations for all other kernels associated with all optimzers.
      The change includes introducing  multi_tensor_apply_l2norm that assumes chunk_size of 64K as well as multi_tensor_apply_base.cuh specifically to be used by l2norm kernels.
      
      ---------
      Co-authored-by: default avataraspanday <aspanday@amd.com>
      b047a1f1
  4. 25 Jan, 2023 1 commit
    • aspanday's avatar
      Updating BLOCK_SIZE to 1024 in all optimizers. (#103) · 14db5c27
      aspanday authored
      * Updating BLOCK_SIZE to 1024.
      tests/L0/run_optimizers/test_fused_optimizer.py test passes except for bfloat16 for Adam. There seems to be a bug in this test that needs to be resolved.
      For now skipping test_bfloat16 for Adam in the unittest.
      Ran 17 other tests and ALL other tests pass!
      More details on the effects of these changes can be found here -  https://confluence.amd.com/display/MLSE/Apex+Kernel+Optimization
      
      .
      This commit changes BLOCK_SIZE=1024 ONLY FOR different optimizers.
      L2norm kernels (part of LAMB optimizer algorithm) still maintain BLOCK_SIZE=512 otherwise Allclose fails.
      
      * Updating tests/L0/run_optimizers/test_fused_optimizer.py with @skipifRocm to skip test_bfloat16 in Adam.
      Co-authored-by: default avataraspanday <aspanday@amd.com>
      14db5c27
  5. 09 Dec, 2022 1 commit
  6. 05 Aug, 2022 1 commit
    • Hubert Lu's avatar
      Enable FusedRMSNorm (#78) · c97ebfab
      Hubert Lu authored
      
      
      * FusedRMSNorm/"T5LayerNorm" based on FusedLayerNorm (#1274)
      
      * FusedRMSNorm based on FusedLayerNorm
      
      * refactor duplicated kernels
      
      * delete comments
      
      * delete comments
      
      * cleanup
      
      * cleanup
      
      * cleanup, fixed clobbering forward_affine_mixed_dtypes
      
      * fix pybind naming and add MixedFused test
      
      * undo skipping
      
      * check elementwise_affine
      
      * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py
      
      Oof, nice catch, thanks
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      
      * fix and generate docs for FusedRMSNorm (#1285)
      
      * [FusedRMSNorm doc] document where epsilon is added (#1295)
      
      * [FusedRMSNorm doc] add epsilon to formula
      
      * correct
      
      * better wording
      
      * Fix some bugs
      
      * Optimize HostRMSNormGradient and HostApplyRMSNorm for AMD GPUs
      
      * Fix NaN issues in FusedRMSNorm
      
      * Update test_fused_layer_norm.py
      
      * Skip test_fused_layer_norm.TestAutocastFusedRMSNorm on ROCm
      
      * Use at::cuda::warp_size() instead of at::cuda::getCurrentDeviceProperties()->warpSize
      Co-authored-by: default avatareqy <eddiey@nvidia.com>
      Co-authored-by: default avatarMasaki Kozuki <masaki.kozuki.2014@gmail.com>
      Co-authored-by: default avatarStas Bekman <stas00@users.noreply.github.com>
      c97ebfab
  7. 29 Jul, 2022 1 commit
  8. 22 Jun, 2022 1 commit
  9. 31 May, 2022 1 commit
  10. 15 Apr, 2022 5 commits
  11. 06 Apr, 2022 1 commit
    • Hubert Lu's avatar
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with... · 5ecad142
      Hubert Lu authored
      Make rocblas_gemm_flags_fp16_alt_impl in MHA and MLP backward compatible with old PyTorch versions (#74)
      
      * First attempt to make rocblas flag backward compatible
      
      * Fix some bugs
      
      * Fix some bugs
      
      * Make rocblas_gemm_flags_fp16_alt_impl in MHA backward compatible with old PyTorch versions
      
      * Add groupbn extension unit tests for ROCm
      
      * Fix some bugs
      5ecad142
  12. 23 Mar, 2022 1 commit
  13. 26 Feb, 2022 1 commit
  14. 15 Feb, 2022 1 commit
  15. 12 Feb, 2022 1 commit
  16. 04 Feb, 2022 1 commit
  17. 25 Jan, 2022 1 commit
  18. 13 Dec, 2021 1 commit
  19. 09 Dec, 2021 2 commits
  20. 17 Nov, 2021 1 commit
  21. 27 Oct, 2021 1 commit
    • Masaki Kozuki's avatar
      Pipeline Model Parallel (#1202) · 63d5dd63
      Masaki Kozuki authored
      
      
      * Init apex.ppu (pipeline model parallel utility)
      
      Reference commit:
      
      ```
      commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main)
      Merge: 14f2c684 7b293d9b
      Author: Mohammad Shoeybi <mshoeybi@nvidia.com>
      Date:   Wed Sep 22 22:57:54 2021 -0700
      
          Merge branch 'add_BOS' into 'main'
      
          Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives
      
          See merge request ADLR/megatron-lm!328
      ```
      
      * removing get_args and replace import - phase 1
      
      * removing get_args and replace import - phase 2
      
      * move ppu to apex.transformer.pipeline_parallel
      
      * update two __init__.py
      
      * update READMEs
      
      * mpu -> parallel_state & tensor_parallel
      
      * fix
      
      * remove not pipeline files
      
      * separate schedules.py - phase 1
      
      * dissect schedules.py
      
      * data_iterators -> batch
      
      * remove optimizer from forward_backward_step funcs
      
      * init test
      
      * Apply 2 suggestion(s) to 2 file(s)
      
      * fix cyclic import
      
      * fix syntax of Callable
      
      * fix - 1
      
      * move directory as testing used for pp test as well
      
      * add some functions for num microbatches calculator
      
      * model is a list in pipeline parallel
      
      * skip build num microbatch calculator
      
      * fix test
      
      * assert -> raise
      
      * skip args printing
      
      * specify tensor shape everywhere even if None - phase 1
      
      * private timers
      
      * passing tensor shape & dtype around
      
      * update dtype handling by introducing helper func
      
      * write helper func to reduce cyclomatic complexity
      
      * remove duplicate
      
      * update
      
      * move split_tensor_into_1d_equal_chunks to avoid cyclic import
      
      * tmp
      
      * cosmetic
      
      * move gather_split_1d_tensor to avoid cyclic imports
      
      * remove debug print
      
      * add outer loop
      
      * early return if possible
      
      * cosmetic
      
      * passing around tensor shape
      
      * refactor test
      
      * add script to learn batch sampler behavior
      
      * update
      
      * minibatch splitter
      
      * add minibatch splitter
      
      * split minibatch into microbatches
      
      * minor changes
      
      * uncomment split batch for test sake
      
      * set as attribute
      
      * study the behavior of no pipelining
      
      * debug 1
      
      * reflect test util namespace change
      
      * update readme
      
      * cosmetic in test
      
      * add model build helper func for interleaving shced
      
      * adding model builder from megatron
      
      * canbe cyclic import
      
      * fix
      
      * enable interleaving test, but failing even if forward only
      
      * fix batch preparation
      
      * add explanation
      
      * print data parallel size
      
      * fix typo
      
      * Add Megatron style GPT model by Rishi
      Co-authored-by: default avatarRishi Puri <riship@nvidia.com>
      
      * update
      
      * type hint for jit
      
      * fix forward_backward_no_pipelining test
      
      * pipeline forward backward seem to hang if not forward only
      
      * fix typo
      
      * debug
      
      * add p2p test
      
      * simplify
      
      * fix
      
      * tentative
      
      * set both tmp and pmp to 1
      
      * init
      
      * fix typo
      
      * fix
      
      * fix path of divide
      
      * set seed for tmp
      
      * update upon Eddie comment
      
      * fix typo
      
      * adding failing data loader test
      
      * fix
      
      * megatron still failing
      
      * check in
      
      * with the nested loop of new order, interleaving seems fine
      
      * cosmetic change
      
      * make `forward_backward_pipelining_with_interleaving private
      
      * warn users that interleaving sched is unstable
      
      * move noop handler to no pipelining
      
      * comment out rank_print
      
      * make `build_model` more flexible
      
      * skip megatron test tentatively
      
      * correctly comment out rank_print
      
      * correctly comment out rank_print
      
      * correctly comment out rank_print
      
      * skip appropriately
      
      * remove wip p2p comm test
      
      * update type hint of model_provider_func
      
      * disable tf32 in each test script
      
      * skip interleaving w/ backward
      
      * rename as mpu is the old name
      
      * remove broken case
      
      * expose build_model func
      
      * delete `dist.ring_exchange` func call and `use_ring_exchange` argument
      
      * nit fixes
      
      * check in
      
      * remove unused file
      
      * update the list
      
      * update tensor shape
      
      * remove mixed dtype case
      
      * use torch.distributed.run
      
      * 2020 -> 2021
      
      * another 2020 -> 2021
      
      * docstring & type hint
      
      * fix teardown
      
      * update
      
      * change to experimental
      
      * check if warned
      Co-authored-by: default avatarRishi Puri <riship@nvidia.com>
      Co-authored-by: default avatarEddie Yan <eddiey@nvidia.com>
      63d5dd63
  22. 19 Oct, 2021 1 commit
  23. 08 Oct, 2021 1 commit
  24. 07 Oct, 2021 1 commit
  25. 04 Oct, 2021 1 commit
  26. 02 Oct, 2021 1 commit
  27. 24 Sep, 2021 1 commit
  28. 04 Sep, 2021 1 commit
    • Burc Eryilmaz's avatar
      fix CUBLAS guards (#1162) · 54b93919
      Burc Eryilmaz authored
      
      
      * support for fused dense layer with cublasLt, fusion in both fprop and bprop
      
      * fix typo causing syntax error
      
      * add fused GEMM+gelu+GEMM modue
      
      * fix typo for workspace size
      
      * update cublas check for 11600
      
      * add tests for fused dense layer
      
      * fix CUDA 10.x path
      
      * safer guard around CUBLAS constants, remove unreferenced variable
      
      * more guard changes
      
      * guard against cublas version instead of cuda
      Co-authored-by: default avatarSukru Eryilmaz <seryilmaz@computelab-dgx1v-32.nvidia.com>
      54b93919
  29. 01 Sep, 2021 2 commits
  30. 17 May, 2021 1 commit
  31. 19 Apr, 2021 1 commit
  32. 17 Apr, 2021 1 commit
  33. 15 Apr, 2021 1 commit
    • Sudhakar Singh's avatar
      Add unit tests for Fused NovoGrad (#1065) · 59d2f7ac
      Sudhakar Singh authored
      * Add unit tests for fused-novograd
      
      * Fix: tensors should reside on the same device
      
      * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test
      
      * fixed issues mentioned in the comments
      59d2f7ac
  34. 25 Feb, 2021 1 commit