1. 13 Sep, 2019 1 commit
  2. 06 Sep, 2019 1 commit
    • mcarilli's avatar
      Fix for #456 (#477) · 325f5a0b
      mcarilli authored
      * Pushing for build tests
      
      * Contrib files
      
      * Removing deprecated checks
      325f5a0b
  3. 17 Aug, 2019 1 commit
  4. 16 Aug, 2019 1 commit
  5. 13 Aug, 2019 1 commit
  6. 08 Aug, 2019 1 commit
  7. 31 May, 2019 1 commit
    • Thor Johnsen's avatar
      Multi tensor lamb optimizer (#334) · 8be5b6be
      Thor Johnsen authored
      * First draft, for discussion
      
      * Fix mistakes in LAMB equations
      
      * Add loop over chunk
      
      * Bug fix
      
      * Bug fix
      
      * Bug fix
      
      * Undo bug fix
      
      * Bug fix
      
      * Add multi tensor LAMB optimizer to setup.py
      
      * Rename step_size to learning_rate
      
      * Fix compilation errors
      8be5b6be
  8. 23 May, 2019 1 commit
  9. 22 May, 2019 1 commit
  10. 09 May, 2019 1 commit
    • Wil Kong's avatar
      Add softmax cross entropy loss with label smoothing support. (#295) · 0c74571f
      Wil Kong authored
      * Add softmax cross entropy loss with label smoothing support.
      
      * Fix deprecation of AT_DISPATCH_XXX and several minor issues.
      
      * Fix issues commented by reviewers.
      
      * Add FB license.
      
      * Remove code generation constraints.
      
      * Add a simple unittest for label smoothing.
      0c74571f
  11. 27 Apr, 2019 1 commit
    • jjsjann123's avatar
      Bnp integration pr (#275) · fedfe0d7
      jjsjann123 authored
      * Persistent group batchnorm added
      
      Added persistent grouped batch norm for performance run on strong scaling case:
      currently only supporting:
      
        1. nhwc layout
        2. fp16
        3. synchronization only within a node!
      
      Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
      by the persistent kernel.
      
      Documentation and examples will follow.
      
      * updating type().scalarType() to scalar_type()
      
      * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm
      
      * fixing the cta computation
      
      * review comment:
      
      set device_id through cudaGetDevice()
      move cudaMemset to cudaMemsetAsync
      updated __threadfence() to __threadfence_system() inter device write
      fedfe0d7
  12. 18 Apr, 2019 1 commit
  13. 09 Apr, 2019 1 commit
  14. 23 Mar, 2019 1 commit
  15. 22 Mar, 2019 1 commit
    • mcarilli's avatar
      Check cuda version (#216) · 5b8faa29
      mcarilli authored
      * Adding Torch + bare-metal nvcc version check and container build tests
      
      * Putting a canary in the coalmine
      
      * canary proved elusive
      
      * Trying direct setup.py install
      
      * this should work
      
      * Removing canary
      
      * hopefully this works
      5b8faa29
  16. 19 Mar, 2019 1 commit
  17. 13 Mar, 2019 1 commit
  18. 12 Mar, 2019 1 commit
  19. 10 Mar, 2019 1 commit
  20. 08 Mar, 2019 1 commit
  21. 05 Mar, 2019 1 commit
  22. 04 Mar, 2019 1 commit
  23. 19 Feb, 2019 1 commit
  24. 11 Feb, 2019 1 commit
  25. 04 Feb, 2019 1 commit
  26. 12 Dec, 2018 1 commit
  27. 31 Oct, 2018 1 commit
  28. 30 Oct, 2018 1 commit
  29. 29 Oct, 2018 1 commit
    • mcarilli's avatar
      Merging in fused adam optimizer, additional DDP features tested in 18.10 (#60) · e0bc5d62
      mcarilli authored
      * test passes
      
      * notes
      
      * Using C++-side flatten and unflatten functions
      
      * Adding csrc
      
      * Persistent synchronization event so it doesn't need to be created and destroyed each time
      
      * Interop with parameter flattening in SSD
      
      * Added deterministic option to imagenet main.py
      
      * Adding options to split gradient averaging and allreduce in pure fp32
      
      * Fixing allreduce_maybe_retain call
      
      * Fixing allreduce_fallback
      
      * Also sync active_i_buckets from rank 0
      
      * Making retain_allreduce_buffers compatible with/orthogonal to delay_allreduce=True|False
      
      * Correcting syntax error, now all seems to work with SSD
      
      * Optional cpp extension build
      
      * Add mixed precision adam optimizer (#59)
      
      * Add FusedAdam Optimizer to Apex that places all the math into a cuda kernel.
      
      * Added fixes to fused_adam to get it to work with network.
      
      * wip work on python interface for adam with options
      
      * fix dispatch for halfs, add python options to handle optional half gradients and params
      
      * cleanup, get rid of grid-stride loop
      e0bc5d62
  30. 23 Oct, 2018 1 commit
    • jjsjann123's avatar
      [syncBN] (#48) · 81eef1ef
      jjsjann123 authored
      * [syncBN]
        added syncBN in native pure python apex
        added fused cuda kernels used for sync BN. Using welford for mean/var
          optional installation using 'python setup.py install --cuda_ext'
        added unit test with side to side comparison between apex sync BN with
          PyTorch BN. Notice that for pytorch BN implementation, because of
          numerical issue for mean/var, the output will be slightly off.
      
      * [syncBN PR]
        added fp16 support
        addressing review comments on:
          1. updating last pow 2
          2. look for import error when importing syncBN kernel
      
      * [syncBN PR]
        added convert function to insert SyncBatchNorm
        refactored some kernel code
      
      * fixing type issue (fp16/fp32/fp64)
      added Kahan summation
      editing unit test to use pytorch primitive ops with double, passing reasonable tests now
      
      * updating tensor creation calls
      
      * fixing the all_reduce contiguous tensor
      
      * transposed all reduce results
      
      * [syncBN]
      support fp16 input & fp32 layer for apex fp16
      partially fixing launch configs
      enabling imagenet example to run with --sync_bn
      
      * [syncBN PR]
      Documentation added
      
      * adjusting README
      
      * adjusting again
      
      * added some doc to imagenet example
      
      * [syncBN]
        warp-level reduction
        bug fix: warp reduction logic updated. check for dummy element to avoid nan.
        improved launch config for better reduction kernels. Further improvements
      would be to increase grid size.
      
      * [syncBN]
        fixing undefined behavior in __shfl_down_sync from divergent threads in warp
      reduction.
        changing at::native::empty to at::empty (upstream comments)
      81eef1ef
  31. 23 Jul, 2018 1 commit
  32. 05 Jul, 2018 1 commit
  33. 04 Jul, 2018 1 commit
  34. 24 Jun, 2018 1 commit
  35. 21 Jun, 2018 1 commit
  36. 14 Jun, 2018 1 commit
  37. 07 Jun, 2018 2 commits
  38. 06 Jun, 2018 1 commit
  39. 26 May, 2018 1 commit