1. 10 May, 2019 1 commit
  2. 03 May, 2019 1 commit
  3. 02 May, 2019 2 commits
  4. 01 May, 2019 3 commits
  5. 27 Apr, 2019 2 commits
    • jjsjann123's avatar
      Bnp integration pr (#275) · fedfe0d7
      jjsjann123 authored
      * Persistent group batchnorm added
      
      Added persistent grouped batch norm for performance run on strong scaling case:
      currently only supporting:
      
        1. nhwc layout
        2. fp16
        3. synchronization only within a node!
      
      Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
      by the persistent kernel.
      
      Documentation and examples will follow.
      
      * updating type().scalarType() to scalar_type()
      
      * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm
      
      * fixing the cta computation
      
      * review comment:
      
      set device_id through cudaGetDevice()
      move cudaMemset to cudaMemsetAsync
      updated __threadfence() to __threadfence_system() inter device write
      fedfe0d7
    • Michael Carilli's avatar
      syntax · e7beba17
      Michael Carilli authored
      e7beba17
  6. 26 Apr, 2019 7 commits
  7. 25 Apr, 2019 3 commits
  8. 24 Apr, 2019 4 commits
  9. 23 Apr, 2019 2 commits
  10. 22 Apr, 2019 1 commit
  11. 18 Apr, 2019 4 commits
  12. 17 Apr, 2019 1 commit
  13. 16 Apr, 2019 5 commits
  14. 15 Apr, 2019 3 commits
  15. 12 Apr, 2019 1 commit