Commits · 15648029e9d44dccba981e4f07846b3acd799393 · OpenDAS / apex

14 Jun, 2019 2 commits
- cleanup · 2b67bc33
  Evgeni Krimer authored Jun 13, 2019
  
  2b67bc33
- update gbn · 0ef439b6
  Evgeni Krimer authored Jun 13, 2019
  
  0ef439b6
27 Apr, 2019 1 commit

jjsjann123 authored Apr 26, 2019

* Persistent group batchnorm added

Added persistent grouped batch norm for performance run on strong scaling case:
currently only supporting:

  1. nhwc layout
  2. fp16
  3. synchronization only within a node!

Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
by the persistent kernel.

Documentation and examples will follow.

* updating type().scalarType() to scalar_type()

* moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm

* fixing the cta computation

* review comment:

set device_id through cudaGetDevice()
move cudaMemset to cudaMemsetAsync
updated __threadfence() to __threadfence_system() inter device write

fedfe0d7