"vscode:/vscode.git/clone" did not exist on "15e91d721b0a0afe0de0706860a03ae810357dd6"
-
jjsjann123 authored
* Persistent group batchnorm added Added persistent grouped batch norm for performance run on strong scaling case: currently only supporting: 1. nhwc layout 2. fp16 3. synchronization only within a node! Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage by the persistent kernel. Documentation and examples will follow. * updating type().scalarType() to scalar_type() * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm * fixing the cta computation * review comment: set device_id through cudaGetDevice() move cudaMemset to cudaMemsetAsync updated __threadfence() to __threadfence_system() inter device write
fedfe0d7