".github/vscode:/vscode.git/clone" did not exist on "5eb35ec98b7172a2784a970314cdd1fa1508de61"
Bnp integration pr (#275)
* Persistent group batchnorm added Added persistent grouped batch norm for performance run on strong scaling case: currently only supporting: 1. nhwc layout 2. fp16 3. synchronization only within a node! Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage by the persistent kernel. Documentation and examples will follow. * updating type().scalarType() to scalar_type() * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm * fixing the cta computation * review comment: set device_id through cudaGetDevice() move cudaMemset to cudaMemsetAsync updated __threadfence() to __threadfence_system() inter device write
Showing
apex/contrib/__init__.py
0 → 100644
Please register or sign in to comment