• jjsjann123's avatar
    Bnp integration pr (#275) · fedfe0d7
    jjsjann123 authored
    * Persistent group batchnorm added
    
    Added persistent grouped batch norm for performance run on strong scaling case:
    currently only supporting:
    
      1. nhwc layout
      2. fp16
      3. synchronization only within a node!
    
    Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
    by the persistent kernel.
    
    Documentation and examples will follow.
    
    * updating type().scalarType() to scalar_type()
    
    * moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm
    
    * fixing the cta computation
    
    * review comment:
    
    set device_id through cudaGetDevice()
    move cudaMemset to cudaMemsetAsync
    updated __threadfence() to __threadfence_system() inter device write
    fedfe0d7
setup.py 7.15 KB