Commits · fedfe0d7159711198a77ca1a6ba8cc20d665ddce · OpenDAS / apex

27 Apr, 2019 2 commits

jjsjann123 authored Apr 26, 2019

* Persistent group batchnorm added

Added persistent grouped batch norm for performance run on strong scaling case:
currently only supporting:

  1. nhwc layout
  2. fp16
  3. synchronization only within a node!

Environment variable is used to tune LAUNCH_MARGIN that limits the CTAs usage
by the persistent kernel.

Documentation and examples will follow.

* updating type().scalarType() to scalar_type()

* moving launch margin to be defined at layer creation, adding a knob cap max ctas per sm

* fixing the cta computation

* review comment:

set device_id through cudaGetDevice()
move cudaMemset to cudaMemsetAsync
updated __threadfence() to __threadfence_system() inter device write

fedfe0d7

syntax · e7beba17
Michael Carilli authored Apr 27, 2019

e7beba17

26 Apr, 2019 7 commits
- Removing instances of ScalarType, still need to change macros · d175acb0
  Michael Carilli authored Apr 26, 2019
  
  d175acb0
- Merging in master · d900e93c
  Michael Carilli authored Apr 26, 2019
  
  d900e93c
- whitespace · c978bda5
  Michael Carilli authored Apr 26, 2019
  
  c978bda5
- Explicit control over number of allreduce streams for DDP · 73d4212d
  Michael Carilli authored Apr 26, 2019
  
  73d4212d
- Replace type().ScalarType() with scalar_type() (#272) · 855808f3
  ptrblck authored Apr 26, 2019
```
* change .type().ScalarType() to .scalar_type() + at::ScalarType::X to at::kX

* revert scalar_type() to type() for AT_DISPATCH_FLOATING_TYPES_AND_HALF

* revert scalar_type() to type() in AT_DISPATCH_FLOATING_TYPES

* revert scalar_type() to type() for AT_DISPATCH_FLOATING_TYPES_AND_HALF in welford.cu

* revert scalar_type() to type() in layer_norm_cuda_kernel.cu

* revert at::kType  to at::ScalarType::Type

* use DISPATCH_FLOAT_AND_HALF to get rid of warnings

* add dispatch mechanisms for double+float and double+float+half
```
  855808f3
- Tested on 1x8x1 · 070c7e96
  Michael Carilli authored Apr 26, 2019
  
  070c7e96
- Fixed bounds checking · 3b32c401
  Michael Carilli authored Apr 26, 2019
  
  3b32c401
25 Apr, 2019 3 commits
- Don't launch for empty sets · 2c63ba91
  Michael Carilli authored Apr 25, 2019
  
  2c63ba91
- syntax · 91362442
  Michael Carilli authored Apr 25, 2019
  
  91362442
- let's see · 75139ca3
  Michael Carilli authored Apr 25, 2019
  
  75139ca3
24 Apr, 2019 4 commits
- Initial organization · e0f2ffa5
  Michael Carilli authored Apr 24, 2019
  
  e0f2ffa5
- Moving sgd to optimizers · bf4aa847
  Michael Carilli authored Apr 23, 2019
  
  bf4aa847
- Merging in FusedAdam treatment · 6af5980e
  Michael Carilli authored Apr 23, 2019
  
  6af5980e
- Updating explanation for record_stream · 7aad54f7
  Michael Carilli authored Apr 24, 2019
  
  7aad54f7
23 Apr, 2019 2 commits
- Moving flat allreduce buffer creation to main stream · 25ac9897
  Michael Carilli authored Apr 23, 2019
  
  25ac9897
- move and fix check_optimizers (#268) · 1c464b48
  ptrblck authored Apr 23, 2019
  
  1c464b48
22 Apr, 2019 1 commit
- Updating TensorList->TensorListMetadata · 16a3bdf3
  Michael Carilli authored Apr 22, 2019
  
  16a3bdf3
18 Apr, 2019 4 commits
- cleanup · 651150cb
  Michael Carilli authored Apr 18, 2019
  
  651150cb
- Merging in master · 843cdbe0
  Michael Carilli authored Apr 18, 2019
  
  843cdbe0
- initial commit, add CUDA warning to check_params_fp32 (#263) · 28097c99
  ptrblck authored Apr 18, 2019
  
  28097c99
- Update README.md (#261) · cd2708cc
  Glenn Jocher authored Apr 18, 2019
  
  cd2708cc
17 Apr, 2019 1 commit
- Option to elide unflattening copy · b8965a78
  Michael Carilli authored Apr 17, 2019
  
  b8965a78
16 Apr, 2019 5 commits
- Better way to expose scale adjustment · 887a50bd
  Michael Carilli authored Apr 16, 2019
  
  887a50bd
- Compatibility between skip_step and FusedAdam.step · 9efb2809
  Michael Carilli authored Apr 16, 2019
  
  9efb2809
- Adding control point for scale adjustment · 111ee132
  Michael Carilli authored Apr 16, 2019
  
  111ee132
- Adding option to ensure that model outputs are a desired type · 0b5dd020
  Michael Carilli authored Apr 16, 2019
  
  0b5dd020
- Adding option to ensure that model outputs are a desired type · d69011de
  Michael Carilli authored Apr 16, 2019
  
  d69011de
15 Apr, 2019 3 commits
- fp16_groups is an attribute of _amp_stash · eea4c0aa
  Michael Carilli authored Apr 15, 2019
  
  eea4c0aa
- Scaler not needed for prepare_backward*fused · e5213b28
  Michael Carilli authored Apr 15, 2019
  
  e5213b28
- For testing purposes, enable the case where FusedAdam is not wrapped by amp · 5ae6008d
  Michael Carilli authored Apr 15, 2019
  
  5ae6008d
12 Apr, 2019 1 commit
- Update Wil's code + typo · 53fd093d
  Michael Carilli authored Apr 12, 2019
  
  53fd093d
11 Apr, 2019 7 commits
- Merge branch 'master' into prepare_fused · 3c53cf81
  Michael Carilli authored Apr 11, 2019
  
  3c53cf81
- typo · b7f10ad0
  Michael Carilli authored Apr 11, 2019
  
  b7f10ad0
- Patching in changes to enable multiple allreduces in flight · 8521bb22
  Michael Carilli authored Apr 11, 2019
  
  8521bb22
- Rough cut, control flow should work for scaleout testing · 61b8a0fd
  Michael Carilli authored Apr 11, 2019
  
  61b8a0fd
- prelu belongs in FP16_CASTS (#257) · 4dc711bc
  henrymai authored Apr 11, 2019
```
The main use of these functions (e.g.: `torch.{conv*, prelu}`) is via their `torch.nn`
wrapping layers.

The `torch.nn` layers are what contain the weights and call into these lower level
functions with the weights as a parameter in their `forward()` method.

The `torch.conv*` functions are already in the `FP16_CASTS` list due to amp's philosophy of
casting the parameters rather than the model/layer weights.

Conceptually `torch.prelu` is the same as the `torch.conv*` case, where its weight parameter
is passed in from its wrapper layer `torch.nn.PReLU`.
```
  4dc711bc
- Fixing merge conflict in setup.py · dda59354
  Michael Carilli authored Apr 10, 2019
  
  dda59354
- some cleanup · fc6c5a25
  Michael Carilli authored Apr 10, 2019
  
  fc6c5a25