enable ROCm build; add BF16 for ROCm and CUDA (#325)
* first step, everything compiles * fix rebuilds; skip cuda version check for rocm * use macro for __shfl_up_sync __shfl_down_sync * add BFloat16 support for ROCm and CUDA * add USE_ROCM definition to setup.py * flake8 fixes
Showing
Please register or sign in to comment