Unverified Commit 733e6ff8 authored by bdf's avatar bdf Committed by GitHub
Browse files

Pick MLU modifications from master (1.x) to main (2.x) (#2704)



* [Feature] Support Voxelization with cambricon MLU device (#2500)

* [Feature] Support hard_voxelize with cambricon MLU backend

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Enhance] Optimize the performace of ms_deform_attn for MLU device (#2510)

* ms_opt

* ms_opt

* ms_opt

* ms_opt

* ms_opt

* [Feature] ms_deform_attn performance optimization

* [Feature] ms_deform_attn performance optimization

* [Feature] ms_deform_attn performance optimization

* [Feature] Support ball_query with cambricon MLU backend and mlu-ops library. (#2520)

* [Feature] Support ball_query with cambricon MLU backend and mlu-ops library.

* [Fix] update operator data layout setting.

* [Fix] add cxx compile option to avoid symbol conflict.

* [Fix] fix lint errors.

* [Fix] update ops.md with info of ball_query support by MLU backend.

* [Feature] Fix typo.

* [Fix] Remove print.

* [Fix] get mlu-ops from MMCV_MLU_OPS_PATH env.

* [Fix] update MMCV_MLU_OPS_PATH check logic.

* [Fix] update error info when failed to download mlu-ops.

* [Fix] check mlu-ops version matching info in mmcv.

* [Fix] revise wrong filename.

* [Fix] remove f.close and re.

* [Docs] Steps to compile mmcv-full on MLU machine (#2571)

* [Docs] Steps to compile mmcv-full on MLU machine

* [Docs] Adjust paragraph order

* Update docs/zh_cn/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* Update docs/zh_cn/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* Update docs/en/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* Update docs/en/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* [Docs] Modify the format

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* [Fix] Fix tensor descriptor setting in MLU ball_query. (#2579)

* [Feature] Add MLU support for Sparse Convolution op (#2589)

* [Feature] Add sparse convolution MLU API

* [Feature] update cpp code style

* end-of-file

* delete libext.a

* code style

* update ops.md

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

* [Enhancement] Replace the implementation of deform_roi_pool with mlu-ops (#2598)

* [Feature] Replace the implementation of deform_roi_pool with mlu-ops

* [Feature] Modify code

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

* [Enhancement] ms_deform_attn performance optimization (#2616)

* ms_opt_v2

* ms_opt_v2_1

* optimize MultiScaleDeformableAttention ops for MLU

* ms_opt_v2_1

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

---------
Co-authored-by: default avatardongchengwei <dongchengwei@cambricon.com>

* [Feature] Support NmsRotated with cambricon MLU backend (#2643)

* [Feature] Support NmsRotated with cambricon MLU backend

* [Feature] remove foolproofs in nms_rotated_mlu.cpp

* [Feature] fix lint in test_nms_rotated.py

* [Feature] fix kMLU not found in nms_rotated.cpp

* [Feature] modify mlu support in nms.py

* [Feature] modify nms_rotated support in ops.md

* [Feature] modify ops/nms.py

* [Enhance] Add a default value for MMCV_MLU_ARGS (#2688)

* add mlu_args

* add mlu_args

* Modify the code

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

* [Enhance] Ignore mlu-ops files (#2691)
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

---------
Co-authored-by: default avatarZShaopeng <108382403+ZShaopeng@users.noreply.github.com>
Co-authored-by: default avatarBinZheng <38182684+Wickyzheng@users.noreply.github.com>
Co-authored-by: default avatarliuduanhui <103939338+DanieeelLiu@users.noreply.github.com>
Co-authored-by: default avatarbudefei <budefei@cambricon.com>
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: default avatarduzekun <108381389+duzekunKTH@users.noreply.github.com>
Co-authored-by: default avatardongchengwei <dongchengwei@cambricon.com>
Co-authored-by: default avatarliuyuan1-v <125547457+liuyuan1-v@users.noreply.github.com>
parent 1f161f68
......@@ -27,6 +27,8 @@ wheels/
.installed.cfg
*.egg
MANIFEST
mlu-ops/
mlu-ops.*
# PyInstaller
# Usually these files are written by a python script from a template
......
......@@ -290,3 +290,60 @@ If you need to use PyTorch-related modules, make sure PyTorch has been successfu
```bash
python -c 'import mmcv;print(mmcv.__version__)'
```
### Build mmcv-full on Cambricon MLU Devices
#### Install torch_mlu
##### Option1: Install mmcv-full based on Cambricon docker image
Firstly, install and pull Cambricon docker image (please email service@cambricon.com for the latest release docker):
```bash
docker pull ${docker image}
```
Run and attach to the docker, [Install mmcv-full on MLU device](#install-mmcv\-full-on-cambricon-mlu-device) and [make sure you've installed mmcv-full on MLU device successfully](#test-code)
##### Option2: Install mmcv-full from compiling Cambricon PyTorch source code
Please email service@cambricon.com or contact with Cambricon engineers for a suitable version of CATCH package. After you get the suitable version of CATCH package, please follow the steps in ${CATCH-path}/CONTRIBUTING.md to install Cambricon PyTorch.
#### Install mmcv-full on Cambricon MLU device
Clone the repo
```bash
git clone https://github.com/open-mmlab/mmcv.git
```
The mlu-ops library will be downloaded to the default directory (mmcv/mlu-ops) while building MMCV. You can also set `MMCV_MLU_OPS_PATH` to an existing mlu-ops library before building as follows:
```bash
export MMCV_MLU_OPS_PATH=/xxx/xxx/mlu-ops
```
Install mmcv-full
```bash
cd mmcv
export MMCV_WITH_OPS=1
export FORCE_MLU=1
python setup.py install
```
#### Test Code
After finishing previous steps, you can run the following python code to make sure that you've installed mmcv-full on MLU device successfully
```python
import torch
import torch_mlu
from mmcv.ops import sigmoid_focal_loss
x = torch.randn(3, 10).mlu()
x.requires_grad = True
y = torch.tensor([1, 5, 3]).mlu()
w = torch.ones(10).float().mlu()
output = sigmoid_focal_loss(x, y, 2.0, 0.25, w, 'none')
print(output)
```
......@@ -6,7 +6,7 @@ We implement common ops used in detection, segmentation, etc.
| ---------------------------- | --- | ---- | --- | --- | ------ |
| ActiveRotatedFilter | √ | √ | | | |
| AssignScoreWithK | | √ | | | |
| BallQuery | | √ | | | |
| BallQuery | | √ | | | |
| BBoxOverlaps | | √ | √ | √ | √ |
| BorderAlign | | √ | | | |
| BoxIouRotated | √ | √ | | | |
......@@ -35,7 +35,7 @@ We implement common ops used in detection, segmentation, etc.
| ModulatedDeformConv2d | √ | √ | | | √ |
| MultiScaleDeformableAttn | | √ | √ | | |
| NMS | √ | √ | √ | | √ |
| NMSRotated | √ | √ | | | √ |
| NMSRotated | √ | √ | | | √ |
| NMSQuadri | √ | √ | | | |
| PixelGroup | √ | | | | |
| PointsInBoxes | √ | √ | | | |
......@@ -52,13 +52,13 @@ We implement common ops used in detection, segmentation, etc.
| SigmoidFocalLoss | | √ | √ | | √ |
| SoftmaxFocalLoss | | √ | | | √ |
| SoftNMS | | √ | | | |
| Sparse Convolution | | √ | | | |
| Sparse Convolution | | √ | | | |
| Synchronized BatchNorm | | √ | | | |
| ThreeInterpolate | | √ | | | |
| ThreeNN | | √ | √ | | |
| TINShift | | √ | √ | | |
| UpFirDn2d | | √ | | | |
| Voxelization | √ | √ | | | √ |
| Voxelization | √ | √ | | | √ |
| PrRoIPool | | √ | | | |
| BezierAlign | √ | √ | | | |
| BiasAct | | √ | | | |
......
......@@ -298,3 +298,59 @@ mmcv 有两个版本:
```bash
python -c 'import mmcv;print(mmcv.__version__)'
```
### 在寒武纪 MLU 机器编译 mmcv-full
#### 安装 torch_mlu
##### 选项1: 基于寒武纪 docker image 安装
首先请下载并且拉取寒武纪 docker (请向 service@cambricon.com 发邮件以获得最新的寒武纪 pytorch 发布 docker)。
```
docker pull ${docker image}
```
进入 docker, [编译 MMCV MLU](#编译mmcv-mlu)[进行验证](#验证是否成功安装)
##### 选项2:基于 cambricon pytorch 源码编译安装
请向 service@cambricon.com 发送邮件或联系 Cambricon 工程师以获取合适版本的 CATCH 软件包,在您获得合适版本的 CATCH 软件包后,请参照 ${CATCH-path}/CONTRIBUTING.md 中的步骤安装 CATCH。
#### 编译 MMCV
克隆代码仓库
```bash
git clone https://github.com/open-mmlab/mmcv.git
```
算子库 mlu-ops 在编译 MMCV 时自动下载到默认路径(mmcv/mlu-ops),你也可以在编译前设置环境变量 MMCV_MLU_OPS_PATH 指向已经存在的 mlu-ops 算子库路径。
```bash
export MMCV_MLU_OPS_PATH=/xxx/xxx/mlu-ops
```
开始编译
```bash
cd mmcv
export MMCV_WITH_OPS=1
export FORCE_MLU=1
python setup.py install
```
#### 验证是否成功安装
完成上述安装步骤之后,您可以尝试运行下面的 Python 代码以测试您是否成功在 MLU 设备上安装了 mmcv-full
```python
import torch
import torch_mlu
from mmcv.ops import sigmoid_focal_loss
x = torch.randn(3, 10).mlu()
x.requires_grad = True
y = torch.tensor([1, 5, 3]).mlu()
w = torch.ones(10).float().mlu()
output = sigmoid_focal_loss(x, y, 2.0, 0.25, w, 'none')
```
......@@ -6,7 +6,7 @@ MMCV 提供了检测、分割等任务中常用的算子
| ---------------------------- | --- | ---- | --- | --- | ------ |
| ActiveRotatedFilter | √ | √ | | | |
| AssignScoreWithK | | √ | | | |
| BallQuery | | √ | | | |
| BallQuery | | √ | | | |
| BBoxOverlaps | | √ | √ | √ | √ |
| BorderAlign | | √ | | | |
| BoxIouRotated | √ | √ | | | |
......@@ -35,7 +35,7 @@ MMCV 提供了检测、分割等任务中常用的算子
| ModulatedDeformConv2d | √ | √ | | | √ |
| MultiScaleDeformableAttn | | √ | √ | | |
| NMS | √ | √ | √ | | √ |
| NMSRotated | √ | √ | | | √ |
| NMSRotated | √ | √ | | | √ |
| NMSQuadri | √ | √ | | | |
| PixelGroup | √ | | | | |
| PointsInBoxes | √ | √ | | | |
......@@ -52,13 +52,13 @@ MMCV 提供了检测、分割等任务中常用的算子
| SigmoidFocalLoss | | √ | √ | | √ |
| SoftmaxFocalLoss | | √ | | | √ |
| SoftNMS | | √ | | | |
| Sparse Convolution | | √ | | | |
| Sparse Convolution | | √ | | | |
| Synchronized BatchNorm | | √ | | | |
| ThreeInterpolate | | √ | | | |
| ThreeNN | | √ | √ | | |
| TINShift | | √ | √ | | |
| UpFirDn2d | | √ | | | |
| Voxelization | √ | √ | | | √ |
| Voxelization | √ | √ | | | √ |
| PrRoIPool | | √ | | | |
| BezierAlign | √ | √ | | | |
| BiasAct | | √ | | | |
......
This diff is collapsed.
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "mlu_common_helper.h"
void ball_query_forward_mlu(int b, int n, int m, float min_radius,
float max_radius, int nsample, const Tensor new_xyz,
const Tensor xyz, Tensor idx) {
auto new_xyz_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
new_xyz, new_xyz.suggest_memory_format());
auto xyz_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
xyz, new_xyz.suggest_memory_format());
auto idx_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
idx, new_xyz.suggest_memory_format());
MluOpTensorDescriptor new_xyz_desc, xyz_desc, idx_desc;
new_xyz_desc.set(new_xyz_contiguous);
xyz_desc.set(xyz_contiguous);
idx_desc.set(idx_contiguous);
auto new_xyz_impl = torch_mlu::getMluTensorImpl(new_xyz_contiguous);
auto xyz_impl = torch_mlu::getMluTensorImpl(xyz_contiguous);
auto idx_impl = torch_mlu::getMluTensorImpl(idx_contiguous);
auto new_xyz_ptr = new_xyz_impl->cnnlMalloc();
auto xyz_ptr = xyz_impl->cnnlMalloc();
auto idx_ptr = idx_impl->cnnlMalloc();
auto handle = mluOpGetCurrentHandle();
mluOpBallQuery(handle, new_xyz_desc.desc(), new_xyz_ptr, xyz_desc.desc(),
xyz_ptr, min_radius, max_radius, nsample, idx_desc.desc(),
idx_ptr);
}
void ball_query_forward_impl(int b, int n, int m, float min_radius,
float max_radius, int nsample,
const Tensor new_xyz, const Tensor xyz,
Tensor idx);
REGISTER_DEVICE_IMPL(ball_query_forward_impl, MLU, ball_query_forward_mlu);
......@@ -9,254 +9,59 @@
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
void KernelDeformRoIPoolForward(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, cnrtDataType_t data_type,
const void *input, const void *rois,
const void *offset, void *output,
const int channels, const int height,
const int width, const int num_rois,
const int pooled_height, const int pooled_width,
const float spatial_scale,
const int sampling_ratio, const float gamma);
void KernelDeformRoIPoolBackward(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
cnrtDataType_t data_type, const void *grad_output, const void *input,
const void *rois, const void *offset, void *grad_input, void *grad_offset,
const int channels, const int height, const int width, const int num_rois,
const int pooled_height, const int pooled_width, const float spatial_scale,
const int sampling_ratio, const float gamma);
// policy function for forward and backward
static void policyFunc(const int bin_num, cnrtDim3_t *k_dim,
cnrtFunctionType_t *k_type) {
const size_t cluster_limit = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
;
const size_t core_limit = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
const size_t bin_num_align = CEIL_ALIGN(bin_num, core_limit);
k_dim->x = core_limit;
k_dim->y = (bin_num_align / core_limit) > cluster_limit
? cluster_limit
: (bin_num_align / core_limit);
k_dim->z = 1;
*k_type = CNRT_FUNC_TYPE_UNION1;
}
#include "mlu_common_helper.h"
void DeformRoIPoolForwardMLUKernelLauncher(Tensor input, Tensor rois,
Tensor offset, Tensor output,
int pooled_height, int pooled_width,
float spatial_scale,
int sampling_ratio, float gamma) {
// Check dtype.
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"input type should be Float or Half, got ", input.scalar_type());
TORCH_CHECK(input.scalar_type() == rois.scalar_type(),
"rois should have the same type as input");
// Check shape.
TORCH_CHECK(input.dim() == 4, "input should be 4d tensor, got ", input.dim(),
"D.");
TORCH_CHECK(rois.dim() == 2, "rois should be 2d tensor, got ", rois.dim(),
"D.");
if (offset.defined() && offset.numel() > 0) {
TORCH_CHECK(input.scalar_type() == offset.scalar_type(),
"offset should have the same type as input");
TORCH_CHECK(offset.dim() == 4, "offset should be 4d tensor, got ",
offset.dim(), "D.");
TORCH_CHECK(
(offset.size(0) == rois.size(0)), "offset.size(0) = ", offset.size(0),
"while rois.size(0)) = ", rois.size(0), ". They should be the same.");
TORCH_CHECK((offset.size(1) == 2), "offset.size(1) should be 2, ",
"but now offset.size(1) = ", offset.size(1), ".");
TORCH_CHECK((offset.size(2) == output.size(2)),
"offset.size(2) = ", offset.size(2),
"while output.size(2)) = ", output.size(2),
". They should be the same.");
TORCH_CHECK((offset.size(3) == output.size(3)),
"offset.size(3) = ", offset.size(3),
"while output.size(3)) = ", output.size(3),
". They should be the same.");
}
TORCH_CHECK(spatial_scale > 0 && spatial_scale <= 1,
"spatial_scale should be within (0, 1], got ", spatial_scale,
".");
// compute kernel params
auto height = input.size(2);
auto width = input.size(3);
auto channels = input.size(1);
auto num_rois = output.size(0);
if (output.numel() == 0) {
output = at::zeros({num_rois, channels, pooled_height, pooled_width},
input.options());
return;
}
// zero element check
TORCH_CHECK(input.size(0) != 0, "input.size(0) should not be zero, got ",
input.size(0));
TORCH_CHECK(rois.numel() != 0, "rois.numel() should not be zero, got ",
rois.numel());
if (input.numel() == 0 || output.numel() == 0) {
return;
}
// large tensor check
const size_t max_input_num = 2147483648; // 2^31, 2G num
TORCH_CHECK(input.numel() < max_input_num,
"input.numel() should be less than 2147483648, got ",
input.numel());
TORCH_CHECK(rois.numel() < max_input_num,
"rois.numel() should be less than 2147483648, got ",
rois.numel());
TORCH_CHECK(output.numel() < max_input_num,
"output.numel() should be less than 2147483648, got ",
output.numel());
TORCH_CHECK(!offset.defined() || offset.numel() < max_input_num,
"offset.numel() should be less than 2147483648, got ",
offset.numel());
auto memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim());
auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format);
at::Tensor output_ =
at::empty({num_rois, channels, pooled_height, pooled_width},
input.options(), memory_format);
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFunc(num_rois * pooled_height * pooled_width, &k_dim, &k_type);
// get compute queue
auto queue = torch_mlu::getCurQueue();
auto rois_contiguous =
torch_mlu::cnnl::ops::cnnl_contiguous(rois, rois.suggest_memory_format());
auto output_contiguous =
torch_mlu::cnnl::ops::cnnl_contiguous(output, memory_format);
MluOpTensorDescriptor input_desc, rois_desc, offset_desc, output_desc;
input_desc.set_with_layout(input_, MLUOP_LAYOUT_NHWC);
rois_desc.set(rois_contiguous);
output_desc.set_with_layout(output_contiguous, MLUOP_LAYOUT_NHWC);
mluOpTensorDescriptor_t offset_real_desc = NULL;
void *offset_ptr = NULL;
if (offset.defined() && offset.numel() > 0) {
auto offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
offset, offset.suggest_memory_format());
offset_desc.set(offset_contiguous);
offset_real_desc = offset_desc.desc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset_contiguous);
offset_ptr = offset_impl->cnnlMalloc();
}
// get ptr of tensors
auto input_impl = torch_mlu::getMluTensorImpl(input_);
auto input_ptr = input_impl->cnnlMalloc();
auto rois_impl = torch_mlu::getMluTensorImpl(rois);
auto rois_impl = torch_mlu::getMluTensorImpl(rois_contiguous);
auto rois_ptr = rois_impl->cnnlMalloc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset);
auto offset_ptr = offset_impl->cnnlMalloc();
auto output_impl = torch_mlu::getMluTensorImpl(output_);
auto output_impl = torch_mlu::getMluTensorImpl(output_contiguous);
auto output_ptr = output_impl->cnnlMalloc();
// get comput dtype of input
cnrtDataType_t data_type = torch_mlu::toCnrtDtype(input_.dtype());
// launch kernel
CNLOG(INFO) << "Launch Kernel MLUKernelDeformRoIPoolForward<<<" << k_dim.x
<< ", " << k_dim.y << ", " << k_dim.z << ">>>";
// get compute handle
auto handle = mluOpGetCurrentHandle();
mluOpDeformRoiPoolForward(
handle, input_desc.desc(), input_ptr, rois_desc.desc(), rois_ptr,
offset_real_desc, offset_ptr, pooled_height, pooled_width, spatial_scale,
sampling_ratio, gamma, output_desc.desc(), output_ptr);
KernelDeformRoIPoolForward(k_dim, k_type, queue, data_type, input_ptr,
rois_ptr, offset_ptr, output_ptr, channels, height,
width, num_rois, pooled_height, pooled_width,
spatial_scale, sampling_ratio, gamma);
output.copy_(output_);
output.copy_(output_contiguous);
}
void DeformRoIPoolBackwardMLUKernelLauncher(
Tensor grad_output, Tensor input, Tensor rois, Tensor offset,
Tensor grad_input, Tensor grad_offset, int pooled_height, int pooled_width,
float spatial_scale, int sampling_ratio, float gamma) {
// Check dtype.
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"input type should be Float or Half, got ", input.scalar_type());
TORCH_CHECK(input.scalar_type() == grad_output.scalar_type(),
"grad_output should have the same type as input");
TORCH_CHECK(input.scalar_type() == rois.scalar_type(),
"rois should have the same type as input");
TORCH_CHECK(input.scalar_type() == grad_input.scalar_type(),
"grad_input should have the same type as input");
// Check shape.
TORCH_CHECK(grad_output.dim() == 4, "grad_output should be 4d tensor, got ",
grad_output.dim(), "D.");
TORCH_CHECK(input.dim() == 4, "input should be 4d tensor, got ", input.dim(),
"D.");
TORCH_CHECK(rois.dim() == 2, "rois should be 2d tensor, got ", rois.dim(),
"D.");
if (offset.defined() && offset.numel() > 0) {
TORCH_CHECK(input.scalar_type() == offset.scalar_type(),
"offset should have the same type as input");
TORCH_CHECK(offset.dim() == 4, "offset should be 4d tensor, got ",
offset.dim(), "D.");
TORCH_CHECK(
(offset.size(0) == rois.size(0)), "offset.size(0) = ", offset.size(0),
"while rois.size(0)) = ", rois.size(0), ". They should be the same.");
TORCH_CHECK((offset.size(1) == 2), "offset.size(1) should be 2, ",
"but now offset.size(1) = ", offset.size(1), ".");
TORCH_CHECK((offset.size(2) == grad_output.size(2)),
"offset.size(2) = ", offset.size(2),
"while grad_output.size(2)) = ", grad_output.size(2),
". They should be the same.");
TORCH_CHECK((offset.size(3) == grad_output.size(3)),
"offset.size(3) = ", offset.size(3),
"while grad_output.size(3)) = ", grad_output.size(3),
". They should be the same.");
}
TORCH_CHECK(spatial_scale > 0 && spatial_scale <= 1,
"spatial_scale should be within (0, 1], got ", spatial_scale);
// Check relationship between tensor.
TORCH_CHECK((grad_output.size(0) == rois.size(0)),
"grad_output.size(0) = ", grad_output.size(0),
"while rois.size(0)) = ", rois.size(0),
". They should be the same.");
TORCH_CHECK((grad_output.size(1) == input.size(1)),
"grad_output.size(1) = ", grad_output.size(1),
"while input.size(1)) = ", input.size(1),
". They should be the same.");
TORCH_CHECK((grad_output.size(2) == pooled_height),
"grad_output.size(2) = ", grad_output.size(2),
"while pooled_height = ", pooled_height,
". They should be the same.");
TORCH_CHECK((grad_output.size(3) == pooled_width),
"grad_output.size(3) = ", grad_output.size(3),
"while pooled_width = ", pooled_width,
". They should be the same.");
// compute kernel params
auto batch = input.size(0);
auto channels = input.size(1);
auto height = input.size(2);
auto width = input.size(3);
auto num_rois = grad_output.size(0);
// zero element check
TORCH_CHECK(input.size(0) != 0, "input.size(0) should not be zero, got ",
input.size(0));
TORCH_CHECK(rois.numel() != 0, "rois.numel() should not be zero, got ",
rois.numel());
if (input.numel() == 0 || grad_output.numel() == 0) {
return;
}
// large tensor check
const size_t max_input_num = 2147483648; // 2^31, 2G num
TORCH_CHECK(input.numel() < max_input_num,
"input.numel() should be less than 2147483648, got ",
input.numel());
TORCH_CHECK(rois.numel() < max_input_num,
"rois.numel() should be less than 2147483648, got ",
rois.numel());
TORCH_CHECK(grad_output.numel() < max_input_num,
"grad_output.numel() should be less than 2147483648, got ",
grad_output.numel());
TORCH_CHECK(!offset.defined() || offset.numel() < max_input_num,
"offset.numel() should be less than 2147483648, got ",
offset.numel());
auto memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(grad_output.dim());
auto grad_output_ =
......@@ -264,45 +69,56 @@ void DeformRoIPoolBackwardMLUKernelLauncher(
memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim());
auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format);
at::Tensor grad_input_ = at::empty({batch, channels, height, width},
input.options(), memory_format)
.zero_();
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFunc(num_rois * pooled_height * pooled_width, &k_dim, &k_type);
// get compute queue
auto queue = torch_mlu::getCurQueue();
auto rois_contiguous =
torch_mlu::cnnl::ops::cnnl_contiguous(rois, rois.suggest_memory_format());
auto grad_input_ =
torch_mlu::cnnl::ops::cnnl_contiguous(grad_input, memory_format);
// get ptr of tensors
auto grad_output_impl = torch_mlu::getMluTensorImpl(grad_output_);
auto grad_output_ptr = grad_output_impl->cnnlMalloc();
auto input_impl = torch_mlu::getMluTensorImpl(input_);
auto input_ptr = input_impl->cnnlMalloc();
auto rois_impl = torch_mlu::getMluTensorImpl(rois);
auto rois_impl = torch_mlu::getMluTensorImpl(rois_contiguous);
auto rois_ptr = rois_impl->cnnlMalloc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset);
auto offset_ptr = offset_impl->cnnlMalloc();
auto grad_input_impl = torch_mlu::getMluTensorImpl(grad_input_);
auto grad_input_ptr = grad_input_impl->cnnlMalloc();
auto grad_offset_impl = torch_mlu::getMluTensorImpl(grad_offset);
auto grad_offset_ptr = grad_offset_impl->cnnlMalloc();
// get comput dtype of input
cnrtDataType_t data_type = torch_mlu::toCnrtDtype(input.dtype());
// launch kernel
CNLOG(INFO) << "Launch Kernel KernelDeformRoIPoolBackward<<<" << k_dim.x
<< ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelDeformRoIPoolBackward(k_dim, k_type, queue, data_type, grad_output_ptr,
input_ptr, rois_ptr, offset_ptr, grad_input_ptr,
grad_offset_ptr, channels, height, width,
num_rois, pooled_height, pooled_width,
spatial_scale, sampling_ratio, gamma);
MluOpTensorDescriptor grad_output_desc, input_desc, rois_desc, offset_desc,
grad_input_desc, grad_offset_desc;
grad_output_desc.set_with_layout(grad_output_, MLUOP_LAYOUT_NHWC);
input_desc.set_with_layout(input_, MLUOP_LAYOUT_NHWC);
rois_desc.set(rois_contiguous);
grad_input_desc.set_with_layout(grad_input_, MLUOP_LAYOUT_NHWC);
mluOpTensorDescriptor_t offset_real_desc = NULL;
void *offset_ptr = NULL;
if (offset.defined() && offset.numel() > 0) {
auto offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
offset, offset.suggest_memory_format());
offset_desc.set(offset_contiguous);
offset_real_desc = offset_desc.desc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset_contiguous);
offset_ptr = offset_impl->cnnlMalloc();
}
mluOpTensorDescriptor_t grad_offset_real_desc = NULL;
void *grad_offset_ptr = NULL;
if (grad_offset.defined() && grad_offset.numel() > 0) {
auto grad_offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
grad_offset, grad_offset.suggest_memory_format());
grad_offset_desc.set(grad_offset_contiguous);
grad_offset_real_desc = grad_offset_desc.desc();
auto grad_offset_impl = torch_mlu::getMluTensorImpl(grad_offset_contiguous);
grad_offset_ptr = grad_offset_impl->cnnlMalloc();
}
// get compute handle
auto handle = mluOpGetCurrentHandle();
mluOpDeformRoiPoolBackward(
handle, grad_output_desc.desc(), grad_output_ptr, input_desc.desc(),
input_ptr, rois_desc.desc(), rois_ptr, offset_real_desc, offset_ptr,
pooled_height, pooled_width, spatial_scale, sampling_ratio, gamma,
grad_input_desc.desc(), grad_input_ptr, grad_offset_real_desc,
grad_offset_ptr);
grad_input.copy_(grad_input_);
}
......
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "mlu_common_helper.h"
// Descriptors
mluOpDataType_t getMluOpDataType(const caffe2::TypeMeta& data_type) {
const std::map<std::string, mluOpDataType_t> mapping_type = {
{std::string("c10::Half"), MLUOP_DTYPE_HALF},
{std::string("float"), MLUOP_DTYPE_FLOAT},
{std::string("double"), MLUOP_DTYPE_DOUBLE},
{std::string("int8"), MLUOP_DTYPE_INT8},
{std::string("signed char"), MLUOP_DTYPE_INT8},
{std::string("short int"), MLUOP_DTYPE_INT16},
{std::string("short"), MLUOP_DTYPE_INT16},
{std::string("int"), MLUOP_DTYPE_INT32},
{std::string("long int"), MLUOP_DTYPE_INT64},
{std::string("long"), MLUOP_DTYPE_INT64},
{std::string("unsigned char"), MLUOP_DTYPE_UINT8},
{std::string("bool"), MLUOP_DTYPE_BOOL},
{std::string("c10::complex<c10::Half>"), MLUOP_DTYPE_COMPLEX_HALF},
{std::string("c10::complex<float>"), MLUOP_DTYPE_COMPLEX_FLOAT}};
if (mapping_type.find(std::string(data_type.name())) != mapping_type.end()) {
return mapping_type.find(std::string(data_type.name()))->second;
}
return MLUOP_DTYPE_INVALID;
}
// laytout
mluOpTensorLayout_t getMluOpSuggestLayout(const at::Tensor& input) {
auto suggest_memory_format = input.suggest_memory_format();
mluOpTensorLayout_t layout = MLUOP_LAYOUT_ARRAY;
switch (input.dim()) {
case 4:
layout = (suggest_memory_format == at::MemoryFormat::ChannelsLast)
? MLUOP_LAYOUT_NHWC
: MLUOP_LAYOUT_NCHW;
break;
case 5:
layout = (suggest_memory_format == at::MemoryFormat::ChannelsLast3d)
? MLUOP_LAYOUT_NDHWC
: MLUOP_LAYOUT_NCDHW;
break;
default:
layout = MLUOP_LAYOUT_ARRAY;
}
return layout;
}
void MluOpTensorDescriptor::set(Tensor t) {
mluOpDataType_t data_type = getMluOpDataType(t.dtype());
mluOpTensorLayout_t layout = getMluOpSuggestLayout(t);
int t_dim = t.dim();
std::vector<int> dim_array;
if (t_dim == 0) {
dim_array.push_back(
1); // ScalarTensor(0-dim 1-item Tensor) view like size = 1 as default;
} else {
for (int i = 0; i < t_dim; i++) {
dim_array.push_back(static_cast<int>(t.sizes().vec()[i]));
}
}
set_desc(t, layout, data_type, dim_array);
}
void MluOpTensorDescriptor::set_with_layout(Tensor t,
mluOpTensorLayout_t layout) {
mluOpDataType_t data_type = getMluOpDataType(t.dtype());
int t_dim = t.dim();
std::vector<int> shape_info = checkUpperBoundAndCastTo<int>(t.sizes().vec());
std::vector<int> stride_info =
checkUpperBoundAndCastTo<int>(t.strides().vec());
if (layout == MLUOP_LAYOUT_NHWC || layout == MLUOP_LAYOUT_NDHWC ||
layout == MLUOP_LAYOUT_NLC) {
convertShapeAndStride(shape_info, stride_info);
} else if (layout == MLUOP_LAYOUT_HWCN) {
auto convertDepthWiseConvShapeStride = [](const std::vector<int64_t>& vec,
std::vector<int>& target_vec,
std::vector<int>& stride_vec) {
// NCHW --> HWCN
target_vec[0] = static_cast<int>(vec[2]);
target_vec[1] = static_cast<int>(vec[3]);
target_vec[2] = static_cast<int>(vec[1]);
target_vec[3] = static_cast<int>(vec[0]);
// Calculate Stride just like contiguous of HWCN.
stride_vec[3] = 1;
stride_vec[2] = target_vec[3] * stride_vec[3];
stride_vec[1] = target_vec[2] * stride_vec[2];
stride_vec[0] = target_vec[1] * stride_vec[1];
};
convertDepthWiseConvShapeStride(t.sizes().vec(), shape_info, stride_info);
}
TORCH_CHECK(mluOpSetTensorDescriptorEx(
desc_, layout, data_type, t_dim, shape_info.data(),
stride_info.data()) == MLUOP_STATUS_SUCCESS,
"mluOpSetTensorDescriptorEx execution failed.");
}
void MluOpTensorDescriptor::set_desc(const at::Tensor& t,
mluOpTensorLayout_t layout,
mluOpDataType_t dtype,
std::vector<int>& dims) {
int dimNb = dims.size();
mluOpSetTensorDescriptor(desc_, layout, dtype, dimNb, dims.data());
}
// Handles
std::once_flag mmcv_mluop_init_flag;
std::mutex mmcv_mluop_mutex;
static std::vector<MluOpHandle> mmcv_mluop_handles;
mluOpHandle_t mluOpGetCurrentHandle(c10::DeviceIndex device_index) {
std::call_once(mmcv_mluop_init_flag,
[]() // Init mmcv_mluop_handles 1-device <-> 1-handle
{
c10::DeviceIndex num_devices = torch_mlu::device_count();
mmcv_mluop_handles.resize(num_devices);
});
if (device_index == -1) {
device_index = torch_mlu::current_device();
}
std::lock_guard<std::mutex> mmcv_mluop_guard(mmcv_mluop_mutex);
auto queue = torch_mlu::getCurrentQueue(device_index).queue();
mmcv_mluop_handles[device_index].setQueue(queue);
return mmcv_mluop_handles[device_index].handle;
}
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#pragma once
#include <ATen/ATen.h>
#include <c10/core/ScalarType.h>
#include "aten.h"
#include "mlu_op.h"
#include "pytorch_device_registry.hpp"
#define MLUOP_MAJOR 0
#define MLUOP_MINOR 5
#define MLUOP_PATCHLEVEL 302
mluOpDataType_t getMluOpDataType(const caffe2::TypeMeta& data_type);
mluOpTensorLayout_t getMluOpSuggestLayout(const at::Tensor& input);
class MluOpTensorDescriptor {
public:
MluOpTensorDescriptor() { mluOpCreateTensorDescriptor(&desc_); };
~MluOpTensorDescriptor() { mluOpDestroyTensorDescriptor(desc_); }
void set(at::Tensor);
void set_with_layout(at::Tensor, mluOpTensorLayout_t layout);
mluOpTensorDescriptor_t desc() { return desc_; }
private:
mluOpTensorDescriptor_t desc_;
void set_desc(const at::Tensor&, mluOpTensorLayout_t, mluOpDataType_t,
std::vector<int>& dims);
};
mluOpHandle_t mluOpGetCurrentHandle(c10::DeviceIndex device_index = -1);
class MluOpHandle {
public:
MluOpHandle() : handle(nullptr) { mluOpCreate(&handle); }
~MluOpHandle() {
if (handle) {
mluOpDestroy(handle);
handle = nullptr;
}
}
void setQueue(cnrtQueue_t queue) { mluOpSetQueue(handle, queue); }
mluOpHandle_t handle;
};
// modify tensor size and stride order based on
// channels_first to channels_last or channels_last_3d.
// which this is not same with pytorch original layout,
// this real layout is based on data storage real order.
// example: modify channels_last tensor dim to nhwc tensor desc.
// N C H W --> N H W C
// C*H*W 1 W C --> C*H*W W C 1
template <typename T>
void convertShapeAndStride(std::vector<T>& shape_info,
std::vector<T>& stride_info) {
TORCH_MLU_CHECK(shape_info.size() == stride_info.size(),
"shape size need equal to stride size.");
const int dim = shape_info.size();
std::vector<T> temp_shape_info(dim);
std::vector<T> temp_stride_info(dim);
temp_shape_info[0] = shape_info[0];
temp_stride_info[0] = stride_info[0];
for (size_t i = 0; i < dim - 1; ++i) {
const int index = (i + 1) % (dim - 1) + 1;
temp_shape_info[i + 1] = shape_info[index];
temp_stride_info[i + 1] = stride_info[index];
}
shape_info.assign(temp_shape_info.begin(), temp_shape_info.end());
stride_info.assign(temp_stride_info.begin(), temp_stride_info.end());
}
// torch tensor provides int64_t type of shape and stride,
// but mluops descriptor requires type int32.
// use this function to ensure safe CAST, or report an error.
template <typename DST_T, typename SRC_T>
std::vector<DST_T> checkUpperBoundAndCastTo(const std::vector<SRC_T>& input) {
std::vector<DST_T> output;
output.reserve(input.size());
for (const auto& val : input) {
if (val > std::numeric_limits<DST_T>::max()) {
TORCH_MLU_CHECK(false, "Requires dim size not greater than ",
std::numeric_limits<DST_T>::max(), ". But got ", val,
".");
}
output.push_back(static_cast<DST_T>(val));
}
return output;
}
......@@ -14,7 +14,15 @@
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
void KernelMsDeformAttnForward(
typedef enum {
MS_DEFORM_ATTN_FORWARD_INVALID = 0, /*!< Index is invalid. */
MS_DEFORM_ATTN_FORWARD_DEFAULT =
1, /*!< MLUKernelMsDeformAttnForwardDefault */
MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL =
2, /*!< MLUKernelMsDeformAttnForwardSmallChannel */
} MsDeformAttnForwardPolicy;
void KernelMsDeformAttnForwardDefault(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const char* data_value_gdram,
const char* data_spatial_shapes_gdram,
......@@ -23,7 +31,37 @@ void KernelMsDeformAttnForward(
const int32_t batch_size, const int32_t num_keys, const int32_t num_heads,
const int32_t channels, const int32_t num_levels, const int32_t num_queries,
const int32_t num_points, char* data_col_gdram);
void KernelMsDeformAttnBackward(
void KernelMsDeformAttnForwardSmallChannel(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const char* data_value_gdram,
const char* data_spatial_shapes_gdram,
const char* data_level_start_index_gdram,
const char* data_sampling_loc_gdram, const char* data_attn_weight_gdram,
const int32_t batch_size, const int32_t num_keys, const int32_t num_heads,
const int32_t channels, const int32_t num_levels, const int32_t num_queries,
const int32_t num_points, char* data_col_gdram);
typedef enum {
MS_DEFORM_ATTN_BACKWARD_DEFAULT = 0,
MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL = 1,
} MsDeformAttnBackwardKernelPolicy;
MsDeformAttnBackwardKernelPolicy msDeformAttnBackwardPolicyFunc(
const int32_t channels, const int32_t num_levels, const int32_t num_points,
const int32_t num_heads) {
const int32_t nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
const int num_hlp = num_heads * num_levels * num_points;
int num_per_time_theory = (nram_size - num_levels * sizeof(float) -
3 * num_levels * sizeof(int32_t)) /
sizeof(float) / (8 * PAD_UP(channels, 32) + 28) /
PAD_UP((num_hlp), 32);
if (num_per_time_theory >= 1) {
return MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL;
}
return MS_DEFORM_ATTN_BACKWARD_DEFAULT;
}
void KernelMsDeformAttnBackwardDefaultKernel(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const float* data_value,
const int32_t* spatial_shapes, const int32_t* data_level_start_index,
......@@ -32,10 +70,23 @@ void KernelMsDeformAttnBackward(
const int32_t num_heads, const int32_t channels, const int32_t num_levels,
const int32_t num_queries, const int32_t num_points, float* grad_value,
float* grad_sampling_loc, float* grad_attn_weight);
void KernelMsDeformAttnBackwardSmallChannelsKernel(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const float* data_value,
const int32_t* spatial_shapes, const int32_t* data_level_start_index,
const float* data_sampling_loc, const float* data_attn_weight,
const float* grad_output, const int32_t batch, const int32_t spatial_size,
const int32_t num_heads, const int32_t channels, const int32_t num_levels,
const int32_t num_query, const int32_t num_points, float* grad_value,
float* grad_sampling_loc, float* grad_attn_weight);
// policy function
static void policyFuncForward(cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type,
const int batch_size, const int num_queries,
const int num_heads) {
MsDeformAttnForwardPolicy msDeformAttnForwardPolicyFunc(
cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type, const int32_t batch_size,
const int32_t num_keys, const int32_t num_heads, const int32_t channels,
const int32_t num_levels, const int32_t num_queries,
const int32_t num_points) {
k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
k_dim->y =
MIN((batch_size * num_queries * num_heads + k_dim->x - 1) / k_dim->x,
......@@ -46,6 +97,16 @@ static void policyFuncForward(cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type,
#else
*k_type = CNRT_FUNC_TYPE_UNION1;
#endif
int32_t nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
if (num_levels * num_points * 3 * sizeof(int32_t) > nram_size) {
return MS_DEFORM_ATTN_FORWARD_DEFAULT;
} else if (channels > nram_size / 12 / sizeof(float) || channels > 96 ||
channels < 16) {
return MS_DEFORM_ATTN_FORWARD_DEFAULT;
} else {
return MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL;
}
}
// policy function for backward
......@@ -196,7 +257,9 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value,
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFuncForward(&k_dim, &k_type, batch_size, num_queries, num_heads);
MsDeformAttnForwardPolicy policy = msDeformAttnForwardPolicyFunc(
&k_dim, &k_type, batch_size, num_keys, num_heads, channels, num_levels,
num_queries, num_points);
// get compute queue
auto queue = torch_mlu::getCurQueue();
......@@ -222,15 +285,33 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value,
cnrtDataType_t data_type = torch_mlu::toCnrtDtype(value.dtype());
// launch kernel
CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForward<<<" << k_dim.x
<< ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelMsDeformAttnForward(
switch (policy) {
default: {
VLOG(5) << "MsDeformAttnForward Policy not supported";
}; break;
case MS_DEFORM_ATTN_FORWARD_DEFAULT: {
CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForwardDefault<<<"
<< k_dim.x << ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelMsDeformAttnForwardDefault(
k_dim, k_type, queue, data_type, (char*)value_ptr,
(char*)spatial_shapes_ptr, (char*)level_start_index_ptr,
(char*)sampling_loc_ptr, (char*)attn_weight_ptr, batch_size, num_keys,
num_heads, channels, num_levels, num_queries, num_points,
(char*)output_ptr);
break;
}
case MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL: {
CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForwardSmallChannel<<<"
<< k_dim.x << ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelMsDeformAttnForwardSmallChannel(
k_dim, k_type, queue, data_type, (char*)value_ptr,
(char*)spatial_shapes_ptr, (char*)level_start_index_ptr,
(char*)sampling_loc_ptr, (char*)attn_weight_ptr, batch_size, num_keys,
num_heads, channels, num_levels, num_queries, num_points,
(char*)output_ptr);
break;
}
}
output = output.view({batch_size, num_queries, num_heads * channels});
return output;
......@@ -391,14 +472,32 @@ void ms_deform_attn_mlu_backward(
// launch kernel
CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnBackward<<<" << k_dim.x
<< ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelMsDeformAttnBackward(
MsDeformAttnBackwardKernelPolicy kernelPolicy =
msDeformAttnBackwardPolicyFunc(channels, num_levels, num_points,
num_heads);
switch (kernelPolicy) {
default: {
VLOG(5) << "NotImplemented.";
} break;
case MS_DEFORM_ATTN_BACKWARD_DEFAULT: {
KernelMsDeformAttnBackwardDefaultKernel(
k_dim, k_type, queue, data_type, (float*)value_ptr,
(int32_t*)spatial_shapes_ptr, (int32_t*)level_start_index_ptr,
(float*)sampling_loc_ptr, (float*)attn_weight_ptr,
(float*)grad_output_ptr, batch_size, num_keys, num_heads, channels,
num_levels, num_queries, num_points, (float*)grad_value_ptr,
(float*)grad_sampling_loc_ptr, (float*)grad_attn_weight_ptr);
} break;
case MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL: {
KernelMsDeformAttnBackwardSmallChannelsKernel(
k_dim, k_type, queue, data_type, (float*)value_ptr,
(int32_t*)spatial_shapes_ptr, (int32_t*)level_start_index_ptr,
(float*)sampling_loc_ptr, (float*)attn_weight_ptr,
(float*)grad_output_ptr, batch_size, num_keys, num_heads, channels,
num_levels, num_queries, num_points, (float*)grad_value_ptr,
(float*)grad_sampling_loc_ptr, (float*)grad_attn_weight_ptr);
} break;
}
}
Tensor ms_deform_attn_impl_forward(const Tensor& value,
......
/*************************************************************************
* Copyright (C) 2021 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "mlu_common_helper.h"
Tensor nms_rotated_mlu(Tensor boxes, Tensor scores, float iou_threshold) {
if (boxes.numel() == 0) {
return at::empty({0}, boxes.options().dtype(at::kLong));
}
int boxes_num = boxes.size(0);
auto boxes_ = torch_mlu::cnnl::ops::cnnl_contiguous(boxes);
auto scores_ = torch_mlu::cnnl::ops::cnnl_contiguous(scores);
auto output = at::empty({boxes_num}, boxes.options().dtype(at::kInt));
auto output_size = at::empty({1}, scores.options().dtype(at::kInt));
MluOpTensorDescriptor boxes_desc, scores_desc, output_desc;
boxes_desc.set(boxes_);
scores_desc.set(scores_);
output_desc.set(output);
// workspace
size_t workspace_size = 0;
auto handle = mluOpGetCurrentHandle();
mluOpGetNmsRotatedWorkspaceSize(handle, boxes_desc.desc(), &workspace_size);
auto workspace = at::empty(workspace_size, boxes.options().dtype(at::kByte));
auto boxes_impl = torch_mlu::getMluTensorImpl(boxes_);
auto boxes_ptr = boxes_impl->cnnlMalloc();
auto scores_impl = torch_mlu::getMluTensorImpl(scores_);
auto scores_ptr = scores_impl->cnnlMalloc();
auto workspace_impl = torch_mlu::getMluTensorImpl(workspace);
auto workspace_ptr = workspace_impl->cnnlMalloc();
auto output_impl = torch_mlu::getMluTensorImpl(output);
auto output_ptr = output_impl->cnnlMalloc();
auto output_size_impl = torch_mlu::getMluTensorImpl(output_size);
auto output_size_ptr = output_size_impl->cnnlMalloc();
mluOpNmsRotated(handle, iou_threshold, boxes_desc.desc(), boxes_ptr,
scores_desc.desc(), scores_ptr, workspace_ptr, workspace_size,
output_desc.desc(), output_ptr, (int *)output_size_ptr);
int output_num = *static_cast<int *>(output_size.cpu().data_ptr());
auto ret = output.to(boxes.options().dtype(at::kLong));
return ret.slice(0, 0, output_num);
}
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include <torch/script.h>
#include <vector>
#include "mlu_common_helper.h"
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
template <unsigned NDim>
std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose) {
// The following code is copied from
// mmcv/ops/csrc/pytorch/cuda/spconv_ops_cuda.cu to ensure the output is
// available for network train. The outputs of this function have correct
// shape but wrong value.
auto numAct = indices.size(0);
auto kernelVolume = kernelSize[0];
int sub_m = (int)_subM;
int transpose = (int)_transpose;
int batch = (int)batchSize;
auto coorDim = indices.size(1) - 1;
for (int i = 1; i < kernelSize.size(); ++i) {
kernelVolume *= kernelSize[i];
}
auto outputVolume = outSpatialShape[0];
for (int i = 1; i < outSpatialShape.size(); ++i) {
outputVolume *= outSpatialShape[i];
}
torch::Tensor indicePairs = at::full({kernelVolume, 2, numAct}, -1,
indices.options().dtype(at::kInt));
torch::Tensor indiceNum =
at::zeros({kernelVolume}, indices.options().dtype(at::kInt));
int out_size = sub_m == 1
? numAct
: std::min(numAct * kernelVolume, batch * outputVolume);
torch::Tensor out_indices =
at::zeros({out_size, coorDim + 1}, indices.options().dtype(at::kInt));
auto indices_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indices, at::MemoryFormat::Contiguous);
auto indicePairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indicePairs, at::MemoryFormat::Contiguous);
auto indiceNum_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indiceNum, at::MemoryFormat::Contiguous);
auto out_indices_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
out_indices, at::MemoryFormat::Contiguous);
std::vector<int> input_space;
std::vector<int> filter_space;
std::vector<int> output_space;
std::vector<int> padding32;
std::vector<int> stride32;
std::vector<int> dilation32;
for (int i = 0; i < NDim; i++) {
input_space.push_back(spatialShape[i]);
filter_space.push_back(kernelSize[i]);
output_space.push_back(outSpatialShape[i]);
padding32.push_back(padding[i]);
stride32.push_back(stride[i]);
dilation32.push_back(dilation[i]);
}
MluOpTensorDescriptor indices_desc, out_indices_desc, indicePairs_desc,
indiceNum_desc;
indices_desc.set(indices_contiguous);
indicePairs_desc.set(indicePairs_contiguous);
indiceNum_desc.set(indiceNum_contiguous);
out_indices_desc.set(out_indices_contiguous);
{
mluOpTensorLayout_t layout = MLUOP_LAYOUT_ARRAY;
mluOpDataType_t dtype = MLUOP_DTYPE_INT32;
std::vector<int> dims;
dims = {numAct, coorDim + 1};
mluOpSetTensorDescriptor(indices_desc.desc(), layout, dtype, dims.size(),
dims.data());
dims = {kernelVolume, 2, numAct};
mluOpSetTensorDescriptor(indicePairs_desc.desc(), layout, dtype,
dims.size(), dims.data());
dims = {kernelVolume};
mluOpSetTensorDescriptor(indiceNum_desc.desc(), layout, dtype, dims.size(),
dims.data());
dims = {out_size, coorDim + 1};
mluOpSetTensorDescriptor(out_indices_desc.desc(), layout, dtype,
dims.size(), dims.data());
}
mluOpSparseConvolutionDescriptor_t sparse_conv_desc;
mluOpCreateSparseConvolutionDescriptor(&sparse_conv_desc);
mluOpSetSparseConvolutionDescriptor(
sparse_conv_desc, NDim + 2, batch, padding32.data(), stride32.data(),
dilation32.data(), input_space.data(), filter_space.data(),
output_space.data(), sub_m, transpose, 0);
auto handle = mluOpGetCurrentHandle();
size_t workspace_size = 0;
mluOpGetIndicePairsWorkspaceSize(
handle, sparse_conv_desc, indices_desc.desc(), indicePairs_desc.desc(),
out_indices_desc.desc(), indiceNum_desc.desc(), &workspace_size);
auto indice_workspace_size =
at::empty(workspace_size, indices.options().dtype(at::kByte));
auto indices_impl = torch_mlu::getMluTensorImpl(indices_contiguous);
auto out_indices_impl = torch_mlu::getMluTensorImpl(out_indices_contiguous);
auto indicePairs_impl = torch_mlu::getMluTensorImpl(indicePairs_contiguous);
auto indiceNum_impl = torch_mlu::getMluTensorImpl(indiceNum_contiguous);
auto indice_workspace_impl =
torch_mlu::getMluTensorImpl(indice_workspace_size);
auto indices_ptr = indices_impl->cnnlMalloc();
auto out_indices_ptr = out_indices_impl->cnnlMalloc();
auto indicePairs_ptr = indicePairs_impl->cnnlMalloc();
auto indiceNum_ptr = indiceNum_impl->cnnlMalloc();
auto indice_workspace_ptr = indice_workspace_impl->cnnlMalloc();
mluOpGetIndicePairs(handle, sparse_conv_desc, indices_desc.desc(),
indices_ptr, indice_workspace_ptr, workspace_size,
indicePairs_desc.desc(), indicePairs_ptr,
out_indices_desc.desc(), out_indices_ptr,
indiceNum_desc.desc(), indiceNum_ptr);
int num_act_out = 0;
mluOpGetSparseConvolutionNumActOut(sparse_conv_desc, &num_act_out);
mluOpDestroySparseConvolutionDescriptor(sparse_conv_desc);
if (!sub_m) {
return {out_indices.slice(0, 0, num_act_out), indicePairs, indiceNum};
} else {
return {indices, indicePairs, indiceNum};
}
}
torch::Tensor IndiceConvForwardMLUKernelLauncher(
torch::Tensor features, torch::Tensor filters, torch::Tensor indicePairs,
torch::Tensor indiceNum, int64_t numActOut, int64_t _inverse,
int64_t _subM) {
auto indice_num_cpu = indiceNum.to({torch::kCPU});
auto indice_num_cpu_64 = indice_num_cpu.data_ptr<int>();
int indice_num_len = indiceNum.numel();
int64_t indice_num[indice_num_len];
for (int i = 0; i < indice_num_len; ++i) {
indice_num[i] = (int64_t)(((int *)indice_num_cpu_64)[i]);
}
// generate empty output
int C = filters.dim() == 4 ? filters.size(3) : filters.size(4);
torch::Tensor output =
at::zeros({numActOut, C}, features.options().dtype(at::kFloat));
// generate descriptor
auto features_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
features, at::MemoryFormat::Contiguous);
auto filters_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
filters, at::MemoryFormat::Contiguous);
auto indice_pairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indicePairs, at::MemoryFormat::Contiguous);
auto output_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
output, at::MemoryFormat::Contiguous);
MluOpTensorDescriptor features_desc, filters_desc, indice_pairs_desc,
output_desc;
features_desc.set(features_contiguous);
filters_desc.set(filters_contiguous);
indice_pairs_desc.set(indice_pairs_contiguous);
output_desc.set(output_contiguous);
// set layout
{
mluOpTensorLayout_t layout;
mluOpDataType_t dtype;
int dim;
int dims[8];
// features_desc
mluOpGetTensorDescriptor(features_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(features_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// filters_desc
mluOpGetTensorDescriptor(filters_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// indice_pairs_desc
mluOpGetTensorDescriptor(indice_pairs_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(indice_pairs_desc.desc(), MLUOP_LAYOUT_ARRAY,
dtype, dim, dims);
// output_desc
mluOpGetTensorDescriptor(output_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(output_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype, dim,
dims);
}
auto handle = mluOpGetCurrentHandle();
size_t workspace_size = 0;
mluOpGetIndiceConvolutionForwardWorkspaceSize(
handle, features_desc.desc(), filters_desc.desc(),
indice_pairs_desc.desc(), output_desc.desc(), indice_num, numActOut,
_inverse, _subM, &workspace_size);
auto workspace =
at::empty(workspace_size, features.options().dtype(at::kByte));
auto features_impl = torch_mlu::getMluTensorImpl(features_contiguous);
auto filters_impl = torch_mlu::getMluTensorImpl(filters_contiguous);
auto indice_pairs_impl = torch_mlu::getMluTensorImpl(indice_pairs_contiguous);
auto workspace_impl = torch_mlu::getMluTensorImpl(workspace);
auto features_ptr = features_impl->cnnlMalloc();
auto filters_ptr = filters_impl->cnnlMalloc();
auto indice_pairs_ptr = indice_pairs_impl->cnnlMalloc();
auto workspace_ptr = workspace_impl->cnnlMalloc();
// outputs
auto output_impl = torch_mlu::getMluTensorImpl(output);
auto output_ptr = output_impl->cnnlMalloc();
mluOpIndiceConvolutionForward(
handle, features_desc.desc(), features_ptr, filters_desc.desc(),
filters_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
numActOut, _inverse, _subM, workspace_ptr, workspace_size,
output_desc.desc(), output_ptr);
return output;
}
std::vector<torch::Tensor> IndiceConvBackwardMLUKernelLauncher(
torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
int64_t _subM) {
auto indice_num_cpu = indiceNum.to({torch::kCPU});
auto indice_num_cpu_64 = indice_num_cpu.data_ptr<int>();
int indice_num_len = indiceNum.numel();
int64_t indice_num[indice_num_len];
for (int i = 0; i < indice_num_len; ++i) {
indice_num[i] = (int64_t)(((int *)(indice_num_cpu_64))[i]);
}
// generate empty input_grad
torch::Tensor input_grad = at::zeros({features.size(0), features.size(1)},
features.options().dtype(at::kFloat));
torch::Tensor filters_grad;
if (filters.dim() == 4) {
int h = filters.size(0);
int w = filters.size(1);
int c = filters.size(2);
int n = filters.size(3);
filters_grad = at::zeros({h, w, c, n}, filters.options().dtype(at::kFloat));
} else if (filters.dim() == 5) {
int d = filters.size(0);
int h = filters.size(1);
int w = filters.size(2);
int c = filters.size(3);
int n = filters.size(4);
filters_grad =
at::zeros({d, h, w, c, n}, filters.options().dtype(at::kFloat));
}
auto features_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
features, at::MemoryFormat::Contiguous);
auto filters_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
filters, at::MemoryFormat::Contiguous);
auto output_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
outGrad, at::MemoryFormat::Contiguous);
auto indice_pairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indicePairs, at::MemoryFormat::Contiguous);
auto input_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
features, at::MemoryFormat::Contiguous);
auto filters_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
filters, at::MemoryFormat::Contiguous);
MluOpTensorDescriptor features_desc, output_grad_desc, filters_desc,
indice_pairs_desc, input_grad_desc, filters_grad_desc;
features_desc.set(features_contiguous);
filters_desc.set(filters_contiguous);
output_grad_desc.set(output_grad_contiguous);
indice_pairs_desc.set(indice_pairs_contiguous);
input_grad_desc.set(input_grad_contiguous);
filters_grad_desc.set(filters_grad_contiguous);
// need to set desc layout with mluOp functions
{
mluOpTensorLayout_t layout;
mluOpDataType_t dtype;
int dim;
int dims[8];
// features_desc
mluOpGetTensorDescriptor(features_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(features_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// filters_desc
mluOpGetTensorDescriptor(filters_desc.desc(), &layout, &dtype, &dim, dims);
if (dim == 4) {
mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_HWCN, dtype,
dim, dims);
} else {
mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
}
// output_grad_desc
mluOpGetTensorDescriptor(output_grad_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(output_grad_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// indice_pairs_desc
mluOpGetTensorDescriptor(indice_pairs_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(indice_pairs_desc.desc(), MLUOP_LAYOUT_ARRAY,
dtype, dim, dims);
// input_grad_desc
mluOpGetTensorDescriptor(input_grad_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(input_grad_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
}
auto handle = mluOpGetCurrentHandle();
size_t data_workspace_size = 0;
mluOpGetIndiceConvolutionBackwardDataWorkspaceSize(
handle, output_grad_desc.desc(), filters_desc.desc(),
indice_pairs_desc.desc(), input_grad_desc.desc(), indice_num, _inverse,
&data_workspace_size);
size_t filters_workspace_size = 0;
mluOpGetIndiceConvolutionBackwardFilterWorkspaceSize(
handle, features_desc.desc(), output_grad_desc.desc(),
indice_pairs_desc.desc(), filters_grad_desc.desc(), indice_num, _inverse,
_subM, &filters_workspace_size);
auto indice_convbpdata_workspace =
at::empty(data_workspace_size, features.options().dtype(at::kByte));
auto indice_convbpfilter_workspace =
at::empty(filters_workspace_size, filters.options().dtype(at::kByte));
auto features_impl = torch_mlu::getMluTensorImpl(features_contiguous);
auto filters_impl = torch_mlu::getMluTensorImpl(filters_contiguous);
auto output_grad_impl = torch_mlu::getMluTensorImpl(output_grad_contiguous);
auto indice_pairs_impl = torch_mlu::getMluTensorImpl(indice_pairs_contiguous);
auto indice_convbpdata_workspace_impl =
torch_mlu::getMluTensorImpl(indice_convbpdata_workspace);
auto indice_convbpfilter_workspace_impl =
torch_mlu::getMluTensorImpl(indice_convbpfilter_workspace);
auto features_ptr = features_impl->cnnlMalloc();
auto filters_ptr = filters_impl->cnnlMalloc();
auto output_grad_ptr = output_grad_impl->cnnlMalloc();
auto indice_pairs_ptr = indice_pairs_impl->cnnlMalloc();
auto indice_convbpdata_workspace_ptr =
indice_convbpdata_workspace_impl->cnnlMalloc();
auto indice_convbpfilter_workspace_ptr =
indice_convbpfilter_workspace_impl->cnnlMalloc();
// outputs
auto input_grad_impl = torch_mlu::getMluTensorImpl(input_grad);
auto input_grad_ptr = input_grad_impl->cnnlMalloc();
auto filters_grad_impl = torch_mlu::getMluTensorImpl(filters_grad);
auto filters_grad_ptr = filters_grad_impl->cnnlMalloc();
mluOpIndiceConvolutionBackwardData(
handle, output_grad_desc.desc(), output_grad_ptr, filters_desc.desc(),
filters_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
_inverse, _subM, indice_convbpdata_workspace_ptr, data_workspace_size,
input_grad_desc.desc(), input_grad_ptr);
mluOpIndiceConvolutionBackwardFilter(
handle, features_desc.desc(), features_ptr, output_grad_desc.desc(),
output_grad_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
_inverse, _subM, indice_convbpfilter_workspace_ptr,
filters_workspace_size, filters_grad_desc.desc(), filters_grad_ptr);
std::vector<torch::Tensor> result;
result.push_back(input_grad);
result.push_back(filters_grad);
return result;
}
torch::Tensor indice_conv_forward_mlu(torch::Tensor features,
torch::Tensor filters,
torch::Tensor indicePairs,
torch::Tensor indiceNum,
int64_t numActOut, int64_t _inverse,
int64_t _subM) {
return IndiceConvForwardMLUKernelLauncher(
features, filters, indicePairs, indiceNum, numActOut, _inverse, _subM);
}
std::vector<torch::Tensor> indice_conv_backward_mlu(
torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
int64_t _subM) {
return IndiceConvBackwardMLUKernelLauncher(
features, filters, outGrad, indicePairs, indiceNum, _inverse, _subM);
}
torch::Tensor indice_conv_forward_impl(torch::Tensor features,
torch::Tensor filters,
torch::Tensor indicePairs,
torch::Tensor indiceNum,
int64_t numActOut, int64_t _inverse,
int64_t _subM);
std::vector<torch::Tensor> indice_conv_backward_impl(
torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
int64_t _subM);
REGISTER_DEVICE_IMPL(indice_conv_forward_impl, MLU, indice_conv_forward_mlu);
REGISTER_DEVICE_IMPL(indice_conv_backward_impl, MLU, indice_conv_backward_mlu);
template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<2>(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<3>(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<4>(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
/*************************************************************************
* Copyright (C) 2022 by Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
void KernelDynamicVoxelize(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const void *points, void *coors, const float voxel_x, const float voxel_y,
const float voxel_z, const float coors_x_min, const float coors_y_min,
const float coors_z_min, const float coors_x_max, const float coors_y_max,
const float coors_z_max, const int32_t grid_x, const int32_t grid_y,
const int32_t grid_z, const int32_t num_points, const int32_t num_features);
void KernelPoint2Voxel(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, void *coors, void *point_to_pointidx,
void *point_to_voxelidx, const int32_t num_points,
const int32_t max_points);
void KernelCalcPointsPerVoxel(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, void *point_to_pointidx,
void *point_to_voxelidx, void *coor_to_voxelidx,
void *num_points_per_voxel, void *voxel_num,
const int32_t max_voxels,
const int32_t num_points);
void KernelAssignVoxelsCoors(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, const void *points,
void *temp_coors, void *point_to_voxelidx,
void *coor_to_voxelidx, void *voxels, void *coors,
const int32_t max_points, const int32_t num_points,
const int32_t num_features);
// policy function
static void policyFuncDefault(cnrtDim3_t *k_dim, cnrtFunctionType_t *k_type,
const int num_points) {
k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
k_dim->y = MIN((num_points + k_dim->x - 1) / k_dim->x,
torch_mlu::getDeviceAttr(cnrtAttrClusterCount));
k_dim->z = 1;
*k_type = CNRT_FUNC_TYPE_UNION1;
}
// policy function
static void policyFuncCalcPointsPerVoxel(cnrtDim3_t *k_dim,
cnrtFunctionType_t *k_type,
const int num_points) {
k_dim->x = 1;
k_dim->y = 1;
k_dim->z = 1;
*k_type = CNRT_FUNC_TYPE_BLOCK;
}
int HardVoxelizeForwardMLUKernelLauncher(
const at::Tensor &points, at::Tensor &voxels, at::Tensor &coors,
at::Tensor &num_points_per_voxel, const std::vector<float> voxel_size,
const std::vector<float> coors_range, const int max_points,
const int max_voxels, const int NDim = 3) {
// check datatype
TORCH_CHECK(points.scalar_type() == at::kFloat,
"points type should be Float, got ", points.scalar_type(), ".");
TORCH_CHECK(voxels.scalar_type() == at::kFloat,
"voxels type should be Float, got ", voxels.scalar_type(), ".");
TORCH_CHECK(coors.scalar_type() == at::kInt,
"coors type should be Float, got ", coors.scalar_type(), ".");
TORCH_CHECK(num_points_per_voxel.scalar_type() == at::kInt,
"num_points_per_voxel type should be Float, got ",
num_points_per_voxel.scalar_type(), ".");
// check shape
TORCH_CHECK(points.dim() == 2, "points should be a 2d tensor, got ",
points.dim(), "D.");
TORCH_CHECK(voxels.dim() == 3, "voxels should be a 3d tensor, got ",
voxels.dim(), "D.");
TORCH_CHECK(coors.dim() == 2, "coors should be a 2d tensor, got ",
coors.dim(), "D.");
TORCH_CHECK(num_points_per_voxel.dim() == 1,
"num_points_per_voxel should be a 1d tensor, got ",
num_points_per_voxel.dim(), "D.");
const int num_points = points.size(0);
const int num_features = points.size(1);
TORCH_CHECK(points.size(0) == num_points,
"the 1st dimensions of points should be num_points, got ",
points.size(0), ".");
TORCH_CHECK(points.size(1) == num_features,
"the 2nd dimensions of points should be num_features, got ",
points.size(1), ".");
TORCH_CHECK(voxels.size(0) == max_voxels,
"the 1st dimensions of voxels should be max_voxels, got ",
voxels.size(0), ".");
TORCH_CHECK(voxels.size(1) == max_points,
"the 2nd dimensions of voxels should be max_points, got ",
voxels.size(1), ".");
TORCH_CHECK(voxels.size(2) == num_features,
"the 3rd dimensions of voxels should be num_features, got ",
voxels.size(2), ".");
TORCH_CHECK(coors.size(0) == max_voxels,
"the 1st dimensions of coors should be max_voxels, got ",
coors.size(0), ".");
TORCH_CHECK(coors.size(1) == 3,
"the 2nd dimensions of coors should be 3, got ", coors.size(1),
".");
TORCH_CHECK(num_points_per_voxel.size(0) == max_voxels,
"the 1st dimensions of num_points_per_voxel should be 3, got ",
num_points_per_voxel.size(0), ".");
// large tensor check
const size_t max_input_size = 2147483648;
TORCH_CHECK(points.numel() < max_input_size,
"points element num should be less than 2^31, got ",
points.numel(), ".");
TORCH_CHECK(voxels.numel() < max_input_size,
"voxels element num should be less than 2^31, got ",
voxels.numel(), ".");
TORCH_CHECK(coors.numel() < max_input_size,
"coors element num should be less than 2^31, got ", coors.numel(),
".");
// check zero element
if (max_points == 0 || max_voxels == 0) {
return 0;
}
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get ptr of tensors
auto points_ = points.contiguous();
auto points_impl = torch_mlu::getMluTensorImpl(points_);
auto points_ptr = points_impl->cnnlMalloc();
auto voxels_ = voxels.contiguous();
auto voxels_impl = torch_mlu::getMluTensorImpl(voxels_);
auto voxels_ptr = voxels_impl->cnnlMalloc();
auto coors_ = coors.contiguous();
auto coors_impl = torch_mlu::getMluTensorImpl(coors_);
auto coors_ptr = coors_impl->cnnlMalloc();
auto num_points_per_voxel_ = num_points_per_voxel.contiguous();
auto num_points_per_voxel_impl =
torch_mlu::getMluTensorImpl(num_points_per_voxel_);
auto num_points_per_voxel_ptr = num_points_per_voxel_impl->cnnlMalloc();
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFuncDefault(&k_dim, &k_type, num_points);
// 1. link point to corresponding voxel coors
const float voxel_x = voxel_size[0];
const float voxel_y = voxel_size[1];
const float voxel_z = voxel_size[2];
const float coors_x_min = coors_range[0];
const float coors_y_min = coors_range[1];
const float coors_z_min = coors_range[2];
const float coors_x_max = coors_range[3];
const float coors_y_max = coors_range[4];
const float coors_z_max = coors_range[5];
const int grid_x = round((coors_x_max - coors_x_min) / voxel_x);
const int grid_y = round((coors_y_max - coors_y_min) / voxel_y);
const int grid_z = round((coors_z_max - coors_z_min) / voxel_z);
auto temp_coors =
at::zeros({NDim, num_points}, points.options().dtype(at::kInt))
.contiguous();
auto temp_coors_impl = torch_mlu::getMluTensorImpl(temp_coors);
auto temp_coors_ptr = temp_coors_impl->cnnlMalloc();
KernelDynamicVoxelize(k_dim, k_type, queue, points_ptr, temp_coors_ptr,
voxel_x, voxel_y, voxel_z, coors_x_min, coors_y_min,
coors_z_min, coors_x_max, coors_y_max, coors_z_max,
grid_x, grid_y, grid_z, num_points, num_features);
// 2. map point to the idx of the corresponding voxel, find duplicate coor
auto point_to_pointidx = at::zeros(
{
num_points,
},
points.options().dtype(at::kInt))
.contiguous();
auto point_to_pointidx_impl = torch_mlu::getMluTensorImpl(point_to_pointidx);
auto point_to_pointidx_ptr = point_to_pointidx_impl->cnnlMalloc();
auto point_to_voxelidx = at::zeros(
{
num_points,
},
points.options().dtype(at::kInt))
.contiguous();
auto point_to_voxelidx_impl = torch_mlu::getMluTensorImpl(point_to_voxelidx);
auto point_to_voxelidx_ptr = point_to_voxelidx_impl->cnnlMalloc();
KernelPoint2Voxel(k_dim, k_type, queue, temp_coors_ptr, point_to_pointidx_ptr,
point_to_voxelidx_ptr, num_points, max_points);
// calculate task dimension
cnrtDim3_t k_dim_calc_points_per_voxel;
cnrtFunctionType_t k_type_calc_points_per_voxel;
policyFuncCalcPointsPerVoxel(&k_dim_calc_points_per_voxel,
&k_type_calc_points_per_voxel, num_points);
// 3. determine voxel num and voxel's coor index
auto coor_to_voxelidx = at::zeros(
{
num_points,
},
points.options().dtype(at::kInt))
.contiguous();
auto coor_to_voxelidx_impl = torch_mlu::getMluTensorImpl(coor_to_voxelidx);
auto coor_to_voxelidx_ptr = coor_to_voxelidx_impl->cnnlMalloc();
auto voxel_num = at::zeros(
{
1,
},
points.options().dtype(at::kInt))
.contiguous();
auto voxel_num_impl = torch_mlu::getMluTensorImpl(voxel_num);
auto voxel_num_ptr = voxel_num_impl->cnnlMalloc();
KernelCalcPointsPerVoxel(
k_dim_calc_points_per_voxel, k_type_calc_points_per_voxel, queue,
point_to_pointidx_ptr, point_to_voxelidx_ptr, coor_to_voxelidx_ptr,
num_points_per_voxel_ptr, voxel_num_ptr, max_voxels, num_points);
// 4. copy point features and coors of each voxels to voxels
KernelAssignVoxelsCoors(k_dim, k_type, queue, points_ptr, temp_coors_ptr,
point_to_voxelidx_ptr, coor_to_voxelidx_ptr,
voxels_ptr, coors_ptr, max_points, num_points,
num_features);
auto voxel_num_cpu = voxel_num.to(at::kCPU);
int voxel_num_int = voxel_num_cpu.data_ptr<int>()[0];
return voxel_num_int;
}
int hard_voxelize_forward_mlu(const at::Tensor &points, at::Tensor &voxels,
at::Tensor &coors,
at::Tensor &num_points_per_voxel,
const std::vector<float> voxel_size,
const std::vector<float> coors_range,
const int max_points, const int max_voxels,
const int NDim) {
return HardVoxelizeForwardMLUKernelLauncher(
points, voxels, coors, num_points_per_voxel, voxel_size, coors_range,
max_points, max_voxels, NDim);
};
int hard_voxelize_forward_impl(const at::Tensor &points, at::Tensor &voxels,
at::Tensor &coors,
at::Tensor &num_points_per_voxel,
const std::vector<float> voxel_size,
const std::vector<float> coors_range,
const int max_points, const int max_voxels,
const int NDim);
REGISTER_DEVICE_IMPL(hard_voxelize_forward_impl, MLU,
hard_voxelize_forward_mlu);
......@@ -17,6 +17,11 @@ Tensor nms_rotated_npu(const Tensor dets, const Tensor scores,
const Tensor labels, const float iou_threshold);
#endif
#ifdef MMCV_WITH_MLU
Tensor nms_rotated_mlu(const Tensor dets, const Tensor scores,
const float iou_threshold);
#endif
// Interface for Python
// inline is needed to prevent multiple function definitions when this header is
// included by different cpps
......@@ -36,6 +41,10 @@ Tensor nms_rotated(const Tensor dets, const Tensor scores, const Tensor order,
return nms_rotated_npu(dets, scores, labels, iou_threshold);
#else
AT_ERROR("Not compiled with NPU support");
#endif
#ifdef MMCV_WITH_MLU
} else if (dets.device().type() == at::kMLU) {
return nms_rotated_mlu(dets, scores, iou_threshold);
#endif
}
......
......@@ -35,6 +35,26 @@ std::vector<torch::Tensor> get_indice_pairs_forward_cuda(
padding, dilation, outPadding, _subM, _transpose);
};
template <unsigned NDim>
std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
template <unsigned NDim>
std::vector<torch::Tensor> get_indice_pairs_forward_mlu(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose) {
return GetIndicePairsForwardMLUKernelLauncher<NDim>(
indices, batchSize, outSpatialShape, spatialShape, kernelSize, stride,
padding, dilation, outPadding, _subM, _transpose);
}
template <unsigned NDim>
std::vector<torch::Tensor> GetIndicePairsBackwardCUDAKernelLauncher(
torch::Tensor indices, torch::Tensor gridOut, int64_t batchSize,
......@@ -71,6 +91,12 @@ std::vector<torch::Tensor> get_indice_pairs_forward(
padding, dilation, outPadding, _subM, _transpose);
#else
AT_ERROR("get_indice_pairs is not compiled with GPU support");
#endif
#ifdef MMCV_WITH_MLU
} else if (indices.device().type() == at::kMLU) {
return get_indice_pairs_forward_mlu<NDim>(
indices, batchSize, outSpatialShape, spatialShape, kernelSize, stride,
padding, dilation, outPadding, _subM, _transpose);
#endif
} else {
AT_ERROR("get_indice_pairs is not implemented on CPU");
......
......@@ -410,8 +410,9 @@ def nms_rotated(dets: Tensor,
input_labels = scores.new_empty(0, dtype=torch.int)
else:
input_labels = labels
if dets.device.type == 'npu':
if dets.device.type in ('npu', 'mlu'):
order = scores.new_empty(0, dtype=torch.long)
if dets.device.type == 'npu':
coefficient = 57.29578 # 180 / PI
for i in range(dets.size()[0]):
dets_cw[i][4] *= coefficient # radians to angle
......
......@@ -211,6 +211,7 @@ def get_extensions():
include_dirs = []
extra_objects = []
is_rocm_pytorch = False
try:
from torch.utils.cpp_extension import ROCM_HOME
......@@ -238,16 +239,98 @@ def get_extensions():
torch.is_mlu_available()) or \
os.getenv('FORCE_MLU', '0') == '1':
from torch_mlu.utils.cpp_extension import MLUExtension
def get_mluops_version(file_path):
with open(file_path) as f:
for line in f:
if re.search('MLUOP_MAJOR', line):
major = line.strip().split(' ')[2]
if re.search('MLUOP_MINOR', line):
minor = line.strip().split(' ')[2]
if re.search('MLUOP_PATCHLEVEL', line):
patchlevel = line.strip().split(' ')[2]
mluops_version = f'v{major}.{minor}.{patchlevel}'
return mluops_version
mmcv_mluops_version = get_mluops_version(
'./mmcv/ops/csrc/pytorch/mlu/mlu_common_helper.h')
mlu_ops_path = os.getenv('MMCV_MLU_OPS_PATH')
if mlu_ops_path:
exists_mluops_version = get_mluops_version(
mlu_ops_path + '/bangc-ops/mlu_op.h')
if exists_mluops_version != mmcv_mluops_version:
print('the version of mlu-ops provided is %s,'
' while %s is needed.' %
(exists_mluops_version, mmcv_mluops_version))
exit()
try:
if os.path.exists('mlu-ops'):
if os.path.islink('mlu-ops'):
os.remove('mlu-ops')
os.symlink(mlu_ops_path, 'mlu-ops')
elif os.path.abspath('mlu-ops') != mlu_ops_path:
os.symlink(mlu_ops_path, 'mlu-ops')
else:
os.symlink(mlu_ops_path, 'mlu-ops')
except Exception:
raise FileExistsError(
'mlu-ops already exists, please move it out,'
'or rename or remove it.')
else:
if not os.path.exists('mlu-ops'):
import requests
mluops_url = 'https://github.com/Cambricon/mlu-ops/' + \
'archive/refs/tags/' + mmcv_mluops_version + '.zip'
req = requests.get(mluops_url)
with open('./mlu-ops.zip', 'wb') as f:
try:
f.write(req.content)
except Exception:
raise ImportError('failed to download mlu-ops')
from zipfile import BadZipFile, ZipFile
with ZipFile('./mlu-ops.zip', 'r') as archive:
try:
archive.extractall()
dir_name = archive.namelist()[0].split('/')[0]
os.rename(dir_name, 'mlu-ops')
except BadZipFile:
print('invalid mlu-ops.zip file')
else:
exists_mluops_version = get_mluops_version(
'./mlu-ops/bangc-ops/mlu_op.h')
if exists_mluops_version != mmcv_mluops_version:
print('the version of provided mlu-ops is %s,'
' while %s is needed.' %
(exists_mluops_version, mmcv_mluops_version))
exit()
define_macros += [('MMCV_WITH_MLU', None)]
mlu_args = os.getenv('MMCV_MLU_ARGS')
extra_compile_args['cncc'] = [mlu_args] if mlu_args else []
mlu_args = os.getenv('MMCV_MLU_ARGS', '-DNDEBUG ')
mluops_includes = []
mluops_includes.append('-I' +
os.path.abspath('./mlu-ops/bangc-ops'))
mluops_includes.append(
'-I' + os.path.abspath('./mlu-ops/bangc-ops/kernels'))
extra_compile_args['cncc'] = [mlu_args] + \
mluops_includes if mlu_args else mluops_includes
extra_compile_args['cxx'] += ['-fno-gnu-unique']
op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \
glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp') + \
glob.glob('./mmcv/ops/csrc/pytorch/mlu/*.cpp') + \
glob.glob('./mmcv/ops/csrc/common/mlu/*.mlu')
glob.glob('./mmcv/ops/csrc/common/mlu/*.mlu') + \
glob.glob(
'./mlu-ops/bangc-ops/core/**/*.cpp', recursive=True) + \
glob.glob(
'./mlu-ops/bangc-ops/kernels/**/*.cpp', recursive=True) + \
glob.glob(
'./mlu-ops/bangc-ops/kernels/**/*.mlu', recursive=True)
extra_objects = glob.glob(
'./mlu-ops/bangc-ops/kernels/kernel_wrapper/*.o')
extension = MLUExtension
include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common'))
include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common/mlu'))
include_dirs.append(os.path.abspath('./mlu-ops/bangc-ops'))
elif (hasattr(torch.backends, 'mps')
and torch.backends.mps.is_available()) or os.getenv(
'FORCE_MPS', '0') == '1':
......@@ -309,6 +392,7 @@ def get_extensions():
sources=op_files,
include_dirs=include_dirs,
define_macros=define_macros,
extra_objects=extra_objects,
extra_compile_args=extra_compile_args)
extensions.append(ext_ops)
return extensions
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment