You need to sign in or sign up before continuing.
Unverified Commit 733e6ff8 authored by bdf's avatar bdf Committed by GitHub
Browse files

Pick MLU modifications from master (1.x) to main (2.x) (#2704)



* [Feature] Support Voxelization with cambricon MLU device (#2500)

* [Feature] Support hard_voxelize with cambricon MLU backend

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Feature](bangc-ops): add voxelization op

* [Enhance] Optimize the performace of ms_deform_attn for MLU device (#2510)

* ms_opt

* ms_opt

* ms_opt

* ms_opt

* ms_opt

* [Feature] ms_deform_attn performance optimization

* [Feature] ms_deform_attn performance optimization

* [Feature] ms_deform_attn performance optimization

* [Feature] Support ball_query with cambricon MLU backend and mlu-ops library. (#2520)

* [Feature] Support ball_query with cambricon MLU backend and mlu-ops library.

* [Fix] update operator data layout setting.

* [Fix] add cxx compile option to avoid symbol conflict.

* [Fix] fix lint errors.

* [Fix] update ops.md with info of ball_query support by MLU backend.

* [Feature] Fix typo.

* [Fix] Remove print.

* [Fix] get mlu-ops from MMCV_MLU_OPS_PATH env.

* [Fix] update MMCV_MLU_OPS_PATH check logic.

* [Fix] update error info when failed to download mlu-ops.

* [Fix] check mlu-ops version matching info in mmcv.

* [Fix] revise wrong filename.

* [Fix] remove f.close and re.

* [Docs] Steps to compile mmcv-full on MLU machine (#2571)

* [Docs] Steps to compile mmcv-full on MLU machine

* [Docs] Adjust paragraph order

* Update docs/zh_cn/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* Update docs/zh_cn/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* Update docs/en/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* Update docs/en/get_started/build.md
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* [Docs] Modify the format

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* [Fix] Fix tensor descriptor setting in MLU ball_query. (#2579)

* [Feature] Add MLU support for Sparse Convolution op (#2589)

* [Feature] Add sparse convolution MLU API

* [Feature] update cpp code style

* end-of-file

* delete libext.a

* code style

* update ops.md

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

* [Enhancement] Replace the implementation of deform_roi_pool with mlu-ops (#2598)

* [Feature] Replace the implementation of deform_roi_pool with mlu-ops

* [Feature] Modify code

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

* [Enhancement] ms_deform_attn performance optimization (#2616)

* ms_opt_v2

* ms_opt_v2_1

* optimize MultiScaleDeformableAttention ops for MLU

* ms_opt_v2_1

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

* [Feature] ms_deform_attn performance optimization V2

---------
Co-authored-by: default avatardongchengwei <dongchengwei@cambricon.com>

* [Feature] Support NmsRotated with cambricon MLU backend (#2643)

* [Feature] Support NmsRotated with cambricon MLU backend

* [Feature] remove foolproofs in nms_rotated_mlu.cpp

* [Feature] fix lint in test_nms_rotated.py

* [Feature] fix kMLU not found in nms_rotated.cpp

* [Feature] modify mlu support in nms.py

* [Feature] modify nms_rotated support in ops.md

* [Feature] modify ops/nms.py

* [Enhance] Add a default value for MMCV_MLU_ARGS (#2688)

* add mlu_args

* add mlu_args

* Modify the code

---------
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

* [Enhance] Ignore mlu-ops files (#2691)
Co-authored-by: default avatarbudefei <budefei@cambricon.com>

---------
Co-authored-by: default avatarZShaopeng <108382403+ZShaopeng@users.noreply.github.com>
Co-authored-by: default avatarBinZheng <38182684+Wickyzheng@users.noreply.github.com>
Co-authored-by: default avatarliuduanhui <103939338+DanieeelLiu@users.noreply.github.com>
Co-authored-by: default avatarbudefei <budefei@cambricon.com>
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: default avatarduzekun <108381389+duzekunKTH@users.noreply.github.com>
Co-authored-by: default avatardongchengwei <dongchengwei@cambricon.com>
Co-authored-by: default avatarliuyuan1-v <125547457+liuyuan1-v@users.noreply.github.com>
parent 1f161f68
...@@ -27,6 +27,8 @@ wheels/ ...@@ -27,6 +27,8 @@ wheels/
.installed.cfg .installed.cfg
*.egg *.egg
MANIFEST MANIFEST
mlu-ops/
mlu-ops.*
# PyInstaller # PyInstaller
# Usually these files are written by a python script from a template # Usually these files are written by a python script from a template
......
...@@ -290,3 +290,60 @@ If you need to use PyTorch-related modules, make sure PyTorch has been successfu ...@@ -290,3 +290,60 @@ If you need to use PyTorch-related modules, make sure PyTorch has been successfu
```bash ```bash
python -c 'import mmcv;print(mmcv.__version__)' python -c 'import mmcv;print(mmcv.__version__)'
``` ```
### Build mmcv-full on Cambricon MLU Devices
#### Install torch_mlu
##### Option1: Install mmcv-full based on Cambricon docker image
Firstly, install and pull Cambricon docker image (please email service@cambricon.com for the latest release docker):
```bash
docker pull ${docker image}
```
Run and attach to the docker, [Install mmcv-full on MLU device](#install-mmcv\-full-on-cambricon-mlu-device) and [make sure you've installed mmcv-full on MLU device successfully](#test-code)
##### Option2: Install mmcv-full from compiling Cambricon PyTorch source code
Please email service@cambricon.com or contact with Cambricon engineers for a suitable version of CATCH package. After you get the suitable version of CATCH package, please follow the steps in ${CATCH-path}/CONTRIBUTING.md to install Cambricon PyTorch.
#### Install mmcv-full on Cambricon MLU device
Clone the repo
```bash
git clone https://github.com/open-mmlab/mmcv.git
```
The mlu-ops library will be downloaded to the default directory (mmcv/mlu-ops) while building MMCV. You can also set `MMCV_MLU_OPS_PATH` to an existing mlu-ops library before building as follows:
```bash
export MMCV_MLU_OPS_PATH=/xxx/xxx/mlu-ops
```
Install mmcv-full
```bash
cd mmcv
export MMCV_WITH_OPS=1
export FORCE_MLU=1
python setup.py install
```
#### Test Code
After finishing previous steps, you can run the following python code to make sure that you've installed mmcv-full on MLU device successfully
```python
import torch
import torch_mlu
from mmcv.ops import sigmoid_focal_loss
x = torch.randn(3, 10).mlu()
x.requires_grad = True
y = torch.tensor([1, 5, 3]).mlu()
w = torch.ones(10).float().mlu()
output = sigmoid_focal_loss(x, y, 2.0, 0.25, w, 'none')
print(output)
```
...@@ -6,7 +6,7 @@ We implement common ops used in detection, segmentation, etc. ...@@ -6,7 +6,7 @@ We implement common ops used in detection, segmentation, etc.
| ---------------------------- | --- | ---- | --- | --- | ------ | | ---------------------------- | --- | ---- | --- | --- | ------ |
| ActiveRotatedFilter | √ | √ | | | | | ActiveRotatedFilter | √ | √ | | | |
| AssignScoreWithK | | √ | | | | | AssignScoreWithK | | √ | | | |
| BallQuery | | √ | | | | | BallQuery | | √ | | | |
| BBoxOverlaps | | √ | √ | √ | √ | | BBoxOverlaps | | √ | √ | √ | √ |
| BorderAlign | | √ | | | | | BorderAlign | | √ | | | |
| BoxIouRotated | √ | √ | | | | | BoxIouRotated | √ | √ | | | |
...@@ -35,7 +35,7 @@ We implement common ops used in detection, segmentation, etc. ...@@ -35,7 +35,7 @@ We implement common ops used in detection, segmentation, etc.
| ModulatedDeformConv2d | √ | √ | | | √ | | ModulatedDeformConv2d | √ | √ | | | √ |
| MultiScaleDeformableAttn | | √ | √ | | | | MultiScaleDeformableAttn | | √ | √ | | |
| NMS | √ | √ | √ | | √ | | NMS | √ | √ | √ | | √ |
| NMSRotated | √ | √ | | | √ | | NMSRotated | √ | √ | | | √ |
| NMSQuadri | √ | √ | | | | | NMSQuadri | √ | √ | | | |
| PixelGroup | √ | | | | | | PixelGroup | √ | | | | |
| PointsInBoxes | √ | √ | | | | | PointsInBoxes | √ | √ | | | |
...@@ -52,13 +52,13 @@ We implement common ops used in detection, segmentation, etc. ...@@ -52,13 +52,13 @@ We implement common ops used in detection, segmentation, etc.
| SigmoidFocalLoss | | √ | √ | | √ | | SigmoidFocalLoss | | √ | √ | | √ |
| SoftmaxFocalLoss | | √ | | | √ | | SoftmaxFocalLoss | | √ | | | √ |
| SoftNMS | | √ | | | | | SoftNMS | | √ | | | |
| Sparse Convolution | | √ | | | | | Sparse Convolution | | √ | | | |
| Synchronized BatchNorm | | √ | | | | | Synchronized BatchNorm | | √ | | | |
| ThreeInterpolate | | √ | | | | | ThreeInterpolate | | √ | | | |
| ThreeNN | | √ | √ | | | | ThreeNN | | √ | √ | | |
| TINShift | | √ | √ | | | | TINShift | | √ | √ | | |
| UpFirDn2d | | √ | | | | | UpFirDn2d | | √ | | | |
| Voxelization | √ | √ | | | √ | | Voxelization | √ | √ | | | √ |
| PrRoIPool | | √ | | | | | PrRoIPool | | √ | | | |
| BezierAlign | √ | √ | | | | | BezierAlign | √ | √ | | | |
| BiasAct | | √ | | | | | BiasAct | | √ | | | |
......
...@@ -298,3 +298,59 @@ mmcv 有两个版本: ...@@ -298,3 +298,59 @@ mmcv 有两个版本:
```bash ```bash
python -c 'import mmcv;print(mmcv.__version__)' python -c 'import mmcv;print(mmcv.__version__)'
``` ```
### 在寒武纪 MLU 机器编译 mmcv-full
#### 安装 torch_mlu
##### 选项1: 基于寒武纪 docker image 安装
首先请下载并且拉取寒武纪 docker (请向 service@cambricon.com 发邮件以获得最新的寒武纪 pytorch 发布 docker)。
```
docker pull ${docker image}
```
进入 docker, [编译 MMCV MLU](#编译mmcv-mlu)[进行验证](#验证是否成功安装)
##### 选项2:基于 cambricon pytorch 源码编译安装
请向 service@cambricon.com 发送邮件或联系 Cambricon 工程师以获取合适版本的 CATCH 软件包,在您获得合适版本的 CATCH 软件包后,请参照 ${CATCH-path}/CONTRIBUTING.md 中的步骤安装 CATCH。
#### 编译 MMCV
克隆代码仓库
```bash
git clone https://github.com/open-mmlab/mmcv.git
```
算子库 mlu-ops 在编译 MMCV 时自动下载到默认路径(mmcv/mlu-ops),你也可以在编译前设置环境变量 MMCV_MLU_OPS_PATH 指向已经存在的 mlu-ops 算子库路径。
```bash
export MMCV_MLU_OPS_PATH=/xxx/xxx/mlu-ops
```
开始编译
```bash
cd mmcv
export MMCV_WITH_OPS=1
export FORCE_MLU=1
python setup.py install
```
#### 验证是否成功安装
完成上述安装步骤之后,您可以尝试运行下面的 Python 代码以测试您是否成功在 MLU 设备上安装了 mmcv-full
```python
import torch
import torch_mlu
from mmcv.ops import sigmoid_focal_loss
x = torch.randn(3, 10).mlu()
x.requires_grad = True
y = torch.tensor([1, 5, 3]).mlu()
w = torch.ones(10).float().mlu()
output = sigmoid_focal_loss(x, y, 2.0, 0.25, w, 'none')
```
...@@ -6,7 +6,7 @@ MMCV 提供了检测、分割等任务中常用的算子 ...@@ -6,7 +6,7 @@ MMCV 提供了检测、分割等任务中常用的算子
| ---------------------------- | --- | ---- | --- | --- | ------ | | ---------------------------- | --- | ---- | --- | --- | ------ |
| ActiveRotatedFilter | √ | √ | | | | | ActiveRotatedFilter | √ | √ | | | |
| AssignScoreWithK | | √ | | | | | AssignScoreWithK | | √ | | | |
| BallQuery | | √ | | | | | BallQuery | | √ | | | |
| BBoxOverlaps | | √ | √ | √ | √ | | BBoxOverlaps | | √ | √ | √ | √ |
| BorderAlign | | √ | | | | | BorderAlign | | √ | | | |
| BoxIouRotated | √ | √ | | | | | BoxIouRotated | √ | √ | | | |
...@@ -35,7 +35,7 @@ MMCV 提供了检测、分割等任务中常用的算子 ...@@ -35,7 +35,7 @@ MMCV 提供了检测、分割等任务中常用的算子
| ModulatedDeformConv2d | √ | √ | | | √ | | ModulatedDeformConv2d | √ | √ | | | √ |
| MultiScaleDeformableAttn | | √ | √ | | | | MultiScaleDeformableAttn | | √ | √ | | |
| NMS | √ | √ | √ | | √ | | NMS | √ | √ | √ | | √ |
| NMSRotated | √ | √ | | | √ | | NMSRotated | √ | √ | | | √ |
| NMSQuadri | √ | √ | | | | | NMSQuadri | √ | √ | | | |
| PixelGroup | √ | | | | | | PixelGroup | √ | | | | |
| PointsInBoxes | √ | √ | | | | | PointsInBoxes | √ | √ | | | |
...@@ -52,13 +52,13 @@ MMCV 提供了检测、分割等任务中常用的算子 ...@@ -52,13 +52,13 @@ MMCV 提供了检测、分割等任务中常用的算子
| SigmoidFocalLoss | | √ | √ | | √ | | SigmoidFocalLoss | | √ | √ | | √ |
| SoftmaxFocalLoss | | √ | | | √ | | SoftmaxFocalLoss | | √ | | | √ |
| SoftNMS | | √ | | | | | SoftNMS | | √ | | | |
| Sparse Convolution | | √ | | | | | Sparse Convolution | | √ | | | |
| Synchronized BatchNorm | | √ | | | | | Synchronized BatchNorm | | √ | | | |
| ThreeInterpolate | | √ | | | | | ThreeInterpolate | | √ | | | |
| ThreeNN | | √ | √ | | | | ThreeNN | | √ | √ | | |
| TINShift | | √ | √ | | | | TINShift | | √ | √ | | |
| UpFirDn2d | | √ | | | | | UpFirDn2d | | √ | | | |
| Voxelization | √ | √ | | | √ | | Voxelization | √ | √ | | | √ |
| PrRoIPool | | √ | | | | | PrRoIPool | | √ | | | |
| BezierAlign | √ | √ | | | | | BezierAlign | √ | √ | | | |
| BiasAct | | √ | | | | | BiasAct | | √ | | | |
......
This diff is collapsed.
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "mlu_common_helper.h"
void ball_query_forward_mlu(int b, int n, int m, float min_radius,
float max_radius, int nsample, const Tensor new_xyz,
const Tensor xyz, Tensor idx) {
auto new_xyz_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
new_xyz, new_xyz.suggest_memory_format());
auto xyz_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
xyz, new_xyz.suggest_memory_format());
auto idx_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
idx, new_xyz.suggest_memory_format());
MluOpTensorDescriptor new_xyz_desc, xyz_desc, idx_desc;
new_xyz_desc.set(new_xyz_contiguous);
xyz_desc.set(xyz_contiguous);
idx_desc.set(idx_contiguous);
auto new_xyz_impl = torch_mlu::getMluTensorImpl(new_xyz_contiguous);
auto xyz_impl = torch_mlu::getMluTensorImpl(xyz_contiguous);
auto idx_impl = torch_mlu::getMluTensorImpl(idx_contiguous);
auto new_xyz_ptr = new_xyz_impl->cnnlMalloc();
auto xyz_ptr = xyz_impl->cnnlMalloc();
auto idx_ptr = idx_impl->cnnlMalloc();
auto handle = mluOpGetCurrentHandle();
mluOpBallQuery(handle, new_xyz_desc.desc(), new_xyz_ptr, xyz_desc.desc(),
xyz_ptr, min_radius, max_radius, nsample, idx_desc.desc(),
idx_ptr);
}
void ball_query_forward_impl(int b, int n, int m, float min_radius,
float max_radius, int nsample,
const Tensor new_xyz, const Tensor xyz,
Tensor idx);
REGISTER_DEVICE_IMPL(ball_query_forward_impl, MLU, ball_query_forward_mlu);
...@@ -9,254 +9,59 @@ ...@@ -9,254 +9,59 @@
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. * SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/ *************************************************************************/
#include "pytorch_device_registry.hpp" #include "mlu_common_helper.h"
#include "pytorch_mlu_helper.hpp"
void KernelDeformRoIPoolForward(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, cnrtDataType_t data_type,
const void *input, const void *rois,
const void *offset, void *output,
const int channels, const int height,
const int width, const int num_rois,
const int pooled_height, const int pooled_width,
const float spatial_scale,
const int sampling_ratio, const float gamma);
void KernelDeformRoIPoolBackward(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
cnrtDataType_t data_type, const void *grad_output, const void *input,
const void *rois, const void *offset, void *grad_input, void *grad_offset,
const int channels, const int height, const int width, const int num_rois,
const int pooled_height, const int pooled_width, const float spatial_scale,
const int sampling_ratio, const float gamma);
// policy function for forward and backward
static void policyFunc(const int bin_num, cnrtDim3_t *k_dim,
cnrtFunctionType_t *k_type) {
const size_t cluster_limit = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
;
const size_t core_limit = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
const size_t bin_num_align = CEIL_ALIGN(bin_num, core_limit);
k_dim->x = core_limit;
k_dim->y = (bin_num_align / core_limit) > cluster_limit
? cluster_limit
: (bin_num_align / core_limit);
k_dim->z = 1;
*k_type = CNRT_FUNC_TYPE_UNION1;
}
void DeformRoIPoolForwardMLUKernelLauncher(Tensor input, Tensor rois, void DeformRoIPoolForwardMLUKernelLauncher(Tensor input, Tensor rois,
Tensor offset, Tensor output, Tensor offset, Tensor output,
int pooled_height, int pooled_width, int pooled_height, int pooled_width,
float spatial_scale, float spatial_scale,
int sampling_ratio, float gamma) { int sampling_ratio, float gamma) {
// Check dtype.
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"input type should be Float or Half, got ", input.scalar_type());
TORCH_CHECK(input.scalar_type() == rois.scalar_type(),
"rois should have the same type as input");
// Check shape.
TORCH_CHECK(input.dim() == 4, "input should be 4d tensor, got ", input.dim(),
"D.");
TORCH_CHECK(rois.dim() == 2, "rois should be 2d tensor, got ", rois.dim(),
"D.");
if (offset.defined() && offset.numel() > 0) {
TORCH_CHECK(input.scalar_type() == offset.scalar_type(),
"offset should have the same type as input");
TORCH_CHECK(offset.dim() == 4, "offset should be 4d tensor, got ",
offset.dim(), "D.");
TORCH_CHECK(
(offset.size(0) == rois.size(0)), "offset.size(0) = ", offset.size(0),
"while rois.size(0)) = ", rois.size(0), ". They should be the same.");
TORCH_CHECK((offset.size(1) == 2), "offset.size(1) should be 2, ",
"but now offset.size(1) = ", offset.size(1), ".");
TORCH_CHECK((offset.size(2) == output.size(2)),
"offset.size(2) = ", offset.size(2),
"while output.size(2)) = ", output.size(2),
". They should be the same.");
TORCH_CHECK((offset.size(3) == output.size(3)),
"offset.size(3) = ", offset.size(3),
"while output.size(3)) = ", output.size(3),
". They should be the same.");
}
TORCH_CHECK(spatial_scale > 0 && spatial_scale <= 1,
"spatial_scale should be within (0, 1], got ", spatial_scale,
".");
// compute kernel params
auto height = input.size(2);
auto width = input.size(3);
auto channels = input.size(1);
auto num_rois = output.size(0);
if (output.numel() == 0) {
output = at::zeros({num_rois, channels, pooled_height, pooled_width},
input.options());
return;
}
// zero element check
TORCH_CHECK(input.size(0) != 0, "input.size(0) should not be zero, got ",
input.size(0));
TORCH_CHECK(rois.numel() != 0, "rois.numel() should not be zero, got ",
rois.numel());
if (input.numel() == 0 || output.numel() == 0) {
return;
}
// large tensor check
const size_t max_input_num = 2147483648; // 2^31, 2G num
TORCH_CHECK(input.numel() < max_input_num,
"input.numel() should be less than 2147483648, got ",
input.numel());
TORCH_CHECK(rois.numel() < max_input_num,
"rois.numel() should be less than 2147483648, got ",
rois.numel());
TORCH_CHECK(output.numel() < max_input_num,
"output.numel() should be less than 2147483648, got ",
output.numel());
TORCH_CHECK(!offset.defined() || offset.numel() < max_input_num,
"offset.numel() should be less than 2147483648, got ",
offset.numel());
auto memory_format = auto memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim()); torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim());
auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format); auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format);
auto rois_contiguous =
at::Tensor output_ = torch_mlu::cnnl::ops::cnnl_contiguous(rois, rois.suggest_memory_format());
at::empty({num_rois, channels, pooled_height, pooled_width}, auto output_contiguous =
input.options(), memory_format); torch_mlu::cnnl::ops::cnnl_contiguous(output, memory_format);
// calculate task dimension MluOpTensorDescriptor input_desc, rois_desc, offset_desc, output_desc;
cnrtDim3_t k_dim; input_desc.set_with_layout(input_, MLUOP_LAYOUT_NHWC);
cnrtFunctionType_t k_type; rois_desc.set(rois_contiguous);
policyFunc(num_rois * pooled_height * pooled_width, &k_dim, &k_type); output_desc.set_with_layout(output_contiguous, MLUOP_LAYOUT_NHWC);
// get compute queue mluOpTensorDescriptor_t offset_real_desc = NULL;
auto queue = torch_mlu::getCurQueue(); void *offset_ptr = NULL;
if (offset.defined() && offset.numel() > 0) {
auto offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
offset, offset.suggest_memory_format());
offset_desc.set(offset_contiguous);
offset_real_desc = offset_desc.desc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset_contiguous);
offset_ptr = offset_impl->cnnlMalloc();
}
// get ptr of tensors // get ptr of tensors
auto input_impl = torch_mlu::getMluTensorImpl(input_); auto input_impl = torch_mlu::getMluTensorImpl(input_);
auto input_ptr = input_impl->cnnlMalloc(); auto input_ptr = input_impl->cnnlMalloc();
auto rois_impl = torch_mlu::getMluTensorImpl(rois); auto rois_impl = torch_mlu::getMluTensorImpl(rois_contiguous);
auto rois_ptr = rois_impl->cnnlMalloc(); auto rois_ptr = rois_impl->cnnlMalloc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset); auto output_impl = torch_mlu::getMluTensorImpl(output_contiguous);
auto offset_ptr = offset_impl->cnnlMalloc();
auto output_impl = torch_mlu::getMluTensorImpl(output_);
auto output_ptr = output_impl->cnnlMalloc(); auto output_ptr = output_impl->cnnlMalloc();
// get comput dtype of input // get compute handle
cnrtDataType_t data_type = torch_mlu::toCnrtDtype(input_.dtype()); auto handle = mluOpGetCurrentHandle();
mluOpDeformRoiPoolForward(
handle, input_desc.desc(), input_ptr, rois_desc.desc(), rois_ptr,
offset_real_desc, offset_ptr, pooled_height, pooled_width, spatial_scale,
sampling_ratio, gamma, output_desc.desc(), output_ptr);
// launch kernel output.copy_(output_contiguous);
CNLOG(INFO) << "Launch Kernel MLUKernelDeformRoIPoolForward<<<" << k_dim.x
<< ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelDeformRoIPoolForward(k_dim, k_type, queue, data_type, input_ptr,
rois_ptr, offset_ptr, output_ptr, channels, height,
width, num_rois, pooled_height, pooled_width,
spatial_scale, sampling_ratio, gamma);
output.copy_(output_);
} }
void DeformRoIPoolBackwardMLUKernelLauncher( void DeformRoIPoolBackwardMLUKernelLauncher(
Tensor grad_output, Tensor input, Tensor rois, Tensor offset, Tensor grad_output, Tensor input, Tensor rois, Tensor offset,
Tensor grad_input, Tensor grad_offset, int pooled_height, int pooled_width, Tensor grad_input, Tensor grad_offset, int pooled_height, int pooled_width,
float spatial_scale, int sampling_ratio, float gamma) { float spatial_scale, int sampling_ratio, float gamma) {
// Check dtype.
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"input type should be Float or Half, got ", input.scalar_type());
TORCH_CHECK(input.scalar_type() == grad_output.scalar_type(),
"grad_output should have the same type as input");
TORCH_CHECK(input.scalar_type() == rois.scalar_type(),
"rois should have the same type as input");
TORCH_CHECK(input.scalar_type() == grad_input.scalar_type(),
"grad_input should have the same type as input");
// Check shape.
TORCH_CHECK(grad_output.dim() == 4, "grad_output should be 4d tensor, got ",
grad_output.dim(), "D.");
TORCH_CHECK(input.dim() == 4, "input should be 4d tensor, got ", input.dim(),
"D.");
TORCH_CHECK(rois.dim() == 2, "rois should be 2d tensor, got ", rois.dim(),
"D.");
if (offset.defined() && offset.numel() > 0) {
TORCH_CHECK(input.scalar_type() == offset.scalar_type(),
"offset should have the same type as input");
TORCH_CHECK(offset.dim() == 4, "offset should be 4d tensor, got ",
offset.dim(), "D.");
TORCH_CHECK(
(offset.size(0) == rois.size(0)), "offset.size(0) = ", offset.size(0),
"while rois.size(0)) = ", rois.size(0), ". They should be the same.");
TORCH_CHECK((offset.size(1) == 2), "offset.size(1) should be 2, ",
"but now offset.size(1) = ", offset.size(1), ".");
TORCH_CHECK((offset.size(2) == grad_output.size(2)),
"offset.size(2) = ", offset.size(2),
"while grad_output.size(2)) = ", grad_output.size(2),
". They should be the same.");
TORCH_CHECK((offset.size(3) == grad_output.size(3)),
"offset.size(3) = ", offset.size(3),
"while grad_output.size(3)) = ", grad_output.size(3),
". They should be the same.");
}
TORCH_CHECK(spatial_scale > 0 && spatial_scale <= 1,
"spatial_scale should be within (0, 1], got ", spatial_scale);
// Check relationship between tensor.
TORCH_CHECK((grad_output.size(0) == rois.size(0)),
"grad_output.size(0) = ", grad_output.size(0),
"while rois.size(0)) = ", rois.size(0),
". They should be the same.");
TORCH_CHECK((grad_output.size(1) == input.size(1)),
"grad_output.size(1) = ", grad_output.size(1),
"while input.size(1)) = ", input.size(1),
". They should be the same.");
TORCH_CHECK((grad_output.size(2) == pooled_height),
"grad_output.size(2) = ", grad_output.size(2),
"while pooled_height = ", pooled_height,
". They should be the same.");
TORCH_CHECK((grad_output.size(3) == pooled_width),
"grad_output.size(3) = ", grad_output.size(3),
"while pooled_width = ", pooled_width,
". They should be the same.");
// compute kernel params
auto batch = input.size(0);
auto channels = input.size(1);
auto height = input.size(2);
auto width = input.size(3);
auto num_rois = grad_output.size(0);
// zero element check
TORCH_CHECK(input.size(0) != 0, "input.size(0) should not be zero, got ",
input.size(0));
TORCH_CHECK(rois.numel() != 0, "rois.numel() should not be zero, got ",
rois.numel());
if (input.numel() == 0 || grad_output.numel() == 0) {
return;
}
// large tensor check
const size_t max_input_num = 2147483648; // 2^31, 2G num
TORCH_CHECK(input.numel() < max_input_num,
"input.numel() should be less than 2147483648, got ",
input.numel());
TORCH_CHECK(rois.numel() < max_input_num,
"rois.numel() should be less than 2147483648, got ",
rois.numel());
TORCH_CHECK(grad_output.numel() < max_input_num,
"grad_output.numel() should be less than 2147483648, got ",
grad_output.numel());
TORCH_CHECK(!offset.defined() || offset.numel() < max_input_num,
"offset.numel() should be less than 2147483648, got ",
offset.numel());
auto memory_format = auto memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(grad_output.dim()); torch_mlu::cnnl::ops::get_channels_last_memory_format(grad_output.dim());
auto grad_output_ = auto grad_output_ =
...@@ -264,45 +69,56 @@ void DeformRoIPoolBackwardMLUKernelLauncher( ...@@ -264,45 +69,56 @@ void DeformRoIPoolBackwardMLUKernelLauncher(
memory_format = memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim()); torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim());
auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format); auto input_ = torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format);
at::Tensor grad_input_ = at::empty({batch, channels, height, width}, auto rois_contiguous =
input.options(), memory_format) torch_mlu::cnnl::ops::cnnl_contiguous(rois, rois.suggest_memory_format());
.zero_(); auto grad_input_ =
torch_mlu::cnnl::ops::cnnl_contiguous(grad_input, memory_format);
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFunc(num_rois * pooled_height * pooled_width, &k_dim, &k_type);
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get ptr of tensors // get ptr of tensors
auto grad_output_impl = torch_mlu::getMluTensorImpl(grad_output_); auto grad_output_impl = torch_mlu::getMluTensorImpl(grad_output_);
auto grad_output_ptr = grad_output_impl->cnnlMalloc(); auto grad_output_ptr = grad_output_impl->cnnlMalloc();
auto input_impl = torch_mlu::getMluTensorImpl(input_); auto input_impl = torch_mlu::getMluTensorImpl(input_);
auto input_ptr = input_impl->cnnlMalloc(); auto input_ptr = input_impl->cnnlMalloc();
auto rois_impl = torch_mlu::getMluTensorImpl(rois); auto rois_impl = torch_mlu::getMluTensorImpl(rois_contiguous);
auto rois_ptr = rois_impl->cnnlMalloc(); auto rois_ptr = rois_impl->cnnlMalloc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset);
auto offset_ptr = offset_impl->cnnlMalloc();
auto grad_input_impl = torch_mlu::getMluTensorImpl(grad_input_); auto grad_input_impl = torch_mlu::getMluTensorImpl(grad_input_);
auto grad_input_ptr = grad_input_impl->cnnlMalloc(); auto grad_input_ptr = grad_input_impl->cnnlMalloc();
auto grad_offset_impl = torch_mlu::getMluTensorImpl(grad_offset);
auto grad_offset_ptr = grad_offset_impl->cnnlMalloc();
// get comput dtype of input
cnrtDataType_t data_type = torch_mlu::toCnrtDtype(input.dtype());
// launch kernel MluOpTensorDescriptor grad_output_desc, input_desc, rois_desc, offset_desc,
CNLOG(INFO) << "Launch Kernel KernelDeformRoIPoolBackward<<<" << k_dim.x grad_input_desc, grad_offset_desc;
<< ", " << k_dim.y << ", " << k_dim.z << ">>>"; grad_output_desc.set_with_layout(grad_output_, MLUOP_LAYOUT_NHWC);
input_desc.set_with_layout(input_, MLUOP_LAYOUT_NHWC);
KernelDeformRoIPoolBackward(k_dim, k_type, queue, data_type, grad_output_ptr, rois_desc.set(rois_contiguous);
input_ptr, rois_ptr, offset_ptr, grad_input_ptr, grad_input_desc.set_with_layout(grad_input_, MLUOP_LAYOUT_NHWC);
grad_offset_ptr, channels, height, width, mluOpTensorDescriptor_t offset_real_desc = NULL;
num_rois, pooled_height, pooled_width, void *offset_ptr = NULL;
spatial_scale, sampling_ratio, gamma); if (offset.defined() && offset.numel() > 0) {
auto offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
offset, offset.suggest_memory_format());
offset_desc.set(offset_contiguous);
offset_real_desc = offset_desc.desc();
auto offset_impl = torch_mlu::getMluTensorImpl(offset_contiguous);
offset_ptr = offset_impl->cnnlMalloc();
}
mluOpTensorDescriptor_t grad_offset_real_desc = NULL;
void *grad_offset_ptr = NULL;
if (grad_offset.defined() && grad_offset.numel() > 0) {
auto grad_offset_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
grad_offset, grad_offset.suggest_memory_format());
grad_offset_desc.set(grad_offset_contiguous);
grad_offset_real_desc = grad_offset_desc.desc();
auto grad_offset_impl = torch_mlu::getMluTensorImpl(grad_offset_contiguous);
grad_offset_ptr = grad_offset_impl->cnnlMalloc();
}
// get compute handle
auto handle = mluOpGetCurrentHandle();
mluOpDeformRoiPoolBackward(
handle, grad_output_desc.desc(), grad_output_ptr, input_desc.desc(),
input_ptr, rois_desc.desc(), rois_ptr, offset_real_desc, offset_ptr,
pooled_height, pooled_width, spatial_scale, sampling_ratio, gamma,
grad_input_desc.desc(), grad_input_ptr, grad_offset_real_desc,
grad_offset_ptr);
grad_input.copy_(grad_input_); grad_input.copy_(grad_input_);
} }
......
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "mlu_common_helper.h"
// Descriptors
mluOpDataType_t getMluOpDataType(const caffe2::TypeMeta& data_type) {
const std::map<std::string, mluOpDataType_t> mapping_type = {
{std::string("c10::Half"), MLUOP_DTYPE_HALF},
{std::string("float"), MLUOP_DTYPE_FLOAT},
{std::string("double"), MLUOP_DTYPE_DOUBLE},
{std::string("int8"), MLUOP_DTYPE_INT8},
{std::string("signed char"), MLUOP_DTYPE_INT8},
{std::string("short int"), MLUOP_DTYPE_INT16},
{std::string("short"), MLUOP_DTYPE_INT16},
{std::string("int"), MLUOP_DTYPE_INT32},
{std::string("long int"), MLUOP_DTYPE_INT64},
{std::string("long"), MLUOP_DTYPE_INT64},
{std::string("unsigned char"), MLUOP_DTYPE_UINT8},
{std::string("bool"), MLUOP_DTYPE_BOOL},
{std::string("c10::complex<c10::Half>"), MLUOP_DTYPE_COMPLEX_HALF},
{std::string("c10::complex<float>"), MLUOP_DTYPE_COMPLEX_FLOAT}};
if (mapping_type.find(std::string(data_type.name())) != mapping_type.end()) {
return mapping_type.find(std::string(data_type.name()))->second;
}
return MLUOP_DTYPE_INVALID;
}
// laytout
mluOpTensorLayout_t getMluOpSuggestLayout(const at::Tensor& input) {
auto suggest_memory_format = input.suggest_memory_format();
mluOpTensorLayout_t layout = MLUOP_LAYOUT_ARRAY;
switch (input.dim()) {
case 4:
layout = (suggest_memory_format == at::MemoryFormat::ChannelsLast)
? MLUOP_LAYOUT_NHWC
: MLUOP_LAYOUT_NCHW;
break;
case 5:
layout = (suggest_memory_format == at::MemoryFormat::ChannelsLast3d)
? MLUOP_LAYOUT_NDHWC
: MLUOP_LAYOUT_NCDHW;
break;
default:
layout = MLUOP_LAYOUT_ARRAY;
}
return layout;
}
void MluOpTensorDescriptor::set(Tensor t) {
mluOpDataType_t data_type = getMluOpDataType(t.dtype());
mluOpTensorLayout_t layout = getMluOpSuggestLayout(t);
int t_dim = t.dim();
std::vector<int> dim_array;
if (t_dim == 0) {
dim_array.push_back(
1); // ScalarTensor(0-dim 1-item Tensor) view like size = 1 as default;
} else {
for (int i = 0; i < t_dim; i++) {
dim_array.push_back(static_cast<int>(t.sizes().vec()[i]));
}
}
set_desc(t, layout, data_type, dim_array);
}
void MluOpTensorDescriptor::set_with_layout(Tensor t,
mluOpTensorLayout_t layout) {
mluOpDataType_t data_type = getMluOpDataType(t.dtype());
int t_dim = t.dim();
std::vector<int> shape_info = checkUpperBoundAndCastTo<int>(t.sizes().vec());
std::vector<int> stride_info =
checkUpperBoundAndCastTo<int>(t.strides().vec());
if (layout == MLUOP_LAYOUT_NHWC || layout == MLUOP_LAYOUT_NDHWC ||
layout == MLUOP_LAYOUT_NLC) {
convertShapeAndStride(shape_info, stride_info);
} else if (layout == MLUOP_LAYOUT_HWCN) {
auto convertDepthWiseConvShapeStride = [](const std::vector<int64_t>& vec,
std::vector<int>& target_vec,
std::vector<int>& stride_vec) {
// NCHW --> HWCN
target_vec[0] = static_cast<int>(vec[2]);
target_vec[1] = static_cast<int>(vec[3]);
target_vec[2] = static_cast<int>(vec[1]);
target_vec[3] = static_cast<int>(vec[0]);
// Calculate Stride just like contiguous of HWCN.
stride_vec[3] = 1;
stride_vec[2] = target_vec[3] * stride_vec[3];
stride_vec[1] = target_vec[2] * stride_vec[2];
stride_vec[0] = target_vec[1] * stride_vec[1];
};
convertDepthWiseConvShapeStride(t.sizes().vec(), shape_info, stride_info);
}
TORCH_CHECK(mluOpSetTensorDescriptorEx(
desc_, layout, data_type, t_dim, shape_info.data(),
stride_info.data()) == MLUOP_STATUS_SUCCESS,
"mluOpSetTensorDescriptorEx execution failed.");
}
void MluOpTensorDescriptor::set_desc(const at::Tensor& t,
mluOpTensorLayout_t layout,
mluOpDataType_t dtype,
std::vector<int>& dims) {
int dimNb = dims.size();
mluOpSetTensorDescriptor(desc_, layout, dtype, dimNb, dims.data());
}
// Handles
std::once_flag mmcv_mluop_init_flag;
std::mutex mmcv_mluop_mutex;
static std::vector<MluOpHandle> mmcv_mluop_handles;
mluOpHandle_t mluOpGetCurrentHandle(c10::DeviceIndex device_index) {
std::call_once(mmcv_mluop_init_flag,
[]() // Init mmcv_mluop_handles 1-device <-> 1-handle
{
c10::DeviceIndex num_devices = torch_mlu::device_count();
mmcv_mluop_handles.resize(num_devices);
});
if (device_index == -1) {
device_index = torch_mlu::current_device();
}
std::lock_guard<std::mutex> mmcv_mluop_guard(mmcv_mluop_mutex);
auto queue = torch_mlu::getCurrentQueue(device_index).queue();
mmcv_mluop_handles[device_index].setQueue(queue);
return mmcv_mluop_handles[device_index].handle;
}
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#pragma once
#include <ATen/ATen.h>
#include <c10/core/ScalarType.h>
#include "aten.h"
#include "mlu_op.h"
#include "pytorch_device_registry.hpp"
#define MLUOP_MAJOR 0
#define MLUOP_MINOR 5
#define MLUOP_PATCHLEVEL 302
mluOpDataType_t getMluOpDataType(const caffe2::TypeMeta& data_type);
mluOpTensorLayout_t getMluOpSuggestLayout(const at::Tensor& input);
class MluOpTensorDescriptor {
public:
MluOpTensorDescriptor() { mluOpCreateTensorDescriptor(&desc_); };
~MluOpTensorDescriptor() { mluOpDestroyTensorDescriptor(desc_); }
void set(at::Tensor);
void set_with_layout(at::Tensor, mluOpTensorLayout_t layout);
mluOpTensorDescriptor_t desc() { return desc_; }
private:
mluOpTensorDescriptor_t desc_;
void set_desc(const at::Tensor&, mluOpTensorLayout_t, mluOpDataType_t,
std::vector<int>& dims);
};
mluOpHandle_t mluOpGetCurrentHandle(c10::DeviceIndex device_index = -1);
class MluOpHandle {
public:
MluOpHandle() : handle(nullptr) { mluOpCreate(&handle); }
~MluOpHandle() {
if (handle) {
mluOpDestroy(handle);
handle = nullptr;
}
}
void setQueue(cnrtQueue_t queue) { mluOpSetQueue(handle, queue); }
mluOpHandle_t handle;
};
// modify tensor size and stride order based on
// channels_first to channels_last or channels_last_3d.
// which this is not same with pytorch original layout,
// this real layout is based on data storage real order.
// example: modify channels_last tensor dim to nhwc tensor desc.
// N C H W --> N H W C
// C*H*W 1 W C --> C*H*W W C 1
template <typename T>
void convertShapeAndStride(std::vector<T>& shape_info,
std::vector<T>& stride_info) {
TORCH_MLU_CHECK(shape_info.size() == stride_info.size(),
"shape size need equal to stride size.");
const int dim = shape_info.size();
std::vector<T> temp_shape_info(dim);
std::vector<T> temp_stride_info(dim);
temp_shape_info[0] = shape_info[0];
temp_stride_info[0] = stride_info[0];
for (size_t i = 0; i < dim - 1; ++i) {
const int index = (i + 1) % (dim - 1) + 1;
temp_shape_info[i + 1] = shape_info[index];
temp_stride_info[i + 1] = stride_info[index];
}
shape_info.assign(temp_shape_info.begin(), temp_shape_info.end());
stride_info.assign(temp_stride_info.begin(), temp_stride_info.end());
}
// torch tensor provides int64_t type of shape and stride,
// but mluops descriptor requires type int32.
// use this function to ensure safe CAST, or report an error.
template <typename DST_T, typename SRC_T>
std::vector<DST_T> checkUpperBoundAndCastTo(const std::vector<SRC_T>& input) {
std::vector<DST_T> output;
output.reserve(input.size());
for (const auto& val : input) {
if (val > std::numeric_limits<DST_T>::max()) {
TORCH_MLU_CHECK(false, "Requires dim size not greater than ",
std::numeric_limits<DST_T>::max(), ". But got ", val,
".");
}
output.push_back(static_cast<DST_T>(val));
}
return output;
}
...@@ -14,7 +14,15 @@ ...@@ -14,7 +14,15 @@
#define MIN(a, b) (((a) < (b)) ? (a) : (b)) #define MIN(a, b) (((a) < (b)) ? (a) : (b))
void KernelMsDeformAttnForward( typedef enum {
MS_DEFORM_ATTN_FORWARD_INVALID = 0, /*!< Index is invalid. */
MS_DEFORM_ATTN_FORWARD_DEFAULT =
1, /*!< MLUKernelMsDeformAttnForwardDefault */
MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL =
2, /*!< MLUKernelMsDeformAttnForwardSmallChannel */
} MsDeformAttnForwardPolicy;
void KernelMsDeformAttnForwardDefault(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue, cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const char* data_value_gdram, const cnrtDataType_t d_type, const char* data_value_gdram,
const char* data_spatial_shapes_gdram, const char* data_spatial_shapes_gdram,
...@@ -23,7 +31,37 @@ void KernelMsDeformAttnForward( ...@@ -23,7 +31,37 @@ void KernelMsDeformAttnForward(
const int32_t batch_size, const int32_t num_keys, const int32_t num_heads, const int32_t batch_size, const int32_t num_keys, const int32_t num_heads,
const int32_t channels, const int32_t num_levels, const int32_t num_queries, const int32_t channels, const int32_t num_levels, const int32_t num_queries,
const int32_t num_points, char* data_col_gdram); const int32_t num_points, char* data_col_gdram);
void KernelMsDeformAttnBackward( void KernelMsDeformAttnForwardSmallChannel(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const char* data_value_gdram,
const char* data_spatial_shapes_gdram,
const char* data_level_start_index_gdram,
const char* data_sampling_loc_gdram, const char* data_attn_weight_gdram,
const int32_t batch_size, const int32_t num_keys, const int32_t num_heads,
const int32_t channels, const int32_t num_levels, const int32_t num_queries,
const int32_t num_points, char* data_col_gdram);
typedef enum {
MS_DEFORM_ATTN_BACKWARD_DEFAULT = 0,
MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL = 1,
} MsDeformAttnBackwardKernelPolicy;
MsDeformAttnBackwardKernelPolicy msDeformAttnBackwardPolicyFunc(
const int32_t channels, const int32_t num_levels, const int32_t num_points,
const int32_t num_heads) {
const int32_t nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
const int num_hlp = num_heads * num_levels * num_points;
int num_per_time_theory = (nram_size - num_levels * sizeof(float) -
3 * num_levels * sizeof(int32_t)) /
sizeof(float) / (8 * PAD_UP(channels, 32) + 28) /
PAD_UP((num_hlp), 32);
if (num_per_time_theory >= 1) {
return MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL;
}
return MS_DEFORM_ATTN_BACKWARD_DEFAULT;
}
void KernelMsDeformAttnBackwardDefaultKernel(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue, cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const float* data_value, const cnrtDataType_t d_type, const float* data_value,
const int32_t* spatial_shapes, const int32_t* data_level_start_index, const int32_t* spatial_shapes, const int32_t* data_level_start_index,
...@@ -32,10 +70,23 @@ void KernelMsDeformAttnBackward( ...@@ -32,10 +70,23 @@ void KernelMsDeformAttnBackward(
const int32_t num_heads, const int32_t channels, const int32_t num_levels, const int32_t num_heads, const int32_t channels, const int32_t num_levels,
const int32_t num_queries, const int32_t num_points, float* grad_value, const int32_t num_queries, const int32_t num_points, float* grad_value,
float* grad_sampling_loc, float* grad_attn_weight); float* grad_sampling_loc, float* grad_attn_weight);
void KernelMsDeformAttnBackwardSmallChannelsKernel(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t d_type, const float* data_value,
const int32_t* spatial_shapes, const int32_t* data_level_start_index,
const float* data_sampling_loc, const float* data_attn_weight,
const float* grad_output, const int32_t batch, const int32_t spatial_size,
const int32_t num_heads, const int32_t channels, const int32_t num_levels,
const int32_t num_query, const int32_t num_points, float* grad_value,
float* grad_sampling_loc, float* grad_attn_weight);
// policy function // policy function
static void policyFuncForward(cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type, MsDeformAttnForwardPolicy msDeformAttnForwardPolicyFunc(
const int batch_size, const int num_queries, cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type, const int32_t batch_size,
const int num_heads) { const int32_t num_keys, const int32_t num_heads, const int32_t channels,
const int32_t num_levels, const int32_t num_queries,
const int32_t num_points) {
k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster); k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
k_dim->y = k_dim->y =
MIN((batch_size * num_queries * num_heads + k_dim->x - 1) / k_dim->x, MIN((batch_size * num_queries * num_heads + k_dim->x - 1) / k_dim->x,
...@@ -46,6 +97,16 @@ static void policyFuncForward(cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type, ...@@ -46,6 +97,16 @@ static void policyFuncForward(cnrtDim3_t* k_dim, cnrtFunctionType_t* k_type,
#else #else
*k_type = CNRT_FUNC_TYPE_UNION1; *k_type = CNRT_FUNC_TYPE_UNION1;
#endif #endif
int32_t nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
if (num_levels * num_points * 3 * sizeof(int32_t) > nram_size) {
return MS_DEFORM_ATTN_FORWARD_DEFAULT;
} else if (channels > nram_size / 12 / sizeof(float) || channels > 96 ||
channels < 16) {
return MS_DEFORM_ATTN_FORWARD_DEFAULT;
} else {
return MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL;
}
} }
// policy function for backward // policy function for backward
...@@ -196,7 +257,9 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value, ...@@ -196,7 +257,9 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value,
// calculate task dimension // calculate task dimension
cnrtDim3_t k_dim; cnrtDim3_t k_dim;
cnrtFunctionType_t k_type; cnrtFunctionType_t k_type;
policyFuncForward(&k_dim, &k_type, batch_size, num_queries, num_heads); MsDeformAttnForwardPolicy policy = msDeformAttnForwardPolicyFunc(
&k_dim, &k_type, batch_size, num_keys, num_heads, channels, num_levels,
num_queries, num_points);
// get compute queue // get compute queue
auto queue = torch_mlu::getCurQueue(); auto queue = torch_mlu::getCurQueue();
...@@ -222,15 +285,33 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value, ...@@ -222,15 +285,33 @@ Tensor ms_deform_attn_mlu_forward(const Tensor& value,
cnrtDataType_t data_type = torch_mlu::toCnrtDtype(value.dtype()); cnrtDataType_t data_type = torch_mlu::toCnrtDtype(value.dtype());
// launch kernel // launch kernel
CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForward<<<" << k_dim.x switch (policy) {
<< ", " << k_dim.y << ", " << k_dim.z << ">>>"; default: {
VLOG(5) << "MsDeformAttnForward Policy not supported";
KernelMsDeformAttnForward( }; break;
k_dim, k_type, queue, data_type, (char*)value_ptr, case MS_DEFORM_ATTN_FORWARD_DEFAULT: {
(char*)spatial_shapes_ptr, (char*)level_start_index_ptr, CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForwardDefault<<<"
(char*)sampling_loc_ptr, (char*)attn_weight_ptr, batch_size, num_keys, << k_dim.x << ", " << k_dim.y << ", " << k_dim.z << ">>>";
num_heads, channels, num_levels, num_queries, num_points, KernelMsDeformAttnForwardDefault(
(char*)output_ptr); k_dim, k_type, queue, data_type, (char*)value_ptr,
(char*)spatial_shapes_ptr, (char*)level_start_index_ptr,
(char*)sampling_loc_ptr, (char*)attn_weight_ptr, batch_size, num_keys,
num_heads, channels, num_levels, num_queries, num_points,
(char*)output_ptr);
break;
}
case MS_DEFORM_ATTN_FORWARD_SMALL_CHANNEL: {
CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnForwardSmallChannel<<<"
<< k_dim.x << ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelMsDeformAttnForwardSmallChannel(
k_dim, k_type, queue, data_type, (char*)value_ptr,
(char*)spatial_shapes_ptr, (char*)level_start_index_ptr,
(char*)sampling_loc_ptr, (char*)attn_weight_ptr, batch_size, num_keys,
num_heads, channels, num_levels, num_queries, num_points,
(char*)output_ptr);
break;
}
}
output = output.view({batch_size, num_queries, num_heads * channels}); output = output.view({batch_size, num_queries, num_heads * channels});
return output; return output;
...@@ -391,14 +472,32 @@ void ms_deform_attn_mlu_backward( ...@@ -391,14 +472,32 @@ void ms_deform_attn_mlu_backward(
// launch kernel // launch kernel
CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnBackward<<<" << k_dim.x CNLOG(INFO) << "Launch Kernel MLUKernelMsDeformAttnBackward<<<" << k_dim.x
<< ", " << k_dim.y << ", " << k_dim.z << ">>>"; << ", " << k_dim.y << ", " << k_dim.z << ">>>";
MsDeformAttnBackwardKernelPolicy kernelPolicy =
KernelMsDeformAttnBackward( msDeformAttnBackwardPolicyFunc(channels, num_levels, num_points,
k_dim, k_type, queue, data_type, (float*)value_ptr, num_heads);
(int32_t*)spatial_shapes_ptr, (int32_t*)level_start_index_ptr, switch (kernelPolicy) {
(float*)sampling_loc_ptr, (float*)attn_weight_ptr, default: {
(float*)grad_output_ptr, batch_size, num_keys, num_heads, channels, VLOG(5) << "NotImplemented.";
num_levels, num_queries, num_points, (float*)grad_value_ptr, } break;
(float*)grad_sampling_loc_ptr, (float*)grad_attn_weight_ptr); case MS_DEFORM_ATTN_BACKWARD_DEFAULT: {
KernelMsDeformAttnBackwardDefaultKernel(
k_dim, k_type, queue, data_type, (float*)value_ptr,
(int32_t*)spatial_shapes_ptr, (int32_t*)level_start_index_ptr,
(float*)sampling_loc_ptr, (float*)attn_weight_ptr,
(float*)grad_output_ptr, batch_size, num_keys, num_heads, channels,
num_levels, num_queries, num_points, (float*)grad_value_ptr,
(float*)grad_sampling_loc_ptr, (float*)grad_attn_weight_ptr);
} break;
case MS_DEFORM_ATTN_BACKWARD_SMALL_CHANNEL: {
KernelMsDeformAttnBackwardSmallChannelsKernel(
k_dim, k_type, queue, data_type, (float*)value_ptr,
(int32_t*)spatial_shapes_ptr, (int32_t*)level_start_index_ptr,
(float*)sampling_loc_ptr, (float*)attn_weight_ptr,
(float*)grad_output_ptr, batch_size, num_keys, num_heads, channels,
num_levels, num_queries, num_points, (float*)grad_value_ptr,
(float*)grad_sampling_loc_ptr, (float*)grad_attn_weight_ptr);
} break;
}
} }
Tensor ms_deform_attn_impl_forward(const Tensor& value, Tensor ms_deform_attn_impl_forward(const Tensor& value,
......
/*************************************************************************
* Copyright (C) 2021 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "mlu_common_helper.h"
Tensor nms_rotated_mlu(Tensor boxes, Tensor scores, float iou_threshold) {
if (boxes.numel() == 0) {
return at::empty({0}, boxes.options().dtype(at::kLong));
}
int boxes_num = boxes.size(0);
auto boxes_ = torch_mlu::cnnl::ops::cnnl_contiguous(boxes);
auto scores_ = torch_mlu::cnnl::ops::cnnl_contiguous(scores);
auto output = at::empty({boxes_num}, boxes.options().dtype(at::kInt));
auto output_size = at::empty({1}, scores.options().dtype(at::kInt));
MluOpTensorDescriptor boxes_desc, scores_desc, output_desc;
boxes_desc.set(boxes_);
scores_desc.set(scores_);
output_desc.set(output);
// workspace
size_t workspace_size = 0;
auto handle = mluOpGetCurrentHandle();
mluOpGetNmsRotatedWorkspaceSize(handle, boxes_desc.desc(), &workspace_size);
auto workspace = at::empty(workspace_size, boxes.options().dtype(at::kByte));
auto boxes_impl = torch_mlu::getMluTensorImpl(boxes_);
auto boxes_ptr = boxes_impl->cnnlMalloc();
auto scores_impl = torch_mlu::getMluTensorImpl(scores_);
auto scores_ptr = scores_impl->cnnlMalloc();
auto workspace_impl = torch_mlu::getMluTensorImpl(workspace);
auto workspace_ptr = workspace_impl->cnnlMalloc();
auto output_impl = torch_mlu::getMluTensorImpl(output);
auto output_ptr = output_impl->cnnlMalloc();
auto output_size_impl = torch_mlu::getMluTensorImpl(output_size);
auto output_size_ptr = output_size_impl->cnnlMalloc();
mluOpNmsRotated(handle, iou_threshold, boxes_desc.desc(), boxes_ptr,
scores_desc.desc(), scores_ptr, workspace_ptr, workspace_size,
output_desc.desc(), output_ptr, (int *)output_size_ptr);
int output_num = *static_cast<int *>(output_size.cpu().data_ptr());
auto ret = output.to(boxes.options().dtype(at::kLong));
return ret.slice(0, 0, output_num);
}
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include <torch/script.h>
#include <vector>
#include "mlu_common_helper.h"
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
template <unsigned NDim>
std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose) {
// The following code is copied from
// mmcv/ops/csrc/pytorch/cuda/spconv_ops_cuda.cu to ensure the output is
// available for network train. The outputs of this function have correct
// shape but wrong value.
auto numAct = indices.size(0);
auto kernelVolume = kernelSize[0];
int sub_m = (int)_subM;
int transpose = (int)_transpose;
int batch = (int)batchSize;
auto coorDim = indices.size(1) - 1;
for (int i = 1; i < kernelSize.size(); ++i) {
kernelVolume *= kernelSize[i];
}
auto outputVolume = outSpatialShape[0];
for (int i = 1; i < outSpatialShape.size(); ++i) {
outputVolume *= outSpatialShape[i];
}
torch::Tensor indicePairs = at::full({kernelVolume, 2, numAct}, -1,
indices.options().dtype(at::kInt));
torch::Tensor indiceNum =
at::zeros({kernelVolume}, indices.options().dtype(at::kInt));
int out_size = sub_m == 1
? numAct
: std::min(numAct * kernelVolume, batch * outputVolume);
torch::Tensor out_indices =
at::zeros({out_size, coorDim + 1}, indices.options().dtype(at::kInt));
auto indices_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indices, at::MemoryFormat::Contiguous);
auto indicePairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indicePairs, at::MemoryFormat::Contiguous);
auto indiceNum_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indiceNum, at::MemoryFormat::Contiguous);
auto out_indices_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
out_indices, at::MemoryFormat::Contiguous);
std::vector<int> input_space;
std::vector<int> filter_space;
std::vector<int> output_space;
std::vector<int> padding32;
std::vector<int> stride32;
std::vector<int> dilation32;
for (int i = 0; i < NDim; i++) {
input_space.push_back(spatialShape[i]);
filter_space.push_back(kernelSize[i]);
output_space.push_back(outSpatialShape[i]);
padding32.push_back(padding[i]);
stride32.push_back(stride[i]);
dilation32.push_back(dilation[i]);
}
MluOpTensorDescriptor indices_desc, out_indices_desc, indicePairs_desc,
indiceNum_desc;
indices_desc.set(indices_contiguous);
indicePairs_desc.set(indicePairs_contiguous);
indiceNum_desc.set(indiceNum_contiguous);
out_indices_desc.set(out_indices_contiguous);
{
mluOpTensorLayout_t layout = MLUOP_LAYOUT_ARRAY;
mluOpDataType_t dtype = MLUOP_DTYPE_INT32;
std::vector<int> dims;
dims = {numAct, coorDim + 1};
mluOpSetTensorDescriptor(indices_desc.desc(), layout, dtype, dims.size(),
dims.data());
dims = {kernelVolume, 2, numAct};
mluOpSetTensorDescriptor(indicePairs_desc.desc(), layout, dtype,
dims.size(), dims.data());
dims = {kernelVolume};
mluOpSetTensorDescriptor(indiceNum_desc.desc(), layout, dtype, dims.size(),
dims.data());
dims = {out_size, coorDim + 1};
mluOpSetTensorDescriptor(out_indices_desc.desc(), layout, dtype,
dims.size(), dims.data());
}
mluOpSparseConvolutionDescriptor_t sparse_conv_desc;
mluOpCreateSparseConvolutionDescriptor(&sparse_conv_desc);
mluOpSetSparseConvolutionDescriptor(
sparse_conv_desc, NDim + 2, batch, padding32.data(), stride32.data(),
dilation32.data(), input_space.data(), filter_space.data(),
output_space.data(), sub_m, transpose, 0);
auto handle = mluOpGetCurrentHandle();
size_t workspace_size = 0;
mluOpGetIndicePairsWorkspaceSize(
handle, sparse_conv_desc, indices_desc.desc(), indicePairs_desc.desc(),
out_indices_desc.desc(), indiceNum_desc.desc(), &workspace_size);
auto indice_workspace_size =
at::empty(workspace_size, indices.options().dtype(at::kByte));
auto indices_impl = torch_mlu::getMluTensorImpl(indices_contiguous);
auto out_indices_impl = torch_mlu::getMluTensorImpl(out_indices_contiguous);
auto indicePairs_impl = torch_mlu::getMluTensorImpl(indicePairs_contiguous);
auto indiceNum_impl = torch_mlu::getMluTensorImpl(indiceNum_contiguous);
auto indice_workspace_impl =
torch_mlu::getMluTensorImpl(indice_workspace_size);
auto indices_ptr = indices_impl->cnnlMalloc();
auto out_indices_ptr = out_indices_impl->cnnlMalloc();
auto indicePairs_ptr = indicePairs_impl->cnnlMalloc();
auto indiceNum_ptr = indiceNum_impl->cnnlMalloc();
auto indice_workspace_ptr = indice_workspace_impl->cnnlMalloc();
mluOpGetIndicePairs(handle, sparse_conv_desc, indices_desc.desc(),
indices_ptr, indice_workspace_ptr, workspace_size,
indicePairs_desc.desc(), indicePairs_ptr,
out_indices_desc.desc(), out_indices_ptr,
indiceNum_desc.desc(), indiceNum_ptr);
int num_act_out = 0;
mluOpGetSparseConvolutionNumActOut(sparse_conv_desc, &num_act_out);
mluOpDestroySparseConvolutionDescriptor(sparse_conv_desc);
if (!sub_m) {
return {out_indices.slice(0, 0, num_act_out), indicePairs, indiceNum};
} else {
return {indices, indicePairs, indiceNum};
}
}
torch::Tensor IndiceConvForwardMLUKernelLauncher(
torch::Tensor features, torch::Tensor filters, torch::Tensor indicePairs,
torch::Tensor indiceNum, int64_t numActOut, int64_t _inverse,
int64_t _subM) {
auto indice_num_cpu = indiceNum.to({torch::kCPU});
auto indice_num_cpu_64 = indice_num_cpu.data_ptr<int>();
int indice_num_len = indiceNum.numel();
int64_t indice_num[indice_num_len];
for (int i = 0; i < indice_num_len; ++i) {
indice_num[i] = (int64_t)(((int *)indice_num_cpu_64)[i]);
}
// generate empty output
int C = filters.dim() == 4 ? filters.size(3) : filters.size(4);
torch::Tensor output =
at::zeros({numActOut, C}, features.options().dtype(at::kFloat));
// generate descriptor
auto features_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
features, at::MemoryFormat::Contiguous);
auto filters_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
filters, at::MemoryFormat::Contiguous);
auto indice_pairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indicePairs, at::MemoryFormat::Contiguous);
auto output_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
output, at::MemoryFormat::Contiguous);
MluOpTensorDescriptor features_desc, filters_desc, indice_pairs_desc,
output_desc;
features_desc.set(features_contiguous);
filters_desc.set(filters_contiguous);
indice_pairs_desc.set(indice_pairs_contiguous);
output_desc.set(output_contiguous);
// set layout
{
mluOpTensorLayout_t layout;
mluOpDataType_t dtype;
int dim;
int dims[8];
// features_desc
mluOpGetTensorDescriptor(features_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(features_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// filters_desc
mluOpGetTensorDescriptor(filters_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// indice_pairs_desc
mluOpGetTensorDescriptor(indice_pairs_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(indice_pairs_desc.desc(), MLUOP_LAYOUT_ARRAY,
dtype, dim, dims);
// output_desc
mluOpGetTensorDescriptor(output_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(output_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype, dim,
dims);
}
auto handle = mluOpGetCurrentHandle();
size_t workspace_size = 0;
mluOpGetIndiceConvolutionForwardWorkspaceSize(
handle, features_desc.desc(), filters_desc.desc(),
indice_pairs_desc.desc(), output_desc.desc(), indice_num, numActOut,
_inverse, _subM, &workspace_size);
auto workspace =
at::empty(workspace_size, features.options().dtype(at::kByte));
auto features_impl = torch_mlu::getMluTensorImpl(features_contiguous);
auto filters_impl = torch_mlu::getMluTensorImpl(filters_contiguous);
auto indice_pairs_impl = torch_mlu::getMluTensorImpl(indice_pairs_contiguous);
auto workspace_impl = torch_mlu::getMluTensorImpl(workspace);
auto features_ptr = features_impl->cnnlMalloc();
auto filters_ptr = filters_impl->cnnlMalloc();
auto indice_pairs_ptr = indice_pairs_impl->cnnlMalloc();
auto workspace_ptr = workspace_impl->cnnlMalloc();
// outputs
auto output_impl = torch_mlu::getMluTensorImpl(output);
auto output_ptr = output_impl->cnnlMalloc();
mluOpIndiceConvolutionForward(
handle, features_desc.desc(), features_ptr, filters_desc.desc(),
filters_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
numActOut, _inverse, _subM, workspace_ptr, workspace_size,
output_desc.desc(), output_ptr);
return output;
}
std::vector<torch::Tensor> IndiceConvBackwardMLUKernelLauncher(
torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
int64_t _subM) {
auto indice_num_cpu = indiceNum.to({torch::kCPU});
auto indice_num_cpu_64 = indice_num_cpu.data_ptr<int>();
int indice_num_len = indiceNum.numel();
int64_t indice_num[indice_num_len];
for (int i = 0; i < indice_num_len; ++i) {
indice_num[i] = (int64_t)(((int *)(indice_num_cpu_64))[i]);
}
// generate empty input_grad
torch::Tensor input_grad = at::zeros({features.size(0), features.size(1)},
features.options().dtype(at::kFloat));
torch::Tensor filters_grad;
if (filters.dim() == 4) {
int h = filters.size(0);
int w = filters.size(1);
int c = filters.size(2);
int n = filters.size(3);
filters_grad = at::zeros({h, w, c, n}, filters.options().dtype(at::kFloat));
} else if (filters.dim() == 5) {
int d = filters.size(0);
int h = filters.size(1);
int w = filters.size(2);
int c = filters.size(3);
int n = filters.size(4);
filters_grad =
at::zeros({d, h, w, c, n}, filters.options().dtype(at::kFloat));
}
auto features_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
features, at::MemoryFormat::Contiguous);
auto filters_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
filters, at::MemoryFormat::Contiguous);
auto output_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
outGrad, at::MemoryFormat::Contiguous);
auto indice_pairs_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
indicePairs, at::MemoryFormat::Contiguous);
auto input_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
features, at::MemoryFormat::Contiguous);
auto filters_grad_contiguous = torch_mlu::cnnl::ops::cnnl_contiguous(
filters, at::MemoryFormat::Contiguous);
MluOpTensorDescriptor features_desc, output_grad_desc, filters_desc,
indice_pairs_desc, input_grad_desc, filters_grad_desc;
features_desc.set(features_contiguous);
filters_desc.set(filters_contiguous);
output_grad_desc.set(output_grad_contiguous);
indice_pairs_desc.set(indice_pairs_contiguous);
input_grad_desc.set(input_grad_contiguous);
filters_grad_desc.set(filters_grad_contiguous);
// need to set desc layout with mluOp functions
{
mluOpTensorLayout_t layout;
mluOpDataType_t dtype;
int dim;
int dims[8];
// features_desc
mluOpGetTensorDescriptor(features_desc.desc(), &layout, &dtype, &dim, dims);
mluOpSetTensorDescriptor(features_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// filters_desc
mluOpGetTensorDescriptor(filters_desc.desc(), &layout, &dtype, &dim, dims);
if (dim == 4) {
mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_HWCN, dtype,
dim, dims);
} else {
mluOpSetTensorDescriptor(filters_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
}
// output_grad_desc
mluOpGetTensorDescriptor(output_grad_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(output_grad_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
// indice_pairs_desc
mluOpGetTensorDescriptor(indice_pairs_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(indice_pairs_desc.desc(), MLUOP_LAYOUT_ARRAY,
dtype, dim, dims);
// input_grad_desc
mluOpGetTensorDescriptor(input_grad_desc.desc(), &layout, &dtype, &dim,
dims);
mluOpSetTensorDescriptor(input_grad_desc.desc(), MLUOP_LAYOUT_ARRAY, dtype,
dim, dims);
}
auto handle = mluOpGetCurrentHandle();
size_t data_workspace_size = 0;
mluOpGetIndiceConvolutionBackwardDataWorkspaceSize(
handle, output_grad_desc.desc(), filters_desc.desc(),
indice_pairs_desc.desc(), input_grad_desc.desc(), indice_num, _inverse,
&data_workspace_size);
size_t filters_workspace_size = 0;
mluOpGetIndiceConvolutionBackwardFilterWorkspaceSize(
handle, features_desc.desc(), output_grad_desc.desc(),
indice_pairs_desc.desc(), filters_grad_desc.desc(), indice_num, _inverse,
_subM, &filters_workspace_size);
auto indice_convbpdata_workspace =
at::empty(data_workspace_size, features.options().dtype(at::kByte));
auto indice_convbpfilter_workspace =
at::empty(filters_workspace_size, filters.options().dtype(at::kByte));
auto features_impl = torch_mlu::getMluTensorImpl(features_contiguous);
auto filters_impl = torch_mlu::getMluTensorImpl(filters_contiguous);
auto output_grad_impl = torch_mlu::getMluTensorImpl(output_grad_contiguous);
auto indice_pairs_impl = torch_mlu::getMluTensorImpl(indice_pairs_contiguous);
auto indice_convbpdata_workspace_impl =
torch_mlu::getMluTensorImpl(indice_convbpdata_workspace);
auto indice_convbpfilter_workspace_impl =
torch_mlu::getMluTensorImpl(indice_convbpfilter_workspace);
auto features_ptr = features_impl->cnnlMalloc();
auto filters_ptr = filters_impl->cnnlMalloc();
auto output_grad_ptr = output_grad_impl->cnnlMalloc();
auto indice_pairs_ptr = indice_pairs_impl->cnnlMalloc();
auto indice_convbpdata_workspace_ptr =
indice_convbpdata_workspace_impl->cnnlMalloc();
auto indice_convbpfilter_workspace_ptr =
indice_convbpfilter_workspace_impl->cnnlMalloc();
// outputs
auto input_grad_impl = torch_mlu::getMluTensorImpl(input_grad);
auto input_grad_ptr = input_grad_impl->cnnlMalloc();
auto filters_grad_impl = torch_mlu::getMluTensorImpl(filters_grad);
auto filters_grad_ptr = filters_grad_impl->cnnlMalloc();
mluOpIndiceConvolutionBackwardData(
handle, output_grad_desc.desc(), output_grad_ptr, filters_desc.desc(),
filters_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
_inverse, _subM, indice_convbpdata_workspace_ptr, data_workspace_size,
input_grad_desc.desc(), input_grad_ptr);
mluOpIndiceConvolutionBackwardFilter(
handle, features_desc.desc(), features_ptr, output_grad_desc.desc(),
output_grad_ptr, indice_pairs_desc.desc(), indice_pairs_ptr, indice_num,
_inverse, _subM, indice_convbpfilter_workspace_ptr,
filters_workspace_size, filters_grad_desc.desc(), filters_grad_ptr);
std::vector<torch::Tensor> result;
result.push_back(input_grad);
result.push_back(filters_grad);
return result;
}
torch::Tensor indice_conv_forward_mlu(torch::Tensor features,
torch::Tensor filters,
torch::Tensor indicePairs,
torch::Tensor indiceNum,
int64_t numActOut, int64_t _inverse,
int64_t _subM) {
return IndiceConvForwardMLUKernelLauncher(
features, filters, indicePairs, indiceNum, numActOut, _inverse, _subM);
}
std::vector<torch::Tensor> indice_conv_backward_mlu(
torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
int64_t _subM) {
return IndiceConvBackwardMLUKernelLauncher(
features, filters, outGrad, indicePairs, indiceNum, _inverse, _subM);
}
torch::Tensor indice_conv_forward_impl(torch::Tensor features,
torch::Tensor filters,
torch::Tensor indicePairs,
torch::Tensor indiceNum,
int64_t numActOut, int64_t _inverse,
int64_t _subM);
std::vector<torch::Tensor> indice_conv_backward_impl(
torch::Tensor features, torch::Tensor filters, torch::Tensor outGrad,
torch::Tensor indicePairs, torch::Tensor indiceNum, int64_t _inverse,
int64_t _subM);
REGISTER_DEVICE_IMPL(indice_conv_forward_impl, MLU, indice_conv_forward_mlu);
REGISTER_DEVICE_IMPL(indice_conv_backward_impl, MLU, indice_conv_backward_mlu);
template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<2>(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<3>(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
template std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher<4>(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
/*************************************************************************
* Copyright (C) 2022 by Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
void KernelDynamicVoxelize(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const void *points, void *coors, const float voxel_x, const float voxel_y,
const float voxel_z, const float coors_x_min, const float coors_y_min,
const float coors_z_min, const float coors_x_max, const float coors_y_max,
const float coors_z_max, const int32_t grid_x, const int32_t grid_y,
const int32_t grid_z, const int32_t num_points, const int32_t num_features);
void KernelPoint2Voxel(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, void *coors, void *point_to_pointidx,
void *point_to_voxelidx, const int32_t num_points,
const int32_t max_points);
void KernelCalcPointsPerVoxel(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, void *point_to_pointidx,
void *point_to_voxelidx, void *coor_to_voxelidx,
void *num_points_per_voxel, void *voxel_num,
const int32_t max_voxels,
const int32_t num_points);
void KernelAssignVoxelsCoors(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, const void *points,
void *temp_coors, void *point_to_voxelidx,
void *coor_to_voxelidx, void *voxels, void *coors,
const int32_t max_points, const int32_t num_points,
const int32_t num_features);
// policy function
static void policyFuncDefault(cnrtDim3_t *k_dim, cnrtFunctionType_t *k_type,
const int num_points) {
k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
k_dim->y = MIN((num_points + k_dim->x - 1) / k_dim->x,
torch_mlu::getDeviceAttr(cnrtAttrClusterCount));
k_dim->z = 1;
*k_type = CNRT_FUNC_TYPE_UNION1;
}
// policy function
static void policyFuncCalcPointsPerVoxel(cnrtDim3_t *k_dim,
cnrtFunctionType_t *k_type,
const int num_points) {
k_dim->x = 1;
k_dim->y = 1;
k_dim->z = 1;
*k_type = CNRT_FUNC_TYPE_BLOCK;
}
int HardVoxelizeForwardMLUKernelLauncher(
const at::Tensor &points, at::Tensor &voxels, at::Tensor &coors,
at::Tensor &num_points_per_voxel, const std::vector<float> voxel_size,
const std::vector<float> coors_range, const int max_points,
const int max_voxels, const int NDim = 3) {
// check datatype
TORCH_CHECK(points.scalar_type() == at::kFloat,
"points type should be Float, got ", points.scalar_type(), ".");
TORCH_CHECK(voxels.scalar_type() == at::kFloat,
"voxels type should be Float, got ", voxels.scalar_type(), ".");
TORCH_CHECK(coors.scalar_type() == at::kInt,
"coors type should be Float, got ", coors.scalar_type(), ".");
TORCH_CHECK(num_points_per_voxel.scalar_type() == at::kInt,
"num_points_per_voxel type should be Float, got ",
num_points_per_voxel.scalar_type(), ".");
// check shape
TORCH_CHECK(points.dim() == 2, "points should be a 2d tensor, got ",
points.dim(), "D.");
TORCH_CHECK(voxels.dim() == 3, "voxels should be a 3d tensor, got ",
voxels.dim(), "D.");
TORCH_CHECK(coors.dim() == 2, "coors should be a 2d tensor, got ",
coors.dim(), "D.");
TORCH_CHECK(num_points_per_voxel.dim() == 1,
"num_points_per_voxel should be a 1d tensor, got ",
num_points_per_voxel.dim(), "D.");
const int num_points = points.size(0);
const int num_features = points.size(1);
TORCH_CHECK(points.size(0) == num_points,
"the 1st dimensions of points should be num_points, got ",
points.size(0), ".");
TORCH_CHECK(points.size(1) == num_features,
"the 2nd dimensions of points should be num_features, got ",
points.size(1), ".");
TORCH_CHECK(voxels.size(0) == max_voxels,
"the 1st dimensions of voxels should be max_voxels, got ",
voxels.size(0), ".");
TORCH_CHECK(voxels.size(1) == max_points,
"the 2nd dimensions of voxels should be max_points, got ",
voxels.size(1), ".");
TORCH_CHECK(voxels.size(2) == num_features,
"the 3rd dimensions of voxels should be num_features, got ",
voxels.size(2), ".");
TORCH_CHECK(coors.size(0) == max_voxels,
"the 1st dimensions of coors should be max_voxels, got ",
coors.size(0), ".");
TORCH_CHECK(coors.size(1) == 3,
"the 2nd dimensions of coors should be 3, got ", coors.size(1),
".");
TORCH_CHECK(num_points_per_voxel.size(0) == max_voxels,
"the 1st dimensions of num_points_per_voxel should be 3, got ",
num_points_per_voxel.size(0), ".");
// large tensor check
const size_t max_input_size = 2147483648;
TORCH_CHECK(points.numel() < max_input_size,
"points element num should be less than 2^31, got ",
points.numel(), ".");
TORCH_CHECK(voxels.numel() < max_input_size,
"voxels element num should be less than 2^31, got ",
voxels.numel(), ".");
TORCH_CHECK(coors.numel() < max_input_size,
"coors element num should be less than 2^31, got ", coors.numel(),
".");
// check zero element
if (max_points == 0 || max_voxels == 0) {
return 0;
}
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get ptr of tensors
auto points_ = points.contiguous();
auto points_impl = torch_mlu::getMluTensorImpl(points_);
auto points_ptr = points_impl->cnnlMalloc();
auto voxels_ = voxels.contiguous();
auto voxels_impl = torch_mlu::getMluTensorImpl(voxels_);
auto voxels_ptr = voxels_impl->cnnlMalloc();
auto coors_ = coors.contiguous();
auto coors_impl = torch_mlu::getMluTensorImpl(coors_);
auto coors_ptr = coors_impl->cnnlMalloc();
auto num_points_per_voxel_ = num_points_per_voxel.contiguous();
auto num_points_per_voxel_impl =
torch_mlu::getMluTensorImpl(num_points_per_voxel_);
auto num_points_per_voxel_ptr = num_points_per_voxel_impl->cnnlMalloc();
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFuncDefault(&k_dim, &k_type, num_points);
// 1. link point to corresponding voxel coors
const float voxel_x = voxel_size[0];
const float voxel_y = voxel_size[1];
const float voxel_z = voxel_size[2];
const float coors_x_min = coors_range[0];
const float coors_y_min = coors_range[1];
const float coors_z_min = coors_range[2];
const float coors_x_max = coors_range[3];
const float coors_y_max = coors_range[4];
const float coors_z_max = coors_range[5];
const int grid_x = round((coors_x_max - coors_x_min) / voxel_x);
const int grid_y = round((coors_y_max - coors_y_min) / voxel_y);
const int grid_z = round((coors_z_max - coors_z_min) / voxel_z);
auto temp_coors =
at::zeros({NDim, num_points}, points.options().dtype(at::kInt))
.contiguous();
auto temp_coors_impl = torch_mlu::getMluTensorImpl(temp_coors);
auto temp_coors_ptr = temp_coors_impl->cnnlMalloc();
KernelDynamicVoxelize(k_dim, k_type, queue, points_ptr, temp_coors_ptr,
voxel_x, voxel_y, voxel_z, coors_x_min, coors_y_min,
coors_z_min, coors_x_max, coors_y_max, coors_z_max,
grid_x, grid_y, grid_z, num_points, num_features);
// 2. map point to the idx of the corresponding voxel, find duplicate coor
auto point_to_pointidx = at::zeros(
{
num_points,
},
points.options().dtype(at::kInt))
.contiguous();
auto point_to_pointidx_impl = torch_mlu::getMluTensorImpl(point_to_pointidx);
auto point_to_pointidx_ptr = point_to_pointidx_impl->cnnlMalloc();
auto point_to_voxelidx = at::zeros(
{
num_points,
},
points.options().dtype(at::kInt))
.contiguous();
auto point_to_voxelidx_impl = torch_mlu::getMluTensorImpl(point_to_voxelidx);
auto point_to_voxelidx_ptr = point_to_voxelidx_impl->cnnlMalloc();
KernelPoint2Voxel(k_dim, k_type, queue, temp_coors_ptr, point_to_pointidx_ptr,
point_to_voxelidx_ptr, num_points, max_points);
// calculate task dimension
cnrtDim3_t k_dim_calc_points_per_voxel;
cnrtFunctionType_t k_type_calc_points_per_voxel;
policyFuncCalcPointsPerVoxel(&k_dim_calc_points_per_voxel,
&k_type_calc_points_per_voxel, num_points);
// 3. determine voxel num and voxel's coor index
auto coor_to_voxelidx = at::zeros(
{
num_points,
},
points.options().dtype(at::kInt))
.contiguous();
auto coor_to_voxelidx_impl = torch_mlu::getMluTensorImpl(coor_to_voxelidx);
auto coor_to_voxelidx_ptr = coor_to_voxelidx_impl->cnnlMalloc();
auto voxel_num = at::zeros(
{
1,
},
points.options().dtype(at::kInt))
.contiguous();
auto voxel_num_impl = torch_mlu::getMluTensorImpl(voxel_num);
auto voxel_num_ptr = voxel_num_impl->cnnlMalloc();
KernelCalcPointsPerVoxel(
k_dim_calc_points_per_voxel, k_type_calc_points_per_voxel, queue,
point_to_pointidx_ptr, point_to_voxelidx_ptr, coor_to_voxelidx_ptr,
num_points_per_voxel_ptr, voxel_num_ptr, max_voxels, num_points);
// 4. copy point features and coors of each voxels to voxels
KernelAssignVoxelsCoors(k_dim, k_type, queue, points_ptr, temp_coors_ptr,
point_to_voxelidx_ptr, coor_to_voxelidx_ptr,
voxels_ptr, coors_ptr, max_points, num_points,
num_features);
auto voxel_num_cpu = voxel_num.to(at::kCPU);
int voxel_num_int = voxel_num_cpu.data_ptr<int>()[0];
return voxel_num_int;
}
int hard_voxelize_forward_mlu(const at::Tensor &points, at::Tensor &voxels,
at::Tensor &coors,
at::Tensor &num_points_per_voxel,
const std::vector<float> voxel_size,
const std::vector<float> coors_range,
const int max_points, const int max_voxels,
const int NDim) {
return HardVoxelizeForwardMLUKernelLauncher(
points, voxels, coors, num_points_per_voxel, voxel_size, coors_range,
max_points, max_voxels, NDim);
};
int hard_voxelize_forward_impl(const at::Tensor &points, at::Tensor &voxels,
at::Tensor &coors,
at::Tensor &num_points_per_voxel,
const std::vector<float> voxel_size,
const std::vector<float> coors_range,
const int max_points, const int max_voxels,
const int NDim);
REGISTER_DEVICE_IMPL(hard_voxelize_forward_impl, MLU,
hard_voxelize_forward_mlu);
...@@ -17,6 +17,11 @@ Tensor nms_rotated_npu(const Tensor dets, const Tensor scores, ...@@ -17,6 +17,11 @@ Tensor nms_rotated_npu(const Tensor dets, const Tensor scores,
const Tensor labels, const float iou_threshold); const Tensor labels, const float iou_threshold);
#endif #endif
#ifdef MMCV_WITH_MLU
Tensor nms_rotated_mlu(const Tensor dets, const Tensor scores,
const float iou_threshold);
#endif
// Interface for Python // Interface for Python
// inline is needed to prevent multiple function definitions when this header is // inline is needed to prevent multiple function definitions when this header is
// included by different cpps // included by different cpps
...@@ -36,6 +41,10 @@ Tensor nms_rotated(const Tensor dets, const Tensor scores, const Tensor order, ...@@ -36,6 +41,10 @@ Tensor nms_rotated(const Tensor dets, const Tensor scores, const Tensor order,
return nms_rotated_npu(dets, scores, labels, iou_threshold); return nms_rotated_npu(dets, scores, labels, iou_threshold);
#else #else
AT_ERROR("Not compiled with NPU support"); AT_ERROR("Not compiled with NPU support");
#endif
#ifdef MMCV_WITH_MLU
} else if (dets.device().type() == at::kMLU) {
return nms_rotated_mlu(dets, scores, iou_threshold);
#endif #endif
} }
......
...@@ -35,6 +35,26 @@ std::vector<torch::Tensor> get_indice_pairs_forward_cuda( ...@@ -35,6 +35,26 @@ std::vector<torch::Tensor> get_indice_pairs_forward_cuda(
padding, dilation, outPadding, _subM, _transpose); padding, dilation, outPadding, _subM, _transpose);
}; };
template <unsigned NDim>
std::vector<torch::Tensor> GetIndicePairsForwardMLUKernelLauncher(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose);
template <unsigned NDim>
std::vector<torch::Tensor> get_indice_pairs_forward_mlu(
torch::Tensor indices, int64_t batchSize,
std::vector<int64_t> outSpatialShape, std::vector<int64_t> spatialShape,
std::vector<int64_t> kernelSize, std::vector<int64_t> stride,
std::vector<int64_t> padding, std::vector<int64_t> dilation,
std::vector<int64_t> outPadding, int64_t _subM, int64_t _transpose) {
return GetIndicePairsForwardMLUKernelLauncher<NDim>(
indices, batchSize, outSpatialShape, spatialShape, kernelSize, stride,
padding, dilation, outPadding, _subM, _transpose);
}
template <unsigned NDim> template <unsigned NDim>
std::vector<torch::Tensor> GetIndicePairsBackwardCUDAKernelLauncher( std::vector<torch::Tensor> GetIndicePairsBackwardCUDAKernelLauncher(
torch::Tensor indices, torch::Tensor gridOut, int64_t batchSize, torch::Tensor indices, torch::Tensor gridOut, int64_t batchSize,
...@@ -71,6 +91,12 @@ std::vector<torch::Tensor> get_indice_pairs_forward( ...@@ -71,6 +91,12 @@ std::vector<torch::Tensor> get_indice_pairs_forward(
padding, dilation, outPadding, _subM, _transpose); padding, dilation, outPadding, _subM, _transpose);
#else #else
AT_ERROR("get_indice_pairs is not compiled with GPU support"); AT_ERROR("get_indice_pairs is not compiled with GPU support");
#endif
#ifdef MMCV_WITH_MLU
} else if (indices.device().type() == at::kMLU) {
return get_indice_pairs_forward_mlu<NDim>(
indices, batchSize, outSpatialShape, spatialShape, kernelSize, stride,
padding, dilation, outPadding, _subM, _transpose);
#endif #endif
} else { } else {
AT_ERROR("get_indice_pairs is not implemented on CPU"); AT_ERROR("get_indice_pairs is not implemented on CPU");
......
...@@ -410,11 +410,12 @@ def nms_rotated(dets: Tensor, ...@@ -410,11 +410,12 @@ def nms_rotated(dets: Tensor,
input_labels = scores.new_empty(0, dtype=torch.int) input_labels = scores.new_empty(0, dtype=torch.int)
else: else:
input_labels = labels input_labels = labels
if dets.device.type == 'npu': if dets.device.type in ('npu', 'mlu'):
order = scores.new_empty(0, dtype=torch.long) order = scores.new_empty(0, dtype=torch.long)
coefficient = 57.29578 # 180 / PI if dets.device.type == 'npu':
for i in range(dets.size()[0]): coefficient = 57.29578 # 180 / PI
dets_cw[i][4] *= coefficient # radians to angle for i in range(dets.size()[0]):
dets_cw[i][4] *= coefficient # radians to angle
keep_inds = ext_module.nms_rotated(dets_cw, scores, order, dets_cw, keep_inds = ext_module.nms_rotated(dets_cw, scores, order, dets_cw,
input_labels, iou_threshold, input_labels, iou_threshold,
multi_label) multi_label)
......
...@@ -211,6 +211,7 @@ def get_extensions(): ...@@ -211,6 +211,7 @@ def get_extensions():
include_dirs = [] include_dirs = []
extra_objects = []
is_rocm_pytorch = False is_rocm_pytorch = False
try: try:
from torch.utils.cpp_extension import ROCM_HOME from torch.utils.cpp_extension import ROCM_HOME
...@@ -238,16 +239,98 @@ def get_extensions(): ...@@ -238,16 +239,98 @@ def get_extensions():
torch.is_mlu_available()) or \ torch.is_mlu_available()) or \
os.getenv('FORCE_MLU', '0') == '1': os.getenv('FORCE_MLU', '0') == '1':
from torch_mlu.utils.cpp_extension import MLUExtension from torch_mlu.utils.cpp_extension import MLUExtension
def get_mluops_version(file_path):
with open(file_path) as f:
for line in f:
if re.search('MLUOP_MAJOR', line):
major = line.strip().split(' ')[2]
if re.search('MLUOP_MINOR', line):
minor = line.strip().split(' ')[2]
if re.search('MLUOP_PATCHLEVEL', line):
patchlevel = line.strip().split(' ')[2]
mluops_version = f'v{major}.{minor}.{patchlevel}'
return mluops_version
mmcv_mluops_version = get_mluops_version(
'./mmcv/ops/csrc/pytorch/mlu/mlu_common_helper.h')
mlu_ops_path = os.getenv('MMCV_MLU_OPS_PATH')
if mlu_ops_path:
exists_mluops_version = get_mluops_version(
mlu_ops_path + '/bangc-ops/mlu_op.h')
if exists_mluops_version != mmcv_mluops_version:
print('the version of mlu-ops provided is %s,'
' while %s is needed.' %
(exists_mluops_version, mmcv_mluops_version))
exit()
try:
if os.path.exists('mlu-ops'):
if os.path.islink('mlu-ops'):
os.remove('mlu-ops')
os.symlink(mlu_ops_path, 'mlu-ops')
elif os.path.abspath('mlu-ops') != mlu_ops_path:
os.symlink(mlu_ops_path, 'mlu-ops')
else:
os.symlink(mlu_ops_path, 'mlu-ops')
except Exception:
raise FileExistsError(
'mlu-ops already exists, please move it out,'
'or rename or remove it.')
else:
if not os.path.exists('mlu-ops'):
import requests
mluops_url = 'https://github.com/Cambricon/mlu-ops/' + \
'archive/refs/tags/' + mmcv_mluops_version + '.zip'
req = requests.get(mluops_url)
with open('./mlu-ops.zip', 'wb') as f:
try:
f.write(req.content)
except Exception:
raise ImportError('failed to download mlu-ops')
from zipfile import BadZipFile, ZipFile
with ZipFile('./mlu-ops.zip', 'r') as archive:
try:
archive.extractall()
dir_name = archive.namelist()[0].split('/')[0]
os.rename(dir_name, 'mlu-ops')
except BadZipFile:
print('invalid mlu-ops.zip file')
else:
exists_mluops_version = get_mluops_version(
'./mlu-ops/bangc-ops/mlu_op.h')
if exists_mluops_version != mmcv_mluops_version:
print('the version of provided mlu-ops is %s,'
' while %s is needed.' %
(exists_mluops_version, mmcv_mluops_version))
exit()
define_macros += [('MMCV_WITH_MLU', None)] define_macros += [('MMCV_WITH_MLU', None)]
mlu_args = os.getenv('MMCV_MLU_ARGS') mlu_args = os.getenv('MMCV_MLU_ARGS', '-DNDEBUG ')
extra_compile_args['cncc'] = [mlu_args] if mlu_args else [] mluops_includes = []
mluops_includes.append('-I' +
os.path.abspath('./mlu-ops/bangc-ops'))
mluops_includes.append(
'-I' + os.path.abspath('./mlu-ops/bangc-ops/kernels'))
extra_compile_args['cncc'] = [mlu_args] + \
mluops_includes if mlu_args else mluops_includes
extra_compile_args['cxx'] += ['-fno-gnu-unique']
op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \ op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \
glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp') + \ glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp') + \
glob.glob('./mmcv/ops/csrc/pytorch/mlu/*.cpp') + \ glob.glob('./mmcv/ops/csrc/pytorch/mlu/*.cpp') + \
glob.glob('./mmcv/ops/csrc/common/mlu/*.mlu') glob.glob('./mmcv/ops/csrc/common/mlu/*.mlu') + \
glob.glob(
'./mlu-ops/bangc-ops/core/**/*.cpp', recursive=True) + \
glob.glob(
'./mlu-ops/bangc-ops/kernels/**/*.cpp', recursive=True) + \
glob.glob(
'./mlu-ops/bangc-ops/kernels/**/*.mlu', recursive=True)
extra_objects = glob.glob(
'./mlu-ops/bangc-ops/kernels/kernel_wrapper/*.o')
extension = MLUExtension extension = MLUExtension
include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common')) include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common'))
include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common/mlu')) include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common/mlu'))
include_dirs.append(os.path.abspath('./mlu-ops/bangc-ops'))
elif (hasattr(torch.backends, 'mps') elif (hasattr(torch.backends, 'mps')
and torch.backends.mps.is_available()) or os.getenv( and torch.backends.mps.is_available()) or os.getenv(
'FORCE_MPS', '0') == '1': 'FORCE_MPS', '0') == '1':
...@@ -309,6 +392,7 @@ def get_extensions(): ...@@ -309,6 +392,7 @@ def get_extensions():
sources=op_files, sources=op_files,
include_dirs=include_dirs, include_dirs=include_dirs,
define_macros=define_macros, define_macros=define_macros,
extra_objects=extra_objects,
extra_compile_args=extra_compile_args) extra_compile_args=extra_compile_args)
extensions.append(ext_ops) extensions.append(ext_ops)
return extensions return extensions
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment