"docs/vscode:/vscode.git/clone" did not exist on "a8f4e97bd5356a7a77510cdf6a3a62e25a5dc602"
Unverified Commit 362a90f8 authored by Jiazhen Wang's avatar Jiazhen Wang Committed by GitHub
Browse files

[Feature] Add several MLU ops (#1563)



* [Feature] Add roiaware pool3d ops from mmdet3d (#1382)

* add ops (roiaware pool3d) in mmdet3d

* refactor code

* fix typo
Co-authored-by: default avatarzhouzaida <zhouzaida@163.com>

* [Feature] Add iou3d op from mmdet3d (#1356)

* add ops (iou3d) in mmdet3d

* add unit test

* refactor code

* refactor code

* refactor code

* refactor code

* refactor code
Co-authored-by: default avatarzhouzaida <zhouzaida@163.com>

* [Fix] Update test data for test_iou3d (#1427)

* Update test data for test_iou3d

* delete blank lines
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* [Feature] Add group points ops from mmdet3d (#1415)

* add op (group points) and its related ops (ball query and knn) in mmdet3d

* refactor code

* fix typo

* refactor code

* fix typo

* refactor code

* make input contiguous
Co-authored-by: default avatarzhouzaida <zhouzaida@163.com>

* add mmdet3d op (#1425)
Co-authored-by: default avatarzhouzaida <zhouzaida@163.com>

* [Feature] Loading objects from different backends and dumping objects to different backends (#1330)

* [Feature] Choose storage backend by the prefix of filepath

* refactor FileClient and add unittest

* support loading from different backends

* polish docstring

* fix unittet

* rename attribute str_like_obj to is_str_like_obj

* add infer_client method

* add check_exist method

* rename var client to file_client

* polish docstring

* add join_paths method

* remove join_paths and add _format_path

* enhance unittest

* refactor unittest

* singleton pattern

* fix test_clientio.py

* deprecate CephBackend

* enhance docstring

* refactor unittest for petrel

* refactor unittest for disk backend

* update io.md

* add concat_paths method

* improve docstring

* improve docstring

* add isdir and copyfile for file backend

* delete copyfile and add get_local_path

* remove isdir method of petrel

* fix typo

* add comment and polish docstring

* polish docstring

* rename _path_mapping to _map_path

* polish docstring and fix typo

* refactor get_local_path

* add list_dir_or_file for FileClient

* add list_dir_or_file for PetrelBackend

* fix windows ci

* Add return docstring

* polish docstring

* fix typo

* fix typo

* deprecate the conversion from Path to str

* add docs for loading checkpoints with FileClient

* refactor map_path

* add _ensure_methods to ensure methods have been implemented

* fix list_dir_or_file

* rename _ensure_method_implemented to has_method

* Add CI for pytorch 1.10 (#1431)

* [Feature] Upload checkpoints and logs to ceph (#1375)

* [Feature] Choose storage backend by the prefix of filepath

* refactor FileClient and add unittest

* support loading from different backends

* polish docstring

* fix unittet

* rename attribute str_like_obj to is_str_like_obj

* [Docs] Upload checkpoint to petrel oss

* add infer_client method

* Support uploading checkpoint to petrel oss

* add check_exist method

* refactor CheckpointHook

* support uploading logs to ceph

* rename var client to file_client

* polish docstring

* enhance load_from_ceph

* refactor load_from_ceph

* refactor TextLoggerHook

* change the meaning of out_dir argument

* fix test_checkpoint_hook.py

* add join_paths method

* remove join_paths and add _format_path

* enhance unittest

* refactor unittest

* add a unittest for EvalHook when file backend is petrel

* singleton pattern

* fix test_clientio.py

* deprecate CephBackend

* add warning in load_from_ceph

* fix type of out_suffix

* enhance docstring

* refactor unittest for petrel

* refactor unittest for disk backend

* update io.md

* add concat_paths method

* fix CI

* mock check_exist

* improve docstring

* improve docstring

* improve docstring

* improve docstring

* add isdir and copyfile for file backend

* delete copyfile and add get_local_path

* remove isdir method of petrel

* fix typo

* rename check_exists to exists

* refactor code and polish docstring

* fix windows ci

* add comment and polish docstring

* polish docstring

* polish docstring

* rename _path_mapping to _map_path

* polish docstring and fix typo

* refactor get_local_path

* add list_dir_or_file for FileClient

* add list_dir_or_file for PetrelBackend

* fix windows ci

* Add return docstring

* polish docstring

* fix typo

* fix typo

* fix typo

* fix error when mocking PetrelBackend

* deprecate the conversion from Path to str

* add docs for loading checkpoints with FileClient

* rename keep_log to keep_local

* refactor map_path

* add _ensure_methods to ensure methods have been implemented

* fix list_dir_or_file

* rename _ensure_method_implemented to has_method

* refactor

* polish information

* format information

* bump version to v1.3.16 (#1430)

* [Fix]: Update test data of test_tin_shift (#1426)

* Update test data of test_tin_shift

* Delete tmp.engine

* add pytest raises asserterror test

* raise valueerror, update test log

* add more comment

* Apply suggestions from code review
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* fix the wrong function reference bug in BaseTransformerLayer when batch_first is True (#1418)

* [Docs] Add mmcv itself in the docs list (#1441)

* Add mmcv itself in the docs list

* modify link of docs

* [Improve] improve checkpoint loading log (#1446)

* [Feature] Support SigmoidFocalLoss with Cambricon MLU backend (#1346)

* [Feature] Support SigmoidFocalLoss with Cambricon MLU backend

* refactor MMCV_WITH_MLU macro define

* refactor NFU_ALIGN_SIZE, PAD_DOWN and split_pipeline_num

* delete extra fool proofing in cpp

* [Feature] Support SigmoidFocalLossBackward with Cambricon MLU backend

* fix macro definition in SigmoidFocalLoss

* refactor mlu files into clang-format

* refactor sigmoid focal loss test

* refactor Sigmoid Focal Loss file structure.

* fix python lint error

* fix import torch_mlu error type

* fix lint

* refactor clang format style to google
Co-authored-by: default avatarzhouzaida <zhouzaida@163.com>

* [Feature] Support RoiAlign With Cambricon MLU Backend (#1429)

* [Feature] Support NMS with cambricon MLU backend (#1467)

* [Feature] Support BBoxOverlaps with cambricon MLU backend (#1507)

* [Refactor] Format C++ code

* [Refactor] include common_mlu_helper in pytorch_mlu_helper and refactor build condition

* [Improve] Improve the performance of roialign, nms and focalloss with MLU backend (#1572)

* [Improve] Improve the performance of roialign with MLU backend

* replace CHECK_MLU with CHECK_MLU_INPUT

* [Improve] Improve the perf of nms and focallosssigmoid with MLU backend

* [Improve] Improve the performance of roialign with MLU backend (#1741)

* [Feature] Support tin_shift with cambricon MLU backend (#1696)

* [Feature] Support tin_shift with cambricon MLU backend

* [fix] Add the assertion of batch_size in tin_shift.py

* [fix] fix the param check of tin_shift in cambricon code

* [fix] Fix lint failure.

* [fix] Fix source file lint failure.

* Update mmcv/ops/tin_shift.py

[Refactor] Modify the code in mmcv/ops/tin_shift.py.
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: default avatarbudefei <budefei@cambricon.com>
Co-authored-by: default avatarbudefei <budefei@cambricom.com>
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>

* resolve conflicts and fix lint

* fix mmcv.utils.__init__

* fix mmcv.utils.__init__

* Fix lints and change FLAG

* fix setup and refine

* remove a redundant line

* remove an unnecessary 'f'

* fix compilation error
Co-authored-by: default avatardingchang <hudingchang.vendor@sensetime.com>
Co-authored-by: default avatarzhouzaida <zhouzaida@163.com>
Co-authored-by: default avatarq.yao <yaoqian@sensetime.com>
Co-authored-by: default avatarZaida Zhou <58739961+zhouzaida@users.noreply.github.com>
Co-authored-by: default avatarpc <luopeichao@sensetime.com>
Co-authored-by: default avatarWenwei Zhang <40779233+ZwwWayne@users.noreply.github.com>
Co-authored-by: default avatarq.yao <streetyao@live.com>
Co-authored-by: default avatarTong Gao <gaotongxiao@gmail.com>
Co-authored-by: default avatarYuxin Liu <liuyuxin@cambricon.com>
Co-authored-by: default avatarzihanchang11 <92860914+zihanchang11@users.noreply.github.com>
Co-authored-by: default avatarshlrao <shenglong.rao@gmail.com>
Co-authored-by: default avatarzhouchenyang <zcy19950525@gmail.com>
Co-authored-by: default avatarMrxiaofei <36697723+Mrxiaofei@users.noreply.github.com>
Co-authored-by: default avatarbudefei <budefei@cambricon.com>
Co-authored-by: default avatarbudefei <budefei@cambricom.com>
parent 95273020
...@@ -2,9 +2,9 @@ ...@@ -2,9 +2,9 @@
from .data_parallel import MLUDataParallel from .data_parallel import MLUDataParallel
from .distributed import MLUDistributedDataParallel from .distributed import MLUDistributedDataParallel
from .scatter_gather import scatter, scatter_kwargs from .scatter_gather import scatter, scatter_kwargs
from .utils import IS_MLU from .utils import IS_MLU_AVAILABLE
__all__ = [ __all__ = [
'MLUDataParallel', 'MLUDistributedDataParallel', 'scatter', 'MLUDataParallel', 'MLUDistributedDataParallel', 'scatter',
'scatter_kwargs', 'IS_MLU' 'scatter_kwargs', 'IS_MLU_AVAILABLE'
] ]
...@@ -8,4 +8,4 @@ def is_mlu_available(): ...@@ -8,4 +8,4 @@ def is_mlu_available():
return False return False
IS_MLU = is_mlu_available() IS_MLU_AVAILABLE = is_mlu_available()
/*************************************************************************
* Copyright (C) 2021 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include <float.h>
#include "common_mlu_helper.hpp"
#define COORD_NUM 4
__nram__ char nmem_buf[MAX_NRAM_SIZE];
template <typename T>
__mlu_func__ void computeDiv(void *nram_dst, void *nram_src0, void *nram_src1,
void *nram_addition, const int32_t deal_num) {
__bang_active_reciphp((T *)nram_dst, (T *)nram_src1, deal_num);
__bang_mul((T *)nram_dst, (T *)nram_src0, (T *)nram_dst, deal_num);
}
template <>
__mlu_func__ void computeDiv<half>(void *nram_dst, void *nram_src0,
void *nram_src1, void *nram_addition,
const int32_t deal_num) {
__bang_half2float((float *)nram_addition, (half *)nram_src1, deal_num);
__bang_active_reciphp((float *)nram_addition, (float *)nram_addition,
deal_num);
__bang_float2half_rd((half *)nram_src1, (float *)nram_addition, deal_num);
__bang_mul((half *)nram_dst, (half *)nram_src0, (half *)nram_src1, deal_num);
}
template <typename T>
__mlu_func__ void bboxOverlapsWorkflow(
T *vec_b1_x1, T *vec_b1_y1, T *vec_b1_x2, T *vec_b1_y2, T *vec_b2_x1,
T *vec_b2_y1, T *vec_b2_x2, T *vec_b2_y2, T *vec_left, T *vec_right,
T *vec_top, T *vec_bottom, const T *bbox1, const T *bbox2, void *ious,
const int32_t offset, const int32_t mode, const int32_t batches_stride,
const int32_t num_bbox1, const int32_t num_bbox2, const bool aligned) {
int32_t task_batch_stride = (num_bbox1 + taskDim - 1) / taskDim;
int32_t batch_start = taskId * task_batch_stride;
int32_t batch_per_task = batch_start + task_batch_stride < num_bbox1
? task_batch_stride
: num_bbox1 - batch_start;
batch_per_task = batch_per_task > 0 ? batch_per_task : (0);
if (aligned) {
int32_t num_loop_cpy = batch_per_task / batches_stride;
int32_t num_rem_cpy_batches = batch_per_task % batches_stride;
num_loop_cpy = num_rem_cpy_batches > 0 ? num_loop_cpy + 1 : num_loop_cpy;
for (int32_t i = 0; i < num_loop_cpy; i++) {
int32_t index = batch_start + i * batches_stride;
int32_t handle_batches = index + batches_stride > num_bbox1
? num_rem_cpy_batches
: batches_stride;
int32_t b1 = index;
int32_t b2 = index;
int32_t base1 = b1 * COORD_NUM;
__memcpy(vec_b1_x1, &bbox1[base1], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b1_y1, &bbox1[base1 + 1], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b1_x2, &bbox1[base1 + 2], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b1_y2, &bbox1[base1 + 3], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
int32_t base2 = b2 * COORD_NUM;
__memcpy(vec_b2_x1, &bbox2[base2], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b2_y1, &bbox2[base2 + 1], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b2_x2, &bbox2[base2 + 2], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b2_y2, &bbox2[base2 + 3], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
// get the width and height
__bang_maxequal(vec_left, vec_b1_x1, vec_b2_x1, batches_stride);
__bang_minequal(vec_right, vec_b1_x2, vec_b2_x2, batches_stride);
__bang_maxequal(vec_top, vec_b1_y1, vec_b2_y1, batches_stride);
__bang_minequal(vec_bottom, vec_b1_y2, vec_b2_y2, batches_stride);
// right - left + offset ---> left
__bang_sub(vec_left, vec_right, vec_left, batches_stride);
__bang_add_const(vec_left, vec_left, (T)offset, batches_stride);
// bottom - top + offset ---> right
__bang_sub(vec_right, vec_bottom, vec_top, batches_stride);
__bang_add_const(vec_right, vec_right, (T)offset, batches_stride);
// zero vector ---> bottom
__nramset(vec_bottom, batches_stride, 0.f);
// width --> vec_left
__bang_maxequal(vec_left, vec_bottom, vec_left, batches_stride);
T *width = vec_left;
// height --> vec_right
__bang_maxequal(vec_right, vec_bottom, vec_right, batches_stride);
T *height = vec_right;
// get the b1_area
// (b1_x2 - b1_x1 + offset) ---> vec_top
__bang_sub(vec_top, vec_b1_x2, vec_b1_x1, batches_stride);
__bang_add_const(vec_top, vec_top, (T)offset, batches_stride);
// (b1_y2 - b1_y1 + offset) ---> vec_bottom
__bang_sub(vec_bottom, vec_b1_y2, vec_b1_y1, batches_stride);
__bang_add_const(vec_bottom, vec_bottom, (T)offset, batches_stride);
// b1_area = (b1_x2 - b1_x1 + offset) * (b1_y2 - b1_y1 + offset)
// ---> vec_top;
__bang_mul(vec_top, vec_top, vec_bottom, batches_stride);
T *b1_area = vec_top;
// get the b2_area
// (b2_x2 - b2_x1 + offset) ---> b2_x1
__bang_sub(vec_b2_x1, vec_b2_x2, vec_b2_x1, batches_stride);
__bang_add_const(vec_b2_x1, vec_b2_x1, (T)offset, batches_stride);
// (b2_y2 - b2_y1 + offset) ---> b2_y1
__bang_sub(vec_b2_y1, vec_b2_y2, vec_b2_y1, batches_stride);
__bang_add_const(vec_b2_y1, vec_b2_y1, (T)offset, batches_stride);
// b2_area = (b2_x2 - b2_x1 + offset) * (b2_y2 - b2_y1 + offset)
// ---> b2_x1;
__bang_mul(vec_b2_x1, vec_b2_x1, vec_b2_y1, batches_stride);
T *b2_area = vec_b2_x1;
// inter_s = width * height
__bang_mul(height, width, height, batches_stride);
T *inter_s = height;
// offset vector ---> vec_b2_y1
__nramset(vec_b2_y1, batches_stride, T(offset));
T *vec_offset = vec_b2_y1;
if (mode == 0) {
__bang_add(b1_area, b1_area, b2_area, batches_stride);
__bang_sub(b1_area, b1_area, inter_s, batches_stride);
__bang_maxequal(b1_area, vec_offset, b1_area, batches_stride);
} else {
__bang_maxequal(b1_area, vec_offset, b1_area, batches_stride);
}
T *base_s = b1_area;
// ious = inter_s / base_s
computeDiv<T>(width, inter_s, base_s, vec_b2_x2, batches_stride);
__memcpy((T *)ious + index, width, handle_batches * sizeof(T),
NRAM2GDRAM);
}
} else {
int32_t num_loop_cpy = num_bbox2 / batches_stride;
int32_t num_rem_cpy_batches = num_bbox2 % batches_stride;
num_loop_cpy = num_rem_cpy_batches > 0 ? num_loop_cpy + 1 : num_loop_cpy;
for (int32_t i = 0; i < batch_per_task; i++) {
int32_t index1 = batch_start + i;
int32_t b1 = index1;
int32_t base1 = b1 * COORD_NUM;
// set bbox1 and bbox2 to nram
__nramset(vec_b1_x1, batches_stride, bbox1[base1]);
__nramset(vec_b1_y1, batches_stride, bbox1[base1 + 1]);
__nramset(vec_b1_x2, batches_stride, bbox1[base1 + 2]);
__nramset(vec_b1_y2, batches_stride, bbox1[base1 + 3]);
for (int32_t j = 0; j < num_loop_cpy; j++) {
int32_t index2 = j * batches_stride;
int32_t handle_batches = index2 + batches_stride > num_bbox2
? num_rem_cpy_batches
: batches_stride;
int32_t b2 = index2;
int32_t base2 = b2 * COORD_NUM;
// copy bbox2 to nram
__memcpy(vec_b2_x1, &bbox2[base2], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b2_y1, &bbox2[base2 + 1], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b2_x2, &bbox2[base2 + 2], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
__memcpy(vec_b2_y2, &bbox2[base2 + 3], sizeof(T), GDRAM2NRAM, sizeof(T),
COORD_NUM * sizeof(T), handle_batches - 1);
// get the width and height
__bang_maxequal(vec_left, vec_b1_x1, vec_b2_x1, batches_stride);
__bang_minequal(vec_right, vec_b1_x2, vec_b2_x2, batches_stride);
__bang_maxequal(vec_top, vec_b1_y1, vec_b2_y1, batches_stride);
__bang_minequal(vec_bottom, vec_b1_y2, vec_b2_y2, batches_stride);
// right - left + offset ---> left
__bang_sub(vec_left, vec_right, vec_left, batches_stride);
__bang_add_const(vec_left, vec_left, (T)offset, batches_stride);
// bottom - top + offset ---> right
__bang_sub(vec_right, vec_bottom, vec_top, batches_stride);
__bang_add_const(vec_right, vec_right, (T)offset, batches_stride);
// zero vector ---> bottom
__nramset(vec_bottom, batches_stride, (T)0);
// width --> vec_left
__bang_maxequal(vec_left, vec_bottom, vec_left, batches_stride);
T *width = vec_left;
// height --> vec_right
__bang_maxequal(vec_right, vec_bottom, vec_right, batches_stride);
T *height = vec_right;
// get the b1_area
// (b1_x2 - b1_x1 + offset) ---> vec_top
__bang_sub(vec_top, vec_b1_x2, vec_b1_x1, batches_stride);
__bang_add_const(vec_top, vec_top, (T)offset, batches_stride);
// (b1_y2 - b1_y1 + offset) ---> vec_bottom
__bang_sub(vec_bottom, vec_b1_y2, vec_b1_y1, batches_stride);
__bang_add_const(vec_bottom, vec_bottom, (T)offset, batches_stride);
// b1_area = (b1_x2 - b1_x1 + offset) * (b1_y2 - b1_y1 + offset)
// ---> vec_top;
__bang_mul(vec_top, vec_top, vec_bottom, batches_stride);
T *b1_area = vec_top;
// get the b2_area
// (b2_x2 - b2_x1 + offset) ---> b2_x1
__bang_sub(vec_b2_x1, vec_b2_x2, vec_b2_x1, batches_stride);
__bang_add_const(vec_b2_x1, vec_b2_x1, (T)offset, batches_stride);
// (b2_y2 - b2_y1 + offset) ---> b2_y1
__bang_sub(vec_b2_y1, vec_b2_y2, vec_b2_y1, batches_stride);
__bang_add_const(vec_b2_y1, vec_b2_y1, (T)offset, batches_stride);
// b2_area = (b2_x2 - b2_x1 + offset) * (b2_y2 - b2_y1 + offset)
// ---> b2_x1;
__bang_mul(vec_b2_x1, vec_b2_x1, vec_b2_y1, batches_stride);
T *b2_area = vec_b2_x1;
// inter_s = width * height
__bang_mul(height, width, height, batches_stride);
T *inter_s = height;
// offset vector ---> vec_b2_y1
__nramset(vec_b2_y1, batches_stride, T(offset));
T *vec_offset = vec_b2_y1;
if (mode == 0) {
__bang_add(b1_area, b1_area, b2_area, batches_stride);
__bang_sub(b1_area, b1_area, inter_s, batches_stride);
__bang_maxequal(b1_area, vec_offset, b1_area, batches_stride);
} else {
__bang_maxequal(b1_area, vec_offset, b1_area, batches_stride);
}
T *base_s = b1_area;
// ious = inter_s / base_s
computeDiv<T>(width, inter_s, base_s, vec_b2_x2, batches_stride);
int32_t gdram_offset = index1 * num_bbox2 + index2;
__memcpy((T *)ious + gdram_offset, width, handle_batches * sizeof(T),
NRAM2GDRAM);
}
}
}
}
template <typename T>
__mlu_global__ void MLUUnion1KernelBBoxOverlaps(
const void *bbox1, const void *bbox2, void *ious, const int32_t num_bbox1,
const int32_t num_bbox2, const int32_t mode, const bool aligned,
const int32_t offset) {
/*
* NRAM partition
* |-------------------------------------------------------------|
* | vec_b1_x1 | vec_b1_y1 | vec_b1_x2 | vec_b1_y2 |
* |-------------------------------------------------------------|
* | vec_b2_x1 | vec_b2_y1 | vec_b2_x2 | vec_b2_y2 |
* |-------------------------------------------------------------|
* | vec_left | vec_right | vec_top | vec_bottom |
* |-------------------------------------------------------------|
*
*/
const int32_t align_bytes = PAD_DOWN(MAX_NRAM_SIZE, NFU_ALIGN_SIZE);
const int32_t split_nram_num = 12;
const int32_t nram_stride =
align_bytes / NFU_ALIGN_SIZE / split_nram_num * NFU_ALIGN_SIZE;
void *vec_b1_x1 = nmem_buf;
void *vec_b1_y1 = nmem_buf + nram_stride;
void *vec_b1_x2 = nmem_buf + 2 * nram_stride;
void *vec_b1_y2 = nmem_buf + 3 * nram_stride;
void *vec_b2_x1 = nmem_buf + 4 * nram_stride;
void *vec_b2_y1 = nmem_buf + 5 * nram_stride;
void *vec_b2_x2 = nmem_buf + 6 * nram_stride;
void *vec_b2_y2 = nmem_buf + 7 * nram_stride;
void *vec_left = nmem_buf + 8 * nram_stride;
void *vec_right = nmem_buf + 9 * nram_stride;
void *vec_top = nmem_buf + 10 * nram_stride;
void *vec_bottom = nmem_buf + 11 * nram_stride;
const int32_t vec_length = nram_stride / sizeof(T);
bboxOverlapsWorkflow((T *)vec_b1_x1, (T *)vec_b1_y1, (T *)vec_b1_x2,
(T *)vec_b1_y2, (T *)vec_b2_x1, (T *)vec_b2_y1,
(T *)vec_b2_x2, (T *)vec_b2_y2, (T *)vec_left,
(T *)vec_right, (T *)vec_top, (T *)vec_bottom,
(T *)bbox1, (T *)bbox2, (T *)ious, offset, mode,
vec_length, num_bbox1, num_bbox2, aligned);
}
void KernelBBoxOverlaps(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, const cnrtDataType_t d_type,
const void *bbox1, const void *bbox2, void *ious,
const int32_t num_bbox1, const int32_t num_bbox2,
const int32_t mode, const bool aligned,
const int32_t offset) {
if (d_type == CNRT_FLOAT16) {
MLUUnion1KernelBBoxOverlaps<half><<<k_dim, k_type, queue>>>(
bbox1, bbox2, ious, num_bbox1, num_bbox2, mode, aligned, offset);
} else {
MLUUnion1KernelBBoxOverlaps<float><<<k_dim, k_type, queue>>>(
bbox1, bbox2, ious, num_bbox1, num_bbox2, mode, aligned, offset);
}
}
/*************************************************************************
* Copyright (C) 2021 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#ifndef UTILS_H_
#define UTILS_H_
#define NFU_ALIGN_SIZE 128 // Byte
#define REM_FOR_STACK (128 * 1024) // 128KB reserved for cncc
#ifdef __BANG_ARCH__
#define MAX_NRAM_SIZE \
(__MLU_NRAM_SIZE__ * 1024 - REM_FOR_STACK) // 128KB reserved for cncc
#define MAX_SRAM_SIZE \
(__MLU_SRAM_SIZE__ * 1024 - REM_FOR_STACK) // 128KB reserved for cncc
#else
#define MAX_NRAM_SIZE (384 * 1024) // 384KB, initialization value
#define MAX_SRAM_SIZE (1920 * 1024) // 1920KB, initialization value
#endif
#ifndef PAD_UP
#define PAD_UP(x, y) (((x) / (y) + (int)((x) % (y) > 0)) * (y))
#endif
#ifndef PAD_DOWN
#define PAD_DOWN(x, y) (((x) / (y)) * (y))
#endif
#endif // UTILS_H_
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "common_mlu_helper.hpp"
__nram__ char data_nram[MAX_NRAM_SIZE];
template <typename T>
__mlu_func__ void mluMultiKernelTinShift(
const T *input, const int *shifts, T *output, const int batch_size,
const int time_size, const int channel_size, const int hw_size,
const int group_size, const int group_channel) {
for (int cur_channel_index = taskId;
cur_channel_index < batch_size * channel_size;
cur_channel_index += taskDim) {
int n_index = cur_channel_index / channel_size;
int group_id = cur_channel_index % channel_size / group_channel;
int t_shift = shifts[n_index * group_size + group_id];
int index = cur_channel_index % channel_size * hw_size +
n_index * time_size * channel_size * hw_size;
__nramset(data_nram, MAX_NRAM_SIZE, (char)0);
__asm__ volatile("sync;");
if (abs(t_shift) >= time_size) {
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
time_size - 1);
} else {
if (t_shift > 0) {
__memcpy(data_nram + t_shift * hw_size * sizeof(T), input + index,
hw_size * sizeof(T), GDRAM2NRAM, hw_size * sizeof(T),
channel_size * hw_size * sizeof(T), time_size - 1 - t_shift);
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
time_size - 1);
} else {
__memcpy(data_nram, input + (index - t_shift * channel_size * hw_size),
hw_size * sizeof(T), GDRAM2NRAM, hw_size * sizeof(T),
channel_size * hw_size * sizeof(T), time_size - 1 + t_shift);
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
time_size - 1);
}
}
__asm__ volatile("sync;");
}
}
template <typename T>
__mlu_func__ void mluHwSplit(const T *input, const int t_shift,
const int time_size, const int hw_size,
const int channel_size, const int index,
const int cur_sequence_index,
const int max_length_per_core, T *output) {
for (int cur_index = index; cur_index < index + hw_size;
cur_index += max_length_per_core) {
int memcpy_size = max_length_per_core;
if (cur_index + max_length_per_core > index + hw_size) {
memcpy_size = index + hw_size - cur_index;
}
if (cur_sequence_index - t_shift < 0 ||
cur_sequence_index - t_shift >= time_size) {
__memcpy(output + cur_index, data_nram, memcpy_size * sizeof(T),
NRAM2GDRAM);
} else {
__memcpy(data_nram, input + cur_index - t_shift * channel_size * hw_size,
memcpy_size * sizeof(T), GDRAM2NRAM);
__memcpy(output + cur_index, data_nram, memcpy_size * sizeof(T),
NRAM2GDRAM);
}
__asm__ volatile("sync;");
}
}
template <typename T>
__mlu_func__ void mluMultiKernelTinShiftSplitSequence(
const T *input, const int *shifts, T *output, const int batch_size,
const int time_size, const int channel_size, const int hw_size,
const int group_size, const int group_channel,
const int max_number_hw_per_core, const int max_length_per_core) {
const int tmp_max_number_hw_per_core =
max_number_hw_per_core > 0 ? max_number_hw_per_core : 1;
const int loop_time = time_size / tmp_max_number_hw_per_core +
((time_size % tmp_max_number_hw_per_core) > 0 ? 1 : 0);
int segmentime_size = tmp_max_number_hw_per_core;
int res_segment = time_size % tmp_max_number_hw_per_core;
for (int cur_segment_index = taskId;
cur_segment_index < loop_time * batch_size * channel_size;
cur_segment_index += taskDim) {
int n_index = cur_segment_index / loop_time / channel_size;
int group_id = cur_segment_index / loop_time % channel_size / group_channel;
int t_shift = shifts[n_index * group_size + group_id];
int index = n_index * time_size * channel_size * hw_size +
(cur_segment_index / loop_time % channel_size) * hw_size +
cur_segment_index % loop_time * segmentime_size * hw_size *
channel_size;
char *dst_gdram2nram = data_nram;
const T *src_gdram2nram = input + index;
int count_gdram2nram = -1;
int count_nram2gdram = -1;
int next_sequence_index =
index / hw_size / channel_size % time_size + segmentime_size;
int cur_sequence_index = index / hw_size / channel_size % time_size;
__nramset(data_nram, MAX_NRAM_SIZE, (char)0);
__asm__ volatile("sync;");
if (max_number_hw_per_core == 0) {
mluHwSplit(input, t_shift, time_size, hw_size, channel_size, index,
cur_sequence_index, max_length_per_core, output);
continue;
}
if (abs(t_shift) >= time_size) {
if ((cur_segment_index + 1) % loop_time == 0 && res_segment != 0) {
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
res_segment - 1);
} else {
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
segmentime_size - 1);
}
continue;
}
if (t_shift == 0) {
if ((cur_segment_index + 1) % loop_time == 0 && res_segment != 0) {
dst_gdram2nram = data_nram;
src_gdram2nram = input + index;
count_gdram2nram = res_segment - 1;
count_nram2gdram = res_segment - 1;
} else {
dst_gdram2nram = data_nram;
src_gdram2nram = input + index;
count_gdram2nram = segmentime_size - 1;
count_nram2gdram = segmentime_size - 1;
}
} else if (t_shift > 0) {
int first_index_cur_channel =
n_index * time_size * channel_size * hw_size +
(cur_segment_index / loop_time % channel_size) * hw_size;
if ((cur_segment_index + 1) % loop_time == 0 && res_segment != 0) {
dst_gdram2nram = data_nram;
src_gdram2nram =
input +
(index - t_shift * channel_size * hw_size < first_index_cur_channel
? first_index_cur_channel
: index - t_shift * channel_size * hw_size);
count_gdram2nram = res_segment - 1;
count_nram2gdram = res_segment - 1;
if (cur_sequence_index < t_shift && t_shift < next_sequence_index) {
dst_gdram2nram =
data_nram + t_shift % segmentime_size * hw_size * sizeof(T);
count_gdram2nram = res_segment - (t_shift - cur_sequence_index) - 1;
}
} else {
if (t_shift >= next_sequence_index) {
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
segmentime_size - 1);
continue;
} else if (cur_sequence_index < t_shift &&
t_shift < next_sequence_index) {
dst_gdram2nram =
data_nram + t_shift % segmentime_size * hw_size * sizeof(T);
src_gdram2nram = input + first_index_cur_channel;
count_gdram2nram = segmentime_size - (t_shift % segmentime_size) - 1;
count_nram2gdram = segmentime_size - 1;
} else {
dst_gdram2nram = data_nram;
src_gdram2nram = input + index - t_shift * channel_size * hw_size;
count_gdram2nram = segmentime_size - 1;
count_nram2gdram = segmentime_size - 1;
}
}
} else {
int offset_index = time_size + t_shift;
if (cur_sequence_index >= offset_index) {
if ((cur_segment_index + 1) % loop_time == 0 && res_segment != 0) {
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
res_segment - 1);
continue;
} else {
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
segmentime_size - 1);
continue;
}
} else {
dst_gdram2nram = data_nram;
src_gdram2nram = input + index - t_shift * channel_size * hw_size;
if (cur_sequence_index - t_shift + segmentime_size < time_size) {
count_gdram2nram = segmentime_size - 1;
count_nram2gdram = segmentime_size - 1;
} else {
count_gdram2nram = time_size - (cur_sequence_index - t_shift) - 1;
count_nram2gdram =
(segmentime_size - 1) < (time_size - cur_sequence_index - 1)
? (segmentime_size - 1)
: (time_size - cur_sequence_index - 1);
}
}
}
__memcpy(dst_gdram2nram, src_gdram2nram, hw_size * sizeof(T), GDRAM2NRAM,
hw_size * sizeof(T), channel_size * hw_size * sizeof(T),
count_gdram2nram);
__memcpy(output + index, data_nram, hw_size * sizeof(T), NRAM2GDRAM,
channel_size * hw_size * sizeof(T), hw_size * sizeof(T),
count_nram2gdram);
__asm__ volatile("sync;");
}
}
__mlu_entry__ void MLUUnion1KernelTinShift(
const void *input, const void *shifts, void *output, const int batch_size,
const int time_size, const int channel_size, const int hw_size,
const int group_size, const int group_channel,
const cnrtDataType_t data_dtype) {
// make sure that memcore is not used
if (coreId == 0x80) {
return;
}
switch (data_dtype) {
case CNRT_FLOAT16: {
mluMultiKernelTinShift((half *)input, (const int *)shifts, (half *)output,
batch_size, time_size, channel_size, hw_size,
group_size, group_channel);
}; break;
case CNRT_FLOAT32: {
mluMultiKernelTinShift((float *)input, (const int *)shifts,
(float *)output, batch_size, time_size,
channel_size, hw_size, group_size, group_channel);
}; break;
default: { return; }
}
}
__mlu_entry__ void MLUUnion1KernelTinShiftSplitSequence(
const void *input, const void *shifts, void *output, const int batch_size,
const int time_size, const int channel_size, const int hw_size,
const int group_size, const int group_channel,
const int max_number_hw_per_core, const int max_length_per_core,
const cnrtDataType_t data_dtype) {
// make sure that memcore is not used
if (coreId == 0x80) {
return;
}
switch (data_dtype) {
case CNRT_FLOAT16: {
mluMultiKernelTinShiftSplitSequence(
(half *)input, (const int *)shifts, (half *)output, batch_size,
time_size, channel_size, hw_size, group_size, group_channel,
max_number_hw_per_core, max_length_per_core);
}; break;
case CNRT_FLOAT32: {
mluMultiKernelTinShiftSplitSequence(
(float *)input, (const int *)shifts, (float *)output, batch_size,
time_size, channel_size, hw_size, group_size, group_channel,
max_number_hw_per_core, max_length_per_core);
}; break;
default: { return; }
}
}
void KernelTinShiftForward(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const void *input, const void *shifts, void *output, const int batch_size,
const int time_size, const int channel_size, const int hw_size,
const int group_size, const int group_channel,
const cnrtDataType_t data_dtype, const int channel_per_core,
const int max_number_hw_per_core, const int max_length_per_core) {
if (channel_per_core >= 1) {
MLUUnion1KernelTinShift<<<k_dim, k_type, queue>>>(
input, shifts, output, batch_size, time_size, channel_size, hw_size,
group_size, group_channel, data_dtype);
} else {
MLUUnion1KernelTinShiftSplitSequence<<<k_dim, k_type, queue>>>(
input, shifts, output, batch_size, time_size, channel_size, hw_size,
group_size, group_channel, max_number_hw_per_core, max_length_per_core,
data_dtype);
}
}
void KernelTinShiftBackward(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const void *grad_output, const void *shifts, void *grad_input,
const int batch_size, const int time_size, const int channel_size,
const int hw_size, const int group_size, const int group_channel,
const cnrtDataType_t data_dtype, const int channel_per_core,
const int max_number_hw_per_core, const int max_length_per_core) {
if (channel_per_core >= 1) {
MLUUnion1KernelTinShift<<<k_dim, k_type, queue>>>(
grad_output, shifts, grad_input, batch_size, time_size, channel_size,
hw_size, group_size, group_channel, data_dtype);
} else {
MLUUnion1KernelTinShiftSplitSequence<<<k_dim, k_type, queue>>>(
grad_output, shifts, grad_input, batch_size, time_size, channel_size,
hw_size, group_size, group_channel, max_number_hw_per_core,
max_length_per_core, data_dtype);
}
}
...@@ -11,7 +11,7 @@ using namespace at; ...@@ -11,7 +11,7 @@ using namespace at;
#define CHECK_MLU(x) \ #define CHECK_MLU(x) \
TORCH_CHECK(x.device().type() == at::kMLU, #x " must be a MLU tensor") TORCH_CHECK(x.device().type() == at::kMLU, #x " must be a MLU tensor")
#define CHECK_CPU(x) \ #define CHECK_CPU(x) \
TORCH_CHECK(!x.device().is_cuda(), #x " must be a CPU tensor") TORCH_CHECK(x.device().type() == at::kCPU, #x " must be a CPU tensor")
#define CHECK_CONTIGUOUS(x) \ #define CHECK_CONTIGUOUS(x) \
TORCH_CHECK(x.is_contiguous(), #x " must be contiguous") TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_CUDA_INPUT(x) \ #define CHECK_CUDA_INPUT(x) \
......
/*************************************************************************
* Copyright (C) 2021 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
void KernelBBoxOverlaps(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, const cnrtDataType_t d_type,
const void *bbox1, const void *bbox2, void *ious,
const int32_t num_bbox1, const int32_t num_bbox2,
const int32_t mode, const bool aligned,
const int32_t offset);
static void policyFunc(cnrtDim3_t *k_dim, cnrtFunctionType_t *k_type,
const int32_t batch_num_all) {
auto union_num = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
auto core_dim = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
auto core_num = union_num * core_dim;
// Union1 policyFunc
*k_type = CNRT_FUNC_TYPE_UNION1;
k_dim->x = core_dim;
auto need_core_num = PAD_UP(batch_num_all, core_dim);
k_dim->y =
(need_core_num < core_num) ? (need_core_num / core_dim) : union_num;
k_dim->z = 1;
return;
}
void BBoxOverlapsMLUKernelLauncher(const Tensor bboxes1, const Tensor bboxes2,
Tensor ious, const int32_t mode,
const bool aligned, const int32_t offset) {
// check dtype
TORCH_CHECK(
bboxes1.scalar_type() == at::kFloat || bboxes1.scalar_type() == at::kHalf,
"Data type of input should be Float or Half. But now input type is ",
bboxes1.scalar_type(), ".");
TORCH_CHECK(bboxes1.scalar_type() == bboxes2.scalar_type(),
"bboxes1's dtype should be the same with bboxes2's dtype.");
// params check
TORCH_CHECK(bboxes1.dim() == 2, "bboxes1 should be a 2d tensor, got ",
bboxes1.dim(), "D");
TORCH_CHECK(bboxes2.dim() == 2, "bboxes2 should be a 2d tensor, got ",
bboxes2.dim(), "D");
auto rows = bboxes1.size(0);
auto cols = bboxes2.size(0);
auto batch_num_all = rows;
if (rows * cols == 0) {
// return if zero element
return;
}
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFunc(&k_dim, &k_type, batch_num_all);
// get compute queue
cnrtQueue_t queue = torch_mlu::getCurQueue();
// get dtype of input
cnrtDataType_t d_type = torch_mlu::toCnrtDtype(bboxes1.dtype());
// get ptr of tensors
auto bboxes1_impl = torch_mlu::getMluTensorImpl(bboxes1);
auto bboxes1_ptr = bboxes1_impl->cnnlMalloc();
auto bboxes2_impl = torch_mlu::getMluTensorImpl(bboxes2);
auto bboxes2_ptr = bboxes2_impl->cnnlMalloc();
auto ious_impl = torch_mlu::getMluTensorImpl(ious);
auto ious_ptr = ious_impl->cnnlMalloc();
// launch kernel
CNLOG(INFO) << "Launch Kernel MLUUnion1BboxOverlapsKernel";
CNLOG(INFO) << "kDim :[ " << k_dim.x << ", " << k_dim.y << ", " << k_dim.z
<< " ]";
KernelBBoxOverlaps(k_dim, k_type, queue, d_type, bboxes1_ptr, bboxes2_ptr,
ious_ptr, rows, cols, mode, aligned, offset);
}
void bbox_overlaps_mlu(const Tensor bboxes1, const Tensor bboxes2, Tensor ious,
const int mode, const bool aligned, const int offset) {
BBoxOverlapsMLUKernelLauncher(bboxes1, bboxes2, ious, mode, aligned, offset);
}
void bbox_overlaps_impl(const Tensor bboxes1, const Tensor bboxes2, Tensor ious,
const int mode, const bool aligned, const int offset);
REGISTER_DEVICE_IMPL(bbox_overlaps_impl, MLU, bbox_overlaps_mlu);
/*************************************************************************
* Copyright (C) 2021 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include <string>
#include <vector>
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
void KernelFocalLossSigmoidForward(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue,
const cnrtDataType_t d_type,
const void *input, const void *target,
const void *weight, const int32_t N,
const int32_t C, const float alpha,
const float gamma, void *output);
void KernelFocalLossSigmoidBackward(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue,
const cnrtDataType_t d_type,
const void *input, const void *target,
const void *weight, const float gamma,
const float alpha, const int32_t dim_n,
const int32_t deal_n, const int32_t dim_c,
void *output);
// Policy Function for Forward
static void policyFuncForward(cnrtDim3_t *k_dim, cnrtFunctionType_t *k_type,
const Tensor &input, const Tensor &target,
const Tensor &weight) {
auto N = input.size(0);
auto C = input.size(1);
const size_t nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
const size_t c_align_size = PAD_UP((C * input.itemsize()), NFU_ALIGN_SIZE);
const int split_target_num = 2;
const int split_pipeline_num = 6;
const int has_weight = weight.data_ptr() != nullptr;
const int target_data_width = target.scalar_type() == at::kLong
? target.itemsize() / 2
: target.itemsize();
const int threshold_c =
PAD_DOWN((nram_size - split_target_num * sizeof(int)) /
(split_pipeline_num + has_weight),
NFU_ALIGN_SIZE) /
input.itemsize();
int n_seg = 1;
if (C <= threshold_c) {
int c_size = C * input.itemsize();
int reservered_align_size =
(split_target_num + split_pipeline_num) * NFU_ALIGN_SIZE;
int wegiht_size = 0;
if (has_weight) {
c_size = c_align_size;
reservered_align_size = split_target_num * NFU_ALIGN_SIZE;
wegiht_size = c_align_size;
}
// n_seg * c_size * split_pipeline_num + n_seg * target.itemsize() *
// split_target_num
// + weight_size + reservered_align_size <= nram_size
n_seg = (nram_size - wegiht_size - reservered_align_size) /
(split_pipeline_num * c_size + split_target_num * sizeof(int32_t));
}
auto seg_num = n_seg == 0 ? N : (N + n_seg - 1) / n_seg;
auto core_dim = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
auto cluster_num = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
auto core_num = core_dim * cluster_num;
k_dim->x = *k_type;
k_dim->y =
seg_num > core_num ? cluster_num : (seg_num + core_dim - 1) / core_dim;
k_dim->z = 1;
}
// Policy Function for Backward
static void policyFuncBackward(cnrtDim3_t *k_dim, cnrtFunctionType_t *k_type) {
// set Union1 Job
*k_type = CNRT_FUNC_TYPE_UNION1;
k_dim->x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
k_dim->y = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
k_dim->z = 1;
}
void SigmoidFocalLossForwardMLUKernelLauncher(Tensor input, Tensor target,
Tensor weight, Tensor output,
const float gamma,
const float alpha) {
// params check
TORCH_CHECK(gamma >= 0, "gamma should be greater than or equal to 0. ",
"But now gamma is ", gamma, ".");
// check dtype
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"Data type of input should be Float or Half. But now input type is ",
input.scalar_type(), ".");
TORCH_CHECK(
(target.scalar_type() == at::kInt || target.scalar_type() == at::kLong),
"target type should be Int or Long. ", "But now target type is ",
target.scalar_type(), ".");
if (weight.data_ptr() != nullptr) {
TORCH_CHECK(weight.scalar_type() == input.scalar_type(),
"Data types of input and weight should be the same. But now "
"input type is ",
input.scalar_type(), ", weight type is ", weight.scalar_type(),
".");
} else {
CNLOG(INFO) << "weight is a empty tensor.";
}
// return if zero-element
if (input.numel() == 0 || target.numel() == 0 || output.numel() == 0) {
return;
}
// calculate task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type = CNRT_FUNC_TYPE_UNION1;
policyFuncForward(&k_dim, &k_type, input, target, weight);
auto core_dim = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get ptr of tensors
auto input_impl = torch_mlu::getMluTensorImpl(input);
auto input_ptr = input_impl->cnnlMalloc();
auto target_impl = torch_mlu::getMluTensorImpl(target);
auto target_ptr = target_impl->cnnlMalloc();
auto weight_impl = torch_mlu::getMluTensorImpl(weight);
auto weight_ptr = weight_impl->cnnlMalloc();
auto output_impl = torch_mlu::getMluTensorImpl(output);
auto output_ptr = output_impl->cnnlMalloc();
// get dtype of input
cnrtDataType_t d_type = torch_mlu::toCnrtDtype(input.dtype());
CNLOG(INFO) << "Launch Kernel KernelFocalLossSigmoidForward<<<Union"
<< k_type / core_dim << ", " << k_dim.x << ", " << k_dim.y << ", "
<< k_dim.z << ">>>";
// launch kernel
KernelFocalLossSigmoidForward(k_dim, k_type, queue, d_type, input_ptr,
target_ptr, weight_ptr, input.size(0),
input.size(1), alpha, gamma, output_ptr);
}
void getDealNAndThresholdC(const int compute_data_bytes,
const int target_data_bytes, const int total_c,
int *deal_n_ptr, int *threshold_c_ptr,
const bool has_weight, const bool is_half) {
/* NRAM partition:
*
* |-----------------ping pong--------------------|
* |input | pt | alpha_t | temp | output | target | flt_min | gamma | weight|
*
* split_pipeline_num is 5: including input, pt, alpha_t, temp, output.
*/
const int nram_split_num = 5;
const int nram_split_pingpong = 2;
const int max_nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
int32_t compute_align_size = NFU_ALIGN_SIZE;
if (is_half) {
compute_align_size += NFU_ALIGN_SIZE;
}
const int32_t compute_align_num = compute_align_size / compute_data_bytes;
// reservered_align_size: including input(ping pong), pt(ping pong),
// alpha_t(ping pong), temp(ping pong),
// output(ping pong), target(ping pong),
// flt_min and gamma.
const int reservered_align_size =
((nram_split_num + 1) * nram_split_pingpong + 2) * compute_align_size;
int nram_pingpong_size = max_nram_size - reservered_align_size;
int compute_c = total_c;
int threshold_c = 0;
if (has_weight) {
// reserved space for weight to align
nram_pingpong_size -= NFU_ALIGN_SIZE;
// threshold_c * nram_split_pingpong * compute_data_bytes * nram_split_num +
// nram_split_pingpong * target_data_bytes +
// threshold_c * compute_data_bytes <= nram_pingpong_size
threshold_c =
(nram_pingpong_size - nram_split_pingpong * target_data_bytes) /
(compute_data_bytes * (nram_split_num * nram_split_pingpong + 1));
threshold_c = PAD_DOWN(threshold_c, compute_align_num);
int weight_space = PAD_UP(total_c * compute_data_bytes, NFU_ALIGN_SIZE);
// reserved space for weight
nram_pingpong_size -= weight_space;
compute_c = PAD_UP(total_c, compute_align_num);
} else {
// threshold_c * nram_split_pingpong * compute_data_bytes * nram_split_num +
// nram_split_pingpong * target_data_bytes <= nram_pingpong_size
threshold_c =
(nram_pingpong_size / nram_split_pingpong - target_data_bytes) /
(nram_split_num * compute_data_bytes);
}
// deal_n * compute_c * nram_split_pingpong * compute_data_bytes *
// nram_split_num + deal_n * nram_split_pingpong * target_data_bytes <=
// nram_pingpong_size
*deal_n_ptr =
nram_pingpong_size /
((nram_split_num * compute_c * compute_data_bytes + target_data_bytes) *
nram_split_pingpong);
*threshold_c_ptr = threshold_c;
}
void SigmoidFocalLossBackwardMLUKernelLauncher(Tensor input, Tensor target,
Tensor weight, Tensor output,
const float gamma,
const float alpha) {
// params check
TORCH_CHECK(gamma >= 0, "gamma should be greater than or equal to 0. ",
"But now gamma is ", gamma, ".");
// check dtype
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"Data type of input should be Float or Half. But now input type is ",
input.scalar_type(), ".");
TORCH_CHECK(
(target.scalar_type() == at::kInt || target.scalar_type() == at::kLong),
"target type should be Int or Long. ", "But now target type is ",
target.scalar_type(), ".");
bool has_weight = false;
if (weight.data_ptr() != nullptr) {
TORCH_CHECK(weight.scalar_type() == input.scalar_type(),
"Data types of input and weight should be the same. But now "
"input type is ",
input.scalar_type(), ", weight type is ", weight.scalar_type(),
".");
has_weight = true;
} else {
CNLOG(INFO) << "weight is a empty tensor.";
}
auto dim_c = input.size(1);
const int compute_data_bytes = sizeof(float);
// target supports only INT on MLU device while it keeps LONG on host side,
// so target.itemsize() / 2
const int target_data_bytes = target.scalar_type() == at::kLong
? (target.itemsize() / 2)
: target.itemsize();
int deal_n = 0;
int threshold_c = 0;
bool is_half = false;
if (input.scalar_type() == at::kHalf) {
is_half = true;
}
// calculate deal_n and threshold_c
getDealNAndThresholdC(compute_data_bytes, target_data_bytes, dim_c, &deal_n,
&threshold_c, has_weight, is_half);
// check C
TORCH_CHECK(threshold_c >= dim_c,
"input.size(1) should be in the range of [0, ", threshold_c,
"]. ", "But now input.size(1) is ", dim_c, ".");
if (input.numel() == 0 || target.numel() == 0 || output.numel() == 0) {
// return if zero-element
return;
}
// set task dimension
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
policyFuncBackward(&k_dim, &k_type);
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get ptr of tensors
auto input_impl = torch_mlu::getMluTensorImpl(input);
auto input_ptr = input_impl->cnnlMalloc();
auto target_impl = torch_mlu::getMluTensorImpl(target);
auto target_ptr = target_impl->cnnlMalloc();
auto weight_impl = torch_mlu::getMluTensorImpl(weight);
auto weight_ptr = weight_impl->cnnlMalloc();
auto output_impl = torch_mlu::getMluTensorImpl(output);
auto output_ptr = output_impl->cnnlMalloc();
// get dtype of input
cnrtDataType_t d_type = torch_mlu::toCnrtDtype(input.dtype());
auto core_dim = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
auto dim_n = input.size(0);
CNLOG(INFO) << "Launch Kernel KernelFocalLossSigmoidBackward<<<Union"
<< k_type / core_dim << ", " << k_dim.x << ", " << k_dim.y << ", "
<< k_dim.z << ">>>";
// launch kernel
KernelFocalLossSigmoidBackward(k_dim, k_type, queue, d_type, input_ptr,
target_ptr, weight_ptr, gamma, alpha, dim_n,
deal_n, dim_c, output_ptr);
}
void sigmoid_focal_loss_forward_mlu(Tensor input, Tensor target, Tensor weight,
Tensor output, float gamma, float alpha) {
SigmoidFocalLossForwardMLUKernelLauncher(input, target, weight, output, gamma,
alpha);
}
void sigmoid_focal_loss_backward_mlu(Tensor input, Tensor target, Tensor weight,
Tensor grad_input, float gamma,
float alpha) {
SigmoidFocalLossBackwardMLUKernelLauncher(input, target, weight, grad_input,
gamma, alpha);
}
void sigmoid_focal_loss_forward_impl(Tensor input, Tensor target, Tensor weight,
Tensor output, float gamma, float alpha);
void sigmoid_focal_loss_backward_impl(Tensor input, Tensor target,
Tensor weight, Tensor grad_input,
float gamma, float alpha);
REGISTER_DEVICE_IMPL(sigmoid_focal_loss_forward_impl, MLU,
sigmoid_focal_loss_forward_mlu);
REGISTER_DEVICE_IMPL(sigmoid_focal_loss_backward_impl, MLU,
sigmoid_focal_loss_backward_mlu);
/*************************************************************************
* Copyright (C) 2021 by Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
void KernelNms(cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const cnrtDataType_t data_type_input, const void *boxes_ptr,
const void *scores_ptr, const int input_num_boxes,
const int input_stride, const int max_output_boxes,
const float iou_threshold, const float offset,
void *workspace_ptr, void *output_size_ptr, void *output_ptr);
int selectUnionType(uint32_t use_job, int box_num_per_core) {
// the box_num_per_core should be at least 256, otherwise the real IO
// bandwidth would be very low
while (box_num_per_core < 256 && use_job >= 4) {
box_num_per_core *= 2;
use_job /= 2;
}
return use_job;
}
Tensor NMSMLUKernelLauncher(Tensor boxes, Tensor scores, float iou_threshold,
int offset) {
// dimension parameters check
TORCH_CHECK(boxes.dim() == 2, "boxes should be a 2d tensor, got ",
boxes.dim(), "D");
TORCH_CHECK(boxes.size(1) == 4,
"boxes should have 4 elements in dimension 1, got ",
boxes.size(1));
TORCH_CHECK(scores.dim() == 1, "scores should be a 1d tensor, got ",
scores.dim(), "D");
// data type check
TORCH_CHECK(boxes.scalar_type() == scores.scalar_type(),
"boxes should have the same type as scores");
TORCH_CHECK(
boxes.scalar_type() == at::kFloat || boxes.scalar_type() == at::kHalf,
"data type of boxes should be Float or Half, got ", boxes.scalar_type());
if (boxes.numel() == 0) {
return at::empty({0}, boxes.options().dtype(at::kLong));
}
int input_num_boxes = boxes.size(0);
int input_stride = boxes.size(0);
int max_output_boxes = boxes.size(0);
cnrtDataType_t data_type_input = torch_mlu::toCnrtDtype(boxes.dtype());
cnrtDim3_t k_dim;
cnrtJobType_t k_type;
uint32_t union_number = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
uint32_t core_dim = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
uint32_t job_limit = union_number * core_dim;
uint32_t core_number = union_number * core_dim;
int box_num_per_core = (input_num_boxes + core_number - 1) / core_number;
// initiate k_type as Union1
k_dim.x = core_dim;
k_dim.y = 1;
k_dim.z = 1;
k_type = CNRT_FUNC_TYPE_UNION1;
int use_job = selectUnionType(job_limit, box_num_per_core);
if (use_job < 4) {
k_dim.x = 1;
k_type = CNRT_FUNC_TYPE_BLOCK;
} else if (use_job == 4) {
k_dim.x = core_dim;
k_type = CNRT_FUNC_TYPE_UNION1;
} else {
k_dim.x = use_job;
k_type = (cnrtFunctionType_t)use_job;
}
// transpose boxes (n, 4) to (4, n) for better performance
auto boxes_t = boxes.transpose(0, 1);
auto boxes_ = torch_mlu::cnnl::ops::cnnl_contiguous(boxes_t);
auto scores_ = torch_mlu::cnnl::ops::cnnl_contiguous(scores);
auto output = at::empty({max_output_boxes}, boxes.options().dtype(at::kLong));
auto output_size = at::empty({1}, scores.options().dtype(at::kInt));
// workspace
const int info_num = 5; // x1, x2, y1, y2 and score
size_t space_size = 0;
if (boxes.scalar_type() == at::kHalf) {
space_size = input_num_boxes * sizeof(int16_t) * info_num + sizeof(float);
} else {
space_size = input_num_boxes * sizeof(float) * info_num + sizeof(float);
}
auto workspace = at::empty(space_size, boxes.options().dtype(at::kByte));
// get compute queue
auto queue = torch_mlu::getCurQueue();
auto boxes_impl = torch_mlu::getMluTensorImpl(boxes_);
auto boxes_ptr = boxes_impl->cnnlMalloc();
auto scores_impl = torch_mlu::getMluTensorImpl(scores_);
auto scores_ptr = scores_impl->cnnlMalloc();
auto workspace_impl = torch_mlu::getMluTensorImpl(workspace);
auto workspace_ptr = workspace_impl->cnnlMalloc();
auto output_impl = torch_mlu::getMluTensorImpl(output);
auto output_ptr = output_impl->cnnlMalloc();
auto output_size_impl = torch_mlu::getMluTensorImpl(output_size);
auto output_size_ptr = output_size_impl->cnnlMalloc();
CNLOG(INFO) << "Launch Kernel MLUUnionX NMS<<<Union" << k_type / core_dim
<< ", " << k_dim.x << ", " << k_dim.y << ", " << k_dim.z << ">>>";
KernelNms(k_dim, k_type, queue, data_type_input, boxes_ptr, scores_ptr,
input_num_boxes, input_stride, max_output_boxes, iou_threshold,
offset, workspace_ptr, output_size_ptr, output_ptr);
int output_num = *static_cast<int *>(output_size.cpu().data_ptr());
return output.slice(0, 0, output_num);
}
Tensor nms_mlu(Tensor boxes, Tensor scores, float iou_threshold, int offset) {
return NMSMLUKernelLauncher(boxes, scores, iou_threshold, offset);
}
Tensor nms_impl(Tensor boxes, Tensor scores, float iou_threshold, int offset);
REGISTER_DEVICE_IMPL(nms_impl, MLU, nms_mlu);
/*************************************************************************
* Copyright (C) 2021 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
void KernelRoiAlign(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, const cnrtDataType_t d_type,
const void *input, const void *rois, const int channels,
const bool aligned, const int pooled_height,
const int pooled_width, const int input_height,
const int input_width, const int sampling_ratio,
const float spatial_scale, const int num_rois,
void *output);
void KernelRoiAlignBackward(cnrtDim3_t k_dim, cnrtFunctionType_t k_type,
cnrtQueue_t queue, const cnrtDataType_t dtype,
const void *grads, const void *boxes,
void *grads_image, const int boxes_num,
const int hi, const int wi, const int c,
const int no, const int ho, const int wo,
const float spatial_scale, const int sampling_ratio,
const bool aligned);
void ROIAlignForwardMLUKernelLauncher(Tensor input, Tensor rois, Tensor output,
Tensor argmax_y, Tensor argmax_x,
int aligned_height, int aligned_width,
float spatial_scale, int sampling_ratio,
int pool_mode, bool aligned) {
// params check
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"input type should be Float or Half, got ", input.scalar_type());
TORCH_CHECK(rois.scalar_type() == input.scalar_type(),
"rois should have the same type as input");
TORCH_CHECK(input.dim() == 4, "input should be a 4d tensor, got ",
input.dim(), "D");
TORCH_CHECK(rois.dim() == 2, "rois should be a 2d tensor, got ", rois.dim(),
"D");
TORCH_CHECK(pool_mode == 1, "pool_mode only suppurts 'avg' currently");
auto memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(input.dim());
auto input_tensor =
torch_mlu::cnnl::ops::cnnl_contiguous(input, memory_format);
auto num_rois = rois.size(0);
auto channels = input.size(1);
int height = input.size(2);
int width = input.size(3);
if (output.numel() == 0) {
output = at::zeros({num_rois, channels, aligned_height, aligned_width},
input.options());
return;
}
at::Tensor output_tmp =
at::empty({num_rois, channels, aligned_height, aligned_width},
input.options(), memory_format);
// get tensor impl
auto self_impl = torch_mlu::getMluTensorImpl(input_tensor);
auto rois_impl = torch_mlu::getMluTensorImpl(rois);
auto output_impl = torch_mlu::getMluTensorImpl(output_tmp);
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get the mlu ptr
auto self_ptr = self_impl->cnnlMalloc();
auto rois_ptr = rois_impl->cnnlMalloc();
auto output_ptr = output_impl->cnnlMalloc();
cnrtJobType_t k_type = CNRT_FUNC_TYPE_UNION1;
cnrtDim3_t k_dim;
k_dim.x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
k_dim.y = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
k_dim.z = 1;
cnrtDataType_t data_type = torch_mlu::toCnrtDtype(input.dtype());
KernelRoiAlign(k_dim, k_type, queue, data_type, self_ptr, rois_ptr, channels,
aligned, aligned_height, aligned_width, height, width,
sampling_ratio, spatial_scale, num_rois, output_ptr);
output.copy_(output_tmp);
}
static int nearestPower2(int x) {
x--;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x++;
return x;
}
void ROIAlignBackwardMLUKernelLauncher(Tensor grad, Tensor rois,
Tensor argmax_y, Tensor argmax_x,
Tensor grad_input, int aligned_height,
int aligned_width, float spatial_scale,
int sampling_ratio, int pool_mode,
bool aligned) {
// params check
TORCH_CHECK(
grad.scalar_type() == at::kFloat || grad.scalar_type() == at::kHalf,
"grad type should be Float or Half, got ", grad.scalar_type());
TORCH_CHECK(rois.scalar_type() == grad.scalar_type(),
"rois should have the same type as grad");
TORCH_CHECK(grad.dim() == 4, "grad should be a 4d tensor, got ", grad.dim(),
"D");
TORCH_CHECK(rois.dim() == 2, "rois should be a 2d tensor, got ", rois.dim(),
"D");
TORCH_CHECK(pool_mode == 1, "pool_mode only suppurts 'avg' currently");
int batch_size = grad_input.size(0);
int channels = grad_input.size(1);
int height = grad_input.size(2);
int width = grad_input.size(3);
auto memory_format =
torch_mlu::cnnl::ops::get_channels_last_memory_format(grad.dim());
auto grad_ = torch_mlu::cnnl::ops::cnnl_contiguous(grad, memory_format);
auto grad_input_ = at::empty({batch_size, channels, height, width},
grad.options(), memory_format)
.zero_();
int boxes_num = rois.size(0);
int hi = grad.size(2);
int wi = grad.size(3);
int c = grad.size(1);
int no = grad_input.size(0);
int ho = grad_input.size(2);
int wo = grad_input.size(3);
// get tensor impl
auto grad_impl = torch_mlu::getMluTensorImpl(grad_);
auto grad_input_impl = torch_mlu::getMluTensorImpl(grad_input_);
auto rois_impl = torch_mlu::getMluTensorImpl(rois);
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get the mlu ptr
auto grad_ptr = grad_impl->cnnlMalloc();
auto rois_ptr = rois_impl->cnnlMalloc();
auto grad_input_ptr = grad_input_impl->cnnlMalloc();
cnrtJobType_t k_type = CNRT_FUNC_TYPE_UNION1;
int need_core = nearestPower2(boxes_num);
int union_number = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
uint32_t dim_x = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
uint32_t dim_y = (need_core - 1) / dim_x + 1;
dim_y = (dim_y > union_number) ? union_number : dim_y;
cnrtDim3_t k_dim = {dim_x, dim_y, 1};
cnrtDataType_t k_dtype = torch_mlu::toCnrtDtype(grad.dtype());
KernelRoiAlignBackward(k_dim, k_type, queue, k_dtype, grad_ptr, rois_ptr,
grad_input_ptr, boxes_num, hi, wi, c, no, ho, wo,
spatial_scale, sampling_ratio, aligned);
grad_input.copy_(grad_input_);
}
void roi_align_forward_mlu(Tensor input, Tensor rois, Tensor output,
Tensor argmax_y, Tensor argmax_x, int aligned_height,
int aligned_width, float spatial_scale,
int sampling_ratio, int pool_mode, bool aligned) {
ROIAlignForwardMLUKernelLauncher(input, rois, output, argmax_y, argmax_x,
aligned_height, aligned_width, spatial_scale,
sampling_ratio, pool_mode, aligned);
}
void roi_align_backward_mlu(Tensor grad_output, Tensor rois, Tensor argmax_y,
Tensor argmax_x, Tensor grad_input,
int aligned_height, int aligned_width,
float spatial_scale, int sampling_ratio,
int pool_mode, bool aligned) {
ROIAlignBackwardMLUKernelLauncher(
grad_output, rois, argmax_y, argmax_x, grad_input, aligned_height,
aligned_width, spatial_scale, sampling_ratio, pool_mode, aligned);
}
void roi_align_forward_impl(Tensor input, Tensor rois, Tensor output,
Tensor argmax_y, Tensor argmax_x,
int aligned_height, int aligned_width,
float spatial_scale, int sampling_ratio,
int pool_mode, bool aligned);
void roi_align_backward_impl(Tensor grad_output, Tensor rois, Tensor argmax_y,
Tensor argmax_x, Tensor grad_input,
int aligned_height, int aligned_width,
float spatial_scale, int sampling_ratio,
int pool_mode, bool aligned);
REGISTER_DEVICE_IMPL(roi_align_forward_impl, MLU, roi_align_forward_mlu);
REGISTER_DEVICE_IMPL(roi_align_backward_impl, MLU, roi_align_backward_mlu);
/*************************************************************************
* Copyright (C) 2022 Cambricon.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
* OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*************************************************************************/
#include "pytorch_device_registry.hpp"
#include "pytorch_mlu_helper.hpp"
void KernelTinShiftForward(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const void *input, const void *shifts, void *output, const int batch_size,
const int time_size, const int channel_size, const int hw_size,
const int group_size, const int group_channel,
const cnrtDataType_t data_dtype, const int channel_per_core,
const int max_number_hw_per_core, const int max_length_per_core);
void KernelTinShiftBackward(
cnrtDim3_t k_dim, cnrtFunctionType_t k_type, cnrtQueue_t queue,
const void *grad_output, const void *shifts, void *grad_input,
const int batch_size, const int time_size, const int channel_size,
const int hw_size, const int group_size, const int group_channel,
const cnrtDataType_t data_dtype, const int channel_per_core,
const int max_number_hw_per_core, const int max_length_per_core);
// policy function
static void policyFunc(const Tensor &input, cnrtDim3_t *k_dim,
cnrtFunctionType_t *k_type, int *channel_per_core,
int *max_number_hw_per_core, int *max_length_per_core) {
const int32_t cluster_limit = torch_mlu::getDeviceAttr(cnrtAttrClusterCount);
const int32_t core_limit = torch_mlu::getDeviceAttr(cnrtAttrMcorePerCluster);
auto nram_size = torch_mlu::getDeviceAttr(cnrtAttrNramSizePerMcore);
const int core_num = core_limit * cluster_limit;
const int batch_size = input.size(0);
const int time_size = input.size(1);
const int channel_size = input.size(2);
const int hw_size = input.size(3);
const size_t size_per_channel = time_size * hw_size * input.itemsize();
*channel_per_core = nram_size / size_per_channel;
int task_dim = 0;
if (*channel_per_core == 0) {
const size_t size_per_hw = hw_size * input.itemsize();
*max_number_hw_per_core = nram_size / size_per_hw;
if (*max_number_hw_per_core <= 0) {
*max_length_per_core = nram_size / input.itemsize();
}
int tmp_max_number_hw_per_core =
*max_number_hw_per_core > 0 ? *max_number_hw_per_core : 1;
const int loop_time =
(time_size / (tmp_max_number_hw_per_core)) +
((time_size % (tmp_max_number_hw_per_core)) > 0 ? 1 : 0);
task_dim = batch_size * channel_size * loop_time < core_num
? batch_size * channel_size * loop_time
: core_num;
} else {
task_dim = batch_size * channel_size < core_num ? batch_size * channel_size
: core_num;
}
k_dim->x = core_limit;
k_dim->y = (task_dim / core_limit) > 0 ? (task_dim / core_limit) : 1;
k_dim->z = 1;
*k_type = CNRT_FUNC_TYPE_UNION1;
}
void TINShiftForwardMLUKernelLauncher(Tensor input, Tensor shift,
Tensor output) {
// params check
TORCH_CHECK(
input.scalar_type() == at::kFloat || input.scalar_type() == at::kHalf,
"input type should be Float or Half, got ", input.scalar_type(), ".");
TORCH_CHECK(input.dim() == 4, "input should be a 4d tensor, got ",
input.dim(), "d.");
TORCH_CHECK(shift.dim() == 2, "shift should be a 2d tensor, got ",
shift.dim(), "d.");
TORCH_CHECK(
input.size(0) == shift.size(0),
"input batch size should be the same as shift's, input batch size is ",
input.size(0), " and shift batch size is ", shift.size(0), ".");
TORCH_CHECK(input.size(0) != 0, "Input batch size should not be zero.");
TORCH_CHECK(input.size(3) != 0,
"The last dim size of input should not be zero.");
if (input.size(1) == 0) {
return;
}
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
int channel_per_core = 0;
int max_number_hw_per_core = 0;
int max_length_per_core = 0;
policyFunc(input, &k_dim, &k_type, &channel_per_core, &max_number_hw_per_core,
&max_length_per_core);
const int batch_size = input.size(0);
const int time_size = input.size(1);
const int channel_size = input.size(2);
const int hw_size = input.size(3);
const int group_size = shift.size(1);
int group_channel = channel_size / group_size;
// get tensor impl
auto input_impl = torch_mlu::getMluTensorImpl(input);
auto shift_impl = torch_mlu::getMluTensorImpl(shift);
auto output_impl = torch_mlu::getMluTensorImpl(output);
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get the mlu ptr
auto input_ptr = input_impl->cnnlMalloc();
auto shift_ptr = shift_impl->cnnlMalloc();
auto output_ptr = output_impl->cnnlMalloc();
cnrtDataType_t data_dtype = torch_mlu::toCnrtDtype(input.dtype());
KernelTinShiftForward(k_dim, k_type, queue, input_ptr, shift_ptr, output_ptr,
batch_size, time_size, channel_size, hw_size,
group_size, group_channel, data_dtype, channel_per_core,
max_number_hw_per_core, max_length_per_core);
}
void TINShiftBackwardMLUKernelLauncher(Tensor grad_output, Tensor shift,
Tensor grad_input) {
// params check
TORCH_CHECK(grad_output.scalar_type() == at::kFloat ||
grad_output.scalar_type() == at::kHalf,
"grad_output type should be Float or Half, got ",
grad_output.scalar_type(), ".");
TORCH_CHECK(grad_output.dim() == 4, "grad_output should be a 4d tensor, got ",
grad_output.dim(), "d.");
TORCH_CHECK(shift.dim() == 2, "shift should be a 2d tensor, got ",
shift.dim(), "d.");
TORCH_CHECK(grad_output.size(0) == shift.size(0),
"grad_output batch size should be the same as shift's, "
"grad_output batch size is ",
grad_output.size(0), ", shift batch size is ", shift.size(0),
".");
TORCH_CHECK(grad_output.size(0) != 0,
"grad_output batch size should not be zero.");
TORCH_CHECK(grad_output.size(3) != 0,
"The last dim size of grad_output should not be zero.");
if (grad_output.size(1) == 0) {
return;
}
cnrtDim3_t k_dim;
cnrtFunctionType_t k_type;
int channel_per_core = 0;
int max_number_hw_per_core = 0;
int max_length_per_core = 0;
policyFunc(grad_output, &k_dim, &k_type, &channel_per_core,
&max_number_hw_per_core, &max_length_per_core);
const int batch_size = grad_output.size(0);
const int time_size = grad_output.size(1);
const int channel_size = grad_output.size(2);
const int hw_size = grad_output.size(3);
const int group_size = shift.size(1);
int group_channel = channel_size / group_size;
// get tensor impl
auto grad_output_impl = torch_mlu::getMluTensorImpl(grad_output);
auto shift_impl = torch_mlu::getMluTensorImpl(shift);
auto grad_input_impl = torch_mlu::getMluTensorImpl(grad_input);
// get compute queue
auto queue = torch_mlu::getCurQueue();
// get the mlu ptr
auto grad_output_ptr = grad_output_impl->cnnlMalloc();
auto shift_ptr = shift_impl->cnnlMalloc();
auto grad_input_ptr = grad_input_impl->cnnlMalloc();
cnrtDataType_t data_dtype = torch_mlu::toCnrtDtype(grad_output.dtype());
KernelTinShiftBackward(k_dim, k_type, queue, grad_output_ptr, shift_ptr,
grad_input_ptr, batch_size, time_size, channel_size,
hw_size, group_size, group_channel, data_dtype,
channel_per_core, max_number_hw_per_core,
max_length_per_core);
}
void tin_shift_forward_mlu(Tensor input, Tensor shift, Tensor output) {
TINShiftForwardMLUKernelLauncher(input, shift, output);
}
void tin_shift_backward_mlu(Tensor grad_output, Tensor shift,
Tensor grad_input) {
TINShiftBackwardMLUKernelLauncher(grad_output, shift, grad_input);
}
void tin_shift_forward_impl(Tensor input, Tensor shift, Tensor output);
void tin_shift_backward_impl(Tensor grad_output, Tensor shift,
Tensor grad_input);
REGISTER_DEVICE_IMPL(tin_shift_forward_impl, MLU, tin_shift_forward_mlu);
REGISTER_DEVICE_IMPL(tin_shift_backward_impl, MLU, tin_shift_backward_mlu);
...@@ -34,7 +34,8 @@ class SigmoidFocalLossFunction(Function): ...@@ -34,7 +34,8 @@ class SigmoidFocalLossFunction(Function):
weight=None, weight=None,
reduction='mean'): reduction='mean'):
assert isinstance(target, (torch.LongTensor, torch.cuda.LongTensor)) assert isinstance(
target, (torch.Tensor, torch.LongTensor, torch.cuda.LongTensor))
assert input.dim() == 2 assert input.dim() == 2
assert target.dim() == 1 assert target.dim() == 1
assert input.size(0) == target.size(0) assert input.size(0) == target.size(0)
......
...@@ -18,6 +18,10 @@ class TINShiftFunction(Function): ...@@ -18,6 +18,10 @@ class TINShiftFunction(Function):
@staticmethod @staticmethod
def forward(ctx, input, shift): def forward(ctx, input, shift):
if input.size(0) != shift.size(0):
raise ValueError(
'The first dim (batch) of `input` and `shift` should be '
f'same, but got {input.size(0)} and {shift.size(0)}.')
C = input.size(2) C = input.size(2)
num_segments = shift.size(1) num_segments = shift.size(1)
if C // num_segments <= 0 or C % num_segments != 0: if C // num_segments <= 0 or C % num_segments != 0:
......
...@@ -12,7 +12,7 @@ from torch import distributed as dist ...@@ -12,7 +12,7 @@ from torch import distributed as dist
from torch._utils import (_flatten_dense_tensors, _take_tensors, from torch._utils import (_flatten_dense_tensors, _take_tensors,
_unflatten_dense_tensors) _unflatten_dense_tensors)
from mmcv.device.mlu import IS_MLU from mmcv.device.mlu import IS_MLU_AVAILABLE
def _find_free_port(): def _find_free_port():
...@@ -49,7 +49,7 @@ def init_dist(launcher, backend='nccl', **kwargs): ...@@ -49,7 +49,7 @@ def init_dist(launcher, backend='nccl', **kwargs):
def _init_dist_pytorch(backend, **kwargs): def _init_dist_pytorch(backend, **kwargs):
# TODO: use local_rank instead of rank % num_gpus # TODO: use local_rank instead of rank % num_gpus
rank = int(os.environ['RANK']) rank = int(os.environ['RANK'])
if IS_MLU: if IS_MLU_AVAILABLE:
import torch_mlu # noqa: F401 import torch_mlu # noqa: F401
torch.mlu.set_device(rank) torch.mlu.set_device(rank)
dist.init_process_group( dist.init_process_group(
......
...@@ -41,11 +41,12 @@ else: ...@@ -41,11 +41,12 @@ else:
from .logging import get_logger, print_log from .logging import get_logger, print_log
from .parrots_jit import jit, skip_no_elena from .parrots_jit import jit, skip_no_elena
# yapf: disable # yapf: disable
from .parrots_wrapper import (TORCH_VERSION, BuildExtension, CppExtension, from .parrots_wrapper import (IS_CUDA_AVAILABLE, TORCH_VERSION,
CUDAExtension, DataLoader, PoolDataLoader, BuildExtension, CppExtension, CUDAExtension,
SyncBatchNorm, _AdaptiveAvgPoolNd, DataLoader, PoolDataLoader, SyncBatchNorm,
_AdaptiveMaxPoolNd, _AvgPoolNd, _BatchNorm, _AdaptiveAvgPoolNd, _AdaptiveMaxPoolNd,
_ConvNd, _ConvTransposeMixin, _get_cuda_home, _AvgPoolNd, _BatchNorm, _ConvNd,
_ConvTransposeMixin, _get_cuda_home,
_InstanceNorm, _MaxPoolNd, get_build_config, _InstanceNorm, _MaxPoolNd, get_build_config,
is_rocm_pytorch) is_rocm_pytorch)
# yapf: enable # yapf: enable
...@@ -71,5 +72,6 @@ else: ...@@ -71,5 +72,6 @@ else:
'assert_dict_has_keys', 'assert_keys_equal', 'assert_is_norm_layer', 'assert_dict_has_keys', 'assert_keys_equal', 'assert_is_norm_layer',
'assert_params_all_zeros', 'check_python_script', 'assert_params_all_zeros', 'check_python_script',
'is_method_overridden', 'is_jit_tracing', 'is_rocm_pytorch', 'is_method_overridden', 'is_jit_tracing', 'is_rocm_pytorch',
'_get_cuda_home', 'load_url', 'has_method', 'worker_init_fn' '_get_cuda_home', 'load_url', 'has_method', 'IS_CUDA_AVAILABLE',
'worker_init_fn'
] ]
...@@ -6,6 +6,13 @@ import torch ...@@ -6,6 +6,13 @@ import torch
TORCH_VERSION = torch.__version__ TORCH_VERSION = torch.__version__
def is_cuda_available() -> bool:
return torch.cuda.is_available()
IS_CUDA_AVAILABLE = is_cuda_available()
def is_rocm_pytorch() -> bool: def is_rocm_pytorch() -> bool:
is_rocm = False is_rocm = False
if TORCH_VERSION != 'parrots': if TORCH_VERSION != 'parrots':
......
...@@ -301,13 +301,12 @@ def get_extensions(): ...@@ -301,13 +301,12 @@ def get_extensions():
extra_compile_args['cncc'] = [mlu_args] if mlu_args else [] extra_compile_args['cncc'] = [mlu_args] if mlu_args else []
op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \ op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \
glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp') + \ glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp') + \
glob.glob('./mmcv/ops/csrc/pytorch/mlu/*.cpp') + \ glob.glob('./mmcv/ops/csrc/pytorch/mlu/*.cpp')
glob.glob('./mmcv/ops/csrc/pytorch/mlu/*.mlu')
extension = MLUExtension extension = MLUExtension
include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common')) include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common'))
include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common/mlu')) include_dirs.append(os.path.abspath('./mmcv/ops/csrc/common/mlu'))
else: else:
print(f'Compiling {ext_name} without CUDA') print(f'Compiling {ext_name} only with CPU')
op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \ op_files = glob.glob('./mmcv/ops/csrc/pytorch/*.cpp') + \
glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp') glob.glob('./mmcv/ops/csrc/pytorch/cpu/*.cpp')
extension = CppExtension extension = CppExtension
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment