push v0.6.18 version

b47225fd · aiss · 2951b12d · 2951b12d · 2951b12d · 2951b12d
Commit b47225fd authored Jan 10, 2024 by aiss
20 changed files
--- a/PKG-INFO
+++ b/PKG-INFO
-Metadata-Version: 2.1
-Name: torch_sparse
-Version: 0.6.13
-Summary: PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations
-Home-page: https://github.com/rusty1s/pytorch_sparse
-Author: Matthias Fey
-Author-email: matthias.fey@tu-dortmund.de
-License: UNKNOWN
-Download-URL: https://github.com/rusty1s/pytorch_sparse/archive/0.6.13.tar.gz
-Description: [pypi-image]: https://badge.fury.io/py/torch-sparse.svg
-        [pypi-url]: https://pypi.python.org/pypi/torch-sparse
-        [testing-image]: https://github.com/rusty1s/pytorch_sparse/actions/workflows/testing.yml/badge.svg
-        [testing-url]: https://github.com/rusty1s/pytorch_sparse/actions/workflows/testing.yml
-        [linting-image]: https://github.com/rusty1s/pytorch_sparse/actions/workflows/linting.yml/badge.svg
-        [linting-url]: https://github.com/rusty1s/pytorch_sparse/actions/workflows/linting.yml
-        [coverage-image]: https://codecov.io/gh/rusty1s/pytorch_sparse/branch/master/graph/badge.svg
-        [coverage-url]: https://codecov.io/github/rusty1s/pytorch_sparse?branch=master
-        # PyTorch Sparse
-        [![PyPI Version][pypi-image]][pypi-url]
-        [![Testing Status][testing-image]][testing-url]
-        [![Linting Status][linting-image]][linting-url]
-        [![Code Coverage][coverage-image]][coverage-url]
-        --------------------------------------------------------------------------------
-        This package consists of a small extension library of optimized sparse matrix operations with autograd support.
-        This package currently consists of the following methods:
-        * **[Coalesce](#coalesce)**
-        * **[Transpose](#transpose)**
-        * **[Sparse Dense Matrix Multiplication](#sparse-dense-matrix-multiplication)**
-        * **[Sparse Sparse Matrix Multiplication](#sparse-sparse-matrix-multiplication)**
-        All included operations work on varying data types and are implemented both for CPU and GPU.
-        To avoid the hazzle of creating [`torch.sparse_coo_tensor`](https://pytorch.org/docs/stable/torch.html?highlight=sparse_coo_tensor#torch.sparse_coo_tensor), this package defines operations on sparse tensors by simply passing `index` and `value` tensors as arguments ([with same shapes as defined in PyTorch](https://pytorch.org/docs/stable/sparse.html)).
-        Note that only `value` comes with autograd support, as `index` is discrete and therefore not differentiable.
-        ## Installation
-        ### Anaconda
-        **Update:** You can now install `pytorch-sparse` via [Anaconda](https://anaconda.org/pyg/pytorch-sparse) for all major OS/PyTorch/CUDA combinations 🤗
-        Given that you have [`pytorch >= 1.8.0` installed](https://pytorch.org/get-started/locally/), simply run
-        ```
-        conda install pytorch-sparse -c pyg
-        ```
-        ### Binaries
-        We alternatively provide pip wheels for all major OS/PyTorch/CUDA combinations, see [here](https://data.pyg.org/whl).
-        #### PyTorch 1.11
-        To install the binaries for PyTorch 1.11.0, simply run
-        ```
-        pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.11.0+${CUDA}.html
-        ```
-        where `${CUDA}` should be replaced by either `cpu`, `cu102`, `cu113`, or `cu115` depending on your PyTorch installation.
-        |             | `cpu` | `cu102` | `cu113` | `cu115` |
-        |-------------|-------|---------|---------|---------|
-        | **Linux**   | ✅    | ✅      | ✅      | ✅      |
-        | **Windows** | ✅    |         | ✅      | ✅      |
-        | **macOS**   | ✅    |         |         |         |
-        #### PyTorch 1.10
-        To install the binaries for PyTorch 1.10.0, PyTorch 1.10.1 and PyTorch 1.10.2, simply run
-        ```
-        pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+${CUDA}.html
-        ```
-        where `${CUDA}` should be replaced by either `cpu`, `cu102`, `cu111`, or `cu113` depending on your PyTorch installation.
-        |             | `cpu` | `cu102` | `cu111` | `cu113` |
-        |-------------|-------|---------|---------|---------|
-        | **Linux**   | ✅    | ✅      | ✅      | ✅      |
-        | **Windows** | ✅    | ✅      | ✅      | ✅      |
-        | **macOS**   | ✅    |         |         |         |
-        **Note:** Binaries of older versions are also provided for PyTorch 1.4.0, PyTorch 1.5.0, PyTorch 1.6.0, PyTorch 1.7.0/1.7.1, PyTorch 1.8.0/1.8.1 and PyTorch 1.9.0 (following the same procedure).
-        For older versions, you might need to explicitly specify the latest supported version number in order to prevent a manual installation from source.
-        You can look up the latest supported version number [here](https://data.pyg.org/whl).
-        ### From source
-        Ensure that at least PyTorch 1.7.0 is installed and verify that `cuda/bin` and `cuda/include` are in your `$PATH` and `$CPATH` respectively, *e.g.*:
-        ```
-        $ python -c "import torch; print(torch.__version__)"
-        >>> 1.7.0
-        $ echo $PATH
-        >>> /usr/local/cuda/bin:...
-        $ echo $CPATH
-        >>> /usr/local/cuda/include:...
-        ```
-        If you want to additionally build `torch-sparse` with METIS support, *e.g.* for partioning, please download and install the [METIS library](http://glaros.dtc.umn.edu/gkhome/metis/metis/download) by following the instructions in the `Install.txt` file.
-        Note that METIS needs to be installed with 64 bit `IDXTYPEWIDTH` by changing `include/metis.h`.
-        Afterwards, set the environment variable `WITH_METIS=1`.
-        Then run:
-        ```
-        pip install torch-scatter torch-sparse
-        ```
-        When running in a docker container without NVIDIA driver, PyTorch needs to evaluate the compute capabilities and may fail.
-        In this case, ensure that the compute capabilities are set via `TORCH_CUDA_ARCH_LIST`, *e.g.*:
-        ```
-        export TORCH_CUDA_ARCH_LIST="6.0 6.1 7.2+PTX 7.5+PTX"
-        ```
-        ## Functions
-        ### Coalesce
-        ```
-        torch_sparse.coalesce(index, value, m, n, op="add") -> (torch.LongTensor, torch.Tensor)
-        ```
-        Row-wise sorts `index` and removes duplicate entries.
-        Duplicate entries are removed by scattering them together.
-        For scattering, any operation of [`torch_scatter`](https://github.com/rusty1s/pytorch_scatter) can be used.
-        #### Parameters
-        * **index** *(LongTensor)* - The index tensor of sparse matrix.
-        * **value** *(Tensor)* - The value tensor of sparse matrix.
-        * **m** *(int)* - The first dimension of sparse matrix.
-        * **n** *(int)* - The second dimension of sparse matrix.
-        * **op** *(string, optional)* - The scatter operation to use. (default: `"add"`)
-        #### Returns
-        * **index** *(LongTensor)* - The coalesced index tensor of sparse matrix.
-        * **value** *(Tensor)* - The coalesced value tensor of sparse matrix.
-        #### Example
-        ```python
-        import torch
-        from torch_sparse import coalesce
-        index = torch.tensor([[1, 0, 1, 0, 2, 1],
-                              [0, 1, 1, 1, 0, 0]])
-        value = torch.Tensor([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
-        index, value = coalesce(index, value, m=3, n=2)
-        ```
-        ```
-        print(index)
-        tensor([[0, 1, 1, 2],
-                [1, 0, 1, 0]])
-        print(value)
-        tensor([[6.0, 8.0],
-                [7.0, 9.0],
-                [3.0, 4.0],
-                [5.0, 6.0]])
-        ```
-        ### Transpose
-        ```
-        torch_sparse.transpose(index, value, m, n) -> (torch.LongTensor, torch.Tensor)
-        ```
-        Transposes dimensions 0 and 1 of a sparse matrix.
-        #### Parameters
-        * **index** *(LongTensor)* - The index tensor of sparse matrix.
-        * **value** *(Tensor)* - The value tensor of sparse matrix.
-        * **m** *(int)* - The first dimension of sparse matrix.
-        * **n** *(int)* - The second dimension of sparse matrix.
-        * **coalesced** *(bool, optional)* - If set to `False`, will not coalesce the output. (default: `True`)
-        #### Returns
-        * **index** *(LongTensor)* - The transposed index tensor of sparse matrix.
-        * **value** *(Tensor)* - The transposed value tensor of sparse matrix.
-        #### Example
-        ```python
-        import torch
-        from torch_sparse import transpose
-        index = torch.tensor([[1, 0, 1, 0, 2, 1],
-                              [0, 1, 1, 1, 0, 0]])
-        value = torch.Tensor([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6], [6, 7]])
-        index, value = transpose(index, value, 3, 2)
-        ```
-        ```
-        print(index)
-        tensor([[0, 0, 1, 1],
-                [1, 2, 0, 1]])
-        print(value)
-        tensor([[7.0, 9.0],
-                [5.0, 6.0],
-                [6.0, 8.0],
-                [3.0, 4.0]])
-        ```
-        ### Sparse Dense Matrix Multiplication
-        ```
-        torch_sparse.spmm(index, value, m, n, matrix) -> torch.Tensor
-        ```
-        Matrix product of a sparse matrix with a dense matrix.
-        #### Parameters
-        * **index** *(LongTensor)* - The index tensor of sparse matrix.
-        * **value** *(Tensor)* - The value tensor of sparse matrix.
-        * **m** *(int)* - The first dimension of sparse matrix.
-        * **n** *(int)* - The second dimension of sparse matrix.
-        * **matrix** *(Tensor)* - The dense matrix.
-        #### Returns
-        * **out** *(Tensor)* - The dense output matrix.
-        #### Example
-        ```python
-        import torch
-        from torch_sparse import spmm
-        index = torch.tensor([[0, 0, 1, 2, 2],
-                              [0, 2, 1, 0, 1]])
-        value = torch.Tensor([1, 2, 4, 1, 3])
-        matrix = torch.Tensor([[1, 4], [2, 5], [3, 6]])
-        out = spmm(index, value, 3, 3, matrix)
-        ```
-        ```
-        print(out)
-        tensor([[7.0, 16.0],
-                [8.0, 20.0],
-                [7.0, 19.0]])
-        ```
-        ### Sparse Sparse Matrix Multiplication
-        ```
-        torch_sparse.spspmm(indexA, valueA, indexB, valueB, m, k, n) -> (torch.LongTensor, torch.Tensor)
-        ```
-        Matrix product of two sparse tensors.
-        Both input sparse matrices need to be **coalesced** (use the `coalesced` attribute to force).
-        #### Parameters
-        * **indexA** *(LongTensor)* - The index tensor of first sparse matrix.
-        * **valueA** *(Tensor)* - The value tensor of first sparse matrix.
-        * **indexB** *(LongTensor)* - The index tensor of second sparse matrix.
-        * **valueB** *(Tensor)* - The value tensor of second sparse matrix.
-        * **m** *(int)* - The first dimension of first sparse matrix.
-        * **k** *(int)* - The second dimension of first sparse matrix and first dimension of second sparse matrix.
-        * **n** *(int)* - The second dimension of second sparse matrix.
-        * **coalesced** *(bool, optional)*: If set to `True`, will coalesce both input sparse matrices. (default: `False`)
-        #### Returns
-        * **index** *(LongTensor)* - The output index tensor of sparse matrix.
-        * **value** *(Tensor)* - The output value tensor of sparse matrix.
-        #### Example
-        ```python
-        import torch
-        from torch_sparse import spspmm
-        indexA = torch.tensor([[0, 0, 1, 2, 2], [1, 2, 0, 0, 1]])
-        valueA = torch.Tensor([1, 2, 3, 4, 5])
-        indexB = torch.tensor([[0, 2], [1, 0]])
-        valueB = torch.Tensor([2, 4])
-        indexC, valueC = spspmm(indexA, valueA, indexB, valueB, 3, 3, 2)
-        ```
-        ```
-        print(indexC)
-        tensor([[0, 1, 2],
-                [0, 1, 1]])
-        print(valueC)
-        tensor([8.0, 6.0, 8.0])
-        ```
-        ## C++ API
-        `torch-sparse` also offers a C++ API that contains C++ equivalent of python models.
-        ```
-        mkdir build
-        cd build
-        # Add -DWITH_CUDA=on support for the CUDA if needed
-        cmake ..
-        make
-        make install
-        ```
-        ## Running tests
-        ```
-        pytest
-        ```
-Keywords: pytorch,sparse,sparse-matrices,autograd
-Platform: UNKNOWN
-Classifier: Development Status :: 5 - Production/Stable
-Classifier: License :: OSI Approved :: MIT License
-Classifier: Programming Language :: Python
-Classifier: Programming Language :: Python :: 3.7
-Classifier: Programming Language :: Python :: 3.8
-Classifier: Programming Language :: Python :: 3.9
-Classifier: Programming Language :: Python :: 3.10
-Classifier: Programming Language :: Python :: 3 :: Only
-Requires-Python: >=3.7
-Description-Content-Type: text/markdown
-Provides-Extra: test
--- a/csrc/cpu/spspmm_cpu.cpp
+++ b/csrc/cpu/spspmm_cpu.cpp
-#include "spspmm_cpu.h"
-#include "utils.h"
-std::tuple<torch::Tensor, torch::Tensor, torch::optional<torch::Tensor>>
-spspmm_cpu(torch::Tensor rowptrA, torch::Tensor colA,
-           torch::optional<torch::Tensor> optional_valueA,
-           torch::Tensor rowptrB, torch::Tensor colB,
-           torch::optional<torch::Tensor> optional_valueB, int64_t K,
-           std::string reduce) {
-  CHECK_CPU(rowptrA);
-  CHECK_CPU(colA);
-  if (optional_valueA.has_value())
-    CHECK_CPU(optional_valueA.value());
-  CHECK_CPU(rowptrB);
-  CHECK_CPU(colB);
-  if (optional_valueB.has_value())
-    CHECK_CPU(optional_valueB.value());
-  CHECK_INPUT(rowptrA.dim() == 1);
-  CHECK_INPUT(colA.dim() == 1);
-  if (optional_valueA.has_value()) {
-    CHECK_INPUT(optional_valueA.value().dim() == 1);
-    CHECK_INPUT(optional_valueA.value().size(0) == colA.size(0));
-  }
-  CHECK_INPUT(rowptrB.dim() == 1);
-  CHECK_INPUT(colB.dim() == 1);
-  if (optional_valueB.has_value()) {
-    CHECK_INPUT(optional_valueB.value().dim() == 1);
-    CHECK_INPUT(optional_valueB.value().size(0) == colB.size(0));
-  }
-  if (!optional_valueA.has_value() && optional_valueB.has_value())
-    optional_valueA =
-        torch::ones(colA.numel(), optional_valueB.value().options());
-  if (!optional_valueB.has_value() && optional_valueA.has_value())
-    optional_valueB =
-        torch::ones(colB.numel(), optional_valueA.value().options());
-  auto scalar_type = torch::ScalarType::Float;
-  if (optional_valueA.has_value())
-    scalar_type = optional_valueA.value().scalar_type();
-  auto rowptrA_data = rowptrA.data_ptr<int64_t>();
-  auto colA_data = colA.data_ptr<int64_t>();
-  auto rowptrB_data = rowptrB.data_ptr<int64_t>();
-  auto colB_data = colB.data_ptr<int64_t>();
-  auto rowptrC = torch::empty_like(rowptrA);
-  auto rowptrC_data = rowptrC.data_ptr<int64_t>();
-  rowptrC_data[0] = 0;
-  torch::Tensor colC;
-  torch::optional<torch::Tensor> optional_valueC = torch::nullopt;
-  AT_DISPATCH_ALL_TYPES(scalar_type, "spspmm", [&] {
-    AT_DISPATCH_HAS_VALUE(optional_valueA, [&] {
-      scalar_t *valA_data = nullptr, *valB_data = nullptr;
-      if (HAS_VALUE) {
-        valA_data = optional_valueA.value().data_ptr<scalar_t>();
-        valB_data = optional_valueB.value().data_ptr<scalar_t>();
-      }
-      int64_t nnz = 0, cA, cB;
-      std::vector<scalar_t> tmp_vals(K, 0);
-      std::vector<int64_t> cols;
-      std::vector<scalar_t> vals;
-      for (auto rA = 0; rA < rowptrA.numel() - 1; rA++) {
-        for (auto eA = rowptrA_data[rA]; eA < rowptrA_data[rA + 1]; eA++) {
-          cA = colA_data[eA];
-          for (auto eB = rowptrB_data[cA]; eB < rowptrB_data[cA + 1]; eB++) {
-            cB = colB_data[eB];
-            if (HAS_VALUE)
-              tmp_vals[cB] += valA_data[eA] * valB_data[eB];
-            else
-              tmp_vals[cB]++;
-          }
-        }
-        for (auto k = 0; k < K; k++) {
-          if (tmp_vals[k] != 0) {
-            cols.push_back(k);
-            if (HAS_VALUE)
-              vals.push_back(tmp_vals[k]);
-            nnz++;
-          }
-          tmp_vals[k] = (scalar_t)0;
-        }
-        rowptrC_data[rA + 1] = nnz;
-      }
-      colC = torch::from_blob(cols.data(), {nnz}, colA.options()).clone();
-      if (HAS_VALUE) {
-        optional_valueC = torch::from_blob(vals.data(), {nnz},
-                                           optional_valueA.value().options());
-        optional_valueC = optional_valueC.value().clone();
-      }
-    });
-  });
-  return std::make_tuple(rowptrC, colC, optional_valueC);
-}
--- a/csrc/cpu/spspmm_cpu.h
+++ b/csrc/cpu/spspmm_cpu.h
-#pragma once
-#include "../extensions.h"
-std::tuple<torch::Tensor, torch::Tensor, torch::optional<torch::Tensor>>
-spspmm_cpu(torch::Tensor rowptrA, torch::Tensor colA,
-           torch::optional<torch::Tensor> optional_valueA,
-           torch::Tensor rowptrB, torch::Tensor colB,
-           torch::optional<torch::Tensor> optional_valueB, int64_t K,
-           std::string reduce);
--- a/csrc/hip/atomics.cuh
+++ b/csrc/hip/atomics.cuh
-#pragma once
-static inline __device__ void atomAdd(float *address, float val) {
-  atomicAdd(address, val);
-}
-static inline __device__ void atomAdd(double *address, double val) {
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 600 || TORCH_HIP_VERSION < 8000)
-  unsigned long long int *address_as_ull = (unsigned long long int *)address;
-  unsigned long long int old = *address_as_ull;
-  unsigned long long int assumed;
-  do {
-    assumed = old;
-    old = atomicCAS(address_as_ull, assumed,
-                    __double_as_longlong(val + __longlong_as_double(assumed)));
-  } while (assumed != old);
-#else
-  atomicAdd(address, val);
-#endif
-}
--- a/csrc/hip/convert_hip.h
+++ b/csrc/hip/convert_hip.h
-#pragma once
-#include "../extensions.h"
-torch::Tensor ind2ptr_cuda(torch::Tensor ind, int64_t M);
-torch::Tensor ptr2ind_cuda(torch::Tensor ptr, int64_t E);
--- a/csrc/hip/convert_hip.hip
+++ b/csrc/hip/convert_hip.hip
-#include "hip/hip_runtime.h"
-#include "convert_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "utils.cuh"
-#define THREADS 256
-__global__ void ind2ptr_kernel(const int64_t *ind_data, int64_t *out_data,
-                               int64_t M, int64_t numel) {
-  int64_t thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  if (thread_idx == 0) {
-    for (int64_t i = 0; i <= ind_data[0]; i++)
-      out_data[i] = 0;
-  } else if (thread_idx < numel) {
-    for (int64_t i = ind_data[thread_idx - 1]; i < ind_data[thread_idx]; i++)
-      out_data[i + 1] = thread_idx;
-  } else if (thread_idx == numel) {
-    for (int64_t i = ind_data[numel - 1] + 1; i < M + 1; i++)
-      out_data[i] = numel;
-  }
-}
-torch::Tensor ind2ptr_cuda(torch::Tensor ind, int64_t M) {
-  CHECK_CUDA(ind);
-  hipSetDevice(ind.get_device());
-  auto out = torch::empty(M + 1, ind.options());
-  if (ind.numel() == 0)
-    return out.zero_();
-  auto ind_data = ind.data_ptr<int64_t>();
-  auto out_data = out.data_ptr<int64_t>();
-  auto stream = at::cuda::getCurrentCUDAStream();
-  ind2ptr_kernel<<<(ind.numel() + 2 + THREADS - 1) / THREADS, THREADS, 0,
-                   stream>>>(ind_data, out_data, M, ind.numel());
-  return out;
-}
-__global__ void ptr2ind_kernel(const int64_t *ptr_data, int64_t *out_data,
-                               int64_t E, int64_t numel) {
-  int64_t thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  if (thread_idx < numel) {
-    int64_t idx = ptr_data[thread_idx], next_idx = ptr_data[thread_idx + 1];
-    for (int64_t i = idx; i < next_idx; i++) {
-      out_data[i] = thread_idx;
-    }
-  }
-}
-torch::Tensor ptr2ind_cuda(torch::Tensor ptr, int64_t E) {
-  CHECK_CUDA(ptr);
-  hipSetDevice(ptr.get_device());
-  auto out = torch::empty(E, ptr.options());
-  auto ptr_data = ptr.data_ptr<int64_t>();
-  auto out_data = out.data_ptr<int64_t>();
-  auto stream = at::cuda::getCurrentCUDAStream();
-  ptr2ind_kernel<<<(ptr.numel() - 1 + THREADS - 1) / THREADS, THREADS, 0,
-                   stream>>>(ptr_data, out_data, E, ptr.numel() - 1);
-  return out;
-}
--- a/csrc/hip/convert_hip_hip.hip
+++ b/csrc/hip/convert_hip_hip.hip
-#include "hip/hip_runtime.h"
-#include "convert_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "utils.cuh"
-#define THREADS 256
-__global__ void ind2ptr_kernel(const int64_t *ind_data, int64_t *out_data,
-                               int64_t M, int64_t numel) {
-  int64_t thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  if (thread_idx == 0) {
-    for (int64_t i = 0; i <= ind_data[0]; i++)
-      out_data[i] = 0;
-  } else if (thread_idx < numel) {
-    for (int64_t i = ind_data[thread_idx - 1]; i < ind_data[thread_idx]; i++)
-      out_data[i + 1] = thread_idx;
-  } else if (thread_idx == numel) {
-    for (int64_t i = ind_data[numel - 1] + 1; i < M + 1; i++)
-      out_data[i] = numel;
-  }
-}
-torch::Tensor ind2ptr_cuda(torch::Tensor ind, int64_t M) {
-  CHECK_CUDA(ind);
-  hipSetDevice(ind.get_device());
-  auto out = torch::empty(M + 1, ind.options());
-  if (ind.numel() == 0)
-    return out.zero_();
-  auto ind_data = ind.data_ptr<int64_t>();
-  auto out_data = out.data_ptr<int64_t>();
-  auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
- hipLaunchKernelGGL(( ind2ptr_kernel), dim3((ind.numel() + 2 + THREADS - 1) / THREADS), dim3(THREADS), 0,
-                   stream, ind_data, out_data, M, ind.numel());
-  return out;
-}
-__global__ void ptr2ind_kernel(const int64_t *ptr_data, int64_t *out_data,
-                               int64_t E, int64_t numel) {
-  int64_t thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  if (thread_idx < numel) {
-    int64_t idx = ptr_data[thread_idx], next_idx = ptr_data[thread_idx + 1];
-    for (int64_t i = idx; i < next_idx; i++) {
-      out_data[i] = thread_idx;
-    }
-  }
-}
-torch::Tensor ptr2ind_cuda(torch::Tensor ptr, int64_t E) {
-  CHECK_CUDA(ptr);
-  hipSetDevice(ptr.get_device());
-  auto out = torch::empty(E, ptr.options());
-  auto ptr_data = ptr.data_ptr<int64_t>();
-  auto out_data = out.data_ptr<int64_t>();
-  auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
- hipLaunchKernelGGL(( ptr2ind_kernel), dim3((ptr.numel() - 1 + THREADS - 1) / THREADS), dim3(THREADS), 0,
-                   stream, ptr_data, out_data, E, ptr.numel() - 1);
-  return out;
-}
--- a/csrc/hip/diag_hip.h
+++ b/csrc/hip/diag_hip.h
-#pragma once
-#include "../extensions.h"
-torch::Tensor non_diag_mask_cuda(torch::Tensor row, torch::Tensor col,
-                                 int64_t M, int64_t N, int64_t k);
--- a/csrc/hip/diag_hip.hip
+++ b/csrc/hip/diag_hip.hip
-#include "hip/hip_runtime.h"
-#include "diag_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "utils.cuh"
-#define THREADS 1024
-__global__ void non_diag_mask_kernel(const int64_t *row_data,
-                                     const int64_t *col_data, bool *out_data,
-                                     int64_t N, int64_t k, int64_t num_diag,
-                                     int64_t numel) {
-  int64_t thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  if (thread_idx < numel) {
-    int64_t r = row_data[thread_idx], c = col_data[thread_idx];
-    if (k < 0) {
-      if (r + k < 0) {
-        out_data[thread_idx] = true;
-      } else if (r + k >= N) {
-        out_data[thread_idx + num_diag] = true;
-      } else if (r + k > c) {
-        out_data[thread_idx + r + k] = true;
-      } else if (r + k < c) {
-        out_data[thread_idx + r + k + 1] = true;
-      }
-    } else {
-      if (r + k >= N) {
-        out_data[thread_idx + num_diag] = true;
-      } else if (r + k > c) {
-        out_data[thread_idx + r] = true;
-      } else if (r + k < c) {
-        out_data[thread_idx + r + 1] = true;
-      }
-    }
-  }
-}
-torch::Tensor non_diag_mask_cuda(torch::Tensor row, torch::Tensor col,
-                                 int64_t M, int64_t N, int64_t k) {
-  CHECK_CUDA(row);
-  CHECK_CUDA(col);
-  hipSetDevice(row.get_device());
-  auto E = row.size(0);
-  auto num_diag = k < 0 ? std::min(M + k, N) : std::min(M, N - k);
-  auto row_data = row.data_ptr<int64_t>();
-  auto col_data = col.data_ptr<int64_t>();
-  auto mask = torch::zeros(E + num_diag, row.options().dtype(torch::kBool));
-  auto mask_data = mask.data_ptr<bool>();
-  if (E == 0)
-    return mask;
-  auto stream = at::cuda::getCurrentCUDAStream();
-  non_diag_mask_kernel<<<(E + THREADS - 1) / THREADS, THREADS, 0, stream>>>(
-      row_data, col_data, mask_data, N, k, num_diag, E);
-  return mask;
-}
--- a/csrc/hip/diag_hip_hip.hip
+++ b/csrc/hip/diag_hip_hip.hip
-#include "hip/hip_runtime.h"
-#include "diag_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "utils.cuh"
-#define THREADS 1024
-__global__ void non_diag_mask_kernel(const int64_t *row_data,
-                                     const int64_t *col_data, bool *out_data,
-                                     int64_t N, int64_t k, int64_t num_diag,
-                                     int64_t numel) {
-  int64_t thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  if (thread_idx < numel) {
-    int64_t r = row_data[thread_idx], c = col_data[thread_idx];
-    if (k < 0) {
-      if (r + k < 0) {
-        out_data[thread_idx] = true;
-      } else if (r + k >= N) {
-        out_data[thread_idx + num_diag] = true;
-      } else if (r + k > c) {
-        out_data[thread_idx + r + k] = true;
-      } else if (r + k < c) {
-        out_data[thread_idx + r + k + 1] = true;
-      }
-    } else {
-      if (r + k >= N) {
-        out_data[thread_idx + num_diag] = true;
-      } else if (r + k > c) {
-        out_data[thread_idx + r] = true;
-      } else if (r + k < c) {
-        out_data[thread_idx + r + 1] = true;
-      }
-    }
-  }
-}
-torch::Tensor non_diag_mask_cuda(torch::Tensor row, torch::Tensor col,
-                                 int64_t M, int64_t N, int64_t k) {
-  CHECK_CUDA(row);
-  CHECK_CUDA(col);
-  hipSetDevice(row.get_device());
-  auto E = row.size(0);
-  auto num_diag = k < 0 ? std::min(M + k, N) : std::min(M, N - k);
-  auto row_data = row.data_ptr<int64_t>();
-  auto col_data = col.data_ptr<int64_t>();
-  auto mask = torch::zeros(E + num_diag, row.options().dtype(torch::kBool));
-  auto mask_data = mask.data_ptr<bool>();
-  if (E == 0)
-    return mask;
-  auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
- hipLaunchKernelGGL(( non_diag_mask_kernel), dim3((E + THREADS - 1) / THREADS), dim3(THREADS), 0, stream, 
-      row_data, col_data, mask_data, N, k, num_diag, E);
-  return mask;
-}
--- a/csrc/hip/reducer.cuh
+++ b/csrc/hip/reducer.cuh
-#pragma once
-#include <limits>
-#include <map>
-enum ReductionType { SUM, MEAN, MUL, DIV, MIN, MAX };
-const std::map<std::string, ReductionType> reduce2REDUCE = {
-    {"sum", SUM}, {"mean", MEAN}, {"mul", MUL},
-    {"div", DIV}, {"min", MIN},   {"max", MAX},
-};
-#define AT_DISPATCH_REDUCTION_TYPES(reduce, ...)                               \
-  [&] {                                                                        \
-    switch (reduce2REDUCE.at(reduce)) {                                        \
-    case SUM: {                                                                \
-      const ReductionType REDUCE = SUM;                                        \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    case MEAN: {                                                               \
-      const ReductionType REDUCE = MEAN;                                       \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    case MUL: {                                                                \
-      const ReductionType REDUCE = MUL;                                        \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    case DIV: {                                                                \
-      const ReductionType REDUCE = DIV;                                        \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    case MIN: {                                                                \
-      const ReductionType REDUCE = MIN;                                        \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    case MAX: {                                                                \
-      const ReductionType REDUCE = MAX;                                        \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    }                                                                          \
-  }()
-template <typename scalar_t, ReductionType REDUCE> struct Reducer {
-  static inline __host__ __device__ scalar_t init() {
-    if (REDUCE == MUL || REDUCE == DIV)
-      return (scalar_t)1;
-    else if (REDUCE == MIN)
-      return std::numeric_limits<scalar_t>::max();
-    else if (REDUCE == MAX)
-      return std::numeric_limits<scalar_t>::lowest();
-    else
-      return (scalar_t)0;
-  }
-  static inline __host__ __device__ void update(scalar_t *val, scalar_t new_val,
-                                                int64_t *arg, int64_t new_arg) {
-    if (REDUCE == SUM || REDUCE == MEAN)
-      *val = *val + new_val;
-    else if (REDUCE == MUL)
-      *val = *val * new_val;
-    else if (REDUCE == DIV)
-      *val = *val / new_val;
-    else if ((REDUCE == MIN && new_val < *val) ||
-             (REDUCE == MAX && new_val > *val)) {
-      *val = new_val;
-      *arg = new_arg;
-    }
-  }
-  static inline __host__ __device__ void write(scalar_t *address, scalar_t val,
-                                               int64_t *arg_address,
-                                               int64_t arg, int count) {
-    if (REDUCE == SUM || REDUCE == MUL || REDUCE == DIV)
-      *address = val;
-    else if (REDUCE == MEAN)
-      *address = val / (scalar_t)(count > 0 ? count : 1);
-    else if (REDUCE == MIN || REDUCE == MAX) {
-      if (count > 0) {
-        *address = val;
-        *arg_address = arg;
-      } else
-        *address = (scalar_t)0;
-    }
-  }
-};
--- a/csrc/hip/rw_hip.h
+++ b/csrc/hip/rw_hip.h
-#pragma once
-#include "../extensions.h"
-torch::Tensor random_walk_cuda(torch::Tensor rowptr, torch::Tensor col,
-                               torch::Tensor start, int64_t walk_length);
--- a/csrc/hip/rw_hip.hip
+++ b/csrc/hip/rw_hip.hip
-#include "hip/hip_runtime.h"
-#include "rw_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "utils.cuh"
-#define THREADS 1024
-#define BLOCKS(N) (N + THREADS - 1) / THREADS
-__global__ void uniform_random_walk_kernel(const int64_t *rowptr,
-                                           const int64_t *col,
-                                           const int64_t *start,
-                                           const float *rand, int64_t *out,
-                                           int64_t walk_length, int64_t numel) {
-  const int64_t thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
-  if (thread_idx < numel) {
-    int64_t cur = start[thread_idx];
-    out[thread_idx] = cur;
-    int64_t row_start, row_end;
-    for (int64_t l = 0; l < walk_length; l++) {
-      row_start = rowptr[cur], row_end = rowptr[cur + 1];
-      cur = col[row_start +
-                int64_t(rand[l * numel + thread_idx] * (row_end - row_start))];
-      out[(l + 1) * numel + thread_idx] = cur;
-    }
-  }
-}
-torch::Tensor random_walk_cuda(torch::Tensor rowptr, torch::Tensor col,
-                               torch::Tensor start, int64_t walk_length) {
-  CHECK_CUDA(rowptr);
-  CHECK_CUDA(col);
-  CHECK_CUDA(start);
-  hipSetDevice(rowptr.get_device());
-  CHECK_INPUT(rowptr.dim() == 1);
-  CHECK_INPUT(col.dim() == 1);
-  CHECK_INPUT(start.dim() == 1);
-  auto rand = torch::rand({walk_length, start.size(0)},
-                          start.options().dtype(torch::kFloat));
-  auto out = torch::full({walk_length + 1, start.size(0)}, -1, start.options());
-  auto stream = at::cuda::getCurrentCUDAStream();
-  uniform_random_walk_kernel<<<BLOCKS(start.numel()), THREADS, 0, stream>>>(
-      rowptr.data_ptr<int64_t>(), col.data_ptr<int64_t>(),
-      start.data_ptr<int64_t>(), rand.data_ptr<float>(),
-      out.data_ptr<int64_t>(), walk_length, start.numel());
-  return out.t().contiguous();
-}
--- a/csrc/hip/rw_hip_hip.hip
+++ b/csrc/hip/rw_hip_hip.hip
-#include "hip/hip_runtime.h"
-#include "rw_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "utils.cuh"
-#define THREADS 1024
-#define BLOCKS(N) (N + THREADS - 1) / THREADS
-__global__ void uniform_random_walk_kernel(const int64_t *rowptr,
-                                           const int64_t *col,
-                                           const int64_t *start,
-                                           const float *rand, int64_t *out,
-                                           int64_t walk_length, int64_t numel) {
-  const int64_t thread_idx = blockIdx.x * blockDim.x + threadIdx.x;
-  if (thread_idx < numel) {
-    int64_t cur = start[thread_idx];
-    out[thread_idx] = cur;
-    int64_t row_start, row_end;
-    for (int64_t l = 0; l < walk_length; l++) {
-      row_start = rowptr[cur], row_end = rowptr[cur + 1];
-      cur = col[row_start +
-                int64_t(rand[l * numel + thread_idx] * (row_end - row_start))];
-      out[(l + 1) * numel + thread_idx] = cur;
-    }
-  }
-}
-torch::Tensor random_walk_cuda(torch::Tensor rowptr, torch::Tensor col,
-                               torch::Tensor start, int64_t walk_length) {
-  CHECK_CUDA(rowptr);
-  CHECK_CUDA(col);
-  CHECK_CUDA(start);
-  hipSetDevice(rowptr.get_device());
-  CHECK_INPUT(rowptr.dim() == 1);
-  CHECK_INPUT(col.dim() == 1);
-  CHECK_INPUT(start.dim() == 1);
-  auto rand = torch::rand({walk_length, start.size(0)},
-                          start.options().dtype(torch::kFloat));
-  auto out = torch::full({walk_length + 1, start.size(0)}, -1, start.options());
-  auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
- hipLaunchKernelGGL(( uniform_random_walk_kernel), dim3(BLOCKS(start.numel())), dim3(THREADS), 0, stream, 
-      rowptr.data_ptr<int64_t>(), col.data_ptr<int64_t>(),
-      start.data_ptr<int64_t>(), rand.data_ptr<float>(),
-      out.data_ptr<int64_t>(), walk_length, start.numel());
-  return out.t().contiguous();
-}
--- a/csrc/hip/spmm_hip.h
+++ b/csrc/hip/spmm_hip.h
-#pragma once
-#include "../extensions.h"
-std::tuple<torch::Tensor, torch::optional<torch::Tensor>>
-spmm_cuda(torch::Tensor rowptr, torch::Tensor col,
-          torch::optional<torch::Tensor> optional_value, torch::Tensor mat,
-          std::string reduce);
-torch::Tensor spmm_value_bw_cuda(torch::Tensor row, torch::Tensor rowptr,
-                                 torch::Tensor col, torch::Tensor mat,
-                                 torch::Tensor grad, std::string reduce);
-template<typename T>
-__device__ T __ldg(const T* ptr) {
-    return *ptr;
-}
--- a/csrc/hip/spmm_hip.hip
+++ b/csrc/hip/spmm_hip.hip
-#include "hip/hip_runtime.h"
-#include "spmm_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "reducer.cuh"
-#include "utils.cuh"
-#define THREADS 256
-#define FULL_MASK 0xffffffff
-// Paper: Design Principles for Sparse Matrix Multiplication on the GPU
-// Code:  https://github.com/owensgroup/merge-spmm
-template <typename scalar_t, ReductionType REDUCE, bool HAS_VALUE>
-__global__ void spmm_kernel(const int64_t *rowptr_data, const int64_t *col_data,
-                            const scalar_t *value_data,
-                            const scalar_t *mat_data, scalar_t *out_data,
-                            int64_t *arg_out_data, int B, int M, int N, int K) {
-  // We ignore blockIdx.y here, because threads
-  // across `blockIdx.y` are treated equally.
-  int thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  int row = thread_idx >> 5;            // thread_idx / 32
-  int lane_idx = thread_idx & (32 - 1); // thread_idx % 32
-  int batch_idx = row / M;
-  // Compute the column index of `mat` in which the thread is operating.
-  int mat_col_idx = lane_idx + (blockIdx.y << 5);
-  // Compute the output index (row-major order).
-  int out_idx = row * K + mat_col_idx;
-  // Helper arrays for warp communication.
-  int mat_row, mat_rows[32];
-  scalar_t val, vals[HAS_VALUE ? 32 : 1];
-  // Do not aggregate/write across the Y-axis (lane_idx < leftover).
-  int leftover = K - (blockIdx.y << 5);
-  if (batch_idx < B) {
-    int row_start = __ldg(rowptr_data + (row % M));
-    int row_end = __ldg(rowptr_data + (row % M) + 1);
-    int col_idx = row_start + lane_idx;
-    scalar_t result = Reducer<scalar_t, REDUCE>::init();
-    int64_t arg;
-    // Iterate over all `col` indices in parallel within a warp.
-    for (int c = row_start; c < row_end; c += 32) {
-      if (col_idx < row_end) {
-        // Coalesced memory access into `col` and `val`.
-        mat_row = __ldg(col_data + col_idx) * K;
-        if (HAS_VALUE)
-          val = __ldg(value_data + col_idx);
-      } else {
-        mat_row = -1;
-        if (HAS_VALUE)
-          val = (scalar_t)0;
-      }
-      col_idx += 32;
-#pragma unroll
-      for (int i = 0; i < 32; i++) {
-        // Communication between all threads in a warp.
-        mat_rows[i] = __shfl_sync(FULL_MASK, mat_row, i);
-        if (HAS_VALUE)
-          vals[i] = __shfl_sync(FULL_MASK, val, i);
-      }
-#pragma unroll
-      for (int i = 0; i < 32; i++) {
-        if (lane_idx < leftover && mat_rows[i] != -1) {
-          // Coalesced memory access into `mat`.
-          val = __ldg(mat_data + batch_idx * N * K + mat_rows[i] + mat_col_idx);
-          if (HAS_VALUE)
-            val = vals[i] * val;
-          Reducer<scalar_t, REDUCE>::update(&result, val, &arg, c + i);
-        }
-      }
-    }
-    if (lane_idx < leftover) {
-      // Coalesced write into `out`.
-      Reducer<scalar_t, REDUCE>::write(out_data + out_idx, result,
-                                       arg_out_data + out_idx, arg,
-                                       row_end - row_start);
-    }
-  }
-}
-std::tuple<torch::Tensor, torch::optional<torch::Tensor>>
-spmm_cuda(torch::Tensor rowptr, torch::Tensor col,
-          torch::optional<torch::Tensor> optional_value, torch::Tensor mat,
-          std::string reduce) {
-  CHECK_CUDA(rowptr);
-  CHECK_CUDA(col);
-  if (optional_value.has_value())
-    CHECK_CUDA(optional_value.value());
-  CHECK_CUDA(mat);
-  hipSetDevice(rowptr.get_device());
-  CHECK_INPUT(rowptr.dim() == 1);
-  CHECK_INPUT(col.dim() == 1);
-  if (optional_value.has_value()) {
-    CHECK_INPUT(optional_value.value().dim() == 1);
-    CHECK_INPUT(optional_value.value().size(0) == col.size(0));
-  }
-  CHECK_INPUT(mat.dim() >= 2);
-  mat = mat.contiguous();
-  auto sizes = mat.sizes().vec();
-  sizes[mat.dim() - 2] = rowptr.numel() - 1;
-  auto out = torch::empty(sizes, mat.options());
-  torch::optional<torch::Tensor> arg_out = torch::nullopt;
-  int64_t *arg_out_data = nullptr;
-  if (reduce2REDUCE.at(reduce) == MIN || reduce2REDUCE.at(reduce) == MAX) {
-    arg_out = torch::full_like(out, col.numel(), rowptr.options());
-    arg_out_data = arg_out.value().data_ptr<int64_t>();
-  }
-  auto rowptr_data = rowptr.data_ptr<int64_t>();
-  auto col_data = col.data_ptr<int64_t>();
-  auto M = rowptr.numel() - 1;
-  auto N = mat.size(-2);
-  auto K = mat.size(-1);
-  auto B = mat.numel() / (N * K);
-  auto BLOCKS = dim3((32 * B * M + THREADS - 1) / THREADS, (K + 31) / 32);
-  auto stream = at::cuda::getCurrentCUDAStream();
-  AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, mat.scalar_type(), "_", [&] {
-    auto mat_data = mat.data_ptr<scalar_t>();
-    auto out_data = out.data_ptr<scalar_t>();
-    AT_DISPATCH_REDUCTION_TYPES(reduce, [&] {
-      if (optional_value.has_value()) {
-        auto value_data = optional_value.value().data_ptr<scalar_t>();
-        spmm_kernel<scalar_t, REDUCE, true><<<BLOCKS, THREADS, 0, stream>>>(
-            rowptr_data, col_data, value_data, mat_data, out_data, arg_out_data,
-            B, M, N, K);
-      } else {
-        spmm_kernel<scalar_t, REDUCE, false><<<BLOCKS, THREADS, 0, stream>>>(
-            rowptr_data, col_data, nullptr, mat_data, out_data, arg_out_data, B,
-            M, N, K);
-      }
-    });
-  });
-  return std::make_tuple(out, arg_out);
-}
-template <typename scalar_t, ReductionType REDUCE>
-__global__ void
-spmm_value_bw_kernel(const int64_t *row_data, const int64_t *rowptr_data,
-                     const int64_t *col_data, const scalar_t *mat_data,
-                     const scalar_t *grad_data, scalar_t *out_data, int B,
-                     int M, int N, int E, int K) {
-  int thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  int index_idx = (thread_idx >> 5);    // thread_idx / 32
-  int lane_idx = thread_idx & (32 - 1); // thread_idx % 32
-  if (index_idx < E) {
-    int row = __ldg(row_data + index_idx);
-    int col = __ldg(col_data + index_idx);
-    scalar_t val = (scalar_t)0;
-    for (int b = 0; b < B; b++) {
-      for (int k = lane_idx; k < K; k += 32) {
-        val += mat_data[b * N * K + col * K + k] *
-               grad_data[b * M * K + row * K + k];
-      }
-    }
-#pragma unroll
-    for (int i = 32 / 2; i > 0; i /= 2) { // Parallel reduction inside a warp.
-      val += __shfl_down_sync(FULL_MASK, val, i);
-    }
-    if (lane_idx == 0) {
-      if (REDUCE == MEAN) {
-        int row_start = __ldg(rowptr_data + row);
-        int row_end = __ldg(rowptr_data + row + 1);
-        val /= (scalar_t)max(row_end - row_start, 1);
-      }
-      out_data[index_idx] = val;
-    }
-  }
-}
-torch::Tensor spmm_value_bw_cuda(torch::Tensor row, torch::Tensor rowptr,
-                                 torch::Tensor col, torch::Tensor mat,
-                                 torch::Tensor grad, std::string reduce) {
-  CHECK_CUDA(row);
-  CHECK_CUDA(rowptr);
-  CHECK_CUDA(col);
-  CHECK_CUDA(mat);
-  CHECK_CUDA(grad);
-  hipSetDevice(row.get_device());
-  mat = mat.contiguous();
-  grad = grad.contiguous();
-  auto M = grad.size(-2);
-  auto N = mat.size(-2);
-  auto E = row.numel();
-  auto K = mat.size(-1);
-  auto B = mat.numel() / (N * K);
-  auto BLOCKS = dim3((E * 32 + THREADS - 1) / THREADS);
-  auto out = torch::zeros(row.numel(), grad.options());
-  auto row_data = row.data_ptr<int64_t>();
-  auto rowptr_data = rowptr.data_ptr<int64_t>();
-  auto col_data = col.data_ptr<int64_t>();
-  auto stream = at::cuda::getCurrentCUDAStream();
-  AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, mat.scalar_type(), "_", [&] {
-    auto mat_data = mat.data_ptr<scalar_t>();
-    auto grad_data = grad.data_ptr<scalar_t>();
-    auto out_data = out.data_ptr<scalar_t>();
-    AT_DISPATCH_REDUCTION_TYPES(reduce, [&] {
-      spmm_value_bw_kernel<scalar_t, REDUCE><<<BLOCKS, THREADS, 0, stream>>>(
-          row_data, rowptr_data, col_data, mat_data, grad_data, out_data, B, M,
-          N, E, K);
-    });
-  });
-  return out;
-}
--- a/csrc/hip/spmm_hip_hip.hip
+++ b/csrc/hip/spmm_hip_hip.hip
-#include "hip/hip_runtime.h"
-#include "spmm_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include "reducer.cuh"
-#include "utils.cuh"
-#define THREADS 256
-#define FULL_MASK 0xffffffff
-// Paper: Design Principles for Sparse Matrix Multiplication on the GPU
-// Code:  https://github.com/owensgroup/merge-spmm
-template <typename scalar_t, ReductionType REDUCE, bool HAS_VALUE>
-__global__ void spmm_kernel(const int64_t *rowptr_data, const int64_t *col_data,
-                            const scalar_t *value_data,
-                            const scalar_t *mat_data, scalar_t *out_data,
-                            int64_t *arg_out_data, int B, int M, int N, int K) {
-  // We ignore blockIdx.y here, because threads
-  // across `blockIdx.y` are treated equally.
-  int thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  int row = thread_idx >> 5;            // thread_idx / 32
-  int lane_idx = thread_idx & (32 - 1); // thread_idx % 32
-  int batch_idx = row / M;
-  // Compute the column index of `mat` in which the thread is operating.
-  int mat_col_idx = lane_idx + (blockIdx.y << 5);
-  // Compute the output index (row-major order).
-  int out_idx = row * K + mat_col_idx;
-  // Helper arrays for warp communication.
-  int mat_row, mat_rows[32];
-  scalar_t val, vals[HAS_VALUE ? 32 : 1];
-  // Do not aggregate/write across the Y-axis (lane_idx < leftover).
-  int leftover = K - (blockIdx.y << 5);
-  if (batch_idx < B) {
-    int row_start = __ldg(rowptr_data + (row % M));
-    int row_end = __ldg(rowptr_data + (row % M) + 1);
-    int col_idx = row_start + lane_idx;
-    scalar_t result = Reducer<scalar_t, REDUCE>::init();
-    int64_t arg;
-    // Iterate over all `col` indices in parallel within a warp.
-    for (int c = row_start; c < row_end; c += 32) {
-      if (col_idx < row_end) {
-        // Coalesced memory access into `col` and `val`.
-        mat_row = __ldg(col_data + col_idx) * K;
-        if (HAS_VALUE)
-          val = __ldg(value_data + col_idx);
-      } else {
-        mat_row = -1;
-        if (HAS_VALUE)
-          val = (scalar_t)0;
-      }
-      col_idx += 32;
-#pragma unroll
-      for (int i = 0; i < 32; i++) {
-        // Communication between all threads in a warp.
-        mat_rows[i] = __shfl_sync(FULL_MASK, mat_row, i);
-        if (HAS_VALUE)
-          vals[i] = __shfl_sync(FULL_MASK, val, i);
-      }
-#pragma unroll
-      for (int i = 0; i < 32; i++) {
-        if (lane_idx < leftover && mat_rows[i] != -1) {
-          // Coalesced memory access into `mat`.
-          val = __ldg(mat_data + batch_idx * N * K + mat_rows[i] + mat_col_idx);
-          if (HAS_VALUE)
-            val = vals[i] * val;
-          Reducer<scalar_t, REDUCE>::update(&result, val, &arg, c + i);
-        }
-      }
-    }
-    if (lane_idx < leftover) {
-      // Coalesced write into `out`.
-      Reducer<scalar_t, REDUCE>::write(out_data + out_idx, result,
-                                       arg_out_data + out_idx, arg,
-                                       row_end - row_start);
-    }
-  }
-}
-std::tuple<torch::Tensor, torch::optional<torch::Tensor>>
-spmm_cuda(torch::Tensor rowptr, torch::Tensor col,
-          torch::optional<torch::Tensor> optional_value, torch::Tensor mat,
-          std::string reduce) {
-  CHECK_CUDA(rowptr);
-  CHECK_CUDA(col);
-  if (optional_value.has_value())
-    CHECK_CUDA(optional_value.value());
-  CHECK_CUDA(mat);
-  hipSetDevice(rowptr.get_device());
-  CHECK_INPUT(rowptr.dim() == 1);
-  CHECK_INPUT(col.dim() == 1);
-  if (optional_value.has_value()) {
-    CHECK_INPUT(optional_value.value().dim() == 1);
-    CHECK_INPUT(optional_value.value().size(0) == col.size(0));
-  }
-  CHECK_INPUT(mat.dim() >= 2);
-  mat = mat.contiguous();
-  auto sizes = mat.sizes().vec();
-  sizes[mat.dim() - 2] = rowptr.numel() - 1;
-  auto out = torch::empty(sizes, mat.options());
-  torch::optional<torch::Tensor> arg_out = torch::nullopt;
-  int64_t *arg_out_data = nullptr;
-  if (reduce2REDUCE.at(reduce) == MIN || reduce2REDUCE.at(reduce) == MAX) {
-    arg_out = torch::full_like(out, col.numel(), rowptr.options());
-    arg_out_data = arg_out.value().data_ptr<int64_t>();
-  }
-  auto rowptr_data = rowptr.data_ptr<int64_t>();
-  auto col_data = col.data_ptr<int64_t>();
-  auto M = rowptr.numel() - 1;
-  auto N = mat.size(-2);
-  auto K = mat.size(-1);
-  auto B = mat.numel() / (N * K);
-  auto BLOCKS = dim3((32 * B * M + THREADS - 1) / THREADS, (K + 31) / 32);
-  auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
-  AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, mat.scalar_type(), "_", [&] {
-    auto mat_data = mat.data_ptr<scalar_t>();
-    auto out_data = out.data_ptr<scalar_t>();
-    AT_DISPATCH_REDUCTION_TYPES(reduce, [&] {
-      if (optional_value.has_value()) {
-        auto value_data = optional_value.value().data_ptr<scalar_t>();
-       hipLaunchKernelGGL(( spmm_kernel<scalar_t, REDUCE, true>), dim3(BLOCKS), dim3(THREADS), 0, stream, 
-            rowptr_data, col_data, value_data, mat_data, out_data, arg_out_data,
-            B, M, N, K);
-      } else {
-       hipLaunchKernelGGL(( spmm_kernel<scalar_t, REDUCE, false>), dim3(BLOCKS), dim3(THREADS), 0, stream, 
-            rowptr_data, col_data, nullptr, mat_data, out_data, arg_out_data, B,
-            M, N, K);
-      }
-    });
-  });
-  return std::make_tuple(out, arg_out);
-}
-template <typename scalar_t, ReductionType REDUCE>
-__global__ void
-spmm_value_bw_kernel(const int64_t *row_data, const int64_t *rowptr_data,
-                     const int64_t *col_data, const scalar_t *mat_data,
-                     const scalar_t *grad_data, scalar_t *out_data, int B,
-                     int M, int N, int E, int K) {
-  int thread_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  int index_idx = (thread_idx >> 5);    // thread_idx / 32
-  int lane_idx = thread_idx & (32 - 1); // thread_idx % 32
-  if (index_idx < E) {
-    int row = __ldg(row_data + index_idx);
-    int col = __ldg(col_data + index_idx);
-    scalar_t val = (scalar_t)0;
-    for (int b = 0; b < B; b++) {
-      for (int k = lane_idx; k < K; k += 32) {
-        val += mat_data[b * N * K + col * K + k] *
-               grad_data[b * M * K + row * K + k];
-      }
-    }
-#pragma unroll
-    for (int i = 32 / 2; i > 0; i /= 2) { // Parallel reduction inside a warp.
-      val += __shfl_down_sync(FULL_MASK, val, i);
-    }
-    if (lane_idx == 0) {
-      if (REDUCE == MEAN) {
-        int row_start = __ldg(rowptr_data + row);
-        int row_end = __ldg(rowptr_data + row + 1);
-        val /= (scalar_t)max(row_end - row_start, 1);
-      }
-      out_data[index_idx] = val;
-    }
-  }
-}
-torch::Tensor spmm_value_bw_cuda(torch::Tensor row, torch::Tensor rowptr,
-                                 torch::Tensor col, torch::Tensor mat,
-                                 torch::Tensor grad, std::string reduce) {
-  CHECK_CUDA(row);
-  CHECK_CUDA(rowptr);
-  CHECK_CUDA(col);
-  CHECK_CUDA(mat);
-  CHECK_CUDA(grad);
-  hipSetDevice(row.get_device());
-  mat = mat.contiguous();
-  grad = grad.contiguous();
-  auto M = grad.size(-2);
-  auto N = mat.size(-2);
-  auto E = row.numel();
-  auto K = mat.size(-1);
-  auto B = mat.numel() / (N * K);
-  auto BLOCKS = dim3((E * 32 + THREADS - 1) / THREADS);
-  auto out = torch::zeros(row.numel(), grad.options());
-  auto row_data = row.data_ptr<int64_t>();
-  auto rowptr_data = rowptr.data_ptr<int64_t>();
-  auto col_data = col.data_ptr<int64_t>();
-  auto stream = at::hip::getCurrentHIPStreamMasqueradingAsCUDA();
-  AT_DISPATCH_ALL_TYPES_AND(at::ScalarType::Half, mat.scalar_type(), "_", [&] {
-    auto mat_data = mat.data_ptr<scalar_t>();
-    auto grad_data = grad.data_ptr<scalar_t>();
-    auto out_data = out.data_ptr<scalar_t>();
-    AT_DISPATCH_REDUCTION_TYPES(reduce, [&] {
-     hipLaunchKernelGGL(( spmm_value_bw_kernel<scalar_t, REDUCE>), dim3(BLOCKS), dim3(THREADS), 0, stream, 
-          row_data, rowptr_data, col_data, mat_data, grad_data, out_data, B, M,
-          N, E, K);
-    });
-  });
-  return out;
-}
--- a/csrc/hip/spspmm_hip.h
+++ b/csrc/hip/spspmm_hip.h
-#pragma once
-#include "../extensions.h"
-std::tuple<torch::Tensor, torch::Tensor, torch::optional<torch::Tensor>>
-spspmm_cuda(torch::Tensor rowptrA, torch::Tensor colA,
-            torch::optional<torch::Tensor> optional_valueA,
-            torch::Tensor rowptrB, torch::Tensor colB,
-            torch::optional<torch::Tensor> optional_valueB, int64_t K,
-            std::string reduce);
--- a/csrc/hip/spspmm_hip.hip
+++ b/csrc/hip/spspmm_hip.hip
-#include "spspmm_hip.h"
-#include <ATen/hip/HIPContext.h>
-#include <hipsparse.h>
-#include "utils.cuh"
-#define AT_DISPATCH_CUSPARSE_TYPES(TYPE, ...)                                  \
-  [&] {                                                                        \
-    switch (TYPE) {                                                            \
-    case torch::ScalarType::Float: {                                           \
-      using scalar_t = float;                                                  \
-      const auto &cusparsecsrgemm2_bufferSizeExt =                             \
-          hipsparseScsrgemm2_bufferSizeExt;                                     \
-      const auto &cusparsecsrgemm2 = hipsparseScsrgemm2;                        \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    case torch::ScalarType::Double: {                                          \
-      using scalar_t = double;                                                 \
-      const auto &cusparsecsrgemm2_bufferSizeExt =                             \
-          hipsparseDcsrgemm2_bufferSizeExt;                                     \
-      const auto &cusparsecsrgemm2 = hipsparseDcsrgemm2;                        \
-      return __VA_ARGS__();                                                    \
-    }                                                                          \
-    default:                                                                   \
-      AT_ERROR("Not implemented for '", toString(TYPE), "'");                  \
-    }                                                                          \
-  }()
-std::tuple<torch::Tensor, torch::Tensor, torch::optional<torch::Tensor>>
-spspmm_cuda(torch::Tensor rowptrA, torch::Tensor colA,
-            torch::optional<torch::Tensor> optional_valueA,
-            torch::Tensor rowptrB, torch::Tensor colB,
-            torch::optional<torch::Tensor> optional_valueB, int64_t K,
-            std::string reduce) {
-  CHECK_CUDA(rowptrA);
-  CHECK_CUDA(colA);
-  if (optional_valueA.has_value())
-    CHECK_CUDA(optional_valueA.value());
-  CHECK_CUDA(rowptrB);
-  CHECK_CUDA(colB);
-  if (optional_valueB.has_value())
-    CHECK_CUDA(optional_valueB.value());
-  hipSetDevice(rowptrA.get_device());
-  CHECK_INPUT(rowptrA.dim() == 1);
-  CHECK_INPUT(colA.dim() == 1);
-  if (optional_valueA.has_value()) {
-    CHECK_INPUT(optional_valueA.value().dim() == 1);
-    CHECK_INPUT(optional_valueA.value().size(0) == colA.size(0));
-  }
-  CHECK_INPUT(rowptrB.dim() == 1);
-  CHECK_INPUT(colB.dim() == 1);
-  if (optional_valueB.has_value()) {
-    CHECK_INPUT(optional_valueB.value().dim() == 1);
-    CHECK_INPUT(optional_valueB.value().size(0) == colB.size(0));
-  }
-  if (!optional_valueA.has_value() && optional_valueB.has_value())
-    optional_valueA =
-        torch::ones(colA.numel(), optional_valueB.value().options());
-  if (!optional_valueB.has_value() && optional_valueA.has_value())
-    optional_valueB =
-        torch::ones(colB.numel(), optional_valueA.value().options());
-  auto scalar_type = torch::ScalarType::Float;
-  if (optional_valueA.has_value())
-    scalar_type = optional_valueA.value().scalar_type();
-  auto handle = at::cuda::getCurrentCUDASparseHandle();
-  hipsparseMatDescr_t descr;
-  hipsparseCreateMatDescr(&descr);
-  rowptrA = rowptrA.toType(torch::kInt);
-  colA = colA.toType(torch::kInt);
-  rowptrB = rowptrB.toType(torch::kInt);
-  colB = colB.toType(torch::kInt);
-  int64_t M = rowptrA.numel() - 1, N = rowptrB.numel() - 1;
-  auto rowptrA_data = rowptrA.data_ptr<int>();
-  auto colA_data = colA.data_ptr<int>();
-  auto rowptrB_data = rowptrB.data_ptr<int>();
-  auto colB_data = colB.data_ptr<int>();
-  torch::Tensor rowptrC, colC;
-  torch::optional<torch::Tensor> optional_valueC = torch::nullopt;
-  int nnzC;
-  int *nnzTotalDevHostPtr = &nnzC;
-  // Step 1: Create an opaque structure.
-  csrgemm2Info_t info = NULL;
-  hipsparseCreateCsrgemm2Info(&info);
-  // Step 2: Allocate buffer for `csrgemm2Nnz` and `csrgemm2`.
-  size_t bufferSize;
-  AT_DISPATCH_CUSPARSE_TYPES(scalar_type, [&] {
-    scalar_t alpha = (scalar_t)1.0;
-    cusparsecsrgemm2_bufferSizeExt(handle, M, N, K, &alpha, descr, colA.numel(),
-                                   rowptrA_data, colA_data, descr, colB.numel(),
-                                   rowptrB_data, colB_data, NULL, descr, 0,
-                                   NULL, NULL, info, &bufferSize);
-    void *buffer = NULL;
-    hipMalloc(&buffer, bufferSize);
-    // Step 3: Compute CSR row pointer.
-    rowptrC = torch::empty(M + 1, rowptrA.options());
-    auto rowptrC_data = rowptrC.data_ptr<int>();
-    hipsparseXcsrgemm2Nnz(handle, M, N, K, descr, colA.numel(), rowptrA_data,
-                         colA_data, descr, colB.numel(), rowptrB_data,
-                         colB_data, descr, 0, NULL, NULL, descr, rowptrC_data,
-                         nnzTotalDevHostPtr, info, buffer);
-    // Step 4: Compute CSR entries.
-    colC = torch::empty(nnzC, rowptrC.options());
-    auto colC_data = colC.data_ptr<int>();
-    if (optional_valueA.has_value())
-      optional_valueC = torch::empty(nnzC, optional_valueA.value().options());
-    scalar_t *valA_data = NULL, *valB_data = NULL, *valC_data = NULL;
-    if (optional_valueA.has_value()) {
-      valA_data = optional_valueA.value().data_ptr<scalar_t>();
-      valB_data = optional_valueB.value().data_ptr<scalar_t>();
-      valC_data = optional_valueC.value().data_ptr<scalar_t>();
-    }
-    cusparsecsrgemm2(handle, M, N, K, &alpha, descr, colA.numel(), valA_data,
-                     rowptrA_data, colA_data, descr, colB.numel(), valB_data,
-                     rowptrB_data, colB_data, NULL, descr, 0, NULL, NULL, NULL,
-                     descr, valC_data, rowptrC_data, colC_data, info, buffer);
-    hipFree(buffer);
-  });
-  // Step 5: Destroy the opaque structure.
-  hipsparseDestroyCsrgemm2Info(info);
-  rowptrC = rowptrC.toType(torch::kLong);
-  colC = colC.toType(torch::kLong);
-  return std::make_tuple(rowptrC, colC, optional_valueC);
-}
--- a/csrc/hip/utils.cuh
+++ b/csrc/hip/utils.cuh
-#pragma once
-#include "../extensions.h"
-#define CHECK_CUDA(x)                                                          \
-  AT_ASSERTM(x.device().is_cuda(), #x " must be CUDA tensor")
-#define CHECK_INPUT(x) AT_ASSERTM(x, "Input mismatch")
-__device__ __inline__ at::Half
-__shfl_sync(const unsigned mask, const at::Half var, const int srcLane) {
-  return __shfl_sync(mask, (__half)var, srcLane);
-}
-__device__ __inline__ at::Half __shfl_down_sync(const unsigned mask,
-                                                const at::Half var,
-                                                const unsigned int delta) {
-  return __shfl_down_sync(mask, (__half)var, delta);
-}