# 基于 DTK 26.04 的 FBGEMM GPU 编译与安装指南（完整版）

**FBGEMM GPU** 指 PyTorch 生态中的 `fbgemm_gpu` 包。本文**自包含**：从环境、拉源码、打补丁、编译、安装、验证到测试与排错，均在本页完成，**无需再读其他说明文件**。

**约定**：下文路径以 **FBGEMM 仓库根目录** 为基准（克隆后的 `FBGEMM/`）。例如 `fbgemm_gpu/setup.py` 即 `FBGEMM/fbgemm_gpu/setup.py`。

**重要**：**不要修改 DTK 安装目录**（如 `/opt/dtk`、`/opt/dtk-26.04-*`）内文件；仅修改 FBGEMM 源码，以及可选的系统 `librt` 符号链接（见第 2.4 节）。

---

## 1. 环境与版本

| 项目 | 说明 |
|------|------|
| DTK | **26.04** 系；`ROCM_PATH` 常为 `/opt/dtk`，以本机为准 |
| PyTorch | **HIP/ROCm** 构建（`torch.version.hip` 有版本；`torch.version.cuda` 多为 `None`） |
| FBGEMM | **v1.3.0**（`--branch v1.3.0`） |
| Python | 与已安装的 PyTorch wheel 一致（如 3.10） |
| GPU 架构 | 海光 DCU 常见 **`gfx936`**，由 `PYTORCH_ROCM_ARCH` 指定 |

自检：

```bash
python3 -c "import torch; print('torch', torch.__version__); print('hip', torch.version.hip); print('cuda', torch.version.cuda)"
ls -d /opt/dtk /opt/dtk-* 2>/dev/null || true
```

`setup.py` / CMake 需要环境变量 **`BUILD_ROCM_VERSION`**，格式 **`主版本.次版本`**（与当前 ROCm/DTK 大版本一致，例如 `6.3` 或 `6.4`）：

```bash
export BUILD_ROCM_VERSION=6.3
```

---

## 2. 依赖、网络、源码与子模块

### 2.1 Python 包（构建）

```bash
pip install setuptools_git_versioning scikit-build tabulate ninja cmake patchelf
```

### 2.2 访问 GitHub

若直连失败，请为本终端设置可用的 **`http_proxy` / `https_proxy`**，或使用 GitHub 加速前缀（示例，按你环境替换域名）：

```bash
git clone https://github.com/pytorch/FBGEMM.git --branch v1.3.0 --depth 1
# 或使用：git clone https://<你的加速前缀>/https://github.com/pytorch/FBGEMM.git --branch v1.3.0 --depth 1
```

### 2.3 子模块

```bash
cd FBGEMM
git submodule update --init --recursive --jobs 6
```

仓库较大，可加 `--jobs` 并行拉取。

### 2.4 `librt` 链接（链接阶段常见需要）

```bash
sudo ln -sf /lib/x86_64-linux-gnu/librt.so.1 /usr/lib/x86_64-linux-gnu/librt.so
```

---

## 3. 为何必须改源码、且只能改「CUDA 侧」

在 **`FBGEMM_BUILD_VARIANT=rocm`** 时，CMake 会对 `fbgemm_gpu` 下大量 **`.cu` / `.cuh`** 做 **hipify**，生成 **`*.hip`、`*_hip.cuh`** 等。

- **不要**把补丁只写在生成文件上；下一次 `cmake` / 清理重编会被覆盖。
- 补丁应写在 **参与 hipify 的 CUDA 源**（以及 `setup.py`）上。

---

## 4. 补丁总览表

以下文件均在 **`fbgemm_gpu/`** 目录下（即 `FBGEMM/fbgemm_gpu/...`）。

| 目的 | 文件（相对 `fbgemm_gpu/`） |
|------|---------------------------|
| `setuptools_git_versioning` 返回 `Version` 非字符串 | `setup.py` |
| ROCm 下 `CUDA_KERNEL_ASSERT2` 与 DSA/DTK 冲突 | `src/sparse_ops/common.cuh`、`include/fbgemm_gpu/utils/embedding_bounds_check_common.cuh`、`test/utils/kernel_launcher_test.cu` |
| ROCm 下内核启动后错误检查与 PyTorch HIP API 不一致 | `include/fbgemm_gpu/utils/kernel_launcher.cuh` |
| `getDeviceProperties` 返回 `hipDeviceProp_t_v2*` 与旧式 `cudaDeviceProp*` 声明不匹配 | `include/fbgemm_gpu/utils/kernel_launcher.cuh`（四个 `check*` 模板化）、`include/fbgemm_gpu/utils/cuda_prelude.cuh`、`src/split_embeddings_cache/reset_weight_momentum.cu`、`src/intraining_embedding_pruning_ops/intraining_embedding_pruning.cu` |

---

## 5. 补丁分步说明（含可复制代码）

### 5.1 `setup.py`

在 `package_version()` 中，将所有：

```python
gitversion.version_from_git().split("+")[0]
```

改为：

```python
str(gitversion.version_from_git()).split("+")[0]
```

（一般出现 **两处**：带 `rc` 处理的分支与不带的分支各一处。）

---

### 5.2 `CUDA_KERNEL_ASSERT2` 在 ROCm 下置空

在下列文件中，在 **已包含** `<c10/cuda/CUDADeviceAssertion.h>`（或等价头文件）**之后**追加：

```cpp
#ifdef USE_ROCM
#undef CUDA_KERNEL_ASSERT2
#define CUDA_KERNEL_ASSERT2(condition) ((void)0)
#endif
```

**文件与插入位置：**

1. **`src/sparse_ops/common.cuh`**：在 `#include <c10/cuda/CUDADeviceAssertion.h>` 与 `#include <c10/cuda/CUDADeviceAssertionHost.h>` 之后。
2. **`include/fbgemm_gpu/utils/embedding_bounds_check_common.cuh`**：在 `#include <c10/cuda/CUDADeviceAssertion.h>` 之后。
3. **`test/utils/kernel_launcher_test.cu`**：在 `#include <c10/cuda/CUDADeviceAssertion.h>` 之后。

---

### 5.3 `include/fbgemm_gpu/utils/kernel_launcher.cuh`

#### A. `kernelLaunchCheck`

将函数体尾部「失败时」的大段 `TORCH_CHECK` 改为在 ROCm 下使用简短分支，**避免**调用当前 PyTorch HIP 构建中缺失或映射不全的 host 辅助接口。整体结构如下（保留前面的 `cudaGetLastError` 与 `C10_LIKELY` 判断不变，仅替换失败分支）：

```cpp
#ifdef USE_ROCM
    TORCH_CHECK(
        false,
        context.description(),
        " GPU Error: ",
        cudaGetErrorString(cuda_error));
#else
    TORCH_CHECK(
        false,
        context.description(),
        " CUDA Error: ",
        cudaGetErrorString(cuda_error),
        c10::cuda::get_cuda_check_suffix(),
        "\n",
        c10::cuda::c10_retrieve_device_side_assertion_info());
#endif
```

#### B. 四个 `check*` 成员函数改为模板

将下列函数中参数类型由 **`const cudaDeviceProp& properties`** 改为 **模板形参**（每个函数各自加一行 `template <typename DeviceProperties>`，参数写 **`const DeviceProperties& properties`**）：

- `checkGridSizesInRange`
- `checkBlockSizesInRange`
- `checkThreadCountNotExceeded`
- `checkSharedMemoryPerBlockNotExceeded`

示例（`checkGridSizesInRange`）：

```cpp
  template <typename DeviceProperties>
  constexpr inline void checkGridSizesInRange(
      const DeviceProperties& properties,
      const dim3& grid) const {
```

函数体内部逻辑**不变**（仍访问 `properties.maxGridSize` 等字段）。

这样在 ROCm 下，`launch_kernel` 里 `const auto properties = *at::cuda::getDeviceProperties(device);` 推导为 **`hipDeviceProp_t_v2`**，与 PyTorch 声明一致。

---

### 5.4 `include/fbgemm_gpu/utils/cuda_prelude.cuh`

将匿名命名空间内的 `get_device_sm_cnt_()` 从「指针变量 + 成员访问」改为**直接返回**（避免 `cudaDeviceProp*` 与 `hipDeviceProp_t_v2*` 在 hipify 后不一致）：

```cpp
inline int get_device_sm_cnt_() {
  return at::cuda::getDeviceProperties(c10::cuda::current_device())
      ->multiProcessorCount;
}
```

---

### 5.5 两个 `.cu` 中的 `get_sm_count_()`

以下文件中命名空间内的 **`int get_sm_count_()`** 做与上相同的「单行 return」修改（原先用 `cudaDeviceProp*` 承接指针的写法删除）：

- **`src/split_embeddings_cache/reset_weight_momentum.cu`**
- **`src/intraining_embedding_pruning_ops/intraining_embedding_pruning.cu`**

```cpp
int get_sm_count_() {
  return at::cuda::getDeviceProperties(c10::cuda::current_device())
      ->multiProcessorCount;
}
```

源文件保持 **CUDA 侧 API 写法**；hipify 会在 ROCm 构建中生成对应 HIP 代码。

---

## 6. 编译生成 wheel

在 **FBGEMM 仓库根目录**执行：

```bash
cd fbgemm_gpu
rm -rf _skbuild build

export BUILD_ROCM_VERSION=6.3
export PYTORCH_ROCM_ARCH="gfx936"
export ROCM_PATH=/opt/dtk

TORCH_PATH="$(python3 -c "import torch, os; print(os.path.join(os.path.dirname(torch.__file__), 'share/cmake/Torch'))")"

python3 setup.py bdist_wheel \
  --package_channel=release \
  --build-target=default \
  --build-variant=rocm \
  -DCMAKE_PREFIX_PATH="${TORCH_PATH};${ROCM_PATH}" \
  -DAMDGPU_TARGETS="${PYTORCH_ROCM_ARCH}" \
  -DTORCH_DIR="${TORCH_PATH}" \
  -DCMAKE_C_FLAGS="-Wno-return-type -Wno-ignored-attributes -O1" \
  -DCMAKE_CXX_FLAGS="-Wno-return-type -Wno-ignored-attributes -O1" \
  -DCMAKE_HIP_FLAGS="-Wno-return-type -O1" \
  -DUSE_AVX512=on \
  -DCOPY_VISIBLE_LIBRARIES=ON \
  -DCMAKE_SHARED_LINKER_FLAGS="-Wl,--no-as-needed -ltbb -Wl,-rpath,/usr/lib/x86_64-linux-gnu" \
  -DTBB_INCLUDE_DIR=/usr/include
```

**必须遵守：**

1. 使用 **`--build-variant=rocm`**；不要指望只传 `-DFBGEMM_BUILD_VARIANT=rocm` 而不传 setup 参数。
2. 已导出 **`BUILD_ROCM_VERSION`**。
3. **不要**在 `CMAKE_CXX_FLAGS` / `CMAKE_HIP_FLAGS` 里加 **`-DhipDeviceProp_t=hipDeviceProp_t_v2`**：DTK 26.04 的 HIP 头文件已同时定义 `hipDeviceProp_t` 与 `hipDeviceProp_t_v2`，再加该宏易导致 **`hipDeviceProp_t_v2` 重复定义**。类型问题靠第 5.3 B、5.4、5.5 节源码解决。

成功后 wheel 路径示例：

```text
fbgemm_gpu/dist/fbgemm_gpu_rocm-1.3.0-cp310-cp310-linux_x86_64.whl
```

（Python 标签、小版本号以实际输出为准。）

---

## 7. 安装

在 **FBGEMM 仓库根目录**（与 `fbgemm_gpu/` 同级）执行：

```bash
pip install fbgemm_gpu/dist/fbgemm_gpu_rocm-*.whl
```

若当前已在 **`fbgemm_gpu/`** 目录内，则改为：

```bash
pip install dist/fbgemm_gpu_rocm-*.whl
```

若安装时拉依赖失败（例如无法访问 PyPI）：

```bash
pip install --no-deps fbgemm_gpu/dist/fbgemm_gpu_rocm-*.whl
# 若已在 fbgemm_gpu/ 目录内：pip install --no-deps dist/fbgemm_gpu_rocm-*.whl
```

需保证本机已有匹配版本的 **`torch`**、**`numpy`** 等。

---

## 8. 验证导入

**不要**在 `fbgemm_gpu` 源码树内作为当前工作目录运行验证（易优先加载无 `.so` 的源码包）：

```bash
cd /tmp
unset PYTHONPATH
python3 -c "import torch; import fbgemm_gpu; print('OK', fbgemm_gpu.__variant__, fbgemm_gpu.__file__)"
```

期望 **`__variant__`** 与 **rocm** 相关，**`__file__`** 在 `site-packages` 下。

---

## 9. 可选：运行测试

安装 **pytest**：

```bash
pip install pytest
```

在 **FBGEMM 仓库根目录**下进入测试目录（避免从源码树误导入）：

```bash
cd fbgemm_gpu/test
export FBGEMM_TEST_WITH_ROCM=1
unset PYTHONPATH
# 可选：export HIP_LAUNCH_BLOCKING=1

python3 -m pytest -v -rsx --tb=short \
  config/feature_gate_test.py \
  sparse/pack_segments_test.py \
  sparse/index_select_test.py \
  sparse/cumsum_test.py \
  quantize/fused_8bit_rowwise_test.py
```

全量测试可遍历该目录下所有 `*_test.py`，耗时可很长，需充足 GPU 显存与内存。

---

## 10. 常见问题与处理

| 现象 | 处理 |
|------|------|
| `ModuleNotFoundError: setuptools_git_versioning` | 执行第 2.1 节 `pip install` |
| `BUILD_ROCM_VERSION is not set` | `export BUILD_ROCM_VERSION=X.Y` 后再运行 `setup.py` |
| `AttributeError: 'Version' object has no attribute 'split'` | 完成第 5.1 节 `setup.py` 修改 |
| `redefinition of 'hipDeviceProp_t_v2'` | 编译参数中**不要**包含 `-DhipDeviceProp_t=hipDeviceProp_t_v2` |
| `hipDeviceProp_t` / `hipDeviceProp_t_v2` 转换错误 | 完成第 5.3 B、5.4、5.5 节后 **`rm -rf _skbuild build` 再编** |
| `import fbgemm_gpu` 报缺 `fbgemm_gpu_config.so` | `cd /tmp` 并 `unset PYTHONPATH`；勿把 `fbgemm_gpu` 源码目录放在 `PYTHONPATH` 首位 |
| `git clone` / TLS 失败 | 配置代理或使用 GitHub 加速前缀后重试 |
| 子模块过慢 | `git submodule update --init --recursive --jobs 6` |

---

## 11. 版本与维护说明

- 本文针对 **DTK 26.04 + PyTorch HIP + FBGEMM v1.3.0** 整理。  
- 若升级 DTK / PyTorch / FBGEMM 标签，请重新核对：`BUILD_ROCM_VERSION`、`ROCM_PATH`、`PYTORCH_ROCM_ARCH` 及 PyTorch 与 `getDeviceProperties` 相关头文件是否变化。