FBGEMM_GPU_构建与测试说明.md

# FBGEMM GPU（ROCm / 海光 DTK）构建、安装与测试说明

本文档与仓库内 [install_fbgemm.md](./install_fbgemm.md) 配套：`install_fbgemm.md` 侧重镜像与总体流程；本文记录在实际 **DTK 26.x + PyTorch 2.5.1 HIP** 环境下踩坑后的**可复现命令**、**代码改动原则**与**测试方法**。

---

## 1. 环境前提

- PyTorch 为 **ROCm/HIP** 构建（`torch.version.hip` 有版本号，`torch.version.cuda` 常为 `None`）。
- 已安装 **DTK/ROCm**（常见路径 `/opt/dtk`，具体以本机 `hipconfig`、`ROCM_PATH` 为准）。
- **GPU 架构**：海光 DCU 常见为 `gfx936`（与 `install_fbgemm.md` 中 `PYTORCH_ROCM_ARCH` 一致，按实际驱动/设备调整）。
- Python 构建依赖（示例）：

  ```bash
  pip install setuptools_git_versioning scikit-build tabulate ninja cmake patchelf
  ```

---

## 2. 源码与网络（代理）

```bash
# 按需选用可访问 GitHub 的代理脚本（以本仓库 ws/proxy 为例）
source /workspace/ws/proxy/set_proxy_bj.sh

git clone https://github.com/pytorch/FBGEMM.git --branch v1.3.0 --depth 1
cd FBGEMM
git submodule update --init --recursive --jobs 6
```

说明：子模块较多（含较大仓库如 cutlass），仅北京代理可达时保留 `--jobs` 可缩短拉取时间；若某代理不可达（`No route to host`），换同目录下其他 `set_proxy_*.sh` 再执行 submodule。

---

## 3. 系统与小改动（与 install_fbgemm 一致部分）

```bash
sudo ln -sf /lib/x86_64-linux-gnu/librt.so.1 /usr/lib/x86_64-linux-gnu/librt.so
```

---

## 4. DTK 版本差异：请勿混用旧文档里的 `hipDeviceProp_t` 宏

`install_fbgemm.md` 中在 **CMAKE_CXX_FLAGS / CMAKE_HIP_FLAGS** 里增加：

`-DhipDeviceProp_t=hipDeviceProp_t_v2`

在 **DTK 26.x** 的 HIP 头文件中，`hipDeviceProp_t` 与 `hipDeviceProp_t_v2` **均已定义**，再全局 typedef 宏会导致 **重复定义** 编译错误。

在 **DTK 26.x + PyTorch 返回 `hipDeviceProp_t_v2*`** 时，应：

- **去掉**上述 `-DhipDeviceProp_t=hipDeviceProp_t_v2`；
- 在 **CUDA 源码**侧做类型兼容（见下一节），由 **hipify** 生成 HIP 代码。

---

## 5. 代码补丁原则：只改会被 hipify 的 CUDA 源，不要长期改生成文件

配置 ROCm 时，CMake 会对 `fbgemm_gpu` 下 **`.cu` / `.cuh`** 做 **hipify**，重新生成 `*_hip.cuh`、`*.hip` 等。  
**直接修改生成文件会在下次配置/编译时被覆盖。**

已在实践中验证有效的改法包括：

| 目的 | 修改文件（示例） |
|------|------------------|
| `CUDA_KERNEL_ASSERT2` / DSA 与 DTK 不匹配 | `src/sparse_ops/common.cuh`、`include/.../embedding_bounds_check_common.cuh`、`test/utils/kernel_launcher_test.cu`：在 `#ifdef USE_ROCM` 下 `#undef CUDA_KERNEL_ASSERT2` 并定义为 `((void)0)` |
| `kernelLaunchCheck` 中 HIP 辅助符号与当前 PyTorch 不一致 | `include/fbgemm_gpu/utils/kernel_launcher.cuh`：`#ifdef USE_ROCM` 时缩短 `TORCH_CHECK` 错误信息，避免依赖缺失的 host API |
| `hipDeviceProp_t` vs `hipDeviceProp_t_v2` | `include/fbgemm_gpu/utils/kernel_launcher.cuh`：四个 `check*` 改为 `template <typename DeviceProperties>`；`include/fbgemm_gpu/utils/cuda_prelude.cuh` 与部分 `.cu` 中 `getDeviceProperties` **不要**再声明为 `cudaDeviceProp*`/`hipDeviceProp_t*`，改为直接 `return getDeviceProperties(...)->multiProcessorCount` |
| 新版 `setuptools_git_versioning` 返回 `Version` 对象 | `setup.py`：`str(gitversion.version_from_git()).split("+")[0]` |

---

## 6. 编译命令（推荐）

在 **`FBGEMM/fbgemm_gpu`** 目录：

```bash
cd FBGEMM/fbgemm_gpu
rm -rf _skbuild build

export BUILD_ROCM_VERSION=6.3          # 与当前 ROCm/DTK 主版本一致，格式 X.Y
export PYTORCH_ROCM_ARCH="gfx936"     # 按实际 GPU 调整
export ROCM_PATH=/opt/dtk             # 按实际安装路径调整

TORCH_PATH="$(python3 -c "import torch, os; print(os.path.join(os.path.dirname(torch.__file__), 'share/cmake/Torch'))")"

python3 setup.py bdist_wheel \
  --package_channel=release \
  --build-target=default \
  --build-variant=rocm \
  -DCMAKE_PREFIX_PATH="${TORCH_PATH};${ROCM_PATH}" \
  -DAMDGPU_TARGETS="${PYTORCH_ROCM_ARCH}" \
  -DTORCH_DIR="${TORCH_PATH}" \
  -DCMAKE_C_FLAGS="-Wno-return-type -Wno-ignored-attributes -O1" \
  -DCMAKE_CXX_FLAGS="-Wno-return-type -Wno-ignored-attributes -O1" \
  -DCMAKE_HIP_FLAGS="-Wno-return-type -O1" \
  -DUSE_AVX512=on \
  -DCOPY_VISIBLE_LIBRARIES=ON \
  -DCMAKE_SHARED_LINKER_FLAGS="-Wl,--no-as-needed -ltbb -Wl,-rpath,/usr/lib/x86_64-linux-gnu" \
  -DTBB_INCLUDE_DIR=/usr/include
```

要点：

- 必须使用 **`--build-variant=rocm`**；仅传 `-DFBGEMM_BUILD_VARIANT=rocm` 而不传 setup.py 参数时，`setup.py` 仍可能按默认 CUDA 逻辑处理部分步骤。
- **`BUILD_ROCM_VERSION`** 为 `setup.py` / CMake 中 ROCm 工具链逻辑所需，缺省会报错。

成功后 wheel 一般在：`fbgemm_gpu/dist/fbgemm_gpu_rocm-1.3.0-*.whl`（版本号以构建输出为准）。

---

## 7. 安装与 Python 验证

```bash
pip install fbgemm_gpu/dist/fbgemm_gpu_rocm-*.whl
# 若 PyPI 不可用，可：pip install --no-deps <wheel路径>
```

验证（**不要**在 `fbgemm_gpu` 源码根目录下把源码当作包导入，否则可能找不到已安装的 `.so`）：

```bash
cd /tmp
python3 -c "import torch; import fbgemm_gpu; print('OK', fbgemm_gpu.__variant__, fbgemm_gpu.__file__)"
```

---

## 8. 运行测试（Pytest 子集示例）

在 **`FBGEMM/fbgemm_gpu/test`** 目录：

```bash
cd FBGEMM/fbgemm_gpu/test
export FBGEMM_TEST_WITH_ROCM=1
export HIP_LAUNCH_BLOCKING=1    # 可选，便于定位异步错误
unset PYTHONPATH                  # 避免优先 import 源码树中的 fbgemm_gpu

python3 -m pytest -v -rsx --tb=short \
  config/feature_gate_test.py \
  sparse/pack_segments_test.py \
  sparse/index_select_test.py \
  sparse/cumsum_test.py \
  quantize/fused_8bit_rowwise_test.py
```

全量测试可参考上游 `.github/scripts/fbgemm_gpu_test.bash` 中 `test_all_fbgemm_gpu_modules`（耗时可很长，需 ROCm 环境与足够显存/内存）。

---

## 9. 常见问题

| 现象 | 处理 |
|------|------|
| `gnutls_handshake` / clone 失败 | 使用可访问 GitHub 的代理后重试 `git clone` / `submodule update` |
| `hipDeviceProp_t_v2` 重定义 | 去掉编译参数中的 `-DhipDeviceProp_t=hipDeviceProp_t_v2`（DTK 26.x） |
| `hipDeviceProp_t` 与 `hipDeviceProp_t_v2` 不兼容 | 确保修改 **CUDA 源**中 `kernel_launcher.cuh` 的模板化 `check*` 与 `cuda_prelude.cuh` 等，并重新配置使 hipify 再生 HIP 文件 |
| `import fbgemm_gpu` 报缺 `fbgemm_gpu_config.so` | 当前工作目录或 `PYTHONPATH` 指到了未编译的源码包；换到 `/tmp` 或卸载路径冲突 |
| `BUILD_ROCM_VERSION is not set` | 导出 `BUILD_ROCM_VERSION=X.Y` 后再运行 `setup.py` |

---

## 10. 相关文件

- 环境与镜像：[install_fbgemm.md](./install_fbgemm.md)
- 代理脚本：`/workspace/ws/proxy/set_proxy_*.sh`
- 上游 ROCm CI：FBGEMM 仓库 `.github/workflows/fbgemm_gpu_ci_rocm.yml`

---

*文档基于 FBGEMM v1.3.0、PyTorch 2.5.1 HIP、DTK 26.x 类环境整理；若你方镜像仍为 DTK 25.04.2，请仍以 `install_fbgemm.md` 为准并单独验证 `hipDeviceProp_t` 宏是否仍需要。*