update readme

54795f98 · fengzch-das · 475d8e66 · 54795f98 · 54795f98
Commit 54795f98 authored Sep 29, 2025 by fengzch-das
Hide whitespace changes
Inline Side-by-side

Showing with 290 additions and 212 deletions

README.md README.md +58 -212

README_ORIGIN.md README_ORIGIN.md +232 -0

No files found.
--- a/README.md
+++ b/README.md
+# <div align="center"><strong>fast_rnnt</strong></div>
+## 简介
+fast_rnnt 实现了一种更快速、内存效率更高的RNN-T 损失计算方法。
-This project implements a method for faster and more memory-efficient RNN-T loss computation, called `pruned rnnt`.
+## 安装
+组件支持组合
-Note: There is also a fast RNN-T loss implementation in [k2](https://github.com/k2-fsa/k2) project, which shares the same code here. We make `fast_rnnt` a stand-alone project in case someone wants only this rnnt loss.
+   | PyTorch版本 | fastpt版本  |fast_rnnt版本      | DTK版本                  | Python版本       | 推荐编译方式 |
+   | ----------- | ----------- | ----------- | ------------------------ | -----------------| ------------ |
+   | 2.5.1       | 2.1.0       |master       | >= 25.04                 | 3.8、3.10、3.11  | fastpt不转码 |
+   | 2.4.1       | 2.0.1       |master       | >= 25.04                 | 3.8、3.10、3.11  | fastpt不转码 |
+   | 其他        | 其他         | 其他        | 其他                     | 3.8、3.10、3.11  | hip转码      |
-## How does the pruned-rnnt work ?
+ pytorch版本大于2.4.1 && dtk版本大于25.04 推荐使用fastpt不转码编译。
-We first obtain pruning bounds for the RNN-T recursion using a simple joiner network that is just an addition of the encoder and decoder, then we use those pruning bounds to evaluate the full, non-linear joiner network.
-The picture below display the gradients (obtained by `rnnt_loss_simple` with `return_grad=true`) of lattice nodes, at each time frame, only a small set of nodes have a non-zero gradient, which justifies the pruned RNN-T loss, i.e., putting a limit on the number of symbols per frame.
-<img src="https://user-images.githubusercontent.com/5284924/158116784-4dcf1107-2b84-4c0c-90c3-cb4a02f027c9.png" width="900" height="250" />
-> This picture is taken from [here](https://github.com/k2-fsa/icefall/pull/251)
-## Installation
-You can install it via `pip`:
+### 1、使用pip方式安装
+fast_rnnt whl包下载目录：[光和开发者社区](https://download.sourcefind.cn:65024/4/main)，选择对应的pytorch版本和python版本下载对应fast_rnnt的whl包
+```shell
+pip install torch* (下载torch的whl包)
+pip install fastpt* --no-deps (下载fastpt的whl包)
+source  /usr/local/bin/fastpt -E
+pip install fast_rnnt* (下载的fast_rnnt的whl包)
 ```
-pip install fast_rnnt
+### 2、使用源码编译方式安装
-```
-You can also install from source:
-```
+#### 编译环境准备
-git clone https://github.com/danpovey/fast_rnnt.git
+提供基于fastpt不转码编译：
-cd fast_rnnt
-python setup.py install
-```
-To check that `fast_rnnt` was installed successfully, please run
+1. 基于光源pytorch基础镜像环境：镜像下载地址：[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch)，根据pytorch、python、dtk及系统下载对应的镜像版本。
+2. 基于现有python环境：安装pytorch，fastpt whl包下载目录：[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch)，根据python、dtk版本,下载对应pytorch的whl包。安装命令如下：
+```shell
+pip install torch* (下载torch的whl包)
+pip install fastpt* --no-deps (下载fastpt的whl包, 安装顺序，先安装torch，后安装fastpt)
+pip install pytest
+pip install wheel
 ```
-python3 -c "import fast_rnnt; print(fast_rnnt.__version__)"
-```
-which should print the version of the installed `fast_rnnt`, e.g., `1.0`.
-### How to display installation log ?
-Use
+#### 源码编译安装
+- 代码下载
+```shell
+git clone http://developer.sourcefind.cn/codes/OpenDAS/fast_rnnt.git # 根据编译需要切换分支
 ```
-pip install --verbose fast_rnnt
+- 提供2种源码编译方式（进入fast_rnnt目录）：
 ```
+1. 设置不转码编译环境变量
+source /usr/local/bin/fastpt -C
-### How to reduce installation time ?
+2. 编译whl包并安装
+python3 setup.py bdist_wheel
-Use
+pip install dist/fast_rnnt* --no-deps
+3. 源码编译安装
+python3 setup.py install --no-deps
 ```
-export FT_MAKE_ARGS="-j"
+#### 注意事项
-pip install --verbose fast_rnnt
+ 若使用pip install下载安装过慢，可添加pypi清华源：-i https://pypi.tuna.tsinghua.edu.cn/simple/
-```
+ ROCM_PATH为dtk的路径，默认为/opt/dtk
+ 在pytorch2.5.1环境下编译需要支持c++17语法，打开setup.py文件，把文件中的 -std=c++14 修改为 -std=c++17
-It will pass `-j` to `make`.
-### Which version of PyTorch is supported ?
-It has been tested on PyTorch >= 1.5.0.
-Note: The cuda version of the Pytorch should be the same as the cuda version in your environment,
-or it will cause a compilation error.
-### How to install a CPU version of `fast_rnnt` ?
-Use
+## 验证
 ```
-export FT_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF"
+python3
-export FT_MAKE_ARGS="-j"
+Python 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0] on linux
-pip install --verbose fast_rnnt
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import fast_rnnt
+>>> fast_rnnt.__version__
+'1.2'
+>>>
 ```
+版本号与官方版本同步，查询该软件的版本号，例如0.3；
-It will pass `-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF` to `cmake`.
+## Known Issue
+- 无
-### Where to get help if I have problems with the installation ?
-Please file an issue at <https://github.com/danpovey/fast_rnnt/issues>
-and describe your problem there.
-## Usage
-### For rnnt_loss_simple
-This is a simple case of the RNN-T loss, where the joiner network is just
-addition.
-Note: termination_symbol plays the role of blank in other RNN-T loss implementations, we call it termination_symbol as it terminates symbols of current frame.
-```python
-am = torch.randn((B, T, C), dtype=torch.float32)
-lm = torch.randn((B, S + 1, C), dtype=torch.float32)
-symbols = torch.randint(0, C, (B, S))
-termination_symbol = 0
-boundary = torch.zeros((B, 4), dtype=torch.int64)
-boundary[:, 2] = target_lengths
-boundary[:, 3] = num_frames
-loss = fast_rnnt.rnnt_loss_simple(
-    lm=lm,
-    am=am,
-    symbols=symbols,
-    termination_symbol=termination_symbol,
-    boundary=boundary,
-    reduction="sum",
-)
-```
-### For rnnt_loss_smoothed
-The same as `rnnt_loss_simple`, except that it supports `am_only` & `lm_only` smoothing
-that allows you to make the loss-function one of the form:
-          lm_only_scale * lm_probs +
-          am_only_scale * am_probs +
-          (1-lm_only_scale-am_only_scale) * combined_probs
-where `lm_probs` and `am_probs` are the probabilities given the lm and acoustic model independently.
-```python
-am = torch.randn((B, T, C), dtype=torch.float32)
-lm = torch.randn((B, S + 1, C), dtype=torch.float32)
-symbols = torch.randint(0, C, (B, S))
-termination_symbol = 0
-boundary = torch.zeros((B, 4), dtype=torch.int64)
-boundary[:, 2] = target_lengths
-boundary[:, 3] = num_frames
-loss = fast_rnnt.rnnt_loss_smoothed(
-    lm=lm,
-    am=am,
-    symbols=symbols,
-    termination_symbol=termination_symbol,
-    lm_only_scale=0.25,
-    am_only_scale=0.0
-    boundary=boundary,
-    reduction="sum",
-)
-```
-### For rnnt_loss_pruned
-`rnnt_loss_pruned` can not be used alone, it needs the gradients returned by `rnnt_loss_simple/rnnt_loss_smoothed` to get pruning bounds.
-```python
-am = torch.randn((B, T, C), dtype=torch.float32)
-lm = torch.randn((B, S + 1, C), dtype=torch.float32)
-symbols = torch.randint(0, C, (B, S))
-termination_symbol = 0
-boundary = torch.zeros((B, 4), dtype=torch.int64)
-boundary[:, 2] = target_lengths
-boundary[:, 3] = num_frames
-# rnnt_loss_simple can be also rnnt_loss_smoothed
-simple_loss, (px_grad, py_grad) = fast_rnnt.rnnt_loss_simple(
-    lm=lm,
-    am=am,
-    symbols=symbols,
-    termination_symbol=termination_symbol,
-    boundary=boundary,
-    reduction="sum",
-    return_grad=True,
-)
-s_range = 5  # can be other values
-ranges = fast_rnnt.get_rnnt_prune_ranges(
-    px_grad=px_grad,
-    py_grad=py_grad,
-    boundary=boundary,
-    s_range=s_range,
-)
-am_pruned, lm_pruned = fast_rnnt.do_rnnt_pruning(am=am, lm=lm, ranges=ranges)
-logits = model.joiner(am_pruned, lm_pruned)
-pruned_loss = fast_rnnt.rnnt_loss_pruned(
-    logits=logits,
-    symbols=symbols,
-    ranges=ranges,
-    termination_symbol=termination_symbol,
-    boundary=boundary,
-    reduction="sum",
-)
-```
-You can also find recipes [here](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless) that uses `rnnt_loss_pruned` to train a model.
-### For rnnt_loss
-The `unprund rnnt_loss` is the same as `torchaudio rnnt_loss`, it produces same output as torchaudio for the same input.
-```python
-logits = torch.randn((B, S, T, C), dtype=torch.float32)
-symbols = torch.randint(0, C, (B, S))
-termination_symbol = 0
-boundary = torch.zeros((B, 4), dtype=torch.int64)
-boundary[:, 2] = target_lengths
-boundary[:, 3] = num_frames
-loss = fast_rnnt.rnnt_loss(
-    logits=logits,
-    symbols=symbols,
-    termination_symbol=termination_symbol,
-    boundary=boundary,
-    reduction="sum",
-)
-```
-## Benchmarking
-The [repo](https://github.com/csukuangfj/transducer-loss-benchmarking) compares the speed and memory usage of several transducer losses, the summary in the following table is taken from there, you can check the repository for more details.
-Note: As we declared above, `fast_rnnt` is also implemented in [k2](https://github.com/k2-fsa/k2) project, so `k2` and `fast_rnnt` are equivalent in the benchmarking.
-|Name	               |Average step time (us) | Peak memory usage (MB)|
+## 参考资料
-|--------------------|-----------------------|-----------------------|
+- [README_ORIGIN](README_ORIGIN.md)
-|torchaudio          |601447                 |12959.2                |
+- [https://github.com/princeton-vl/fast_rnnt](https://github.com/k2-fsa/fast_rnnt)
-|fast_rnnt(unpruned) |274407                 |15106.5                |
-|fast_rnnt(pruned)   |38112                  |2647.8                 |
-|optimized_transducer|567684                 |10903.1                |
-|warprnnt_numba      |229340                 |13061.8                |
-|warp-transducer     |210772                 |13061.8                |
--- a/README_ORIGIN.md
+++ b/README_ORIGIN.md
+This project implements a method for faster and more memory-efficient RNN-T loss computation, called `pruned rnnt`.
+Note: There is also a fast RNN-T loss implementation in [k2](https://github.com/k2-fsa/k2) project, which shares the same code here. We make `fast_rnnt` a stand-alone project in case someone wants only this rnnt loss.
+## How does the pruned-rnnt work ?
+We first obtain pruning bounds for the RNN-T recursion using a simple joiner network that is just an addition of the encoder and decoder, then we use those pruning bounds to evaluate the full, non-linear joiner network.
+The picture below display the gradients (obtained by `rnnt_loss_simple` with `return_grad=true`) of lattice nodes, at each time frame, only a small set of nodes have a non-zero gradient, which justifies the pruned RNN-T loss, i.e., putting a limit on the number of symbols per frame.
+<img src="https://user-images.githubusercontent.com/5284924/158116784-4dcf1107-2b84-4c0c-90c3-cb4a02f027c9.png" width="900" height="250" />
+> This picture is taken from [here](https://github.com/k2-fsa/icefall/pull/251)
+## Installation
+You can install it via `pip`:
+```
+pip install fast_rnnt
+```
+You can also install from source:
+```
+git clone https://github.com/danpovey/fast_rnnt.git
+cd fast_rnnt
+python setup.py install
+```
+To check that `fast_rnnt` was installed successfully, please run
+```
+python3 -c "import fast_rnnt; print(fast_rnnt.__version__)"
+```
+which should print the version of the installed `fast_rnnt`, e.g., `1.0`.
+### How to display installation log ?
+Use
+```
+pip install --verbose fast_rnnt
+```
+### How to reduce installation time ?
+Use
+```
+export FT_MAKE_ARGS="-j"
+pip install --verbose fast_rnnt
+```
+It will pass `-j` to `make`.
+### Which version of PyTorch is supported ?
+It has been tested on PyTorch >= 1.5.0.
+Note: The cuda version of the Pytorch should be the same as the cuda version in your environment,
+or it will cause a compilation error.
+### How to install a CPU version of `fast_rnnt` ?
+Use
+```
+export FT_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF"
+export FT_MAKE_ARGS="-j"
+pip install --verbose fast_rnnt
+```
+It will pass `-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF` to `cmake`.
+### Where to get help if I have problems with the installation ?
+Please file an issue at <https://github.com/danpovey/fast_rnnt/issues>
+and describe your problem there.
+## Usage
+### For rnnt_loss_simple
+This is a simple case of the RNN-T loss, where the joiner network is just
+addition.
+Note: termination_symbol plays the role of blank in other RNN-T loss implementations, we call it termination_symbol as it terminates symbols of current frame.
+```python
+am = torch.randn((B, T, C), dtype=torch.float32)
+lm = torch.randn((B, S + 1, C), dtype=torch.float32)
+symbols = torch.randint(0, C, (B, S))
+termination_symbol = 0
+boundary = torch.zeros((B, 4), dtype=torch.int64)
+boundary[:, 2] = target_lengths
+boundary[:, 3] = num_frames
+loss = fast_rnnt.rnnt_loss_simple(
+    lm=lm,
+    am=am,
+    symbols=symbols,
+    termination_symbol=termination_symbol,
+    boundary=boundary,
+    reduction="sum",
+)
+```
+### For rnnt_loss_smoothed
+The same as `rnnt_loss_simple`, except that it supports `am_only` & `lm_only` smoothing
+that allows you to make the loss-function one of the form:
+          lm_only_scale * lm_probs +
+          am_only_scale * am_probs +
+          (1-lm_only_scale-am_only_scale) * combined_probs
+where `lm_probs` and `am_probs` are the probabilities given the lm and acoustic model independently.
+```python
+am = torch.randn((B, T, C), dtype=torch.float32)
+lm = torch.randn((B, S + 1, C), dtype=torch.float32)
+symbols = torch.randint(0, C, (B, S))
+termination_symbol = 0
+boundary = torch.zeros((B, 4), dtype=torch.int64)
+boundary[:, 2] = target_lengths
+boundary[:, 3] = num_frames
+loss = fast_rnnt.rnnt_loss_smoothed(
+    lm=lm,
+    am=am,
+    symbols=symbols,
+    termination_symbol=termination_symbol,
+    lm_only_scale=0.25,
+    am_only_scale=0.0
+    boundary=boundary,
+    reduction="sum",
+)
+```
+### For rnnt_loss_pruned
+`rnnt_loss_pruned` can not be used alone, it needs the gradients returned by `rnnt_loss_simple/rnnt_loss_smoothed` to get pruning bounds.
+```python
+am = torch.randn((B, T, C), dtype=torch.float32)
+lm = torch.randn((B, S + 1, C), dtype=torch.float32)
+symbols = torch.randint(0, C, (B, S))
+termination_symbol = 0
+boundary = torch.zeros((B, 4), dtype=torch.int64)
+boundary[:, 2] = target_lengths
+boundary[:, 3] = num_frames
+# rnnt_loss_simple can be also rnnt_loss_smoothed
+simple_loss, (px_grad, py_grad) = fast_rnnt.rnnt_loss_simple(
+    lm=lm,
+    am=am,
+    symbols=symbols,
+    termination_symbol=termination_symbol,
+    boundary=boundary,
+    reduction="sum",
+    return_grad=True,
+)
+s_range = 5  # can be other values
+ranges = fast_rnnt.get_rnnt_prune_ranges(
+    px_grad=px_grad,
+    py_grad=py_grad,
+    boundary=boundary,
+    s_range=s_range,
+)
+am_pruned, lm_pruned = fast_rnnt.do_rnnt_pruning(am=am, lm=lm, ranges=ranges)
+logits = model.joiner(am_pruned, lm_pruned)
+pruned_loss = fast_rnnt.rnnt_loss_pruned(
+    logits=logits,
+    symbols=symbols,
+    ranges=ranges,
+    termination_symbol=termination_symbol,
+    boundary=boundary,
+    reduction="sum",
+)
+```
+You can also find recipes [here](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless) that uses `rnnt_loss_pruned` to train a model.
+### For rnnt_loss
+The `unprund rnnt_loss` is the same as `torchaudio rnnt_loss`, it produces same output as torchaudio for the same input.
+```python
+logits = torch.randn((B, S, T, C), dtype=torch.float32)
+symbols = torch.randint(0, C, (B, S))
+termination_symbol = 0
+boundary = torch.zeros((B, 4), dtype=torch.int64)
+boundary[:, 2] = target_lengths
+boundary[:, 3] = num_frames
+loss = fast_rnnt.rnnt_loss(
+    logits=logits,
+    symbols=symbols,
+    termination_symbol=termination_symbol,
+    boundary=boundary,
+    reduction="sum",
+)
+```
+## Benchmarking
+The [repo](https://github.com/csukuangfj/transducer-loss-benchmarking) compares the speed and memory usage of several transducer losses, the summary in the following table is taken from there, you can check the repository for more details.
+Note: As we declared above, `fast_rnnt` is also implemented in [k2](https://github.com/k2-fsa/k2) project, so `k2` and `fast_rnnt` are equivalent in the benchmarking.
+|Name	               |Average step time (us) | Peak memory usage (MB)|
+|--------------------|-----------------------|-----------------------|
+|torchaudio          |601447                 |12959.2                |
+|fast_rnnt(unpruned) |274407                 |15106.5                |
+|fast_rnnt(pruned)   |38112                  |2647.8                 |
+|optimized_transducer|567684                 |10903.1                |
+|warprnnt_numba      |229340                 |13061.8                |
+|warp-transducer     |210772                 |13061.8                |