"...text-generation-inference.git" did not exist on "09b7c26bbdb940e4e0d2216e14fd437f89fcdeb2"
Commit 54795f98 authored by fengzch-das's avatar fengzch-das
Browse files

update readme

parent 475d8e66
Pipeline #2957 failed with stages
in 0 seconds
# <div align="center"><strong>fast_rnnt</strong></div>
## 简介
fast_rnnt 实现了一种更快速、内存效率更高的RNN-T 损失计算方法。
This project implements a method for faster and more memory-efficient RNN-T loss computation, called `pruned rnnt`. ## 安装
组件支持组合
Note: There is also a fast RNN-T loss implementation in [k2](https://github.com/k2-fsa/k2) project, which shares the same code here. We make `fast_rnnt` a stand-alone project in case someone wants only this rnnt loss. | PyTorch版本 | fastpt版本 |fast_rnnt版本 | DTK版本 | Python版本 | 推荐编译方式 |
| ----------- | ----------- | ----------- | ------------------------ | -----------------| ------------ |
| 2.5.1 | 2.1.0 |master | >= 25.04 | 3.8、3.10、3.11 | fastpt不转码 |
| 2.4.1 | 2.0.1 |master | >= 25.04 | 3.8、3.10、3.11 | fastpt不转码 |
| 其他 | 其他 | 其他 | 其他 | 3.8、3.10、3.11 | hip转码 |
## How does the pruned-rnnt work ? + pytorch版本大于2.4.1 && dtk版本大于25.04 推荐使用fastpt不转码编译。
We first obtain pruning bounds for the RNN-T recursion using a simple joiner network that is just an addition of the encoder and decoder, then we use those pruning bounds to evaluate the full, non-linear joiner network.
The picture below display the gradients (obtained by `rnnt_loss_simple` with `return_grad=true`) of lattice nodes, at each time frame, only a small set of nodes have a non-zero gradient, which justifies the pruned RNN-T loss, i.e., putting a limit on the number of symbols per frame.
<img src="https://user-images.githubusercontent.com/5284924/158116784-4dcf1107-2b84-4c0c-90c3-cb4a02f027c9.png" width="900" height="250" />
> This picture is taken from [here](https://github.com/k2-fsa/icefall/pull/251)
## Installation
You can install it via `pip`:
### 1、使用pip方式安装
fast_rnnt whl包下载目录:[光和开发者社区](https://download.sourcefind.cn:65024/4/main),选择对应的pytorch版本和python版本下载对应fast_rnnt的whl包
```shell
pip install torch* (下载torch的whl包)
pip install fastpt* --no-deps (下载fastpt的whl包)
source /usr/local/bin/fastpt -E
pip install fast_rnnt* (下载的fast_rnnt的whl包)
``` ```
pip install fast_rnnt ### 2、使用源码编译方式安装
```
You can also install from source:
``` #### 编译环境准备
git clone https://github.com/danpovey/fast_rnnt.git 提供基于fastpt不转码编译:
cd fast_rnnt
python setup.py install
```
To check that `fast_rnnt` was installed successfully, please run 1. 基于光源pytorch基础镜像环境:镜像下载地址:[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch),根据pytorch、python、dtk及系统下载对应的镜像版本。
2. 基于现有python环境:安装pytorch,fastpt whl包下载目录:[光合开发者社区](https://sourcefind.cn/#/image/dcu/pytorch),根据python、dtk版本,下载对应pytorch的whl包。安装命令如下:
```shell
pip install torch* (下载torch的whl包)
pip install fastpt* --no-deps (下载fastpt的whl包, 安装顺序,先安装torch,后安装fastpt)
pip install pytest
pip install wheel
``` ```
python3 -c "import fast_rnnt; print(fast_rnnt.__version__)"
```
which should print the version of the installed `fast_rnnt`, e.g., `1.0`.
### How to display installation log ?
Use
#### 源码编译安装
- 代码下载
```shell
git clone http://developer.sourcefind.cn/codes/OpenDAS/fast_rnnt.git # 根据编译需要切换分支
``` ```
pip install --verbose fast_rnnt - 提供2种源码编译方式(进入fast_rnnt目录):
``` ```
1. 设置不转码编译环境变量
source /usr/local/bin/fastpt -C
### How to reduce installation time ? 2. 编译whl包并安装
python3 setup.py bdist_wheel
Use pip install dist/fast_rnnt* --no-deps
3. 源码编译安装
python3 setup.py install --no-deps
``` ```
export FT_MAKE_ARGS="-j" #### 注意事项
pip install --verbose fast_rnnt + 若使用pip install下载安装过慢,可添加pypi清华源:-i https://pypi.tuna.tsinghua.edu.cn/simple/
``` + ROCM_PATH为dtk的路径,默认为/opt/dtk
+ 在pytorch2.5.1环境下编译需要支持c++17语法,打开setup.py文件,把文件中的 -std=c++14 修改为 -std=c++17
It will pass `-j` to `make`.
### Which version of PyTorch is supported ?
It has been tested on PyTorch >= 1.5.0.
Note: The cuda version of the Pytorch should be the same as the cuda version in your environment,
or it will cause a compilation error.
### How to install a CPU version of `fast_rnnt` ?
Use
## 验证
``` ```
export FT_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF" python3
export FT_MAKE_ARGS="-j" Python 3.10.12 (main, May 27 2025, 17:12:29) [GCC 11.4.0] on linux
pip install --verbose fast_rnnt Type "help", "copyright", "credits" or "license" for more information.
>>> import fast_rnnt
>>> fast_rnnt.__version__
'1.2'
>>>
``` ```
版本号与官方版本同步,查询该软件的版本号,例如0.3;
It will pass `-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF` to `cmake`. ## Known Issue
-
### Where to get help if I have problems with the installation ?
Please file an issue at <https://github.com/danpovey/fast_rnnt/issues>
and describe your problem there.
## Usage
### For rnnt_loss_simple
This is a simple case of the RNN-T loss, where the joiner network is just
addition.
Note: termination_symbol plays the role of blank in other RNN-T loss implementations, we call it termination_symbol as it terminates symbols of current frame.
```python
am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
loss = fast_rnnt.rnnt_loss_simple(
lm=lm,
am=am,
symbols=symbols,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
)
```
### For rnnt_loss_smoothed
The same as `rnnt_loss_simple`, except that it supports `am_only` & `lm_only` smoothing
that allows you to make the loss-function one of the form:
lm_only_scale * lm_probs +
am_only_scale * am_probs +
(1-lm_only_scale-am_only_scale) * combined_probs
where `lm_probs` and `am_probs` are the probabilities given the lm and acoustic model independently.
```python
am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
loss = fast_rnnt.rnnt_loss_smoothed(
lm=lm,
am=am,
symbols=symbols,
termination_symbol=termination_symbol,
lm_only_scale=0.25,
am_only_scale=0.0
boundary=boundary,
reduction="sum",
)
```
### For rnnt_loss_pruned
`rnnt_loss_pruned` can not be used alone, it needs the gradients returned by `rnnt_loss_simple/rnnt_loss_smoothed` to get pruning bounds.
```python
am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
# rnnt_loss_simple can be also rnnt_loss_smoothed
simple_loss, (px_grad, py_grad) = fast_rnnt.rnnt_loss_simple(
lm=lm,
am=am,
symbols=symbols,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
return_grad=True,
)
s_range = 5 # can be other values
ranges = fast_rnnt.get_rnnt_prune_ranges(
px_grad=px_grad,
py_grad=py_grad,
boundary=boundary,
s_range=s_range,
)
am_pruned, lm_pruned = fast_rnnt.do_rnnt_pruning(am=am, lm=lm, ranges=ranges)
logits = model.joiner(am_pruned, lm_pruned)
pruned_loss = fast_rnnt.rnnt_loss_pruned(
logits=logits,
symbols=symbols,
ranges=ranges,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
)
```
You can also find recipes [here](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless) that uses `rnnt_loss_pruned` to train a model.
### For rnnt_loss
The `unprund rnnt_loss` is the same as `torchaudio rnnt_loss`, it produces same output as torchaudio for the same input.
```python
logits = torch.randn((B, S, T, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
loss = fast_rnnt.rnnt_loss(
logits=logits,
symbols=symbols,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
)
```
## Benchmarking
The [repo](https://github.com/csukuangfj/transducer-loss-benchmarking) compares the speed and memory usage of several transducer losses, the summary in the following table is taken from there, you can check the repository for more details.
Note: As we declared above, `fast_rnnt` is also implemented in [k2](https://github.com/k2-fsa/k2) project, so `k2` and `fast_rnnt` are equivalent in the benchmarking.
|Name |Average step time (us) | Peak memory usage (MB)| ## 参考资料
|--------------------|-----------------------|-----------------------| - [README_ORIGIN](README_ORIGIN.md)
|torchaudio |601447 |12959.2 | - [https://github.com/princeton-vl/fast_rnnt](https://github.com/k2-fsa/fast_rnnt)
|fast_rnnt(unpruned) |274407 |15106.5 |
|fast_rnnt(pruned) |38112 |2647.8 |
|optimized_transducer|567684 |10903.1 |
|warprnnt_numba |229340 |13061.8 |
|warp-transducer |210772 |13061.8 |
This project implements a method for faster and more memory-efficient RNN-T loss computation, called `pruned rnnt`.
Note: There is also a fast RNN-T loss implementation in [k2](https://github.com/k2-fsa/k2) project, which shares the same code here. We make `fast_rnnt` a stand-alone project in case someone wants only this rnnt loss.
## How does the pruned-rnnt work ?
We first obtain pruning bounds for the RNN-T recursion using a simple joiner network that is just an addition of the encoder and decoder, then we use those pruning bounds to evaluate the full, non-linear joiner network.
The picture below display the gradients (obtained by `rnnt_loss_simple` with `return_grad=true`) of lattice nodes, at each time frame, only a small set of nodes have a non-zero gradient, which justifies the pruned RNN-T loss, i.e., putting a limit on the number of symbols per frame.
<img src="https://user-images.githubusercontent.com/5284924/158116784-4dcf1107-2b84-4c0c-90c3-cb4a02f027c9.png" width="900" height="250" />
> This picture is taken from [here](https://github.com/k2-fsa/icefall/pull/251)
## Installation
You can install it via `pip`:
```
pip install fast_rnnt
```
You can also install from source:
```
git clone https://github.com/danpovey/fast_rnnt.git
cd fast_rnnt
python setup.py install
```
To check that `fast_rnnt` was installed successfully, please run
```
python3 -c "import fast_rnnt; print(fast_rnnt.__version__)"
```
which should print the version of the installed `fast_rnnt`, e.g., `1.0`.
### How to display installation log ?
Use
```
pip install --verbose fast_rnnt
```
### How to reduce installation time ?
Use
```
export FT_MAKE_ARGS="-j"
pip install --verbose fast_rnnt
```
It will pass `-j` to `make`.
### Which version of PyTorch is supported ?
It has been tested on PyTorch >= 1.5.0.
Note: The cuda version of the Pytorch should be the same as the cuda version in your environment,
or it will cause a compilation error.
### How to install a CPU version of `fast_rnnt` ?
Use
```
export FT_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF"
export FT_MAKE_ARGS="-j"
pip install --verbose fast_rnnt
```
It will pass `-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF` to `cmake`.
### Where to get help if I have problems with the installation ?
Please file an issue at <https://github.com/danpovey/fast_rnnt/issues>
and describe your problem there.
## Usage
### For rnnt_loss_simple
This is a simple case of the RNN-T loss, where the joiner network is just
addition.
Note: termination_symbol plays the role of blank in other RNN-T loss implementations, we call it termination_symbol as it terminates symbols of current frame.
```python
am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
loss = fast_rnnt.rnnt_loss_simple(
lm=lm,
am=am,
symbols=symbols,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
)
```
### For rnnt_loss_smoothed
The same as `rnnt_loss_simple`, except that it supports `am_only` & `lm_only` smoothing
that allows you to make the loss-function one of the form:
lm_only_scale * lm_probs +
am_only_scale * am_probs +
(1-lm_only_scale-am_only_scale) * combined_probs
where `lm_probs` and `am_probs` are the probabilities given the lm and acoustic model independently.
```python
am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
loss = fast_rnnt.rnnt_loss_smoothed(
lm=lm,
am=am,
symbols=symbols,
termination_symbol=termination_symbol,
lm_only_scale=0.25,
am_only_scale=0.0
boundary=boundary,
reduction="sum",
)
```
### For rnnt_loss_pruned
`rnnt_loss_pruned` can not be used alone, it needs the gradients returned by `rnnt_loss_simple/rnnt_loss_smoothed` to get pruning bounds.
```python
am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
# rnnt_loss_simple can be also rnnt_loss_smoothed
simple_loss, (px_grad, py_grad) = fast_rnnt.rnnt_loss_simple(
lm=lm,
am=am,
symbols=symbols,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
return_grad=True,
)
s_range = 5 # can be other values
ranges = fast_rnnt.get_rnnt_prune_ranges(
px_grad=px_grad,
py_grad=py_grad,
boundary=boundary,
s_range=s_range,
)
am_pruned, lm_pruned = fast_rnnt.do_rnnt_pruning(am=am, lm=lm, ranges=ranges)
logits = model.joiner(am_pruned, lm_pruned)
pruned_loss = fast_rnnt.rnnt_loss_pruned(
logits=logits,
symbols=symbols,
ranges=ranges,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
)
```
You can also find recipes [here](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless) that uses `rnnt_loss_pruned` to train a model.
### For rnnt_loss
The `unprund rnnt_loss` is the same as `torchaudio rnnt_loss`, it produces same output as torchaudio for the same input.
```python
logits = torch.randn((B, S, T, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0
boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames
loss = fast_rnnt.rnnt_loss(
logits=logits,
symbols=symbols,
termination_symbol=termination_symbol,
boundary=boundary,
reduction="sum",
)
```
## Benchmarking
The [repo](https://github.com/csukuangfj/transducer-loss-benchmarking) compares the speed and memory usage of several transducer losses, the summary in the following table is taken from there, you can check the repository for more details.
Note: As we declared above, `fast_rnnt` is also implemented in [k2](https://github.com/k2-fsa/k2) project, so `k2` and `fast_rnnt` are equivalent in the benchmarking.
|Name |Average step time (us) | Peak memory usage (MB)|
|--------------------|-----------------------|-----------------------|
|torchaudio |601447 |12959.2 |
|fast_rnnt(unpruned) |274407 |15106.5 |
|fast_rnnt(pruned) |38112 |2647.8 |
|optimized_transducer|567684 |10903.1 |
|warprnnt_numba |229340 |13061.8 |
|warp-transducer |210772 |13061.8 |
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment