pipelines.rst 6.66 KB
Newer Older
1
2
.. py:module:: torchaudio.pipelines

3
4
5
6
torchaudio.pipelines
====================

.. currentmodule:: torchaudio.pipelines
moto's avatar
moto committed
7
		   
8
The ``torchaudio.pipelines`` module packages pre-trained models with support functions and meta-data into simple APIs tailored to perform specific tasks.
9

10
When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, the client code also needs to build pipelines for feature extractions and post processing in the same way they were done during the training. This requires to carrying over information used during the training, such as the type of transforms and the their parameters (for example, sampling rate the number of FFT bins).
11

12
To make this information tied to a pre-trained model and easily accessible, ``torchaudio.pipelines`` module uses the concept of a `Bundle` class, which defines a set of APIs to instantiate pipelines, and the interface of the pipelines.
13

14
The following figure illustrates this.
15

16
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png
17

18
A pre-trained model and associated pipelines are expressed as an instance of ``Bundle``. Different instances of same ``Bundle`` share the interface, but their implementations are not constrained to be of same types. For example, :class:`SourceSeparationBundle` defines the interface for performing source separation, but its instance :data:`CONVTASNET_BASE_LIBRI2MIX` instantiates a model of :class:`~torchaudio.models.ConvTasNet` while :data:`HDEMUCS_HIGH_MUSDB` instantiates a model of :class:`~torchaudio.models.HDemucs`. Still, because they share the same interface, the usage is the same.
19

20
.. note::
21

22
   Under the hood, the implementations of ``Bundle`` use components from other ``torchaudio`` modules, such as :mod:`torchaudio.models` and :mod:`torchaudio.transforms`, or even third party libraries like `SentencPiece <https://github.com/google/sentencepiece>`__ and `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__. But this implementation detail is abstracted away from library users.
23

24
25
RNN-T Streaming/Non-Streaming ASR
---------------------------------
26

27
28
Interface
^^^^^^^^^
29

30
``RNNTBundle`` defines ASR pipelines and consists of three steps: feature extraction, inference, and de-tokenization.
31

32
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png
33

34
35
36
37
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_class.rst
38

39
40
41
   RNNTBundle
   RNNTBundle.FeatureExtractor
   RNNTBundle.TokenProcessor
42

43
.. rubric:: Tutorials using ``RNNTBundle``
44

45
.. minigallery:: torchaudio.pipelines.RNNTBundle
46

47
48
Pretrained Models
^^^^^^^^^^^^^^^^^
49

50
51
52
53
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_data.rst
54

55
   EMFORMER_RNNT_BASE_LIBRISPEECH
56
57


Grigory Sizov's avatar
Grigory Sizov committed
58
wav2vec 2.0 / HuBERT / WavLM - SSL
59
----------------------------------
60

61
62
Interface
^^^^^^^^^
63

64
``Wav2Vec2Bundle`` instantiates models that generate acoustic features that can be used for downstream inference and fine-tuning.
65

66
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2bundle.png
67

68
69
70
71
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_class.rst
72

73
   Wav2Vec2Bundle
74

75
76
Pretrained Models
^^^^^^^^^^^^^^^^^
77

78
79
80
81
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_data.rst
82

83
84
85
86
   WAV2VEC2_BASE
   WAV2VEC2_LARGE
   WAV2VEC2_LARGE_LV60K
   WAV2VEC2_XLSR53
87
88
89
   WAV2VEC2_XLSR_300M
   WAV2VEC2_XLSR_1B
   WAV2VEC2_XLSR_2B
90
91
92
   HUBERT_BASE
   HUBERT_LARGE
   HUBERT_XLARGE
Grigory Sizov's avatar
Grigory Sizov committed
93
94
95
   WAVLM_BASE
   WAVLM_BASE_PLUS
   WAVLM_LARGE
96

97
98
wav2vec 2.0 / HuBERT - Fine-tuned ASR
-------------------------------------
99

100
101
Interface
^^^^^^^^^
102

103
``Wav2Vec2ASRBundle`` instantiates models that generate probability distribution over pre-defined labels, that can be used for ASR.
104

105
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2asrbundle.png
106

107
108
109
110
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_class.rst
111

112
   Wav2Vec2ASRBundle
113

114
.. rubric:: Tutorials using ``Wav2Vec2ASRBundle``
115

116
.. minigallery:: torchaudio.pipelines.Wav2Vec2ASRBundle
117

118
119
Pretrained Models
^^^^^^^^^^^^^^^^^
120

121
122
123
124
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_data.rst
125

126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
   WAV2VEC2_ASR_BASE_10M
   WAV2VEC2_ASR_BASE_100H
   WAV2VEC2_ASR_BASE_960H
   WAV2VEC2_ASR_LARGE_10M
   WAV2VEC2_ASR_LARGE_100H
   WAV2VEC2_ASR_LARGE_960H
   WAV2VEC2_ASR_LARGE_LV60K_10M
   WAV2VEC2_ASR_LARGE_LV60K_100H
   WAV2VEC2_ASR_LARGE_LV60K_960H
   VOXPOPULI_ASR_BASE_10K_DE
   VOXPOPULI_ASR_BASE_10K_EN
   VOXPOPULI_ASR_BASE_10K_ES
   VOXPOPULI_ASR_BASE_10K_FR
   VOXPOPULI_ASR_BASE_10K_IT
   HUBERT_ASR_LARGE
   HUBERT_ASR_XLARGE
142

moto's avatar
moto committed
143
144
145
146

Tacotron2 Text-To-Speech
------------------------

147
``Tacotron2TTSBundle`` defines text-to-speech pipelines and consists of three steps: tokenization, spectrogram generation and vocoder. The spectrogram generation is based on :class:`~torchaudio.models.Tacotron2` model.
moto's avatar
moto committed
148

149
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png
moto's avatar
moto committed
150

151
``TextProcessor`` can be rule-based tokenization in the case of characters, or it can be a neural-netowrk-based G2P model that generates sequence of phonemes from input text.
moto's avatar
moto committed
152

153
Similarly ``Vocoder`` can be an algorithm without learning parameters, like `Griffin-Lim`, or a neural-network-based model like `Waveglow`.
moto's avatar
moto committed
154

155
156
Interface
^^^^^^^^^
moto's avatar
moto committed
157

158
159
160
161
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_class.rst
moto's avatar
moto committed
162

163
164
165
   Tacotron2TTSBundle
   Tacotron2TTSBundle.TextProcessor
   Tacotron2TTSBundle.Vocoder
moto's avatar
moto committed
166

167
.. rubric:: Tutorials using ``Tacotron2TTSBundle``
moto's avatar
moto committed
168

169
.. minigallery:: torchaudio.pipelines.Tacotron2TTSBundle
moto's avatar
moto committed
170

171
172
Pretrained Models
^^^^^^^^^^^^^^^^^
moto's avatar
moto committed
173

174
175
176
177
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_data.rst
moto's avatar
moto committed
178

179
180
181
182
   TACOTRON2_WAVERNN_PHONE_LJSPEECH
   TACOTRON2_WAVERNN_CHAR_LJSPEECH
   TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
   TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
183

184
185
186
Source Separation
-----------------

187
188
Interface
^^^^^^^^^
189

190
``SourceSeparationBundle`` instantiates source separation models which take single channel audio and generates multi-channel audio.
191

192
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-sourceseparationbundle.png
193

194
195
196
197
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_class.rst
198

199
   SourceSeparationBundle
200

201
.. rubric:: Tutorials using ``SourceSeparationBundle``
202

203
.. minigallery:: torchaudio.pipelines.SourceSeparationBundle
204

205
206
Pretrained Models
^^^^^^^^^^^^^^^^^
207

208
209
210
211
.. autosummary::
   :toctree: generated
   :nosignatures:
   :template: autosummary/bundle_data.rst
212

213
214
215
   CONVTASNET_BASE_LIBRI2MIX
   HDEMUCS_HIGH_MUSDB_PLUS
   HDEMUCS_HIGH_MUSDB