- `dataset_conf.batch_type`(str):`example`(默认),batch的类型。`example`表示按照固定数目batch_size个样本组batch;`length` or `token` 表示动态组batch,batch总长度或者token数为batch_size。
FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](https://github.com/alibaba-damo-academy/FunASR/blob/main/MODEL_LICENSE). Below, we list some representative models. For a comprehensive list, please refer to our [Model Zoo](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo).
<divalign="center">
<h4>
<ahref="#Inference"> Model Inference </a>
|<ahref="#Training"> Model Training and Testing </a>
-`model`(str): model name in the [Model Repository](https://github.com/alibaba-damo-academy/FunASR/tree/main/model_zoo), or a model path on local disk.
-`device`(str): `cuda:0` (default gpu0) for using GPU for inference, specify `cpu` for using CPU.
-`ncpu`(int): `4` (default), sets the number of threads for CPU internal operations.
-`output_dir`(str): `None` (default), set this to specify the output path for the results.
-`batch_size`(int): `1` (default), the number of samples per batch during decoding.
-`hub`(str):`ms` (default) to download models from ModelScope. Use `hf` to download models from Hugging Face.
-`**kwargs`(dict): Any parameters found in config.yaml can be directly specified here, for instance, the maximum segmentation length in the vad model max_single_segment_time=6000 (milliseconds).
#### AutoModel Inference
```python
res=model.generate(input=[str],output_dir=[str])
```
-`input`: The input to be decoded, which could be:
- A wav file path, e.g., asr_example.wav
- A pcm file path, e.g., asr_example.pcm, in this case, specify the audio sampling rate fs (default is 16000)
- An audio byte stream, e.g., byte data from a microphone
- A wav.scp, a Kaldi-style wav list (wav_id \t wav_path), for example:
```text
asr_example1 ./audios/asr_example1.wav
asr_example2 ./audios/asr_example2.wav
```
When using wav.scp as input, you must set output_dir to save the output results.
- Audio samples, `e.g.`: `audio, rate = soundfile.read("asr_example_zh.wav")`, data type is numpy.ndarray. Supports batch inputs, type is list:
res = model.generate(input=wav_file, batch_size_s=300, batch_size_threshold_s=60, hotword='魔搭')
print(res)
```
Notes:
- Typically, the input duration for models is limited to under 30 seconds. However, when combined with `vad_model`, support for audio input of any length is enabled, not limited to the paraformer model—any audio input model can be used.
- Parameters related to model can be directly specified in the definition of AutoModel; parameters related to `vad_model` can be set through `vad_kwargs`, which is a dict; similar parameters include `punc_kwargs` and `spk_kwargs`.
- `max_single_segment_time`: Denotes the maximum audio segmentation length for `vad_model`, measured in milliseconds (ms).
- `batch_size_s` represents the use of dynamic batching, where the total audio duration within a batch is measured in seconds (s).
- `batch_size_threshold_s`: Indicates that when the duration of an audio segment post-VAD segmentation exceeds the batch_size_threshold_s threshold, the batch size is set to 1, measured in seconds (s).
Recommendations:
When you input long audio and encounter Out Of Memory (OOM) issues, since memory usage tends to increase quadratically with audio length, consider the following three scenarios:
a) At the beginning of inference, memory usage primarily depends on `batch_size_s`. Appropriately reducing this value can decrease memory usage.
b) During the middle of inference, when encountering long audio segments cut by VAD and the total token count is less than `batch_size_s`, yet still facing OOM, you can appropriately reduce `batch_size_threshold_s`. If the threshold is exceeded, the batch size is forced to 1.
c) Towards the end of inference, if long audio segments cut by VAD have a total token count less than `batch_size_s` and exceed the `threshold` batch_size_threshold_s, forcing the batch size to 1 and still facing OOM, you may reduce `max_single_segment_time` to shorten the VAD audio segment length.
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
print(res)
```
Note: `chunk_size` is the configuration for streaming latency.` [0,10,5]` indicates that the real-time display granularity is `10*60=600ms`, and the lookahead information is `5*60=300ms`. Each inference input is `600ms` (sample points are `16000*0.6=960`), and the output is the corresponding text. For the last speech segment input, `is_final=True` needs to be set to force the output of the last word.
Note: The output format of the VAD model is: `[[beg1, end1], [beg2, end2], ..., [begN, endN]]`, where `begN/endN` indicates the starting/ending point of the `N-th` valid audio segment, measured in milliseconds.
Execute with Python code (supports multi-node and multi-GPU, recommended):
```shell
cd examples/industrial_data_pretraining/paraformer
bash finetune.sh
# "log_file: ./outputs/log.txt"
```
Full code ref to [finetune.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/examples/industrial_data_pretraining/paraformer/finetune.sh)
### Detailed Parameter Description:
```shell
funasr/bin/train.py \
++model="${model_name_or_model_dir}" \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.batch_size=20000 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=4 \
++train_conf.max_epoch=50 \
++train_conf.log_interval=1 \
++train_conf.resume=false \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}
```
- `model`(str): The name of the model (the ID in the model repository), at which point the script will automatically download the model to local storage; alternatively, the path to a model already downloaded locally.
- `train_data_set_list`(str): The path to the training data, typically in jsonl format, for specific details refer to [examples](https://github.com/alibaba-damo-academy/FunASR/blob/main/data/list).
- `valid_data_set_list`(str):The path to the validation data, also generally in jsonl format, for specific details refer to examples](https://github.com/alibaba-damo-academy/FunASR/blob/main/data/list).
- `dataset_conf.batch_type`(str):example (default), the type of batch. example means batches are formed with a fixed number of batch_size samples; length or token means dynamic batching, with total length or number of tokens of the batch equalling batch_size.
- `dataset_conf.batch_size`(int):Used in conjunction with batch_type. When batch_type=example, it represents the number of samples; when batch_type=length, it represents the length of the samples, measured in fbank frames (1 frame = 10 ms) or the number of text tokens.
- `train_conf.max_epoch`(int):The total number of epochs for training.
- `train_conf.log_interval`(int):The number of steps between logging.
- `train_conf.resume`(int):Whether to enable checkpoint resuming for training.
- `train_conf.validate_interval`(int):The interval in steps to run validation tests during training.
- `train_conf.save_checkpoint_interval`(int):The interval in steps for saving the model during training.
- `train_conf.keep_nbest_models`(int):The maximum number of model parameters to retain, sorted by validation set accuracy, from highest to lowest.
- `train_conf.avg_nbest_model`(int):Average over the top n models with the highest accuracy.
- `optim_conf.lr`(float):The learning rate.
- `output_dir`(str):The path for saving the model.
- `**kwargs`(dict): Any parameters in config.yaml can be specified directly here, for example, to filter out audio longer than 20s: dataset_conf.max_token_length=2000, measured in fbank frames (1 frame = 10 ms) or the number of text tokens.
On the worker node (assuming the IP is 192.168.1.2), you need to ensure that the MASTER_ADDR and MASTER_PORT environment variables are set to match those of the master node, and then run the same command:
--nnodes indicates the total number of nodes participating in the training, --node_rank represents the ID of the current node, and --nproc_per_node specifies the number of processes running on each node (usually corresponds to the number of GPUs).
- `epoch`,`step`,`total step`:the current epoch, step, and total steps.
- `loss_avg_rank`:the average loss across all GPUs for the current step.
- `loss/ppl/acc_avg_epoch`:the overall average loss/perplexity/accuracy for the current epoch, up to the current step count. The last step of the epoch when it ends represents the total average loss/perplexity/accuracy for that epoch; it is recommended to use the accuracy metric.
- `lr`:the learning rate for the current step.
- `[('loss_att', 0.259), ('acc', 0.825), ('loss_pre', 0.04), ('loss', 0.299), ('batch_size', 40)]`:the specific data for the current GPU ID.
- `total_time`:the total time taken for a single step.
- `GPU, memory`:the model-used/peak memory and the model+cache-used/peak memory.
Assuming the training model path is: ./model_dir, if a configuration.json file has been generated in this directory, you only need to change the model name to the model path in the above model inference method.
If there is no configuration.json in the model path, you need to manually specify the exact configuration file path and the model path.
```shell
python -m funasr.bin.inference \
--config-path "${local_path}" \
--config-name "${config}" \
++init_param="${init_param}" \
++tokenizer_conf.token_list="${tokens}" \
++frontend_conf.cmvn_file="${cmvn_file}" \
++input="${input}" \
++output_dir="${output_dir}" \
++device="${device}"
```
Parameter Introduction
- `config-path`:This is the path to the config.yaml saved during the experiment, which can be found in the experiment's output directory.
- `config-name`:The name of the configuration file, usually config.yaml. It supports both YAML and JSON formats, for example config.json.
- `init_param`:The model parameters that need to be tested, usually model.pt. You can choose a specific model file as needed.
- `tokenizer_conf.token_list`:The path to the vocabulary file, which is normally specified in config.yaml. There is no need to manually specify it again unless the path in config.yaml is incorrect, in which case the correct path must be manually specified here.
- `frontend_conf.cmvn_file`:The CMVN (Cepstral Mean and Variance Normalization) file used when extracting fbank features from WAV files, which is usually specified in config.yaml. There is no need to manually specify it again unless the path in config.yaml is incorrect, in which case the correct path must be manually specified here.
Other parameters are the same as mentioned above. A complete [example](https://github.com/alibaba-damo-academy/FunASR/blob/main/examples/industrial_data_pretraining/paraformer/infer_from_local.sh) can be found here.
- `dataset_conf.batch_type`(str):`example`(默认),batch的类型。`example`表示按照固定数目batch_size个样本组batch;`length` or `token` 表示动态组batch,batch总长度或者token数为batch_size。
- `dataset_conf.batch_type`(str):`example`(默认),batch的类型。`example`表示按照固定数目batch_size个样本组batch;`length` or `token` 表示动态组batch,batch总长度或者token数为batch_size。