• Zhaoheng Ni's avatar
    Improve hubert recipe for pre-training and fine-tuning (#2744) · 928248d7
    Zhaoheng Ni authored
    Summary:
    following pr https://github.com/pytorch/audio/issues/2716
    - For preprocessing
      - The HuBERT feature takes lots of memory which may not fit some machines. Enable to use a subset of feature for training a k-means model.
    
    - For pre-training
      - Normalize the loss based on the total number of masked frames across all GPUs.
      - Use mixed precision training. fp16 is not well supported in pytorch_lightning.
      - Log accuracies of masked/unmasked frames during training.
      - Clip the gradients with norm `10.0`.
    
    - For ASR fine-tuning
      - Normalize the loss based on the total number of batches across all GPUs, same as in the conformer recipe of TorchAudio.
      - Use mixed precision training.
      - Add "|" after the end of transcription to capture the silence/word termination, same as in fairseq recipe.
    
    - Update the WER results on LibriSpeech dev and test sets.
    
    |                   | WER% (Viterbi)|  WER% (KenLM) |
    |:-----------------:|--------------:|--------------:|
    | dev-clean         |       10.9    |       4.2     |
    | dev-other         |       17.5    |       9.4     |
    | test-clean        |       10.9    |       4.4     |
    | test-other        |       17.8    |       9.5     |
    
    Pull Request resolved: https://github.com/pytorch/audio/pull/2744
    
    Reviewed By: carolineechen
    
    Differential Revision: D40282322
    
    Pulled By: nateanl
    
    fbshipit-source-id: 4723584c912e70e8970149fe09de005385eaab90
    928248d7
evaluate.py 6.31 KB