README.md

# NeMo End-to-End Speech Intent Classification and Slot Filling on SLURP Dataset

## Introduction
This example shows how to train an end-to-end model for spoken language understanding on the SLURP dataset [2]. The model is an encoder-decoder framework, where the encoder is a Conformer-large [3] model initialized from [here](https://ngc.nvidia.com/models/nvidia:nemo:stt_en_conformer_ctc_large), while the decoder is a Transformer decoder [4] that is randomly initialized. The model is trained by minimizing the negative log-likelihood (NLL) loss with teacher forcing.

## Results

We present the main results of our models in the following table.
|                                                  |                |                          | **Intent (Scenario_Action)** |               | **Entity** |        |              | **SLURP Metrics** |                     |
|--------------------------------------------------|----------------|--------------------------|------------------------------|---------------|------------|--------|--------------|-------------------|---------------------|
|                     **Model**                    | **Params (M)** |      **Pretrained**      |         **Accuracy**         | **Precision** | **Recall** | **F1** | **Precision** |     **Recall**    |        **F1**       |
| NeMo-Conformer-Transformer-Large                 | 127            | NeMo ASR-Set 3.0         |                        90.14 |         86.46 |      82.29 |  84.33 |        84.31 |             80.33 |               82.27 |
| NeMo-Conformer-Transformer-Large                 | 127            | NeMo SSL-LL60kh          |                        89.04 |         73.19 |       71.8 |  72.49 |         77.9 |             76.65 |               77.22 |
| NeMo-Conformer-Transformer-Large                 | 127            | None                     |                        72.56 |         43.19 |       43.5 |  43.34 |        53.59 |             53.92 |               53.76 |
| NeMo-Conformer-Transformer-XLarge                | 617            | NeMo SSL-LL60kh          |                        91.04 |         76.67 |      74.36 |  75.49 |        82.44 |             80.14 |               81.28 |

Note: LL60kh refers to the Libri-Light dataset [7].  

## Usage
Please install [NeMo](https://github.com/NVIDIA/NeMo) [1] before proceeding. **All following scripts are run under the current directory of this README.md file**.

### Data Preparation
1. Under the current directory, run the following script to download and process data.
```bash
python ../../../scripts/dataset_processing/process_slurp_data.py \
    --data_dir="./slurp_data" \
    --text_key="semantics" \
    --suffix="slu"
```

2. Download evaluation code:
```bash
wget https://github.com/pswietojanski/slurp/raw/master/scripts/evaluation/util.py -P eval_utils/evaluation
wget https://github.com/pswietojanski/slurp/raw/master/scripts/evaluation/metrics/distance.py -P eval_utils/evaluation/metrics
wget https://github.com/pswietojanski/slurp/raw/master/scripts/evaluation/metrics/metrics.py -P eval_utils/evaluation/metrics
```


### Building Tokenizers
1. Build the tokenizer for slu by running:
```bash
DATA_ROOT="./slurp_data"
python ../../../scripts/tokenizers/process_asr_text_tokenizer.py \
  --manifest="${DATA_ROOT}/train_slu.json,${DATA_ROOT}/train_synthetic_slu.json" \
  --data_root="${DATA_ROOT}/tokenizers_slu/" \
  --vocab_size=58 \
  --tokenizer="spe" \
  --spe_type="unigram" \
  --log \
  --spe_bos \
  --spe_eos \
  --spe_pad
```


### Training
Run with the default config that uses ASR-pretrained encoder on NeMo ASR-set 3.0. The default batch size is set to 16 for a GPU with 32GB memory, please adjust it to your own case.

```bash
DATA_DIR="./slurp_data"
EXP_NAME="slurp_conformer_transformer_large"
CUDA_VISIBLE_DEVICES=0 python run_speech_intent_slot_train.py \
    --config-path="./configs" --config-name=conformer_transformer_large_bpe \
    model.train_ds.manifest_filepath="[${DATA_DIR}/train_slu.json,${DATA_DIR}/train_synthetic_slu.json]" \
    model.validation_ds.manifest_filepath="${DATA_DIR}/devel_slu.json" \
    model.test_ds.manifest_filepath="${DATA_DIR}/test_slu.json" \
    model.tokenizer.dir="${DATA_DIR}/tokenizers_slu/tokenizer_spe_unigram_v58_pad_bos_eos" \
    model.train_ds.batch_size=16 \
    model.validation_ds.batch_size=16 \
    model.test_ds.batch_size=16 \
    trainer.devices=1 \
    trainer.max_epochs=100 \
    model.optim.sched.warmup_steps=2000 \
    exp_manager.name=$EXP_NAME \
    exp_manager.resume_if_exists=true \
    exp_manager.resume_ignore_no_checkpoint=true
```


### Evaluation
After training, we can evaluate the model by running the following script, which will first perform checkpoint averaging and then run beam search with the averaged checkpoint on the test set.
```bash
DATA_DIR="./slurp_data"
EXP_NAME="slurp_conformer_transformer_large"
CKPT_DIR="./nemo_experiments/${EXP_NAME}/checkpoints"
CKPT_AVG_DIR="../../../examples/slu/speech_intent_slot/${CKPT_DIR}"

python ../../../scripts/checkpoint_averaging/checkpoint_averaging.py $CKPT_AVG_DIR

NEMO_MODEL="${CKPT_DIR}/${EXP_NAME}-averaged.nemo"
CUDA_VISIBLE_DEVICES=0 python run_speech_intent_slot_eval.py \
    dataset_manifest="${DATA_DIR}/test_slu.json" \
    model_path=$NEMO_MODEL \
    batch_size=32 \
    num_workers=8 \
    sequence_generator.type="beam" \
    sequence_generator.beam_size=32 \
    sequence_generator.temperature=1.25 \
    only_score_manifest=false
```

### Using Encoder Finetuned on SLURP Speech Recognition
To learn how to finetune the Conformer encoder on SLURP ASR, please refer to the tutorials at 
- [Finetuning CTC models on other languages](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb)
- [Self-Supervised pre-training for ASR](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Self_Supervised_Pre_Training.ipynb)


## Pretrained Models
The pretrained models and directions on how to use them are available [here](https://ngc.nvidia.com/catalog/models/nvidia:nemo:slu_conformer_transformer_large_slurp).


## Reference
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

[2] [SLURP: A Spoken Language Understanding Resource Package](https://arxiv.org/abs/2011.13205)

[3] [Conformer: Convolution-augmented Transformer for Speech Recognition](https://arxiv.org/abs/2005.08100)

[4] [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs)

[5] [Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding](https://arxiv.org/abs/2104.07253)

[6] [SpeechBrain SLURP Recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/SLURP)

[7] [Libri-Light: A Benchmark for ASR with Limited or No Supervision](https://arxiv.org/abs/1912.07875)

## Acknowledgments
The evaluation code is borrowed from the official [SLURP package](https://github.com/pswietojanski/slurp/tree/master/scripts/evaluation), and some data processing code is adapted from [SpeechBrain SLURP Recipe](https://github.com/speechbrain/speechbrain/tree/develop/recipes/SLURP).