@@ -27,8 +27,18 @@ this should make a directory called `cnn_dm/` with files like `test.source`.
...
@@ -27,8 +27,18 @@ this should make a directory called `cnn_dm/` with files like `test.source`.
```
```
WMT16 English-Romanian Translation Data:
WMT16 English-Romanian Translation Data:
This dataset comes in two formats. The "packed" version merges short training examples into examples of <200 tokens to increase GPU utilization (and also improves validation performance).
As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine).
As you train, `output_dir` will be filled with files, that look kind of like this (comments are mine).
Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:
Some of them are metrics, some of them are checkpoints, some of them are metadata. Here is a quick tour:
...
@@ -108,7 +133,7 @@ output_dir
...
@@ -108,7 +133,7 @@ output_dir
│ ├── tokenizer_config.json
│ ├── tokenizer_config.json
│ └── vocab.json
│ └── vocab.json
├── git_log.json # repo, branch, and commit hash
├── git_log.json # repo, branch, and commit hash
├── val_avg_rouge2=0.1984-step_count=11.ckpt # this is a pytorch lightning checkpoint associated with the best val score.
├── val_avg_rouge2=0.1984-step_count=11.ckpt # this is a pytorch lightning checkpoint associated with the best val score. (it will be called BLEU for MT)
├── metrics.json # new validation metrics will continually be appended to this
├── metrics.json # new validation metrics will continually be appended to this
├── student # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.
├── student # this is a huggingface checkpoint generated by SummarizationDistiller. It is the student before it gets finetuned.