Explain the language modelling format in RoBERTa pretraining readme

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174 Differential Revision: D17627767 Pulled By: myleott fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851

Explain the language modelling format in RoBERTa pretraining readme
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174 Differential Revision: D17627767 Pulled By: myleott fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851
62e65c41 · Louis Martin · Facebook Github Bot · 2314979e · 62e65c41
Commit 62e65c41 authored Sep 27, 2019 by Louis Martin Committed by Facebook Github Bot Sep 27, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

examples/roberta/README.pretraining.md examples/roberta/README.pretraining.md +1 -1

No files found.
--- a/examples/roberta/README.pretraining.md
+++ b/examples/roberta/README.pretraining.md
@@ -4,7 +4,7 @@ This tutorial will walk you through pretraining RoBERTa over your own data.

 ### 1) Preprocess the data

-Data should be preprocessed following the [language modeling format](/examples/language_model).
+Data should be preprocessed following the [language modeling format](/examples/language_model), i.e. each document should be separated by an empty line (only useful with `--sample-break-mode complete_doc`). Lines will be concatenated as a 1D text stream during training.

 We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/)
 to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course