Commit 62e65c41 authored by Louis Martin's avatar Louis Martin Committed by Facebook Github Bot
Browse files

Explain the language modelling format in RoBERTa pretraining readme

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174

Differential Revision: D17627767

Pulled By: myleott

fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851
parent 2314979e
......@@ -4,7 +4,7 @@ This tutorial will walk you through pretraining RoBERTa over your own data.
### 1) Preprocess the data
Data should be preprocessed following the [language modeling format](/examples/language_model).
Data should be preprocessed following the [language modeling format](/examples/language_model), i.e. each document should be separated by an empty line (only useful with `--sample-break-mode complete_doc`). Lines will be concatenated as a 1D text stream during training.
We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/)
to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment