"src/vscode:/vscode.git/clone" did not exist on "0763a7edf4e9f2992f5ec8fb0c9dca8ab3e29f07"
Commit 62e65c41 authored by Louis Martin's avatar Louis Martin Committed by Facebook Github Bot
Browse files

Explain the language modelling format in RoBERTa pretraining readme

Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174

Differential Revision: D17627767

Pulled By: myleott

fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851
parent 2314979e
...@@ -4,7 +4,7 @@ This tutorial will walk you through pretraining RoBERTa over your own data. ...@@ -4,7 +4,7 @@ This tutorial will walk you through pretraining RoBERTa over your own data.
### 1) Preprocess the data ### 1) Preprocess the data
Data should be preprocessed following the [language modeling format](/examples/language_model). Data should be preprocessed following the [language modeling format](/examples/language_model), i.e. each document should be separated by an empty line (only useful with `--sample-break-mode complete_doc`). Lines will be concatenated as a 1D text stream during training.
We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) We'll use the [WikiText-103 dataset](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/)
to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course to demonstrate how to preprocess raw text data with the GPT-2 BPE. Of course
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment