"tests/nn/vscode:/vscode.git/clone" did not exist on "f80b303ceea2efe225577a61985def72f8a16627"
add constrains when checking multiple consecutive blank lines (#1031)
Summary:
It will cause runtime error on some standard datasets (e.g. wikitext-103).
Details:
After preprocessing to wikitext-103 folder with current master branch, I use fairseq-train and get the following Error:
```bash
Traceback (most recent call last):
File "/home/trinkle/.local/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/data/git/Transformer/fairseq/fairseq_cli/train.py", line 321, in cli_main
main(args)
File "/data/git/Transformer/fairseq/fairseq_cli/train.py", line 46, in main
task.load_dataset(valid_sub_split, combine=False, epoch=0)
File "/data/git/Transformer/fairseq/fairseq/tasks/language_modeling.py", line 167, in load_dataset
break_mode=self.args.sample_break_mode, include_targets=True,
File "/data/git/Transformer/fairseq/fairseq/data/token_block_dataset.py", line 54, in init
"Found multiple blank lines in the dataset, please remove them"
AssertionError: Found multiple blank lines in the dataset, please remove them (eg. cat -s raw.txt) and preprocess the data again.
```
It's because these datasets have multiple blank lines. The assertion is added in https://github.com/pytorch/fairseq/commit/851c022610b27da3beaa4e40a6834b5fb3b44f44, however, adding this assertion is not a good way.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1031
Differential Revision: D16892942
Pulled By: myleott
fbshipit-source-id: 90c41b7d98a7b78f506bb57320f9f6b901e05d5b
Showing
Please register or sign in to comment