"tests/nn/vscode:/vscode.git/clone" did not exist on "f80b303ceea2efe225577a61985def72f8a16627"
Commit 79460d34 authored by Trinkle23897's avatar Trinkle23897 Committed by Facebook Github Bot
Browse files

add constrains when checking multiple consecutive blank lines (#1031)

Summary:
It will cause runtime error on some standard datasets (e.g. wikitext-103).

Details:
After preprocessing to wikitext-103 folder with current master branch, I use fairseq-train and get the following Error:
```bash
Traceback (most recent call last):
  File "/home/trinkle/.local/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/data/git/Transformer/fairseq/fairseq_cli/train.py", line 321, in cli_main
    main(args)
  File "/data/git/Transformer/fairseq/fairseq_cli/train.py", line 46, in main
    task.load_dataset(valid_sub_split, combine=False, epoch=0)
  File "/data/git/Transformer/fairseq/fairseq/tasks/language_modeling.py", line 167, in load_dataset
    break_mode=self.args.sample_break_mode, include_targets=True,
  File "/data/git/Transformer/fairseq/fairseq/data/token_block_dataset.py", line 54, in init
    "Found multiple blank lines in the dataset, please remove them"
AssertionError: Found multiple blank lines in the dataset, please remove them (eg. cat -s raw.txt) and preprocess the data again.
```

It's because these datasets have multiple blank lines. The assertion is added in https://github.com/pytorch/fairseq/commit/851c022610b27da3beaa4e40a6834b5fb3b44f44, however, adding this assertion is not a good way.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1031

Differential Revision: D16892942

Pulled By: myleott

fbshipit-source-id: 90c41b7d98a7b78f506bb57320f9f6b901e05d5b
parent 02cb5a43
......@@ -49,7 +49,7 @@ class TokenBlockDataset(FairseqDataset):
assert len(dataset) > 0
sizes = np.array(sizes, dtype=int)
assert np.all(np.diff((sizes == document_sep_len).nonzero()) != 1),\
assert break_mode != 'complete_doc' or np.all(np.diff((sizes == document_sep_len).nonzero()) != 1),\
(
"Found multiple blank lines in the dataset, please remove them"
" (eg. cat -s raw.txt) and preprocess the data again."
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment