modified stories readme to include sample preprocessing code to split stories to 1k tokens

5d00e8ee · Angela Fan · 4a47b889 · 5d00e8ee
Commit 5d00e8ee authored Sep 07, 2018 by Angela Fan
Hide whitespace changes
Inline Side-by-side

Showing with 13 additions and 0 deletions

examples/stories/README.md examples/stories/README.md +13 -0

No files found.
--- a/examples/stories/README.md
+++ b/examples/stories/README.md
@@ -13,6 +13,19 @@ and contains a train, test, and valid split. The dataset is described here: http

 Example usage:
 ```
+# Preprocess the dataset:
+# Note that the dataset release is the full data, but the paper models the first 1000 words of each story
+# Here is some example code that can trim the dataset to the first 1000 words of each story
+$ python
+$ data = ["train", "test", "valid"]
+$ for name in data:
+$   with open(name + ".wp_target") as f:
+$     stories = f.readlines()
+$   stories = [" ".join(i.split()[0:1000]) for i in stories]
+$   with open(name + ".wp_target", "w") as o:
+$     for line in stories:
+$       o.write(line.strip() + "\n")
+
 # Binarize the dataset:
 $ TEXT=examples/stories/writingPrompts
 $ python preprocess.py --source-lang wp_source --target-lang wp_target \