README

The included data set contains 137,362 aligned sentences extracted by pairing Simple English Wikipedia with English Wikipedia.  A complete description of the extraction process can be found in "Simple English Wikipedia: A New Simplification Task", William Coster and David Kauchak (2011).  In Proceedings of ACL (short paper).  The data set contains those sentences with a similarity above 0.50.  Higher precision alignments may be obtained by TF-IDF thresholding at higher levels.

Two files are included: wiki.normal and wiki.simple.  Each file contains 137,362 lines and corresponds to a sentence.  The nth line/sentence in wiki.normal corresponds to the nth line/sentence in wiki.simple.  Some minimal tokenization has been done to treat most punctuation characters as separate words/tokens.

For questions regarding the data set set, contact David Kauchak at Pomona College.