{"commonsense":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Commonsense subset contains examples focusing on moral standards and principles that most people intuitively accept.","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"label":{"dtype":"int32","id":null,"_type":"Value"},"input":{"dtype":"string","id":null,"_type":"Value"},"is_short":{"dtype":"bool","id":null,"_type":"Value"},"edited":{"dtype":"bool","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"commonsense","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":14435215,"num_examples":13910,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":3150094,"num_examples":3885,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":17585309,"size_in_bytes":53170333},"deontology":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Deontology subset contains examples focusing on whether an act is required, permitted, or forbidden according to a set of rules or constraints","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"group_id":{"dtype":"int32","id":null,"_type":"Value"},"label":{"dtype":"int32","id":null,"_type":"Value"},"scenario":{"dtype":"string","id":null,"_type":"Value"},"excuse":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"deontology","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1931475,"num_examples":18164,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":384602,"num_examples":3596,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":2316077,"size_in_bytes":37901101},"justice":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Justice subset contains examples focusing on how a character treats another person","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"group_id":{"dtype":"int32","id":null,"_type":"Value"},"label":{"dtype":"int32","id":null,"_type":"Value"},"scenario":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"justice","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":2516501,"num_examples":21791,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":309427,"num_examples":2704,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":2825928,"size_in_bytes":38410952},"utilitarianism":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Utilitarianism subset contains scenarios that should be ranked from most pleasant to least pleasant for the person in the scenario","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"activity":{"dtype":"string","id":null,"_type":"Value"},"baseline":{"dtype":"string","id":null,"_type":"Value"},"rating":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"utilitarianism","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":2241770,"num_examples":13738,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":749768,"num_examples":4808,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":2991538,"size_in_bytes":38576562},"virtue":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Virtue subset contains scenarios focusing on whether virtues or vices are being exemplified","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"group_id":{"dtype":"int32","id":null,"_type":"Value"},"label":{"dtype":"int32","id":null,"_type":"Value"},"scenario":{"dtype":"string","id":null,"_type":"Value"},"trait":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"virtue","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":2640328,"num_examples":28245,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":473473,"num_examples":4975,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":3113801,"size_in_bytes":38698825}}
\ No newline at end of file
{"commonsense":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Commonsense subset contains examples focusing on moral standards and principles that most people intuitively accept.","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"label":{"dtype":"int32","id":null,"_type":"Value"},"input":{"dtype":"string","id":null,"_type":"Value"},"is_short":{"dtype":"bool","id":null,"_type":"Value"},"edited":{"dtype":"bool","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"commonsense","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":14435215,"num_examples":13910,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":3150094,"num_examples":3885,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":17585309,"size_in_bytes":53170333},"deontology":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Deontology subset contains examples focusing on whether an act is required, permitted, or forbidden according to a set of rules or constraints","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"group_id":{"dtype":"int32","id":null,"_type":"Value"},"label":{"dtype":"int32","id":null,"_type":"Value"},"scenario":{"dtype":"string","id":null,"_type":"Value"},"excuse":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"deontology","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1931475,"num_examples":18164,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":384602,"num_examples":3596,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":2316077,"size_in_bytes":37901101},"justice":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Justice subset contains examples focusing on how a character treats another person","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"group_id":{"dtype":"int32","id":null,"_type":"Value"},"label":{"dtype":"int32","id":null,"_type":"Value"},"scenario":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"justice","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":2516501,"num_examples":21791,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":309427,"num_examples":2704,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":2825928,"size_in_bytes":38410952},"utilitarianism":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Utilitarianism subset contains scenarios that should be ranked from most pleasant to least pleasant for the person in the scenario","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"activity":{"dtype":"string","id":null,"_type":"Value"},"baseline":{"dtype":"string","id":null,"_type":"Value"},"rating":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"utilitarianism","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":2241770,"num_examples":13738,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":749768,"num_examples":4808,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":2991538,"size_in_bytes":38576562},"virtue":{"description":"The ETHICS dataset is a benchmark that spans concepts in justice, well-being,\nduties, virtues, and commonsense morality. Models predict widespread moral\njudgments about diverse text scenarios. This requires connecting physical and\nsocial world knowledge to value judgements, a capability that may enable us\nto steer chatbot outputs or eventually regularize open-ended reinforcement\nlearning agents.\n\nThe Virtue subset contains scenarios focusing on whether virtues or vices are being exemplified","citation":"@article{hendrycks2021ethics\n title={Aligning AI With Shared Human Values},\n author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},\n journal={Proceedings of the International Conference on Learning Representations (ICLR)},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/ethics","license":"","features":{"group_id":{"dtype":"int32","id":null,"_type":"Value"},"label":{"dtype":"int32","id":null,"_type":"Value"},"scenario":{"dtype":"string","id":null,"_type":"Value"},"trait":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_ethics","config_name":"virtue","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":2640328,"num_examples":28245,"dataset_name":"hendrycks_ethics"},"test":{"name":"test","num_bytes":473473,"num_examples":4975,"dataset_name":"hendrycks_ethics"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/ethics.tar":{"num_bytes":35585024,"checksum":"40acbf1ac0da79a2aabef394d58889136b8d38b05be09482006de2453fb06333"}},"download_size":35585024,"post_processing_size":null,"dataset_size":3113801,"size_in_bytes":38698825}}
@@ -71,54 +71,64 @@ class HendrycksEthics(datasets.GeneratorBasedBuilder):
EthicsConfig(
name="commonsense",
prefix="cm",
features=datasets.Features({
"label":datasets.Value("int32"),
"input":datasets.Value("string"),
"is_short":datasets.Value("bool"),
"edited":datasets.Value("bool"),
}),
description="The Commonsense subset contains examples focusing on moral standards and principles that most people intuitively accept."
features=datasets.Features(
{
"label":datasets.Value("int32"),
"input":datasets.Value("string"),
"is_short":datasets.Value("bool"),
"edited":datasets.Value("bool"),
}
),
description="The Commonsense subset contains examples focusing on moral standards and principles that most people intuitively accept.",
),
EthicsConfig(
name="deontology",
prefix="deontology",
features=datasets.Features({
"group_id":datasets.Value("int32"),
"label":datasets.Value("int32"),
"scenario":datasets.Value("string"),
"excuse":datasets.Value("string"),
}),
features=datasets.Features(
{
"group_id":datasets.Value("int32"),
"label":datasets.Value("int32"),
"scenario":datasets.Value("string"),
"excuse":datasets.Value("string"),
}
),
description="The Deontology subset contains examples focusing on whether an act is required, permitted, or forbidden according to a set of rules or constraints",
),
EthicsConfig(
name="justice",
prefix="justice",
features=datasets.Features({
"group_id":datasets.Value("int32"),
"label":datasets.Value("int32"),
"scenario":datasets.Value("string"),
}),
features=datasets.Features(
{
"group_id":datasets.Value("int32"),
"label":datasets.Value("int32"),
"scenario":datasets.Value("string"),
}
),
description="The Justice subset contains examples focusing on how a character treats another person",
),
EthicsConfig(
name="utilitarianism",
prefix="util",
features=datasets.Features({
"activity":datasets.Value("string"),
"baseline":datasets.Value("string"),
"rating":datasets.Value("string"),# Empty rating.
}),
features=datasets.Features(
{
"activity":datasets.Value("string"),
"baseline":datasets.Value("string"),
"rating":datasets.Value("string"),# Empty rating.
}
),
description="The Utilitarianism subset contains scenarios that should be ranked from most pleasant to least pleasant for the person in the scenario",
),
EthicsConfig(
name="virtue",
prefix="virtue",
features=datasets.Features({
"group_id":datasets.Value("int32"),
"label":datasets.Value("int32"),
"scenario":datasets.Value("string"),
"trait":datasets.Value("string"),
}),
features=datasets.Features(
{
"group_id":datasets.Value("int32"),
"label":datasets.Value("int32"),
"scenario":datasets.Value("string"),
"trait":datasets.Value("string"),
}
),
description="The Virtue subset contains scenarios focusing on whether virtues or vices are being exemplified",
),
]
...
...
@@ -140,7 +150,12 @@ class HendrycksEthics(datasets.GeneratorBasedBuilder):
name=datasets.Split.TRAIN,
# These kwargs will be passed to _generate_examples
{"algebra":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"algebra","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":955021,"num_examples":1744,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":648291,"num_examples":1187,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1603312,"size_in_bytes":21931248},"counting_and_probability":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"counting_and_probability","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":667385,"num_examples":771,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":353803,"num_examples":474,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1021188,"size_in_bytes":21349124},"geometry":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"geometry","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1077241,"num_examples":870,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":523126,"num_examples":479,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1600367,"size_in_bytes":21928303},"intermediate_algebra":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"intermediate_algebra","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1157476,"num_examples":1295,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":795070,"num_examples":903,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1952546,"size_in_bytes":22280482},"number_theory":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"number_theory","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":595793,"num_examples":869,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":349455,"num_examples":540,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":945248,"size_in_bytes":21273184},"prealgebra":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"prealgebra","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":715611,"num_examples":1205,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":510195,"num_examples":871,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1225806,"size_in_bytes":21553742},"precalculus":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"precalculus","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":816245,"num_examples":746,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":552893,"num_examples":546,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1369138,"size_in_bytes":21697074}}
\ No newline at end of file
{"algebra":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"algebra","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":955021,"num_examples":1744,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":648291,"num_examples":1187,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1603312,"size_in_bytes":21931248},"counting_and_probability":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"counting_and_probability","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":667385,"num_examples":771,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":353803,"num_examples":474,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1021188,"size_in_bytes":21349124},"geometry":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"geometry","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1077241,"num_examples":870,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":523126,"num_examples":479,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1600367,"size_in_bytes":21928303},"intermediate_algebra":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"intermediate_algebra","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1157476,"num_examples":1295,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":795070,"num_examples":903,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1952546,"size_in_bytes":22280482},"number_theory":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"number_theory","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":595793,"num_examples":869,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":349455,"num_examples":540,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":945248,"size_in_bytes":21273184},"prealgebra":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"prealgebra","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":715611,"num_examples":1205,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":510195,"num_examples":871,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1225806,"size_in_bytes":21553742},"precalculus":{"description":"MATH is a dataset of 12,500 challenging competition mathematics problems. Each\nproblem in Math has a full step-by-step solution which can be used to teach\nmodels to generate answer derivations and explanations.\n","citation":"@article{hendrycksmath2021,\n title={Measuring Mathematical Problem Solving With the Math Dataset},\n author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},\n journal={NeurIPS},\n year={2021}\n}\n","homepage":"https://github.com/hendrycks/math","license":"","features":{"problem":{"dtype":"string","id":null,"_type":"Value"},"level":{"dtype":"string","id":null,"_type":"Value"},"type":{"dtype":"string","id":null,"_type":"Value"},"solution":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"hendrycks_math","config_name":"precalculus","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":816245,"num_examples":746,"dataset_name":"hendrycks_math"},"test":{"name":"test","num_bytes":552893,"num_examples":546,"dataset_name":"hendrycks_math"}},"download_checksums":{"https://people.eecs.berkeley.edu/~hendrycks/MATH.tar":{"num_bytes":20327936,"checksum":"0fbe4fad0df66942db6c221cdcc95b298cc7f4595a2f0f518360cce84e90d9ac"}},"download_size":20327936,"post_processing_size":null,"dataset_size":1369138,"size_in_bytes":21697074}}
{"original":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"original","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1709449,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test.jsonl":{"num_bytes":1819752,"checksum":"4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226"}},"download_size":1819752,"post_processing_size":null,"dataset_size":1709449,"size_in_bytes":3529201},"en":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe English translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"en","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1709449,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_en.jsonl":{"num_bytes":1819752,"checksum":"4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226"}},"download_size":1819752,"post_processing_size":null,"dataset_size":1709449,"size_in_bytes":3529201},"fr":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe French translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"fr","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1948795,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_fr.jsonl":{"num_bytes":2028703,"checksum":"941ec6a73dba7dc91c860bf493eb66a527cd430148827a4753a4535a046bf362"}},"download_size":2028703,"post_processing_size":null,"dataset_size":1948795,"size_in_bytes":3977498},"de":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe German translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"de","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1904576,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_de.jsonl":{"num_bytes":1985231,"checksum":"51c6c1795894c46e88e4c104b5667f488efe79081fb34d746b82b8caa663865e"}},"download_size":1985231,"post_processing_size":null,"dataset_size":1904576,"size_in_bytes":3889807},"it":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe Italian translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"it","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1813420,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_it.jsonl":{"num_bytes":1894613,"checksum":"86654237716702ab74f42855ae5a78455c1b0e50054a4593fb9c6fcf7fad0850"}},"download_size":1894613,"post_processing_size":null,"dataset_size":1813420,"size_in_bytes":3708033},"es":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe Spanish translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"es","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1821735,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_es.jsonl":{"num_bytes":1902349,"checksum":"ffd760026c647fb43c67ce1bc56fd527937304b348712dce33190ea6caba6f9c"}},"download_size":1902349,"post_processing_size":null,"dataset_size":1821735,"size_in_bytes":3724084}}
\ No newline at end of file
{"original":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"original","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1709449,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test.jsonl":{"num_bytes":1819752,"checksum":"4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226"}},"download_size":1819752,"post_processing_size":null,"dataset_size":1709449,"size_in_bytes":3529201},"en":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe English translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"en","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1709449,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_en.jsonl":{"num_bytes":1819752,"checksum":"4aa8d02cd17c719165fc8a7887fddd641f43fcafa4b1c806ca8abc31fabdb226"}},"download_size":1819752,"post_processing_size":null,"dataset_size":1709449,"size_in_bytes":3529201},"fr":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe French translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"fr","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1948795,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_fr.jsonl":{"num_bytes":2028703,"checksum":"941ec6a73dba7dc91c860bf493eb66a527cd430148827a4753a4535a046bf362"}},"download_size":2028703,"post_processing_size":null,"dataset_size":1948795,"size_in_bytes":3977498},"de":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe German translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"de","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1904576,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_de.jsonl":{"num_bytes":1985231,"checksum":"51c6c1795894c46e88e4c104b5667f488efe79081fb34d746b82b8caa663865e"}},"download_size":1985231,"post_processing_size":null,"dataset_size":1904576,"size_in_bytes":3889807},"it":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe Italian translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"it","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1813420,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_it.jsonl":{"num_bytes":1894613,"checksum":"86654237716702ab74f42855ae5a78455c1b0e50054a4593fb9c6fcf7fad0850"}},"download_size":1894613,"post_processing_size":null,"dataset_size":1813420,"size_in_bytes":3708033},"es":{"description":"LAMBADA is a dataset to evaluate the capabilities of computational models for text\nunderstanding by means of a word prediction task. LAMBADA is a collection of narrative\ntexts sharing the characteristic that human subjects are able to guess their last\nword if they are exposed to the whole text, but not if they only see the last\nsentence preceding the target word. To succeed on LAMBADA, computational models\ncannot simply rely on local context, but must be able to keep track of information\nin the broader discourse.\n\nThe Spanish translated LAMBADA dataset","citation":"@misc{\n author={Paperno, Denis and Kruszewski, Germ\u00e1n and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fern\u00e1ndez, Raquel}, \n title={The LAMBADA dataset},\n DOI={10.5281/zenodo.2630551},\n publisher={Zenodo},\n year={2016},\n month={Aug}\n}\n","homepage":"https://zenodo.org/record/2630551#.X4Xzn5NKjUI","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"lambada","config_name":"es","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":1821735,"num_examples":5153,"dataset_name":"lambada"}},"download_checksums":{"http://eaidata.bmk.sh/data/lambada_test_es.jsonl":{"num_bytes":1902349,"checksum":"ffd760026c647fb43c67ce1bc56fd527937304b348712dce33190ea6caba6f9c"}},"download_size":1902349,"post_processing_size":null,"dataset_size":1821735,"size_in_bytes":3724084}}
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
author={Paperno, Denis and Kruszewski, Germán and Lazaridou, Angeliki and Pham, Quan Ngoc and Bernardi, Raffaella and Pezzelle, Sandro and Baroni, Marco and Boleda, Gemma and Fernández, Raquel},
title={The LAMBADA dataset},
DOI={10.5281/zenodo.2630551},
publisher={Zenodo},
...
...
@@ -62,12 +62,34 @@ class Lambada(datasets.GeneratorBasedBuilder):
{"logiqa":{"description":"LogiQA is a dataset for testing human logical reasoning. It consists of 8,678 QA\ninstances, covering multiple types of deductive reasoning. Results show that state-\nof-the-art neural models perform by far worse than human ceiling. The dataset can\nalso serve as a benchmark for reinvestigating logical AI under the deep learning\nNLP setting.\n","citation":"@misc{liu2020logiqa,\n title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning}, \n author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},\n year={2020},\n eprint={2007.08124},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n","homepage":"https://github.com/lgw863/LogiQA-dataset","license":"","features":{"label":{"dtype":"string","id":null,"_type":"Value"},"context":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"options":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"logiqa","config_name":"logiqa","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":6419852,"num_examples":7376,"dataset_name":"logiqa"},"test":{"name":"test","num_bytes":571705,"num_examples":651,"dataset_name":"logiqa"},"validation":{"name":"validation","num_bytes":562437,"num_examples":651,"dataset_name":"logiqa"}},"download_checksums":{"https://raw.githubusercontent.com/lgw863/LogiQA-dataset/master/Train.txt":{"num_bytes":6281272,"checksum":"7d5bb1f58278e33b395744cd2ad8d7600faa0b3c4d615c659a44ec1181d759fa"},"https://raw.githubusercontent.com/lgw863/LogiQA-dataset/master/Test.txt":{"num_bytes":559060,"checksum":"359acb78c37802208f7fde9e2f6574b8526527c63d6a336f90a53f1932cb4701"},"https://raw.githubusercontent.com/lgw863/LogiQA-dataset/master/Eval.txt":{"num_bytes":550021,"checksum":"4c49e6753b7262c001506b9151135abf722247035ab075dad93acdea5789c01f"}},"download_size":7390353,"post_processing_size":null,"dataset_size":7553994,"size_in_bytes":14944347}}
\ No newline at end of file
{"logiqa":{"description":"LogiQA is a dataset for testing human logical reasoning. It consists of 8,678 QA\ninstances, covering multiple types of deductive reasoning. Results show that state-\nof-the-art neural models perform by far worse than human ceiling. The dataset can\nalso serve as a benchmark for reinvestigating logical AI under the deep learning\nNLP setting.\n","citation":"@misc{liu2020logiqa,\n title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning}, \n author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},\n year={2020},\n eprint={2007.08124},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n","homepage":"https://github.com/lgw863/LogiQA-dataset","license":"","features":{"label":{"dtype":"string","id":null,"_type":"Value"},"context":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"options":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"logiqa","config_name":"logiqa","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":6419852,"num_examples":7376,"dataset_name":"logiqa"},"test":{"name":"test","num_bytes":571705,"num_examples":651,"dataset_name":"logiqa"},"validation":{"name":"validation","num_bytes":562437,"num_examples":651,"dataset_name":"logiqa"}},"download_checksums":{"https://raw.githubusercontent.com/lgw863/LogiQA-dataset/master/Train.txt":{"num_bytes":6281272,"checksum":"7d5bb1f58278e33b395744cd2ad8d7600faa0b3c4d615c659a44ec1181d759fa"},"https://raw.githubusercontent.com/lgw863/LogiQA-dataset/master/Test.txt":{"num_bytes":559060,"checksum":"359acb78c37802208f7fde9e2f6574b8526527c63d6a336f90a53f1932cb4701"},"https://raw.githubusercontent.com/lgw863/LogiQA-dataset/master/Eval.txt":{"num_bytes":550021,"checksum":"4c49e6753b7262c001506b9151135abf722247035ab075dad93acdea5789c01f"}},"download_size":7390353,"post_processing_size":null,"dataset_size":7553994,"size_in_bytes":14944347}}
{"mutual":{"description":"MuTual is a retrieval-based dataset for multi-turn dialogue reasoning, which is\nmodified from Chinese high school English listening comprehension test data.\n\nThe MuTual dataset.","citation":"@inproceedings{mutual,\n title = \"MuTual: A Dataset for Multi-Turn Dialogue Reasoning\",\n author = \"Cui, Leyang and Wu, Yu and Liu, Shujie and Zhang, Yue and Zhou, Ming\" ,\n booktitle = \"Proceedings of the 58th Conference of the Association for Computational Linguistics\",\n year = \"2020\",\n publisher = \"Association for Computational Linguistics\",\n}\n","homepage":"https://github.com/Nealcly/MuTual","license":"","features":{"answers":{"dtype":"string","id":null,"_type":"Value"},"options":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"},"article":{"dtype":"string","id":null,"_type":"Value"},"id":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"mutual","config_name":"mutual","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":5141602,"num_examples":7088,"dataset_name":"mutual"},"test":{"name":"test","num_bytes":634396,"num_examples":886,"dataset_name":"mutual"},"validation":{"name":"validation","num_bytes":624271,"num_examples":886,"dataset_name":"mutual"}},"download_checksums":{"https://github.com/Nealcly/MuTual/archive/master.zip":{"num_bytes":10997878,"checksum":"bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9"}},"download_size":10997878,"post_processing_size":null,"dataset_size":6400269,"size_in_bytes":17398147},"mutual_plus":{"description":"MuTual is a retrieval-based dataset for multi-turn dialogue reasoning, which is\nmodified from Chinese high school English listening comprehension test data.\n\nMuTualPlus is a more difficult MuTual that replaces positive responses with a safe responses.","citation":"@inproceedings{mutual,\n title = \"MuTual: A Dataset for Multi-Turn Dialogue Reasoning\",\n author = \"Cui, Leyang and Wu, Yu and Liu, Shujie and Zhang, Yue and Zhou, Ming\" ,\n booktitle = \"Proceedings of the 58th Conference of the Association for Computational Linguistics\",\n year = \"2020\",\n publisher = \"Association for Computational Linguistics\",\n}\n","homepage":"https://github.com/Nealcly/MuTual","license":"","features":{"answers":{"dtype":"string","id":null,"_type":"Value"},"options":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"},"article":{"dtype":"string","id":null,"_type":"Value"},"id":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"mutual","config_name":"mutual_plus","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":4921179,"num_examples":7088,"dataset_name":"mutual"},"test":{"name":"test","num_bytes":606620,"num_examples":886,"dataset_name":"mutual"},"validation":{"name":"validation","num_bytes":597340,"num_examples":886,"dataset_name":"mutual"}},"download_checksums":{"https://github.com/Nealcly/MuTual/archive/master.zip":{"num_bytes":10997878,"checksum":"bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9"}},"download_size":10997878,"post_processing_size":null,"dataset_size":6125139,"size_in_bytes":17123017}}
\ No newline at end of file
{"mutual":{"description":"MuTual is a retrieval-based dataset for multi-turn dialogue reasoning, which is\nmodified from Chinese high school English listening comprehension test data.\n\nThe MuTual dataset.","citation":"@inproceedings{mutual,\n title = \"MuTual: A Dataset for Multi-Turn Dialogue Reasoning\",\n author = \"Cui, Leyang and Wu, Yu and Liu, Shujie and Zhang, Yue and Zhou, Ming\" ,\n booktitle = \"Proceedings of the 58th Conference of the Association for Computational Linguistics\",\n year = \"2020\",\n publisher = \"Association for Computational Linguistics\",\n}\n","homepage":"https://github.com/Nealcly/MuTual","license":"","features":{"answers":{"dtype":"string","id":null,"_type":"Value"},"options":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"},"article":{"dtype":"string","id":null,"_type":"Value"},"id":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"mutual","config_name":"mutual","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":5141602,"num_examples":7088,"dataset_name":"mutual"},"test":{"name":"test","num_bytes":634396,"num_examples":886,"dataset_name":"mutual"},"validation":{"name":"validation","num_bytes":624271,"num_examples":886,"dataset_name":"mutual"}},"download_checksums":{"https://github.com/Nealcly/MuTual/archive/master.zip":{"num_bytes":10997878,"checksum":"bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9"}},"download_size":10997878,"post_processing_size":null,"dataset_size":6400269,"size_in_bytes":17398147},"mutual_plus":{"description":"MuTual is a retrieval-based dataset for multi-turn dialogue reasoning, which is\nmodified from Chinese high school English listening comprehension test data.\n\nMuTualPlus is a more difficult MuTual that replaces positive responses with a safe responses.","citation":"@inproceedings{mutual,\n title = \"MuTual: A Dataset for Multi-Turn Dialogue Reasoning\",\n author = \"Cui, Leyang and Wu, Yu and Liu, Shujie and Zhang, Yue and Zhou, Ming\" ,\n booktitle = \"Proceedings of the 58th Conference of the Association for Computational Linguistics\",\n year = \"2020\",\n publisher = \"Association for Computational Linguistics\",\n}\n","homepage":"https://github.com/Nealcly/MuTual","license":"","features":{"answers":{"dtype":"string","id":null,"_type":"Value"},"options":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"},"article":{"dtype":"string","id":null,"_type":"Value"},"id":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"mutual","config_name":"mutual_plus","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":4921179,"num_examples":7088,"dataset_name":"mutual"},"test":{"name":"test","num_bytes":606620,"num_examples":886,"dataset_name":"mutual"},"validation":{"name":"validation","num_bytes":597340,"num_examples":886,"dataset_name":"mutual"}},"download_checksums":{"https://github.com/Nealcly/MuTual/archive/master.zip":{"num_bytes":10997878,"checksum":"bb325cf6c672f0f02699993a37138b0fa0af6fcfc77ec81dfbe46add4d7b29f9"}},"download_size":10997878,"post_processing_size":null,"dataset_size":6125139,"size_in_bytes":17123017}}
datasets.BuilderConfig(name="mutual_plus",version=VERSION,description="MuTualPlus is a more difficult MuTual that replaces positive responses with a safe responses."),
{"pile_arxiv":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nArXiv","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_arxiv","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":113218251,"num_examples":2407,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":115653720,"num_examples":2434,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":228871971,"size_in_bytes":1160030307},"pile_books3":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nBooks3","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_books3","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":150095743,"num_examples":269,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":177359876,"num_examples":301,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":327455619,"size_in_bytes":1258613955},"pile_bookcorpus2":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nBookCorpus2","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_bookcorpus2","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":9680652,"num_examples":28,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":9776271,"num_examples":26,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":19456923,"size_in_bytes":950615259},"pile_dm-mathematics":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nDM Mathematics","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_dm-mathematics","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":15756556,"num_examples":1922,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":16453386,"num_examples":2007,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":32209942,"size_in_bytes":963368278},"pile_enron":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nEnron Emails","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_enron","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":1638859,"num_examples":1010,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":1556487,"num_examples":947,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":3195346,"size_in_bytes":934353682},"pile_europarl":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nEuroParl","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_europarl","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":8789652,"num_examples":157,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":9111791,"num_examples":133,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":17901443,"size_in_bytes":949059779},"pile_freelaw":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nFreeLaw","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_freelaw","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":80808693,"num_examples":5101,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":80363814,"num_examples":5094,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":161172507,"size_in_bytes":1092330843},"pile_github":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nGithub","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_github","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":95654706,"num_examples":18195,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":97179576,"num_examples":18337,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":192834282,"size_in_bytes":1123992618},"pile_gutenberg":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nGutenberg (PG-19)","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_gutenberg","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":30243176,"num_examples":80,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":24685980,"num_examples":60,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":54929156,"size_in_bytes":986087492},"pile_hackernews":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nHackerNews","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_hackernews","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":8124255,"num_examples":1632,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":9803822,"num_examples":1619,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":17928077,"size_in_bytes":949086413},"pile_nih-exporter":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nNIH ExPorter","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_nih-exporter","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":3928804,"num_examples":1884,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":3927967,"num_examples":1825,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":7856771,"size_in_bytes":939015107},"pile_opensubtitles":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nOpenSubtitles","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_opensubtitles","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":21008996,"num_examples":642,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":19622904,"num_examples":621,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":40631900,"size_in_bytes":971790236},"pile_openwebtext2":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nOpenWebText2","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_openwebtext2","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":128624303,"num_examples":32925,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":131554302,"num_examples":33400,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":260178605,"size_in_bytes":1191336941},"pile_philpapers":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPhilPapers","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_philpapers","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":5090158,"num_examples":68,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":6499078,"num_examples":64,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":11589236,"size_in_bytes":942747572},"pile_pile-cc":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPile-CC","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_pile-cc","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":235004043,"num_examples":52790,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":233535650,"num_examples":52792,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":468539693,"size_in_bytes":1399698029},"pile_pubmed-abstracts":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPubMed Abstracts","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_pubmed-abstracts","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":39908950,"num_examples":29895,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":40008336,"num_examples":29871,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":79917286,"size_in_bytes":1011075622},"pile_pubmed-central":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPubMed Central","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_pubmed-central","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":187251519,"num_examples":5911,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":184791818,"num_examples":5977,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":372043337,"size_in_bytes":1303201673},"pile_stackexchange":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nStackExchange","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_stackexchange","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":66441557,"num_examples":30378,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":66011397,"num_examples":29950,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":132452954,"size_in_bytes":1063611290},"pile_upsto":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nUSPTO Backgrounds","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_upsto","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":47345405,"num_examples":11415,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":48122320,"num_examples":11387,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":95467725,"size_in_bytes":1026626061},"pile_ubuntu-irc":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nUbuntu IRC","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_ubuntu-irc","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":5694218,"num_examples":22,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":7410104,"num_examples":21,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":13104322,"size_in_bytes":944262658},"pile_wikipedia":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nWikipedia (en)","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_wikipedia","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":52166968,"num_examples":17511,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":53186137,"num_examples":17478,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":105353105,"size_in_bytes":1036511441},"pile_youtubesubtitles":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nYoutubeSubtitles","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_youtubesubtitles","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":7377448,"num_examples":342,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":8937546,"num_examples":326,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":16314994,"size_in_bytes":947473330}}
\ No newline at end of file
{"pile_arxiv":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nArXiv","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_arxiv","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":113218251,"num_examples":2407,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":115653720,"num_examples":2434,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":228871971,"size_in_bytes":1160030307},"pile_books3":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nBooks3","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_books3","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":150095743,"num_examples":269,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":177359876,"num_examples":301,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":327455619,"size_in_bytes":1258613955},"pile_bookcorpus2":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nBookCorpus2","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_bookcorpus2","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":9680652,"num_examples":28,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":9776271,"num_examples":26,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":19456923,"size_in_bytes":950615259},"pile_dm-mathematics":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nDM Mathematics","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_dm-mathematics","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":15756556,"num_examples":1922,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":16453386,"num_examples":2007,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":32209942,"size_in_bytes":963368278},"pile_enron":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nEnron Emails","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_enron","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":1638859,"num_examples":1010,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":1556487,"num_examples":947,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":3195346,"size_in_bytes":934353682},"pile_europarl":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nEuroParl","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_europarl","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":8789652,"num_examples":157,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":9111791,"num_examples":133,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":17901443,"size_in_bytes":949059779},"pile_freelaw":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nFreeLaw","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_freelaw","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":80808693,"num_examples":5101,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":80363814,"num_examples":5094,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":161172507,"size_in_bytes":1092330843},"pile_github":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nGithub","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_github","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":95654706,"num_examples":18195,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":97179576,"num_examples":18337,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":192834282,"size_in_bytes":1123992618},"pile_gutenberg":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nGutenberg (PG-19)","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_gutenberg","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":30243176,"num_examples":80,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":24685980,"num_examples":60,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":54929156,"size_in_bytes":986087492},"pile_hackernews":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nHackerNews","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_hackernews","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":8124255,"num_examples":1632,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":9803822,"num_examples":1619,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":17928077,"size_in_bytes":949086413},"pile_nih-exporter":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nNIH ExPorter","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_nih-exporter","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":3928804,"num_examples":1884,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":3927967,"num_examples":1825,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":7856771,"size_in_bytes":939015107},"pile_opensubtitles":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nOpenSubtitles","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_opensubtitles","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":21008996,"num_examples":642,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":19622904,"num_examples":621,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":40631900,"size_in_bytes":971790236},"pile_openwebtext2":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nOpenWebText2","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_openwebtext2","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":128624303,"num_examples":32925,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":131554302,"num_examples":33400,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":260178605,"size_in_bytes":1191336941},"pile_philpapers":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPhilPapers","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_philpapers","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":5090158,"num_examples":68,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":6499078,"num_examples":64,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":11589236,"size_in_bytes":942747572},"pile_pile-cc":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPile-CC","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_pile-cc","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":235004043,"num_examples":52790,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":233535650,"num_examples":52792,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":468539693,"size_in_bytes":1399698029},"pile_pubmed-abstracts":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPubMed Abstracts","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_pubmed-abstracts","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":39908950,"num_examples":29895,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":40008336,"num_examples":29871,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":79917286,"size_in_bytes":1011075622},"pile_pubmed-central":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nPubMed Central","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_pubmed-central","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":187251519,"num_examples":5911,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":184791818,"num_examples":5977,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":372043337,"size_in_bytes":1303201673},"pile_stackexchange":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nStackExchange","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_stackexchange","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":66441557,"num_examples":30378,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":66011397,"num_examples":29950,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":132452954,"size_in_bytes":1063611290},"pile_upsto":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nUSPTO Backgrounds","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_upsto","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":47345405,"num_examples":11415,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":48122320,"num_examples":11387,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":95467725,"size_in_bytes":1026626061},"pile_ubuntu-irc":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nUbuntu IRC","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_ubuntu-irc","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":5694218,"num_examples":22,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":7410104,"num_examples":21,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":13104322,"size_in_bytes":944262658},"pile_wikipedia":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nWikipedia (en)","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_wikipedia","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":52166968,"num_examples":17511,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":53186137,"num_examples":17478,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":105353105,"size_in_bytes":1036511441},"pile_youtubesubtitles":{"description":"The Pile is a 825 GiB diverse, open source language modeling data set that consists\nof 22 smaller, high-quality datasets combined together. To score well on Pile\nBPB (bits per byte), a model must be able to understand many disparate domains\nincluding books, github repositories, webpages, chat logs, and medical, physics,\nmath, computer science, and philosophy papers.\n\nYoutubeSubtitles","citation":"@article{pile,\n title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},\n author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},\n journal={arXiv preprint arXiv:2101.00027},\n year={2020}\n}\n","homepage":"https://pile.eleuther.ai/","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"pile","config_name":"pile_youtubesubtitles","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"test":{"name":"test","num_bytes":7377448,"num_examples":342,"dataset_name":"pile"},"validation":{"name":"validation","num_bytes":8937546,"num_examples":326,"dataset_name":"pile"}},"download_checksums":{"http://eaidata.bmk.sh/data/pile/val.jsonl.zst":{"num_bytes":470907480,"checksum":"264c875d8bbd355d8daa9d032b75fd8fb91606218bb84dd1155b203fcd5fab92"},"http://eaidata.bmk.sh/data/pile/test.jsonl.zst":{"num_bytes":460250856,"checksum":"0bb28c52d0b5596d389bf179ce2d43bf7f7ffae76b0d2d20b180c97f62e0975e"}},"download_size":931158336,"post_processing_size":null,"dataset_size":16314994,"size_in_bytes":947473330}}
{"quac":{"description":"Question Answering in Context (QuAC) is a dataset for modeling, understanding, and \nparticipating in information seeking dialog. Data instances consist of an interactive\ndialog between two crowd workers: (1) a student who poses a sequence of freeform\nquestions to learn as much as possible about a hidden Wikipedia text, and (2)\na teacher who answers the questions by providing short excerpts (spans) from the text.\n","citation":"@article{choi2018quac,\n title={Quac: Question answering in context},\n author={Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke},\n journal={arXiv preprint arXiv:1808.07036},\n year={2018}\n}\n","homepage":"https://quac.ai/","license":"","features":{"title":{"dtype":"string","id":null,"_type":"Value"},"section_title":{"dtype":"string","id":null,"_type":"Value"},"paragraph":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"answer":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"quac","config_name":"quac","version":{"version_str":"1.1.0","description":null,"major":1,"minor":1,"patch":0},"splits":{"train":{"name":"train","num_bytes":212391958,"num_examples":83568,"dataset_name":"quac"},"validation":{"name":"validation","num_bytes":20678483,"num_examples":7354,"dataset_name":"quac"}},"download_checksums":{"https://s3.amazonaws.com/my89public/quac/train_v0.2.json":{"num_bytes":68114819,"checksum":"ff5cca5a2e4b4d1cb5b5ced68b9fce88394ef6d93117426d6d4baafbcc05c56a"},"https://s3.amazonaws.com/my89public/quac/val_v0.2.json":{"num_bytes":8929167,"checksum":"09e622916280ba04c9352acb1bc5bbe80f11a2598f6f34e934c51d9e6570f378"}},"download_size":77043986,"post_processing_size":null,"dataset_size":233070441,"size_in_bytes":310114427}}
\ No newline at end of file
{"quac":{"description":"Question Answering in Context (QuAC) is a dataset for modeling, understanding, and \nparticipating in information seeking dialog. Data instances consist of an interactive\ndialog between two crowd workers: (1) a student who poses a sequence of freeform\nquestions to learn as much as possible about a hidden Wikipedia text, and (2)\na teacher who answers the questions by providing short excerpts (spans) from the text.\n","citation":"@article{choi2018quac,\n title={Quac: Question answering in context},\n author={Choi, Eunsol and He, He and Iyyer, Mohit and Yatskar, Mark and Yih, Wen-tau and Choi, Yejin and Liang, Percy and Zettlemoyer, Luke},\n journal={arXiv preprint arXiv:1808.07036},\n year={2018}\n}\n","homepage":"https://quac.ai/","license":"","features":{"title":{"dtype":"string","id":null,"_type":"Value"},"section_title":{"dtype":"string","id":null,"_type":"Value"},"paragraph":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"answer":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"quac","config_name":"quac","version":{"version_str":"1.1.0","description":null,"major":1,"minor":1,"patch":0},"splits":{"train":{"name":"train","num_bytes":212391958,"num_examples":83568,"dataset_name":"quac"},"validation":{"name":"validation","num_bytes":20678483,"num_examples":7354,"dataset_name":"quac"}},"download_checksums":{"https://s3.amazonaws.com/my89public/quac/train_v0.2.json":{"num_bytes":68114819,"checksum":"ff5cca5a2e4b4d1cb5b5ced68b9fce88394ef6d93117426d6d4baafbcc05c56a"},"https://s3.amazonaws.com/my89public/quac/val_v0.2.json":{"num_bytes":8929167,"checksum":"09e622916280ba04c9352acb1bc5bbe80f11a2598f6f34e934c51d9e6570f378"}},"download_size":77043986,"post_processing_size":null,"dataset_size":233070441,"size_in_bytes":310114427}}
{"triviaqa":{"description":"TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence\ntriples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts\nand independently gathered evidence documents, six per question on average, that provide\nhigh quality distant supervision for answering the questions.\n","citation":"@InProceedings{JoshiTriviaQA2017,\n author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},\n title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},\n booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},\n month = {July},\n year = {2017},\n address = {Vancouver, Canada},\n publisher = {Association for Computational Linguistics},\n}\n","homepage":"https://nlp.cs.washington.edu/triviaqa/","license":"Apache License 2.0","features":{"question_id":{"dtype":"string","id":null,"_type":"Value"},"question_source":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"answer":{"aliases":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"},"value":{"dtype":"string","id":null,"_type":"Value"}},"search_results":{"feature":{"description":{"dtype":"string","id":null,"_type":"Value"},"filename":{"dtype":"string","id":null,"_type":"Value"},"rank":{"dtype":"int32","id":null,"_type":"Value"},"title":{"dtype":"string","id":null,"_type":"Value"},"url":{"dtype":"string","id":null,"_type":"Value"},"search_context":{"dtype":"string","id":null,"_type":"Value"}},"length":-1,"id":null,"_type":"Sequence"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"triviaqa","config_name":"triviaqa","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1271393601,"num_examples":87622,"dataset_name":"triviaqa"},"validation":{"name":"validation","num_bytes":163819509,"num_examples":11313,"dataset_name":"triviaqa"}},"download_checksums":{"http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz":{"num_bytes":546481381,"checksum":"adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e"}},"download_size":546481381,"post_processing_size":null,"dataset_size":1435213110,"size_in_bytes":1981694491}}
\ No newline at end of file
{"triviaqa":{"description":"TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence\ntriples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts\nand independently gathered evidence documents, six per question on average, that provide\nhigh quality distant supervision for answering the questions.\n","citation":"@InProceedings{JoshiTriviaQA2017,\n author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},\n title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},\n booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},\n month = {July},\n year = {2017},\n address = {Vancouver, Canada},\n publisher = {Association for Computational Linguistics},\n}\n","homepage":"https://nlp.cs.washington.edu/triviaqa/","license":"Apache License 2.0","features":{"question_id":{"dtype":"string","id":null,"_type":"Value"},"question_source":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"answer":{"aliases":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"},"value":{"dtype":"string","id":null,"_type":"Value"}},"search_results":{"feature":{"description":{"dtype":"string","id":null,"_type":"Value"},"filename":{"dtype":"string","id":null,"_type":"Value"},"rank":{"dtype":"int32","id":null,"_type":"Value"},"title":{"dtype":"string","id":null,"_type":"Value"},"url":{"dtype":"string","id":null,"_type":"Value"},"search_context":{"dtype":"string","id":null,"_type":"Value"}},"length":-1,"id":null,"_type":"Sequence"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"triviaqa","config_name":"triviaqa","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1271393601,"num_examples":87622,"dataset_name":"triviaqa"},"validation":{"name":"validation","num_bytes":163819509,"num_examples":11313,"dataset_name":"triviaqa"}},"download_checksums":{"http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz":{"num_bytes":546481381,"checksum":"adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e"}},"download_size":546481381,"post_processing_size":null,"dataset_size":1435213110,"size_in_bytes":1981694491}}
{"mid_word_1_anagrams":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"mid_word_1_anagrams","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":271516,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/mid_word_1_anagrams.jsonl.gz":{"num_bytes":106533,"checksum":"6768a86896083199de4815d4964cb2f6f1046476cfd80c2a562784f182905979"}},"download_size":106533,"post_processing_size":null,"dataset_size":271516,"size_in_bytes":378049},"mid_word_2_anagrams":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"mid_word_2_anagrams","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":282654,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/mid_word_2_anagrams.jsonl.gz":{"num_bytes":109091,"checksum":"c3d839d09a7954b78a27cd2cd75d4ed0488656c56ef4dbd741a005343826cb01"}},"download_size":109091,"post_processing_size":null,"dataset_size":282654,"size_in_bytes":391745},"cycle_letters_in_word":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"cycle_letters_in_word","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":282654,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/cycle_letters_in_word.jsonl.gz":{"num_bytes":98451,"checksum":"1689c9002bb8c5988bf5f05e977c9db92f57932c1b5a38998c29ac0dd71e1d42"}},"download_size":98451,"post_processing_size":null,"dataset_size":282654,"size_in_bytes":381105},"random_insertion_in_word":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"random_insertion_in_word","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":353981,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/random_insertion_in_word.jsonl.gz":{"num_bytes":143626,"checksum":"72e65d83da53d15752ee0c47379509de149ddbad32d61184e5991df29616b78a"}},"download_size":143626,"post_processing_size":null,"dataset_size":353981,"size_in_bytes":497607},"reversed_words":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"reversed_words","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":282654,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/reversed_words.jsonl.gz":{"num_bytes":91917,"checksum":"133a08f875cd6c1ef8608a3233571a773881cc27b1c707de738cc6543439332a"}},"download_size":91917,"post_processing_size":null,"dataset_size":282654,"size_in_bytes":374571}}
\ No newline at end of file
{"mid_word_1_anagrams":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"mid_word_1_anagrams","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":271516,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/mid_word_1_anagrams.jsonl.gz":{"num_bytes":106533,"checksum":"6768a86896083199de4815d4964cb2f6f1046476cfd80c2a562784f182905979"}},"download_size":106533,"post_processing_size":null,"dataset_size":271516,"size_in_bytes":378049},"mid_word_2_anagrams":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"mid_word_2_anagrams","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":282654,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/mid_word_2_anagrams.jsonl.gz":{"num_bytes":109091,"checksum":"c3d839d09a7954b78a27cd2cd75d4ed0488656c56ef4dbd741a005343826cb01"}},"download_size":109091,"post_processing_size":null,"dataset_size":282654,"size_in_bytes":391745},"cycle_letters_in_word":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"cycle_letters_in_word","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":282654,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/cycle_letters_in_word.jsonl.gz":{"num_bytes":98451,"checksum":"1689c9002bb8c5988bf5f05e977c9db92f57932c1b5a38998c29ac0dd71e1d42"}},"download_size":98451,"post_processing_size":null,"dataset_size":282654,"size_in_bytes":381105},"random_insertion_in_word":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"random_insertion_in_word","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":353981,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/random_insertion_in_word.jsonl.gz":{"num_bytes":143626,"checksum":"72e65d83da53d15752ee0c47379509de149ddbad32d61184e5991df29616b78a"}},"download_size":143626,"post_processing_size":null,"dataset_size":353981,"size_in_bytes":497607},"reversed_words":{"description":"Unscramble is a small battery of 5 \u201ccharacter manipulation\u201d tasks. Each task\ninvolves giving the model a word distorted by some combination of scrambling,\naddition, or deletion of characters, and asking it to recover the original word.\n","citation":"@inproceedings{NEURIPS2020_1457c0d6,\n author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},\n booktitle = {Advances in Neural Information Processing Systems},\n editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},\n pages = {1877--1901},\n publisher = {Curran Associates, Inc.},\n title = {Language Models are Few-Shot Learners},\n url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},\n volume = {33},\n year = {2020}\n}\n","homepage":"https://github.com/openai/gpt-3/tree/master/data","license":"","features":{"context":{"dtype":"string","id":null,"_type":"Value"},"completion":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"unscramble","config_name":"reversed_words","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"validation":{"name":"validation","num_bytes":282654,"num_examples":10000,"dataset_name":"unscramble"}},"download_checksums":{"https://raw.githubusercontent.com/openai/gpt-3/master/data/reversed_words.jsonl.gz":{"num_bytes":91917,"checksum":"133a08f875cd6c1ef8608a3233571a773881cc27b1c707de738cc6543439332a"}},"download_size":91917,"post_processing_size":null,"dataset_size":282654,"size_in_bytes":374571}}