Unverified Commit 86a3b270 authored by Yufeng Xu's avatar Yufeng Xu Committed by GitHub
Browse files

Added C4 Support (#2889)

* added c4 dataset (working)

* fixed bugs in c4

* fixed loading bugs in c4 dataset; using partial loading

* cleaned the code

* added version number for c4

* removed irrelevant files
parent 18297993
......@@ -29,6 +29,7 @@
| [bertaqa](bertaqa/README.md) | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) |
| [bigbench](bigbench/README.md) | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple |
| [blimp](blimp/README.md) | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English |
| [c4](c4/README.md) | Tasks based on a colossal, cleaned version of Common Crawl's web crawl corpus to assess models' language modeling capabilities. | English |
| [careqa](careqa/README.md) | Multiple choice and open-ended medical question answering based on the Spanish Specialised Healthcare Training (MIR) exams. | English, Spanish |
| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan |
| [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
......
# Colossal Clean Crawled Corpus(C4)
### Paper
[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on [Common Crawl dataset](https://commoncrawl.org).
This is the processed version of Google's C4 dataset.
[Homepage](https://huggingface.co/datasets/allenai/c4)
### Citation
```text
@misc{raffel2023exploringlimitstransferlearning,
title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
author={Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
year={2023},
eprint={1910.10683},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1910.10683},
}
```
### Groups, Tags, and Tasks
#### Groups
* Not part of a group yet.
#### Tasks
* `c4`: measure perplexity on the C4 dataset, via rolling loglikelihoods.
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
### Changelog
task: c4
dataset_path: allenai/c4
dataset_name: en
output_type: loglikelihood_rolling
training_split: train
validation_split: validation
doc_to_text: ""
doc_to_target: !function preprocess_c4.c4_detokenizer
process_results: !function preprocess_c4.process_results
should_decontaminate: true
doc_to_decontamination_query: "{{page}}"
metric_list:
- metric: word_perplexity
- metric: byte_perplexity
- metric: bits_per_byte
metadata:
version: 0.0
dataset_kwargs:
data_files:
train: en/c4-train.00000-of-01024.json.gz
validation: en/c4-validation.00000-of-00008.json.gz
# following the choice of https://arxiv.org/abs/2410.07461
trust_remote_code: true
verification_mode: "no_checks"
\ No newline at end of file
import re
def c4_detokenizer(doc):
string = doc["text"]
# contractions
string = string.replace("s '", "s'")
string = re.sub(r"/' [0-9]/", r"/'[0-9]/", string)
# number separators
string = string.replace(" @-@ ", "-")
string = string.replace(" @,@ ", ",")
string = string.replace(" @.@ ", ".")
# punctuation
string = string.replace(" : ", ": ")
string = string.replace(" ; ", "; ")
string = string.replace(" . ", ". ")
string = string.replace(" ! ", "! ")
string = string.replace(" ? ", "? ")
string = string.replace(" , ", ", ")
# double brackets
string = re.sub(r"\(\s*([^\)]*?)\s*\)", r"(\1)", string)
string = re.sub(r"\[\s*([^\]]*?)\s*\]", r"[\1]", string)
string = re.sub(r"{\s*([^}]*?)\s*}", r"{\1}", string)
string = re.sub(r"\"\s*([^\"]*?)\s*\"", r'"\1"', string)
string = re.sub(r"'\s*([^']*?)\s*'", r"'\1'", string)
# miscellaneous
string = string.replace("= = = =", "====")
string = string.replace("= = =", "===")
string = string.replace("= =", "==")
string = string.replace(" " + chr(176) + " ", chr(176))
string = string.replace(" \n", "\n")
string = string.replace("\n ", "\n")
string = string.replace(" N ", " 1 ")
string = string.replace(" 's", "'s")
return string
def process_results(doc, results):
(loglikelihood,) = results
# IMPORTANT: wikitext counts number of words in *original doc before detokenization*
_words = len(re.split(r"\s+", doc["text"]))
_bytes = len(doc["text"].encode("utf-8"))
return {
"word_perplexity": (loglikelihood, _words),
"byte_perplexity": (loglikelihood, _bytes),
"bits_per_byte": (loglikelihood, _bytes),
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment