Commit 4cc57af4 authored by lintangsutawika's avatar lintangsutawika
Browse files

merged conflict

parents 9554f8da 9d3247be
...@@ -19,7 +19,7 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields ...@@ -19,7 +19,7 @@ Tasks are configured via the `TaskConfig` object. Below, we describe all fields
- **reference** (`str`, *optional*) — - **reference** (`str`, *optional*) —
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub. - **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name** (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.) - **dataset_name** (`str`, *optional*, defaults to None) — The name of, what HF calls, a “data instance” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxillary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv. - **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split. - **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split. - **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split. - **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
...@@ -169,7 +169,7 @@ You can find an example of how to use this feature at [gsm8k-cot-self-consistenc ...@@ -169,7 +169,7 @@ You can find an example of how to use this feature at [gsm8k-cot-self-consistenc
## Passing Arguments to Metrics ## Passing Arguments to Metrics
Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxillary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient. Metrics can be defined in the `metric_list` argument when building the YAML config. Multiple metrics can be listed along with any auxiliary arguments. For example, setting the [`exact_match` metric](https://github.com/huggingface/evaluate/tree/main/metrics/exact_match), auxiliary arguments such as `ignore_case`, `ignore_punctuation`, `regexes_to_ignore` can be listed as well. They will be added to the metric function as `kwargs`. Some metrics have predefined values for `aggregation` and `higher_is_better` so listing the metric name only can be sufficient.
``` ```
metric_list: metric_list:
...@@ -225,4 +225,3 @@ Generative tasks: ...@@ -225,4 +225,3 @@ Generative tasks:
Tasks using complex filtering: Tasks using complex filtering:
- GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`) - GSM8k with CoT (+ with Self-Consistency): (`lm_eval/tasks/gsm8k/gsm8k-cot.yaml` ; `lm_eval/tasks/gsm8k/gsm8k-cot-self-consistency.yaml`)
...@@ -21,21 +21,16 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark ( ...@@ -21,21 +21,16 @@ As a concrete example, we'll walk through reimplementing the `gsm8k` benchmark (
## Creating a YAML file ## Creating a YAML file
- Tasks in eval harness are largely implemented via YAML files. To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file. This file can have any name, but we recommend placing it in a subfolder of `lm_eval/tasks` titled by the dataset or task's shorthand name: for example,
- mention the tasks worth "forking"/building off of
- Step through the different args all tasks will need
To implement a new standard task, we'll need to write a YAML file which configures our task logic. We start by making a new empty YAML file:
```sh ```sh
touch lm_eval/tasks/new_mcqa.yaml touch lm_eval/tasks/<dataset_name>/<my_new_task_name>.yaml
``` ```
or Or, copy the template subfolder we provide from `templates/new_yaml_task`:
```sh ```sh
touch lm_eval/tasks/new_generative_task.yaml cp -r templates/new_yaml_task lm_eval/tasks/
``` ```
and rename the folders and YAML file(s) as desired.
### Selecting and configuring a dataset ### Selecting and configuring a dataset
...@@ -241,16 +236,17 @@ The checklist is the following: ...@@ -241,16 +236,17 @@ The checklist is the following:
For adding novel benchmarks/datasets to the library: For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature? * [ ] Is the task an existing benchmark in the literature?
* [ ] Has the task been checked for equivalence with the original paper's methodology? * [ ] Have you referenced the original paper that introduced the task?
* [ ] Is the task in Eval-harness v0.3.0 or earlier? * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* [ ] If so, has it been checked for regression from earlier versions? If there is a change in results, is it justified by matching the original authors' intended setup?
If other tasks on this dataset are already supported: If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted? * [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.
## Submitting your task ## Submitting your task
You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord! You're all set! Now push your work and make a pull request to the `big-refactor` branch! Thanks for the contribution :). If there are any questions, please leave a message in the `#lm-thunderdome` channel on the EAI discord!
...@@ -20,6 +20,8 @@ def median(arr): ...@@ -20,6 +20,8 @@ def median(arr):
return arr[len(arr) // 2] return arr[len(arr) // 2]
# Certain metrics must be calculated across all documents in a benchmark.
# We use them as aggregation metrics, paired with no-op passthrough metric fns.
@register_aggregation("perplexity") @register_aggregation("perplexity")
def perplexity(items): def perplexity(items):
return math.exp(-mean(items)) return math.exp(-mean(items))
...@@ -35,6 +37,25 @@ def bits_per_byte(items): ...@@ -35,6 +37,25 @@ def bits_per_byte(items):
return -weighted_mean(items) / math.log(2) return -weighted_mean(items) / math.log(2)
@register_aggregation("f1")
def f1_score(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = sklearn.metrics.f1_score(golds, preds)
return np.max(fscore)
@register_aggregation("matthews_corrcoef")
def matthews_corrcoef(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
# print(preds)
return sklearn.metrics.matthews_corrcoef(golds, preds)
@register_metric( @register_metric(
metric="acc", metric="acc",
higher_is_better=True, higher_is_better=True,
...@@ -119,27 +140,24 @@ def mean_stderr(arr): ...@@ -119,27 +140,24 @@ def mean_stderr(arr):
return sample_stddev(arr) / math.sqrt(len(arr)) return sample_stddev(arr) / math.sqrt(len(arr))
@register_metric(metric="matthews_corrcoef", higher_is_better=True, aggregation="mean") @register_metric(
def matthews_corrcoef(items): metric="mcc",
unzipped_list = list(zip(*items)) higher_is_better=True,
golds = unzipped_list[0] output_type="multiple_choice",
preds = unzipped_list[1] aggregation="matthews_corrcoef",
return sklearn.metrics.matthews_corrcoef(golds, preds) )
def mcc_fn(items): # This is a passthrough function
return items
@register_metric( @register_metric(
metric="f1", metric="f1",
higher_is_better=True, higher_is_better=True,
output_type="multiple_choice", output_type="multiple_choice",
aggregation="mean", aggregation="f1",
) )
def f1_score(items): def f1_fn(items): # This is a passthrough function
unzipped_list = list(zip(*items)) return items
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = sklearn.metrics.f1_score(golds, preds)
return np.max(fscore)
@register_metric( @register_metric(
......
...@@ -26,12 +26,17 @@ def register_model(*names): ...@@ -26,12 +26,17 @@ def register_model(*names):
def get_model(model_name): def get_model(model_name):
try:
return MODEL_REGISTRY[model_name] return MODEL_REGISTRY[model_name]
except KeyError:
raise ValueError(
f"Attempted to load model '{model_name}', but no model for this name found! Supported model names: {', '.join(MODEL_REGISTRY.keys())}"
)
TASK_REGISTRY = {} TASK_REGISTRY = {}
GROUP_REGISTRY = {} GROUP_REGISTRY = {}
ALL_TASKS = [] ALL_TASKS = set()
func2task_index = {} func2task_index = {}
...@@ -42,6 +47,7 @@ def register_task(name): ...@@ -42,6 +47,7 @@ def register_task(name):
), f"task named '{name}' conflicts with existing registered task!" ), f"task named '{name}' conflicts with existing registered task!"
TASK_REGISTRY[name] = fn TASK_REGISTRY[name] = fn
ALL_TASKS.add(name)
func2task_index[fn.__name__] = name func2task_index[fn.__name__] = name
return fn return fn
...@@ -55,6 +61,7 @@ def register_group(name): ...@@ -55,6 +61,7 @@ def register_group(name):
GROUP_REGISTRY[name].append(func_name) GROUP_REGISTRY[name].append(func_name)
else: else:
GROUP_REGISTRY[name] = [func_name] GROUP_REGISTRY[name] = [func_name]
ALL_TASKS.add(name)
return fn return fn
return decorate return decorate
...@@ -72,10 +79,7 @@ DEFAULT_METRIC_REGISTRY = { ...@@ -72,10 +79,7 @@ DEFAULT_METRIC_REGISTRY = {
"acc", "acc",
], ],
"loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"], "loglikelihood_rolling": ["word_perplexity", "byte_perplexity", "bits_per_byte"],
"multiple_choice": [ "multiple_choice": ["acc", "acc_norm"],
"acc",
"acc_norm"
],
"greedy_until": ["exact_match"], "greedy_until": ["exact_match"],
} }
...@@ -133,7 +137,6 @@ searching in HF Evaluate library..." ...@@ -133,7 +137,6 @@ searching in HF Evaluate library..."
def register_aggregation(name): def register_aggregation(name):
# TODO: should we enforce a specific interface to aggregation metrics?
def decorate(fn): def decorate(fn):
assert ( assert (
name not in AGGREGATION_REGISTRY name not in AGGREGATION_REGISTRY
......
...@@ -98,7 +98,9 @@ class TaskConfig(dict): ...@@ -98,7 +98,9 @@ class TaskConfig(dict):
self.gold_alias = self.template_aliases + self.doc_to_target self.gold_alias = self.template_aliases + self.doc_to_target
if self.generation_kwargs or self.output_type == "greedy_until": if self.generation_kwargs or self.output_type == "greedy_until":
assert self.output_type == "greedy_until", "passed `generation_kwargs`, but not using a generation request type!" assert (
self.output_type == "greedy_until"
), "passed `generation_kwargs`, but not using a generation request type!"
# ensure that we greedily generate in absence of explicit arguments otherwise # ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = {"do_sample": False, "temperature": 0.0} self.generation_kwargs = {"do_sample": False, "temperature": 0.0}
...@@ -106,7 +108,21 @@ class TaskConfig(dict): ...@@ -106,7 +108,21 @@ class TaskConfig(dict):
return getattr(self, item) return getattr(self, item)
def to_dict(self): def to_dict(self):
return asdict(self) """dumps the current config as a dictionary object, as a printable format.
null fields will not be printed.
Used for dumping results alongside full task configuration
:return: dict
A printable dictionary version of the TaskConfig object.
# TODO: should any default value in the TaskConfig not be printed?
"""
cfg_dict = asdict(self)
# remove values that are `None`
for k, v in list(cfg_dict.items()):
if v is None:
cfg_dict.pop(k)
return cfg_dict
class Task(abc.ABC): class Task(abc.ABC):
...@@ -419,7 +435,7 @@ class Task(abc.ABC): ...@@ -419,7 +435,7 @@ class Task(abc.ABC):
if num_fewshot == 0: if num_fewshot == 0:
labeled_examples = "" labeled_examples = ""
else: else:
labeled_examples = self.sampler.get_context(doc, self._config.num_fewshot) labeled_examples = self.sampler.get_context(doc, num_fewshot)
# for sets with no training docs, draw from other set *but ensure no overlap with current doc* # for sets with no training docs, draw from other set *but ensure no overlap with current doc*
# if self.has_training_docs(): # if self.has_training_docs():
...@@ -532,7 +548,7 @@ class ConfigurableTask(Task): ...@@ -532,7 +548,7 @@ class ConfigurableTask(Task):
} }
try: try:
self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name] self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
except: except Exception:
eval_logger.warning( eval_logger.warning(
f"Metric {metric_name} not found, " f"Metric {metric_name} not found, "
"Searching from https://huggingface.co/evaluate-metric" "Searching from https://huggingface.co/evaluate-metric"
...@@ -550,15 +566,24 @@ class ConfigurableTask(Task): ...@@ -550,15 +566,24 @@ class ConfigurableTask(Task):
if "aggregation" in metric_config: if "aggregation" in metric_config:
agg_name = metric_config["aggregation"] agg_name = metric_config["aggregation"]
self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[agg_name] if type(agg_name) == str:
self._aggregation_list[metric_name] = AGGREGATION_REGISTRY[
agg_name
]
elif callable(agg_name):
self._aggregation_list[metric_name] = metric_config[
"aggregation"
]
else: else:
INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
metric_agg = DEFAULT_AGGREGATION_REGISTRY[metric_name]
eval_logger.warning( eval_logger.warning(
f"metric {metric_name} is defined, but aggregation is not" f"metric {metric_name} is defined, but aggregation is not. "
f"using default aggregation for {metric_name}" f"using default "
f"aggregation={INV_AGG_REGISTRY[metric_agg]}"
) )
self._aggregation_list[metric_name] = DEFAULT_AGGREGATION_REGISTRY[ self._aggregation_list[metric_name] = metric_agg
metric_name
]
if "higher_is_better" in metric_config: if "higher_is_better" in metric_config:
self._higher_is_better[metric_name] = metric_config[ self._higher_is_better[metric_name] = metric_config[
...@@ -566,8 +591,9 @@ class ConfigurableTask(Task): ...@@ -566,8 +591,9 @@ class ConfigurableTask(Task):
] ]
else: else:
eval_logger.warning( eval_logger.warning(
f"metric {metric_name} is defined, but higher_is_better is not" f"metric {metric_name} is defined, but higher_is_better is not. "
f"using default higher_is_better for {metric_name}" f"using default "
f"higher_is_better={HIGHER_IS_BETTER_REGISTRY[metric_name]}"
) )
self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[ self._higher_is_better[metric_name] = HIGHER_IS_BETTER_REGISTRY[
metric_name metric_name
...@@ -592,9 +618,7 @@ class ConfigurableTask(Task): ...@@ -592,9 +618,7 @@ class ConfigurableTask(Task):
filter_pipeline = build_filter_ensemble(filter_name, components) filter_pipeline = build_filter_ensemble(filter_name, components)
self._filters.append(filter_pipeline) self._filters.append(filter_pipeline)
else: else:
self._filters = [ self._filters = [build_filter_ensemble("none", [["take_first", None]])]
build_filter_ensemble("none", [["take_first", None]])
]
if self._config.use_prompt is not None: if self._config.use_prompt is not None:
eval_logger.info(f"loading prompt {self._config.use_prompt}") eval_logger.info(f"loading prompt {self._config.use_prompt}")
...@@ -653,6 +677,7 @@ class ConfigurableTask(Task): ...@@ -653,6 +677,7 @@ class ConfigurableTask(Task):
else: else:
if self._config.num_fewshot > 0: if self._config.num_fewshot > 0:
eval_logger.warning( eval_logger.warning(
f"Task '{self._config.task}': "
"num_fewshot > 0 but fewshot_split is None. " "num_fewshot > 0 but fewshot_split is None. "
"using preconfigured rule." "using preconfigured rule."
) )
...@@ -842,7 +867,8 @@ class ConfigurableTask(Task): ...@@ -842,7 +867,8 @@ class ConfigurableTask(Task):
result_dict = { result_dict = {
**({"acc": acc} if "acc" in use_metric else {}), **({"acc": acc} if "acc" in use_metric else {}),
**({"f1": (pred, gold)} if "f1" in use_metric else {}), **({"f1": (gold, pred)} if "f1" in use_metric else {}),
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}), **({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
} }
......
...@@ -149,7 +149,7 @@ def evaluate( ...@@ -149,7 +149,7 @@ def evaluate(
results = collections.defaultdict(dict) results = collections.defaultdict(dict)
versions = collections.defaultdict(dict) versions = collections.defaultdict(dict)
configs = collections.defaultdict(dict) configs = collections.defaultdict(dict)
samples = collections.defaultdict(list)
requests = collections.defaultdict(list) requests = collections.defaultdict(list)
# docs = {} # docs = {}
...@@ -232,7 +232,7 @@ def evaluate( ...@@ -232,7 +232,7 @@ def evaluate(
enumerate(task.validation_docs()), lm.rank, limit, lm.world_size enumerate(task.validation_docs()), lm.rank, limit, lm.world_size
) )
) )
example_logger = logging.getLogger("examples")
for doc_id, doc in doc_iterator: for doc_id, doc in doc_iterator:
# subset instances to only this document id ; sort by idx # subset instances to only this document id ; sort by idx
requests = list(filter(lambda x: x.doc_id == doc_id, task.instances)) requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))
...@@ -249,7 +249,7 @@ def evaluate( ...@@ -249,7 +249,7 @@ def evaluate(
"filtered_resps": [req.filtered_resps[key] for req in requests], "filtered_resps": [req.filtered_resps[key] for req in requests],
} }
example.update(metrics) example.update(metrics)
example_logger.info(json.dumps(example)) samples[task_name].append(example)
for metric, value in metrics.items(): for metric, value in metrics.items():
vals[(task_name, key, metric)].append(value) vals[(task_name, key, metric)].append(value)
...@@ -314,6 +314,7 @@ def evaluate( ...@@ -314,6 +314,7 @@ def evaluate(
"results": dict(results), "results": dict(results),
"configs": dict(configs), "configs": dict(configs),
"versions": dict(versions), "versions": dict(versions),
"samples": samples,
} }
else: else:
......
...@@ -57,7 +57,7 @@ def oa_completion(**kwargs): ...@@ -57,7 +57,7 @@ def oa_completion(**kwargs):
backoff_time *= 1.5 backoff_time *= 1.5
@register_model("openai", "gooseai") @register_model("openai", "openai-completions", "gooseai")
class GPT3LM(LM): class GPT3LM(LM):
REQ_CHUNK_SIZE = 20 REQ_CHUNK_SIZE = 20
......
# v1.0 Tasks # v1.0 Tasks
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness. This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
Boxes should be checked iff tasks are implemented in v2.0 and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
- [ ] Glue - [ ] Glue (WIP)
- [ ] SuperGlue - [x] SuperGlue
- [ ] CoQA - [ ] CoQA
- [ ] DROP - [ ] DROP
- [x] ~~Lambada~~ - [x] ~~Lambada~~
...@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress ...@@ -31,7 +31,7 @@ Boxes should be checked iff tasks are implemented in v2.0 and tested for regress
- [ ] WebQs - [ ] WebQs
- [ ] WSC273 - [ ] WSC273
- [ ] Winogrande - [ ] Winogrande
- [ ] ANLI - [x] ANLI
- [ ] Hendrycks Ethics - [ ] Hendrycks Ethics
- [ ] TruthfulQA - [ ] TruthfulQA
- [ ] MuTual - [ ] MuTual
......
...@@ -3,6 +3,7 @@ from typing import List, Union ...@@ -3,6 +3,7 @@ from typing import List, Union
from .gsm8k import * from .gsm8k import *
from .triviaqa import * from .triviaqa import *
from .glue import *
from lm_eval import utils from lm_eval import utils
from lm_eval.logger import eval_logger from lm_eval.logger import eval_logger
...@@ -12,6 +13,7 @@ from lm_eval.api.registry import ( ...@@ -12,6 +13,7 @@ from lm_eval.api.registry import (
register_group, register_group,
TASK_REGISTRY, TASK_REGISTRY,
GROUP_REGISTRY, GROUP_REGISTRY,
ALL_TASKS,
) )
...@@ -38,6 +40,9 @@ def include_task_folder(task_dir): ...@@ -38,6 +40,9 @@ def include_task_folder(task_dir):
) )
if "task" in config: if "task" in config:
# task_name = "{}:{}".format(
# get_task_name_from_config(config), config["task"]
# )
task_name = "{}".format(config["task"]) task_name = "{}".format(config["task"])
register_task(task_name)(SubClass) register_task(task_name)(SubClass)
...@@ -62,7 +67,7 @@ def get_task(task_name, config): ...@@ -62,7 +67,7 @@ def get_task(task_name, config):
return TASK_REGISTRY[task_name](config=config) return TASK_REGISTRY[task_name](config=config)
except KeyError: except KeyError:
eval_logger.info("Available tasks:") eval_logger.info("Available tasks:")
eval_logger.info(ALL_TASKS) eval_logger.info(list(TASK_REGISTRY) + list(GROUP_REGISTRY))
raise KeyError(f"Missing task {task_name}") raise KeyError(f"Missing task {task_name}")
......
...@@ -16,6 +16,6 @@ metric_list: ...@@ -16,6 +16,6 @@ metric_list:
- metric: perplexity - metric: perplexity
aggregation: perplexity aggregation: perplexity
higher_is_better: false higher_is_better: false
- metric: accuracy - metric: acc
aggregation: mean aggregation: mean
higher_is_better: true higher_is_better: true
include: pile_arxiv.yaml include: pile_arxiv.yaml
task: pile_pubmed-abstracts task: pile_pubmed-abstracts
dataset_name: pile_pubmed-abstracts dataset_name: pile_pubmed-abstracts
include: pile_arxiv.yaml include: pile_arxiv.yaml
task: pile_pubmed-central task: pile_pubmed-central
dataset_name: pile_pubmed-central dataset_name: pile_pubmed-central
include: pile_arxiv.yaml include: pile_arxiv.yaml
task: pile_stackexchange task: pile_stackexchange
dataset_name: pile_stackexchange dataset_name: pile_stackexchange
include: pile_arxiv.yaml include: pile_arxiv.yaml
task: pile_ubuntu-irc task: pile_ubuntu-irc
dataset_name: pile_ubuntu-irc dataset_name: pile_ubuntu-irc
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment