@@ -307,7 +307,7 @@ To save evaluation results provide an `--output_path`. We also support logging m
Additionally, one can provide a directory with `--use_cache` to cache the results of prior runs. This allows you to avoid repeated execution of the same (model, task) pairs for re-scoring.
To push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the --hf_hub_log_args flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub. For example:
To push results and samples to the Hugging Face Hub, first ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use the `--hf_hub_log_args` flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub - [example output](https://huggingface.co/datasets/KonradSzafer/lm-eval-results-demo/tree/main/microsoft__phi-2). For instance:
```bash
lm_eval --model hf \
...
...
@@ -443,6 +443,7 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
| sentencepiece | For using the sentencepiece tokenizer |
@@ -50,6 +50,10 @@ This mode supports a number of command-line arguments, the details of which can
* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]. e.g., ```--wandb_args project=test-project,name=test-run```
* `--hf_hub_log_args`: To push results and samples to the Hugging Face Hub. First ensure an access token with write access is set in the `HF_TOKEN` environment variable. Then, use this flag to specify the organization, repository name, repository visibility, and whether to push results and samples to the Hub. e.g., ```--hf_hub_log_args hub_results_org=EleutherAI,hub_repo_name=lm-eval-results,public_repo=False,push_samples_to_hub=True```
## External Library Usage
We also support using the library's external API for use within model training loops or other scripts.
Title: `COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances`
Abstract: `https://arxiv.org/abs/2311.01012`
`COPAL-ID is an Indonesian causal commonsense reasoning dataset that captures local nuances. It provides a more natural portrayal of day-to-day causal reasoning within the Indonesian (especially Jakartan) cultural sphere. Professionally written and validatid from scratch by natives, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID.`
Homepage: `https://github.com/haryoa/copal-id`
### Citation
```
@article{wibowo2023copal,
title={COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances},
author={Wibowo, Haryo Akbarianto and Fuadi, Erland Hilman and Nityasya, Made Nindyatama and Prasojo, Radityo Eko and Aji, Alham Fikri},
journal={arXiv preprint arXiv:2311.01012},
year={2023}
}
```
### Groups and Tasks
#### Groups
*`copal_id`
#### Tasks
*`copal_id_standard`: `Standard version of COPAL dataset, use formal language and less local nuances`
*`copal_id_colloquial`: `Colloquial version of COPAL dataset, use informal language and more local nuances`
### Checklist
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Measuring Mathematical Problem Solving With the MATH Dataset
https://arxiv.org/abs/2103.03874
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
NOTE: This task corresponds to the MATH (`hendrycks_math`) implementation at https://github.com/EleutherAI/lm-evaluation-harness/tree/master . For the variant which uses the custom 4-shot prompt in the Minerva paper (https://arxiv.org/abs/2206.14858), and SymPy answer checking as done by Minerva, see `lm_eval/tasks/minerva_math`.
Homepage: https://github.com/hendrycks/math
## Citation
```
@article{hendrycksmath2021,
title={Measuring Mathematical Problem Solving With the MATH Dataset},
author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
journal={NeurIPS},
year={2021}
}
```
### Groups and Tasks
#### Groups
-`hendrycks_math`: the MATH benchmark from Hendrycks et al. 0- or few-shot.
#### Tasks
-`hendrycks_math_algebra`
-`hendrycks_math_counting_and_prob`
-`hendrycks_math_geometry`
-`hendrycks_math_intermediate_algebra`
-`hendrycks_math_num_theory`
-`hendrycks_math_prealgebra`
-`hendrycks_math_precalc`
### Checklist
The checklist is the following:
For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
* Answer extraction code is taken from the original MATH benchmark paper's repository.
If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?