We have a revamp of the Evaluation Harness library internals staged on the [big-refactor](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor) branch! It is far along in progress, but before we start to move the `master` branch of the repository over to this new design with a new version release, we'd like to ensure that it's been tested by outside users and there are no glaring bugs.
We’d like your help to test it out! you can help by:
1. Trying out your current workloads on the big-refactor branch, and seeing if anything breaks or is counterintuitive,
2. Porting tasks supported in the previous version of the harness to the new YAML configuration format. Please check out our [task implementation guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md) for more information.
If you choose to port a task not yet completed according to [our checklist](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/README.md), then you can contribute it by opening a PR containing [Refactor] in the name with:
- A shell command to run the task in the `master` branch, and what the score is
- A shell command to run the task in your PR branch to `big-refactor`, and what the resulting score is, to show that we achieve equality between the two implementations.
Lastly, we'll no longer be accepting new feature requests beyond those that are already open to the master branch as we carry out this switch to the new version over the next week, though we will be accepting bugfixes to `master` branch and PRs to `big-refactor`. Feel free to reach out in the #lm-thunderdome channel of the EAI discord for more information.
## Overview
This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
...
...
@@ -131,6 +146,15 @@ python main.py \
--tasks hellaswag
```
GGUF or GGML quantized models can be loaded by using `llama-cpp-python` server:
```bash
python main.py \
--model gguf \
--model_argsbase_url=http://localhost:8000 \
--tasks hellaswag
```
We support wildcards in task names, for example you can run all of the machine-translated lambada tasks via `--task lambada_openai_mt_*`.
We currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, check out the [BigScience fork](https://github.com/bigscience-workshop/lm-evaluation-harness) of this repo. We are currently working on upstreaming this capability to `main`.
"Warning: a primary stop sequence is multi-token! Will default to EOS token for this tokenizer. Consider using `hf-causal-experimental` for multi-token stop sequence support for the time being."
{"triviaqa":{"description":"TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence\ntriples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts\nand independently gathered evidence documents, six per question on average, that provide\nhigh quality distant supervision for answering the questions.\n","citation":"@InProceedings{JoshiTriviaQA2017,\n author = {Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke},\n title = {TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension},\n booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},\n month = {July},\n year = {2017},\n address = {Vancouver, Canada},\n publisher = {Association for Computational Linguistics},\n}\n","homepage":"https://nlp.cs.washington.edu/triviaqa/","license":"Apache License 2.0","features":{"question_id":{"dtype":"string","id":null,"_type":"Value"},"question_source":{"dtype":"string","id":null,"_type":"Value"},"question":{"dtype":"string","id":null,"_type":"Value"},"answer":{"aliases":{"feature":{"dtype":"string","id":null,"_type":"Value"},"length":-1,"id":null,"_type":"Sequence"},"value":{"dtype":"string","id":null,"_type":"Value"}},"search_results":{"feature":{"description":{"dtype":"string","id":null,"_type":"Value"},"filename":{"dtype":"string","id":null,"_type":"Value"},"rank":{"dtype":"int32","id":null,"_type":"Value"},"title":{"dtype":"string","id":null,"_type":"Value"},"url":{"dtype":"string","id":null,"_type":"Value"},"search_context":{"dtype":"string","id":null,"_type":"Value"}},"length":-1,"id":null,"_type":"Sequence"}},"post_processed":null,"supervised_keys":null,"task_templates":null,"builder_name":"triviaqa","config_name":"triviaqa","version":{"version_str":"0.0.1","description":null,"major":0,"minor":0,"patch":1},"splits":{"train":{"name":"train","num_bytes":1271393601,"num_examples":87622,"dataset_name":"triviaqa"},"validation":{"name":"validation","num_bytes":163819509,"num_examples":11313,"dataset_name":"triviaqa"}},"download_checksums":{"http://eaidata.bmk.sh/data/triviaqa-unfiltered.tar.gz":{"num_bytes":546481381,"checksum":"adc19b42769062d241a8fbe834c56e58598d9322eb6c614e9f33a68a2cf5523e"}},"download_size":546481381,"post_processing_size":null,"dataset_size":1435213110,"size_in_bytes":1981694491}}