@@ -9,7 +9,7 @@ This project provides a unified framework to test autoregressive language models
...
@@ -9,7 +9,7 @@ This project provides a unified framework to test autoregressive language models
Features:
Features:
-100+ tasks implemented
-200+ tasks implemented
- Support for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface
- Support for GPT-2, GPT-3, GPT-Neo, GPT-NeoX, and GPT-J, with flexible tokenization-agnostic interface
- Task versioning to ensure reproducibility
- Task versioning to ensure reproducibility
...
@@ -51,6 +51,15 @@ python main.py \
...
@@ -51,6 +51,15 @@ python main.py \
--tasks lambada,hellaswag
--tasks lambada,hellaswag
```
```
And if you want to verify the data integrity of the tasks you're performing in addition to running the tasks themselves, you can use the `--check_integrity` flag:
```bash
python main.py \
--model gpt3 \
--model_argsengine=davinci \
--tasks lambada,hellaswag \
--check_integrity
```
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
## Implementing new tasks
## Implementing new tasks
...
@@ -90,257 +99,269 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
...
@@ -90,257 +99,269 @@ To implement a new task in eval harness, see [this guide](./docs/task_guide.md).
### Full Task List
### Full Task List
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |
| Task Name |Train|Val|Test|Val/Test Docs| Metrics |