*tinyBenchmarks* can evaluate different benchmarks with a fraction of their examples.
*tinyBenchmarks* can evaluate different benchmarks with a fraction of their examples.
To obtain accurate results, the score vectors obtained with this task require post-processing using the *tinyBenchmarks*-package (see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)).
To obtain accurate results, this task applies post-processing using the *tinyBenchmarks*-package.
You can install the package by running the following commands on the terminal (for more information see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)):
You can install our package by running the following commands on the terminal
Through the package, the ability parameter $\theta$ from the IRT model will be estimated using all the available data. For `benchmark='lb'` or `benchmark='helm_lite'`, the dimension of `y` should be 600 and 1000, respectively, where the correctness values must obey the following order
The value that is returned by the task corresponds to the '**IRT++**'-method from the [original paper](https://arxiv.org/abs/2402.14992).
- For the Open LLM Leaderboard: TruthfulQA, GSM8K, Winogrande, ARC, HellaSwag, and MMLU;
Evaluate specific tasks individually (e.g. `--tasks tinyHellaswag`) or all [open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) tasks by specifying `--tasks tinyBenchmarks`.
### Advanced usage
To obtain the estimated accuracies from all methods from the original paper, the *tinyBenchmarks*-package has to be applied manually.
To do so, run the evaluation with the `--log_samples` and `--output_path` arguments. For example: