@@ -3,7 +3,7 @@ This document contains additional context on the settings and parameters for how
### Auto-eval benchmark notes
#### MMLU
- We are reporting macro averages for MMLU benchmarks. The micro average numbers for MMLU are: 65.4 and 67.4 for the 8B pre-trained and instruct-aligned models, 78.9 and 82.0 for the 70B pre-trained and instruct-aligned models
-For the instruct-aligned MMLU we ask the model to generate the best choice character
-The pre-trained models are evaluated in the standard way by calualting the likelihood of each choice character. For the instruct-aligned models, we use a dialogue prompt (*user/assistant*) for the shots and ask the model to generate the best choice character as answer.
#### AGI English
- We use the default few-shot and prompt settings as specified [here](https://github.com/ruixiangcui/AGIEval). The score is averaged over the english subtasks.