# eval_mmlu ## Dataset **`datasets/MMLU/`.** Test set containing 57 subtasks. In every text, the content before `[SEP]` is the question, and the content after `[SEP]` is the standard answer to that question. ## Evaluation ### Introduction **`examples/eval_mmlu_2x32B.sh`.** The evaluation results for mmlu could be obtained by running this program. The variables in the code should be set as follows: | Variable name | Description | | ------------------- | --------------------------------------------- | | `CHECKPOINT_PATH` | the path that saves the checkpoint to be evaluated. | | `TOKENIZER_MODEL_PATH` | the path that saves the tokenizer. | | `MATH_DATA` | the path that saves the evaluation set. | | `OUTPUT_PATH` | the path that saves the evaluation results. | ### Usage Run the following command to evaluate the model's performance on the test set: ``` bash -x examples/eval_mmlu_2x32B.sh ``` ### Result The evaluation result will be saved in the path of `OUTPUT_PATH`. In the text, the content before `[SEP]` is the question, and the content after `[SEP]` is the answer of our model to that question. ## Accuracy ### Introduction **`tasks/MMLU/score_mmlu.py`.** The accuracy of evaluation results for mmlu could be obtained by running this program. The path variables in the code should be set as follows: | Variable name | Description | | ------------------- | --------------------------------------------- | | `origin_file_path` | Path of evaluation set file. | | `eval_file_path` | Path for saving the evaluation result file. | | `txt_eval_res_dir` | Path for storing distinguished results. Files ending with _true contain correctly results, while those ending in _false contain incorrectly results. The CSV file and jsonl file store accuracy statistics for each subtask. | ### Usage Run the following command to evaluate the model's performance on the test set: ``` python score_mmlu.py ``` ### Result "Number of correct answers" and "Number of incorrect answers" respectively represent the total count of correct and incorrect answers for all tasks, while "accuracy" represents the overall accuracy.