Commits · ca3d86d6b8dea86211ec60b93c3c026ce73c9d60 · gaoqiong / lm-evaluation-harness

19 Aug, 2024 1 commit

Add TMLU Benchmark Dataset (#2093) · ca3d86d6

Yen-Ting Lin authored Aug 19, 2024



* add taiwan truthful qa

* add tmlu

* Add .gitignore entries for evals/ and harness_eval_main_log.txt, and add harness_eval.slurm script

* add pega eval and legal eval

* add ccp eval

* Update .gitignore and harness_eval.slurm

* Add trust_remote_code and wandb_args to harness_eval.slurm, and add run_all.sh script

* Add Pega MMLU task and configuration files

* Add new models and update parameters in run_all.sh

* Add UMTCEval tasks and configurations

* Update dataset paths and output path

* Update .gitignore and harness_eval.slurm, and modify _generate_configs.py

* Update SLURM script and add new models

* clean for pr

* Update lm_eval/tasks/tmlu/default/tmlu.yaml
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>

* adjust tag name

* removed group alias from tasks

* format

---------
Co-authored-by: Lintang Sutawika <lintang@sutawika.com>
Co-authored-by: lintangsutawika <lintang@eleuther.ai>
Co-authored-by: Yen-Ting Adam, Lin <r08944064@csie.ntu.edu.tw>

ca3d86d6

05 Apr, 2024 1 commit

TMMLU+ implementation (#1394) · 9ae96cdf

ZoneTwelve authored Apr 06, 2024



* implementation of TMMLU+

* implemented: TMMLU+

****TMMLU+ : large-scale Traditional chinese Massive Multitask language Understanding****

- 4 categories
    - STEM
    - Social Science
    - Humanities
    - Other

The TMMLU+ dataset, encompassing over 67 subjects and 20160 tasks, is six times larger and more balanced than its predecessor, TMMLU, and includes benchmark results from both closed-source and 20 open-weight Chinese large language models with 1.8B to 72B parameters. However, Traditional Chinese variants continue to underperform compared to major Simplified Chinese models.

```markdown
Total number of tasks in the 'test' sets: 20160
Total number of tasks in the 'validation' sets: 2247
Total number of tasks in the 'train' sets: 335
```

* Remove print from __init__.py

There was my mistake in forgetting to remove the debug print from the code.

* update: move TMMLU+ config generation program into default

* fix: we should use training set as few shots example

* update: README for TMMLU+

* update: a small changes of TMMLU+ README file

* pre-commit run thought

* Add README for TMMLU+ dataset

* run precommit

* trigger precommit again

* trigger precommit again

* isort is fussy

* isort is fussy

* format, again

* oops

* oops

---------
Co-authored-by: lintang <lintang@eleuther.ai>
Co-authored-by: haileyschoelkopf <hailey@eleuther.ai>

9ae96cdf