# TMLU ### Paper Title: `Measuring Taiwanese Mandarin Language Understanding` Abstract: `The evaluation of large language models (LLMs) has drawn substantial attention in the field recently. This work focuses on evaluating LLMs in a Chinese context, specifically, for Traditional Chinese which has been largely underrepresented in existing benchmarks. We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in LLMs, under the context of Taiwanese Mandarin. TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels. In addition, we curate chain-of-thought-like few-shot explanations for each subject to facilitate the evaluation of complex reasoning skills. To establish a comprehensive baseline, we conduct extensive experiments and analysis on 24 advanced LLMs. The results suggest that Chinese open-weight models demonstrate inferior performance comparing to multilingual proprietary ones, and open-weight models tailored for Taiwanese Mandarin lag behind the Simplified-Chinese counterparts. The findings indicate great headrooms for improvement, and emphasize the goal of TMLU to foster the development of localized Taiwanese-Mandarin LLMs. We release the benchmark and evaluation scripts for the community to promote future research.` Homepage: [TMLU Huggingface Dataset](https://huggingface.co/datasets/miulab/tmlu) ### Citation ``` @article{DBLP:journals/corr/abs-2403-20180, author = {Po{-}Heng Chen and Sijia Cheng and Wei{-}Lin Chen and Yen{-}Ting Lin and Yun{-}Nung Chen}, title = {Measuring Taiwanese Mandarin Language Understanding}, journal = {CoRR}, volume = {abs/2403.20180}, year = {2024}, url = {https://doi.org/10.48550/arXiv.2403.20180}, doi = {10.48550/ARXIV.2403.20180}, eprinttype = {arXiv}, eprint = {2403.20180}, timestamp = {Wed, 10 Apr 2024 17:37:45 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2403-20180.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ### Groups and Tasks #### Groups * `tmlu`: `The dataset comprises 2,981 multiple-choice questions from 37 subjects. ` #### Tasks The following tasks evaluate subjects in the TMLU dataset using loglikelihood-based multiple-choice scoring: * `tmlu_{subject_english}` ### Checklist For adding novel benchmarks/datasets to the library: * [x] Is the task an existing benchmark in the literature? * [x] Have you referenced the original paper that introduced the task? * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? If other tasks on this dataset are already supported: * [x] Is the "Main" variant of this task clearly denoted? * [x] Have you provided a short sentence in a README on what each new variant adds / evaluates? * [x] Have you noted which, if any, published evaluation setups are matched by this variant?