**Ca**usal evaluation of **L**anguage **M**odels (CaLM), to the best of our knowledge, is the first comprehensive benchmark for evaluating the causal reasoning capabilities of language models. The CaLM framework establishes a foundational taxonomy consisting of four modules: causal target (i.e., what to evaluate), adaptation (i.e., how to obtain the results), metric (i.e., how to measure the results), and error (i.e., how to analyze the bad results).
[📃 Report](https://arxiv.org/abs/2405.00622) |[ 🎆 Github](https://github.com/OpenCausaLab/CaLM) | 📧 Welcome to join us by email at causalai@pjlab.org.cn
If you want detailed information for each task, use:
```
python run.py --models YOUR_MODEL --datasets calm
```
The `--summarizer calm` flag in the first command is used to generate a summarized output, while omitting it in the second command will provide task-specific details.
## Available Causal Tasks
We provide 92 tasks for causal evaluation, stored in the `data/calm` folder. For more information about our causal tasks, refer to [tasks](https://github.com/OpenCausaLab/CaLM/blob/main/documents/tasks.md).
The directory structure is:
```
├── calm
| ├── association
| ├── causal_discovery # Rung of the causal ladder
| │ ├── abstract_reasoning # Causal scenario
| │ │ ├── AR-B_CaLM-AR_CN.json # Causal task
| │ | └── AR-B_CaLM-AR_EN.json # Causal task
| │ └── ...
| └── ...
└── ...
```
## Dataset
-**Dataset size**: CaLM Lite leverages a light dataset of **9200**, while CaLM uses a significantly larger dataset of 126,334. The table below details the English dataset composition, with the Chinese version structured identically.
-**Dataset configuration**: We prioritize balance in our dataset for **binary classification** and **choice selection** questions. By ensuring an equal number of each GT label, we minimize the risk of introducing bias into the model's testing. For **probability calculation**, CaLM-Lite takes extra attention to balance the number of problems across different causal reasoning processes. (For more details on how causal reasoning process is defined, please refer to Section 9.1.6 of the [paper](https://arxiv.org/abs/2405.00622).)
-**Efficient evaluation**: For enhanced evaluation efficiency, OpenCompass offers customizable methods. Refer to the [documentation](https://opencompass.org.cn/doc) for guidance on tailoring these methods to your needs.
| Causal ladder | Causal scenario | Subset | Question type | Mode | CaLM Lite | CaLM |
Basic Prompt is our default setting for efficient evaluation of CaLM Lite, but we provide flexibility for exploring additional prompts through CaLM. If you'd like to explore and compare a wider range of prompts, we encourage you to use CaLM. We provide a comprehensive and easy-to-follow guide to assist you in our [repository](https://github.com/OpenCausaLab/CaLM).
## Citation
```
@misc{chen2024causal,
title={Causal Evaluation of Language Models},
author={Sirui Chen and Bo Peng and Meiqi Chen and Ruiqi Wang and Mengying Xu and Xingyu Zeng and Rui Zhao and Shengjie Zhao and Yu Qiao and Chaochao Lu},
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
verify_prompt_yes_no="Below is a chemistry exam question and a student's answer:\n##Question##\n{prompt}\n\n##Student's Answer##\n{prediction}\n\nThe standard answer for this question is as follows:\n##Standard Answer##\n{output}\n\nNow, based on the standard answer, determine whether the student's answer is correct. (Please note that the same chemical expression may have different formats or equivalent forms). You only need to focus on:\n1. Whether the student's answer matches the result of the standard answer (without focusing too much on the method).\n2. Whether the student's answer seems to be guessed or is a vague answer. If the student's answer is correct (if there are multiple questions, all sub-questions must be answered correctly), please reply directly with:\n**Correct Answer**\nIf the student's answer is incorrect, please reply directly with:\n**Incorrect Answer**"
verify_prompt_score="""Below is a chemistry exam question and a student's answer:
##Question##
{prompt}
##Student's Answer##
{prediction}
##Standard Answer##
{output}
Now, please compare the student's answer with the standard answer. Assume the question consists of multiple sub-questions. For each sub-question, determine if the student's answer is correct by the following criteria:
Evaluation criteria:
1. Only consider whether the final result of each sub-question matches the standard answer. Equivalent chemical expressions or formats should be accepted.
2. Do not focus on the student's method, only the correctness of the final result.
3. If the correct answer is a chemical formula and the student provides a description instead, the description must be specific and fully correspond to the chemical formula. Vague or imprecise descriptions are incorrect.
4. If a student's answer is vague, unclear, or appears to be guessed, mark it as incorrect.
5. If a sub-question contains multiple parts or answers, award partial credit based on how many parts of the answer are correct. Each correct part within a sub-question should be given partial credit.
Return a single score: the proportion of correctly answered sub-questions (number of correct answers (might be float number) divided by the total number of sub-questions).
Format your final answer as: \\boxed{{score}}, where score is a decimal between 0 and 1."""
verify_prompt_yes_no="Below is a chemistry exam question and a student's answer:\n##Question##\n{prompt}\n\n##Student's Answer##\n{prediction}\n\nThe standard answer for this question is as follows:\n##Standard Answer##\n{output}\n\nNow, based on the standard answer, determine whether the student's answer is correct. (Please note that the same chemical expression may have different formats or equivalent forms). You only need to focus on:\n1. Whether the student's answer matches the result of the standard answer (without focusing too much on the method).\n2. Whether the student's answer seems to be guessed or is a vague answer. If the student's answer is correct (if there are multiple questions, all sub-questions must be answered correctly), please reply directly with:\n**Correct Answer**\nIf the student's answer is incorrect, please reply directly with:\n**Incorrect Answer**"
verify_prompt_score="""Below is a chemistry exam question and a student's answer:
##Question##
{prompt}
##Student's Answer##
{prediction}
##Standard Answer##
{output}
Now, please compare the student's answer with the standard answer. Assume the question consists of multiple sub-questions. For each sub-question, determine if the student's answer is correct by the following criteria:
Evaluation criteria:
1. Only consider whether the final result of each sub-question matches the standard answer. Equivalent chemical expressions or formats should be accepted.
2. Do not focus on the student's method, only the correctness of the final result.
3. If the correct answer is a chemical formula and the student provides a description instead, the description must be specific and fully correspond to the chemical formula. Vague or imprecise descriptions are incorrect.
4. If a student's answer is vague, unclear, or appears to be guessed, mark it as incorrect.
5. If a sub-question contains multiple parts or answers, award partial credit based on how many parts of the answer are correct. Each correct part within a sub-question should be given partial credit.
Return a single score: the proportion of correctly answered sub-questions (number of correct answers (might be float number) divided by the total number of sub-questions).
Format your final answer as: \\boxed{{score}}, where score is a decimal between 0 and 1."""
**Chinese SimpleQA** is the first comprehensive Chinese benchmark to evaluate the factuality ability of language models to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, our benchmark covers **6 major topics** with **99 diverse subtopics**.
Please visit our [website](https://openstellarteam.github.io/ChineseSimpleQA/) or check our [paper](https://arxiv.org/abs/2411.07140) for more details.
## 💫 Instroduction
* How to solve the generative hallucination of models has always been an unsolved problem in the field of artificial intelligence (AI). In order to measure the factual correctness of language models, OpenAI recently released and open-sourced a test set called SimpleQA. We have also been paying attention to the field of factuality, which currently has problems such as outdated data, inaccurate evaluation, and incomplete coverage. For example, the knowledge evaluation sets widely used now are still CommonSenseQA, CMMLU, and C-Eval, which are multiple-choice question-based evaluation sets. **In order to further promote the research of the Chinese community on the factual correctness of models, we propose the Chinese SimpleQA**. which consists of 3000 high-quality questions spanning 6 major topics, ranging from humanities to science and engineering. Specifically, the distinct main features of our proposed Chinese SimpleQA dataset are as follows:
* 🀄**Chinese:** Our Chinese SimpleQA focuses on the Chinese language, which provides a comprehensive evaluation of the factuality abilities of existing LLMs in Chinese.
* 🍀**Diverse:** Chinese SimpleQA covers 6 topics (i.e., “Chinese Culture”, “Humanities”, “Engineering, Technology, and Applied Sciences”, “Life, Art, and Culture”, “Society”, and “Natural Science”), and these topic includes 99 fine-grained subtopics in total, which demonstrates the diversity of our Chinese SimpleQA.
* ⚡**High-quality:** We conduct a comprehensive and rigorous quality control process to ensure the quality and accuracy of our Chinese SimpleQA.
* 💡**Static:** Following SimpleQA, to preserve the evergreen property of Chinese SimpleQA, all reference answers would not change over time.
* 🗂️**Easy-to-evaluate:** Following SimpleQA, as the questions and answers are very short, the grading procedure is fast to run via existing LLMs (e.g., OpenAI API).
- Based on Chinese SimpleQA, we have conducted a comprehensive evaluation of the factual capabilities of existing LLMs. We also maintain a comprehensive leaderboard list.
- In short, we hope that Chinese SimpleQA can help developers gain a deeper understanding of the factual correctness of their models in the Chinese field, and at the same time provide an important cornerstone for their algorithm research, and jointly promote the growth of Chinese basic models.
## 📊 Leaderboard
详见: [📊](http://47.109.32.164/)
## ⚖️ Evals
We provide three evaluation methods.
(1) The first method is based on simple-evals evaluation. The startup command is as follows:
```bash
python -m simple-evals.demo
```
This will launch evaluations through the OpenAI API.
(2) The second is a simple single evaluation script that we wrote from scratch. The startup command is as follows:
- Step1: set your openai key in scripts/chinese_simpleqa_easy.py:
```
os.environ["OPENAI_API_KEY"] = "replace your key here"
```
- Step2: run the eval script:
```
python scripts/chinese_simpleqa_easy.py
```
- Step3: we also provide a unified processing script for multiple model results. After running it, you can get a complete leaderboard:
```
python scripts/get_leaderboard.py
```
(3) We also integrated our Chinese SimpleQA benchmark into our forked [OpenCompass](https://github.com/open-compass/opencompass). You can refer to the opencompass configuration script for evaluation
- Step2: download Chinese Simpleqa data from [huggingface](https://huggingface.co/datasets/OpenStellarTeam/Chinese-SimpleQA), and put it in the following path(OPENCOMPASS_PATH/data/chinese_simpleqa), make sure you get path like this:
```
~/opencompass/data/
└── chinese_simpleqa
├── chinese_simpleqa.jsonl
```
- Step3: configuration your launch in examples/eval_chinese_simpleqa.py, set your models to be evaluated, set your judge model (we recommend to use gpt4o) and launch it!
```
python run.py examples/eval_chinese_simpleqa.py
```
## Citation
Please cite our paper if you use our dataset.
```
@misc{he2024chinesesimpleqachinesefactuality,
title={Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models},
author={Yancheng He and Shilong Li and Jiaheng Liu and Yingshui Tan and Weixun Wang and Hui Huang and Xingyuan Bu and Hangyu Guo and Chengwei Hu and Boren Zheng and Zhuoran Lin and Xuepeng Liu and Dekai Sun and Shirong Lin and Zhicheng Zheng and Xiaoyong Zhu and Wenbo Su and Bo Zheng},