[Bench] Support APPS (#963)

* [Feat] support apps * [Feat] support apps * [Feat] support apps * update README

[Bench] Support APPS (#963)
* [Feat] support apps * [Feat] support apps * [Feat] support apps * update README
3098d788 · Connor-Shen · GitHub · 2a741477 · 3098d788 · 3098d788
Unverified Commit 3098d788 authored Mar 13, 2024 by Connor-Shen Committed by GitHub Mar 13, 2024
4 changed files
--- a/configs/datasets/apps/README.md
+++ b/configs/datasets/apps/README.md
+# APPS
+## Dataset Description
+APPS is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications.
+## Dataset Structure
+```python
+DatasetDict({
+    train: Dataset({
+        features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'],
+        num_rows: 5000
+    })
+    test: Dataset({
+        features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'],
+        num_rows: 5000
+    })
+})
+```
+## How to Use
+You can also filter the dataset based on difficulty level: introductory, interview and competition. Just pass a list of difficulty levels to the filter. For example, if you want the most challenging questions, you need to select the competition level:
+```python
+ds = load_dataset("codeparrot/apps", split="train", difficulties=["competition"])
+print(next(iter(ds))["question"])
+```
+## Evaluation results
+| Dataset | Metric | Baichuan2-7B | Baichuan2-13B | InternLM2-7B | InternLM2-20B |
+|---------|--------|---------------|----------------|---------------|----------------|
+| APPS(testset) | pass@1 | 0.0 | 0.06 | 0.0 | 0.0 |
+Please refer to Table 3 of [code llama](https://scontent-nrt1-2.xx.fbcdn.net/v/t39.2365-6/369856151_1754812304950972_1159666448927483931_n.pdf?_nc_cat=107&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=TxT1PKkNBZoAX8zMHbm&_nc_ht=scontent-nrt1-2.xx&oh=00_AfDmmQAPzqX1-QOKIDUV5lGKzaZqt0CZUVtxFjHtnh6ycQ&oe=65F5AF8F) for original results if needed. 
+## Citation
+```
+@article{hendrycksapps2021,
+  title={Measuring Coding Challenge Competence With APPS},
+  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt},
+  journal={NeurIPS},
+  year={2021}
+}
+```
\ No newline at end of file
--- a/configs/datasets/apps/apps_gen.py
+++ b/configs/datasets/apps/apps_gen.py
 from mmengine.config import read_base
 with read_base():
-    from .apps_gen_7fbb95 import apps_datasets  # noqa: F401, F403
+    from .apps_gen_d82929 import APPS_datasets  # noqa: F401, F403
--- a/configs/datasets/apps/apps_gen_d82929.py
+++ b/configs/datasets/apps/apps_gen_d82929.py
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import APPSDataset, APPSEvaluator
+APPS_reader_cfg = dict(input_columns=["question", "starter"], output_column="problem_id", train_split='test')
+APPS_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template="\nQUESTION:\n{question} {starter}\nANSWER:\n"),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=512),
+)
+APPS_eval_cfg = dict(evaluator=dict(type=APPSEvaluator), pred_role="BOT")
+APPS_datasets = [
+    dict(
+        type=APPSDataset,
+        abbr="apps",
+        path="codeparrot/apps",
+        num_repeats=1,
+        reader_cfg=APPS_reader_cfg,
+        infer_cfg=APPS_infer_cfg,
+        eval_cfg=APPS_eval_cfg,
+    )
+]
--- a/opencompass/datasets/apps.py
+++ b/opencompass/datasets/apps.py