humaneval instruct (#2650)

* add instruct humaneval * nit * add to readme * nit

humaneval instruct (#2650)
* add instruct humaneval * nit * add to readme * nit
c8489857 · Baber Abbasi · GitHub · 7a2ba052 · c8489857 · c8489857
Unverified Commit c8489857 authored Mar 11, 2025 by Baber Abbasi Committed by GitHub Mar 12, 2025
4 changed files
--- a/lm_eval/tasks/humaneval/README.md
+++ b/lm_eval/tasks/humaneval/README.md
@@ -8,6 +8,7 @@ We introduce Codex, a GPT language model fine-tuned on publicly available code f

 Homepage: https://github.com/openai/human-eval

+Note: For instruct tuned models, we recommend the instruct variant. That uses a gen_prefix to ensure the model completes the partial code snippet (might not work with all APIs)

 ## Citation
 ```
@@ -31,6 +32,8 @@ Homepage: https://github.com/openai/human-eval

 - `humaneval` pass@1
 - `humaneval_64` pass@64 variant
+- `humaneval_instruct`: pass@1 with config more appropriate for instruct models. (implementation taken from llama [evals](https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-Instruct-evals/viewer/Llama-3.1-8B-Instruct-evals__human_eval__details?row=0))
+- `humaneval_instruct_64`: pass@64 variant

 ### Checklist

@@ -44,3 +47,5 @@ If other tasks on this dataset are already supported:
 * [ ] Is the "Main" variant of this task clearly denoted?
 * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
 * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
+
+### Changelog
--- a/lm_eval/tasks/humaneval/humaneval_64_instruct.yaml
+++ b/lm_eval/tasks/humaneval/humaneval_64_instruct.yaml
+include: humaneval_64.yaml
+task: humaneval_64_instruct
+doc_to_text: "Write a solution to the following problem and make sure that it passes the tests:\n```{{prompt}}"
+gen_prefix: "Here is the completed function:\\n```python\\n{{prompt}}\\n"
+filter_list:
+  - name: "create_test"
+    filter:
+      - function: "custom"
+        filter_fn: !function utils.build_predictions_instruct
--- a/lm_eval/tasks/humaneval/humaneval_instruct.yaml
+++ b/lm_eval/tasks/humaneval/humaneval_instruct.yaml
+include: humaneval.yaml
+task: humaneval_instruct
+doc_to_text: "Write a solution to the following problem and make sure that it passes the tests:\n```{{prompt}}"
+gen_prefix: "Here is the completed function:\\n```python\\n{{prompt}}\\n"
+filter_list:
+  - name: "create_test"
+    filter:
+      - function: "custom"
+        filter_fn: !function utils.build_predictions_instruct
--- a/lm_eval/tasks/humaneval/utils.py
+++ b/lm_eval/tasks/humaneval/utils.py
@@ -25,3 +25,15 @@ def pass_at_k(references: list[str], predictions: list[list[str]], k: list[int]

 def build_predictions(resps: list[list[str]], docs: list[dict]) -> list[list[str]]:
    return [[doc["prompt"] + r for r in resp] for resp, doc in zip(resps, docs)]
+
+
+def build_predictions_instruct(
+    resps: list[list[str]], docs: list[dict]
+) -> list[list[str]]:
+    return [
+        [
+            doc["prompt"] + (r if r.rfind("```") == -1 else r[: r.rfind("```")])
+            for r in resp
+        ]
+        for resp, doc in zip(resps, docs)
+    ]