FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092)

* Fix: Align the Humaneval dataset with official results Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals". (2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one. Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5). Ref: PR#2650 * add changelog and version * add changelog

FixBug: Align the Humaneval with official results for Llama-3.1-70B-Instruct (#3092)
* Fix: Align the Humaneval dataset with official results Details:(1) modified the "doc_to_text" and "gen_prefix" in the "humaneval_instruct.yaml" file to make them the same as the Prompt in "meta-llama/Llama-3.1-70B-Instruct-evals". (2) Change r.rfind("```") to r.find("```"), so it can locate the first "```", not the last one. Results: Partially reproduced the official results: The result of LLaMA3.1-8B-Instruct is 66.5 (the official result is 72.6), and the result of LLaMA3.1-70B-Instruct is 80.5 (the official result is 80.5). Ref: PR#2650 * add changelog and version * add changelog
a7ca0435 · jinze · GitHub · fea4d11d · a7ca0435 · a7ca0435
Unverified Commit a7ca0435 authored Jul 01, 2025 by jinze Committed by GitHub Jun 30, 2025
3 changed files
--- a/lm_eval/tasks/humaneval/README.md
+++ b/lm_eval/tasks/humaneval/README.md
@@ -50,3 +50,5 @@ If other tasks on this dataset are already supported:

 ### Changelog
 v2 20-MAR-2025: `humaneval_instruct`, `humaneval_instruct_64`: fixed typo in gen_prefix
+
+v3 30-JUN-2025: Updated prompt generation and output parsing to align with the official `Llama-3.1-70B-Instruct-evals`. This corrects the prompt format and fixes a bug in locating the code block. See PR [#3092](https://github.com/EleutherAI/lm-evaluation-harness/pull/3092).
--- a/lm_eval/tasks/humaneval/humaneval_instruct.yaml
+++ b/lm_eval/tasks/humaneval/humaneval_instruct.yaml
 include: humaneval.yaml
 task: humaneval_instruct
-doc_to_text: "Write a solution to the following problem and make sure that it passes the tests:\n```{{prompt}}"
-gen_prefix: "Here is the completed function:\n```python\n{{prompt}}\n"
+doc_to_text: 'Write a solution to the following problem and make sure that it passes the tests:\n```python\n{{ prompt }}\n```\n '
+gen_prefix: 'Here is the completed function:\n```python\n{{ prompt }}\n '
 filter_list:
  - name: "create_test"
    filter:
      - function: "custom"
        filter_fn: !function utils.build_predictions_instruct
 metadata:
-  version: 2.0
+  version: 3.0
--- a/lm_eval/tasks/humaneval/utils.py
+++ b/lm_eval/tasks/humaneval/utils.py
@@ -32,7 +32,7 @@ def build_predictions_instruct(
 ) -> list[list[str]]:
    return [
        [
-            doc["prompt"] + (r if r.rfind("```") == -1 else r[: r.rfind("```")])
+            doc["prompt"] + (r if r.find("```") == -1 else r[: r.find("```")])
            for r in resp
        ]
        for resp, doc in zip(resps, docs)