Correctly Print Task Versioning (#1173)

* change version field formatting in metadata * mention versioning in new task guide * add instructions for changelog * run linters

Correctly Print Task Versioning (#1173)
* change version field formatting in metadata * mention versioning in new task guide * add instructions for changelog * run linters
9cd79897 · Hailey Schoelkopf · GitHub · a0cfe3f6 · 9cd79897 · 9cd79897
Unverified Commit 9cd79897 authored Dec 21, 2023 by Hailey Schoelkopf Committed by GitHub Dec 21, 2023
20 changed files
--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -315,6 +315,25 @@ python -m scripts.write_out \
 Open the file specified at the `--output_base_path <path>` and ensure it passes
 a simple eye test.

+## Versioning
+
+One key feature in LM Evaluation Harness is the ability to version tasks--that is, mark them with a specific version number that can be bumped whenever a breaking change is made.
+
+This version info can be provided by adding the following to your new task config file:
+
+```
+metadata:
+  version: 0
+```
+
+Now, whenever a change needs to be made to your task in the future, please increase the version number by 1 so that users can differentiate the different task iterations and versions.
+
+If you are incrementing a task's version, please also consider adding a changelog to the task's README.md noting the date, PR number, what version you have updated to, and a one-liner describing the change.
+
+for example,
+
+* \[Dec 25, 2023\] (PR #999) Version 0.0 -> 1.0: Fixed a bug with answer extraction that led to underestimated performance.
+
 ## Checking performance + equivalence

 It's now time to check models' performance on your task! In the evaluation harness, we intend to support a wide range of evaluation tasks and setups, but prioritize the inclusion of already-proven benchmarks following the precise evaluation setups in the literature where possible.

--- a/lm_eval/tasks/anli/anli_r1.yaml
+++ b/lm_eval/tasks/anli/anli_r1.yaml
@@ -23,4 +23,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/arc/arc_easy.yaml
+++ b/lm_eval/tasks/arc/arc_easy.yaml
@@ -20,4 +20,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
+++ b/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml
@@ -13,4 +13,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/asdiv/default.yaml
+++ b/lm_eval/tasks/asdiv/default.yaml
@@ -11,4 +11,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/babi/babi.yaml
+++ b/lm_eval/tasks/babi/babi.yaml
@@ -17,4 +17,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 0.0
+  version: 0.0
--- a/lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
+++ b/lm_eval/tasks/bbh/cot_fewshot/_cot_fewshot_template_yaml
@@ -27,4 +27,4 @@ filter_list:
      - function: "take_first"
 num_fewshot: 0
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
+++ b/lm_eval/tasks/bbh/cot_zeroshot/_cot_zeroshot_template_yaml
@@ -24,4 +24,4 @@ filter_list:
      - function: "take_first"
 num_fewshot: 0
 metadata:
-  - version: 0
+  version: 0
--- a/lm_eval/tasks/bbh/fewshot/_fewshot_template_yaml
+++ b/lm_eval/tasks/bbh/fewshot/_fewshot_template_yaml
@@ -18,4 +18,4 @@ generation_kwargs:
  temperature: 0.0
 num_fewshot: 0
 metadata:
-  - version: 0
+  version: 0
--- a/lm_eval/tasks/bbh/zeroshot/_zeroshot_template_yaml
+++ b/lm_eval/tasks/bbh/zeroshot/_zeroshot_template_yaml
@@ -18,4 +18,4 @@ generation_kwargs:
  temperature: 0.0
 num_fewshot: 0
 metadata:
-  - version: 0
+  version: 0
--- a/lm_eval/tasks/belebele/_default_template_yaml
+++ b/lm_eval/tasks/belebele/_default_template_yaml
@@ -18,4 +18,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 0.0
+  version: 0.0
--- a/lm_eval/tasks/bigbench/generate_until_template_yaml
+++ b/lm_eval/tasks/bigbench/generate_until_template_yaml
@@ -15,4 +15,4 @@ metric_list:
    higher_is_better: true
    ignore_punctuation: true
 metadata:
-  - version: 0.0
+  version: 0.0
--- a/lm_eval/tasks/bigbench/multiple_choice/causal_judgement.yaml
+++ b/lm_eval/tasks/bigbench/multiple_choice/causal_judgement.yaml
+# Generated by utils.py
+dataset_name: causal_judgment_zero_shot
+include: ../multiple_choice_template_yaml
+task: bigbench_causal_judgement_multiple_choice
--- a/lm_eval/tasks/bigbench/multiple_choice_template_yaml
+++ b/lm_eval/tasks/bigbench/multiple_choice_template_yaml
@@ -12,4 +12,4 @@ metric_list:
  - metric: acc
  # TODO: brier score and other metrics
 metadata:
-  - version: 0.0
+  version: 0.0
--- a/lm_eval/tasks/blimp/_template_yaml
+++ b/lm_eval/tasks/blimp/_template_yaml
@@ -11,4 +11,4 @@ doc_to_decontamination_query: "{{sentence_good}} {{sentence_bad}}"
 metric_list:
  - metric: acc
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/ceval/_default_ceval_yaml
+++ b/lm_eval/tasks/ceval/_default_ceval_yaml
@@ -16,4 +16,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 1.0
+  version: 1.0
--- a/lm_eval/tasks/cmmlu/_default_template_yaml
+++ b/lm_eval/tasks/cmmlu/_default_template_yaml
@@ -16,4 +16,4 @@ metric_list:
    aggregation: mean
    higher_is_better: true
 metadata:
-  - version: 0.0
+  version: 0.0
--- a/lm_eval/tasks/code_x_glue/code-text/go.yaml
+++ b/lm_eval/tasks/code_x_glue/code-text/go.yaml
@@ -18,4 +18,4 @@ metric_list:
    aggregation: mean
    higher_is_better: True
 metadata:
-  - version: 0.0
+  version: 0.0
--- a/lm_eval/tasks/code_x_glue/code-text/java.yaml
+++ b/lm_eval/tasks/code_x_glue/code-text/java.yaml
@@ -18,4 +18,4 @@ metric_list:
    aggregation: mean
    higher_is_better: True
 metadata:
-  - version: 0.0
+  version: 0.0
--- a/lm_eval/tasks/code_x_glue/code-text/javascript.yaml
+++ b/lm_eval/tasks/code_x_glue/code-text/javascript.yaml
@@ -18,4 +18,4 @@ metric_list:
    aggregation: mean
    higher_is_better: True
 metadata:
-  - version: 0.0
+  version: 0.0