Merge branch 'main' into mathvista

# Conflicts: # lm_eval/models/hf_vlms.py

Merge branch 'main' into mathvista
# Conflicts: # lm_eval/models/hf_vlms.py
25869601 · Baber · 56f40c53 · c1d8795d · 25869601 · 25869601
Commit 25869601 authored Oct 19, 2024 by Baber
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -8,6 +8,7 @@ build
 dist
 *.egg-info
 venv
+.venv/
 .vscode/
 temp
 __pycache__

--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -2,7 +2,7 @@
 exclude: ^tests/testdata/
 repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.5.0
+    rev: v4.6.0
    hooks:
      - id: check-added-large-files
      - id: check-ast
@@ -29,7 +29,7 @@ repos:
      - id: mixed-line-ending
        args: [--fix=lf]
  - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.4.8
+    rev: v0.6.8
    hooks:
      # Run the linter.
      - id: ruff

--- a/README.md
+++ b/README.md
@@ -54,7 +54,7 @@ The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's pop
 To install the `lm-eval` package from the github repository, run:
 ```bash
-git clone https://github.com/EleutherAI/lm-evaluation-harness
+git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
 cd lm-evaluation-harness
 pip install -e .
 ```

--- a/examples/lm-eval-overview.ipynb
+++ b/examples/lm-eval-overview.ipynb
 {
-  "cells": [
+ "cells": [
-    {
+  {
-      "cell_type": "markdown",
+   "cell_type": "markdown",
-      "metadata": {
+   "metadata": {
-        "id": "Qw83KAePAhaS"
+    "id": "Qw83KAePAhaS"
-      },
+   },
-      "source": [
+   "source": [
-        "# Releasing LM-Evaluation-Harness v0.4.0"
+    "# Releasing LM-Evaluation-Harness v0.4.0"
-      ]
+   ]
-    },
+  },
-    {
+  {
-      "cell_type": "markdown",
+   "cell_type": "markdown",
-      "metadata": {
+   "metadata": {
-        "id": "Z7k2vq1iAdqr"
+    "id": "Z7k2vq1iAdqr"
-      },
+   },
-      "source": [
+   "source": [
-        "With the vast amount of work done in the field today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research."
+    "With the vast amount of work done in the field today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research."
-      ]
+   ]
-    },
+  },
-    {
+  {
-      "cell_type": "markdown",
+   "cell_type": "markdown",
-      "metadata": {
+   "metadata": {
-        "id": "0gDoM0AJAvEc"
+    "id": "0gDoM0AJAvEc"
-      },
+   },
-      "source": [
+   "source": [
-        "Our refactor stems from our desires to make the following believed best practices easier to carry out.  \n",
+    "Our refactor stems from our desires to make the following believed best practices easier to carry out.  \n",
-        "\n",
+    "\n",
-        "1.   Never copy results from other papers\n",
+    "1.   Never copy results from other papers\n",
-        "2.   Always share your exact prompts\n",
+    "2.   Always share your exact prompts\n",
-        "3.   Always provide model outputs\n",
+    "3.   Always provide model outputs\n",
-        "4.   Qualitatively review a small batch of outputs before running evaluation jobs at scale\n",
+    "4.   Qualitatively review a small batch of outputs before running evaluation jobs at scale\n",
-        "\n",
+    "\n",
-        "We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:\n",
+    "We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:\n",
-        "\n",
+    "\n",
-        "1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)\n",
+    "1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)\n",
-        "2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)\n",
+    "2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)\n",
-        "3. More configurability, for more advanced workflows and easier operation with modifying prompts\n",
+    "3. More configurability, for more advanced workflows and easier operation with modifying prompts\n",
-        "4. Better logging of data at runtime and post-hoc"
+    "4. Better logging of data at runtime and post-hoc"
-      ]
+   ]
-    },
+  },
-    {
+  {
-      "cell_type": "markdown",
+   "cell_type": "markdown",
-      "metadata": {
+   "metadata": {
-        "id": "nnwsOpjda_YW"
+    "id": "nnwsOpjda_YW"
-      },
+   },
-      "source": [
+   "source": [
-        "In this notebook we will be going through a short tutorial on how things work."
+    "In this notebook we will be going through a short tutorial on how things work."
-      ]
+   ]
-    },
+  },
-    {
+  {
-      "cell_type": "markdown",
+   "cell_type": "markdown",
-      "metadata": {
+   "metadata": {
-        "id": "zAov81vTbL2K"
+    "id": "zAov81vTbL2K"
-      },
+   },
-      "source": [
+   "source": [
-        "## Install LM-Eval"
+    "## Install LM-Eval"
-      ]
+   ]
-    },
+  },
-    {
+  {
-      "cell_type": "code",
+   "cell_type": "code",
-      "execution_count": 1,
+   "execution_count": 1,
-      "metadata": {
+   "metadata": {
-        "colab": {
+    "colab": {
-          "base_uri": "https://localhost:8080/"
+     "base_uri": "https://localhost:8080/"
-        },
-        "id": "8hiosGzq_qZg",
-        "outputId": "6ab73e5e-1f54-417e-a388-07e0d870b132"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor\n",
-            "  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git (to revision big-refactor) to /tmp/pip-req-build-tnssql5s\n",
-            "  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-tnssql5s\n",
-            "  Running command git checkout -b big-refactor --track origin/big-refactor\n",
-            "  Switched to a new branch 'big-refactor'\n",
-            "  Branch 'big-refactor' set up to track remote branch 'big-refactor' from 'origin'.\n",
-            "  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 42f486ee49b65926a444cb0620870a39a5b4b0a8\n",
-            "  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
-            "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
-            "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
-            "Collecting accelerate>=0.21.0 (from lm-eval==1.0.0)\n",
-            "  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
-            "  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
-            "  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m521.2/521.2 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
-            "  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)\n",
-            "Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.8.7)\n",
-            "Collecting peft>=0.2.0 (from lm-eval==1.0.0)\n",
-            "  Downloading peft-0.6.2-py3-none-any.whl (174 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m174.7/174.7 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
-            "  Downloading pybind11-2.11.1-py3-none-any.whl (227 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.7/227.7 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
-            "  Downloading pytablewriter-1.2.0-py3-none-any.whl (111 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.1/111.1 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
-            "  Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
-            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
-            "Collecting sacrebleu>=1.5.0 (from lm-eval==1.0.0)\n",
-            "  Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
-            "Collecting sqlitedict (from lm-eval==1.0.0)\n",
-            "  Downloading sqlitedict-2.1.0.tar.gz (21 kB)\n",
-            "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
-            "Requirement already satisfied: torch>=1.8 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.1.0+cu118)\n",
-            "Collecting tqdm-multiprocess (from lm-eval==1.0.0)\n",
-            "  Downloading tqdm_multiprocess-0.0.11-py3-none-any.whl (9.8 kB)\n",
-            "Requirement already satisfied: transformers>=4.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (4.35.2)\n",
-            "Collecting zstandard (from lm-eval==1.0.0)\n",
-            "  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
-            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (23.2)\n",
-            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (5.9.5)\n",
-            "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (6.0.1)\n",
-            "Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (0.19.4)\n",
-            "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (9.0.0)\n",
-            "Collecting pyarrow-hotfix (from datasets>=2.0.0->lm-eval==1.0.0)\n",
-            "  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)\n",
-            "Collecting dill<0.3.8,>=0.3.0 (from datasets>=2.0.0->lm-eval==1.0.0)\n",
-            "  Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
-            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2.31.0)\n",
-            "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (4.66.1)\n",
-            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.4.1)\n",
-            "Collecting multiprocess (from datasets>=2.0.0->lm-eval==1.0.0)\n",
-            "  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
-            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.8.6)\n",
-            "Collecting responses<0.19 (from evaluate->lm-eval==1.0.0)\n",
-            "  Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
-            "Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from peft>=0.2.0->lm-eval==1.0.0) (0.4.0)\n",
-            "Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.4.0)\n",
-            "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (3.8.1)\n",
-            "Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.16.0)\n",
-            "Collecting portalocker (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
-            "  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)\n",
-            "Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (2023.6.3)\n",
-            "Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (0.9.0)\n",
-            "Collecting colorama (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
-            "  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)\n",
-            "Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (4.9.3)\n",
-            "Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.11.3)\n",
-            "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.3.2)\n",
-            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (3.2.0)\n",
-            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.13.1)\n",
-            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (4.5.0)\n",
-            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (1.12)\n",
-            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.2.1)\n",
-            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.1.2)\n",
-            "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (2.1.0)\n",
-            "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.1->lm-eval==1.0.0) (0.15.0)\n",
-            "Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonlines->lm-eval==1.0.0) (23.1.0)\n",
-            "Requirement already satisfied: setuptools>=38.3.0 in /usr/local/lib/python3.10/dist-packages (from pytablewriter->lm-eval==1.0.0) (67.7.2)\n",
-            "Collecting DataProperty<2,>=1.0.1 (from pytablewriter->lm-eval==1.0.0)\n",
-            "  Downloading DataProperty-1.0.1-py3-none-any.whl (27 kB)\n",
-            "Collecting mbstrdecoder<2,>=1.0.0 (from pytablewriter->lm-eval==1.0.0)\n",
-            "  Downloading mbstrdecoder-1.1.3-py3-none-any.whl (7.8 kB)\n",
-            "Collecting pathvalidate<4,>=2.3.0 (from pytablewriter->lm-eval==1.0.0)\n",
-            "  Downloading pathvalidate-3.2.0-py3-none-any.whl (23 kB)\n",
-            "Collecting tabledata<2,>=1.3.1 (from pytablewriter->lm-eval==1.0.0)\n",
-            "  Downloading tabledata-1.3.3-py3-none-any.whl (11 kB)\n",
-            "Collecting tcolorpy<1,>=0.0.5 (from pytablewriter->lm-eval==1.0.0)\n",
-            "  Downloading tcolorpy-0.1.4-py3-none-any.whl (7.9 kB)\n",
-            "Collecting typepy[datetime]<2,>=1.3.2 (from pytablewriter->lm-eval==1.0.0)\n",
-            "  Downloading typepy-1.3.2-py3-none-any.whl (31 kB)\n",
-            "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (3.3.2)\n",
-            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (6.0.4)\n",
-            "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (4.0.3)\n",
-            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.9.2)\n",
-            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.4.0)\n",
-            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.3.1)\n",
-            "Requirement already satisfied: chardet<6,>=3.0.4 in /usr/local/lib/python3.10/dist-packages (from mbstrdecoder<2,>=1.0.0->pytablewriter->lm-eval==1.0.0) (5.2.0)\n",
-            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (3.4)\n",
-            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2.0.7)\n",
-            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2023.7.22)\n",
-            "Requirement already satisfied: python-dateutil<3.0.0,>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2.8.2)\n",
-            "Requirement already satisfied: pytz>=2018.9 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2023.3.post1)\n",
-            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.8->lm-eval==1.0.0) (2.1.3)\n",
-            "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge-score>=0.0.4->lm-eval==1.0.0) (8.1.7)\n",
-            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.8->lm-eval==1.0.0) (1.3.0)\n",
-            "Building wheels for collected packages: lm-eval, rouge-score, sqlitedict\n",
-            "  Building wheel for lm-eval (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
-            "  Created wheel for lm-eval: filename=lm_eval-1.0.0-py3-none-any.whl size=994254 sha256=88356155b19f2891981ecef948326ad6ce8ca40a6009378410ec20d0e225995a\n",
-            "  Stored in directory: /tmp/pip-ephem-wheel-cache-9v6ye7h3/wheels/17/01/26/599c0779e9858a70a73fa8a306699b5b9a868f820c225457b0\n",
-            "  Building wheel for rouge-score (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
-            "  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=6bb0d44e4881972c43ce194e7cb65233d309758cb15f0dec54590d3d2efcfc36\n",
-            "  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
-            "  Building wheel for sqlitedict (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
-            "  Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=5747f7dd73ddf3d8fbcebf51b5e4f718fabe1e94bccdf16d2f22a2e65ee7fdf4\n",
-            "  Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd\n",
-            "Successfully built lm-eval rouge-score sqlitedict\n",
-            "Installing collected packages: sqlitedict, zstandard, tcolorpy, pybind11, pyarrow-hotfix, portalocker, pathvalidate, mbstrdecoder, jsonlines, dill, colorama, typepy, tqdm-multiprocess, sacrebleu, rouge-score, responses, multiprocess, accelerate, datasets, DataProperty, tabledata, peft, evaluate, pytablewriter, lm-eval\n",
-            "Successfully installed DataProperty-1.0.1 accelerate-0.24.1 colorama-0.4.6 datasets-2.15.0 dill-0.3.7 evaluate-0.4.1 jsonlines-4.0.0 lm-eval-1.0.0 mbstrdecoder-1.1.3 multiprocess-0.70.15 pathvalidate-3.2.0 peft-0.6.2 portalocker-2.8.2 pyarrow-hotfix-0.6 pybind11-2.11.1 pytablewriter-1.2.0 responses-0.18.0 rouge-score-0.1.2 sacrebleu-2.3.2 sqlitedict-2.1.0 tabledata-1.3.3 tcolorpy-0.1.4 tqdm-multiprocess-0.0.11 typepy-1.3.2 zstandard-0.22.0\n"
-          ]
-        }
-      ],
-      "source": [
-        "# Install LM-Eval\n",
-        "!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 2,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 0,
-          "referenced_widgets": [
-            "a1d3a8aa016544a78e8821c8f6199e06",
-            "f61ed33fad754146bdd2ac9db1ba1c48",
-            "bfa0af6aeff344c6845e1080a878e92e",
-            "fd1ad9e0367d4004aae853b91c3a7617",
-            "6b2d90209ec14230b3d58a74ac9b83bf",
-            "a73f357065d34d7baf0453ae4a8d75e2",
-            "46f521b73fd943c081c648fd873ebc0a",
-            "7c5689bc13684db8a22681f41863dddd",
-            "48763b6233374554ae76035c0483066f",
-            "4986a21eb560448fa79f4b25cde48951",
-            "aed3acd2f2d74003b44079c333a0698e"
-          ]
-        },
-        "id": "uyO5MaKkZyah",
-        "outputId": "d46e8096-5086-4e49-967e-ea33d4a2a335"
-      },
-      "outputs": [
-        {
-          "data": {
-            "application/vnd.jupyter.widget-view+json": {
-              "model_id": "a1d3a8aa016544a78e8821c8f6199e06",
-              "version_major": 2,
-              "version_minor": 0
-            },
-            "text/plain": [
-              "Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]"
-            ]
-          },
-          "metadata": {},
-          "output_type": "display_data"
-        }
-      ],
-      "source": [
-        "from lm_eval import api"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "8rfUeX6n_wkK"
-      },
-      "source": [
-        "## Create new evaluation tasks with config-based tasks\n",
-        "\n",
-        "Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "HYFUhhfOSJKe"
-      },
-      "source": [
-        "A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.\n",
-        "\n",
-        "Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 3,
-      "metadata": {
-        "id": "bg3dGROW-V39"
-      },
-      "outputs": [],
-      "source": [
-        "YAML_boolq_string = '''\n",
-        "task: demo_boolq\n",
-        "dataset_path: super_glue\n",
-        "dataset_name: boolq\n",
-        "output_type: multiple_choice\n",
-        "training_split: train\n",
-        "validation_split: validation\n",
-        "doc_to_text: \"{{passage}}\\nQuestion: {{question}}?\\nAnswer:\"\n",
-        "doc_to_target: label\n",
-        "doc_to_choice: [\"no\", \"yes\"]\n",
-        "should_decontaminate: true\n",
-        "doc_to_decontamination_query: passage\n",
-        "metric_list:\n",
-        "  - metric: acc\n",
-        "'''\n",
-        "with open('boolq.yaml', 'w') as f:\n",
-        "    f.write(YAML_boolq_string)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "And we can now run evaluation on this task, by pointing to the config file we've just created:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 4,
-      "metadata": {
-        "id": "LOUHK7PtQfq4"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "2023-11-29:11:54:55,156 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
-            "2023-11-29 11:54:55.942051: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-            "2023-11-29 11:54:55.942108: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
-            "2023-11-29 11:54:55.942142: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-            "2023-11-29 11:54:57.066802: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
-            "2023-11-29:11:55:00,954 INFO     [__main__.py:132] Verbosity set to INFO\n",
-            "2023-11-29:11:55:11,038 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
-            "2023-11-29:11:55:11,038 INFO     [__main__.py:143] Including path: ./\n",
-            "2023-11-29:11:55:11,046 INFO     [__main__.py:205] Selected Tasks: ['demo_boolq']\n",
-            "2023-11-29:11:55:11,047 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
-            "2023-11-29:11:55:11,110 INFO     [huggingface.py:120] Using device 'cuda'\n",
-            "config.json: 100% 571/571 [00:00<00:00, 2.87MB/s]\n",
-            "model.safetensors: 100% 5.68G/5.68G [00:32<00:00, 173MB/s]\n",
-            "tokenizer_config.json: 100% 396/396 [00:00<00:00, 2.06MB/s]\n",
-            "tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 11.6MB/s]\n",
-            "special_tokens_map.json: 100% 99.0/99.0 [00:00<00:00, 555kB/s]\n",
-            "2023-11-29:11:56:18,658 WARNING  [task.py:614] [Task: demo_boolq] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
-            "2023-11-29:11:56:18,658 WARNING  [task.py:626] [Task: demo_boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
-            "Downloading builder script: 100% 30.7k/30.7k [00:00<00:00, 59.0MB/s]\n",
-            "Downloading metadata: 100% 38.7k/38.7k [00:00<00:00, 651kB/s]\n",
-            "Downloading readme: 100% 14.8k/14.8k [00:00<00:00, 37.3MB/s]\n",
-            "Downloading data: 100% 4.12M/4.12M [00:00<00:00, 55.1MB/s]\n",
-            "Generating train split: 100% 9427/9427 [00:00<00:00, 15630.89 examples/s]\n",
-            "Generating validation split: 100% 3270/3270 [00:00<00:00, 20002.56 examples/s]\n",
-            "Generating test split: 100% 3245/3245 [00:00<00:00, 20866.19 examples/s]\n",
-            "2023-11-29:11:56:22,315 INFO     [task.py:355] Building contexts for task on rank 0...\n",
-            "2023-11-29:11:56:22,322 INFO     [evaluator.py:319] Running loglikelihood requests\n",
-            "100% 20/20 [00:04<00:00,  4.37it/s]\n",
-            "fatal: not a git repository (or any of the parent directories): .git\n",
-            "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
-            "|  Tasks   |Version|Filter|n-shot|Metric|Value|   |Stderr|\n",
-            "|----------|-------|------|-----:|------|----:|---|-----:|\n",
-            "|demo_boolq|Yaml   |none  |     0|acc   |    1|±  |     0|\n",
-            "\n"
-          ]
-        }
-      ],
-      "source": [
-        "!lm_eval \\\n",
-        "    --model hf \\\n",
-        "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
-        "    --include_path ./ \\\n",
-        "    --tasks demo_boolq \\\n",
-        "    --limit 10\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "LOUHK7PtQfq4"
-      },
-      "source": [
-        "Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the tag `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
-        "\n",
-        "<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
-        "\n",
-        "We also show the aggregate across samples besides only showing the aggregation between subtasks. This may come in handy when certain groups want to be aggregated as a single task. -->\n",
-        "\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 5,
-      "metadata": {
-        "id": "fthNg3ywO-kA"
-      },
-      "outputs": [],
-      "source": [
-        "YAML_cola_string = '''\n",
-        "tag: yes_or_no_tasks\n",
-        "task: demo_cola\n",
-        "dataset_path: glue\n",
-        "dataset_name: cola\n",
-        "output_type: multiple_choice\n",
-        "training_split: train\n",
-        "validation_split: validation\n",
-        "doc_to_text: \"{{sentence}}\\nQuestion: Does this sentence make sense?\\nAnswer:\"\n",
-        "doc_to_target: label\n",
-        "doc_to_choice: [\"no\", \"yes\"]\n",
-        "should_decontaminate: true\n",
-        "doc_to_decontamination_query: sentence\n",
-        "metric_list:\n",
-        "  - metric: acc\n",
-        "'''\n",
-        "with open('cola.yaml', 'w') as f:\n",
-        "    f.write(YAML_cola_string)"
-      ]
    },
+    "id": "8hiosGzq_qZg",
+    "outputId": "6ab73e5e-1f54-417e-a388-07e0d870b132"
+   },
+   "outputs": [
    {
-      "cell_type": "code",
+     "name": "stdout",
-      "execution_count": 6,
+     "output_type": "stream",
-      "metadata": {
+     "text": [
-        "id": "XceRKCuuDtbn"
+      "Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor\n",
-      },
+      "  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git (to revision big-refactor) to /tmp/pip-req-build-tnssql5s\n",
-      "outputs": [
+      "  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-tnssql5s\n",
-        {
+      "  Running command git checkout -b big-refactor --track origin/big-refactor\n",
-          "name": "stdout",
+      "  Switched to a new branch 'big-refactor'\n",
-          "output_type": "stream",
+      "  Branch 'big-refactor' set up to track remote branch 'big-refactor' from 'origin'.\n",
-          "text": [
+      "  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 42f486ee49b65926a444cb0620870a39a5b4b0a8\n",
-            "2023-11-29:11:56:33,016 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
+      "  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
-            "2023-11-29 11:56:33.852995: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
-            "2023-11-29 11:56:33.853050: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
-            "2023-11-29 11:56:33.853087: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "Collecting accelerate>=0.21.0 (from lm-eval==1.0.0)\n",
-            "2023-11-29 11:56:35.129047: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
+      "  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)\n",
-            "2023-11-29:11:56:38,546 INFO     [__main__.py:132] Verbosity set to INFO\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "2023-11-29:11:56:47,509 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
+      "\u001b[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
-            "2023-11-29:11:56:47,509 INFO     [__main__.py:143] Including path: ./\n",
+      "  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
-            "2023-11-29:11:56:47,517 INFO     [__main__.py:205] Selected Tasks: ['yes_or_no_tasks']\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "2023-11-29:11:56:47,520 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
+      "\u001b[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
-            "2023-11-29:11:56:47,550 INFO     [huggingface.py:120] Using device 'cuda'\n",
+      "  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)\n",
-            "2023-11-29:11:57:08,743 WARNING  [task.py:614] [Task: demo_cola] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m521.2/521.2 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "2023-11-29:11:57:08,743 WARNING  [task.py:626] [Task: demo_cola] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
+      "\u001b[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
-            "Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 52.7MB/s]\n",
+      "  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)\n",
-            "Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 51.9MB/s]\n",
+      "Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.8.7)\n",
-            "Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 48.0MB/s]\n",
+      "Collecting peft>=0.2.0 (from lm-eval==1.0.0)\n",
-            "Downloading data: 100% 377k/377k [00:00<00:00, 12.0MB/s]\n",
+      "  Downloading peft-0.6.2-py3-none-any.whl (174 kB)\n",
-            "Generating train split: 100% 8551/8551 [00:00<00:00, 19744.58 examples/s]\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m174.7/174.7 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "Generating validation split: 100% 1043/1043 [00:00<00:00, 27057.01 examples/s]\n",
+      "\u001b[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
-            "Generating test split: 100% 1063/1063 [00:00<00:00, 22705.17 examples/s]\n",
+      "  Downloading pybind11-2.11.1-py3-none-any.whl (227 kB)\n",
-            "2023-11-29:11:57:11,698 INFO     [task.py:355] Building contexts for task on rank 0...\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.7/227.7 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "2023-11-29:11:57:11,704 INFO     [evaluator.py:319] Running loglikelihood requests\n",
+      "\u001b[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
-            "100% 20/20 [00:03<00:00,  5.15it/s]\n",
+      "  Downloading pytablewriter-1.2.0-py3-none-any.whl (111 kB)\n",
-            "fatal: not a git repository (or any of the parent directories): .git\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.1/111.1 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
+      "\u001b[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
-            "|     Tasks     |Version|Filter|n-shot|Metric|Value|   |Stderr|\n",
+      "  Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
-            "|---------------|-------|------|-----:|------|----:|---|-----:|\n",
+      "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
-            "|yes_or_no_tasks|N/A    |none  |     0|acc   |  0.7|±  |0.1528|\n",
+      "Collecting sacrebleu>=1.5.0 (from lm-eval==1.0.0)\n",
-            "| - demo_cola   |Yaml   |none  |     0|acc   |  0.7|±  |0.1528|\n",
+      "  Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)\n",
-            "\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "|    Groups     |Version|Filter|n-shot|Metric|Value|   |Stderr|\n",
+      "\u001b[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
-            "|---------------|-------|------|-----:|------|----:|---|-----:|\n",
+      "Collecting sqlitedict (from lm-eval==1.0.0)\n",
-            "|yes_or_no_tasks|N/A    |none  |     0|acc   |  0.7|±  |0.1528|\n",
+      "  Downloading sqlitedict-2.1.0.tar.gz (21 kB)\n",
-            "\n"
+      "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
-          ]
+      "Requirement already satisfied: torch>=1.8 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.1.0+cu118)\n",
-        }
+      "Collecting tqdm-multiprocess (from lm-eval==1.0.0)\n",
-      ],
+      "  Downloading tqdm_multiprocess-0.0.11-py3-none-any.whl (9.8 kB)\n",
-      "source": [
+      "Requirement already satisfied: transformers>=4.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (4.35.2)\n",
-        "# !accelerate launch --no_python\n",
+      "Collecting zstandard (from lm-eval==1.0.0)\n",
-        "!lm_eval \\\n",
+      "  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\n",
-        "    --model hf \\\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-        "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+      "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
-        "    --include_path ./ \\\n",
+      "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (23.2)\n",
-        "    --tasks yes_or_no_tasks \\\n",
+      "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (5.9.5)\n",
-        "    --limit 10 \\\n",
+      "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (6.0.1)\n",
-        "    --output output/yes_or_no_tasks/ \\\n",
+      "Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (0.19.4)\n",
-        "    --log_samples\n"
+      "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (9.0.0)\n",
-      ]
+      "Collecting pyarrow-hotfix (from datasets>=2.0.0->lm-eval==1.0.0)\n",
+      "  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)\n",
+      "Collecting dill<0.3.8,>=0.3.0 (from datasets>=2.0.0->lm-eval==1.0.0)\n",
+      "  Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
+      "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2.31.0)\n",
+      "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (4.66.1)\n",
+      "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.4.1)\n",
+      "Collecting multiprocess (from datasets>=2.0.0->lm-eval==1.0.0)\n",
+      "  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
+      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+      "\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
+      "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.8.6)\n",
+      "Collecting responses<0.19 (from evaluate->lm-eval==1.0.0)\n",
+      "  Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
+      "Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from peft>=0.2.0->lm-eval==1.0.0) (0.4.0)\n",
+      "Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.4.0)\n",
+      "Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (3.8.1)\n",
+      "Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.16.0)\n",
+      "Collecting portalocker (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
+      "  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)\n",
+      "Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (2023.6.3)\n",
+      "Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (0.9.0)\n",
+      "Collecting colorama (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
+      "  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)\n",
+      "Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (4.9.3)\n",
+      "Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.11.3)\n",
+      "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.3.2)\n",
+      "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (3.2.0)\n",
+      "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.13.1)\n",
+      "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (4.5.0)\n",
+      "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (1.12)\n",
+      "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.2.1)\n",
+      "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.1.2)\n",
+      "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (2.1.0)\n",
+      "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.1->lm-eval==1.0.0) (0.15.0)\n",
+      "Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonlines->lm-eval==1.0.0) (23.1.0)\n",
+      "Requirement already satisfied: setuptools>=38.3.0 in /usr/local/lib/python3.10/dist-packages (from pytablewriter->lm-eval==1.0.0) (67.7.2)\n",
+      "Collecting DataProperty<2,>=1.0.1 (from pytablewriter->lm-eval==1.0.0)\n",
+      "  Downloading DataProperty-1.0.1-py3-none-any.whl (27 kB)\n",
+      "Collecting mbstrdecoder<2,>=1.0.0 (from pytablewriter->lm-eval==1.0.0)\n",
+      "  Downloading mbstrdecoder-1.1.3-py3-none-any.whl (7.8 kB)\n",
+      "Collecting pathvalidate<4,>=2.3.0 (from pytablewriter->lm-eval==1.0.0)\n",
+      "  Downloading pathvalidate-3.2.0-py3-none-any.whl (23 kB)\n",
+      "Collecting tabledata<2,>=1.3.1 (from pytablewriter->lm-eval==1.0.0)\n",
+      "  Downloading tabledata-1.3.3-py3-none-any.whl (11 kB)\n",
+      "Collecting tcolorpy<1,>=0.0.5 (from pytablewriter->lm-eval==1.0.0)\n",
+      "  Downloading tcolorpy-0.1.4-py3-none-any.whl (7.9 kB)\n",
+      "Collecting typepy[datetime]<2,>=1.3.2 (from pytablewriter->lm-eval==1.0.0)\n",
+      "  Downloading typepy-1.3.2-py3-none-any.whl (31 kB)\n",
+      "Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (3.3.2)\n",
+      "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (6.0.4)\n",
+      "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (4.0.3)\n",
+      "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.9.2)\n",
+      "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.4.0)\n",
+      "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.3.1)\n",
+      "Requirement already satisfied: chardet<6,>=3.0.4 in /usr/local/lib/python3.10/dist-packages (from mbstrdecoder<2,>=1.0.0->pytablewriter->lm-eval==1.0.0) (5.2.0)\n",
+      "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (3.4)\n",
+      "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2.0.7)\n",
+      "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2023.7.22)\n",
+      "Requirement already satisfied: python-dateutil<3.0.0,>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2.8.2)\n",
+      "Requirement already satisfied: pytz>=2018.9 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2023.3.post1)\n",
+      "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.8->lm-eval==1.0.0) (2.1.3)\n",
+      "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge-score>=0.0.4->lm-eval==1.0.0) (8.1.7)\n",
+      "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.8->lm-eval==1.0.0) (1.3.0)\n",
+      "Building wheels for collected packages: lm-eval, rouge-score, sqlitedict\n",
+      "  Building wheel for lm-eval (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Created wheel for lm-eval: filename=lm_eval-1.0.0-py3-none-any.whl size=994254 sha256=88356155b19f2891981ecef948326ad6ce8ca40a6009378410ec20d0e225995a\n",
+      "  Stored in directory: /tmp/pip-ephem-wheel-cache-9v6ye7h3/wheels/17/01/26/599c0779e9858a70a73fa8a306699b5b9a868f820c225457b0\n",
+      "  Building wheel for rouge-score (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=6bb0d44e4881972c43ce194e7cb65233d309758cb15f0dec54590d3d2efcfc36\n",
+      "  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
+      "  Building wheel for sqlitedict (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=5747f7dd73ddf3d8fbcebf51b5e4f718fabe1e94bccdf16d2f22a2e65ee7fdf4\n",
+      "  Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd\n",
+      "Successfully built lm-eval rouge-score sqlitedict\n",
+      "Installing collected packages: sqlitedict, zstandard, tcolorpy, pybind11, pyarrow-hotfix, portalocker, pathvalidate, mbstrdecoder, jsonlines, dill, colorama, typepy, tqdm-multiprocess, sacrebleu, rouge-score, responses, multiprocess, accelerate, datasets, DataProperty, tabledata, peft, evaluate, pytablewriter, lm-eval\n",
+      "Successfully installed DataProperty-1.0.1 accelerate-0.24.1 colorama-0.4.6 datasets-2.15.0 dill-0.3.7 evaluate-0.4.1 jsonlines-4.0.0 lm-eval-1.0.0 mbstrdecoder-1.1.3 multiprocess-0.70.15 pathvalidate-3.2.0 peft-0.6.2 portalocker-2.8.2 pyarrow-hotfix-0.6 pybind11-2.11.1 pytablewriter-1.2.0 responses-0.18.0 rouge-score-0.1.2 sacrebleu-2.3.2 sqlitedict-2.1.0 tabledata-1.3.3 tcolorpy-0.1.4 tqdm-multiprocess-0.0.11 typepy-1.3.2 zstandard-0.22.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Install LM-Eval\n",
+    "!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 0,
+     "referenced_widgets": [
+      "a1d3a8aa016544a78e8821c8f6199e06",
+      "f61ed33fad754146bdd2ac9db1ba1c48",
+      "bfa0af6aeff344c6845e1080a878e92e",
+      "fd1ad9e0367d4004aae853b91c3a7617",
+      "6b2d90209ec14230b3d58a74ac9b83bf",
+      "a73f357065d34d7baf0453ae4a8d75e2",
+      "46f521b73fd943c081c648fd873ebc0a",
+      "7c5689bc13684db8a22681f41863dddd",
+      "48763b6233374554ae76035c0483066f",
+      "4986a21eb560448fa79f4b25cde48951",
+      "aed3acd2f2d74003b44079c333a0698e"
+     ]
    },
+    "id": "uyO5MaKkZyah",
+    "outputId": "d46e8096-5086-4e49-967e-ea33d4a2a335"
+   },
+   "outputs": [
    {
-      "cell_type": "markdown",
+     "data": {
-      "metadata": {
+      "application/vnd.jupyter.widget-view+json": {
-        "id": "XceRKCuuDtbn"
+       "model_id": "a1d3a8aa016544a78e8821c8f6199e06",
+       "version_major": 2,
+       "version_minor": 0
      },
-      "source": [
+      "text/plain": [
-        "## Edit Prompt Templates Quickly\n",
+       "Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]"
-        "\n",
-        "The following is a yaml made to evaluate the specific subtask of `high_school_geography` from MMLU. It uses the standard prompt where the we choose the letters from the options with most likelihood as the model's prediction."
      ]
-    },
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "8rfUeX6n_wkK"
+   },
+   "source": [
+    "## Create new evaluation tasks with config-based tasks\n",
+    "\n",
+    "Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "HYFUhhfOSJKe"
+   },
+   "source": [
+    "A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.\n",
+    "\n",
+    "Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "id": "bg3dGROW-V39"
+   },
+   "outputs": [],
+   "source": [
+    "YAML_boolq_string = \"\"\"\n",
+    "task: demo_boolq\n",
+    "dataset_path: super_glue\n",
+    "dataset_name: boolq\n",
+    "output_type: multiple_choice\n",
+    "training_split: train\n",
+    "validation_split: validation\n",
+    "doc_to_text: \"{{passage}}\\nQuestion: {{question}}?\\nAnswer:\"\n",
+    "doc_to_target: label\n",
+    "doc_to_choice: [\"no\", \"yes\"]\n",
+    "should_decontaminate: true\n",
+    "doc_to_decontamination_query: passage\n",
+    "metric_list:\n",
+    "  - metric: acc\n",
+    "\"\"\"\n",
+    "with open(\"boolq.yaml\", \"w\") as f:\n",
+    "    f.write(YAML_boolq_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And we can now run evaluation on this task, by pointing to the config file we've just created:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "id": "LOUHK7PtQfq4"
+   },
+   "outputs": [
    {
-      "cell_type": "code",
+     "name": "stdout",
-      "execution_count": 7,
+     "output_type": "stream",
-      "metadata": {
+     "text": [
-        "id": "GTFvdt9kSlBG"
+      "2023-11-29:11:54:55,156 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
-      },
+      "2023-11-29 11:54:55.942051: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-      "outputs": [],
+      "2023-11-29 11:54:55.942108: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
-      "source": [
+      "2023-11-29 11:54:55.942142: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-        "YAML_mmlu_geo_string = '''\n",
+      "2023-11-29 11:54:57.066802: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
-        "task: demo_mmlu_high_school_geography\n",
+      "2023-11-29:11:55:00,954 INFO     [__main__.py:132] Verbosity set to INFO\n",
-        "dataset_path: cais/mmlu\n",
+      "2023-11-29:11:55:11,038 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
-        "dataset_name: high_school_geography\n",
+      "2023-11-29:11:55:11,038 INFO     [__main__.py:143] Including path: ./\n",
-        "description: \"The following are multiple choice questions (with answers) about high school geography.\\n\\n\"\n",
+      "2023-11-29:11:55:11,046 INFO     [__main__.py:205] Selected Tasks: ['demo_boolq']\n",
-        "test_split: test\n",
+      "2023-11-29:11:55:11,047 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
-        "fewshot_split: dev\n",
+      "2023-11-29:11:55:11,110 INFO     [huggingface.py:120] Using device 'cuda'\n",
-        "fewshot_config:\n",
+      "config.json: 100% 571/571 [00:00<00:00, 2.87MB/s]\n",
-        "  sampler: first_n\n",
+      "model.safetensors: 100% 5.68G/5.68G [00:32<00:00, 173MB/s]\n",
-        "output_type: multiple_choice\n",
+      "tokenizer_config.json: 100% 396/396 [00:00<00:00, 2.06MB/s]\n",
-        "doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
+      "tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 11.6MB/s]\n",
-        "doc_to_choice: [\"A\", \"B\", \"C\", \"D\"]\n",
+      "special_tokens_map.json: 100% 99.0/99.0 [00:00<00:00, 555kB/s]\n",
-        "doc_to_target: answer\n",
+      "2023-11-29:11:56:18,658 WARNING  [task.py:614] [Task: demo_boolq] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
-        "metric_list:\n",
+      "2023-11-29:11:56:18,658 WARNING  [task.py:626] [Task: demo_boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
-        "  - metric: acc\n",
+      "Downloading builder script: 100% 30.7k/30.7k [00:00<00:00, 59.0MB/s]\n",
-        "    aggregation: mean\n",
+      "Downloading metadata: 100% 38.7k/38.7k [00:00<00:00, 651kB/s]\n",
-        "    higher_is_better: true\n",
+      "Downloading readme: 100% 14.8k/14.8k [00:00<00:00, 37.3MB/s]\n",
-        "  - metric: acc_norm\n",
+      "Downloading data: 100% 4.12M/4.12M [00:00<00:00, 55.1MB/s]\n",
-        "    aggregation: mean\n",
+      "Generating train split: 100% 9427/9427 [00:00<00:00, 15630.89 examples/s]\n",
-        "    higher_is_better: true\n",
+      "Generating validation split: 100% 3270/3270 [00:00<00:00, 20002.56 examples/s]\n",
-        "'''\n",
+      "Generating test split: 100% 3245/3245 [00:00<00:00, 20866.19 examples/s]\n",
-        "with open('mmlu_high_school_geography.yaml', 'w') as f:\n",
+      "2023-11-29:11:56:22,315 INFO     [task.py:355] Building contexts for task on rank 0...\n",
-        "    f.write(YAML_mmlu_geo_string)\n"
+      "2023-11-29:11:56:22,322 INFO     [evaluator.py:319] Running loglikelihood requests\n",
-      ]
+      "100% 20/20 [00:04<00:00,  4.37it/s]\n",
-    },
+      "fatal: not a git repository (or any of the parent directories): .git\n",
+      "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
+      "|  Tasks   |Version|Filter|n-shot|Metric|Value|   |Stderr|\n",
+      "|----------|-------|------|-----:|------|----:|---|-----:|\n",
+      "|demo_boolq|Yaml   |none  |     0|acc   |    1|±  |     0|\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!lm_eval \\\n",
+    "    --model hf \\\n",
+    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+    "    --include_path ./ \\\n",
+    "    --tasks demo_boolq \\\n",
+    "    --limit 10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "LOUHK7PtQfq4"
+   },
+   "source": [
+    "Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the tag `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
+    "\n",
+    "<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
+    "\n",
+    "We also show the aggregate across samples besides only showing the aggregation between subtasks. This may come in handy when certain groups want to be aggregated as a single task. -->\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "id": "fthNg3ywO-kA"
+   },
+   "outputs": [],
+   "source": [
+    "YAML_cola_string = \"\"\"\n",
+    "tag: yes_or_no_tasks\n",
+    "task: demo_cola\n",
+    "dataset_path: glue\n",
+    "dataset_name: cola\n",
+    "output_type: multiple_choice\n",
+    "training_split: train\n",
+    "validation_split: validation\n",
+    "doc_to_text: \"{{sentence}}\\nQuestion: Does this sentence make sense?\\nAnswer:\"\n",
+    "doc_to_target: label\n",
+    "doc_to_choice: [\"no\", \"yes\"]\n",
+    "should_decontaminate: true\n",
+    "doc_to_decontamination_query: sentence\n",
+    "metric_list:\n",
+    "  - metric: acc\n",
+    "\"\"\"\n",
+    "with open(\"cola.yaml\", \"w\") as f:\n",
+    "    f.write(YAML_cola_string)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "id": "XceRKCuuDtbn"
+   },
+   "outputs": [
    {
-      "cell_type": "code",
+     "name": "stdout",
-      "execution_count": 8,
+     "output_type": "stream",
-      "metadata": {
+     "text": [
-        "id": "jyKOfCsKb-xy"
+      "2023-11-29:11:56:33,016 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
-      },
+      "2023-11-29 11:56:33.852995: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-      "outputs": [
+      "2023-11-29 11:56:33.853050: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
-        {
+      "2023-11-29 11:56:33.853087: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-          "name": "stdout",
+      "2023-11-29 11:56:35.129047: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
-          "output_type": "stream",
+      "2023-11-29:11:56:38,546 INFO     [__main__.py:132] Verbosity set to INFO\n",
-          "text": [
+      "2023-11-29:11:56:47,509 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
-            "2023-11-29:11:57:23,598 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
+      "2023-11-29:11:56:47,509 INFO     [__main__.py:143] Including path: ./\n",
-            "2023-11-29 11:57:24.719750: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "2023-11-29:11:56:47,517 INFO     [__main__.py:205] Selected Tasks: ['yes_or_no_tasks']\n",
-            "2023-11-29 11:57:24.719806: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "2023-11-29:11:56:47,520 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
-            "2023-11-29 11:57:24.719847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "2023-11-29:11:56:47,550 INFO     [huggingface.py:120] Using device 'cuda'\n",
-            "2023-11-29 11:57:26.656125: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
+      "2023-11-29:11:57:08,743 WARNING  [task.py:614] [Task: demo_cola] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
-            "2023-11-29:11:57:31,563 INFO     [__main__.py:132] Verbosity set to INFO\n",
+      "2023-11-29:11:57:08,743 WARNING  [task.py:626] [Task: demo_cola] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
-            "2023-11-29:11:57:40,541 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
+      "Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 52.7MB/s]\n",
-            "2023-11-29:11:57:40,541 INFO     [__main__.py:143] Including path: ./\n",
+      "Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 51.9MB/s]\n",
-            "2023-11-29:11:57:40,558 INFO     [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography']\n",
+      "Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 48.0MB/s]\n",
-            "2023-11-29:11:57:40,559 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
+      "Downloading data: 100% 377k/377k [00:00<00:00, 12.0MB/s]\n",
-            "2023-11-29:11:57:40,589 INFO     [huggingface.py:120] Using device 'cuda'\n",
+      "Generating train split: 100% 8551/8551 [00:00<00:00, 19744.58 examples/s]\n",
-            "Downloading builder script: 100% 5.84k/5.84k [00:00<00:00, 17.7MB/s]\n",
+      "Generating validation split: 100% 1043/1043 [00:00<00:00, 27057.01 examples/s]\n",
-            "Downloading metadata: 100% 106k/106k [00:00<00:00, 892kB/s] \n",
+      "Generating test split: 100% 1063/1063 [00:00<00:00, 22705.17 examples/s]\n",
-            "Downloading readme: 100% 39.7k/39.7k [00:00<00:00, 631kB/s]\n",
+      "2023-11-29:11:57:11,698 INFO     [task.py:355] Building contexts for task on rank 0...\n",
-            "Downloading data: 100% 166M/166M [00:01<00:00, 89.0MB/s]\n",
+      "2023-11-29:11:57:11,704 INFO     [evaluator.py:319] Running loglikelihood requests\n",
-            "Generating auxiliary_train split: 100% 99842/99842 [00:07<00:00, 12536.83 examples/s]\n",
+      "100% 20/20 [00:03<00:00,  5.15it/s]\n",
-            "Generating test split: 100% 198/198 [00:00<00:00, 1439.20 examples/s]\n",
+      "fatal: not a git repository (or any of the parent directories): .git\n",
-            "Generating validation split: 100% 22/22 [00:00<00:00, 4181.76 examples/s]\n",
+      "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
-            "Generating dev split: 100% 5/5 [00:00<00:00, 36.25 examples/s]\n",
+      "|     Tasks     |Version|Filter|n-shot|Metric|Value|   |Stderr|\n",
-            "2023-11-29:11:58:09,798 INFO     [task.py:355] Building contexts for task on rank 0...\n",
+      "|---------------|-------|------|-----:|------|----:|---|-----:|\n",
-            "2023-11-29:11:58:09,822 INFO     [evaluator.py:319] Running loglikelihood requests\n",
+      "|yes_or_no_tasks|N/A    |none  |     0|acc   |  0.7|±  |0.1528|\n",
-            "100% 40/40 [00:05<00:00,  7.86it/s]\n",
+      "| - demo_cola   |Yaml   |none  |     0|acc   |  0.7|±  |0.1528|\n",
-            "fatal: not a git repository (or any of the parent directories): .git\n",
+      "\n",
-            "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
+      "|    Groups     |Version|Filter|n-shot|Metric|Value|   |Stderr|\n",
-            "|             Tasks             |Version|Filter|n-shot| Metric |Value|   |Stderr|\n",
+      "|---------------|-------|------|-----:|------|----:|---|-----:|\n",
-            "|-------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
+      "|yes_or_no_tasks|N/A    |none  |     0|acc   |  0.7|±  |0.1528|\n",
-            "|demo_mmlu_high_school_geography|Yaml   |none  |     0|acc     |  0.3|±  |0.1528|\n",
+      "\n"
-            "|                               |       |none  |     0|acc_norm|  0.3|±  |0.1528|\n",
+     ]
-            "\n"
+    }
-          ]
+   ],
-        }
+   "source": [
-      ],
+    "# !accelerate launch --no_python\n",
-      "source": [
+    "!lm_eval \\\n",
-        "# !accelerate launch --no_python\n",
+    "    --model hf \\\n",
-        "!lm_eval \\\n",
+    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
-        "    --model hf \\\n",
+    "    --include_path ./ \\\n",
-        "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+    "    --tasks yes_or_no_tasks \\\n",
-        "    --include_path ./ \\\n",
+    "    --limit 10 \\\n",
-        "    --tasks demo_mmlu_high_school_geography \\\n",
+    "    --output output/yes_or_no_tasks/ \\\n",
-        "    --limit 10 \\\n",
+    "    --log_samples"
-        "    --output output/mmlu_high_school_geography/ \\\n",
+   ]
-        "    --log_samples"
+  },
-      ]
+  {
-    },
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "XceRKCuuDtbn"
+   },
+   "source": [
+    "## Edit Prompt Templates Quickly\n",
+    "\n",
+    "The following is a yaml made to evaluate the specific subtask of `high_school_geography` from MMLU. It uses the standard prompt where the we choose the letters from the options with most likelihood as the model's prediction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "id": "GTFvdt9kSlBG"
+   },
+   "outputs": [],
+   "source": [
+    "YAML_mmlu_geo_string = \"\"\"\n",
+    "task: demo_mmlu_high_school_geography\n",
+    "dataset_path: cais/mmlu\n",
+    "dataset_name: high_school_geography\n",
+    "description: \"The following are multiple choice questions (with answers) about high school geography.\\n\\n\"\n",
+    "test_split: test\n",
+    "fewshot_split: dev\n",
+    "fewshot_config:\n",
+    "  sampler: first_n\n",
+    "output_type: multiple_choice\n",
+    "doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
+    "doc_to_choice: [\"A\", \"B\", \"C\", \"D\"]\n",
+    "doc_to_target: answer\n",
+    "metric_list:\n",
+    "  - metric: acc\n",
+    "    aggregation: mean\n",
+    "    higher_is_better: true\n",
+    "  - metric: acc_norm\n",
+    "    aggregation: mean\n",
+    "    higher_is_better: true\n",
+    "\"\"\"\n",
+    "with open(\"mmlu_high_school_geography.yaml\", \"w\") as f:\n",
+    "    f.write(YAML_mmlu_geo_string)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "id": "jyKOfCsKb-xy"
+   },
+   "outputs": [
    {
-      "cell_type": "markdown",
+     "name": "stdout",
-      "metadata": {
+     "output_type": "stream",
-        "id": "jyKOfCsKb-xy"
+     "text": [
-      },
+      "2023-11-29:11:57:23,598 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
-      "source": [
+      "2023-11-29 11:57:24.719750: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-        "We could also evaluate this task in a different way. For example, instead of observing the loglikelihood of the letters, we can instead evaluate on the choices themselves as the continuation. This is done by simply changing `doc_to_choice` from a list of letters to the corresponding `choices` field from the HF dataset. We write `\"{{choices}}\"` so that the string field is interpreted as jinja string that acquires the list from the HF dataset directly.\n",
+      "2023-11-29 11:57:24.719806: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
-        "\n",
+      "2023-11-29 11:57:24.719847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-        "Another convenient feature here is since we're only modifying the `doc_to_choice` and the rest of config is the same as the task above, we can use the above configuration as a template by using `include: mmlu_high_school_geography.yaml` to load the config from that file. We'll need to add a unique task name as to not colide with the existing yaml config we're including. For this case we'll simply name this one `mmlu_high_school_geography_continuation`. `doc_to_text` is added here just for sake of clarity."
+      "2023-11-29 11:57:26.656125: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
-      ]
+      "2023-11-29:11:57:31,563 INFO     [__main__.py:132] Verbosity set to INFO\n",
-    },
+      "2023-11-29:11:57:40,541 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
+      "2023-11-29:11:57:40,541 INFO     [__main__.py:143] Including path: ./\n",
+      "2023-11-29:11:57:40,558 INFO     [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography']\n",
+      "2023-11-29:11:57:40,559 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
+      "2023-11-29:11:57:40,589 INFO     [huggingface.py:120] Using device 'cuda'\n",
+      "Downloading builder script: 100% 5.84k/5.84k [00:00<00:00, 17.7MB/s]\n",
+      "Downloading metadata: 100% 106k/106k [00:00<00:00, 892kB/s] \n",
+      "Downloading readme: 100% 39.7k/39.7k [00:00<00:00, 631kB/s]\n",
+      "Downloading data: 100% 166M/166M [00:01<00:00, 89.0MB/s]\n",
+      "Generating auxiliary_train split: 100% 99842/99842 [00:07<00:00, 12536.83 examples/s]\n",
+      "Generating test split: 100% 198/198 [00:00<00:00, 1439.20 examples/s]\n",
+      "Generating validation split: 100% 22/22 [00:00<00:00, 4181.76 examples/s]\n",
+      "Generating dev split: 100% 5/5 [00:00<00:00, 36.25 examples/s]\n",
+      "2023-11-29:11:58:09,798 INFO     [task.py:355] Building contexts for task on rank 0...\n",
+      "2023-11-29:11:58:09,822 INFO     [evaluator.py:319] Running loglikelihood requests\n",
+      "100% 40/40 [00:05<00:00,  7.86it/s]\n",
+      "fatal: not a git repository (or any of the parent directories): .git\n",
+      "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
+      "|             Tasks             |Version|Filter|n-shot| Metric |Value|   |Stderr|\n",
+      "|-------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
+      "|demo_mmlu_high_school_geography|Yaml   |none  |     0|acc     |  0.3|±  |0.1528|\n",
+      "|                               |       |none  |     0|acc_norm|  0.3|±  |0.1528|\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# !accelerate launch --no_python\n",
+    "!lm_eval \\\n",
+    "    --model hf \\\n",
+    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+    "    --include_path ./ \\\n",
+    "    --tasks demo_mmlu_high_school_geography \\\n",
+    "    --limit 10 \\\n",
+    "    --output output/mmlu_high_school_geography/ \\\n",
+    "    --log_samples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "jyKOfCsKb-xy"
+   },
+   "source": [
+    "We could also evaluate this task in a different way. For example, instead of observing the loglikelihood of the letters, we can instead evaluate on the choices themselves as the continuation. This is done by simply changing `doc_to_choice` from a list of letters to the corresponding `choices` field from the HF dataset. We write `\"{{choices}}\"` so that the string field is interpreted as jinja string that acquires the list from the HF dataset directly.\n",
+    "\n",
+    "Another convenient feature here is since we're only modifying the `doc_to_choice` and the rest of config is the same as the task above, we can use the above configuration as a template by using `include: mmlu_high_school_geography.yaml` to load the config from that file. We'll need to add a unique task name as to not colide with the existing yaml config we're including. For this case we'll simply name this one `mmlu_high_school_geography_continuation`. `doc_to_text` is added here just for sake of clarity."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "id": "lqElwU54TaK-"
+   },
+   "outputs": [],
+   "source": [
+    "YAML_mmlu_geo_string = \"\"\"\n",
+    "include: mmlu_high_school_geography.yaml\n",
+    "task: demo_mmlu_high_school_geography_continuation\n",
+    "doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
+    "doc_to_choice: \"{{choices}}\"\n",
+    "\"\"\"\n",
+    "with open(\"mmlu_high_school_geography_continuation.yaml\", \"w\") as f:\n",
+    "    f.write(YAML_mmlu_geo_string)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "id": "-_CVnDirdy7j"
+   },
+   "outputs": [
    {
-      "cell_type": "code",
+     "name": "stdout",
-      "execution_count": 9,
+     "output_type": "stream",
-      "metadata": {
+     "text": [
-        "id": "lqElwU54TaK-"
+      "2023-11-29:11:58:21,284 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
-      },
+      "2023-11-29 11:58:22.850159: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-      "outputs": [],
+      "2023-11-29 11:58:22.850219: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
-      "source": [
+      "2023-11-29 11:58:22.850254: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-        "YAML_mmlu_geo_string = '''\n",
+      "2023-11-29 11:58:24.948103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
-        "include: mmlu_high_school_geography.yaml\n",
+      "2023-11-29:11:58:28,460 INFO     [__main__.py:132] Verbosity set to INFO\n",
-        "task: demo_mmlu_high_school_geography_continuation\n",
+      "2023-11-29:11:58:37,935 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
-        "doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
+      "2023-11-29:11:58:37,935 INFO     [__main__.py:143] Including path: ./\n",
-        "doc_to_choice: \"{{choices}}\"\n",
+      "2023-11-29:11:58:37,969 INFO     [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_continuation']\n",
-        "'''\n",
+      "2023-11-29:11:58:37,972 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
-        "with open('mmlu_high_school_geography_continuation.yaml', 'w') as f:\n",
+      "2023-11-29:11:58:38,008 INFO     [huggingface.py:120] Using device 'cuda'\n",
-        "    f.write(YAML_mmlu_geo_string)\n"
+      "2023-11-29:11:58:59,758 INFO     [task.py:355] Building contexts for task on rank 0...\n",
-      ]
+      "2023-11-29:11:58:59,777 INFO     [evaluator.py:319] Running loglikelihood requests\n",
-    },
+      "100% 40/40 [00:02<00:00, 16.23it/s]\n",
+      "fatal: not a git repository (or any of the parent directories): .git\n",
+      "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
+      "|                   Tasks                    |Version|Filter|n-shot| Metric |Value|   |Stderr|\n",
+      "|--------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
+      "|demo_mmlu_high_school_geography_continuation|Yaml   |none  |     0|acc     |  0.1|±  |0.1000|\n",
+      "|                                            |       |none  |     0|acc_norm|  0.2|±  |0.1333|\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# !accelerate launch --no_python\n",
+    "!lm_eval \\\n",
+    "    --model hf \\\n",
+    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+    "    --include_path ./ \\\n",
+    "    --tasks demo_mmlu_high_school_geography_continuation \\\n",
+    "    --limit 10 \\\n",
+    "    --output output/mmlu_high_school_geography_continuation/ \\\n",
+    "    --log_samples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "-_CVnDirdy7j"
+   },
+   "source": [
+    "If we take a look at the samples, we can see that it is in fact evaluating the continuation based on the choices rather than the letters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "id": "duBDqC6PAdjL"
+   },
+   "outputs": [
    {
-      "cell_type": "code",
+     "data": {
-      "execution_count": 10,
+      "application/javascript": "\n      ((filepath) => {{\n        if (!google.colab.kernel.accessAllowed) {{\n          return;\n        }}\n        google.colab.files.view(filepath);\n      }})(\"/content/output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")",
-      "metadata": {
+      "text/plain": [
-        "id": "-_CVnDirdy7j"
+       "<IPython.core.display.Javascript object>"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "2023-11-29:11:58:21,284 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
-            "2023-11-29 11:58:22.850159: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-            "2023-11-29 11:58:22.850219: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
-            "2023-11-29 11:58:22.850254: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-            "2023-11-29 11:58:24.948103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
-            "2023-11-29:11:58:28,460 INFO     [__main__.py:132] Verbosity set to INFO\n",
-            "2023-11-29:11:58:37,935 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
-            "2023-11-29:11:58:37,935 INFO     [__main__.py:143] Including path: ./\n",
-            "2023-11-29:11:58:37,969 INFO     [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_continuation']\n",
-            "2023-11-29:11:58:37,972 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
-            "2023-11-29:11:58:38,008 INFO     [huggingface.py:120] Using device 'cuda'\n",
-            "2023-11-29:11:58:59,758 INFO     [task.py:355] Building contexts for task on rank 0...\n",
-            "2023-11-29:11:58:59,777 INFO     [evaluator.py:319] Running loglikelihood requests\n",
-            "100% 40/40 [00:02<00:00, 16.23it/s]\n",
-            "fatal: not a git repository (or any of the parent directories): .git\n",
-            "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
-            "|                   Tasks                    |Version|Filter|n-shot| Metric |Value|   |Stderr|\n",
-            "|--------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
-            "|demo_mmlu_high_school_geography_continuation|Yaml   |none  |     0|acc     |  0.1|±  |0.1000|\n",
-            "|                                            |       |none  |     0|acc_norm|  0.2|±  |0.1333|\n",
-            "\n"
-          ]
-        }
-      ],
-      "source": [
-        "# !accelerate launch --no_python\n",
-        "!lm_eval \\\n",
-        "    --model hf \\\n",
-        "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
-        "    --include_path ./ \\\n",
-        "    --tasks demo_mmlu_high_school_geography_continuation \\\n",
-        "    --limit 10 \\\n",
-        "    --output output/mmlu_high_school_geography_continuation/ \\\n",
-        "    --log_samples\n"
      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from google.colab import files\n",
+    "\n",
+    "\n",
+    "files.view(\n",
+    "    \"output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "6p0-KPwAgK5j"
+   },
+   "source": [
+    "## Closer Look at YAML Fields\n",
+    "\n",
+    "To prepare a task we can simply fill in a YAML config with the relevant information.\n",
+    "\n",
+    "`output_type`\n",
+    "The current provided evaluation types comprise of the following:\n",
+    "1.   `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.\n",
+    "2.   `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)\n",
+    "3.   `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n",
+    "4.   `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n",
+    "\n",
+    "The core prompt revolves around 3 fields.\n",
+    "1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n",
+    "2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.\n",
+    "3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n",
+    "\n",
+    "These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "6p0-KPwAgK5j"
+   },
+   "source": [
+    "## What if Jinja is not Sufficient?\n",
+    "\n",
+    "There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:\n",
+    "\n",
+    "1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.\n",
+    "2. Perform a transformation on the dataset beforehand."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below, we show an example of using `!function` to create `doc_to_text` from a python function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
    },
+    "id": "DYZ5c0JhR1lJ",
+    "outputId": "ca945235-fb9e-4f17-8bfa-78e7d6ec1490"
+   },
+   "outputs": [
    {
-      "cell_type": "markdown",
+     "name": "stdout",
-      "metadata": {
+     "output_type": "stream",
-        "id": "-_CVnDirdy7j"
+     "text": [
-      },
+      "2023-11-29:11:59:08,312 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
-      "source": [
+      "2023-11-29 11:59:09.348327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
-        "If we take a look at the samples, we can see that it is in fact evaluating the continuation based on the choices rather than the letters."
+      "2023-11-29 11:59:09.348387: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
-      ]
+      "2023-11-29 11:59:09.348421: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "2023-11-29 11:59:10.573752: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
+      "2023-11-29:11:59:14,044 INFO     [__main__.py:132] Verbosity set to INFO\n",
+      "2023-11-29:11:59:23,654 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
+      "2023-11-29:11:59:23,654 INFO     [__main__.py:143] Including path: ./\n",
+      "2023-11-29:11:59:23,678 INFO     [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt']\n",
+      "2023-11-29:11:59:23,679 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
+      "2023-11-29:11:59:23,708 INFO     [huggingface.py:120] Using device 'cuda'\n",
+      "2023-11-29:11:59:44,516 INFO     [task.py:355] Building contexts for task on rank 0...\n",
+      "2023-11-29:11:59:44,524 INFO     [evaluator.py:319] Running loglikelihood requests\n",
+      "100% 40/40 [00:02<00:00, 15.41it/s]\n",
+      "fatal: not a git repository (or any of the parent directories): .git\n",
+      "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
+      "|                     Tasks                     |Version|Filter|n-shot| Metric |Value|   |Stderr|\n",
+      "|-----------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
+      "|demo_mmlu_high_school_geography_function_prompt|Yaml   |none  |     0|acc     |  0.1|±  |0.1000|\n",
+      "|                                               |       |none  |     0|acc_norm|  0.2|±  |0.1333|\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "YAML_mmlu_geo_string = \"\"\"\n",
+    "include: mmlu_high_school_geography.yaml\n",
+    "task: demo_mmlu_high_school_geography_function_prompt\n",
+    "doc_to_text: !function utils.doc_to_text\n",
+    "doc_to_choice: \"{{choices}}\"\n",
+    "\"\"\"\n",
+    "with open(\"demo_mmlu_high_school_geography_function_prompt.yaml\", \"w\") as f:\n",
+    "    f.write(YAML_mmlu_geo_string)\n",
+    "\n",
+    "DOC_TO_TEXT = \"\"\"\n",
+    "def doc_to_text(x):\n",
+    "    question = x[\"question\"].strip()\n",
+    "    choices = x[\"choices\"]\n",
+    "    option_a = choices[0]\n",
+    "    option_b = choices[1]\n",
+    "    option_c = choices[2]\n",
+    "    option_d = choices[3]\n",
+    "    return f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
+    "\"\"\"\n",
+    "with open(\"utils.py\", \"w\") as f:\n",
+    "    f.write(DOC_TO_TEXT)\n",
+    "\n",
+    "!lm_eval \\\n",
+    "    --model hf \\\n",
+    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+    "    --include_path ./ \\\n",
+    "    --tasks demo_mmlu_high_school_geography_function_prompt \\\n",
+    "    --limit 10 \\\n",
+    "    --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n",
+    "    --log_samples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:\n",
+    "\n",
+    "We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "YAML_mmlu_geo_string = \"\"\"\n",
+    "include: mmlu_high_school_geography.yaml\n",
+    "task: demo_mmlu_high_school_geography_function_prompt_2\n",
+    "process_docs: !function utils_process_docs.process_docs\n",
+    "doc_to_text: \"{{input}}\"\n",
+    "doc_to_choice: \"{{choices}}\"\n",
+    "\"\"\"\n",
+    "with open(\"demo_mmlu_high_school_geography_process_docs.yaml\", \"w\") as f:\n",
+    "    f.write(YAML_mmlu_geo_string)\n",
+    "\n",
+    "DOC_TO_TEXT = \"\"\"\n",
+    "def process_docs(dataset):\n",
+    "    def _process_doc(x):\n",
+    "        question = x[\"question\"].strip()\n",
+    "        choices = x[\"choices\"]\n",
+    "        option_a = choices[0]\n",
+    "        option_b = choices[1]\n",
+    "        option_c = choices[2]\n",
+    "        option_d = choices[3]\n",
+    "        doc[\"input\"] = f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
+    "        return out_doc\n",
+    "\n",
+    "    return dataset.map(_process_doc)\n",
+    "\"\"\"\n",
+    "\n",
+    "with open(\"utils_process_docs.py\", \"w\") as f:\n",
+    "    f.write(DOC_TO_TEXT)\n",
+    "\n",
+    "!lm_eval \\\n",
+    "    --model hf \\\n",
+    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+    "    --include_path ./ \\\n",
+    "    --tasks demo_mmlu_high_school_geography_function_prompt_2 \\\n",
+    "    --limit 10 \\\n",
+    "    --output output/demo_mmlu_high_school_geography_function_prompt_2/ \\\n",
+    "    --log_samples"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We hope that this explainer gives you a sense of what can be done with and how to work with LM-Evaluation-Harnes v0.4.0 ! \n",
+    "\n",
+    "For more information, check out our documentation pages in the `docs/` folder, and if you have questions, please raise them in GitHub issues, or in #lm-thunderdome or #release-discussion on the EleutherAI discord server."
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "collapsed_sections": [
+    "zAov81vTbL2K"
+   ],
+   "gpuType": "T4",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  },
+  "widgets": {
+   "application/vnd.jupyter.widget-state+json": {
+    "46f521b73fd943c081c648fd873ebc0a": {
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "DescriptionStyleModel",
+     "state": {
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "StyleView",
+      "description_width": ""
+     }
    },
-    {
+    "48763b6233374554ae76035c0483066f": {
-      "cell_type": "code",
+     "model_module": "@jupyter-widgets/controls",
-      "execution_count": 11,
+     "model_module_version": "1.5.0",
-      "metadata": {
+     "model_name": "ProgressStyleModel",
-        "id": "duBDqC6PAdjL"
+     "state": {
-      },
+      "_model_module": "@jupyter-widgets/controls",
-      "outputs": [
+      "_model_module_version": "1.5.0",
-        {
+      "_model_name": "ProgressStyleModel",
-          "data": {
+      "_view_count": null,
-            "application/javascript": "\n      ((filepath) => {{\n        if (!google.colab.kernel.accessAllowed) {{\n          return;\n        }}\n        google.colab.files.view(filepath);\n      }})(\"/content/output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")",
+      "_view_module": "@jupyter-widgets/base",
-            "text/plain": [
+      "_view_module_version": "1.2.0",
-              "<IPython.core.display.Javascript object>"
+      "_view_name": "StyleView",
-            ]
+      "bar_color": null,
-          },
+      "description_width": ""
-          "metadata": {},
+     }
-          "output_type": "display_data"
-        }
-      ],
-      "source": [
-        "from google.colab import files\n",
-        "files.view(\"output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")\n"
-      ]
    },
-    {
+    "4986a21eb560448fa79f4b25cde48951": {
-      "cell_type": "markdown",
+     "model_module": "@jupyter-widgets/base",
-      "metadata": {
+     "model_module_version": "1.2.0",
-        "id": "6p0-KPwAgK5j"
+     "model_name": "LayoutModel",
-      },
+     "state": {
-      "source": [
+      "_model_module": "@jupyter-widgets/base",
-        "## Closer Look at YAML Fields\n",
+      "_model_module_version": "1.2.0",
-        "\n",
+      "_model_name": "LayoutModel",
-        "To prepare a task we can simply fill in a YAML config with the relevant information.\n",
+      "_view_count": null,
-        "\n",
+      "_view_module": "@jupyter-widgets/base",
-        "`output_type`\n",
+      "_view_module_version": "1.2.0",
-        "The current provided evaluation types comprise of the following:\n",
+      "_view_name": "LayoutView",
-        "1.   `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.\n",
+      "align_content": null,
-        "2.   `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)\n",
+      "align_items": null,
-        "3.   `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n",
+      "align_self": null,
-        "4.   `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n",
+      "border": null,
-        "\n",
+      "bottom": null,
-        "The core prompt revolves around 3 fields.\n",
+      "display": null,
-        "1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n",
+      "flex": null,
-        "2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.\n",
+      "flex_flow": null,
-        "3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n",
+      "grid_area": null,
-        "\n",
+      "grid_auto_columns": null,
-        "These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.\n"
+      "grid_auto_flow": null,
-      ]
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
    },
-    {
+    "6b2d90209ec14230b3d58a74ac9b83bf": {
-      "cell_type": "markdown",
+     "model_module": "@jupyter-widgets/base",
-      "metadata": {
+     "model_module_version": "1.2.0",
-        "id": "6p0-KPwAgK5j"
+     "model_name": "LayoutModel",
-      },
+     "state": {
-      "source": [
+      "_model_module": "@jupyter-widgets/base",
-        "## What if Jinja is not Sufficient?\n",
+      "_model_module_version": "1.2.0",
-        "\n",
+      "_model_name": "LayoutModel",
-        "There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:\n",
+      "_view_count": null,
-        "\n",
+      "_view_module": "@jupyter-widgets/base",
-        "1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.\n",
+      "_view_module_version": "1.2.0",
-        "2. Perform a transformation on the dataset beforehand."
+      "_view_name": "LayoutView",
-      ]
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
    },
-    {
+    "7c5689bc13684db8a22681f41863dddd": {
-      "cell_type": "markdown",
+     "model_module": "@jupyter-widgets/base",
-      "metadata": {},
+     "model_module_version": "1.2.0",
-      "source": [
+     "model_name": "LayoutModel",
-        "Below, we show an example of using `!function` to create `doc_to_text` from a python function:"
+     "state": {
-      ]
+      "_model_module": "@jupyter-widgets/base",
+      "_model_module_version": "1.2.0",
+      "_model_name": "LayoutModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.2.0",
+      "_view_name": "LayoutView",
+      "align_content": null,
+      "align_items": null,
+      "align_self": null,
+      "border": null,
+      "bottom": null,
+      "display": null,
+      "flex": null,
+      "flex_flow": null,
+      "grid_area": null,
+      "grid_auto_columns": null,
+      "grid_auto_flow": null,
+      "grid_auto_rows": null,
+      "grid_column": null,
+      "grid_gap": null,
+      "grid_row": null,
+      "grid_template_areas": null,
+      "grid_template_columns": null,
+      "grid_template_rows": null,
+      "height": null,
+      "justify_content": null,
+      "justify_items": null,
+      "left": null,
+      "margin": null,
+      "max_height": null,
+      "max_width": null,
+      "min_height": null,
+      "min_width": null,
+      "object_fit": null,
+      "object_position": null,
+      "order": null,
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
    },
-    {
+    "a1d3a8aa016544a78e8821c8f6199e06": {
-      "cell_type": "code",
+     "model_module": "@jupyter-widgets/controls",
-      "execution_count": 12,
+     "model_module_version": "1.5.0",
-      "metadata": {
+     "model_name": "HBoxModel",
-        "colab": {
+     "state": {
-          "base_uri": "https://localhost:8080/"
+      "_dom_classes": [],
-        },
+      "_model_module": "@jupyter-widgets/controls",
-        "id": "DYZ5c0JhR1lJ",
+      "_model_module_version": "1.5.0",
-        "outputId": "ca945235-fb9e-4f17-8bfa-78e7d6ec1490"
+      "_model_name": "HBoxModel",
-      },
+      "_view_count": null,
-      "outputs": [
+      "_view_module": "@jupyter-widgets/controls",
-        {
+      "_view_module_version": "1.5.0",
-          "name": "stdout",
+      "_view_name": "HBoxView",
-          "output_type": "stream",
+      "box_style": "",
-          "text": [
+      "children": [
-            "2023-11-29:11:59:08,312 INFO     [utils.py:160] NumExpr defaulting to 2 threads.\n",
+       "IPY_MODEL_f61ed33fad754146bdd2ac9db1ba1c48",
-            "2023-11-29 11:59:09.348327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+       "IPY_MODEL_bfa0af6aeff344c6845e1080a878e92e",
-            "2023-11-29 11:59:09.348387: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+       "IPY_MODEL_fd1ad9e0367d4004aae853b91c3a7617"
-            "2023-11-29 11:59:09.348421: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
-            "2023-11-29 11:59:10.573752: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
-            "2023-11-29:11:59:14,044 INFO     [__main__.py:132] Verbosity set to INFO\n",
-            "2023-11-29:11:59:23,654 WARNING  [__main__.py:138]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
-            "2023-11-29:11:59:23,654 INFO     [__main__.py:143] Including path: ./\n",
-            "2023-11-29:11:59:23,678 INFO     [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt']\n",
-            "2023-11-29:11:59:23,679 WARNING  [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
-            "2023-11-29:11:59:23,708 INFO     [huggingface.py:120] Using device 'cuda'\n",
-            "2023-11-29:11:59:44,516 INFO     [task.py:355] Building contexts for task on rank 0...\n",
-            "2023-11-29:11:59:44,524 INFO     [evaluator.py:319] Running loglikelihood requests\n",
-            "100% 40/40 [00:02<00:00, 15.41it/s]\n",
-            "fatal: not a git repository (or any of the parent directories): .git\n",
-            "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
-            "|                     Tasks                     |Version|Filter|n-shot| Metric |Value|   |Stderr|\n",
-            "|-----------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
-            "|demo_mmlu_high_school_geography_function_prompt|Yaml   |none  |     0|acc     |  0.1|±  |0.1000|\n",
-            "|                                               |       |none  |     0|acc_norm|  0.2|±  |0.1333|\n",
-            "\n"
-          ]
-        }
      ],
-      "source": [
+      "layout": "IPY_MODEL_6b2d90209ec14230b3d58a74ac9b83bf"
-        "YAML_mmlu_geo_string = '''\n",
+     }
-        "include: mmlu_high_school_geography.yaml\n",
-        "task: demo_mmlu_high_school_geography_function_prompt\n",
-        "doc_to_text: !function utils.doc_to_text\n",
-        "doc_to_choice: \"{{choices}}\"\n",
-        "'''\n",
-        "with open('demo_mmlu_high_school_geography_function_prompt.yaml', 'w') as f:\n",
-        "    f.write(YAML_mmlu_geo_string)\n",
-        "\n",
-        "DOC_TO_TEXT = '''\n",
-        "def doc_to_text(x):\n",
-        "    question = x[\"question\"].strip()\n",
-        "    choices = x[\"choices\"]\n",
-        "    option_a = choices[0]\n",
-        "    option_b = choices[1]\n",
-        "    option_c = choices[2]\n",
-        "    option_d = choices[3]\n",
-        "    return f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
-        "'''\n",
-        "with open('utils.py', 'w') as f:\n",
-        "    f.write(DOC_TO_TEXT)\n",
-        "\n",
-        "!lm_eval \\\n",
-        "    --model hf \\\n",
-        "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
-        "    --include_path ./ \\\n",
-        "    --tasks demo_mmlu_high_school_geography_function_prompt \\\n",
-        "    --limit 10 \\\n",
-        "    --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n",
-        "    --log_samples\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:\n",
-        "\n",
-        "We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`."
-      ]
    },
-    {
+    "a73f357065d34d7baf0453ae4a8d75e2": {
-      "cell_type": "code",
+     "model_module": "@jupyter-widgets/base",
-      "execution_count": null,
+     "model_module_version": "1.2.0",
-      "metadata": {},
+     "model_name": "LayoutModel",
-      "outputs": [],
+     "state": {
-      "source": [
+      "_model_module": "@jupyter-widgets/base",
-        "YAML_mmlu_geo_string = '''\n",
+      "_model_module_version": "1.2.0",
-        "include: mmlu_high_school_geography.yaml\n",
+      "_model_name": "LayoutModel",
-        "task: demo_mmlu_high_school_geography_function_prompt_2\n",
+      "_view_count": null,
-        "process_docs: !function utils_process_docs.process_docs\n",
+      "_view_module": "@jupyter-widgets/base",
-        "doc_to_text: \"{{input}}\"\n",
+      "_view_module_version": "1.2.0",
-        "doc_to_choice: \"{{choices}}\"\n",
+      "_view_name": "LayoutView",
-        "'''\n",
+      "align_content": null,
-        "with open('demo_mmlu_high_school_geography_process_docs.yaml', 'w') as f:\n",
+      "align_items": null,
-        "    f.write(YAML_mmlu_geo_string)\n",
+      "align_self": null,
-        "\n",
+      "border": null,
-        "DOC_TO_TEXT = '''\n",
+      "bottom": null,
-        "def process_docs(dataset):\n",
+      "display": null,
-        "    def _process_doc(x):\n",
+      "flex": null,
-        "        question = x[\"question\"].strip()\n",
+      "flex_flow": null,
-        "        choices = x[\"choices\"]\n",
+      "grid_area": null,
-        "        option_a = choices[0]\n",
+      "grid_auto_columns": null,
-        "        option_b = choices[1]\n",
+      "grid_auto_flow": null,
-        "        option_c = choices[2]\n",
+      "grid_auto_rows": null,
-        "        option_d = choices[3]\n",
+      "grid_column": null,
-        "        doc[\"input\"] = f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
+      "grid_gap": null,
-        "        return out_doc\n",
+      "grid_row": null,
-        "\n",
+      "grid_template_areas": null,
-        "    return dataset.map(_process_doc)\n",
+      "grid_template_columns": null,
-        "'''\n",
+      "grid_template_rows": null,
-        "\n",
+      "height": null,
-        "with open('utils_process_docs.py', 'w') as f:\n",
+      "justify_content": null,
-        "    f.write(DOC_TO_TEXT)\n",
+      "justify_items": null,
-        "\n",
+      "left": null,
-        "!lm_eval \\\n",
+      "margin": null,
-        "    --model hf \\\n",
+      "max_height": null,
-        "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
+      "max_width": null,
-        "    --include_path ./ \\\n",
+      "min_height": null,
-        "    --tasks demo_mmlu_high_school_geography_function_prompt_2 \\\n",
+      "min_width": null,
-        "    --limit 10 \\\n",
+      "object_fit": null,
-        "    --output output/demo_mmlu_high_school_geography_function_prompt_2/ \\\n",
+      "object_position": null,
-        "    --log_samples\n"
+      "order": null,
-      ]
+      "overflow": null,
+      "overflow_x": null,
+      "overflow_y": null,
+      "padding": null,
+      "right": null,
+      "top": null,
+      "visibility": null,
+      "width": null
+     }
    },
-    {
+    "aed3acd2f2d74003b44079c333a0698e": {
-      "cell_type": "markdown",
+     "model_module": "@jupyter-widgets/controls",
-      "metadata": {},
+     "model_module_version": "1.5.0",
-      "source": [
+     "model_name": "DescriptionStyleModel",
-        "We hope that this explainer gives you a sense of what can be done with and how to work with LM-Evaluation-Harnes v0.4.0 ! \n",
+     "state": {
-        "\n",
+      "_model_module": "@jupyter-widgets/controls",
-        "For more information, check out our documentation pages in the `docs/` folder, and if you have questions, please raise them in GitHub issues, or in #lm-thunderdome or #release-discussion on the EleutherAI discord server."
+      "_model_module_version": "1.5.0",
-      ]
+      "_model_name": "DescriptionStyleModel",
-    }
+      "_view_count": null,
-  ],
+      "_view_module": "@jupyter-widgets/base",
-  "metadata": {
+      "_view_module_version": "1.2.0",
-    "accelerator": "GPU",
+      "_view_name": "StyleView",
-    "colab": {
+      "description_width": ""
-      "collapsed_sections": [
+     }
-        "zAov81vTbL2K"
-      ],
-      "gpuType": "T4",
-      "provenance": []
    },
-    "kernelspec": {
+    "bfa0af6aeff344c6845e1080a878e92e": {
-      "display_name": "Python 3",
+     "model_module": "@jupyter-widgets/controls",
-      "name": "python3"
+     "model_module_version": "1.5.0",
+     "model_name": "FloatProgressModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "FloatProgressModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "ProgressView",
+      "bar_style": "success",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_7c5689bc13684db8a22681f41863dddd",
+      "max": 5669,
+      "min": 0,
+      "orientation": "horizontal",
+      "style": "IPY_MODEL_48763b6233374554ae76035c0483066f",
+      "value": 5669
+     }
    },
-    "language_info": {
+    "f61ed33fad754146bdd2ac9db1ba1c48": {
-      "name": "python"
+     "model_module": "@jupyter-widgets/controls",
+     "model_module_version": "1.5.0",
+     "model_name": "HTMLModel",
+     "state": {
+      "_dom_classes": [],
+      "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
+      "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
+      "_view_module_version": "1.5.0",
+      "_view_name": "HTMLView",
+      "description": "",
+      "description_tooltip": null,
+      "layout": "IPY_MODEL_a73f357065d34d7baf0453ae4a8d75e2",
+      "placeholder": "",
+      "style": "IPY_MODEL_46f521b73fd943c081c648fd873ebc0a",
+      "value": "Downloading builder script: 100%"
+     }
    },
-    "widgets": {
+    "fd1ad9e0367d4004aae853b91c3a7617": {
-      "application/vnd.jupyter.widget-state+json": {
+     "model_module": "@jupyter-widgets/controls",
-        "46f521b73fd943c081c648fd873ebc0a": {
+     "model_module_version": "1.5.0",
-          "model_module": "@jupyter-widgets/controls",
+     "model_name": "HTMLModel",
-          "model_module_version": "1.5.0",
+     "state": {
-          "model_name": "DescriptionStyleModel",
+      "_dom_classes": [],
-          "state": {
+      "_model_module": "@jupyter-widgets/controls",
-            "_model_module": "@jupyter-widgets/controls",
+      "_model_module_version": "1.5.0",
-            "_model_module_version": "1.5.0",
+      "_model_name": "HTMLModel",
-            "_model_name": "DescriptionStyleModel",
+      "_view_count": null,
-            "_view_count": null,
+      "_view_module": "@jupyter-widgets/controls",
-            "_view_module": "@jupyter-widgets/base",
+      "_view_module_version": "1.5.0",
-            "_view_module_version": "1.2.0",
+      "_view_name": "HTMLView",
-            "_view_name": "StyleView",
+      "description": "",
-            "description_width": ""
+      "description_tooltip": null,
-          }
+      "layout": "IPY_MODEL_4986a21eb560448fa79f4b25cde48951",
-        },
+      "placeholder": "",
-        "48763b6233374554ae76035c0483066f": {
+      "style": "IPY_MODEL_aed3acd2f2d74003b44079c333a0698e",
-          "model_module": "@jupyter-widgets/controls",
+      "value": " 5.67k/5.67k [00:00&lt;00:00, 205kB/s]"
-          "model_module_version": "1.5.0",
+     }
-          "model_name": "ProgressStyleModel",
-          "state": {
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "ProgressStyleModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "StyleView",
-            "bar_color": null,
-            "description_width": ""
-          }
-        },
-        "4986a21eb560448fa79f4b25cde48951": {
-          "model_module": "@jupyter-widgets/base",
-          "model_module_version": "1.2.0",
-          "model_name": "LayoutModel",
-          "state": {
-            "_model_module": "@jupyter-widgets/base",
-            "_model_module_version": "1.2.0",
-            "_model_name": "LayoutModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "LayoutView",
-            "align_content": null,
-            "align_items": null,
-            "align_self": null,
-            "border": null,
-            "bottom": null,
-            "display": null,
-            "flex": null,
-            "flex_flow": null,
-            "grid_area": null,
-            "grid_auto_columns": null,
-            "grid_auto_flow": null,
-            "grid_auto_rows": null,
-            "grid_column": null,
-            "grid_gap": null,
-            "grid_row": null,
-            "grid_template_areas": null,
-            "grid_template_columns": null,
-            "grid_template_rows": null,
-            "height": null,
-            "justify_content": null,
-            "justify_items": null,
-            "left": null,
-            "margin": null,
-            "max_height": null,
-            "max_width": null,
-            "min_height": null,
-            "min_width": null,
-            "object_fit": null,
-            "object_position": null,
-            "order": null,
-            "overflow": null,
-            "overflow_x": null,
-            "overflow_y": null,
-            "padding": null,
-            "right": null,
-            "top": null,
-            "visibility": null,
-            "width": null
-          }
-        },
-        "6b2d90209ec14230b3d58a74ac9b83bf": {
-          "model_module": "@jupyter-widgets/base",
-          "model_module_version": "1.2.0",
-          "model_name": "LayoutModel",
-          "state": {
-            "_model_module": "@jupyter-widgets/base",
-            "_model_module_version": "1.2.0",
-            "_model_name": "LayoutModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "LayoutView",
-            "align_content": null,
-            "align_items": null,
-            "align_self": null,
-            "border": null,
-            "bottom": null,
-            "display": null,
-            "flex": null,
-            "flex_flow": null,
-            "grid_area": null,
-            "grid_auto_columns": null,
-            "grid_auto_flow": null,
-            "grid_auto_rows": null,
-            "grid_column": null,
-            "grid_gap": null,
-            "grid_row": null,
-            "grid_template_areas": null,
-            "grid_template_columns": null,
-            "grid_template_rows": null,
-            "height": null,
-            "justify_content": null,
-            "justify_items": null,
-            "left": null,
-            "margin": null,
-            "max_height": null,
-            "max_width": null,
-            "min_height": null,
-            "min_width": null,
-            "object_fit": null,
-            "object_position": null,
-            "order": null,
-            "overflow": null,
-            "overflow_x": null,
-            "overflow_y": null,
-            "padding": null,
-            "right": null,
-            "top": null,
-            "visibility": null,
-            "width": null
-          }
-        },
-        "7c5689bc13684db8a22681f41863dddd": {
-          "model_module": "@jupyter-widgets/base",
-          "model_module_version": "1.2.0",
-          "model_name": "LayoutModel",
-          "state": {
-            "_model_module": "@jupyter-widgets/base",
-            "_model_module_version": "1.2.0",
-            "_model_name": "LayoutModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "LayoutView",
-            "align_content": null,
-            "align_items": null,
-            "align_self": null,
-            "border": null,
-            "bottom": null,
-            "display": null,
-            "flex": null,
-            "flex_flow": null,
-            "grid_area": null,
-            "grid_auto_columns": null,
-            "grid_auto_flow": null,
-            "grid_auto_rows": null,
-            "grid_column": null,
-            "grid_gap": null,
-            "grid_row": null,
-            "grid_template_areas": null,
-            "grid_template_columns": null,
-            "grid_template_rows": null,
-            "height": null,
-            "justify_content": null,
-            "justify_items": null,
-            "left": null,
-            "margin": null,
-            "max_height": null,
-            "max_width": null,
-            "min_height": null,
-            "min_width": null,
-            "object_fit": null,
-            "object_position": null,
-            "order": null,
-            "overflow": null,
-            "overflow_x": null,
-            "overflow_y": null,
-            "padding": null,
-            "right": null,
-            "top": null,
-            "visibility": null,
-            "width": null
-          }
-        },
-        "a1d3a8aa016544a78e8821c8f6199e06": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_module_version": "1.5.0",
-          "model_name": "HBoxModel",
-          "state": {
-            "_dom_classes": [],
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "HBoxModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/controls",
-            "_view_module_version": "1.5.0",
-            "_view_name": "HBoxView",
-            "box_style": "",
-            "children": [
-              "IPY_MODEL_f61ed33fad754146bdd2ac9db1ba1c48",
-              "IPY_MODEL_bfa0af6aeff344c6845e1080a878e92e",
-              "IPY_MODEL_fd1ad9e0367d4004aae853b91c3a7617"
-            ],
-            "layout": "IPY_MODEL_6b2d90209ec14230b3d58a74ac9b83bf"
-          }
-        },
-        "a73f357065d34d7baf0453ae4a8d75e2": {
-          "model_module": "@jupyter-widgets/base",
-          "model_module_version": "1.2.0",
-          "model_name": "LayoutModel",
-          "state": {
-            "_model_module": "@jupyter-widgets/base",
-            "_model_module_version": "1.2.0",
-            "_model_name": "LayoutModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "LayoutView",
-            "align_content": null,
-            "align_items": null,
-            "align_self": null,
-            "border": null,
-            "bottom": null,
-            "display": null,
-            "flex": null,
-            "flex_flow": null,
-            "grid_area": null,
-            "grid_auto_columns": null,
-            "grid_auto_flow": null,
-            "grid_auto_rows": null,
-            "grid_column": null,
-            "grid_gap": null,
-            "grid_row": null,
-            "grid_template_areas": null,
-            "grid_template_columns": null,
-            "grid_template_rows": null,
-            "height": null,
-            "justify_content": null,
-            "justify_items": null,
-            "left": null,
-            "margin": null,
-            "max_height": null,
-            "max_width": null,
-            "min_height": null,
-            "min_width": null,
-            "object_fit": null,
-            "object_position": null,
-            "order": null,
-            "overflow": null,
-            "overflow_x": null,
-            "overflow_y": null,
-            "padding": null,
-            "right": null,
-            "top": null,
-            "visibility": null,
-            "width": null
-          }
-        },
-        "aed3acd2f2d74003b44079c333a0698e": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_module_version": "1.5.0",
-          "model_name": "DescriptionStyleModel",
-          "state": {
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "DescriptionStyleModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/base",
-            "_view_module_version": "1.2.0",
-            "_view_name": "StyleView",
-            "description_width": ""
-          }
-        },
-        "bfa0af6aeff344c6845e1080a878e92e": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_module_version": "1.5.0",
-          "model_name": "FloatProgressModel",
-          "state": {
-            "_dom_classes": [],
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "FloatProgressModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/controls",
-            "_view_module_version": "1.5.0",
-            "_view_name": "ProgressView",
-            "bar_style": "success",
-            "description": "",
-            "description_tooltip": null,
-            "layout": "IPY_MODEL_7c5689bc13684db8a22681f41863dddd",
-            "max": 5669,
-            "min": 0,
-            "orientation": "horizontal",
-            "style": "IPY_MODEL_48763b6233374554ae76035c0483066f",
-            "value": 5669
-          }
-        },
-        "f61ed33fad754146bdd2ac9db1ba1c48": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_module_version": "1.5.0",
-          "model_name": "HTMLModel",
-          "state": {
-            "_dom_classes": [],
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "HTMLModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/controls",
-            "_view_module_version": "1.5.0",
-            "_view_name": "HTMLView",
-            "description": "",
-            "description_tooltip": null,
-            "layout": "IPY_MODEL_a73f357065d34d7baf0453ae4a8d75e2",
-            "placeholder": "",
-            "style": "IPY_MODEL_46f521b73fd943c081c648fd873ebc0a",
-            "value": "Downloading builder script: 100%"
-          }
-        },
-        "fd1ad9e0367d4004aae853b91c3a7617": {
-          "model_module": "@jupyter-widgets/controls",
-          "model_module_version": "1.5.0",
-          "model_name": "HTMLModel",
-          "state": {
-            "_dom_classes": [],
-            "_model_module": "@jupyter-widgets/controls",
-            "_model_module_version": "1.5.0",
-            "_model_name": "HTMLModel",
-            "_view_count": null,
-            "_view_module": "@jupyter-widgets/controls",
-            "_view_module_version": "1.5.0",
-            "_view_name": "HTMLView",
-            "description": "",
-            "description_tooltip": null,
-            "layout": "IPY_MODEL_4986a21eb560448fa79f4b25cde48951",
-            "placeholder": "",
-            "style": "IPY_MODEL_aed3acd2f2d74003b44079c333a0698e",
-            "value": " 5.67k/5.67k [00:00&lt;00:00, 205kB/s]"
-          }
-        }
-      }
    }
-  },
+   }
-  "nbformat": 4,
+  }
-  "nbformat_minor": 0
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
 }
--- a/examples/visualize-wandb.ipynb
+++ b/examples/visualize-wandb.ipynb
@@ -68,6 +68,7 @@
   "source": [
    "import wandb\n",
    "\n",
+    "\n",
    "wandb.login()"
   ]
  },
@@ -130,6 +131,7 @@
    "import lm_eval\n",
    "from lm_eval.loggers import WandbLogger\n",
    "\n",
+    "\n",
    "results = lm_eval.simple_evaluate(\n",
    "    model=\"hf\",\n",
    "    model_args=\"pretrained=microsoft/phi-2,trust_remote_code=True\",\n",

--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
@@ -431,7 +431,12 @@ class TemplateLM(LM):
        using_default_template = False
        # First, handle the cases when the model has a dict of multiple templates
-        template = self.tokenizer.chat_template or self.tokenizer.default_chat_template
+        try:
+            template = (
+                self.tokenizer.chat_template or self.tokenizer.default_chat_template
+            )
+        except AttributeError:
+            return None
        if isinstance(template, dict):
            using_default_dict = self.tokenizer.chat_template is None

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -57,7 +57,6 @@ class TaskConfig(dict):
    task: Optional[str] = None
    task_alias: Optional[str] = None
    tag: Optional[Union[str, list]] = None
-    group: Optional[Union[str, list]] = None
    # HF dataset options.
    # which dataset to use,
    # and what splits for what purpose
@@ -98,18 +97,6 @@ class TaskConfig(dict):
    )
    def __post_init__(self) -> None:
-        if self.group is not None:
-            eval_logger.warning(
-                "A task YAML file was found to contain a `group` key. Groups which provide aggregate scores over several subtasks now require a separate config file--if not aggregating, you may want to use the `tag` config option instead within your config. Setting `group` within a TaskConfig will be deprecated in v0.4.4. Please see https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md for more information."
-            )
-            if self.tag is None:
-                self.tag = self.group
-            else:
-                raise ValueError(
-                    "Got both a `group` and `tag` entry within a TaskConfig. Please use one or the other--`group` values will be deprecated in v0.4.4."
-                )
        if self.generation_kwargs is not None:
            if self.output_type != "generate_until":
                eval_logger.warning(
@@ -1511,7 +1498,7 @@ class ConfigurableTask(Task):
            # we expect multiple_targets to be a list.
            elif self.multiple_target:
                gold = list(gold)
-            elif type(gold) != type(result):
+            elif type(gold) is not type(result):
                # cast gold to the same type as result
                gold = type(result)(gold)
@@ -1594,7 +1581,7 @@ class ConfigurableTask(Task):
            f"ConfigurableTask(task_name={getattr(self.config, 'task', None)},"
            f"output_type={self.OUTPUT_TYPE},"
            f"num_fewshot={getattr(self.config, 'num_fewshot', None)},"
-            f"num_samples={len(self.eval_docs)})",
+            f"num_samples={len(self.eval_docs)})"
        )

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -157,6 +157,9 @@ def simple_evaluate(
        seed_message.append(f"Setting torch manual seed to {torch_random_seed}")
        torch.manual_seed(torch_random_seed)
+    if fewshot_random_seed is not None:
+        seed_message.append(f"Setting fewshot manual seed to {fewshot_random_seed}")
    if seed_message:
        eval_logger.info(" | ".join(seed_message))
@@ -276,9 +279,6 @@ def simple_evaluate(
                        task_obj.set_config(key="num_fewshot", value=0)
                # fewshot_random_seed set for tasks, even with a default num_fewshot (e.g. in the YAML file)
                task_obj.set_fewshot_seed(seed=fewshot_random_seed)
-                eval_logger.info(
-                    f"Setting fewshot random generator seed to {fewshot_random_seed}"
-                )
                adjusted_task_dict[task_name] = task_obj
@@ -433,10 +433,14 @@ def evaluate(
            )
    # end multimodality validation check
+    # Cache the limit arg.
+    limit_arg = limit
+    limits = []
    for task_output in eval_tasks:
        task: Task = task_output.task
-        limit = get_sample_size(task, limit)
+        limit = get_sample_size(task, limit_arg)
+        limits.append(limit)
        task.build_all_requests(
            limit=limit,
            rank=lm.rank,
@@ -506,7 +510,7 @@ def evaluate(
    WORLD_SIZE = lm.world_size
    ### Postprocess outputs ###
    # TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
-    for task_output in eval_tasks:
+    for task_output, limit in zip(eval_tasks, limits):
        task = task_output.task
        task.apply_filters()
@@ -655,7 +659,7 @@ def evaluate(
                        len(task_output.task.eval_docs),
                    ),
                }
-                for task_output in eval_tasks
+                for task_output, limit in zip(eval_tasks, limits)
            },
        }
        if log_samples:

--- a/lm_eval/models/api_models.py
+++ b/lm_eval/models/api_models.py
@@ -73,9 +73,12 @@ class TemplateAPI(TemplateLM):
        seed: int = 1234,
        max_length: Optional[int] = 2048,
        add_bos_token: bool = False,
-        custom_prefix_token_id=None,
+        custom_prefix_token_id: int = None,
        # send the requests as tokens or strings
-        tokenized_requests=True,
+        tokenized_requests: bool = True,
+        trust_remote_code: bool = False,
+        revision: Optional[str] = "main",
+        use_fast_tokenizer: bool = True,
        **kwargs,
    ) -> None:
        super().__init__()
@@ -128,7 +131,10 @@ class TemplateAPI(TemplateLM):
                    import transformers
                    self.tokenizer = transformers.AutoTokenizer.from_pretrained(
-                        self.tokenizer if self.tokenizer else self.model
+                        self.tokenizer if self.tokenizer else self.model,
+                        trust_remote_code=trust_remote_code,
+                        revision=revision,
+                        use_fast=use_fast_tokenizer,
                    )
                    # Not used as the API will handle padding but to mirror the behavior of the HFLM
                    self.tokenizer = configure_pad_token(self.tokenizer)
@@ -153,6 +159,9 @@ class TemplateAPI(TemplateLM):
                assert isinstance(tokenizer, str), "tokenizer must be a string"
                self.tokenizer = transformers.AutoTokenizer.from_pretrained(
                    tokenizer,
+                    trust_remote_code=trust_remote_code,
+                    revision=revision,
+                    use_fast=use_fast_tokenizer,
                )
    @abc.abstractmethod

--- a/lm_eval/models/dummy.py
+++ b/lm_eval/models/dummy.py
@@ -26,9 +26,9 @@ class DummyLM(LM):
    def generate_until(self, requests, disable_tqdm: bool = False):
        res = []
-        for ctx, _ in tqdm(requests, disable=disable_tqdm):
+        for request in tqdm(requests, disable=disable_tqdm):
            res.append("lol")
-            assert ctx.strip() != ""
+            assert request.arguments[0].strip() != ""
        return res

--- a/lm_eval/models/hf_vlms.py
+++ b/lm_eval/models/hf_vlms.py
@@ -13,6 +13,7 @@ from lm_eval.api.registry import register_model
 from lm_eval.models.huggingface import HFLM
 from lm_eval.models.utils import (
    Collator,
+    flatten_image_list,
    pad_and_concat,
    replace_placeholders,
    stop_sequences_criteria,
@@ -295,6 +296,11 @@ class HFMultimodalLM(HFLM):
        images = [img[: self.max_images] for img in images]
        if self.rgb:
            images = [[img.convert("RGB") for img in sublist] for sublist in images]
+        # certain models like llava expect a single-level image list even for bs>1, multi-image. TODO: port this over to loglikelihoods
+        if getattr(self.config, "model_type", "") == "llava":
+            images = flatten_image_list(images)
        try:
            encoding = self.processor(
                images=images,

--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -55,7 +55,7 @@ class HFLM(TemplateLM):
    def __init__(
        self,
        pretrained: Union[str, transformers.PreTrainedModel],
-        backend: Optional[Literal["default", "causal", "seq2seq"]] = "default",
+        backend: Literal["default", "causal", "seq2seq"] = "default",
        # override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq)
        revision: Optional[str] = "main",
        subfolder: Optional[str] = None,
@@ -90,7 +90,6 @@ class HFLM(TemplateLM):
        **kwargs,
    ) -> None:
        super().__init__()
        # optionally: take in an already-initialized transformers.PreTrainedModel
        if not isinstance(pretrained, str):
            eval_logger.warning(
@@ -164,7 +163,7 @@ class HFLM(TemplateLM):
                trust_remote_code=trust_remote_code,
            )
-        # determine which of 'causal' and 'seq2seq' backends to use
+            # determine which of 'causal' and 'seq2seq' backends to use for HF models
        self._get_backend(
            config=self.config, backend=backend, trust_remote_code=trust_remote_code
        )
@@ -287,7 +286,7 @@ class HFLM(TemplateLM):
    def _get_accelerate_args(
        self,
-        parallelize: bool = None,
+        parallelize: Optional[bool] = None,
        device_map: Optional[str] = "auto",
        max_memory_per_gpu: Optional[Union[int, str]] = None,
        max_cpu_memory: Optional[Union[int, str]] = None,
@@ -441,31 +440,26 @@ class HFLM(TemplateLM):
    def _get_backend(
        self,
        config: Union[transformers.PretrainedConfig, transformers.AutoConfig],
-        backend: Optional[Literal["default", "causal", "seq2seq"]] = "default",
+        backend: Literal["default", "causal", "seq2seq"] = "default",
        trust_remote_code: Optional[bool] = False,
    ) -> None:
        """
        Helper method during initialization.
-        Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder))
+        Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder)) model type to be used.
-        model type to be used.
        sets `self.AUTO_MODEL_CLASS` appropriately if not already set.
+        **If not calling HFLM.__init__() or HFLM._get_backend() within a subclass of HFLM,
+        user must set `self.backend` to be either "causal" or "seq2seq" manually!**
        """
-        # escape hatch: if we're using a subclass that shouldn't follow
-        # the default _get_backend logic,
-        # then skip over the method.
-        # TODO: this seems very much undesirable in some cases--our code in HFLM
-        # references AutoModelForCausalLM at times to check for equality
-        if self.AUTO_MODEL_CLASS is not None:
-            return
        assert backend in ["default", "causal", "seq2seq"]
        if backend != "default":
            # if we've settled on non-default backend, use that manually
            if backend == "causal":
-                self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
+                self.backend = backend
            elif backend == "seq2seq":
-                self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
+                self.backend = backend
            eval_logger.info(
                f"Overrode HF model backend type, and using type '{backend}'"
            )
@@ -478,26 +472,32 @@ class HFLM(TemplateLM):
                # first check if model type is listed under seq2seq models, since some
                # models like MBart are listed in both seq2seq and causal mistakenly in HF transformers.
                # these special cases should be treated as seq2seq models.
-                self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
+                self.backend = "seq2seq"
+                eval_logger.info(f"Using model type '{backend}'")
            elif (
                getattr(self.config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
            ):
-                self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
+                self.backend = "causal"
+                eval_logger.info(f"Using model type '{backend}'")
            else:
                if not trust_remote_code:
                    eval_logger.warning(
                        "HF model type is neither marked as CausalLM or Seq2SeqLM. \
                    This is expected if your model requires `trust_remote_code=True` but may be an error otherwise."
+                        "Setting backend to causal"
                    )
                # if model type is neither in HF transformers causal or seq2seq model registries
-                # then we default to AutoModelForCausalLM
+                # then we default to assuming AutoModelForCausalLM
-                self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
+                self.backend = "causal"
+                eval_logger.info(
+                    f"Model type cannot be determined. Using default model type '{backend}'"
+                )
-        assert self.AUTO_MODEL_CLASS in [
+        if self.AUTO_MODEL_CLASS is None:
-            transformers.AutoModelForCausalLM,
+            if self.backend == "causal":
-            transformers.AutoModelForSeq2SeqLM,
+                self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
-        ]
+            elif self.backend == "seq2seq":
-        return None
+                self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
    def _get_config(
        self,
@@ -505,6 +505,7 @@ class HFLM(TemplateLM):
        revision: str = "main",
        trust_remote_code: bool = False,
    ) -> None:
+        """Return the model config for HuggingFace models"""
        self._config = transformers.AutoConfig.from_pretrained(
            pretrained,
            revision=revision,
@@ -703,7 +704,7 @@ class HFLM(TemplateLM):
        # if OOM, then halves batch_size and tries again
        @find_executable_batch_size(starting_batch_size=self.max_batch_size)
        def forward_batch(batch_size):
-            if self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
+            if self.backend == "seq2seq":
                length = max(max_context_enc, max_cont_enc)
                batched_conts = torch.ones(
                    (batch_size, length), device=self.device
@@ -754,7 +755,7 @@ class HFLM(TemplateLM):
        # by default for CausalLM - false or self.add_bos_token is set
        if add_special_tokens is None:
-            if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
+            if self.backend == "causal":
                special_tokens_kwargs = {
                    "add_special_tokens": False or self.add_bos_token
                }
@@ -782,7 +783,7 @@ class HFLM(TemplateLM):
        self.tokenizer.padding_side = padding_side
        add_special_tokens = {}
-        if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
+        if self.backend == "causal":
            add_special_tokens = {"add_special_tokens": False or self.add_bos_token}
        encoding = self.tokenizer(
@@ -860,14 +861,14 @@ class HFLM(TemplateLM):
    def _select_cont_toks(
        self, logits: torch.Tensor, contlen: int = None, inplen: int = None
    ) -> torch.Tensor:
-        if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
+        if self.backend == "causal":
            assert (
                contlen and inplen
            ), "Must pass input len and cont. len to select scored logits for causal LM"
            # discard right-padding.
            # also discard the input/context tokens. we'll only score continuations.
            logits = logits[inplen - contlen : inplen]
-        elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
+        elif self.backend == "seq2seq":
            assert (
                contlen and not inplen
            ), "Selecting scored logits for Seq2SeqLM requires only cont. len"
@@ -990,8 +991,7 @@ class HFLM(TemplateLM):
            requests,
            sort_fn=_collate,
            group_by="contexts"
-            if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
+            if self.backend == "causal" and self.logits_cache
-            and self.logits_cache
            else None,
            group_fn=_lookup_one_token_cont,
        )
@@ -1048,14 +1048,14 @@ class HFLM(TemplateLM):
                # cont_toks      4 5 6 7 8 9      [:, -len(continuation_enc):, :self.vocab_size] slice
                # when too long to fit in context, truncate from the left
-                if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
+                if self.backend == "causal":
                    inp = torch.tensor(
                        (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
                        dtype=torch.long,
                        device=self.device,
                    )
                    (inplen,) = inp.shape
-                elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
+                elif self.backend == "seq2seq":
                    inp = torch.tensor(
                        (context_enc)[-self.max_length :],
                        dtype=torch.long,
@@ -1095,11 +1095,11 @@ class HFLM(TemplateLM):
            # create encoder attn mask and batched conts, if seq2seq
            call_kwargs = {}
-            if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
+            if self.backend == "causal":
                batched_inps = pad_and_concat(
                    padding_len_inp, inps, padding_side="right"
                )  # [batch, padding_len_inp]
-            elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
+            elif self.backend == "seq2seq":
                # TODO: left-pad encoder inps and mask?
                batched_inps = pad_and_concat(
                    padding_len_inp, inps
@@ -1130,7 +1130,7 @@ class HFLM(TemplateLM):
                # from prompt/prefix tuning tokens, if applicable
                ctx_len = (
                    inplen + (logits.shape[0] - padding_len_inp)
-                    if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
+                    if self.backend == "causal"
                    else None
                )
                logits = self._select_cont_toks(logits, contlen=contlen, inplen=ctx_len)
@@ -1265,10 +1265,10 @@ class HFLM(TemplateLM):
                max_gen_toks = self.max_gen_toks
            # set the max length in tokens of inputs ("context_enc")
-            if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
+            if self.backend == "causal":
                # max len for inputs = max length, minus room to generate the max new tokens
                max_ctx_len = self.max_length - max_gen_toks
-            elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
+            elif self.backend == "seq2seq":
                # max len for inputs = encoder's whole max_length
                max_ctx_len = self.max_length
@@ -1295,7 +1295,7 @@ class HFLM(TemplateLM):
            cont_toks_list = cont.tolist()
            for cont_toks, context in zip(cont_toks_list, contexts):
                # discard context + left-padding toks if using causal decoder-only LM
-                if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
+                if self.backend == "causal":
                    cont_toks = cont_toks[context_enc.shape[1] :]
                s = self.tok_decode(cont_toks)

--- a/lm_eval/models/neuron_optimum.py
+++ b/lm_eval/models/neuron_optimum.py
 import copy
-import json
 import logging
-import subprocess
 from collections import defaultdict
 from typing import List, Optional, Union
@@ -33,54 +31,6 @@ except ImportError:
 logger = logging.getLogger(__name__)
-def get_nc_count() -> Union[int, None]:
-    """Returns the number of neuron cores on the current instance."""
-    try:
-        cmd = "neuron-ls --json-output"
-        result = subprocess.run(cmd, shell=True, capture_output=True)
-        print(f"inferring nc_count from `neuron-ls` {result.stdout}")
-        json_output = json.loads(result.stdout)
-        count = sum([x["nc_count"] for x in json_output])
-        print(f"nc_count={count}")
-        return count
-    except Exception:
-        return None
-def wrap_constant_batch_size(func):
-    def _decorator(self, input_ids):
-        """input_ids a 2D array with batch_size on dim=0
-        makes sure the func runs with self.batch_size
-        """
-        # access a from TestSample
-        batch_size = input_ids.shape[0]
-        if batch_size < self.batch_size:
-            # handle the event of input_ids.shape[0] != batch_size
-            # Neuron cores expect constant batch_size
-            input_ids = torch.concat(
-                (
-                    input_ids,
-                    # add missing_batch_size dummy
-                    torch.zeros(
-                        [self.batch_size - batch_size, *input_ids.size()[1:]],
-                        dtype=input_ids.dtype,
-                        device=input_ids.device,
-                    ),
-                ),
-                dim=0,
-            )
-        elif batch_size > self.batch_size:
-            raise ValueError(
-                f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
-            )
-        # return the forward pass that requires constant batch size
-        return func(self, input_ids)[:batch_size]
-    return _decorator
 class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
    """NeuronModelForCausalLM with `stopping_criteria` in `generate`"""
@@ -146,7 +96,7 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
            raise ValueError(
                f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
            )
-        elif batch_size < self.batch_size:
+        elif batch_size < self.batch_size and not self.continuous_batching:
            logger.warning(
                "Inputs will be padded to match the model static batch size. This will increase latency."
            )
@@ -158,8 +108,6 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
            if attention_mask is not None:
                padding = torch.zeros(padding_shape, dtype=torch.int64)
                padded_attention_mask = torch.cat([attention_mask, padding])
-        # Drop the current generation context and clear the Key/Value cache
-        self.reset_generation()
        output_ids = self.generate_tokens(
            padded_input_ids,
@@ -179,8 +127,6 @@ class NEURON_HF(TemplateLM):
    Tested with neuron 2.17.0
    """
-    _DEFAULT_MAX_LENGTH = 2048
    def __init__(
        self,
        pretrained: Optional[str] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
@@ -203,7 +149,7 @@ class NEURON_HF(TemplateLM):
                "please install neuron via pip install transformers-neuron ",
                "also make sure you are running on an AWS inf2 instance",
            )
-        if version.parse(optimum_neuron_version) != version.parse("0.0.17"):
+        if version.parse(optimum_neuron_version) != version.parse("0.0.24"):
            logger.warning(
                '`optimum-neuron` model requires `pip install "optimum[neuronx]>=0.0.17" '
                "preferably using the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) "
@@ -217,35 +163,16 @@ class NEURON_HF(TemplateLM):
        self.batch_size_per_gpu = int(batch_size)
        batch_size = int(batch_size)
-        if tp_degree is None:
-            # execute `neuron-ls --json-output | jq '.[0].nc_count'``
-            # to get the number of neuron cores on your instance
-            tp_degree = get_nc_count()
-        assert isinstance(tp_degree, int), (
-            f"model_args must include tp_degree. tp_degree must be set to an integer,"
-            f" but is tp_degree=`{tp_degree}` with type=`{type(tp_degree)}`."
-            "Set it to number of neuron cores on your instance."
-            " For inf2.xlarge and inf2.8xlarge, set it to `2`."
-            " For inf2.24xlarge, set it to `12`."
-            " For inf2.48xlarge, set it to `24`."
-        )
-        revision = str(revision)  # cast to string if not already one
-        # TODO: update this to be less of a hack once subfolder is fixed in HF
-        revision = revision + ("/" + subfolder if subfolder is not None else "")
        self._config = transformers.AutoConfig.from_pretrained(
            pretrained,
            revision=revision,
            trust_remote_code=trust_remote_code,
        )
-        torch_dtype = lm_eval.models.utils.get_dtype(dtype)
-        assert torch_dtype in [
+        revision = str(revision)  # cast to string if not already one
-            torch.float16,
+        # TODO: update this to be less of a hack once subfolder is fixed in HF
-            torch.bfloat16,
+        revision = revision + ("/" + subfolder if subfolder is not None else "")
-        ], "Only float16 and bfloat16 are supported"
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            pretrained if tokenizer is None else tokenizer,
@@ -254,36 +181,58 @@ class NEURON_HF(TemplateLM):
            use_fast=use_fast_tokenizer,
        )
-        # Neuron specific code
+        neuron_config = getattr(self._config, "neuron", None)
-        if torch_dtype == torch.float16:
+        if neuron_config is None:
-            self.amp_dtype = "f16"
+            # Check export parameters
-        elif torch_dtype == torch.bfloat16:
+            if tp_degree is not None:
-            self.amp_dtype = "bf16"
+                assert isinstance(tp_degree, int), (
-        elif torch_dtype == torch.float32:
+                    f"tp_degree must be set to an integer,"
-            self.amp_dtype = "f32"
+                    f" but is tp_degree=`{tp_degree}` with type=`{type(tp_degree)}`."
-        else:
+                    "Set it to a number lower than the number of neuron cores on your instance."
-            raise NotImplementedError("Only float16 and bfloat16 are implemented.")
+                    " For inf2.xlarge and inf2.8xlarge, set it to `2`."
+                    " For inf2.24xlarge, set it <= `12`."
-        compiler_args = {"num_cores": tp_degree, "auto_cast_type": self.amp_dtype}
+                    " For inf2.48xlarge, set it <= `24`."
-        input_shapes = {
+                )
-            "batch_size": batch_size,
+            torch_dtype = lm_eval.models.utils.get_dtype(dtype)
-            "sequence_length": self._DEFAULT_MAX_LENGTH,
-        }
+            if torch_dtype == torch.float16:
+                self.amp_dtype = "f16"
+            elif torch_dtype == torch.bfloat16:
+                self.amp_dtype = "bf16"
+            elif torch_dtype == torch.float32:
+                self.amp_dtype = "f32"
+            else:
+                raise NotImplementedError(
+                    "Only float16/bfloat16/float32 are supported."
+                )
-        print(
+            print(f"{'='*20} \n exporting model to neuron")
-            f"{'='*20} \n loading model to neuron with"
+            self.model = CustomNeuronModelForCausalLM.from_pretrained(
-            f" {compiler_args}, {input_shapes}..."
+                pretrained,
-        )
+                revision=revision,
-        self.model = CustomNeuronModelForCausalLM.from_pretrained(
+                trust_remote_code=trust_remote_code,
-            pretrained,
+                low_cpu_mem_usage=low_cpu_mem_usage,
-            revision=revision,
+                export=True,
-            trust_remote_code=trust_remote_code,
+                batch_size=batch_size,
-            low_cpu_mem_usage=low_cpu_mem_usage,
+                num_cores=tp_degree,
-            export=True,
+                auto_cast_type=self.amp_dtype,
-            **compiler_args,
+                sequence_length=max_length,
-            **input_shapes,
+            )
-        )
+            neuron_config = self.model.config.neuron
-        print(f"SUCCESS: neuron model compiled. \n {'='*20}")
+            print(
+                f"SUCCESS: neuron model exported with config {neuron_config}. \n {'='*20}"
+            )
+        else:
+            print(
+                f"{'='*20} \n loading neuron model with config" f" {neuron_config}..."
+            )
+            self.model = CustomNeuronModelForCausalLM.from_pretrained(
+                pretrained,
+                revision=revision,
+                trust_remote_code=trust_remote_code,
+                low_cpu_mem_usage=low_cpu_mem_usage,
+            )
+            print(f"SUCCESS: neuron model loaded. \n {'='*20}")
        self.truncation = truncation
@@ -291,8 +240,6 @@ class NEURON_HF(TemplateLM):
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
        self.add_bos_token = add_bos_token
-        self._max_length = max_length
        self.batch_schedule = 1
        self.batch_sizes = {}
@@ -313,17 +260,7 @@ class NEURON_HF(TemplateLM):
    @property
    def max_length(self):
-        if self._max_length:  # if max length manually set, return it
+        return self.model.max_length
-            return self._max_length
-        seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
-        for attr in seqlen_config_attrs:
-            if hasattr(self.model.config, attr):
-                return getattr(self.model.config, attr)
-        if hasattr(self.tokenizer, "model_max_length"):
-            if self.tokenizer.model_max_length == 1000000000000000019884624838656:
-                return self._DEFAULT_MAX_LENGTH
-            return self.tokenizer.model_max_length
-        return self._DEFAULT_MAX_LENGTH
    @property
    def max_gen_toks(self) -> int:
@@ -391,34 +328,6 @@ class NEURON_HF(TemplateLM):
    def tok_decode(self, tokens):
        return self.tokenizer.decode(tokens)
-    @wrap_constant_batch_size
-    def _model_call(self, input_ids: torch.Tensor):
-        """
-        get logits for the entire sequence
-        :param input_ids: torch.Tensor
-            A torch tensor of shape [batch, sequence_cont]
-            the size of sequence may vary from call to call
-        :return
-            A torch tensor of shape [batch, sequence, vocab] with the
-            logits returned from the model's decoder-lm head
-        """
-        _, sequence_length = input_ids.shape
-        with torch.inference_mode():
-            cache_ids = torch.arange(0, sequence_length, dtype=torch.int32).split(1)
-            input_ids_split = input_ids.split(1, dim=1)
-            return torch.concat(
-                [
-                    self.model.forward(
-                        input_ids=input_id, cache_ids=cache_id, return_dict=False
-                    )[0]
-                    for input_id, cache_id in zip(input_ids_split, cache_ids)
-                ],
-                dim=1,
-            )
    def _model_generate(self, context, max_length, stop, **generation_kwargs):
        # we require users to pass do_sample=True explicitly
        # for non-greedy gen. This should be reevaluated when considering beam search.
@@ -580,15 +489,41 @@ class NEURON_HF(TemplateLM):
                cont_toks_list.append(continuation_enc)
                inplens.append(inplen)
-            # create encoder attn mask and batched conts, if seq2seq
+            # Add dummy inputs up to the model static batch size
-            call_kwargs = {}
+            if len(inps) < self.batch_size:
+                inps = inps + [
+                    torch.zeros_like(inps[0]),
+                ] * (self.batch_size - len(inps))
+            masks = [torch.ones_like(inp) for inp in inps]
            batched_inps = lm_eval.models.utils.pad_and_concat(
                padding_len_inp, inps, padding_side="right"
            )  # [batch, padding_len_inp]
-            multi_logits = F.log_softmax(
+            batched_masks = lm_eval.models.utils.pad_and_concat(
-                self._model_call(batched_inps, **call_kwargs), dim=-1
+                padding_len_inp, masks, padding_side="right"
-            )  # [batch, padding_length (inp or cont), vocab]
+            )
+            if self.model.model.neuron_config.output_all_logits:
+                inputs = self.model.prepare_inputs_for_prefill(
+                    batched_inps, batched_masks
+                )
+                multi_logits = F.log_softmax(
+                    self.model.forward(**inputs).logits, dim=-1
+                )  # [batch, padding_length (inp or cont), vocab]
+            else:
+                # The model will only return the logits for the last input token, so we need
+                # to iterate over inputs to accumulate logits.
+                # To speed things up we use the KV cache as we would do when generating.
+                inputs = self.model.prepare_inputs_for_prefill(
+                    batched_inps[:, :1], batched_masks[:, :1]
+                )
+                outputs = [self.model.forward(**inputs).logits]
+                for i in range(1, padding_len_inp):
+                    inputs = self.model.prepare_inputs_for_decode(
+                        batched_inps[:, : i + 1], batched_masks[:, : i + 1]
+                    )
+                    outputs.append(self.model.forward(**inputs).logits)
+                multi_logits = F.log_softmax(torch.concat(outputs, dim=1), dim=-1)
            for (cache_key, _, _), logits, inplen, cont_toks in zip(
                chunk, multi_logits, inplens, cont_toks_list

--- a/lm_eval/models/openai_completions.py
+++ b/lm_eval/models/openai_completions.py
@@ -69,11 +69,11 @@ class LocalCompletionsAPI(TemplateAPI):
            for choice, ctxlen in zip(out["choices"], ctxlens):
                assert ctxlen > 0, "Context length must be greater than 0"
                logprobs = sum(choice["logprobs"]["token_logprobs"][ctxlen:-1])
-                tokens = choice["logprobs"]["token_logprobs"][ctxlen:-1]
+                tokens_logprobs = choice["logprobs"]["token_logprobs"][ctxlen:-1]
                top_logprobs = choice["logprobs"]["top_logprobs"][ctxlen:-1]
                is_greedy = True
-                for tok, top in zip(tokens, top_logprobs):
+                for tok, top in zip(tokens_logprobs, top_logprobs):
-                    if tok != max(top, key=top.get):
+                    if tok != max(top.values()):
                        is_greedy = False
                        break
                res.append((logprobs, is_greedy))
@@ -190,14 +190,18 @@ class OpenAICompletionsAPI(LocalCompletionsAPI):
        key = os.environ.get("OPENAI_API_KEY", None)
        if key is None:
            raise ValueError(
-                "API key not found. Please set the OPENAI_API_KEY environment variable."
+                "API key not found. Please set the `OPENAI_API_KEY` environment variable."
            )
        return key
    def loglikelihood(self, requests, **kwargs):
        assert (
-            self.model != "gpt-3.5-turbo"
+            self.model
-        ), "Loglikelihood is not supported for gpt-3.5-turbo"
+            in [
+                "babbage-002",
+                "davinci-002",
+            ]
+        ), f"Prompt loglikelihoods are only supported by OpenAI's API for {['babbage-002', 'davinci-002']}."
        return super().loglikelihood(requests, **kwargs)
    def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str]:
@@ -226,6 +230,11 @@ class OpenAIChatCompletion(LocalChatCompletion):
        key = os.environ.get("OPENAI_API_KEY", None)
        if key is None:
            raise ValueError(
-                "API key not found. Please set the OPENAI_API_KEY environment variable."
+                "API key not found. Please set the `OPENAI_API_KEY` environment variable."
            )
        return key
+    def loglikelihood(self, requests, **kwargs):
+        raise NotImplementedError(
+            "Loglikelihood (and therefore `multiple_choice`-type tasks) is not supported for chat completions as OpenAI does not provide prompt logprobs. See https://github.com/EleutherAI/lm-evaluation-harness/issues/942#issuecomment-1777836312 or https://github.com/EleutherAI/lm-evaluation-harness/issues/1196 for more background on this limitation."
+        )
--- a/lm_eval/models/utils.py
+++ b/lm_eval/models/utils.py
@@ -698,3 +698,14 @@ def replace_placeholders(
    # Add the last part of the string
    result.append(parts[-1])
    return "".join(result)
+def flatten_image_list(images: List[List]):
+    """
+    Takes in a list of lists of images, and returns a single list of all images in order.
+    Used for some multimodal models like Llava-1.5 which expects this flattened-list format for its image processor.
+    :param images: A list of lists of PIL images.
+    :return: a list of PIL images, via concatenating all the sub-lists in order.
+    """
+    return [image for image_list in images for image in image_list]
--- a/lm_eval/models/vllm_vlms.py
+++ b/lm_eval/models/vllm_vlms.py
@@ -7,9 +7,9 @@ from tqdm import tqdm
 from lm_eval.api.instance import Instance
 from lm_eval.api.registry import register_model
-from lm_eval.models.utils import Collator, undistribute
+from lm_eval.models.utils import Collator, replace_placeholders, undistribute
 from lm_eval.models.vllm_causallms import VLLM
-from lm_eval.utils import simple_parse_args_string
+from lm_eval.utils import eval_logger
 try:
@@ -36,10 +36,11 @@ class VLLM_VLM(VLLM):
        interleave: bool = True,
        # TODO<baber>: handle max_images and limit_mm_per_prompt better
        max_images: int = 999,
-        limit_mm_per_prompt: str = "image=1",
        **kwargs,
    ):
-        kwargs["limit_mm_per_prompt"] = simple_parse_args_string(limit_mm_per_prompt)
+        if max_images != 999:
+            kwargs["limit_mm_per_prompt"] = {"image": max_images}
+            eval_logger.info(f"Setting limit_mm_per_prompt[image] to {max_images}")
        super().__init__(
            pretrained=pretrained,
            trust_remote_code=trust_remote_code,
@@ -63,6 +64,17 @@ class VLLM_VLM(VLLM):
        truncation: bool = False,
    ):
        images = [img[: self.max_images] for img in images]
+        # TODO<baber>: is the default placeholder always <image>?
+        if self.chat_applied is False:
+            strings = [
+                replace_placeholders(
+                    string,
+                    DEFAULT_IMAGE_PLACEHOLDER,
+                    DEFAULT_IMAGE_PLACEHOLDER,
+                    self.max_images,
+                )
+                for string in strings
+            ]
        outputs = []
        for x, i in zip(strings, images):

--- a/lm_eval/tasks/README.md
+++ b/lm_eval/tasks/README.md
@@ -18,6 +18,7 @@
 | [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
 | [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English |
 | [babi](babi/README.md) | Tasks designed as question and answering challenges based on simulated stories. | English |
+| [basque_bench](basque_bench/README.md) | Collection of tasks in Basque encompassing various evaluation areas. | Basque |
 | [basqueglue](basqueglue/README.md) | Tasks designed to evaluate language understanding in Basque language. | Basque |
 | [bbh](bbh/README.md) | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German |
 | [belebele](belebele/README.md) | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages) |
@@ -25,6 +26,7 @@
 | [bertaqa](bertaqa/README.md) | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) |
 | [bigbench](bigbench/README.md) | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple |
 | [blimp](blimp/README.md) | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English |
+| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan |
 | [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
 | [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
 | code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
@@ -42,6 +44,7 @@
 | [fda](fda/README.md) | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English |
 | [fld](fld/README.md) | Tasks involving free-form and directed dialogue understanding. | English |
 | [french_bench](french_bench/README.md) | Set of tasks designed to assess language model performance in French. | French|
+| [galician_bench](galician_bench/README.md) | Collection of tasks in Galician encompassing various evaluation areas. | Galician |
 | [glue](glue/README.md) | General Language Understanding Evaluation benchmark to test broad language abilities. | English |
 | [gpqa](gpqa/README.md) | Tasks designed for general public question answering and knowledge verification. | English |
 | [gsm8k](gsm8k/README.md) | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English |
@@ -86,6 +89,7 @@
 | [pile_10k](pile_10k/README.md) | The first 10K elements of The Pile, useful for debugging models trained on it. | English |
 | [piqa](piqa/README.md) | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English |
 | [polemo2](polemo2/README.md) | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish |
+| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese |
 | [prost](prost/README.md) | Tasks requiring understanding of professional standards and ethics in various domains. | English |
 | [pubmedqa](pubmedqa/README.md) | Question answering tasks based on PubMed research articles for biomedical understanding. | English |
 | [qa4mre](qa4mre/README.md) | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English |
@@ -95,6 +99,7 @@
 | [sciq](sciq/README.md) | Science Question Answering tasks to assess understanding of scientific concepts. | English |
 | [scrolls](scrolls/README.md) | Tasks that involve long-form reading comprehension across various domains. | English |
 | [siqa](siqa/README.md) | Social Interaction Question Answering to evaluate common sense and social reasoning.  | English |
+| [spanish_bench](spanish_bench/README.md) | Collection of tasks in Spanish encompassing various evaluation areas. | Spanish |
 | [squad_completion](squad_completion/README.md) | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English |
 | [squadv2](squadv2/README.md) | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English |
 | [storycloze](storycloze/README.md) | Tasks to predict story endings, focusing on narrative logic and coherence. | English |
@@ -107,6 +112,7 @@
 | [translation](translation/README.md) | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese |
 | [triviaqa](triviaqa/README.md) | A large-scale dataset for trivia question answering to test general knowledge. | English |
 | [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English |
+| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
 | [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
 | [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
 | [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -40,7 +40,11 @@ class TaskManager:
            [x for x in self._all_tasks if self._task_index[x]["type"] == "group"]
        )
        self._all_subtasks = sorted(
-            [x for x in self._all_tasks if self._task_index[x]["type"] == "task"]
+            [
+                x
+                for x in self._all_tasks
+                if self._task_index[x]["type"] in ["task", "python_task"]
+            ]
        )
        self._all_tags = sorted(
            [x for x in self._all_tasks if self._task_index[x]["type"] == "tag"]
@@ -271,7 +275,7 @@ class TaskManager:
                    task_object = config["class"]()
                if isinstance(task_object, ConfigurableTask):
                    # very scuffed: set task name here. TODO: fixme?
-                    task_object.config.task = config["task"]
+                    task_object.config.task = task
            else:
                task_object = ConfigurableTask(config=config)
@@ -436,6 +440,30 @@ class TaskManager:
        :return
            Dictionary of task names as key and task metadata
        """
+        def _populate_tags_and_groups(config, task, tasks_and_groups, print_info):
+            # TODO: remove group in next release
+            if "tag" in config:
+                attr_list = config["tag"]
+                if isinstance(attr_list, str):
+                    attr_list = [attr_list]
+                for tag in attr_list:
+                    if tag not in tasks_and_groups:
+                        tasks_and_groups[tag] = {
+                            "type": "tag",
+                            "task": [task],
+                            "yaml_path": -1,
+                        }
+                    elif tasks_and_groups[tag]["type"] != "tag":
+                        self.logger.info(
+                            f"The tag '{tag}' is already registered as a group, this tag will not be registered. "
+                            "This may affect tasks you want to call."
+                        )
+                        break
+                    else:
+                        tasks_and_groups[tag]["task"].append(task)
        # TODO: remove group in next release
        print_info = True
        ignore_dirs = [
@@ -451,10 +479,14 @@ class TaskManager:
                    config = utils.load_yaml_config(yaml_path, mode="simple")
                    if self._config_is_python_task(config):
                        # This is a python class config
-                        tasks_and_groups[config["task"]] = {
+                        task = config["task"]
+                        tasks_and_groups[task] = {
                            "type": "python_task",
                            "yaml_path": yaml_path,
                        }
+                        _populate_tags_and_groups(
+                            config, task, tasks_and_groups, print_info
+                        )
                    elif self._config_is_group(config):
                        # This is a group config
                        tasks_and_groups[config["group"]] = {
@@ -483,41 +515,9 @@ class TaskManager:
                            "type": "task",
                            "yaml_path": yaml_path,
                        }
+                        _populate_tags_and_groups(
-                        # TODO: remove group in next release
+                            config, task, tasks_and_groups, print_info
-                        for attr in ["tag", "group"]:
+                        )
-                            if attr in config:
-                                if attr == "group" and print_info:
-                                    self.logger.info(
-                                        "`group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. "
-                                        "The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. "
-                                        "`group`s which aggregate across subtasks must be only defined in a separate group config file, "
-                                        "which will be the official way to create groups that support cross-task aggregation as in `mmlu`. "
-                                        "Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs "
-                                        "for more information."
-                                    )
-                                    print_info = False
-                                    # attr = "tag"
-                                attr_list = config[attr]
-                                if isinstance(attr_list, str):
-                                    attr_list = [attr_list]
-                                for tag in attr_list:
-                                    if tag not in tasks_and_groups:
-                                        tasks_and_groups[tag] = {
-                                            "type": "tag",
-                                            "task": [task],
-                                            "yaml_path": -1,
-                                        }
-                                    elif tasks_and_groups[tag]["type"] != "tag":
-                                        self.logger.info(
-                                            f"The tag {tag} is already registered as a group, this tag will not be registered. "
-                                            "This may affect tasks you want to call."
-                                        )
-                                        break
-                                    else:
-                                        tasks_and_groups[tag]["task"].append(task)
                    else:
                        self.logger.debug(f"File {f} in {root} could not be loaded")

--- a/lm_eval/tasks/basque_bench/README.md
+++ b/lm_eval/tasks/basque_bench/README.md
+# BasqueBench
+### Paper
+BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.
+The new evaluation datasets included in BasqueBench are:
+| Task          | Category       | Homepage  |
+|:-------------:|:-----:|:-----:|
+| MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
+| WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
+| XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |
+The datasets included in BasqueBench that have been made public in previous pubications are:
+| Task          | Category       | Paper title          | Homepage  |
+|:-------------:|:-----:|:-------------:|:-----:|
+| Belebele_eu | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
+| EusExams | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusExams |
+| EusProficiency | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusProficiency |
+| EusReading | Reading Comprehension | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusReading |
+| EusTrivia | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusTrivia |
+| FLORES_eu | Translation | [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) | https://huggingface.co/datasets/facebook/flores |
+| QNLIeu | Natural Language Inference | [BasqueGLUE: A Natural Language Understanding Benchmark for Basque](https://aclanthology.org/2022.lrec-1.172/) | https://huggingface.co/datasets/orai-nlp/basqueGLUE |
+| XNLIeu | Natural Language Inference | [XNLIeu: a dataset for cross-lingual NLI in Basque](https://arxiv.org/abs/2404.06996) | https://huggingface.co/datasets/HiTZ/xnli-eu |
+| XStoryCloze_eu | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze |
+### Citation
+Paper for BasqueBench coming soon.
+### Groups and Tasks
+#### Groups
+- `basque_bench`: All tasks included in BasqueBench.
+- `flores_eu`: All FLORES translation tasks from or to Basque.
+#### Tasks
+The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
+  - `belebele_eus_Latn`
+  - `eus_exams_eu`
+  - `eus_proficiency`
+  - `eus_reading`
+  - `eus_trivia`
+  - `flores_eu`
+  - `flores_eu-ca`
+  - `flores_eu-de`
+  - `flores_eu-en`
+  - `flores_eu-es`
+  - `flores_eu-fr`
+  - `flores_eu-gl`
+  - `flores_eu-it`
+  - `flores_eu-pt`
+  - `flores_ca-eu`
+  - `flores_de-eu`
+  - `flores_en-eu`
+  - `flores_es-eu`
+  - `flores_fr-eu`
+  - `flores_gl-eu`
+  - `flores_it-eu`
+  - `flores_pt-eu`
+  - `mgsm_direct_eu`
+  - `mgsm_native_cot_eu`
+  - `qnlieu`
+  - `wnli_eu`
+  - `xcopa_eu`
+  - `xnli_eu`
+  - `xnli_eu_native`
+  - `xstorycloze_eu`
+Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
+- `belebele_eus_Latn`: Belebele Basque
+- `qnlieu`: From BasqueGLUE
+### Checklist
+* [x] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation?
+    * [ ] Yes, original implementation contributed by author of the benchmark
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
--- a/lm_eval/tasks/basque_bench/basque_bench.yaml
+++ b/lm_eval/tasks/basque_bench/basque_bench.yaml
+group: basque_bench
+task:
+    - belebele_eus_Latn
+    - xstorycloze_eu
+    - flores_eu
+    - eus_reading
+    - eus_proficiency
+    - eus_trivia
+    - eus_exams_eu
+    - qnlieu
+    - xnli_eu
+    - xnli_eu_native
+    - wnli_eu
+    - xcopa_eu
+    - mgsm_direct_eu
+    - mgsm_native_cot_eu
+metadata:
+  version: 1.0