"vscode:/vscode.git/clone" did not exist on "7eaec7e354fb89ce4883eb3d09bb2b15cef9cf66"
Commit 25869601 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into mathvista

# Conflicts:
#	lm_eval/models/hf_vlms.py
parents 56f40c53 c1d8795d
...@@ -8,6 +8,7 @@ build ...@@ -8,6 +8,7 @@ build
dist dist
*.egg-info *.egg-info
venv venv
.venv/
.vscode/ .vscode/
temp temp
__pycache__ __pycache__
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
exclude: ^tests/testdata/ exclude: ^tests/testdata/
repos: repos:
- repo: https://github.com/pre-commit/pre-commit-hooks - repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0 rev: v4.6.0
hooks: hooks:
- id: check-added-large-files - id: check-added-large-files
- id: check-ast - id: check-ast
...@@ -29,7 +29,7 @@ repos: ...@@ -29,7 +29,7 @@ repos:
- id: mixed-line-ending - id: mixed-line-ending
args: [--fix=lf] args: [--fix=lf]
- repo: https://github.com/astral-sh/ruff-pre-commit - repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.8 rev: v0.6.8
hooks: hooks:
# Run the linter. # Run the linter.
- id: ruff - id: ruff
......
...@@ -54,7 +54,7 @@ The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's pop ...@@ -54,7 +54,7 @@ The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's pop
To install the `lm-eval` package from the github repository, run: To install the `lm-eval` package from the github repository, run:
```bash ```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness cd lm-evaluation-harness
pip install -e . pip install -e .
``` ```
......
{ {
"cells": [ "cells": [
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "Qw83KAePAhaS" "id": "Qw83KAePAhaS"
}, },
"source": [ "source": [
"# Releasing LM-Evaluation-Harness v0.4.0" "# Releasing LM-Evaluation-Harness v0.4.0"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "Z7k2vq1iAdqr" "id": "Z7k2vq1iAdqr"
}, },
"source": [ "source": [
"With the vast amount of work done in the field today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research." "With the vast amount of work done in the field today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research."
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "0gDoM0AJAvEc" "id": "0gDoM0AJAvEc"
}, },
"source": [ "source": [
"Our refactor stems from our desires to make the following believed best practices easier to carry out. \n", "Our refactor stems from our desires to make the following believed best practices easier to carry out. \n",
"\n", "\n",
"1. Never copy results from other papers\n", "1. Never copy results from other papers\n",
"2. Always share your exact prompts\n", "2. Always share your exact prompts\n",
"3. Always provide model outputs\n", "3. Always provide model outputs\n",
"4. Qualitatively review a small batch of outputs before running evaluation jobs at scale\n", "4. Qualitatively review a small batch of outputs before running evaluation jobs at scale\n",
"\n", "\n",
"We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:\n", "We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:\n",
"\n", "\n",
"1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)\n", "1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)\n",
"2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)\n", "2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)\n",
"3. More configurability, for more advanced workflows and easier operation with modifying prompts\n", "3. More configurability, for more advanced workflows and easier operation with modifying prompts\n",
"4. Better logging of data at runtime and post-hoc" "4. Better logging of data at runtime and post-hoc"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "nnwsOpjda_YW" "id": "nnwsOpjda_YW"
}, },
"source": [ "source": [
"In this notebook we will be going through a short tutorial on how things work." "In this notebook we will be going through a short tutorial on how things work."
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": { "metadata": {
"id": "zAov81vTbL2K" "id": "zAov81vTbL2K"
}, },
"source": [ "source": [
"## Install LM-Eval" "## Install LM-Eval"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 1, "execution_count": 1,
"metadata": { "metadata": {
"colab": { "colab": {
"base_uri": "https://localhost:8080/" "base_uri": "https://localhost:8080/"
},
"id": "8hiosGzq_qZg",
"outputId": "6ab73e5e-1f54-417e-a388-07e0d870b132"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor\n",
" Cloning https://github.com/EleutherAI/lm-evaluation-harness.git (to revision big-refactor) to /tmp/pip-req-build-tnssql5s\n",
" Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-tnssql5s\n",
" Running command git checkout -b big-refactor --track origin/big-refactor\n",
" Switched to a new branch 'big-refactor'\n",
" Branch 'big-refactor' set up to track remote branch 'big-refactor' from 'origin'.\n",
" Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 42f486ee49b65926a444cb0620870a39a5b4b0a8\n",
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"Collecting accelerate>=0.21.0 (from lm-eval==1.0.0)\n",
" Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
" Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
" Downloading datasets-2.15.0-py3-none-any.whl (521 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m521.2/521.2 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
" Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)\n",
"Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.8.7)\n",
"Collecting peft>=0.2.0 (from lm-eval==1.0.0)\n",
" Downloading peft-0.6.2-py3-none-any.whl (174 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m174.7/174.7 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
" Downloading pybind11-2.11.1-py3-none-any.whl (227 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.7/227.7 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
" Downloading pytablewriter-1.2.0-py3-none-any.whl (111 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.1/111.1 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
" Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Collecting sacrebleu>=1.5.0 (from lm-eval==1.0.0)\n",
" Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
"Collecting sqlitedict (from lm-eval==1.0.0)\n",
" Downloading sqlitedict-2.1.0.tar.gz (21 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: torch>=1.8 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.1.0+cu118)\n",
"Collecting tqdm-multiprocess (from lm-eval==1.0.0)\n",
" Downloading tqdm_multiprocess-0.0.11-py3-none-any.whl (9.8 kB)\n",
"Requirement already satisfied: transformers>=4.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (4.35.2)\n",
"Collecting zstandard (from lm-eval==1.0.0)\n",
" Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (23.2)\n",
"Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (5.9.5)\n",
"Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (6.0.1)\n",
"Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (0.19.4)\n",
"Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (9.0.0)\n",
"Collecting pyarrow-hotfix (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)\n",
"Collecting dill<0.3.8,>=0.3.0 (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
"Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2.31.0)\n",
"Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (4.66.1)\n",
"Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.4.1)\n",
"Collecting multiprocess (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
"Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.8.6)\n",
"Collecting responses<0.19 (from evaluate->lm-eval==1.0.0)\n",
" Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
"Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from peft>=0.2.0->lm-eval==1.0.0) (0.4.0)\n",
"Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (3.8.1)\n",
"Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.16.0)\n",
"Collecting portalocker (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)\n",
"Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (2023.6.3)\n",
"Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (0.9.0)\n",
"Collecting colorama (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)\n",
"Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (4.9.3)\n",
"Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.11.3)\n",
"Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.3.2)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (3.2.0)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.13.1)\n",
"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (4.5.0)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (1.12)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.2.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.1.2)\n",
"Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (2.1.0)\n",
"Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.1->lm-eval==1.0.0) (0.15.0)\n",
"Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonlines->lm-eval==1.0.0) (23.1.0)\n",
"Requirement already satisfied: setuptools>=38.3.0 in /usr/local/lib/python3.10/dist-packages (from pytablewriter->lm-eval==1.0.0) (67.7.2)\n",
"Collecting DataProperty<2,>=1.0.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading DataProperty-1.0.1-py3-none-any.whl (27 kB)\n",
"Collecting mbstrdecoder<2,>=1.0.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading mbstrdecoder-1.1.3-py3-none-any.whl (7.8 kB)\n",
"Collecting pathvalidate<4,>=2.3.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading pathvalidate-3.2.0-py3-none-any.whl (23 kB)\n",
"Collecting tabledata<2,>=1.3.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tabledata-1.3.3-py3-none-any.whl (11 kB)\n",
"Collecting tcolorpy<1,>=0.0.5 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tcolorpy-0.1.4-py3-none-any.whl (7.9 kB)\n",
"Collecting typepy[datetime]<2,>=1.3.2 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading typepy-1.3.2-py3-none-any.whl (31 kB)\n",
"Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (3.3.2)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (6.0.4)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (4.0.3)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.9.2)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.3.1)\n",
"Requirement already satisfied: chardet<6,>=3.0.4 in /usr/local/lib/python3.10/dist-packages (from mbstrdecoder<2,>=1.0.0->pytablewriter->lm-eval==1.0.0) (5.2.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2.0.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2023.7.22)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2.8.2)\n",
"Requirement already satisfied: pytz>=2018.9 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2023.3.post1)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.8->lm-eval==1.0.0) (2.1.3)\n",
"Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge-score>=0.0.4->lm-eval==1.0.0) (8.1.7)\n",
"Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.8->lm-eval==1.0.0) (1.3.0)\n",
"Building wheels for collected packages: lm-eval, rouge-score, sqlitedict\n",
" Building wheel for lm-eval (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for lm-eval: filename=lm_eval-1.0.0-py3-none-any.whl size=994254 sha256=88356155b19f2891981ecef948326ad6ce8ca40a6009378410ec20d0e225995a\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-9v6ye7h3/wheels/17/01/26/599c0779e9858a70a73fa8a306699b5b9a868f820c225457b0\n",
" Building wheel for rouge-score (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=6bb0d44e4881972c43ce194e7cb65233d309758cb15f0dec54590d3d2efcfc36\n",
" Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
" Building wheel for sqlitedict (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=5747f7dd73ddf3d8fbcebf51b5e4f718fabe1e94bccdf16d2f22a2e65ee7fdf4\n",
" Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd\n",
"Successfully built lm-eval rouge-score sqlitedict\n",
"Installing collected packages: sqlitedict, zstandard, tcolorpy, pybind11, pyarrow-hotfix, portalocker, pathvalidate, mbstrdecoder, jsonlines, dill, colorama, typepy, tqdm-multiprocess, sacrebleu, rouge-score, responses, multiprocess, accelerate, datasets, DataProperty, tabledata, peft, evaluate, pytablewriter, lm-eval\n",
"Successfully installed DataProperty-1.0.1 accelerate-0.24.1 colorama-0.4.6 datasets-2.15.0 dill-0.3.7 evaluate-0.4.1 jsonlines-4.0.0 lm-eval-1.0.0 mbstrdecoder-1.1.3 multiprocess-0.70.15 pathvalidate-3.2.0 peft-0.6.2 portalocker-2.8.2 pyarrow-hotfix-0.6 pybind11-2.11.1 pytablewriter-1.2.0 responses-0.18.0 rouge-score-0.1.2 sacrebleu-2.3.2 sqlitedict-2.1.0 tabledata-1.3.3 tcolorpy-0.1.4 tqdm-multiprocess-0.0.11 typepy-1.3.2 zstandard-0.22.0\n"
]
}
],
"source": [
"# Install LM-Eval\n",
"!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0,
"referenced_widgets": [
"a1d3a8aa016544a78e8821c8f6199e06",
"f61ed33fad754146bdd2ac9db1ba1c48",
"bfa0af6aeff344c6845e1080a878e92e",
"fd1ad9e0367d4004aae853b91c3a7617",
"6b2d90209ec14230b3d58a74ac9b83bf",
"a73f357065d34d7baf0453ae4a8d75e2",
"46f521b73fd943c081c648fd873ebc0a",
"7c5689bc13684db8a22681f41863dddd",
"48763b6233374554ae76035c0483066f",
"4986a21eb560448fa79f4b25cde48951",
"aed3acd2f2d74003b44079c333a0698e"
]
},
"id": "uyO5MaKkZyah",
"outputId": "d46e8096-5086-4e49-967e-ea33d4a2a335"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a1d3a8aa016544a78e8821c8f6199e06",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading builder script: 0%| | 0.00/5.67k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from lm_eval import api"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8rfUeX6n_wkK"
},
"source": [
"## Create new evaluation tasks with config-based tasks\n",
"\n",
"Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HYFUhhfOSJKe"
},
"source": [
"A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.\n",
"\n",
"Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "bg3dGROW-V39"
},
"outputs": [],
"source": [
"YAML_boolq_string = '''\n",
"task: demo_boolq\n",
"dataset_path: super_glue\n",
"dataset_name: boolq\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{passage}}\\nQuestion: {{question}}?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: passage\n",
"metric_list:\n",
" - metric: acc\n",
"'''\n",
"with open('boolq.yaml', 'w') as f:\n",
" f.write(YAML_boolq_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can now run evaluation on this task, by pointing to the config file we've just created:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "LOUHK7PtQfq4"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:54:55,156 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:54:55.942051: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:54:55.942108: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:54:55.942142: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:54:57.066802: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:55:00,954 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:55:11,038 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:55:11,038 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:55:11,046 INFO [__main__.py:205] Selected Tasks: ['demo_boolq']\n",
"2023-11-29:11:55:11,047 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:55:11,110 INFO [huggingface.py:120] Using device 'cuda'\n",
"config.json: 100% 571/571 [00:00<00:00, 2.87MB/s]\n",
"model.safetensors: 100% 5.68G/5.68G [00:32<00:00, 173MB/s]\n",
"tokenizer_config.json: 100% 396/396 [00:00<00:00, 2.06MB/s]\n",
"tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 11.6MB/s]\n",
"special_tokens_map.json: 100% 99.0/99.0 [00:00<00:00, 555kB/s]\n",
"2023-11-29:11:56:18,658 WARNING [task.py:614] [Task: demo_boolq] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
"2023-11-29:11:56:18,658 WARNING [task.py:626] [Task: demo_boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
"Downloading builder script: 100% 30.7k/30.7k [00:00<00:00, 59.0MB/s]\n",
"Downloading metadata: 100% 38.7k/38.7k [00:00<00:00, 651kB/s]\n",
"Downloading readme: 100% 14.8k/14.8k [00:00<00:00, 37.3MB/s]\n",
"Downloading data: 100% 4.12M/4.12M [00:00<00:00, 55.1MB/s]\n",
"Generating train split: 100% 9427/9427 [00:00<00:00, 15630.89 examples/s]\n",
"Generating validation split: 100% 3270/3270 [00:00<00:00, 20002.56 examples/s]\n",
"Generating test split: 100% 3245/3245 [00:00<00:00, 20866.19 examples/s]\n",
"2023-11-29:11:56:22,315 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:56:22,322 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 20/20 [00:04<00:00, 4.37it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|----------|-------|------|-----:|------|----:|---|-----:|\n",
"|demo_boolq|Yaml |none | 0|acc | 1|± | 0|\n",
"\n"
]
}
],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_boolq \\\n",
" --limit 10\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LOUHK7PtQfq4"
},
"source": [
"Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the tag `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
"\n",
"<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
"\n",
"We also show the aggregate across samples besides only showing the aggregation between subtasks. This may come in handy when certain groups want to be aggregated as a single task. -->\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "fthNg3ywO-kA"
},
"outputs": [],
"source": [
"YAML_cola_string = '''\n",
"tag: yes_or_no_tasks\n",
"task: demo_cola\n",
"dataset_path: glue\n",
"dataset_name: cola\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{sentence}}\\nQuestion: Does this sentence make sense?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: sentence\n",
"metric_list:\n",
" - metric: acc\n",
"'''\n",
"with open('cola.yaml', 'w') as f:\n",
" f.write(YAML_cola_string)"
]
}, },
"id": "8hiosGzq_qZg",
"outputId": "6ab73e5e-1f54-417e-a388-07e0d870b132"
},
"outputs": [
{ {
"cell_type": "code", "name": "stdout",
"execution_count": 6, "output_type": "stream",
"metadata": { "text": [
"id": "XceRKCuuDtbn" "Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor\n",
}, " Cloning https://github.com/EleutherAI/lm-evaluation-harness.git (to revision big-refactor) to /tmp/pip-req-build-tnssql5s\n",
"outputs": [ " Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-tnssql5s\n",
{ " Running command git checkout -b big-refactor --track origin/big-refactor\n",
"name": "stdout", " Switched to a new branch 'big-refactor'\n",
"output_type": "stream", " Branch 'big-refactor' set up to track remote branch 'big-refactor' from 'origin'.\n",
"text": [ " Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 42f486ee49b65926a444cb0620870a39a5b4b0a8\n",
"2023-11-29:11:56:33,016 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n", " Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
"2023-11-29 11:56:33.852995: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
"2023-11-29 11:56:33.853050: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"2023-11-29 11:56:33.853087: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", "Collecting accelerate>=0.21.0 (from lm-eval==1.0.0)\n",
"2023-11-29 11:56:35.129047: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", " Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)\n",
"2023-11-29:11:56:38,546 INFO [__main__.py:132] Verbosity set to INFO\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"2023-11-29:11:56:47,509 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n", "\u001b[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
"2023-11-29:11:56:47,509 INFO [__main__.py:143] Including path: ./\n", " Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
"2023-11-29:11:56:47,517 INFO [__main__.py:205] Selected Tasks: ['yes_or_no_tasks']\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"2023-11-29:11:56:47,520 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n", "\u001b[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
"2023-11-29:11:56:47,550 INFO [huggingface.py:120] Using device 'cuda'\n", " Downloading datasets-2.15.0-py3-none-any.whl (521 kB)\n",
"2023-11-29:11:57:08,743 WARNING [task.py:614] [Task: demo_cola] metric acc is defined, but aggregation is not. using default aggregation=mean\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m521.2/521.2 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"2023-11-29:11:57:08,743 WARNING [task.py:626] [Task: demo_cola] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n", "\u001b[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
"Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 52.7MB/s]\n", " Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)\n",
"Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 51.9MB/s]\n", "Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.8.7)\n",
"Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 48.0MB/s]\n", "Collecting peft>=0.2.0 (from lm-eval==1.0.0)\n",
"Downloading data: 100% 377k/377k [00:00<00:00, 12.0MB/s]\n", " Downloading peft-0.6.2-py3-none-any.whl (174 kB)\n",
"Generating train split: 100% 8551/8551 [00:00<00:00, 19744.58 examples/s]\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m174.7/174.7 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"Generating validation split: 100% 1043/1043 [00:00<00:00, 27057.01 examples/s]\n", "\u001b[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
"Generating test split: 100% 1063/1063 [00:00<00:00, 22705.17 examples/s]\n", " Downloading pybind11-2.11.1-py3-none-any.whl (227 kB)\n",
"2023-11-29:11:57:11,698 INFO [task.py:355] Building contexts for task on rank 0...\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.7/227.7 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"2023-11-29:11:57:11,704 INFO [evaluator.py:319] Running loglikelihood requests\n", "\u001b[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
"100% 20/20 [00:03<00:00, 5.15it/s]\n", " Downloading pytablewriter-1.2.0-py3-none-any.whl (111 kB)\n",
"fatal: not a git repository (or any of the parent directories): .git\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.1/111.1 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n", "\u001b[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
"| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n", " Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
"|---------------|-------|------|-----:|------|----:|---|-----:|\n", " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n", "Collecting sacrebleu>=1.5.0 (from lm-eval==1.0.0)\n",
"| - demo_cola |Yaml |none | 0|acc | 0.7|± |0.1528|\n", " Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)\n",
"\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"| Groups |Version|Filter|n-shot|Metric|Value| |Stderr|\n", "\u001b[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
"|---------------|-------|------|-----:|------|----:|---|-----:|\n", "Collecting sqlitedict (from lm-eval==1.0.0)\n",
"|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n", " Downloading sqlitedict-2.1.0.tar.gz (21 kB)\n",
"\n" " Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
] "Requirement already satisfied: torch>=1.8 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.1.0+cu118)\n",
} "Collecting tqdm-multiprocess (from lm-eval==1.0.0)\n",
], " Downloading tqdm_multiprocess-0.0.11-py3-none-any.whl (9.8 kB)\n",
"source": [ "Requirement already satisfied: transformers>=4.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (4.35.2)\n",
"# !accelerate launch --no_python\n", "Collecting zstandard (from lm-eval==1.0.0)\n",
"!lm_eval \\\n", " Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\n",
" --model hf \\\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n", "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
" --include_path ./ \\\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (23.2)\n",
" --tasks yes_or_no_tasks \\\n", "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (5.9.5)\n",
" --limit 10 \\\n", "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (6.0.1)\n",
" --output output/yes_or_no_tasks/ \\\n", "Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (0.19.4)\n",
" --log_samples\n" "Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (9.0.0)\n",
] "Collecting pyarrow-hotfix (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)\n",
"Collecting dill<0.3.8,>=0.3.0 (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
"Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2.31.0)\n",
"Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (4.66.1)\n",
"Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.4.1)\n",
"Collecting multiprocess (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
"Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.8.6)\n",
"Collecting responses<0.19 (from evaluate->lm-eval==1.0.0)\n",
" Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
"Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from peft>=0.2.0->lm-eval==1.0.0) (0.4.0)\n",
"Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (3.8.1)\n",
"Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.16.0)\n",
"Collecting portalocker (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)\n",
"Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (2023.6.3)\n",
"Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (0.9.0)\n",
"Collecting colorama (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)\n",
"Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (4.9.3)\n",
"Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.11.3)\n",
"Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.3.2)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (3.2.0)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.13.1)\n",
"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (4.5.0)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (1.12)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.2.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.1.2)\n",
"Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (2.1.0)\n",
"Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.1->lm-eval==1.0.0) (0.15.0)\n",
"Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonlines->lm-eval==1.0.0) (23.1.0)\n",
"Requirement already satisfied: setuptools>=38.3.0 in /usr/local/lib/python3.10/dist-packages (from pytablewriter->lm-eval==1.0.0) (67.7.2)\n",
"Collecting DataProperty<2,>=1.0.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading DataProperty-1.0.1-py3-none-any.whl (27 kB)\n",
"Collecting mbstrdecoder<2,>=1.0.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading mbstrdecoder-1.1.3-py3-none-any.whl (7.8 kB)\n",
"Collecting pathvalidate<4,>=2.3.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading pathvalidate-3.2.0-py3-none-any.whl (23 kB)\n",
"Collecting tabledata<2,>=1.3.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tabledata-1.3.3-py3-none-any.whl (11 kB)\n",
"Collecting tcolorpy<1,>=0.0.5 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tcolorpy-0.1.4-py3-none-any.whl (7.9 kB)\n",
"Collecting typepy[datetime]<2,>=1.3.2 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading typepy-1.3.2-py3-none-any.whl (31 kB)\n",
"Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (3.3.2)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (6.0.4)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (4.0.3)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.9.2)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.3.1)\n",
"Requirement already satisfied: chardet<6,>=3.0.4 in /usr/local/lib/python3.10/dist-packages (from mbstrdecoder<2,>=1.0.0->pytablewriter->lm-eval==1.0.0) (5.2.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2.0.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2023.7.22)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2.8.2)\n",
"Requirement already satisfied: pytz>=2018.9 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2023.3.post1)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.8->lm-eval==1.0.0) (2.1.3)\n",
"Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge-score>=0.0.4->lm-eval==1.0.0) (8.1.7)\n",
"Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.8->lm-eval==1.0.0) (1.3.0)\n",
"Building wheels for collected packages: lm-eval, rouge-score, sqlitedict\n",
" Building wheel for lm-eval (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for lm-eval: filename=lm_eval-1.0.0-py3-none-any.whl size=994254 sha256=88356155b19f2891981ecef948326ad6ce8ca40a6009378410ec20d0e225995a\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-9v6ye7h3/wheels/17/01/26/599c0779e9858a70a73fa8a306699b5b9a868f820c225457b0\n",
" Building wheel for rouge-score (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=6bb0d44e4881972c43ce194e7cb65233d309758cb15f0dec54590d3d2efcfc36\n",
" Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
" Building wheel for sqlitedict (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=5747f7dd73ddf3d8fbcebf51b5e4f718fabe1e94bccdf16d2f22a2e65ee7fdf4\n",
" Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd\n",
"Successfully built lm-eval rouge-score sqlitedict\n",
"Installing collected packages: sqlitedict, zstandard, tcolorpy, pybind11, pyarrow-hotfix, portalocker, pathvalidate, mbstrdecoder, jsonlines, dill, colorama, typepy, tqdm-multiprocess, sacrebleu, rouge-score, responses, multiprocess, accelerate, datasets, DataProperty, tabledata, peft, evaluate, pytablewriter, lm-eval\n",
"Successfully installed DataProperty-1.0.1 accelerate-0.24.1 colorama-0.4.6 datasets-2.15.0 dill-0.3.7 evaluate-0.4.1 jsonlines-4.0.0 lm-eval-1.0.0 mbstrdecoder-1.1.3 multiprocess-0.70.15 pathvalidate-3.2.0 peft-0.6.2 portalocker-2.8.2 pyarrow-hotfix-0.6 pybind11-2.11.1 pytablewriter-1.2.0 responses-0.18.0 rouge-score-0.1.2 sacrebleu-2.3.2 sqlitedict-2.1.0 tabledata-1.3.3 tcolorpy-0.1.4 tqdm-multiprocess-0.0.11 typepy-1.3.2 zstandard-0.22.0\n"
]
}
],
"source": [
"# Install LM-Eval\n",
"!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0,
"referenced_widgets": [
"a1d3a8aa016544a78e8821c8f6199e06",
"f61ed33fad754146bdd2ac9db1ba1c48",
"bfa0af6aeff344c6845e1080a878e92e",
"fd1ad9e0367d4004aae853b91c3a7617",
"6b2d90209ec14230b3d58a74ac9b83bf",
"a73f357065d34d7baf0453ae4a8d75e2",
"46f521b73fd943c081c648fd873ebc0a",
"7c5689bc13684db8a22681f41863dddd",
"48763b6233374554ae76035c0483066f",
"4986a21eb560448fa79f4b25cde48951",
"aed3acd2f2d74003b44079c333a0698e"
]
}, },
"id": "uyO5MaKkZyah",
"outputId": "d46e8096-5086-4e49-967e-ea33d4a2a335"
},
"outputs": [
{ {
"cell_type": "markdown", "data": {
"metadata": { "application/vnd.jupyter.widget-view+json": {
"id": "XceRKCuuDtbn" "model_id": "a1d3a8aa016544a78e8821c8f6199e06",
"version_major": 2,
"version_minor": 0
}, },
"source": [ "text/plain": [
"## Edit Prompt Templates Quickly\n", "Downloading builder script: 0%| | 0.00/5.67k [00:00<?, ?B/s]"
"\n",
"The following is a yaml made to evaluate the specific subtask of `high_school_geography` from MMLU. It uses the standard prompt where the we choose the letters from the options with most likelihood as the model's prediction."
] ]
}, },
"metadata": {},
"output_type": "display_data"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "8rfUeX6n_wkK"
},
"source": [
"## Create new evaluation tasks with config-based tasks\n",
"\n",
"Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HYFUhhfOSJKe"
},
"source": [
"A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.\n",
"\n",
"Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "bg3dGROW-V39"
},
"outputs": [],
"source": [
"YAML_boolq_string = \"\"\"\n",
"task: demo_boolq\n",
"dataset_path: super_glue\n",
"dataset_name: boolq\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{passage}}\\nQuestion: {{question}}?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: passage\n",
"metric_list:\n",
" - metric: acc\n",
"\"\"\"\n",
"with open(\"boolq.yaml\", \"w\") as f:\n",
" f.write(YAML_boolq_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can now run evaluation on this task, by pointing to the config file we've just created:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "LOUHK7PtQfq4"
},
"outputs": [
{ {
"cell_type": "code", "name": "stdout",
"execution_count": 7, "output_type": "stream",
"metadata": { "text": [
"id": "GTFvdt9kSlBG" "2023-11-29:11:54:55,156 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
}, "2023-11-29 11:54:55.942051: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"outputs": [], "2023-11-29 11:54:55.942108: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"source": [ "2023-11-29 11:54:55.942142: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"YAML_mmlu_geo_string = '''\n", "2023-11-29 11:54:57.066802: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"task: demo_mmlu_high_school_geography\n", "2023-11-29:11:55:00,954 INFO [__main__.py:132] Verbosity set to INFO\n",
"dataset_path: cais/mmlu\n", "2023-11-29:11:55:11,038 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"dataset_name: high_school_geography\n", "2023-11-29:11:55:11,038 INFO [__main__.py:143] Including path: ./\n",
"description: \"The following are multiple choice questions (with answers) about high school geography.\\n\\n\"\n", "2023-11-29:11:55:11,046 INFO [__main__.py:205] Selected Tasks: ['demo_boolq']\n",
"test_split: test\n", "2023-11-29:11:55:11,047 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"fewshot_split: dev\n", "2023-11-29:11:55:11,110 INFO [huggingface.py:120] Using device 'cuda'\n",
"fewshot_config:\n", "config.json: 100% 571/571 [00:00<00:00, 2.87MB/s]\n",
" sampler: first_n\n", "model.safetensors: 100% 5.68G/5.68G [00:32<00:00, 173MB/s]\n",
"output_type: multiple_choice\n", "tokenizer_config.json: 100% 396/396 [00:00<00:00, 2.06MB/s]\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n", "tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 11.6MB/s]\n",
"doc_to_choice: [\"A\", \"B\", \"C\", \"D\"]\n", "special_tokens_map.json: 100% 99.0/99.0 [00:00<00:00, 555kB/s]\n",
"doc_to_target: answer\n", "2023-11-29:11:56:18,658 WARNING [task.py:614] [Task: demo_boolq] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
"metric_list:\n", "2023-11-29:11:56:18,658 WARNING [task.py:626] [Task: demo_boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
" - metric: acc\n", "Downloading builder script: 100% 30.7k/30.7k [00:00<00:00, 59.0MB/s]\n",
" aggregation: mean\n", "Downloading metadata: 100% 38.7k/38.7k [00:00<00:00, 651kB/s]\n",
" higher_is_better: true\n", "Downloading readme: 100% 14.8k/14.8k [00:00<00:00, 37.3MB/s]\n",
" - metric: acc_norm\n", "Downloading data: 100% 4.12M/4.12M [00:00<00:00, 55.1MB/s]\n",
" aggregation: mean\n", "Generating train split: 100% 9427/9427 [00:00<00:00, 15630.89 examples/s]\n",
" higher_is_better: true\n", "Generating validation split: 100% 3270/3270 [00:00<00:00, 20002.56 examples/s]\n",
"'''\n", "Generating test split: 100% 3245/3245 [00:00<00:00, 20866.19 examples/s]\n",
"with open('mmlu_high_school_geography.yaml', 'w') as f:\n", "2023-11-29:11:56:22,315 INFO [task.py:355] Building contexts for task on rank 0...\n",
" f.write(YAML_mmlu_geo_string)\n" "2023-11-29:11:56:22,322 INFO [evaluator.py:319] Running loglikelihood requests\n",
] "100% 20/20 [00:04<00:00, 4.37it/s]\n",
}, "fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|----------|-------|------|-----:|------|----:|---|-----:|\n",
"|demo_boolq|Yaml |none | 0|acc | 1|± | 0|\n",
"\n"
]
}
],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_boolq \\\n",
" --limit 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LOUHK7PtQfq4"
},
"source": [
"Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the tag `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
"\n",
"<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
"\n",
"We also show the aggregate across samples besides only showing the aggregation between subtasks. This may come in handy when certain groups want to be aggregated as a single task. -->\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "fthNg3ywO-kA"
},
"outputs": [],
"source": [
"YAML_cola_string = \"\"\"\n",
"tag: yes_or_no_tasks\n",
"task: demo_cola\n",
"dataset_path: glue\n",
"dataset_name: cola\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{sentence}}\\nQuestion: Does this sentence make sense?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: sentence\n",
"metric_list:\n",
" - metric: acc\n",
"\"\"\"\n",
"with open(\"cola.yaml\", \"w\") as f:\n",
" f.write(YAML_cola_string)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "XceRKCuuDtbn"
},
"outputs": [
{ {
"cell_type": "code", "name": "stdout",
"execution_count": 8, "output_type": "stream",
"metadata": { "text": [
"id": "jyKOfCsKb-xy" "2023-11-29:11:56:33,016 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
}, "2023-11-29 11:56:33.852995: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"outputs": [ "2023-11-29 11:56:33.853050: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
{ "2023-11-29 11:56:33.853087: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"name": "stdout", "2023-11-29 11:56:35.129047: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"output_type": "stream", "2023-11-29:11:56:38,546 INFO [__main__.py:132] Verbosity set to INFO\n",
"text": [ "2023-11-29:11:56:47,509 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:57:23,598 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n", "2023-11-29:11:56:47,509 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29 11:57:24.719750: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "2023-11-29:11:56:47,517 INFO [__main__.py:205] Selected Tasks: ['yes_or_no_tasks']\n",
"2023-11-29 11:57:24.719806: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "2023-11-29:11:56:47,520 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29 11:57:24.719847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", "2023-11-29:11:56:47,550 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29 11:57:26.656125: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n", "2023-11-29:11:57:08,743 WARNING [task.py:614] [Task: demo_cola] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
"2023-11-29:11:57:31,563 INFO [__main__.py:132] Verbosity set to INFO\n", "2023-11-29:11:57:08,743 WARNING [task.py:626] [Task: demo_cola] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
"2023-11-29:11:57:40,541 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n", "Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 52.7MB/s]\n",
"2023-11-29:11:57:40,541 INFO [__main__.py:143] Including path: ./\n", "Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 51.9MB/s]\n",
"2023-11-29:11:57:40,558 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography']\n", "Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 48.0MB/s]\n",
"2023-11-29:11:57:40,559 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n", "Downloading data: 100% 377k/377k [00:00<00:00, 12.0MB/s]\n",
"2023-11-29:11:57:40,589 INFO [huggingface.py:120] Using device 'cuda'\n", "Generating train split: 100% 8551/8551 [00:00<00:00, 19744.58 examples/s]\n",
"Downloading builder script: 100% 5.84k/5.84k [00:00<00:00, 17.7MB/s]\n", "Generating validation split: 100% 1043/1043 [00:00<00:00, 27057.01 examples/s]\n",
"Downloading metadata: 100% 106k/106k [00:00<00:00, 892kB/s] \n", "Generating test split: 100% 1063/1063 [00:00<00:00, 22705.17 examples/s]\n",
"Downloading readme: 100% 39.7k/39.7k [00:00<00:00, 631kB/s]\n", "2023-11-29:11:57:11,698 INFO [task.py:355] Building contexts for task on rank 0...\n",
"Downloading data: 100% 166M/166M [00:01<00:00, 89.0MB/s]\n", "2023-11-29:11:57:11,704 INFO [evaluator.py:319] Running loglikelihood requests\n",
"Generating auxiliary_train split: 100% 99842/99842 [00:07<00:00, 12536.83 examples/s]\n", "100% 20/20 [00:03<00:00, 5.15it/s]\n",
"Generating test split: 100% 198/198 [00:00<00:00, 1439.20 examples/s]\n", "fatal: not a git repository (or any of the parent directories): .git\n",
"Generating validation split: 100% 22/22 [00:00<00:00, 4181.76 examples/s]\n", "hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"Generating dev split: 100% 5/5 [00:00<00:00, 36.25 examples/s]\n", "| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"2023-11-29:11:58:09,798 INFO [task.py:355] Building contexts for task on rank 0...\n", "|---------------|-------|------|-----:|------|----:|---|-----:|\n",
"2023-11-29:11:58:09,822 INFO [evaluator.py:319] Running loglikelihood requests\n", "|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n",
"100% 40/40 [00:05<00:00, 7.86it/s]\n", "| - demo_cola |Yaml |none | 0|acc | 0.7|± |0.1528|\n",
"fatal: not a git repository (or any of the parent directories): .git\n", "\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n", "| Groups |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n", "|---------------|-------|------|-----:|------|----:|---|-----:|\n",
"|-------------------------------|-------|------|-----:|--------|----:|---|-----:|\n", "|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n",
"|demo_mmlu_high_school_geography|Yaml |none | 0|acc | 0.3|± |0.1528|\n", "\n"
"| | |none | 0|acc_norm| 0.3|± |0.1528|\n", ]
"\n" }
] ],
} "source": [
], "# !accelerate launch --no_python\n",
"source": [ "!lm_eval \\\n",
"# !accelerate launch --no_python\n", " --model hf \\\n",
"!lm_eval \\\n", " --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --model hf \\\n", " --include_path ./ \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n", " --tasks yes_or_no_tasks \\\n",
" --include_path ./ \\\n", " --limit 10 \\\n",
" --tasks demo_mmlu_high_school_geography \\\n", " --output output/yes_or_no_tasks/ \\\n",
" --limit 10 \\\n", " --log_samples"
" --output output/mmlu_high_school_geography/ \\\n", ]
" --log_samples" },
] {
}, "cell_type": "markdown",
"metadata": {
"id": "XceRKCuuDtbn"
},
"source": [
"## Edit Prompt Templates Quickly\n",
"\n",
"The following is a yaml made to evaluate the specific subtask of `high_school_geography` from MMLU. It uses the standard prompt where the we choose the letters from the options with most likelihood as the model's prediction."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "GTFvdt9kSlBG"
},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"task: demo_mmlu_high_school_geography\n",
"dataset_path: cais/mmlu\n",
"dataset_name: high_school_geography\n",
"description: \"The following are multiple choice questions (with answers) about high school geography.\\n\\n\"\n",
"test_split: test\n",
"fewshot_split: dev\n",
"fewshot_config:\n",
" sampler: first_n\n",
"output_type: multiple_choice\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
"doc_to_choice: [\"A\", \"B\", \"C\", \"D\"]\n",
"doc_to_target: answer\n",
"metric_list:\n",
" - metric: acc\n",
" aggregation: mean\n",
" higher_is_better: true\n",
" - metric: acc_norm\n",
" aggregation: mean\n",
" higher_is_better: true\n",
"\"\"\"\n",
"with open(\"mmlu_high_school_geography.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "jyKOfCsKb-xy"
},
"outputs": [
{ {
"cell_type": "markdown", "name": "stdout",
"metadata": { "output_type": "stream",
"id": "jyKOfCsKb-xy" "text": [
}, "2023-11-29:11:57:23,598 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"source": [ "2023-11-29 11:57:24.719750: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"We could also evaluate this task in a different way. For example, instead of observing the loglikelihood of the letters, we can instead evaluate on the choices themselves as the continuation. This is done by simply changing `doc_to_choice` from a list of letters to the corresponding `choices` field from the HF dataset. We write `\"{{choices}}\"` so that the string field is interpreted as jinja string that acquires the list from the HF dataset directly.\n", "2023-11-29 11:57:24.719806: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"\n", "2023-11-29 11:57:24.719847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"Another convenient feature here is since we're only modifying the `doc_to_choice` and the rest of config is the same as the task above, we can use the above configuration as a template by using `include: mmlu_high_school_geography.yaml` to load the config from that file. We'll need to add a unique task name as to not colide with the existing yaml config we're including. For this case we'll simply name this one `mmlu_high_school_geography_continuation`. `doc_to_text` is added here just for sake of clarity." "2023-11-29 11:57:26.656125: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
] "2023-11-29:11:57:31,563 INFO [__main__.py:132] Verbosity set to INFO\n",
}, "2023-11-29:11:57:40,541 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:57:40,541 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:57:40,558 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography']\n",
"2023-11-29:11:57:40,559 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:57:40,589 INFO [huggingface.py:120] Using device 'cuda'\n",
"Downloading builder script: 100% 5.84k/5.84k [00:00<00:00, 17.7MB/s]\n",
"Downloading metadata: 100% 106k/106k [00:00<00:00, 892kB/s] \n",
"Downloading readme: 100% 39.7k/39.7k [00:00<00:00, 631kB/s]\n",
"Downloading data: 100% 166M/166M [00:01<00:00, 89.0MB/s]\n",
"Generating auxiliary_train split: 100% 99842/99842 [00:07<00:00, 12536.83 examples/s]\n",
"Generating test split: 100% 198/198 [00:00<00:00, 1439.20 examples/s]\n",
"Generating validation split: 100% 22/22 [00:00<00:00, 4181.76 examples/s]\n",
"Generating dev split: 100% 5/5 [00:00<00:00, 36.25 examples/s]\n",
"2023-11-29:11:58:09,798 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:58:09,822 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:05<00:00, 7.86it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|-------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography|Yaml |none | 0|acc | 0.3|± |0.1528|\n",
"| | |none | 0|acc_norm| 0.3|± |0.1528|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography \\\n",
" --limit 10 \\\n",
" --output output/mmlu_high_school_geography/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jyKOfCsKb-xy"
},
"source": [
"We could also evaluate this task in a different way. For example, instead of observing the loglikelihood of the letters, we can instead evaluate on the choices themselves as the continuation. This is done by simply changing `doc_to_choice` from a list of letters to the corresponding `choices` field from the HF dataset. We write `\"{{choices}}\"` so that the string field is interpreted as jinja string that acquires the list from the HF dataset directly.\n",
"\n",
"Another convenient feature here is since we're only modifying the `doc_to_choice` and the rest of config is the same as the task above, we can use the above configuration as a template by using `include: mmlu_high_school_geography.yaml` to load the config from that file. We'll need to add a unique task name as to not colide with the existing yaml config we're including. For this case we'll simply name this one `mmlu_high_school_geography_continuation`. `doc_to_text` is added here just for sake of clarity."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "lqElwU54TaK-"
},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_continuation\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
"doc_to_choice: \"{{choices}}\"\n",
"\"\"\"\n",
"with open(\"mmlu_high_school_geography_continuation.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "-_CVnDirdy7j"
},
"outputs": [
{ {
"cell_type": "code", "name": "stdout",
"execution_count": 9, "output_type": "stream",
"metadata": { "text": [
"id": "lqElwU54TaK-" "2023-11-29:11:58:21,284 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
}, "2023-11-29 11:58:22.850159: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"outputs": [], "2023-11-29 11:58:22.850219: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"source": [ "2023-11-29 11:58:22.850254: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"YAML_mmlu_geo_string = '''\n", "2023-11-29 11:58:24.948103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"include: mmlu_high_school_geography.yaml\n", "2023-11-29:11:58:28,460 INFO [__main__.py:132] Verbosity set to INFO\n",
"task: demo_mmlu_high_school_geography_continuation\n", "2023-11-29:11:58:37,935 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n", "2023-11-29:11:58:37,935 INFO [__main__.py:143] Including path: ./\n",
"doc_to_choice: \"{{choices}}\"\n", "2023-11-29:11:58:37,969 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_continuation']\n",
"'''\n", "2023-11-29:11:58:37,972 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"with open('mmlu_high_school_geography_continuation.yaml', 'w') as f:\n", "2023-11-29:11:58:38,008 INFO [huggingface.py:120] Using device 'cuda'\n",
" f.write(YAML_mmlu_geo_string)\n" "2023-11-29:11:58:59,758 INFO [task.py:355] Building contexts for task on rank 0...\n",
] "2023-11-29:11:58:59,777 INFO [evaluator.py:319] Running loglikelihood requests\n",
}, "100% 40/40 [00:02<00:00, 16.23it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|--------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_continuation|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_continuation \\\n",
" --limit 10 \\\n",
" --output output/mmlu_high_school_geography_continuation/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-_CVnDirdy7j"
},
"source": [
"If we take a look at the samples, we can see that it is in fact evaluating the continuation based on the choices rather than the letters."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "duBDqC6PAdjL"
},
"outputs": [
{ {
"cell_type": "code", "data": {
"execution_count": 10, "application/javascript": "\n ((filepath) => {{\n if (!google.colab.kernel.accessAllowed) {{\n return;\n }}\n google.colab.files.view(filepath);\n }})(\"/content/output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")",
"metadata": { "text/plain": [
"id": "-_CVnDirdy7j" "<IPython.core.display.Javascript object>"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:58:21,284 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:58:22.850159: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:58:22.850219: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:58:22.850254: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:58:24.948103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:58:28,460 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:58:37,935 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:58:37,935 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:58:37,969 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_continuation']\n",
"2023-11-29:11:58:37,972 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:58:38,008 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:58:59,758 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:58:59,777 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:02<00:00, 16.23it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|--------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_continuation|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_continuation \\\n",
" --limit 10 \\\n",
" --output output/mmlu_high_school_geography_continuation/ \\\n",
" --log_samples\n"
] ]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from google.colab import files\n",
"\n",
"\n",
"files.view(\n",
" \"output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6p0-KPwAgK5j"
},
"source": [
"## Closer Look at YAML Fields\n",
"\n",
"To prepare a task we can simply fill in a YAML config with the relevant information.\n",
"\n",
"`output_type`\n",
"The current provided evaluation types comprise of the following:\n",
"1. `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.\n",
"2. `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)\n",
"3. `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n",
"4. `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n",
"\n",
"The core prompt revolves around 3 fields.\n",
"1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n",
"2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.\n",
"3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n",
"\n",
"These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6p0-KPwAgK5j"
},
"source": [
"## What if Jinja is not Sufficient?\n",
"\n",
"There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:\n",
"\n",
"1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.\n",
"2. Perform a transformation on the dataset beforehand."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we show an example of using `!function` to create `doc_to_text` from a python function:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
}, },
"id": "DYZ5c0JhR1lJ",
"outputId": "ca945235-fb9e-4f17-8bfa-78e7d6ec1490"
},
"outputs": [
{ {
"cell_type": "markdown", "name": "stdout",
"metadata": { "output_type": "stream",
"id": "-_CVnDirdy7j" "text": [
}, "2023-11-29:11:59:08,312 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"source": [ "2023-11-29 11:59:09.348327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"If we take a look at the samples, we can see that it is in fact evaluating the continuation based on the choices rather than the letters." "2023-11-29 11:59:09.348387: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
] "2023-11-29 11:59:09.348421: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:59:10.573752: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:59:14,044 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:59:23,654 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:59:23,654 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:59:23,678 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt']\n",
"2023-11-29:11:59:23,679 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:59:23,708 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:59:44,516 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:59:44,524 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:02<00:00, 15.41it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|-----------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_function_prompt|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt\n",
"doc_to_text: !function utils.doc_to_text\n",
"doc_to_choice: \"{{choices}}\"\n",
"\"\"\"\n",
"with open(\"demo_mmlu_high_school_geography_function_prompt.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = \"\"\"\n",
"def doc_to_text(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" return f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
"\"\"\"\n",
"with open(\"utils.py\", \"w\") as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:\n",
"\n",
"We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt_2\n",
"process_docs: !function utils_process_docs.process_docs\n",
"doc_to_text: \"{{input}}\"\n",
"doc_to_choice: \"{{choices}}\"\n",
"\"\"\"\n",
"with open(\"demo_mmlu_high_school_geography_process_docs.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = \"\"\"\n",
"def process_docs(dataset):\n",
" def _process_doc(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" doc[\"input\"] = f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
" return out_doc\n",
"\n",
" return dataset.map(_process_doc)\n",
"\"\"\"\n",
"\n",
"with open(\"utils_process_docs.py\", \"w\") as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt_2 \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt_2/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We hope that this explainer gives you a sense of what can be done with and how to work with LM-Evaluation-Harnes v0.4.0 ! \n",
"\n",
"For more information, check out our documentation pages in the `docs/` folder, and if you have questions, please raise them in GitHub issues, or in #lm-thunderdome or #release-discussion on the EleutherAI discord server."
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [
"zAov81vTbL2K"
],
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"46f521b73fd943c081c648fd873ebc0a": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
}, },
{ "48763b6233374554ae76035c0483066f": {
"cell_type": "code", "model_module": "@jupyter-widgets/controls",
"execution_count": 11, "model_module_version": "1.5.0",
"metadata": { "model_name": "ProgressStyleModel",
"id": "duBDqC6PAdjL" "state": {
}, "_model_module": "@jupyter-widgets/controls",
"outputs": [ "_model_module_version": "1.5.0",
{ "_model_name": "ProgressStyleModel",
"data": { "_view_count": null,
"application/javascript": "\n ((filepath) => {{\n if (!google.colab.kernel.accessAllowed) {{\n return;\n }}\n google.colab.files.view(filepath);\n }})(\"/content/output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")", "_view_module": "@jupyter-widgets/base",
"text/plain": [ "_view_module_version": "1.2.0",
"<IPython.core.display.Javascript object>" "_view_name": "StyleView",
] "bar_color": null,
}, "description_width": ""
"metadata": {}, }
"output_type": "display_data"
}
],
"source": [
"from google.colab import files\n",
"files.view(\"output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")\n"
]
}, },
{ "4986a21eb560448fa79f4b25cde48951": {
"cell_type": "markdown", "model_module": "@jupyter-widgets/base",
"metadata": { "model_module_version": "1.2.0",
"id": "6p0-KPwAgK5j" "model_name": "LayoutModel",
}, "state": {
"source": [ "_model_module": "@jupyter-widgets/base",
"## Closer Look at YAML Fields\n", "_model_module_version": "1.2.0",
"\n", "_model_name": "LayoutModel",
"To prepare a task we can simply fill in a YAML config with the relevant information.\n", "_view_count": null,
"\n", "_view_module": "@jupyter-widgets/base",
"`output_type`\n", "_view_module_version": "1.2.0",
"The current provided evaluation types comprise of the following:\n", "_view_name": "LayoutView",
"1. `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.\n", "align_content": null,
"2. `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)\n", "align_items": null,
"3. `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n", "align_self": null,
"4. `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n", "border": null,
"\n", "bottom": null,
"The core prompt revolves around 3 fields.\n", "display": null,
"1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n", "flex": null,
"2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.\n", "flex_flow": null,
"3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n", "grid_area": null,
"\n", "grid_auto_columns": null,
"These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.\n" "grid_auto_flow": null,
] "grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}, },
{ "6b2d90209ec14230b3d58a74ac9b83bf": {
"cell_type": "markdown", "model_module": "@jupyter-widgets/base",
"metadata": { "model_module_version": "1.2.0",
"id": "6p0-KPwAgK5j" "model_name": "LayoutModel",
}, "state": {
"source": [ "_model_module": "@jupyter-widgets/base",
"## What if Jinja is not Sufficient?\n", "_model_module_version": "1.2.0",
"\n", "_model_name": "LayoutModel",
"There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:\n", "_view_count": null,
"\n", "_view_module": "@jupyter-widgets/base",
"1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.\n", "_view_module_version": "1.2.0",
"2. Perform a transformation on the dataset beforehand." "_view_name": "LayoutView",
] "align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}, },
{ "7c5689bc13684db8a22681f41863dddd": {
"cell_type": "markdown", "model_module": "@jupyter-widgets/base",
"metadata": {}, "model_module_version": "1.2.0",
"source": [ "model_name": "LayoutModel",
"Below, we show an example of using `!function` to create `doc_to_text` from a python function:" "state": {
] "_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}, },
{ "a1d3a8aa016544a78e8821c8f6199e06": {
"cell_type": "code", "model_module": "@jupyter-widgets/controls",
"execution_count": 12, "model_module_version": "1.5.0",
"metadata": { "model_name": "HBoxModel",
"colab": { "state": {
"base_uri": "https://localhost:8080/" "_dom_classes": [],
}, "_model_module": "@jupyter-widgets/controls",
"id": "DYZ5c0JhR1lJ", "_model_module_version": "1.5.0",
"outputId": "ca945235-fb9e-4f17-8bfa-78e7d6ec1490" "_model_name": "HBoxModel",
}, "_view_count": null,
"outputs": [ "_view_module": "@jupyter-widgets/controls",
{ "_view_module_version": "1.5.0",
"name": "stdout", "_view_name": "HBoxView",
"output_type": "stream", "box_style": "",
"text": [ "children": [
"2023-11-29:11:59:08,312 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n", "IPY_MODEL_f61ed33fad754146bdd2ac9db1ba1c48",
"2023-11-29 11:59:09.348327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "IPY_MODEL_bfa0af6aeff344c6845e1080a878e92e",
"2023-11-29 11:59:09.348387: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "IPY_MODEL_fd1ad9e0367d4004aae853b91c3a7617"
"2023-11-29 11:59:09.348421: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:59:10.573752: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:59:14,044 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:59:23,654 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:59:23,654 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:59:23,678 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt']\n",
"2023-11-29:11:59:23,679 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:59:23,708 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:59:44,516 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:59:44,524 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:02<00:00, 15.41it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|-----------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_function_prompt|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
], ],
"source": [ "layout": "IPY_MODEL_6b2d90209ec14230b3d58a74ac9b83bf"
"YAML_mmlu_geo_string = '''\n", }
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt\n",
"doc_to_text: !function utils.doc_to_text\n",
"doc_to_choice: \"{{choices}}\"\n",
"'''\n",
"with open('demo_mmlu_high_school_geography_function_prompt.yaml', 'w') as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = '''\n",
"def doc_to_text(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" return f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
"'''\n",
"with open('utils.py', 'w') as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n",
" --log_samples\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:\n",
"\n",
"We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`."
]
}, },
{ "a73f357065d34d7baf0453ae4a8d75e2": {
"cell_type": "code", "model_module": "@jupyter-widgets/base",
"execution_count": null, "model_module_version": "1.2.0",
"metadata": {}, "model_name": "LayoutModel",
"outputs": [], "state": {
"source": [ "_model_module": "@jupyter-widgets/base",
"YAML_mmlu_geo_string = '''\n", "_model_module_version": "1.2.0",
"include: mmlu_high_school_geography.yaml\n", "_model_name": "LayoutModel",
"task: demo_mmlu_high_school_geography_function_prompt_2\n", "_view_count": null,
"process_docs: !function utils_process_docs.process_docs\n", "_view_module": "@jupyter-widgets/base",
"doc_to_text: \"{{input}}\"\n", "_view_module_version": "1.2.0",
"doc_to_choice: \"{{choices}}\"\n", "_view_name": "LayoutView",
"'''\n", "align_content": null,
"with open('demo_mmlu_high_school_geography_process_docs.yaml', 'w') as f:\n", "align_items": null,
" f.write(YAML_mmlu_geo_string)\n", "align_self": null,
"\n", "border": null,
"DOC_TO_TEXT = '''\n", "bottom": null,
"def process_docs(dataset):\n", "display": null,
" def _process_doc(x):\n", "flex": null,
" question = x[\"question\"].strip()\n", "flex_flow": null,
" choices = x[\"choices\"]\n", "grid_area": null,
" option_a = choices[0]\n", "grid_auto_columns": null,
" option_b = choices[1]\n", "grid_auto_flow": null,
" option_c = choices[2]\n", "grid_auto_rows": null,
" option_d = choices[3]\n", "grid_column": null,
" doc[\"input\"] = f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n", "grid_gap": null,
" return out_doc\n", "grid_row": null,
"\n", "grid_template_areas": null,
" return dataset.map(_process_doc)\n", "grid_template_columns": null,
"'''\n", "grid_template_rows": null,
"\n", "height": null,
"with open('utils_process_docs.py', 'w') as f:\n", "justify_content": null,
" f.write(DOC_TO_TEXT)\n", "justify_items": null,
"\n", "left": null,
"!lm_eval \\\n", "margin": null,
" --model hf \\\n", "max_height": null,
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n", "max_width": null,
" --include_path ./ \\\n", "min_height": null,
" --tasks demo_mmlu_high_school_geography_function_prompt_2 \\\n", "min_width": null,
" --limit 10 \\\n", "object_fit": null,
" --output output/demo_mmlu_high_school_geography_function_prompt_2/ \\\n", "object_position": null,
" --log_samples\n" "order": null,
] "overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
}, },
{ "aed3acd2f2d74003b44079c333a0698e": {
"cell_type": "markdown", "model_module": "@jupyter-widgets/controls",
"metadata": {}, "model_module_version": "1.5.0",
"source": [ "model_name": "DescriptionStyleModel",
"We hope that this explainer gives you a sense of what can be done with and how to work with LM-Evaluation-Harnes v0.4.0 ! \n", "state": {
"\n", "_model_module": "@jupyter-widgets/controls",
"For more information, check out our documentation pages in the `docs/` folder, and if you have questions, please raise them in GitHub issues, or in #lm-thunderdome or #release-discussion on the EleutherAI discord server." "_model_module_version": "1.5.0",
] "_model_name": "DescriptionStyleModel",
} "_view_count": null,
], "_view_module": "@jupyter-widgets/base",
"metadata": { "_view_module_version": "1.2.0",
"accelerator": "GPU", "_view_name": "StyleView",
"colab": { "description_width": ""
"collapsed_sections": [ }
"zAov81vTbL2K"
],
"gpuType": "T4",
"provenance": []
}, },
"kernelspec": { "bfa0af6aeff344c6845e1080a878e92e": {
"display_name": "Python 3", "model_module": "@jupyter-widgets/controls",
"name": "python3" "model_module_version": "1.5.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_7c5689bc13684db8a22681f41863dddd",
"max": 5669,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_48763b6233374554ae76035c0483066f",
"value": 5669
}
}, },
"language_info": { "f61ed33fad754146bdd2ac9db1ba1c48": {
"name": "python" "model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_a73f357065d34d7baf0453ae4a8d75e2",
"placeholder": "​",
"style": "IPY_MODEL_46f521b73fd943c081c648fd873ebc0a",
"value": "Downloading builder script: 100%"
}
}, },
"widgets": { "fd1ad9e0367d4004aae853b91c3a7617": {
"application/vnd.jupyter.widget-state+json": { "model_module": "@jupyter-widgets/controls",
"46f521b73fd943c081c648fd873ebc0a": { "model_module_version": "1.5.0",
"model_module": "@jupyter-widgets/controls", "model_name": "HTMLModel",
"model_module_version": "1.5.0", "state": {
"model_name": "DescriptionStyleModel", "_dom_classes": [],
"state": { "_model_module": "@jupyter-widgets/controls",
"_model_module": "@jupyter-widgets/controls", "_model_module_version": "1.5.0",
"_model_module_version": "1.5.0", "_model_name": "HTMLModel",
"_model_name": "DescriptionStyleModel", "_view_count": null,
"_view_count": null, "_view_module": "@jupyter-widgets/controls",
"_view_module": "@jupyter-widgets/base", "_view_module_version": "1.5.0",
"_view_module_version": "1.2.0", "_view_name": "HTMLView",
"_view_name": "StyleView", "description": "",
"description_width": "" "description_tooltip": null,
} "layout": "IPY_MODEL_4986a21eb560448fa79f4b25cde48951",
}, "placeholder": "​",
"48763b6233374554ae76035c0483066f": { "style": "IPY_MODEL_aed3acd2f2d74003b44079c333a0698e",
"model_module": "@jupyter-widgets/controls", "value": " 5.67k/5.67k [00:00&lt;00:00, 205kB/s]"
"model_module_version": "1.5.0", }
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"4986a21eb560448fa79f4b25cde48951": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"6b2d90209ec14230b3d58a74ac9b83bf": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"7c5689bc13684db8a22681f41863dddd": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"a1d3a8aa016544a78e8821c8f6199e06": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_f61ed33fad754146bdd2ac9db1ba1c48",
"IPY_MODEL_bfa0af6aeff344c6845e1080a878e92e",
"IPY_MODEL_fd1ad9e0367d4004aae853b91c3a7617"
],
"layout": "IPY_MODEL_6b2d90209ec14230b3d58a74ac9b83bf"
}
},
"a73f357065d34d7baf0453ae4a8d75e2": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"aed3acd2f2d74003b44079c333a0698e": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"bfa0af6aeff344c6845e1080a878e92e": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_7c5689bc13684db8a22681f41863dddd",
"max": 5669,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_48763b6233374554ae76035c0483066f",
"value": 5669
}
},
"f61ed33fad754146bdd2ac9db1ba1c48": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_a73f357065d34d7baf0453ae4a8d75e2",
"placeholder": "​",
"style": "IPY_MODEL_46f521b73fd943c081c648fd873ebc0a",
"value": "Downloading builder script: 100%"
}
},
"fd1ad9e0367d4004aae853b91c3a7617": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_4986a21eb560448fa79f4b25cde48951",
"placeholder": "​",
"style": "IPY_MODEL_aed3acd2f2d74003b44079c333a0698e",
"value": " 5.67k/5.67k [00:00&lt;00:00, 205kB/s]"
}
}
}
} }
}, }
"nbformat": 4, }
"nbformat_minor": 0 },
"nbformat": 4,
"nbformat_minor": 0
} }
...@@ -68,6 +68,7 @@ ...@@ -68,6 +68,7 @@
"source": [ "source": [
"import wandb\n", "import wandb\n",
"\n", "\n",
"\n",
"wandb.login()" "wandb.login()"
] ]
}, },
...@@ -130,6 +131,7 @@ ...@@ -130,6 +131,7 @@
"import lm_eval\n", "import lm_eval\n",
"from lm_eval.loggers import WandbLogger\n", "from lm_eval.loggers import WandbLogger\n",
"\n", "\n",
"\n",
"results = lm_eval.simple_evaluate(\n", "results = lm_eval.simple_evaluate(\n",
" model=\"hf\",\n", " model=\"hf\",\n",
" model_args=\"pretrained=microsoft/phi-2,trust_remote_code=True\",\n", " model_args=\"pretrained=microsoft/phi-2,trust_remote_code=True\",\n",
......
...@@ -431,7 +431,12 @@ class TemplateLM(LM): ...@@ -431,7 +431,12 @@ class TemplateLM(LM):
using_default_template = False using_default_template = False
# First, handle the cases when the model has a dict of multiple templates # First, handle the cases when the model has a dict of multiple templates
template = self.tokenizer.chat_template or self.tokenizer.default_chat_template try:
template = (
self.tokenizer.chat_template or self.tokenizer.default_chat_template
)
except AttributeError:
return None
if isinstance(template, dict): if isinstance(template, dict):
using_default_dict = self.tokenizer.chat_template is None using_default_dict = self.tokenizer.chat_template is None
......
...@@ -57,7 +57,6 @@ class TaskConfig(dict): ...@@ -57,7 +57,6 @@ class TaskConfig(dict):
task: Optional[str] = None task: Optional[str] = None
task_alias: Optional[str] = None task_alias: Optional[str] = None
tag: Optional[Union[str, list]] = None tag: Optional[Union[str, list]] = None
group: Optional[Union[str, list]] = None
# HF dataset options. # HF dataset options.
# which dataset to use, # which dataset to use,
# and what splits for what purpose # and what splits for what purpose
...@@ -98,18 +97,6 @@ class TaskConfig(dict): ...@@ -98,18 +97,6 @@ class TaskConfig(dict):
) )
def __post_init__(self) -> None: def __post_init__(self) -> None:
if self.group is not None:
eval_logger.warning(
"A task YAML file was found to contain a `group` key. Groups which provide aggregate scores over several subtasks now require a separate config file--if not aggregating, you may want to use the `tag` config option instead within your config. Setting `group` within a TaskConfig will be deprecated in v0.4.4. Please see https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md for more information."
)
if self.tag is None:
self.tag = self.group
else:
raise ValueError(
"Got both a `group` and `tag` entry within a TaskConfig. Please use one or the other--`group` values will be deprecated in v0.4.4."
)
if self.generation_kwargs is not None: if self.generation_kwargs is not None:
if self.output_type != "generate_until": if self.output_type != "generate_until":
eval_logger.warning( eval_logger.warning(
...@@ -1511,7 +1498,7 @@ class ConfigurableTask(Task): ...@@ -1511,7 +1498,7 @@ class ConfigurableTask(Task):
# we expect multiple_targets to be a list. # we expect multiple_targets to be a list.
elif self.multiple_target: elif self.multiple_target:
gold = list(gold) gold = list(gold)
elif type(gold) != type(result): elif type(gold) is not type(result):
# cast gold to the same type as result # cast gold to the same type as result
gold = type(result)(gold) gold = type(result)(gold)
...@@ -1594,7 +1581,7 @@ class ConfigurableTask(Task): ...@@ -1594,7 +1581,7 @@ class ConfigurableTask(Task):
f"ConfigurableTask(task_name={getattr(self.config, 'task', None)}," f"ConfigurableTask(task_name={getattr(self.config, 'task', None)},"
f"output_type={self.OUTPUT_TYPE}," f"output_type={self.OUTPUT_TYPE},"
f"num_fewshot={getattr(self.config, 'num_fewshot', None)}," f"num_fewshot={getattr(self.config, 'num_fewshot', None)},"
f"num_samples={len(self.eval_docs)})", f"num_samples={len(self.eval_docs)})"
) )
......
...@@ -157,6 +157,9 @@ def simple_evaluate( ...@@ -157,6 +157,9 @@ def simple_evaluate(
seed_message.append(f"Setting torch manual seed to {torch_random_seed}") seed_message.append(f"Setting torch manual seed to {torch_random_seed}")
torch.manual_seed(torch_random_seed) torch.manual_seed(torch_random_seed)
if fewshot_random_seed is not None:
seed_message.append(f"Setting fewshot manual seed to {fewshot_random_seed}")
if seed_message: if seed_message:
eval_logger.info(" | ".join(seed_message)) eval_logger.info(" | ".join(seed_message))
...@@ -276,9 +279,6 @@ def simple_evaluate( ...@@ -276,9 +279,6 @@ def simple_evaluate(
task_obj.set_config(key="num_fewshot", value=0) task_obj.set_config(key="num_fewshot", value=0)
# fewshot_random_seed set for tasks, even with a default num_fewshot (e.g. in the YAML file) # fewshot_random_seed set for tasks, even with a default num_fewshot (e.g. in the YAML file)
task_obj.set_fewshot_seed(seed=fewshot_random_seed) task_obj.set_fewshot_seed(seed=fewshot_random_seed)
eval_logger.info(
f"Setting fewshot random generator seed to {fewshot_random_seed}"
)
adjusted_task_dict[task_name] = task_obj adjusted_task_dict[task_name] = task_obj
...@@ -433,10 +433,14 @@ def evaluate( ...@@ -433,10 +433,14 @@ def evaluate(
) )
# end multimodality validation check # end multimodality validation check
# Cache the limit arg.
limit_arg = limit
limits = []
for task_output in eval_tasks: for task_output in eval_tasks:
task: Task = task_output.task task: Task = task_output.task
limit = get_sample_size(task, limit) limit = get_sample_size(task, limit_arg)
limits.append(limit)
task.build_all_requests( task.build_all_requests(
limit=limit, limit=limit,
rank=lm.rank, rank=lm.rank,
...@@ -506,7 +510,7 @@ def evaluate( ...@@ -506,7 +510,7 @@ def evaluate(
WORLD_SIZE = lm.world_size WORLD_SIZE = lm.world_size
### Postprocess outputs ### ### Postprocess outputs ###
# TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately) # TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
for task_output in eval_tasks: for task_output, limit in zip(eval_tasks, limits):
task = task_output.task task = task_output.task
task.apply_filters() task.apply_filters()
...@@ -655,7 +659,7 @@ def evaluate( ...@@ -655,7 +659,7 @@ def evaluate(
len(task_output.task.eval_docs), len(task_output.task.eval_docs),
), ),
} }
for task_output in eval_tasks for task_output, limit in zip(eval_tasks, limits)
}, },
} }
if log_samples: if log_samples:
......
...@@ -73,9 +73,12 @@ class TemplateAPI(TemplateLM): ...@@ -73,9 +73,12 @@ class TemplateAPI(TemplateLM):
seed: int = 1234, seed: int = 1234,
max_length: Optional[int] = 2048, max_length: Optional[int] = 2048,
add_bos_token: bool = False, add_bos_token: bool = False,
custom_prefix_token_id=None, custom_prefix_token_id: int = None,
# send the requests as tokens or strings # send the requests as tokens or strings
tokenized_requests=True, tokenized_requests: bool = True,
trust_remote_code: bool = False,
revision: Optional[str] = "main",
use_fast_tokenizer: bool = True,
**kwargs, **kwargs,
) -> None: ) -> None:
super().__init__() super().__init__()
...@@ -128,7 +131,10 @@ class TemplateAPI(TemplateLM): ...@@ -128,7 +131,10 @@ class TemplateAPI(TemplateLM):
import transformers import transformers
self.tokenizer = transformers.AutoTokenizer.from_pretrained( self.tokenizer = transformers.AutoTokenizer.from_pretrained(
self.tokenizer if self.tokenizer else self.model self.tokenizer if self.tokenizer else self.model,
trust_remote_code=trust_remote_code,
revision=revision,
use_fast=use_fast_tokenizer,
) )
# Not used as the API will handle padding but to mirror the behavior of the HFLM # Not used as the API will handle padding but to mirror the behavior of the HFLM
self.tokenizer = configure_pad_token(self.tokenizer) self.tokenizer = configure_pad_token(self.tokenizer)
...@@ -153,6 +159,9 @@ class TemplateAPI(TemplateLM): ...@@ -153,6 +159,9 @@ class TemplateAPI(TemplateLM):
assert isinstance(tokenizer, str), "tokenizer must be a string" assert isinstance(tokenizer, str), "tokenizer must be a string"
self.tokenizer = transformers.AutoTokenizer.from_pretrained( self.tokenizer = transformers.AutoTokenizer.from_pretrained(
tokenizer, tokenizer,
trust_remote_code=trust_remote_code,
revision=revision,
use_fast=use_fast_tokenizer,
) )
@abc.abstractmethod @abc.abstractmethod
......
...@@ -26,9 +26,9 @@ class DummyLM(LM): ...@@ -26,9 +26,9 @@ class DummyLM(LM):
def generate_until(self, requests, disable_tqdm: bool = False): def generate_until(self, requests, disable_tqdm: bool = False):
res = [] res = []
for ctx, _ in tqdm(requests, disable=disable_tqdm): for request in tqdm(requests, disable=disable_tqdm):
res.append("lol") res.append("lol")
assert ctx.strip() != "" assert request.arguments[0].strip() != ""
return res return res
......
...@@ -13,6 +13,7 @@ from lm_eval.api.registry import register_model ...@@ -13,6 +13,7 @@ from lm_eval.api.registry import register_model
from lm_eval.models.huggingface import HFLM from lm_eval.models.huggingface import HFLM
from lm_eval.models.utils import ( from lm_eval.models.utils import (
Collator, Collator,
flatten_image_list,
pad_and_concat, pad_and_concat,
replace_placeholders, replace_placeholders,
stop_sequences_criteria, stop_sequences_criteria,
...@@ -295,6 +296,11 @@ class HFMultimodalLM(HFLM): ...@@ -295,6 +296,11 @@ class HFMultimodalLM(HFLM):
images = [img[: self.max_images] for img in images] images = [img[: self.max_images] for img in images]
if self.rgb: if self.rgb:
images = [[img.convert("RGB") for img in sublist] for sublist in images] images = [[img.convert("RGB") for img in sublist] for sublist in images]
# certain models like llava expect a single-level image list even for bs>1, multi-image. TODO: port this over to loglikelihoods
if getattr(self.config, "model_type", "") == "llava":
images = flatten_image_list(images)
try: try:
encoding = self.processor( encoding = self.processor(
images=images, images=images,
......
...@@ -55,7 +55,7 @@ class HFLM(TemplateLM): ...@@ -55,7 +55,7 @@ class HFLM(TemplateLM):
def __init__( def __init__(
self, self,
pretrained: Union[str, transformers.PreTrainedModel], pretrained: Union[str, transformers.PreTrainedModel],
backend: Optional[Literal["default", "causal", "seq2seq"]] = "default", backend: Literal["default", "causal", "seq2seq"] = "default",
# override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq) # override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq)
revision: Optional[str] = "main", revision: Optional[str] = "main",
subfolder: Optional[str] = None, subfolder: Optional[str] = None,
...@@ -90,7 +90,6 @@ class HFLM(TemplateLM): ...@@ -90,7 +90,6 @@ class HFLM(TemplateLM):
**kwargs, **kwargs,
) -> None: ) -> None:
super().__init__() super().__init__()
# optionally: take in an already-initialized transformers.PreTrainedModel # optionally: take in an already-initialized transformers.PreTrainedModel
if not isinstance(pretrained, str): if not isinstance(pretrained, str):
eval_logger.warning( eval_logger.warning(
...@@ -164,7 +163,7 @@ class HFLM(TemplateLM): ...@@ -164,7 +163,7 @@ class HFLM(TemplateLM):
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
) )
# determine which of 'causal' and 'seq2seq' backends to use # determine which of 'causal' and 'seq2seq' backends to use for HF models
self._get_backend( self._get_backend(
config=self.config, backend=backend, trust_remote_code=trust_remote_code config=self.config, backend=backend, trust_remote_code=trust_remote_code
) )
...@@ -287,7 +286,7 @@ class HFLM(TemplateLM): ...@@ -287,7 +286,7 @@ class HFLM(TemplateLM):
def _get_accelerate_args( def _get_accelerate_args(
self, self,
parallelize: bool = None, parallelize: Optional[bool] = None,
device_map: Optional[str] = "auto", device_map: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None, max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None, max_cpu_memory: Optional[Union[int, str]] = None,
...@@ -441,31 +440,26 @@ class HFLM(TemplateLM): ...@@ -441,31 +440,26 @@ class HFLM(TemplateLM):
def _get_backend( def _get_backend(
self, self,
config: Union[transformers.PretrainedConfig, transformers.AutoConfig], config: Union[transformers.PretrainedConfig, transformers.AutoConfig],
backend: Optional[Literal["default", "causal", "seq2seq"]] = "default", backend: Literal["default", "causal", "seq2seq"] = "default",
trust_remote_code: Optional[bool] = False, trust_remote_code: Optional[bool] = False,
) -> None: ) -> None:
""" """
Helper method during initialization. Helper method during initialization.
Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder)) Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder)) model type to be used.
model type to be used.
sets `self.AUTO_MODEL_CLASS` appropriately if not already set. sets `self.AUTO_MODEL_CLASS` appropriately if not already set.
**If not calling HFLM.__init__() or HFLM._get_backend() within a subclass of HFLM,
user must set `self.backend` to be either "causal" or "seq2seq" manually!**
""" """
# escape hatch: if we're using a subclass that shouldn't follow
# the default _get_backend logic,
# then skip over the method.
# TODO: this seems very much undesirable in some cases--our code in HFLM
# references AutoModelForCausalLM at times to check for equality
if self.AUTO_MODEL_CLASS is not None:
return
assert backend in ["default", "causal", "seq2seq"] assert backend in ["default", "causal", "seq2seq"]
if backend != "default": if backend != "default":
# if we've settled on non-default backend, use that manually # if we've settled on non-default backend, use that manually
if backend == "causal": if backend == "causal":
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM self.backend = backend
elif backend == "seq2seq": elif backend == "seq2seq":
self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM self.backend = backend
eval_logger.info( eval_logger.info(
f"Overrode HF model backend type, and using type '{backend}'" f"Overrode HF model backend type, and using type '{backend}'"
) )
...@@ -478,26 +472,32 @@ class HFLM(TemplateLM): ...@@ -478,26 +472,32 @@ class HFLM(TemplateLM):
# first check if model type is listed under seq2seq models, since some # first check if model type is listed under seq2seq models, since some
# models like MBart are listed in both seq2seq and causal mistakenly in HF transformers. # models like MBart are listed in both seq2seq and causal mistakenly in HF transformers.
# these special cases should be treated as seq2seq models. # these special cases should be treated as seq2seq models.
self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM self.backend = "seq2seq"
eval_logger.info(f"Using model type '{backend}'")
elif ( elif (
getattr(self.config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES getattr(self.config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
): ):
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM self.backend = "causal"
eval_logger.info(f"Using model type '{backend}'")
else: else:
if not trust_remote_code: if not trust_remote_code:
eval_logger.warning( eval_logger.warning(
"HF model type is neither marked as CausalLM or Seq2SeqLM. \ "HF model type is neither marked as CausalLM or Seq2SeqLM. \
This is expected if your model requires `trust_remote_code=True` but may be an error otherwise." This is expected if your model requires `trust_remote_code=True` but may be an error otherwise."
"Setting backend to causal"
) )
# if model type is neither in HF transformers causal or seq2seq model registries # if model type is neither in HF transformers causal or seq2seq model registries
# then we default to AutoModelForCausalLM # then we default to assuming AutoModelForCausalLM
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM self.backend = "causal"
eval_logger.info(
f"Model type cannot be determined. Using default model type '{backend}'"
)
assert self.AUTO_MODEL_CLASS in [ if self.AUTO_MODEL_CLASS is None:
transformers.AutoModelForCausalLM, if self.backend == "causal":
transformers.AutoModelForSeq2SeqLM, self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
] elif self.backend == "seq2seq":
return None self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
def _get_config( def _get_config(
self, self,
...@@ -505,6 +505,7 @@ class HFLM(TemplateLM): ...@@ -505,6 +505,7 @@ class HFLM(TemplateLM):
revision: str = "main", revision: str = "main",
trust_remote_code: bool = False, trust_remote_code: bool = False,
) -> None: ) -> None:
"""Return the model config for HuggingFace models"""
self._config = transformers.AutoConfig.from_pretrained( self._config = transformers.AutoConfig.from_pretrained(
pretrained, pretrained,
revision=revision, revision=revision,
...@@ -703,7 +704,7 @@ class HFLM(TemplateLM): ...@@ -703,7 +704,7 @@ class HFLM(TemplateLM):
# if OOM, then halves batch_size and tries again # if OOM, then halves batch_size and tries again
@find_executable_batch_size(starting_batch_size=self.max_batch_size) @find_executable_batch_size(starting_batch_size=self.max_batch_size)
def forward_batch(batch_size): def forward_batch(batch_size):
if self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM: if self.backend == "seq2seq":
length = max(max_context_enc, max_cont_enc) length = max(max_context_enc, max_cont_enc)
batched_conts = torch.ones( batched_conts = torch.ones(
(batch_size, length), device=self.device (batch_size, length), device=self.device
...@@ -754,7 +755,7 @@ class HFLM(TemplateLM): ...@@ -754,7 +755,7 @@ class HFLM(TemplateLM):
# by default for CausalLM - false or self.add_bos_token is set # by default for CausalLM - false or self.add_bos_token is set
if add_special_tokens is None: if add_special_tokens is None:
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.backend == "causal":
special_tokens_kwargs = { special_tokens_kwargs = {
"add_special_tokens": False or self.add_bos_token "add_special_tokens": False or self.add_bos_token
} }
...@@ -782,7 +783,7 @@ class HFLM(TemplateLM): ...@@ -782,7 +783,7 @@ class HFLM(TemplateLM):
self.tokenizer.padding_side = padding_side self.tokenizer.padding_side = padding_side
add_special_tokens = {} add_special_tokens = {}
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.backend == "causal":
add_special_tokens = {"add_special_tokens": False or self.add_bos_token} add_special_tokens = {"add_special_tokens": False or self.add_bos_token}
encoding = self.tokenizer( encoding = self.tokenizer(
...@@ -860,14 +861,14 @@ class HFLM(TemplateLM): ...@@ -860,14 +861,14 @@ class HFLM(TemplateLM):
def _select_cont_toks( def _select_cont_toks(
self, logits: torch.Tensor, contlen: int = None, inplen: int = None self, logits: torch.Tensor, contlen: int = None, inplen: int = None
) -> torch.Tensor: ) -> torch.Tensor:
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.backend == "causal":
assert ( assert (
contlen and inplen contlen and inplen
), "Must pass input len and cont. len to select scored logits for causal LM" ), "Must pass input len and cont. len to select scored logits for causal LM"
# discard right-padding. # discard right-padding.
# also discard the input/context tokens. we'll only score continuations. # also discard the input/context tokens. we'll only score continuations.
logits = logits[inplen - contlen : inplen] logits = logits[inplen - contlen : inplen]
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM: elif self.backend == "seq2seq":
assert ( assert (
contlen and not inplen contlen and not inplen
), "Selecting scored logits for Seq2SeqLM requires only cont. len" ), "Selecting scored logits for Seq2SeqLM requires only cont. len"
...@@ -990,8 +991,7 @@ class HFLM(TemplateLM): ...@@ -990,8 +991,7 @@ class HFLM(TemplateLM):
requests, requests,
sort_fn=_collate, sort_fn=_collate,
group_by="contexts" group_by="contexts"
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM if self.backend == "causal" and self.logits_cache
and self.logits_cache
else None, else None,
group_fn=_lookup_one_token_cont, group_fn=_lookup_one_token_cont,
) )
...@@ -1048,14 +1048,14 @@ class HFLM(TemplateLM): ...@@ -1048,14 +1048,14 @@ class HFLM(TemplateLM):
# cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice # cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice
# when too long to fit in context, truncate from the left # when too long to fit in context, truncate from the left
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.backend == "causal":
inp = torch.tensor( inp = torch.tensor(
(context_enc + continuation_enc)[-(self.max_length + 1) :][:-1], (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
dtype=torch.long, dtype=torch.long,
device=self.device, device=self.device,
) )
(inplen,) = inp.shape (inplen,) = inp.shape
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM: elif self.backend == "seq2seq":
inp = torch.tensor( inp = torch.tensor(
(context_enc)[-self.max_length :], (context_enc)[-self.max_length :],
dtype=torch.long, dtype=torch.long,
...@@ -1095,11 +1095,11 @@ class HFLM(TemplateLM): ...@@ -1095,11 +1095,11 @@ class HFLM(TemplateLM):
# create encoder attn mask and batched conts, if seq2seq # create encoder attn mask and batched conts, if seq2seq
call_kwargs = {} call_kwargs = {}
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.backend == "causal":
batched_inps = pad_and_concat( batched_inps = pad_and_concat(
padding_len_inp, inps, padding_side="right" padding_len_inp, inps, padding_side="right"
) # [batch, padding_len_inp] ) # [batch, padding_len_inp]
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM: elif self.backend == "seq2seq":
# TODO: left-pad encoder inps and mask? # TODO: left-pad encoder inps and mask?
batched_inps = pad_and_concat( batched_inps = pad_and_concat(
padding_len_inp, inps padding_len_inp, inps
...@@ -1130,7 +1130,7 @@ class HFLM(TemplateLM): ...@@ -1130,7 +1130,7 @@ class HFLM(TemplateLM):
# from prompt/prefix tuning tokens, if applicable # from prompt/prefix tuning tokens, if applicable
ctx_len = ( ctx_len = (
inplen + (logits.shape[0] - padding_len_inp) inplen + (logits.shape[0] - padding_len_inp)
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM if self.backend == "causal"
else None else None
) )
logits = self._select_cont_toks(logits, contlen=contlen, inplen=ctx_len) logits = self._select_cont_toks(logits, contlen=contlen, inplen=ctx_len)
...@@ -1265,10 +1265,10 @@ class HFLM(TemplateLM): ...@@ -1265,10 +1265,10 @@ class HFLM(TemplateLM):
max_gen_toks = self.max_gen_toks max_gen_toks = self.max_gen_toks
# set the max length in tokens of inputs ("context_enc") # set the max length in tokens of inputs ("context_enc")
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.backend == "causal":
# max len for inputs = max length, minus room to generate the max new tokens # max len for inputs = max length, minus room to generate the max new tokens
max_ctx_len = self.max_length - max_gen_toks max_ctx_len = self.max_length - max_gen_toks
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM: elif self.backend == "seq2seq":
# max len for inputs = encoder's whole max_length # max len for inputs = encoder's whole max_length
max_ctx_len = self.max_length max_ctx_len = self.max_length
...@@ -1295,7 +1295,7 @@ class HFLM(TemplateLM): ...@@ -1295,7 +1295,7 @@ class HFLM(TemplateLM):
cont_toks_list = cont.tolist() cont_toks_list = cont.tolist()
for cont_toks, context in zip(cont_toks_list, contexts): for cont_toks, context in zip(cont_toks_list, contexts):
# discard context + left-padding toks if using causal decoder-only LM # discard context + left-padding toks if using causal decoder-only LM
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM: if self.backend == "causal":
cont_toks = cont_toks[context_enc.shape[1] :] cont_toks = cont_toks[context_enc.shape[1] :]
s = self.tok_decode(cont_toks) s = self.tok_decode(cont_toks)
......
import copy import copy
import json
import logging import logging
import subprocess
from collections import defaultdict from collections import defaultdict
from typing import List, Optional, Union from typing import List, Optional, Union
...@@ -33,54 +31,6 @@ except ImportError: ...@@ -33,54 +31,6 @@ except ImportError:
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def get_nc_count() -> Union[int, None]:
"""Returns the number of neuron cores on the current instance."""
try:
cmd = "neuron-ls --json-output"
result = subprocess.run(cmd, shell=True, capture_output=True)
print(f"inferring nc_count from `neuron-ls` {result.stdout}")
json_output = json.loads(result.stdout)
count = sum([x["nc_count"] for x in json_output])
print(f"nc_count={count}")
return count
except Exception:
return None
def wrap_constant_batch_size(func):
def _decorator(self, input_ids):
"""input_ids a 2D array with batch_size on dim=0
makes sure the func runs with self.batch_size
"""
# access a from TestSample
batch_size = input_ids.shape[0]
if batch_size < self.batch_size:
# handle the event of input_ids.shape[0] != batch_size
# Neuron cores expect constant batch_size
input_ids = torch.concat(
(
input_ids,
# add missing_batch_size dummy
torch.zeros(
[self.batch_size - batch_size, *input_ids.size()[1:]],
dtype=input_ids.dtype,
device=input_ids.device,
),
),
dim=0,
)
elif batch_size > self.batch_size:
raise ValueError(
f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
)
# return the forward pass that requires constant batch size
return func(self, input_ids)[:batch_size]
return _decorator
class CustomNeuronModelForCausalLM(NeuronModelForCausalLM): class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
"""NeuronModelForCausalLM with `stopping_criteria` in `generate`""" """NeuronModelForCausalLM with `stopping_criteria` in `generate`"""
...@@ -146,7 +96,7 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM): ...@@ -146,7 +96,7 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
raise ValueError( raise ValueError(
f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})" f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
) )
elif batch_size < self.batch_size: elif batch_size < self.batch_size and not self.continuous_batching:
logger.warning( logger.warning(
"Inputs will be padded to match the model static batch size. This will increase latency." "Inputs will be padded to match the model static batch size. This will increase latency."
) )
...@@ -158,8 +108,6 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM): ...@@ -158,8 +108,6 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
if attention_mask is not None: if attention_mask is not None:
padding = torch.zeros(padding_shape, dtype=torch.int64) padding = torch.zeros(padding_shape, dtype=torch.int64)
padded_attention_mask = torch.cat([attention_mask, padding]) padded_attention_mask = torch.cat([attention_mask, padding])
# Drop the current generation context and clear the Key/Value cache
self.reset_generation()
output_ids = self.generate_tokens( output_ids = self.generate_tokens(
padded_input_ids, padded_input_ids,
...@@ -179,8 +127,6 @@ class NEURON_HF(TemplateLM): ...@@ -179,8 +127,6 @@ class NEURON_HF(TemplateLM):
Tested with neuron 2.17.0 Tested with neuron 2.17.0
""" """
_DEFAULT_MAX_LENGTH = 2048
def __init__( def __init__(
self, self,
pretrained: Optional[str] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0", pretrained: Optional[str] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
...@@ -203,7 +149,7 @@ class NEURON_HF(TemplateLM): ...@@ -203,7 +149,7 @@ class NEURON_HF(TemplateLM):
"please install neuron via pip install transformers-neuron ", "please install neuron via pip install transformers-neuron ",
"also make sure you are running on an AWS inf2 instance", "also make sure you are running on an AWS inf2 instance",
) )
if version.parse(optimum_neuron_version) != version.parse("0.0.17"): if version.parse(optimum_neuron_version) != version.parse("0.0.24"):
logger.warning( logger.warning(
'`optimum-neuron` model requires `pip install "optimum[neuronx]>=0.0.17" ' '`optimum-neuron` model requires `pip install "optimum[neuronx]>=0.0.17" '
"preferably using the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) " "preferably using the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) "
...@@ -217,35 +163,16 @@ class NEURON_HF(TemplateLM): ...@@ -217,35 +163,16 @@ class NEURON_HF(TemplateLM):
self.batch_size_per_gpu = int(batch_size) self.batch_size_per_gpu = int(batch_size)
batch_size = int(batch_size) batch_size = int(batch_size)
if tp_degree is None:
# execute `neuron-ls --json-output | jq '.[0].nc_count'``
# to get the number of neuron cores on your instance
tp_degree = get_nc_count()
assert isinstance(tp_degree, int), (
f"model_args must include tp_degree. tp_degree must be set to an integer,"
f" but is tp_degree=`{tp_degree}` with type=`{type(tp_degree)}`."
"Set it to number of neuron cores on your instance."
" For inf2.xlarge and inf2.8xlarge, set it to `2`."
" For inf2.24xlarge, set it to `12`."
" For inf2.48xlarge, set it to `24`."
)
revision = str(revision) # cast to string if not already one
# TODO: update this to be less of a hack once subfolder is fixed in HF
revision = revision + ("/" + subfolder if subfolder is not None else "")
self._config = transformers.AutoConfig.from_pretrained( self._config = transformers.AutoConfig.from_pretrained(
pretrained, pretrained,
revision=revision, revision=revision,
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
) )
torch_dtype = lm_eval.models.utils.get_dtype(dtype)
assert torch_dtype in [ revision = str(revision) # cast to string if not already one
torch.float16, # TODO: update this to be less of a hack once subfolder is fixed in HF
torch.bfloat16, revision = revision + ("/" + subfolder if subfolder is not None else "")
], "Only float16 and bfloat16 are supported"
self.tokenizer = transformers.AutoTokenizer.from_pretrained( self.tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained if tokenizer is None else tokenizer, pretrained if tokenizer is None else tokenizer,
...@@ -254,36 +181,58 @@ class NEURON_HF(TemplateLM): ...@@ -254,36 +181,58 @@ class NEURON_HF(TemplateLM):
use_fast=use_fast_tokenizer, use_fast=use_fast_tokenizer,
) )
# Neuron specific code neuron_config = getattr(self._config, "neuron", None)
if torch_dtype == torch.float16: if neuron_config is None:
self.amp_dtype = "f16" # Check export parameters
elif torch_dtype == torch.bfloat16: if tp_degree is not None:
self.amp_dtype = "bf16" assert isinstance(tp_degree, int), (
elif torch_dtype == torch.float32: f"tp_degree must be set to an integer,"
self.amp_dtype = "f32" f" but is tp_degree=`{tp_degree}` with type=`{type(tp_degree)}`."
else: "Set it to a number lower than the number of neuron cores on your instance."
raise NotImplementedError("Only float16 and bfloat16 are implemented.") " For inf2.xlarge and inf2.8xlarge, set it to `2`."
" For inf2.24xlarge, set it <= `12`."
compiler_args = {"num_cores": tp_degree, "auto_cast_type": self.amp_dtype} " For inf2.48xlarge, set it <= `24`."
input_shapes = { )
"batch_size": batch_size, torch_dtype = lm_eval.models.utils.get_dtype(dtype)
"sequence_length": self._DEFAULT_MAX_LENGTH,
} if torch_dtype == torch.float16:
self.amp_dtype = "f16"
elif torch_dtype == torch.bfloat16:
self.amp_dtype = "bf16"
elif torch_dtype == torch.float32:
self.amp_dtype = "f32"
else:
raise NotImplementedError(
"Only float16/bfloat16/float32 are supported."
)
print( print(f"{'='*20} \n exporting model to neuron")
f"{'='*20} \n loading model to neuron with" self.model = CustomNeuronModelForCausalLM.from_pretrained(
f" {compiler_args}, {input_shapes}..." pretrained,
) revision=revision,
self.model = CustomNeuronModelForCausalLM.from_pretrained( trust_remote_code=trust_remote_code,
pretrained, low_cpu_mem_usage=low_cpu_mem_usage,
revision=revision, export=True,
trust_remote_code=trust_remote_code, batch_size=batch_size,
low_cpu_mem_usage=low_cpu_mem_usage, num_cores=tp_degree,
export=True, auto_cast_type=self.amp_dtype,
**compiler_args, sequence_length=max_length,
**input_shapes, )
) neuron_config = self.model.config.neuron
print(f"SUCCESS: neuron model compiled. \n {'='*20}") print(
f"SUCCESS: neuron model exported with config {neuron_config}. \n {'='*20}"
)
else:
print(
f"{'='*20} \n loading neuron model with config" f" {neuron_config}..."
)
self.model = CustomNeuronModelForCausalLM.from_pretrained(
pretrained,
revision=revision,
trust_remote_code=trust_remote_code,
low_cpu_mem_usage=low_cpu_mem_usage,
)
print(f"SUCCESS: neuron model loaded. \n {'='*20}")
self.truncation = truncation self.truncation = truncation
...@@ -291,8 +240,6 @@ class NEURON_HF(TemplateLM): ...@@ -291,8 +240,6 @@ class NEURON_HF(TemplateLM):
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
self.add_bos_token = add_bos_token self.add_bos_token = add_bos_token
self._max_length = max_length
self.batch_schedule = 1 self.batch_schedule = 1
self.batch_sizes = {} self.batch_sizes = {}
...@@ -313,17 +260,7 @@ class NEURON_HF(TemplateLM): ...@@ -313,17 +260,7 @@ class NEURON_HF(TemplateLM):
@property @property
def max_length(self): def max_length(self):
if self._max_length: # if max length manually set, return it return self.model.max_length
return self._max_length
seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
for attr in seqlen_config_attrs:
if hasattr(self.model.config, attr):
return getattr(self.model.config, attr)
if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656:
return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
@property @property
def max_gen_toks(self) -> int: def max_gen_toks(self) -> int:
...@@ -391,34 +328,6 @@ class NEURON_HF(TemplateLM): ...@@ -391,34 +328,6 @@ class NEURON_HF(TemplateLM):
def tok_decode(self, tokens): def tok_decode(self, tokens):
return self.tokenizer.decode(tokens) return self.tokenizer.decode(tokens)
@wrap_constant_batch_size
def _model_call(self, input_ids: torch.Tensor):
"""
get logits for the entire sequence
:param input_ids: torch.Tensor
A torch tensor of shape [batch, sequence_cont]
the size of sequence may vary from call to call
:return
A torch tensor of shape [batch, sequence, vocab] with the
logits returned from the model's decoder-lm head
"""
_, sequence_length = input_ids.shape
with torch.inference_mode():
cache_ids = torch.arange(0, sequence_length, dtype=torch.int32).split(1)
input_ids_split = input_ids.split(1, dim=1)
return torch.concat(
[
self.model.forward(
input_ids=input_id, cache_ids=cache_id, return_dict=False
)[0]
for input_id, cache_id in zip(input_ids_split, cache_ids)
],
dim=1,
)
def _model_generate(self, context, max_length, stop, **generation_kwargs): def _model_generate(self, context, max_length, stop, **generation_kwargs):
# we require users to pass do_sample=True explicitly # we require users to pass do_sample=True explicitly
# for non-greedy gen. This should be reevaluated when considering beam search. # for non-greedy gen. This should be reevaluated when considering beam search.
...@@ -580,15 +489,41 @@ class NEURON_HF(TemplateLM): ...@@ -580,15 +489,41 @@ class NEURON_HF(TemplateLM):
cont_toks_list.append(continuation_enc) cont_toks_list.append(continuation_enc)
inplens.append(inplen) inplens.append(inplen)
# create encoder attn mask and batched conts, if seq2seq # Add dummy inputs up to the model static batch size
call_kwargs = {} if len(inps) < self.batch_size:
inps = inps + [
torch.zeros_like(inps[0]),
] * (self.batch_size - len(inps))
masks = [torch.ones_like(inp) for inp in inps]
batched_inps = lm_eval.models.utils.pad_and_concat( batched_inps = lm_eval.models.utils.pad_and_concat(
padding_len_inp, inps, padding_side="right" padding_len_inp, inps, padding_side="right"
) # [batch, padding_len_inp] ) # [batch, padding_len_inp]
multi_logits = F.log_softmax( batched_masks = lm_eval.models.utils.pad_and_concat(
self._model_call(batched_inps, **call_kwargs), dim=-1 padding_len_inp, masks, padding_side="right"
) # [batch, padding_length (inp or cont), vocab] )
if self.model.model.neuron_config.output_all_logits:
inputs = self.model.prepare_inputs_for_prefill(
batched_inps, batched_masks
)
multi_logits = F.log_softmax(
self.model.forward(**inputs).logits, dim=-1
) # [batch, padding_length (inp or cont), vocab]
else:
# The model will only return the logits for the last input token, so we need
# to iterate over inputs to accumulate logits.
# To speed things up we use the KV cache as we would do when generating.
inputs = self.model.prepare_inputs_for_prefill(
batched_inps[:, :1], batched_masks[:, :1]
)
outputs = [self.model.forward(**inputs).logits]
for i in range(1, padding_len_inp):
inputs = self.model.prepare_inputs_for_decode(
batched_inps[:, : i + 1], batched_masks[:, : i + 1]
)
outputs.append(self.model.forward(**inputs).logits)
multi_logits = F.log_softmax(torch.concat(outputs, dim=1), dim=-1)
for (cache_key, _, _), logits, inplen, cont_toks in zip( for (cache_key, _, _), logits, inplen, cont_toks in zip(
chunk, multi_logits, inplens, cont_toks_list chunk, multi_logits, inplens, cont_toks_list
......
...@@ -69,11 +69,11 @@ class LocalCompletionsAPI(TemplateAPI): ...@@ -69,11 +69,11 @@ class LocalCompletionsAPI(TemplateAPI):
for choice, ctxlen in zip(out["choices"], ctxlens): for choice, ctxlen in zip(out["choices"], ctxlens):
assert ctxlen > 0, "Context length must be greater than 0" assert ctxlen > 0, "Context length must be greater than 0"
logprobs = sum(choice["logprobs"]["token_logprobs"][ctxlen:-1]) logprobs = sum(choice["logprobs"]["token_logprobs"][ctxlen:-1])
tokens = choice["logprobs"]["token_logprobs"][ctxlen:-1] tokens_logprobs = choice["logprobs"]["token_logprobs"][ctxlen:-1]
top_logprobs = choice["logprobs"]["top_logprobs"][ctxlen:-1] top_logprobs = choice["logprobs"]["top_logprobs"][ctxlen:-1]
is_greedy = True is_greedy = True
for tok, top in zip(tokens, top_logprobs): for tok, top in zip(tokens_logprobs, top_logprobs):
if tok != max(top, key=top.get): if tok != max(top.values()):
is_greedy = False is_greedy = False
break break
res.append((logprobs, is_greedy)) res.append((logprobs, is_greedy))
...@@ -190,14 +190,18 @@ class OpenAICompletionsAPI(LocalCompletionsAPI): ...@@ -190,14 +190,18 @@ class OpenAICompletionsAPI(LocalCompletionsAPI):
key = os.environ.get("OPENAI_API_KEY", None) key = os.environ.get("OPENAI_API_KEY", None)
if key is None: if key is None:
raise ValueError( raise ValueError(
"API key not found. Please set the OPENAI_API_KEY environment variable." "API key not found. Please set the `OPENAI_API_KEY` environment variable."
) )
return key return key
def loglikelihood(self, requests, **kwargs): def loglikelihood(self, requests, **kwargs):
assert ( assert (
self.model != "gpt-3.5-turbo" self.model
), "Loglikelihood is not supported for gpt-3.5-turbo" in [
"babbage-002",
"davinci-002",
]
), f"Prompt loglikelihoods are only supported by OpenAI's API for {['babbage-002', 'davinci-002']}."
return super().loglikelihood(requests, **kwargs) return super().loglikelihood(requests, **kwargs)
def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str]: def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str]:
...@@ -226,6 +230,11 @@ class OpenAIChatCompletion(LocalChatCompletion): ...@@ -226,6 +230,11 @@ class OpenAIChatCompletion(LocalChatCompletion):
key = os.environ.get("OPENAI_API_KEY", None) key = os.environ.get("OPENAI_API_KEY", None)
if key is None: if key is None:
raise ValueError( raise ValueError(
"API key not found. Please set the OPENAI_API_KEY environment variable." "API key not found. Please set the `OPENAI_API_KEY` environment variable."
) )
return key return key
def loglikelihood(self, requests, **kwargs):
raise NotImplementedError(
"Loglikelihood (and therefore `multiple_choice`-type tasks) is not supported for chat completions as OpenAI does not provide prompt logprobs. See https://github.com/EleutherAI/lm-evaluation-harness/issues/942#issuecomment-1777836312 or https://github.com/EleutherAI/lm-evaluation-harness/issues/1196 for more background on this limitation."
)
...@@ -698,3 +698,14 @@ def replace_placeholders( ...@@ -698,3 +698,14 @@ def replace_placeholders(
# Add the last part of the string # Add the last part of the string
result.append(parts[-1]) result.append(parts[-1])
return "".join(result) return "".join(result)
def flatten_image_list(images: List[List]):
"""
Takes in a list of lists of images, and returns a single list of all images in order.
Used for some multimodal models like Llava-1.5 which expects this flattened-list format for its image processor.
:param images: A list of lists of PIL images.
:return: a list of PIL images, via concatenating all the sub-lists in order.
"""
return [image for image_list in images for image in image_list]
...@@ -7,9 +7,9 @@ from tqdm import tqdm ...@@ -7,9 +7,9 @@ from tqdm import tqdm
from lm_eval.api.instance import Instance from lm_eval.api.instance import Instance
from lm_eval.api.registry import register_model from lm_eval.api.registry import register_model
from lm_eval.models.utils import Collator, undistribute from lm_eval.models.utils import Collator, replace_placeholders, undistribute
from lm_eval.models.vllm_causallms import VLLM from lm_eval.models.vllm_causallms import VLLM
from lm_eval.utils import simple_parse_args_string from lm_eval.utils import eval_logger
try: try:
...@@ -36,10 +36,11 @@ class VLLM_VLM(VLLM): ...@@ -36,10 +36,11 @@ class VLLM_VLM(VLLM):
interleave: bool = True, interleave: bool = True,
# TODO<baber>: handle max_images and limit_mm_per_prompt better # TODO<baber>: handle max_images and limit_mm_per_prompt better
max_images: int = 999, max_images: int = 999,
limit_mm_per_prompt: str = "image=1",
**kwargs, **kwargs,
): ):
kwargs["limit_mm_per_prompt"] = simple_parse_args_string(limit_mm_per_prompt) if max_images != 999:
kwargs["limit_mm_per_prompt"] = {"image": max_images}
eval_logger.info(f"Setting limit_mm_per_prompt[image] to {max_images}")
super().__init__( super().__init__(
pretrained=pretrained, pretrained=pretrained,
trust_remote_code=trust_remote_code, trust_remote_code=trust_remote_code,
...@@ -63,6 +64,17 @@ class VLLM_VLM(VLLM): ...@@ -63,6 +64,17 @@ class VLLM_VLM(VLLM):
truncation: bool = False, truncation: bool = False,
): ):
images = [img[: self.max_images] for img in images] images = [img[: self.max_images] for img in images]
# TODO<baber>: is the default placeholder always <image>?
if self.chat_applied is False:
strings = [
replace_placeholders(
string,
DEFAULT_IMAGE_PLACEHOLDER,
DEFAULT_IMAGE_PLACEHOLDER,
self.max_images,
)
for string in strings
]
outputs = [] outputs = []
for x, i in zip(strings, images): for x, i in zip(strings, images):
......
...@@ -18,6 +18,7 @@ ...@@ -18,6 +18,7 @@
| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English | | [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
| [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English | | [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English |
| [babi](babi/README.md) | Tasks designed as question and answering challenges based on simulated stories. | English | | [babi](babi/README.md) | Tasks designed as question and answering challenges based on simulated stories. | English |
| [basque_bench](basque_bench/README.md) | Collection of tasks in Basque encompassing various evaluation areas. | Basque |
| [basqueglue](basqueglue/README.md) | Tasks designed to evaluate language understanding in Basque language. | Basque | | [basqueglue](basqueglue/README.md) | Tasks designed to evaluate language understanding in Basque language. | Basque |
| [bbh](bbh/README.md) | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German | | [bbh](bbh/README.md) | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German |
| [belebele](belebele/README.md) | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages) | | [belebele](belebele/README.md) | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages) |
...@@ -25,6 +26,7 @@ ...@@ -25,6 +26,7 @@
| [bertaqa](bertaqa/README.md) | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) | | [bertaqa](bertaqa/README.md) | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) |
| [bigbench](bigbench/README.md) | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple | | [bigbench](bigbench/README.md) | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple |
| [blimp](blimp/README.md) | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English | | [blimp](blimp/README.md) | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English |
| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan |
| [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese | | [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
| [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese | | [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
| code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby | | code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
...@@ -42,6 +44,7 @@ ...@@ -42,6 +44,7 @@
| [fda](fda/README.md) | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English | | [fda](fda/README.md) | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English |
| [fld](fld/README.md) | Tasks involving free-form and directed dialogue understanding. | English | | [fld](fld/README.md) | Tasks involving free-form and directed dialogue understanding. | English |
| [french_bench](french_bench/README.md) | Set of tasks designed to assess language model performance in French. | French| | [french_bench](french_bench/README.md) | Set of tasks designed to assess language model performance in French. | French|
| [galician_bench](galician_bench/README.md) | Collection of tasks in Galician encompassing various evaluation areas. | Galician |
| [glue](glue/README.md) | General Language Understanding Evaluation benchmark to test broad language abilities. | English | | [glue](glue/README.md) | General Language Understanding Evaluation benchmark to test broad language abilities. | English |
| [gpqa](gpqa/README.md) | Tasks designed for general public question answering and knowledge verification. | English | | [gpqa](gpqa/README.md) | Tasks designed for general public question answering and knowledge verification. | English |
| [gsm8k](gsm8k/README.md) | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English | | [gsm8k](gsm8k/README.md) | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English |
...@@ -86,6 +89,7 @@ ...@@ -86,6 +89,7 @@
| [pile_10k](pile_10k/README.md) | The first 10K elements of The Pile, useful for debugging models trained on it. | English | | [pile_10k](pile_10k/README.md) | The first 10K elements of The Pile, useful for debugging models trained on it. | English |
| [piqa](piqa/README.md) | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English | | [piqa](piqa/README.md) | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English |
| [polemo2](polemo2/README.md) | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish | | [polemo2](polemo2/README.md) | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish |
| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese |
| [prost](prost/README.md) | Tasks requiring understanding of professional standards and ethics in various domains. | English | | [prost](prost/README.md) | Tasks requiring understanding of professional standards and ethics in various domains. | English |
| [pubmedqa](pubmedqa/README.md) | Question answering tasks based on PubMed research articles for biomedical understanding. | English | | [pubmedqa](pubmedqa/README.md) | Question answering tasks based on PubMed research articles for biomedical understanding. | English |
| [qa4mre](qa4mre/README.md) | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English | | [qa4mre](qa4mre/README.md) | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English |
...@@ -95,6 +99,7 @@ ...@@ -95,6 +99,7 @@
| [sciq](sciq/README.md) | Science Question Answering tasks to assess understanding of scientific concepts. | English | | [sciq](sciq/README.md) | Science Question Answering tasks to assess understanding of scientific concepts. | English |
| [scrolls](scrolls/README.md) | Tasks that involve long-form reading comprehension across various domains. | English | | [scrolls](scrolls/README.md) | Tasks that involve long-form reading comprehension across various domains. | English |
| [siqa](siqa/README.md) | Social Interaction Question Answering to evaluate common sense and social reasoning. | English | | [siqa](siqa/README.md) | Social Interaction Question Answering to evaluate common sense and social reasoning. | English |
| [spanish_bench](spanish_bench/README.md) | Collection of tasks in Spanish encompassing various evaluation areas. | Spanish |
| [squad_completion](squad_completion/README.md) | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English | | [squad_completion](squad_completion/README.md) | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English |
| [squadv2](squadv2/README.md) | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English | | [squadv2](squadv2/README.md) | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English |
| [storycloze](storycloze/README.md) | Tasks to predict story endings, focusing on narrative logic and coherence. | English | | [storycloze](storycloze/README.md) | Tasks to predict story endings, focusing on narrative logic and coherence. | English |
...@@ -107,6 +112,7 @@ ...@@ -107,6 +112,7 @@
| [translation](translation/README.md) | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese | | [translation](translation/README.md) | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese |
| [triviaqa](triviaqa/README.md) | A large-scale dataset for trivia question answering to test general knowledge. | English | | [triviaqa](triviaqa/README.md) | A large-scale dataset for trivia question answering to test general knowledge. | English |
| [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English | | [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English |
| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English | | [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English | | [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
| [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English | | [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |
......
...@@ -40,7 +40,11 @@ class TaskManager: ...@@ -40,7 +40,11 @@ class TaskManager:
[x for x in self._all_tasks if self._task_index[x]["type"] == "group"] [x for x in self._all_tasks if self._task_index[x]["type"] == "group"]
) )
self._all_subtasks = sorted( self._all_subtasks = sorted(
[x for x in self._all_tasks if self._task_index[x]["type"] == "task"] [
x
for x in self._all_tasks
if self._task_index[x]["type"] in ["task", "python_task"]
]
) )
self._all_tags = sorted( self._all_tags = sorted(
[x for x in self._all_tasks if self._task_index[x]["type"] == "tag"] [x for x in self._all_tasks if self._task_index[x]["type"] == "tag"]
...@@ -271,7 +275,7 @@ class TaskManager: ...@@ -271,7 +275,7 @@ class TaskManager:
task_object = config["class"]() task_object = config["class"]()
if isinstance(task_object, ConfigurableTask): if isinstance(task_object, ConfigurableTask):
# very scuffed: set task name here. TODO: fixme? # very scuffed: set task name here. TODO: fixme?
task_object.config.task = config["task"] task_object.config.task = task
else: else:
task_object = ConfigurableTask(config=config) task_object = ConfigurableTask(config=config)
...@@ -436,6 +440,30 @@ class TaskManager: ...@@ -436,6 +440,30 @@ class TaskManager:
:return :return
Dictionary of task names as key and task metadata Dictionary of task names as key and task metadata
""" """
def _populate_tags_and_groups(config, task, tasks_and_groups, print_info):
# TODO: remove group in next release
if "tag" in config:
attr_list = config["tag"]
if isinstance(attr_list, str):
attr_list = [attr_list]
for tag in attr_list:
if tag not in tasks_and_groups:
tasks_and_groups[tag] = {
"type": "tag",
"task": [task],
"yaml_path": -1,
}
elif tasks_and_groups[tag]["type"] != "tag":
self.logger.info(
f"The tag '{tag}' is already registered as a group, this tag will not be registered. "
"This may affect tasks you want to call."
)
break
else:
tasks_and_groups[tag]["task"].append(task)
# TODO: remove group in next release # TODO: remove group in next release
print_info = True print_info = True
ignore_dirs = [ ignore_dirs = [
...@@ -451,10 +479,14 @@ class TaskManager: ...@@ -451,10 +479,14 @@ class TaskManager:
config = utils.load_yaml_config(yaml_path, mode="simple") config = utils.load_yaml_config(yaml_path, mode="simple")
if self._config_is_python_task(config): if self._config_is_python_task(config):
# This is a python class config # This is a python class config
tasks_and_groups[config["task"]] = { task = config["task"]
tasks_and_groups[task] = {
"type": "python_task", "type": "python_task",
"yaml_path": yaml_path, "yaml_path": yaml_path,
} }
_populate_tags_and_groups(
config, task, tasks_and_groups, print_info
)
elif self._config_is_group(config): elif self._config_is_group(config):
# This is a group config # This is a group config
tasks_and_groups[config["group"]] = { tasks_and_groups[config["group"]] = {
...@@ -483,41 +515,9 @@ class TaskManager: ...@@ -483,41 +515,9 @@ class TaskManager:
"type": "task", "type": "task",
"yaml_path": yaml_path, "yaml_path": yaml_path,
} }
_populate_tags_and_groups(
# TODO: remove group in next release config, task, tasks_and_groups, print_info
for attr in ["tag", "group"]: )
if attr in config:
if attr == "group" and print_info:
self.logger.info(
"`group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. "
"The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. "
"`group`s which aggregate across subtasks must be only defined in a separate group config file, "
"which will be the official way to create groups that support cross-task aggregation as in `mmlu`. "
"Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs "
"for more information."
)
print_info = False
# attr = "tag"
attr_list = config[attr]
if isinstance(attr_list, str):
attr_list = [attr_list]
for tag in attr_list:
if tag not in tasks_and_groups:
tasks_and_groups[tag] = {
"type": "tag",
"task": [task],
"yaml_path": -1,
}
elif tasks_and_groups[tag]["type"] != "tag":
self.logger.info(
f"The tag {tag} is already registered as a group, this tag will not be registered. "
"This may affect tasks you want to call."
)
break
else:
tasks_and_groups[tag]["task"].append(task)
else: else:
self.logger.debug(f"File {f} in {root} could not be loaded") self.logger.debug(f"File {f} in {root} could not be loaded")
......
# BasqueBench
### Paper
BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.
The new evaluation datasets included in BasqueBench are:
| Task | Category | Homepage |
|:-------------:|:-----:|:-----:|
| MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
| WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
| XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |
The datasets included in BasqueBench that have been made public in previous pubications are:
| Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:|
| Belebele_eu | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| EusExams | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusExams |
| EusProficiency | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusProficiency |
| EusReading | Reading Comprehension | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusReading |
| EusTrivia | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusTrivia |
| FLORES_eu | Translation | [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) | https://huggingface.co/datasets/facebook/flores |
| QNLIeu | Natural Language Inference | [BasqueGLUE: A Natural Language Understanding Benchmark for Basque](https://aclanthology.org/2022.lrec-1.172/) | https://huggingface.co/datasets/orai-nlp/basqueGLUE |
| XNLIeu | Natural Language Inference | [XNLIeu: a dataset for cross-lingual NLI in Basque](https://arxiv.org/abs/2404.06996) | https://huggingface.co/datasets/HiTZ/xnli-eu |
| XStoryCloze_eu | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze |
### Citation
Paper for BasqueBench coming soon.
### Groups and Tasks
#### Groups
- `basque_bench`: All tasks included in BasqueBench.
- `flores_eu`: All FLORES translation tasks from or to Basque.
#### Tasks
The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
- `belebele_eus_Latn`
- `eus_exams_eu`
- `eus_proficiency`
- `eus_reading`
- `eus_trivia`
- `flores_eu`
- `flores_eu-ca`
- `flores_eu-de`
- `flores_eu-en`
- `flores_eu-es`
- `flores_eu-fr`
- `flores_eu-gl`
- `flores_eu-it`
- `flores_eu-pt`
- `flores_ca-eu`
- `flores_de-eu`
- `flores_en-eu`
- `flores_es-eu`
- `flores_fr-eu`
- `flores_gl-eu`
- `flores_it-eu`
- `flores_pt-eu`
- `mgsm_direct_eu`
- `mgsm_native_cot_eu`
- `qnlieu`
- `wnli_eu`
- `xcopa_eu`
- `xnli_eu`
- `xnli_eu_native`
- `xstorycloze_eu`
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
- `belebele_eus_Latn`: Belebele Basque
- `qnlieu`: From BasqueGLUE
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
* [ ] Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: basque_bench
task:
- belebele_eus_Latn
- xstorycloze_eu
- flores_eu
- eus_reading
- eus_proficiency
- eus_trivia
- eus_exams_eu
- qnlieu
- xnli_eu
- xnli_eu_native
- wnli_eu
- xcopa_eu
- mgsm_direct_eu
- mgsm_native_cot_eu
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment