Commit 25869601 authored by Baber's avatar Baber
Browse files

Merge branch 'main' into mathvista

# Conflicts:
#	lm_eval/models/hf_vlms.py
parents 56f40c53 c1d8795d
......@@ -8,6 +8,7 @@ build
dist
*.egg-info
venv
.venv/
.vscode/
temp
__pycache__
......
......@@ -2,7 +2,7 @@
exclude: ^tests/testdata/
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
rev: v4.6.0
hooks:
- id: check-added-large-files
- id: check-ast
......@@ -29,7 +29,7 @@ repos:
- id: mixed-line-ending
args: [--fix=lf]
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.4.8
rev: v0.6.8
hooks:
# Run the linter.
- id: ruff
......
......@@ -54,7 +54,7 @@ The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's pop
To install the `lm-eval` package from the github repository, run:
```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
```
......
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Qw83KAePAhaS"
},
"source": [
"# Releasing LM-Evaluation-Harness v0.4.0"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z7k2vq1iAdqr"
},
"source": [
"With the vast amount of work done in the field today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0gDoM0AJAvEc"
},
"source": [
"Our refactor stems from our desires to make the following believed best practices easier to carry out. \n",
"\n",
"1. Never copy results from other papers\n",
"2. Always share your exact prompts\n",
"3. Always provide model outputs\n",
"4. Qualitatively review a small batch of outputs before running evaluation jobs at scale\n",
"\n",
"We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:\n",
"\n",
"1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)\n",
"2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)\n",
"3. More configurability, for more advanced workflows and easier operation with modifying prompts\n",
"4. Better logging of data at runtime and post-hoc"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nnwsOpjda_YW"
},
"source": [
"In this notebook we will be going through a short tutorial on how things work."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zAov81vTbL2K"
},
"source": [
"## Install LM-Eval"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8hiosGzq_qZg",
"outputId": "6ab73e5e-1f54-417e-a388-07e0d870b132"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor\n",
" Cloning https://github.com/EleutherAI/lm-evaluation-harness.git (to revision big-refactor) to /tmp/pip-req-build-tnssql5s\n",
" Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-tnssql5s\n",
" Running command git checkout -b big-refactor --track origin/big-refactor\n",
" Switched to a new branch 'big-refactor'\n",
" Branch 'big-refactor' set up to track remote branch 'big-refactor' from 'origin'.\n",
" Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 42f486ee49b65926a444cb0620870a39a5b4b0a8\n",
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"Collecting accelerate>=0.21.0 (from lm-eval==1.0.0)\n",
" Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
" Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
" Downloading datasets-2.15.0-py3-none-any.whl (521 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m521.2/521.2 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
" Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)\n",
"Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.8.7)\n",
"Collecting peft>=0.2.0 (from lm-eval==1.0.0)\n",
" Downloading peft-0.6.2-py3-none-any.whl (174 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m174.7/174.7 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
" Downloading pybind11-2.11.1-py3-none-any.whl (227 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.7/227.7 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
" Downloading pytablewriter-1.2.0-py3-none-any.whl (111 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.1/111.1 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
" Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Collecting sacrebleu>=1.5.0 (from lm-eval==1.0.0)\n",
" Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
"Collecting sqlitedict (from lm-eval==1.0.0)\n",
" Downloading sqlitedict-2.1.0.tar.gz (21 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: torch>=1.8 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.1.0+cu118)\n",
"Collecting tqdm-multiprocess (from lm-eval==1.0.0)\n",
" Downloading tqdm_multiprocess-0.0.11-py3-none-any.whl (9.8 kB)\n",
"Requirement already satisfied: transformers>=4.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (4.35.2)\n",
"Collecting zstandard (from lm-eval==1.0.0)\n",
" Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (23.2)\n",
"Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (5.9.5)\n",
"Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (6.0.1)\n",
"Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (0.19.4)\n",
"Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (9.0.0)\n",
"Collecting pyarrow-hotfix (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)\n",
"Collecting dill<0.3.8,>=0.3.0 (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
"Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2.31.0)\n",
"Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (4.66.1)\n",
"Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.4.1)\n",
"Collecting multiprocess (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
"Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.8.6)\n",
"Collecting responses<0.19 (from evaluate->lm-eval==1.0.0)\n",
" Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
"Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from peft>=0.2.0->lm-eval==1.0.0) (0.4.0)\n",
"Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (3.8.1)\n",
"Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.16.0)\n",
"Collecting portalocker (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)\n",
"Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (2023.6.3)\n",
"Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (0.9.0)\n",
"Collecting colorama (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)\n",
"Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (4.9.3)\n",
"Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.11.3)\n",
"Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.3.2)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (3.2.0)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.13.1)\n",
"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (4.5.0)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (1.12)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.2.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.1.2)\n",
"Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (2.1.0)\n",
"Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.1->lm-eval==1.0.0) (0.15.0)\n",
"Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonlines->lm-eval==1.0.0) (23.1.0)\n",
"Requirement already satisfied: setuptools>=38.3.0 in /usr/local/lib/python3.10/dist-packages (from pytablewriter->lm-eval==1.0.0) (67.7.2)\n",
"Collecting DataProperty<2,>=1.0.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading DataProperty-1.0.1-py3-none-any.whl (27 kB)\n",
"Collecting mbstrdecoder<2,>=1.0.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading mbstrdecoder-1.1.3-py3-none-any.whl (7.8 kB)\n",
"Collecting pathvalidate<4,>=2.3.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading pathvalidate-3.2.0-py3-none-any.whl (23 kB)\n",
"Collecting tabledata<2,>=1.3.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tabledata-1.3.3-py3-none-any.whl (11 kB)\n",
"Collecting tcolorpy<1,>=0.0.5 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tcolorpy-0.1.4-py3-none-any.whl (7.9 kB)\n",
"Collecting typepy[datetime]<2,>=1.3.2 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading typepy-1.3.2-py3-none-any.whl (31 kB)\n",
"Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (3.3.2)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (6.0.4)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (4.0.3)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.9.2)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.3.1)\n",
"Requirement already satisfied: chardet<6,>=3.0.4 in /usr/local/lib/python3.10/dist-packages (from mbstrdecoder<2,>=1.0.0->pytablewriter->lm-eval==1.0.0) (5.2.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2.0.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2023.7.22)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2.8.2)\n",
"Requirement already satisfied: pytz>=2018.9 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2023.3.post1)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.8->lm-eval==1.0.0) (2.1.3)\n",
"Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge-score>=0.0.4->lm-eval==1.0.0) (8.1.7)\n",
"Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.8->lm-eval==1.0.0) (1.3.0)\n",
"Building wheels for collected packages: lm-eval, rouge-score, sqlitedict\n",
" Building wheel for lm-eval (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for lm-eval: filename=lm_eval-1.0.0-py3-none-any.whl size=994254 sha256=88356155b19f2891981ecef948326ad6ce8ca40a6009378410ec20d0e225995a\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-9v6ye7h3/wheels/17/01/26/599c0779e9858a70a73fa8a306699b5b9a868f820c225457b0\n",
" Building wheel for rouge-score (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=6bb0d44e4881972c43ce194e7cb65233d309758cb15f0dec54590d3d2efcfc36\n",
" Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
" Building wheel for sqlitedict (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=5747f7dd73ddf3d8fbcebf51b5e4f718fabe1e94bccdf16d2f22a2e65ee7fdf4\n",
" Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd\n",
"Successfully built lm-eval rouge-score sqlitedict\n",
"Installing collected packages: sqlitedict, zstandard, tcolorpy, pybind11, pyarrow-hotfix, portalocker, pathvalidate, mbstrdecoder, jsonlines, dill, colorama, typepy, tqdm-multiprocess, sacrebleu, rouge-score, responses, multiprocess, accelerate, datasets, DataProperty, tabledata, peft, evaluate, pytablewriter, lm-eval\n",
"Successfully installed DataProperty-1.0.1 accelerate-0.24.1 colorama-0.4.6 datasets-2.15.0 dill-0.3.7 evaluate-0.4.1 jsonlines-4.0.0 lm-eval-1.0.0 mbstrdecoder-1.1.3 multiprocess-0.70.15 pathvalidate-3.2.0 peft-0.6.2 portalocker-2.8.2 pyarrow-hotfix-0.6 pybind11-2.11.1 pytablewriter-1.2.0 responses-0.18.0 rouge-score-0.1.2 sacrebleu-2.3.2 sqlitedict-2.1.0 tabledata-1.3.3 tcolorpy-0.1.4 tqdm-multiprocess-0.0.11 typepy-1.3.2 zstandard-0.22.0\n"
]
}
],
"source": [
"# Install LM-Eval\n",
"!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0,
"referenced_widgets": [
"a1d3a8aa016544a78e8821c8f6199e06",
"f61ed33fad754146bdd2ac9db1ba1c48",
"bfa0af6aeff344c6845e1080a878e92e",
"fd1ad9e0367d4004aae853b91c3a7617",
"6b2d90209ec14230b3d58a74ac9b83bf",
"a73f357065d34d7baf0453ae4a8d75e2",
"46f521b73fd943c081c648fd873ebc0a",
"7c5689bc13684db8a22681f41863dddd",
"48763b6233374554ae76035c0483066f",
"4986a21eb560448fa79f4b25cde48951",
"aed3acd2f2d74003b44079c333a0698e"
]
},
"id": "uyO5MaKkZyah",
"outputId": "d46e8096-5086-4e49-967e-ea33d4a2a335"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a1d3a8aa016544a78e8821c8f6199e06",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading builder script: 0%| | 0.00/5.67k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from lm_eval import api"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8rfUeX6n_wkK"
},
"source": [
"## Create new evaluation tasks with config-based tasks\n",
"\n",
"Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HYFUhhfOSJKe"
},
"source": [
"A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.\n",
"\n",
"Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "bg3dGROW-V39"
},
"outputs": [],
"source": [
"YAML_boolq_string = '''\n",
"task: demo_boolq\n",
"dataset_path: super_glue\n",
"dataset_name: boolq\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{passage}}\\nQuestion: {{question}}?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: passage\n",
"metric_list:\n",
" - metric: acc\n",
"'''\n",
"with open('boolq.yaml', 'w') as f:\n",
" f.write(YAML_boolq_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can now run evaluation on this task, by pointing to the config file we've just created:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "LOUHK7PtQfq4"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:54:55,156 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:54:55.942051: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:54:55.942108: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:54:55.942142: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:54:57.066802: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:55:00,954 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:55:11,038 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:55:11,038 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:55:11,046 INFO [__main__.py:205] Selected Tasks: ['demo_boolq']\n",
"2023-11-29:11:55:11,047 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:55:11,110 INFO [huggingface.py:120] Using device 'cuda'\n",
"config.json: 100% 571/571 [00:00<00:00, 2.87MB/s]\n",
"model.safetensors: 100% 5.68G/5.68G [00:32<00:00, 173MB/s]\n",
"tokenizer_config.json: 100% 396/396 [00:00<00:00, 2.06MB/s]\n",
"tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 11.6MB/s]\n",
"special_tokens_map.json: 100% 99.0/99.0 [00:00<00:00, 555kB/s]\n",
"2023-11-29:11:56:18,658 WARNING [task.py:614] [Task: demo_boolq] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
"2023-11-29:11:56:18,658 WARNING [task.py:626] [Task: demo_boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
"Downloading builder script: 100% 30.7k/30.7k [00:00<00:00, 59.0MB/s]\n",
"Downloading metadata: 100% 38.7k/38.7k [00:00<00:00, 651kB/s]\n",
"Downloading readme: 100% 14.8k/14.8k [00:00<00:00, 37.3MB/s]\n",
"Downloading data: 100% 4.12M/4.12M [00:00<00:00, 55.1MB/s]\n",
"Generating train split: 100% 9427/9427 [00:00<00:00, 15630.89 examples/s]\n",
"Generating validation split: 100% 3270/3270 [00:00<00:00, 20002.56 examples/s]\n",
"Generating test split: 100% 3245/3245 [00:00<00:00, 20866.19 examples/s]\n",
"2023-11-29:11:56:22,315 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:56:22,322 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 20/20 [00:04<00:00, 4.37it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|----------|-------|------|-----:|------|----:|---|-----:|\n",
"|demo_boolq|Yaml |none | 0|acc | 1|± | 0|\n",
"\n"
]
}
],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_boolq \\\n",
" --limit 10\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LOUHK7PtQfq4"
},
"source": [
"Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the tag `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
"\n",
"<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
"\n",
"We also show the aggregate across samples besides only showing the aggregation between subtasks. This may come in handy when certain groups want to be aggregated as a single task. -->\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "fthNg3ywO-kA"
},
"outputs": [],
"source": [
"YAML_cola_string = '''\n",
"tag: yes_or_no_tasks\n",
"task: demo_cola\n",
"dataset_path: glue\n",
"dataset_name: cola\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{sentence}}\\nQuestion: Does this sentence make sense?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: sentence\n",
"metric_list:\n",
" - metric: acc\n",
"'''\n",
"with open('cola.yaml', 'w') as f:\n",
" f.write(YAML_cola_string)"
]
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Qw83KAePAhaS"
},
"source": [
"# Releasing LM-Evaluation-Harness v0.4.0"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z7k2vq1iAdqr"
},
"source": [
"With the vast amount of work done in the field today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0gDoM0AJAvEc"
},
"source": [
"Our refactor stems from our desires to make the following believed best practices easier to carry out. \n",
"\n",
"1. Never copy results from other papers\n",
"2. Always share your exact prompts\n",
"3. Always provide model outputs\n",
"4. Qualitatively review a small batch of outputs before running evaluation jobs at scale\n",
"\n",
"We also wanted to make the library a better experience to use and to contribute or design evaluations within. New features in the new release that serve this purpose include:\n",
"\n",
"1. Faster Evaluation Runtimes (accelerated data-parallel inference with HF Transformers + Accelerate, and commonly used or faster inference libraries such as vLLM and Llama-CPP)\n",
"2. Easier addition and sharing of new tasks (YAML-based task config formats, allowing single-file sharing of custom tasks)\n",
"3. More configurability, for more advanced workflows and easier operation with modifying prompts\n",
"4. Better logging of data at runtime and post-hoc"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nnwsOpjda_YW"
},
"source": [
"In this notebook we will be going through a short tutorial on how things work."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zAov81vTbL2K"
},
"source": [
"## Install LM-Eval"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8hiosGzq_qZg",
"outputId": "6ab73e5e-1f54-417e-a388-07e0d870b132"
},
"outputs": [
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "XceRKCuuDtbn"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:56:33,016 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:56:33.852995: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:56:33.853050: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:56:33.853087: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:56:35.129047: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:56:38,546 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:56:47,509 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:56:47,509 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:56:47,517 INFO [__main__.py:205] Selected Tasks: ['yes_or_no_tasks']\n",
"2023-11-29:11:56:47,520 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:56:47,550 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:57:08,743 WARNING [task.py:614] [Task: demo_cola] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
"2023-11-29:11:57:08,743 WARNING [task.py:626] [Task: demo_cola] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
"Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 52.7MB/s]\n",
"Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 51.9MB/s]\n",
"Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 48.0MB/s]\n",
"Downloading data: 100% 377k/377k [00:00<00:00, 12.0MB/s]\n",
"Generating train split: 100% 8551/8551 [00:00<00:00, 19744.58 examples/s]\n",
"Generating validation split: 100% 1043/1043 [00:00<00:00, 27057.01 examples/s]\n",
"Generating test split: 100% 1063/1063 [00:00<00:00, 22705.17 examples/s]\n",
"2023-11-29:11:57:11,698 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:57:11,704 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 20/20 [00:03<00:00, 5.15it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|---------------|-------|------|-----:|------|----:|---|-----:|\n",
"|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n",
"| - demo_cola |Yaml |none | 0|acc | 0.7|± |0.1528|\n",
"\n",
"| Groups |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|---------------|-------|------|-----:|------|----:|---|-----:|\n",
"|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks yes_or_no_tasks \\\n",
" --limit 10 \\\n",
" --output output/yes_or_no_tasks/ \\\n",
" --log_samples\n"
]
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor\n",
" Cloning https://github.com/EleutherAI/lm-evaluation-harness.git (to revision big-refactor) to /tmp/pip-req-build-tnssql5s\n",
" Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-tnssql5s\n",
" Running command git checkout -b big-refactor --track origin/big-refactor\n",
" Switched to a new branch 'big-refactor'\n",
" Branch 'big-refactor' set up to track remote branch 'big-refactor' from 'origin'.\n",
" Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 42f486ee49b65926a444cb0620870a39a5b4b0a8\n",
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
"Collecting accelerate>=0.21.0 (from lm-eval==1.0.0)\n",
" Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
" Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
" Downloading datasets-2.15.0-py3-none-any.whl (521 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m521.2/521.2 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
" Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)\n",
"Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.8.7)\n",
"Collecting peft>=0.2.0 (from lm-eval==1.0.0)\n",
" Downloading peft-0.6.2-py3-none-any.whl (174 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m174.7/174.7 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
" Downloading pybind11-2.11.1-py3-none-any.whl (227 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.7/227.7 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
" Downloading pytablewriter-1.2.0-py3-none-any.whl (111 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.1/111.1 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
" Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Collecting sacrebleu>=1.5.0 (from lm-eval==1.0.0)\n",
" Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
"Collecting sqlitedict (from lm-eval==1.0.0)\n",
" Downloading sqlitedict-2.1.0.tar.gz (21 kB)\n",
" Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: torch>=1.8 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.1.0+cu118)\n",
"Collecting tqdm-multiprocess (from lm-eval==1.0.0)\n",
" Downloading tqdm_multiprocess-0.0.11-py3-none-any.whl (9.8 kB)\n",
"Requirement already satisfied: transformers>=4.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (4.35.2)\n",
"Collecting zstandard (from lm-eval==1.0.0)\n",
" Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (23.2)\n",
"Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (5.9.5)\n",
"Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (6.0.1)\n",
"Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (0.19.4)\n",
"Requirement already satisfied: pyarrow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (9.0.0)\n",
"Collecting pyarrow-hotfix (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)\n",
"Collecting dill<0.3.8,>=0.3.0 (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
"Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2.31.0)\n",
"Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (4.66.1)\n",
"Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.4.1)\n",
"Collecting multiprocess (from datasets>=2.0.0->lm-eval==1.0.0)\n",
" Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
"Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.8.6)\n",
"Collecting responses<0.19 (from evaluate->lm-eval==1.0.0)\n",
" Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
"Requirement already satisfied: safetensors in /usr/local/lib/python3.10/dist-packages (from peft>=0.2.0->lm-eval==1.0.0) (0.4.0)\n",
"Requirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (3.8.1)\n",
"Requirement already satisfied: six>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from rouge-score>=0.0.4->lm-eval==1.0.0) (1.16.0)\n",
"Collecting portalocker (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)\n",
"Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (2023.6.3)\n",
"Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (0.9.0)\n",
"Collecting colorama (from sacrebleu>=1.5.0->lm-eval==1.0.0)\n",
" Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)\n",
"Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from sacrebleu>=1.5.0->lm-eval==1.0.0) (4.9.3)\n",
"Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.11.3)\n",
"Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (1.3.2)\n",
"Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.24.1->lm-eval==1.0.0) (3.2.0)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.13.1)\n",
"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (4.5.0)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (1.12)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.2.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (3.1.2)\n",
"Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.8->lm-eval==1.0.0) (2.1.0)\n",
"Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.1->lm-eval==1.0.0) (0.15.0)\n",
"Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonlines->lm-eval==1.0.0) (23.1.0)\n",
"Requirement already satisfied: setuptools>=38.3.0 in /usr/local/lib/python3.10/dist-packages (from pytablewriter->lm-eval==1.0.0) (67.7.2)\n",
"Collecting DataProperty<2,>=1.0.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading DataProperty-1.0.1-py3-none-any.whl (27 kB)\n",
"Collecting mbstrdecoder<2,>=1.0.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading mbstrdecoder-1.1.3-py3-none-any.whl (7.8 kB)\n",
"Collecting pathvalidate<4,>=2.3.0 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading pathvalidate-3.2.0-py3-none-any.whl (23 kB)\n",
"Collecting tabledata<2,>=1.3.1 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tabledata-1.3.3-py3-none-any.whl (11 kB)\n",
"Collecting tcolorpy<1,>=0.0.5 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading tcolorpy-0.1.4-py3-none-any.whl (7.9 kB)\n",
"Collecting typepy[datetime]<2,>=1.3.2 (from pytablewriter->lm-eval==1.0.0)\n",
" Downloading typepy-1.3.2-py3-none-any.whl (31 kB)\n",
"Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (3.3.2)\n",
"Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (6.0.4)\n",
"Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (4.0.3)\n",
"Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.9.2)\n",
"Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.4.0)\n",
"Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.0.0->lm-eval==1.0.0) (1.3.1)\n",
"Requirement already satisfied: chardet<6,>=3.0.4 in /usr/local/lib/python3.10/dist-packages (from mbstrdecoder<2,>=1.0.0->pytablewriter->lm-eval==1.0.0) (5.2.0)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2.0.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.0.0->lm-eval==1.0.0) (2023.7.22)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.8.0 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2.8.2)\n",
"Requirement already satisfied: pytz>=2018.9 in /usr/local/lib/python3.10/dist-packages (from typepy[datetime]<2,>=1.3.2->pytablewriter->lm-eval==1.0.0) (2023.3.post1)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.8->lm-eval==1.0.0) (2.1.3)\n",
"Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge-score>=0.0.4->lm-eval==1.0.0) (8.1.7)\n",
"Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.8->lm-eval==1.0.0) (1.3.0)\n",
"Building wheels for collected packages: lm-eval, rouge-score, sqlitedict\n",
" Building wheel for lm-eval (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for lm-eval: filename=lm_eval-1.0.0-py3-none-any.whl size=994254 sha256=88356155b19f2891981ecef948326ad6ce8ca40a6009378410ec20d0e225995a\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-9v6ye7h3/wheels/17/01/26/599c0779e9858a70a73fa8a306699b5b9a868f820c225457b0\n",
" Building wheel for rouge-score (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=6bb0d44e4881972c43ce194e7cb65233d309758cb15f0dec54590d3d2efcfc36\n",
" Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
" Building wheel for sqlitedict (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=5747f7dd73ddf3d8fbcebf51b5e4f718fabe1e94bccdf16d2f22a2e65ee7fdf4\n",
" Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd\n",
"Successfully built lm-eval rouge-score sqlitedict\n",
"Installing collected packages: sqlitedict, zstandard, tcolorpy, pybind11, pyarrow-hotfix, portalocker, pathvalidate, mbstrdecoder, jsonlines, dill, colorama, typepy, tqdm-multiprocess, sacrebleu, rouge-score, responses, multiprocess, accelerate, datasets, DataProperty, tabledata, peft, evaluate, pytablewriter, lm-eval\n",
"Successfully installed DataProperty-1.0.1 accelerate-0.24.1 colorama-0.4.6 datasets-2.15.0 dill-0.3.7 evaluate-0.4.1 jsonlines-4.0.0 lm-eval-1.0.0 mbstrdecoder-1.1.3 multiprocess-0.70.15 pathvalidate-3.2.0 peft-0.6.2 portalocker-2.8.2 pyarrow-hotfix-0.6 pybind11-2.11.1 pytablewriter-1.2.0 responses-0.18.0 rouge-score-0.1.2 sacrebleu-2.3.2 sqlitedict-2.1.0 tabledata-1.3.3 tcolorpy-0.1.4 tqdm-multiprocess-0.0.11 typepy-1.3.2 zstandard-0.22.0\n"
]
}
],
"source": [
"# Install LM-Eval\n",
"!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0,
"referenced_widgets": [
"a1d3a8aa016544a78e8821c8f6199e06",
"f61ed33fad754146bdd2ac9db1ba1c48",
"bfa0af6aeff344c6845e1080a878e92e",
"fd1ad9e0367d4004aae853b91c3a7617",
"6b2d90209ec14230b3d58a74ac9b83bf",
"a73f357065d34d7baf0453ae4a8d75e2",
"46f521b73fd943c081c648fd873ebc0a",
"7c5689bc13684db8a22681f41863dddd",
"48763b6233374554ae76035c0483066f",
"4986a21eb560448fa79f4b25cde48951",
"aed3acd2f2d74003b44079c333a0698e"
]
},
"id": "uyO5MaKkZyah",
"outputId": "d46e8096-5086-4e49-967e-ea33d4a2a335"
},
"outputs": [
{
"cell_type": "markdown",
"metadata": {
"id": "XceRKCuuDtbn"
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a1d3a8aa016544a78e8821c8f6199e06",
"version_major": 2,
"version_minor": 0
},
"source": [
"## Edit Prompt Templates Quickly\n",
"\n",
"The following is a yaml made to evaluate the specific subtask of `high_school_geography` from MMLU. It uses the standard prompt where the we choose the letters from the options with most likelihood as the model's prediction."
"text/plain": [
"Downloading builder script: 0%| | 0.00/5.67k [00:00<?, ?B/s]"
]
},
},
"metadata": {},
"output_type": "display_data"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "8rfUeX6n_wkK"
},
"source": [
"## Create new evaluation tasks with config-based tasks\n",
"\n",
"Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HYFUhhfOSJKe"
},
"source": [
"A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.\n",
"\n",
"Here, we write a demo YAML config for a multiple-choice evaluation of BoolQ:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "bg3dGROW-V39"
},
"outputs": [],
"source": [
"YAML_boolq_string = \"\"\"\n",
"task: demo_boolq\n",
"dataset_path: super_glue\n",
"dataset_name: boolq\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{passage}}\\nQuestion: {{question}}?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: passage\n",
"metric_list:\n",
" - metric: acc\n",
"\"\"\"\n",
"with open(\"boolq.yaml\", \"w\") as f:\n",
" f.write(YAML_boolq_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we can now run evaluation on this task, by pointing to the config file we've just created:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "LOUHK7PtQfq4"
},
"outputs": [
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "GTFvdt9kSlBG"
},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = '''\n",
"task: demo_mmlu_high_school_geography\n",
"dataset_path: cais/mmlu\n",
"dataset_name: high_school_geography\n",
"description: \"The following are multiple choice questions (with answers) about high school geography.\\n\\n\"\n",
"test_split: test\n",
"fewshot_split: dev\n",
"fewshot_config:\n",
" sampler: first_n\n",
"output_type: multiple_choice\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
"doc_to_choice: [\"A\", \"B\", \"C\", \"D\"]\n",
"doc_to_target: answer\n",
"metric_list:\n",
" - metric: acc\n",
" aggregation: mean\n",
" higher_is_better: true\n",
" - metric: acc_norm\n",
" aggregation: mean\n",
" higher_is_better: true\n",
"'''\n",
"with open('mmlu_high_school_geography.yaml', 'w') as f:\n",
" f.write(YAML_mmlu_geo_string)\n"
]
},
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:54:55,156 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:54:55.942051: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:54:55.942108: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:54:55.942142: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:54:57.066802: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:55:00,954 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:55:11,038 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:55:11,038 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:55:11,046 INFO [__main__.py:205] Selected Tasks: ['demo_boolq']\n",
"2023-11-29:11:55:11,047 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:55:11,110 INFO [huggingface.py:120] Using device 'cuda'\n",
"config.json: 100% 571/571 [00:00<00:00, 2.87MB/s]\n",
"model.safetensors: 100% 5.68G/5.68G [00:32<00:00, 173MB/s]\n",
"tokenizer_config.json: 100% 396/396 [00:00<00:00, 2.06MB/s]\n",
"tokenizer.json: 100% 2.11M/2.11M [00:00<00:00, 11.6MB/s]\n",
"special_tokens_map.json: 100% 99.0/99.0 [00:00<00:00, 555kB/s]\n",
"2023-11-29:11:56:18,658 WARNING [task.py:614] [Task: demo_boolq] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
"2023-11-29:11:56:18,658 WARNING [task.py:626] [Task: demo_boolq] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
"Downloading builder script: 100% 30.7k/30.7k [00:00<00:00, 59.0MB/s]\n",
"Downloading metadata: 100% 38.7k/38.7k [00:00<00:00, 651kB/s]\n",
"Downloading readme: 100% 14.8k/14.8k [00:00<00:00, 37.3MB/s]\n",
"Downloading data: 100% 4.12M/4.12M [00:00<00:00, 55.1MB/s]\n",
"Generating train split: 100% 9427/9427 [00:00<00:00, 15630.89 examples/s]\n",
"Generating validation split: 100% 3270/3270 [00:00<00:00, 20002.56 examples/s]\n",
"Generating test split: 100% 3245/3245 [00:00<00:00, 20866.19 examples/s]\n",
"2023-11-29:11:56:22,315 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:56:22,322 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 20/20 [00:04<00:00, 4.37it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|----------|-------|------|-----:|------|----:|---|-----:|\n",
"|demo_boolq|Yaml |none | 0|acc | 1|± | 0|\n",
"\n"
]
}
],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_boolq \\\n",
" --limit 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LOUHK7PtQfq4"
},
"source": [
"Often, tasks are part of a larger group used to measure different capabilities. The dynamism of the field today means new dimensions of evaluation can come about which would mix and match new and older tasks alike. In LM-Eval, We can also group tasks and call that the group name to evaluate on a set of tasks easily. In this instance, let's evaluate the tag `yes_or_no_tasks` which comprise of the tasks `demo_boolq` and `demo_cola`; tasks which are multiple choice tasks with options `yes` and `no` as the name suggests.\n",
"\n",
"<!-- making new groups is easier than ever, allowing user to work bottom-up by makiing individual tasks and linking them to a group or Top-Down, making a new group by listing existing tasks.\n",
"\n",
"We also show the aggregate across samples besides only showing the aggregation between subtasks. This may come in handy when certain groups want to be aggregated as a single task. -->\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "fthNg3ywO-kA"
},
"outputs": [],
"source": [
"YAML_cola_string = \"\"\"\n",
"tag: yes_or_no_tasks\n",
"task: demo_cola\n",
"dataset_path: glue\n",
"dataset_name: cola\n",
"output_type: multiple_choice\n",
"training_split: train\n",
"validation_split: validation\n",
"doc_to_text: \"{{sentence}}\\nQuestion: Does this sentence make sense?\\nAnswer:\"\n",
"doc_to_target: label\n",
"doc_to_choice: [\"no\", \"yes\"]\n",
"should_decontaminate: true\n",
"doc_to_decontamination_query: sentence\n",
"metric_list:\n",
" - metric: acc\n",
"\"\"\"\n",
"with open(\"cola.yaml\", \"w\") as f:\n",
" f.write(YAML_cola_string)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "XceRKCuuDtbn"
},
"outputs": [
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "jyKOfCsKb-xy"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:57:23,598 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:57:24.719750: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:57:24.719806: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:57:24.719847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:57:26.656125: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:57:31,563 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:57:40,541 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:57:40,541 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:57:40,558 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography']\n",
"2023-11-29:11:57:40,559 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:57:40,589 INFO [huggingface.py:120] Using device 'cuda'\n",
"Downloading builder script: 100% 5.84k/5.84k [00:00<00:00, 17.7MB/s]\n",
"Downloading metadata: 100% 106k/106k [00:00<00:00, 892kB/s] \n",
"Downloading readme: 100% 39.7k/39.7k [00:00<00:00, 631kB/s]\n",
"Downloading data: 100% 166M/166M [00:01<00:00, 89.0MB/s]\n",
"Generating auxiliary_train split: 100% 99842/99842 [00:07<00:00, 12536.83 examples/s]\n",
"Generating test split: 100% 198/198 [00:00<00:00, 1439.20 examples/s]\n",
"Generating validation split: 100% 22/22 [00:00<00:00, 4181.76 examples/s]\n",
"Generating dev split: 100% 5/5 [00:00<00:00, 36.25 examples/s]\n",
"2023-11-29:11:58:09,798 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:58:09,822 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:05<00:00, 7.86it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|-------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography|Yaml |none | 0|acc | 0.3|± |0.1528|\n",
"| | |none | 0|acc_norm| 0.3|± |0.1528|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography \\\n",
" --limit 10 \\\n",
" --output output/mmlu_high_school_geography/ \\\n",
" --log_samples"
]
},
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:56:33,016 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:56:33.852995: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:56:33.853050: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:56:33.853087: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:56:35.129047: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:56:38,546 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:56:47,509 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:56:47,509 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:56:47,517 INFO [__main__.py:205] Selected Tasks: ['yes_or_no_tasks']\n",
"2023-11-29:11:56:47,520 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:56:47,550 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:57:08,743 WARNING [task.py:614] [Task: demo_cola] metric acc is defined, but aggregation is not. using default aggregation=mean\n",
"2023-11-29:11:57:08,743 WARNING [task.py:626] [Task: demo_cola] metric acc is defined, but higher_is_better is not. using default higher_is_better=True\n",
"Downloading builder script: 100% 28.8k/28.8k [00:00<00:00, 52.7MB/s]\n",
"Downloading metadata: 100% 28.7k/28.7k [00:00<00:00, 51.9MB/s]\n",
"Downloading readme: 100% 27.9k/27.9k [00:00<00:00, 48.0MB/s]\n",
"Downloading data: 100% 377k/377k [00:00<00:00, 12.0MB/s]\n",
"Generating train split: 100% 8551/8551 [00:00<00:00, 19744.58 examples/s]\n",
"Generating validation split: 100% 1043/1043 [00:00<00:00, 27057.01 examples/s]\n",
"Generating test split: 100% 1063/1063 [00:00<00:00, 22705.17 examples/s]\n",
"2023-11-29:11:57:11,698 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:57:11,704 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 20/20 [00:03<00:00, 5.15it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|---------------|-------|------|-----:|------|----:|---|-----:|\n",
"|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n",
"| - demo_cola |Yaml |none | 0|acc | 0.7|± |0.1528|\n",
"\n",
"| Groups |Version|Filter|n-shot|Metric|Value| |Stderr|\n",
"|---------------|-------|------|-----:|------|----:|---|-----:|\n",
"|yes_or_no_tasks|N/A |none | 0|acc | 0.7|± |0.1528|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks yes_or_no_tasks \\\n",
" --limit 10 \\\n",
" --output output/yes_or_no_tasks/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XceRKCuuDtbn"
},
"source": [
"## Edit Prompt Templates Quickly\n",
"\n",
"The following is a yaml made to evaluate the specific subtask of `high_school_geography` from MMLU. It uses the standard prompt where the we choose the letters from the options with most likelihood as the model's prediction."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "GTFvdt9kSlBG"
},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"task: demo_mmlu_high_school_geography\n",
"dataset_path: cais/mmlu\n",
"dataset_name: high_school_geography\n",
"description: \"The following are multiple choice questions (with answers) about high school geography.\\n\\n\"\n",
"test_split: test\n",
"fewshot_split: dev\n",
"fewshot_config:\n",
" sampler: first_n\n",
"output_type: multiple_choice\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
"doc_to_choice: [\"A\", \"B\", \"C\", \"D\"]\n",
"doc_to_target: answer\n",
"metric_list:\n",
" - metric: acc\n",
" aggregation: mean\n",
" higher_is_better: true\n",
" - metric: acc_norm\n",
" aggregation: mean\n",
" higher_is_better: true\n",
"\"\"\"\n",
"with open(\"mmlu_high_school_geography.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"id": "jyKOfCsKb-xy"
},
"outputs": [
{
"cell_type": "markdown",
"metadata": {
"id": "jyKOfCsKb-xy"
},
"source": [
"We could also evaluate this task in a different way. For example, instead of observing the loglikelihood of the letters, we can instead evaluate on the choices themselves as the continuation. This is done by simply changing `doc_to_choice` from a list of letters to the corresponding `choices` field from the HF dataset. We write `\"{{choices}}\"` so that the string field is interpreted as jinja string that acquires the list from the HF dataset directly.\n",
"\n",
"Another convenient feature here is since we're only modifying the `doc_to_choice` and the rest of config is the same as the task above, we can use the above configuration as a template by using `include: mmlu_high_school_geography.yaml` to load the config from that file. We'll need to add a unique task name as to not colide with the existing yaml config we're including. For this case we'll simply name this one `mmlu_high_school_geography_continuation`. `doc_to_text` is added here just for sake of clarity."
]
},
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:57:23,598 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:57:24.719750: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:57:24.719806: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:57:24.719847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:57:26.656125: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:57:31,563 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:57:40,541 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:57:40,541 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:57:40,558 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography']\n",
"2023-11-29:11:57:40,559 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:57:40,589 INFO [huggingface.py:120] Using device 'cuda'\n",
"Downloading builder script: 100% 5.84k/5.84k [00:00<00:00, 17.7MB/s]\n",
"Downloading metadata: 100% 106k/106k [00:00<00:00, 892kB/s] \n",
"Downloading readme: 100% 39.7k/39.7k [00:00<00:00, 631kB/s]\n",
"Downloading data: 100% 166M/166M [00:01<00:00, 89.0MB/s]\n",
"Generating auxiliary_train split: 100% 99842/99842 [00:07<00:00, 12536.83 examples/s]\n",
"Generating test split: 100% 198/198 [00:00<00:00, 1439.20 examples/s]\n",
"Generating validation split: 100% 22/22 [00:00<00:00, 4181.76 examples/s]\n",
"Generating dev split: 100% 5/5 [00:00<00:00, 36.25 examples/s]\n",
"2023-11-29:11:58:09,798 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:58:09,822 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:05<00:00, 7.86it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|-------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography|Yaml |none | 0|acc | 0.3|± |0.1528|\n",
"| | |none | 0|acc_norm| 0.3|± |0.1528|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography \\\n",
" --limit 10 \\\n",
" --output output/mmlu_high_school_geography/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jyKOfCsKb-xy"
},
"source": [
"We could also evaluate this task in a different way. For example, instead of observing the loglikelihood of the letters, we can instead evaluate on the choices themselves as the continuation. This is done by simply changing `doc_to_choice` from a list of letters to the corresponding `choices` field from the HF dataset. We write `\"{{choices}}\"` so that the string field is interpreted as jinja string that acquires the list from the HF dataset directly.\n",
"\n",
"Another convenient feature here is since we're only modifying the `doc_to_choice` and the rest of config is the same as the task above, we can use the above configuration as a template by using `include: mmlu_high_school_geography.yaml` to load the config from that file. We'll need to add a unique task name as to not colide with the existing yaml config we're including. For this case we'll simply name this one `mmlu_high_school_geography_continuation`. `doc_to_text` is added here just for sake of clarity."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "lqElwU54TaK-"
},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_continuation\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
"doc_to_choice: \"{{choices}}\"\n",
"\"\"\"\n",
"with open(\"mmlu_high_school_geography_continuation.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "-_CVnDirdy7j"
},
"outputs": [
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "lqElwU54TaK-"
},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = '''\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_continuation\n",
"doc_to_text: \"{{question.strip()}}\\nA. {{choices[0]}}\\nB. {{choices[1]}}\\nC. {{choices[2]}}\\nD. {{choices[3]}}\\nAnswer:\"\n",
"doc_to_choice: \"{{choices}}\"\n",
"'''\n",
"with open('mmlu_high_school_geography_continuation.yaml', 'w') as f:\n",
" f.write(YAML_mmlu_geo_string)\n"
]
},
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:58:21,284 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:58:22.850159: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:58:22.850219: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:58:22.850254: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:58:24.948103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:58:28,460 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:58:37,935 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:58:37,935 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:58:37,969 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_continuation']\n",
"2023-11-29:11:58:37,972 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:58:38,008 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:58:59,758 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:58:59,777 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:02<00:00, 16.23it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|--------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_continuation|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_continuation \\\n",
" --limit 10 \\\n",
" --output output/mmlu_high_school_geography_continuation/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-_CVnDirdy7j"
},
"source": [
"If we take a look at the samples, we can see that it is in fact evaluating the continuation based on the choices rather than the letters."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "duBDqC6PAdjL"
},
"outputs": [
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "-_CVnDirdy7j"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:58:21,284 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:58:22.850159: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:58:22.850219: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:58:22.850254: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:58:24.948103: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:58:28,460 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:58:37,935 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:58:37,935 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:58:37,969 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_continuation']\n",
"2023-11-29:11:58:37,972 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:58:38,008 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:58:59,758 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:58:59,777 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:02<00:00, 16.23it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|--------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_continuation|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
],
"source": [
"# !accelerate launch --no_python\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_continuation \\\n",
" --limit 10 \\\n",
" --output output/mmlu_high_school_geography_continuation/ \\\n",
" --log_samples\n"
"data": {
"application/javascript": "\n ((filepath) => {{\n if (!google.colab.kernel.accessAllowed) {{\n return;\n }}\n google.colab.files.view(filepath);\n }})(\"/content/output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")",
"text/plain": [
"<IPython.core.display.Javascript object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from google.colab import files\n",
"\n",
"\n",
"files.view(\n",
" \"output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6p0-KPwAgK5j"
},
"source": [
"## Closer Look at YAML Fields\n",
"\n",
"To prepare a task we can simply fill in a YAML config with the relevant information.\n",
"\n",
"`output_type`\n",
"The current provided evaluation types comprise of the following:\n",
"1. `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.\n",
"2. `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)\n",
"3. `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n",
"4. `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n",
"\n",
"The core prompt revolves around 3 fields.\n",
"1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n",
"2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.\n",
"3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n",
"\n",
"These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6p0-KPwAgK5j"
},
"source": [
"## What if Jinja is not Sufficient?\n",
"\n",
"There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:\n",
"\n",
"1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.\n",
"2. Perform a transformation on the dataset beforehand."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we show an example of using `!function` to create `doc_to_text` from a python function:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DYZ5c0JhR1lJ",
"outputId": "ca945235-fb9e-4f17-8bfa-78e7d6ec1490"
},
"outputs": [
{
"cell_type": "markdown",
"metadata": {
"id": "-_CVnDirdy7j"
},
"source": [
"If we take a look at the samples, we can see that it is in fact evaluating the continuation based on the choices rather than the letters."
]
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:59:08,312 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:59:09.348327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:59:09.348387: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:59:09.348421: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:59:10.573752: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:59:14,044 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:59:23,654 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:59:23,654 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:59:23,678 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt']\n",
"2023-11-29:11:59:23,679 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:59:23,708 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:59:44,516 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:59:44,524 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:02<00:00, 15.41it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|-----------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_function_prompt|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt\n",
"doc_to_text: !function utils.doc_to_text\n",
"doc_to_choice: \"{{choices}}\"\n",
"\"\"\"\n",
"with open(\"demo_mmlu_high_school_geography_function_prompt.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = \"\"\"\n",
"def doc_to_text(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" return f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
"\"\"\"\n",
"with open(\"utils.py\", \"w\") as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:\n",
"\n",
"We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = \"\"\"\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt_2\n",
"process_docs: !function utils_process_docs.process_docs\n",
"doc_to_text: \"{{input}}\"\n",
"doc_to_choice: \"{{choices}}\"\n",
"\"\"\"\n",
"with open(\"demo_mmlu_high_school_geography_process_docs.yaml\", \"w\") as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = \"\"\"\n",
"def process_docs(dataset):\n",
" def _process_doc(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" doc[\"input\"] = f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
" return out_doc\n",
"\n",
" return dataset.map(_process_doc)\n",
"\"\"\"\n",
"\n",
"with open(\"utils_process_docs.py\", \"w\") as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt_2 \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt_2/ \\\n",
" --log_samples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We hope that this explainer gives you a sense of what can be done with and how to work with LM-Evaluation-Harnes v0.4.0 ! \n",
"\n",
"For more information, check out our documentation pages in the `docs/` folder, and if you have questions, please raise them in GitHub issues, or in #lm-thunderdome or #release-discussion on the EleutherAI discord server."
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [
"zAov81vTbL2K"
],
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"46f521b73fd943c081c648fd873ebc0a": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "duBDqC6PAdjL"
},
"outputs": [
{
"data": {
"application/javascript": "\n ((filepath) => {{\n if (!google.colab.kernel.accessAllowed) {{\n return;\n }}\n google.colab.files.view(filepath);\n }})(\"/content/output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")",
"text/plain": [
"<IPython.core.display.Javascript object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from google.colab import files\n",
"files.view(\"output/mmlu_high_school_geography_continuation/pretrained__EleutherAI__pythia-2.8b_demo_mmlu_high_school_geography_continuation.jsonl\")\n"
]
"48763b6233374554ae76035c0483066f": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "6p0-KPwAgK5j"
},
"source": [
"## Closer Look at YAML Fields\n",
"\n",
"To prepare a task we can simply fill in a YAML config with the relevant information.\n",
"\n",
"`output_type`\n",
"The current provided evaluation types comprise of the following:\n",
"1. `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.\n",
"2. `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)\n",
"3. `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.\n",
"4. `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)\n",
"\n",
"The core prompt revolves around 3 fields.\n",
"1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.\n",
"2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.\n",
"3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.\n",
"\n",
"These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.\n"
]
"4986a21eb560448fa79f4b25cde48951": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "6p0-KPwAgK5j"
},
"source": [
"## What if Jinja is not Sufficient?\n",
"\n",
"There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:\n",
"\n",
"1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.\n",
"2. Perform a transformation on the dataset beforehand."
]
"6b2d90209ec14230b3d58a74ac9b83bf": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we show an example of using `!function` to create `doc_to_text` from a python function:"
]
"7c5689bc13684db8a22681f41863dddd": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "DYZ5c0JhR1lJ",
"outputId": "ca945235-fb9e-4f17-8bfa-78e7d6ec1490"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2023-11-29:11:59:08,312 INFO [utils.py:160] NumExpr defaulting to 2 threads.\n",
"2023-11-29 11:59:09.348327: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"2023-11-29 11:59:09.348387: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"2023-11-29 11:59:09.348421: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
"2023-11-29 11:59:10.573752: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
"2023-11-29:11:59:14,044 INFO [__main__.py:132] Verbosity set to INFO\n",
"2023-11-29:11:59:23,654 WARNING [__main__.py:138] --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.\n",
"2023-11-29:11:59:23,654 INFO [__main__.py:143] Including path: ./\n",
"2023-11-29:11:59:23,678 INFO [__main__.py:205] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt']\n",
"2023-11-29:11:59:23,679 WARNING [evaluator.py:93] generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.\n",
"2023-11-29:11:59:23,708 INFO [huggingface.py:120] Using device 'cuda'\n",
"2023-11-29:11:59:44,516 INFO [task.py:355] Building contexts for task on rank 0...\n",
"2023-11-29:11:59:44,524 INFO [evaluator.py:319] Running loglikelihood requests\n",
"100% 40/40 [00:02<00:00, 15.41it/s]\n",
"fatal: not a git repository (or any of the parent directories): .git\n",
"hf (pretrained=EleutherAI/pythia-2.8b), gen_kwargs: (), limit: 10.0, num_fewshot: None, batch_size: 1\n",
"| Tasks |Version|Filter|n-shot| Metric |Value| |Stderr|\n",
"|-----------------------------------------------|-------|------|-----:|--------|----:|---|-----:|\n",
"|demo_mmlu_high_school_geography_function_prompt|Yaml |none | 0|acc | 0.1|± |0.1000|\n",
"| | |none | 0|acc_norm| 0.2|± |0.1333|\n",
"\n"
]
}
"a1d3a8aa016544a78e8821c8f6199e06": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_f61ed33fad754146bdd2ac9db1ba1c48",
"IPY_MODEL_bfa0af6aeff344c6845e1080a878e92e",
"IPY_MODEL_fd1ad9e0367d4004aae853b91c3a7617"
],
"source": [
"YAML_mmlu_geo_string = '''\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt\n",
"doc_to_text: !function utils.doc_to_text\n",
"doc_to_choice: \"{{choices}}\"\n",
"'''\n",
"with open('demo_mmlu_high_school_geography_function_prompt.yaml', 'w') as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = '''\n",
"def doc_to_text(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" return f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
"'''\n",
"with open('utils.py', 'w') as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt/ \\\n",
" --log_samples\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:\n",
"\n",
"We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`."
]
"layout": "IPY_MODEL_6b2d90209ec14230b3d58a74ac9b83bf"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"YAML_mmlu_geo_string = '''\n",
"include: mmlu_high_school_geography.yaml\n",
"task: demo_mmlu_high_school_geography_function_prompt_2\n",
"process_docs: !function utils_process_docs.process_docs\n",
"doc_to_text: \"{{input}}\"\n",
"doc_to_choice: \"{{choices}}\"\n",
"'''\n",
"with open('demo_mmlu_high_school_geography_process_docs.yaml', 'w') as f:\n",
" f.write(YAML_mmlu_geo_string)\n",
"\n",
"DOC_TO_TEXT = '''\n",
"def process_docs(dataset):\n",
" def _process_doc(x):\n",
" question = x[\"question\"].strip()\n",
" choices = x[\"choices\"]\n",
" option_a = choices[0]\n",
" option_b = choices[1]\n",
" option_c = choices[2]\n",
" option_d = choices[3]\n",
" doc[\"input\"] = f\"{question}\\\\nA. {option_a}\\\\nB. {option_b}\\\\nC. {option_c}\\\\nD. {option_d}\\\\nAnswer:\"\n",
" return out_doc\n",
"\n",
" return dataset.map(_process_doc)\n",
"'''\n",
"\n",
"with open('utils_process_docs.py', 'w') as f:\n",
" f.write(DOC_TO_TEXT)\n",
"\n",
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
" --include_path ./ \\\n",
" --tasks demo_mmlu_high_school_geography_function_prompt_2 \\\n",
" --limit 10 \\\n",
" --output output/demo_mmlu_high_school_geography_function_prompt_2/ \\\n",
" --log_samples\n"
]
"a73f357065d34d7baf0453ae4a8d75e2": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We hope that this explainer gives you a sense of what can be done with and how to work with LM-Evaluation-Harnes v0.4.0 ! \n",
"\n",
"For more information, check out our documentation pages in the `docs/` folder, and if you have questions, please raise them in GitHub issues, or in #lm-thunderdome or #release-discussion on the EleutherAI discord server."
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [
"zAov81vTbL2K"
],
"gpuType": "T4",
"provenance": []
"aed3acd2f2d74003b44079c333a0698e": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
"bfa0af6aeff344c6845e1080a878e92e": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_7c5689bc13684db8a22681f41863dddd",
"max": 5669,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_48763b6233374554ae76035c0483066f",
"value": 5669
}
},
"language_info": {
"name": "python"
"f61ed33fad754146bdd2ac9db1ba1c48": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_a73f357065d34d7baf0453ae4a8d75e2",
"placeholder": "​",
"style": "IPY_MODEL_46f521b73fd943c081c648fd873ebc0a",
"value": "Downloading builder script: 100%"
}
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"46f521b73fd943c081c648fd873ebc0a": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"48763b6233374554ae76035c0483066f": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "ProgressStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "ProgressStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"bar_color": null,
"description_width": ""
}
},
"4986a21eb560448fa79f4b25cde48951": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"6b2d90209ec14230b3d58a74ac9b83bf": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"7c5689bc13684db8a22681f41863dddd": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"a1d3a8aa016544a78e8821c8f6199e06": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HBoxModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HBoxModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HBoxView",
"box_style": "",
"children": [
"IPY_MODEL_f61ed33fad754146bdd2ac9db1ba1c48",
"IPY_MODEL_bfa0af6aeff344c6845e1080a878e92e",
"IPY_MODEL_fd1ad9e0367d4004aae853b91c3a7617"
],
"layout": "IPY_MODEL_6b2d90209ec14230b3d58a74ac9b83bf"
}
},
"a73f357065d34d7baf0453ae4a8d75e2": {
"model_module": "@jupyter-widgets/base",
"model_module_version": "1.2.0",
"model_name": "LayoutModel",
"state": {
"_model_module": "@jupyter-widgets/base",
"_model_module_version": "1.2.0",
"_model_name": "LayoutModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "LayoutView",
"align_content": null,
"align_items": null,
"align_self": null,
"border": null,
"bottom": null,
"display": null,
"flex": null,
"flex_flow": null,
"grid_area": null,
"grid_auto_columns": null,
"grid_auto_flow": null,
"grid_auto_rows": null,
"grid_column": null,
"grid_gap": null,
"grid_row": null,
"grid_template_areas": null,
"grid_template_columns": null,
"grid_template_rows": null,
"height": null,
"justify_content": null,
"justify_items": null,
"left": null,
"margin": null,
"max_height": null,
"max_width": null,
"min_height": null,
"min_width": null,
"object_fit": null,
"object_position": null,
"order": null,
"overflow": null,
"overflow_x": null,
"overflow_y": null,
"padding": null,
"right": null,
"top": null,
"visibility": null,
"width": null
}
},
"aed3acd2f2d74003b44079c333a0698e": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "DescriptionStyleModel",
"state": {
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "DescriptionStyleModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/base",
"_view_module_version": "1.2.0",
"_view_name": "StyleView",
"description_width": ""
}
},
"bfa0af6aeff344c6845e1080a878e92e": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "FloatProgressModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "FloatProgressModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "ProgressView",
"bar_style": "success",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_7c5689bc13684db8a22681f41863dddd",
"max": 5669,
"min": 0,
"orientation": "horizontal",
"style": "IPY_MODEL_48763b6233374554ae76035c0483066f",
"value": 5669
}
},
"f61ed33fad754146bdd2ac9db1ba1c48": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_a73f357065d34d7baf0453ae4a8d75e2",
"placeholder": "​",
"style": "IPY_MODEL_46f521b73fd943c081c648fd873ebc0a",
"value": "Downloading builder script: 100%"
}
},
"fd1ad9e0367d4004aae853b91c3a7617": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_4986a21eb560448fa79f4b25cde48951",
"placeholder": "​",
"style": "IPY_MODEL_aed3acd2f2d74003b44079c333a0698e",
"value": " 5.67k/5.67k [00:00&lt;00:00, 205kB/s]"
}
}
}
"fd1ad9e0367d4004aae853b91c3a7617": {
"model_module": "@jupyter-widgets/controls",
"model_module_version": "1.5.0",
"model_name": "HTMLModel",
"state": {
"_dom_classes": [],
"_model_module": "@jupyter-widgets/controls",
"_model_module_version": "1.5.0",
"_model_name": "HTMLModel",
"_view_count": null,
"_view_module": "@jupyter-widgets/controls",
"_view_module_version": "1.5.0",
"_view_name": "HTMLView",
"description": "",
"description_tooltip": null,
"layout": "IPY_MODEL_4986a21eb560448fa79f4b25cde48951",
"placeholder": "​",
"style": "IPY_MODEL_aed3acd2f2d74003b44079c333a0698e",
"value": " 5.67k/5.67k [00:00&lt;00:00, 205kB/s]"
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
......@@ -68,6 +68,7 @@
"source": [
"import wandb\n",
"\n",
"\n",
"wandb.login()"
]
},
......@@ -130,6 +131,7 @@
"import lm_eval\n",
"from lm_eval.loggers import WandbLogger\n",
"\n",
"\n",
"results = lm_eval.simple_evaluate(\n",
" model=\"hf\",\n",
" model_args=\"pretrained=microsoft/phi-2,trust_remote_code=True\",\n",
......
......@@ -431,7 +431,12 @@ class TemplateLM(LM):
using_default_template = False
# First, handle the cases when the model has a dict of multiple templates
template = self.tokenizer.chat_template or self.tokenizer.default_chat_template
try:
template = (
self.tokenizer.chat_template or self.tokenizer.default_chat_template
)
except AttributeError:
return None
if isinstance(template, dict):
using_default_dict = self.tokenizer.chat_template is None
......
......@@ -57,7 +57,6 @@ class TaskConfig(dict):
task: Optional[str] = None
task_alias: Optional[str] = None
tag: Optional[Union[str, list]] = None
group: Optional[Union[str, list]] = None
# HF dataset options.
# which dataset to use,
# and what splits for what purpose
......@@ -98,18 +97,6 @@ class TaskConfig(dict):
)
def __post_init__(self) -> None:
if self.group is not None:
eval_logger.warning(
"A task YAML file was found to contain a `group` key. Groups which provide aggregate scores over several subtasks now require a separate config file--if not aggregating, you may want to use the `tag` config option instead within your config. Setting `group` within a TaskConfig will be deprecated in v0.4.4. Please see https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md for more information."
)
if self.tag is None:
self.tag = self.group
else:
raise ValueError(
"Got both a `group` and `tag` entry within a TaskConfig. Please use one or the other--`group` values will be deprecated in v0.4.4."
)
if self.generation_kwargs is not None:
if self.output_type != "generate_until":
eval_logger.warning(
......@@ -1511,7 +1498,7 @@ class ConfigurableTask(Task):
# we expect multiple_targets to be a list.
elif self.multiple_target:
gold = list(gold)
elif type(gold) != type(result):
elif type(gold) is not type(result):
# cast gold to the same type as result
gold = type(result)(gold)
......@@ -1594,7 +1581,7 @@ class ConfigurableTask(Task):
f"ConfigurableTask(task_name={getattr(self.config, 'task', None)},"
f"output_type={self.OUTPUT_TYPE},"
f"num_fewshot={getattr(self.config, 'num_fewshot', None)},"
f"num_samples={len(self.eval_docs)})",
f"num_samples={len(self.eval_docs)})"
)
......
......@@ -157,6 +157,9 @@ def simple_evaluate(
seed_message.append(f"Setting torch manual seed to {torch_random_seed}")
torch.manual_seed(torch_random_seed)
if fewshot_random_seed is not None:
seed_message.append(f"Setting fewshot manual seed to {fewshot_random_seed}")
if seed_message:
eval_logger.info(" | ".join(seed_message))
......@@ -276,9 +279,6 @@ def simple_evaluate(
task_obj.set_config(key="num_fewshot", value=0)
# fewshot_random_seed set for tasks, even with a default num_fewshot (e.g. in the YAML file)
task_obj.set_fewshot_seed(seed=fewshot_random_seed)
eval_logger.info(
f"Setting fewshot random generator seed to {fewshot_random_seed}"
)
adjusted_task_dict[task_name] = task_obj
......@@ -433,10 +433,14 @@ def evaluate(
)
# end multimodality validation check
# Cache the limit arg.
limit_arg = limit
limits = []
for task_output in eval_tasks:
task: Task = task_output.task
limit = get_sample_size(task, limit)
limit = get_sample_size(task, limit_arg)
limits.append(limit)
task.build_all_requests(
limit=limit,
rank=lm.rank,
......@@ -506,7 +510,7 @@ def evaluate(
WORLD_SIZE = lm.world_size
### Postprocess outputs ###
# TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
for task_output in eval_tasks:
for task_output, limit in zip(eval_tasks, limits):
task = task_output.task
task.apply_filters()
......@@ -655,7 +659,7 @@ def evaluate(
len(task_output.task.eval_docs),
),
}
for task_output in eval_tasks
for task_output, limit in zip(eval_tasks, limits)
},
}
if log_samples:
......
......@@ -73,9 +73,12 @@ class TemplateAPI(TemplateLM):
seed: int = 1234,
max_length: Optional[int] = 2048,
add_bos_token: bool = False,
custom_prefix_token_id=None,
custom_prefix_token_id: int = None,
# send the requests as tokens or strings
tokenized_requests=True,
tokenized_requests: bool = True,
trust_remote_code: bool = False,
revision: Optional[str] = "main",
use_fast_tokenizer: bool = True,
**kwargs,
) -> None:
super().__init__()
......@@ -128,7 +131,10 @@ class TemplateAPI(TemplateLM):
import transformers
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
self.tokenizer if self.tokenizer else self.model
self.tokenizer if self.tokenizer else self.model,
trust_remote_code=trust_remote_code,
revision=revision,
use_fast=use_fast_tokenizer,
)
# Not used as the API will handle padding but to mirror the behavior of the HFLM
self.tokenizer = configure_pad_token(self.tokenizer)
......@@ -153,6 +159,9 @@ class TemplateAPI(TemplateLM):
assert isinstance(tokenizer, str), "tokenizer must be a string"
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
tokenizer,
trust_remote_code=trust_remote_code,
revision=revision,
use_fast=use_fast_tokenizer,
)
@abc.abstractmethod
......
......@@ -26,9 +26,9 @@ class DummyLM(LM):
def generate_until(self, requests, disable_tqdm: bool = False):
res = []
for ctx, _ in tqdm(requests, disable=disable_tqdm):
for request in tqdm(requests, disable=disable_tqdm):
res.append("lol")
assert ctx.strip() != ""
assert request.arguments[0].strip() != ""
return res
......
......@@ -13,6 +13,7 @@ from lm_eval.api.registry import register_model
from lm_eval.models.huggingface import HFLM
from lm_eval.models.utils import (
Collator,
flatten_image_list,
pad_and_concat,
replace_placeholders,
stop_sequences_criteria,
......@@ -295,6 +296,11 @@ class HFMultimodalLM(HFLM):
images = [img[: self.max_images] for img in images]
if self.rgb:
images = [[img.convert("RGB") for img in sublist] for sublist in images]
# certain models like llava expect a single-level image list even for bs>1, multi-image. TODO: port this over to loglikelihoods
if getattr(self.config, "model_type", "") == "llava":
images = flatten_image_list(images)
try:
encoding = self.processor(
images=images,
......
......@@ -55,7 +55,7 @@ class HFLM(TemplateLM):
def __init__(
self,
pretrained: Union[str, transformers.PreTrainedModel],
backend: Optional[Literal["default", "causal", "seq2seq"]] = "default",
backend: Literal["default", "causal", "seq2seq"] = "default",
# override whether the model should be treated as decoder-only (causal) or encoder-decoder (seq2seq)
revision: Optional[str] = "main",
subfolder: Optional[str] = None,
......@@ -90,7 +90,6 @@ class HFLM(TemplateLM):
**kwargs,
) -> None:
super().__init__()
# optionally: take in an already-initialized transformers.PreTrainedModel
if not isinstance(pretrained, str):
eval_logger.warning(
......@@ -164,7 +163,7 @@ class HFLM(TemplateLM):
trust_remote_code=trust_remote_code,
)
# determine which of 'causal' and 'seq2seq' backends to use
# determine which of 'causal' and 'seq2seq' backends to use for HF models
self._get_backend(
config=self.config, backend=backend, trust_remote_code=trust_remote_code
)
......@@ -287,7 +286,7 @@ class HFLM(TemplateLM):
def _get_accelerate_args(
self,
parallelize: bool = None,
parallelize: Optional[bool] = None,
device_map: Optional[str] = "auto",
max_memory_per_gpu: Optional[Union[int, str]] = None,
max_cpu_memory: Optional[Union[int, str]] = None,
......@@ -441,31 +440,26 @@ class HFLM(TemplateLM):
def _get_backend(
self,
config: Union[transformers.PretrainedConfig, transformers.AutoConfig],
backend: Optional[Literal["default", "causal", "seq2seq"]] = "default",
backend: Literal["default", "causal", "seq2seq"] = "default",
trust_remote_code: Optional[bool] = False,
) -> None:
"""
Helper method during initialization.
Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder))
model type to be used.
Determines the backend ("causal" (decoder-only) or "seq2seq" (encoder-decoder)) model type to be used.
sets `self.AUTO_MODEL_CLASS` appropriately if not already set.
**If not calling HFLM.__init__() or HFLM._get_backend() within a subclass of HFLM,
user must set `self.backend` to be either "causal" or "seq2seq" manually!**
"""
# escape hatch: if we're using a subclass that shouldn't follow
# the default _get_backend logic,
# then skip over the method.
# TODO: this seems very much undesirable in some cases--our code in HFLM
# references AutoModelForCausalLM at times to check for equality
if self.AUTO_MODEL_CLASS is not None:
return
assert backend in ["default", "causal", "seq2seq"]
if backend != "default":
# if we've settled on non-default backend, use that manually
if backend == "causal":
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
self.backend = backend
elif backend == "seq2seq":
self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
self.backend = backend
eval_logger.info(
f"Overrode HF model backend type, and using type '{backend}'"
)
......@@ -478,26 +472,32 @@ class HFLM(TemplateLM):
# first check if model type is listed under seq2seq models, since some
# models like MBart are listed in both seq2seq and causal mistakenly in HF transformers.
# these special cases should be treated as seq2seq models.
self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
self.backend = "seq2seq"
eval_logger.info(f"Using model type '{backend}'")
elif (
getattr(self.config, "model_type") in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
):
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
self.backend = "causal"
eval_logger.info(f"Using model type '{backend}'")
else:
if not trust_remote_code:
eval_logger.warning(
"HF model type is neither marked as CausalLM or Seq2SeqLM. \
This is expected if your model requires `trust_remote_code=True` but may be an error otherwise."
"Setting backend to causal"
)
# if model type is neither in HF transformers causal or seq2seq model registries
# then we default to AutoModelForCausalLM
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
# then we default to assuming AutoModelForCausalLM
self.backend = "causal"
eval_logger.info(
f"Model type cannot be determined. Using default model type '{backend}'"
)
assert self.AUTO_MODEL_CLASS in [
transformers.AutoModelForCausalLM,
transformers.AutoModelForSeq2SeqLM,
]
return None
if self.AUTO_MODEL_CLASS is None:
if self.backend == "causal":
self.AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
elif self.backend == "seq2seq":
self.AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
def _get_config(
self,
......@@ -505,6 +505,7 @@ class HFLM(TemplateLM):
revision: str = "main",
trust_remote_code: bool = False,
) -> None:
"""Return the model config for HuggingFace models"""
self._config = transformers.AutoConfig.from_pretrained(
pretrained,
revision=revision,
......@@ -703,7 +704,7 @@ class HFLM(TemplateLM):
# if OOM, then halves batch_size and tries again
@find_executable_batch_size(starting_batch_size=self.max_batch_size)
def forward_batch(batch_size):
if self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
if self.backend == "seq2seq":
length = max(max_context_enc, max_cont_enc)
batched_conts = torch.ones(
(batch_size, length), device=self.device
......@@ -754,7 +755,7 @@ class HFLM(TemplateLM):
# by default for CausalLM - false or self.add_bos_token is set
if add_special_tokens is None:
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
if self.backend == "causal":
special_tokens_kwargs = {
"add_special_tokens": False or self.add_bos_token
}
......@@ -782,7 +783,7 @@ class HFLM(TemplateLM):
self.tokenizer.padding_side = padding_side
add_special_tokens = {}
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
if self.backend == "causal":
add_special_tokens = {"add_special_tokens": False or self.add_bos_token}
encoding = self.tokenizer(
......@@ -860,14 +861,14 @@ class HFLM(TemplateLM):
def _select_cont_toks(
self, logits: torch.Tensor, contlen: int = None, inplen: int = None
) -> torch.Tensor:
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
if self.backend == "causal":
assert (
contlen and inplen
), "Must pass input len and cont. len to select scored logits for causal LM"
# discard right-padding.
# also discard the input/context tokens. we'll only score continuations.
logits = logits[inplen - contlen : inplen]
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
elif self.backend == "seq2seq":
assert (
contlen and not inplen
), "Selecting scored logits for Seq2SeqLM requires only cont. len"
......@@ -990,8 +991,7 @@ class HFLM(TemplateLM):
requests,
sort_fn=_collate,
group_by="contexts"
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
and self.logits_cache
if self.backend == "causal" and self.logits_cache
else None,
group_fn=_lookup_one_token_cont,
)
......@@ -1048,14 +1048,14 @@ class HFLM(TemplateLM):
# cont_toks 4 5 6 7 8 9 [:, -len(continuation_enc):, :self.vocab_size] slice
# when too long to fit in context, truncate from the left
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
if self.backend == "causal":
inp = torch.tensor(
(context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
dtype=torch.long,
device=self.device,
)
(inplen,) = inp.shape
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
elif self.backend == "seq2seq":
inp = torch.tensor(
(context_enc)[-self.max_length :],
dtype=torch.long,
......@@ -1095,11 +1095,11 @@ class HFLM(TemplateLM):
# create encoder attn mask and batched conts, if seq2seq
call_kwargs = {}
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
if self.backend == "causal":
batched_inps = pad_and_concat(
padding_len_inp, inps, padding_side="right"
) # [batch, padding_len_inp]
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
elif self.backend == "seq2seq":
# TODO: left-pad encoder inps and mask?
batched_inps = pad_and_concat(
padding_len_inp, inps
......@@ -1130,7 +1130,7 @@ class HFLM(TemplateLM):
# from prompt/prefix tuning tokens, if applicable
ctx_len = (
inplen + (logits.shape[0] - padding_len_inp)
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM
if self.backend == "causal"
else None
)
logits = self._select_cont_toks(logits, contlen=contlen, inplen=ctx_len)
......@@ -1265,10 +1265,10 @@ class HFLM(TemplateLM):
max_gen_toks = self.max_gen_toks
# set the max length in tokens of inputs ("context_enc")
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
if self.backend == "causal":
# max len for inputs = max length, minus room to generate the max new tokens
max_ctx_len = self.max_length - max_gen_toks
elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
elif self.backend == "seq2seq":
# max len for inputs = encoder's whole max_length
max_ctx_len = self.max_length
......@@ -1295,7 +1295,7 @@ class HFLM(TemplateLM):
cont_toks_list = cont.tolist()
for cont_toks, context in zip(cont_toks_list, contexts):
# discard context + left-padding toks if using causal decoder-only LM
if self.AUTO_MODEL_CLASS == transformers.AutoModelForCausalLM:
if self.backend == "causal":
cont_toks = cont_toks[context_enc.shape[1] :]
s = self.tok_decode(cont_toks)
......
import copy
import json
import logging
import subprocess
from collections import defaultdict
from typing import List, Optional, Union
......@@ -33,54 +31,6 @@ except ImportError:
logger = logging.getLogger(__name__)
def get_nc_count() -> Union[int, None]:
"""Returns the number of neuron cores on the current instance."""
try:
cmd = "neuron-ls --json-output"
result = subprocess.run(cmd, shell=True, capture_output=True)
print(f"inferring nc_count from `neuron-ls` {result.stdout}")
json_output = json.loads(result.stdout)
count = sum([x["nc_count"] for x in json_output])
print(f"nc_count={count}")
return count
except Exception:
return None
def wrap_constant_batch_size(func):
def _decorator(self, input_ids):
"""input_ids a 2D array with batch_size on dim=0
makes sure the func runs with self.batch_size
"""
# access a from TestSample
batch_size = input_ids.shape[0]
if batch_size < self.batch_size:
# handle the event of input_ids.shape[0] != batch_size
# Neuron cores expect constant batch_size
input_ids = torch.concat(
(
input_ids,
# add missing_batch_size dummy
torch.zeros(
[self.batch_size - batch_size, *input_ids.size()[1:]],
dtype=input_ids.dtype,
device=input_ids.device,
),
),
dim=0,
)
elif batch_size > self.batch_size:
raise ValueError(
f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
)
# return the forward pass that requires constant batch size
return func(self, input_ids)[:batch_size]
return _decorator
class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
"""NeuronModelForCausalLM with `stopping_criteria` in `generate`"""
......@@ -146,7 +96,7 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
raise ValueError(
f"The specified batch_size ({batch_size}) exceeds the model static batch size ({self.batch_size})"
)
elif batch_size < self.batch_size:
elif batch_size < self.batch_size and not self.continuous_batching:
logger.warning(
"Inputs will be padded to match the model static batch size. This will increase latency."
)
......@@ -158,8 +108,6 @@ class CustomNeuronModelForCausalLM(NeuronModelForCausalLM):
if attention_mask is not None:
padding = torch.zeros(padding_shape, dtype=torch.int64)
padded_attention_mask = torch.cat([attention_mask, padding])
# Drop the current generation context and clear the Key/Value cache
self.reset_generation()
output_ids = self.generate_tokens(
padded_input_ids,
......@@ -179,8 +127,6 @@ class NEURON_HF(TemplateLM):
Tested with neuron 2.17.0
"""
_DEFAULT_MAX_LENGTH = 2048
def __init__(
self,
pretrained: Optional[str] = "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
......@@ -203,7 +149,7 @@ class NEURON_HF(TemplateLM):
"please install neuron via pip install transformers-neuron ",
"also make sure you are running on an AWS inf2 instance",
)
if version.parse(optimum_neuron_version) != version.parse("0.0.17"):
if version.parse(optimum_neuron_version) != version.parse("0.0.24"):
logger.warning(
'`optimum-neuron` model requires `pip install "optimum[neuronx]>=0.0.17" '
"preferably using the Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04) "
......@@ -217,35 +163,16 @@ class NEURON_HF(TemplateLM):
self.batch_size_per_gpu = int(batch_size)
batch_size = int(batch_size)
if tp_degree is None:
# execute `neuron-ls --json-output | jq '.[0].nc_count'``
# to get the number of neuron cores on your instance
tp_degree = get_nc_count()
assert isinstance(tp_degree, int), (
f"model_args must include tp_degree. tp_degree must be set to an integer,"
f" but is tp_degree=`{tp_degree}` with type=`{type(tp_degree)}`."
"Set it to number of neuron cores on your instance."
" For inf2.xlarge and inf2.8xlarge, set it to `2`."
" For inf2.24xlarge, set it to `12`."
" For inf2.48xlarge, set it to `24`."
)
revision = str(revision) # cast to string if not already one
# TODO: update this to be less of a hack once subfolder is fixed in HF
revision = revision + ("/" + subfolder if subfolder is not None else "")
self._config = transformers.AutoConfig.from_pretrained(
pretrained,
revision=revision,
trust_remote_code=trust_remote_code,
)
torch_dtype = lm_eval.models.utils.get_dtype(dtype)
assert torch_dtype in [
torch.float16,
torch.bfloat16,
], "Only float16 and bfloat16 are supported"
revision = str(revision) # cast to string if not already one
# TODO: update this to be less of a hack once subfolder is fixed in HF
revision = revision + ("/" + subfolder if subfolder is not None else "")
self.tokenizer = transformers.AutoTokenizer.from_pretrained(
pretrained if tokenizer is None else tokenizer,
......@@ -254,36 +181,58 @@ class NEURON_HF(TemplateLM):
use_fast=use_fast_tokenizer,
)
# Neuron specific code
if torch_dtype == torch.float16:
self.amp_dtype = "f16"
elif torch_dtype == torch.bfloat16:
self.amp_dtype = "bf16"
elif torch_dtype == torch.float32:
self.amp_dtype = "f32"
else:
raise NotImplementedError("Only float16 and bfloat16 are implemented.")
compiler_args = {"num_cores": tp_degree, "auto_cast_type": self.amp_dtype}
input_shapes = {
"batch_size": batch_size,
"sequence_length": self._DEFAULT_MAX_LENGTH,
}
neuron_config = getattr(self._config, "neuron", None)
if neuron_config is None:
# Check export parameters
if tp_degree is not None:
assert isinstance(tp_degree, int), (
f"tp_degree must be set to an integer,"
f" but is tp_degree=`{tp_degree}` with type=`{type(tp_degree)}`."
"Set it to a number lower than the number of neuron cores on your instance."
" For inf2.xlarge and inf2.8xlarge, set it to `2`."
" For inf2.24xlarge, set it <= `12`."
" For inf2.48xlarge, set it <= `24`."
)
torch_dtype = lm_eval.models.utils.get_dtype(dtype)
if torch_dtype == torch.float16:
self.amp_dtype = "f16"
elif torch_dtype == torch.bfloat16:
self.amp_dtype = "bf16"
elif torch_dtype == torch.float32:
self.amp_dtype = "f32"
else:
raise NotImplementedError(
"Only float16/bfloat16/float32 are supported."
)
print(
f"{'='*20} \n loading model to neuron with"
f" {compiler_args}, {input_shapes}..."
)
self.model = CustomNeuronModelForCausalLM.from_pretrained(
pretrained,
revision=revision,
trust_remote_code=trust_remote_code,
low_cpu_mem_usage=low_cpu_mem_usage,
export=True,
**compiler_args,
**input_shapes,
)
print(f"SUCCESS: neuron model compiled. \n {'='*20}")
print(f"{'='*20} \n exporting model to neuron")
self.model = CustomNeuronModelForCausalLM.from_pretrained(
pretrained,
revision=revision,
trust_remote_code=trust_remote_code,
low_cpu_mem_usage=low_cpu_mem_usage,
export=True,
batch_size=batch_size,
num_cores=tp_degree,
auto_cast_type=self.amp_dtype,
sequence_length=max_length,
)
neuron_config = self.model.config.neuron
print(
f"SUCCESS: neuron model exported with config {neuron_config}. \n {'='*20}"
)
else:
print(
f"{'='*20} \n loading neuron model with config" f" {neuron_config}..."
)
self.model = CustomNeuronModelForCausalLM.from_pretrained(
pretrained,
revision=revision,
trust_remote_code=trust_remote_code,
low_cpu_mem_usage=low_cpu_mem_usage,
)
print(f"SUCCESS: neuron model loaded. \n {'='*20}")
self.truncation = truncation
......@@ -291,8 +240,6 @@ class NEURON_HF(TemplateLM):
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
self.add_bos_token = add_bos_token
self._max_length = max_length
self.batch_schedule = 1
self.batch_sizes = {}
......@@ -313,17 +260,7 @@ class NEURON_HF(TemplateLM):
@property
def max_length(self):
if self._max_length: # if max length manually set, return it
return self._max_length
seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
for attr in seqlen_config_attrs:
if hasattr(self.model.config, attr):
return getattr(self.model.config, attr)
if hasattr(self.tokenizer, "model_max_length"):
if self.tokenizer.model_max_length == 1000000000000000019884624838656:
return self._DEFAULT_MAX_LENGTH
return self.tokenizer.model_max_length
return self._DEFAULT_MAX_LENGTH
return self.model.max_length
@property
def max_gen_toks(self) -> int:
......@@ -391,34 +328,6 @@ class NEURON_HF(TemplateLM):
def tok_decode(self, tokens):
return self.tokenizer.decode(tokens)
@wrap_constant_batch_size
def _model_call(self, input_ids: torch.Tensor):
"""
get logits for the entire sequence
:param input_ids: torch.Tensor
A torch tensor of shape [batch, sequence_cont]
the size of sequence may vary from call to call
:return
A torch tensor of shape [batch, sequence, vocab] with the
logits returned from the model's decoder-lm head
"""
_, sequence_length = input_ids.shape
with torch.inference_mode():
cache_ids = torch.arange(0, sequence_length, dtype=torch.int32).split(1)
input_ids_split = input_ids.split(1, dim=1)
return torch.concat(
[
self.model.forward(
input_ids=input_id, cache_ids=cache_id, return_dict=False
)[0]
for input_id, cache_id in zip(input_ids_split, cache_ids)
],
dim=1,
)
def _model_generate(self, context, max_length, stop, **generation_kwargs):
# we require users to pass do_sample=True explicitly
# for non-greedy gen. This should be reevaluated when considering beam search.
......@@ -580,15 +489,41 @@ class NEURON_HF(TemplateLM):
cont_toks_list.append(continuation_enc)
inplens.append(inplen)
# create encoder attn mask and batched conts, if seq2seq
call_kwargs = {}
# Add dummy inputs up to the model static batch size
if len(inps) < self.batch_size:
inps = inps + [
torch.zeros_like(inps[0]),
] * (self.batch_size - len(inps))
masks = [torch.ones_like(inp) for inp in inps]
batched_inps = lm_eval.models.utils.pad_and_concat(
padding_len_inp, inps, padding_side="right"
) # [batch, padding_len_inp]
multi_logits = F.log_softmax(
self._model_call(batched_inps, **call_kwargs), dim=-1
) # [batch, padding_length (inp or cont), vocab]
batched_masks = lm_eval.models.utils.pad_and_concat(
padding_len_inp, masks, padding_side="right"
)
if self.model.model.neuron_config.output_all_logits:
inputs = self.model.prepare_inputs_for_prefill(
batched_inps, batched_masks
)
multi_logits = F.log_softmax(
self.model.forward(**inputs).logits, dim=-1
) # [batch, padding_length (inp or cont), vocab]
else:
# The model will only return the logits for the last input token, so we need
# to iterate over inputs to accumulate logits.
# To speed things up we use the KV cache as we would do when generating.
inputs = self.model.prepare_inputs_for_prefill(
batched_inps[:, :1], batched_masks[:, :1]
)
outputs = [self.model.forward(**inputs).logits]
for i in range(1, padding_len_inp):
inputs = self.model.prepare_inputs_for_decode(
batched_inps[:, : i + 1], batched_masks[:, : i + 1]
)
outputs.append(self.model.forward(**inputs).logits)
multi_logits = F.log_softmax(torch.concat(outputs, dim=1), dim=-1)
for (cache_key, _, _), logits, inplen, cont_toks in zip(
chunk, multi_logits, inplens, cont_toks_list
......
......@@ -69,11 +69,11 @@ class LocalCompletionsAPI(TemplateAPI):
for choice, ctxlen in zip(out["choices"], ctxlens):
assert ctxlen > 0, "Context length must be greater than 0"
logprobs = sum(choice["logprobs"]["token_logprobs"][ctxlen:-1])
tokens = choice["logprobs"]["token_logprobs"][ctxlen:-1]
tokens_logprobs = choice["logprobs"]["token_logprobs"][ctxlen:-1]
top_logprobs = choice["logprobs"]["top_logprobs"][ctxlen:-1]
is_greedy = True
for tok, top in zip(tokens, top_logprobs):
if tok != max(top, key=top.get):
for tok, top in zip(tokens_logprobs, top_logprobs):
if tok != max(top.values()):
is_greedy = False
break
res.append((logprobs, is_greedy))
......@@ -190,14 +190,18 @@ class OpenAICompletionsAPI(LocalCompletionsAPI):
key = os.environ.get("OPENAI_API_KEY", None)
if key is None:
raise ValueError(
"API key not found. Please set the OPENAI_API_KEY environment variable."
"API key not found. Please set the `OPENAI_API_KEY` environment variable."
)
return key
def loglikelihood(self, requests, **kwargs):
assert (
self.model != "gpt-3.5-turbo"
), "Loglikelihood is not supported for gpt-3.5-turbo"
self.model
in [
"babbage-002",
"davinci-002",
]
), f"Prompt loglikelihoods are only supported by OpenAI's API for {['babbage-002', 'davinci-002']}."
return super().loglikelihood(requests, **kwargs)
def chat_template(self, chat_template: Union[bool, str] = False) -> Optional[str]:
......@@ -226,6 +230,11 @@ class OpenAIChatCompletion(LocalChatCompletion):
key = os.environ.get("OPENAI_API_KEY", None)
if key is None:
raise ValueError(
"API key not found. Please set the OPENAI_API_KEY environment variable."
"API key not found. Please set the `OPENAI_API_KEY` environment variable."
)
return key
def loglikelihood(self, requests, **kwargs):
raise NotImplementedError(
"Loglikelihood (and therefore `multiple_choice`-type tasks) is not supported for chat completions as OpenAI does not provide prompt logprobs. See https://github.com/EleutherAI/lm-evaluation-harness/issues/942#issuecomment-1777836312 or https://github.com/EleutherAI/lm-evaluation-harness/issues/1196 for more background on this limitation."
)
......@@ -698,3 +698,14 @@ def replace_placeholders(
# Add the last part of the string
result.append(parts[-1])
return "".join(result)
def flatten_image_list(images: List[List]):
"""
Takes in a list of lists of images, and returns a single list of all images in order.
Used for some multimodal models like Llava-1.5 which expects this flattened-list format for its image processor.
:param images: A list of lists of PIL images.
:return: a list of PIL images, via concatenating all the sub-lists in order.
"""
return [image for image_list in images for image in image_list]
......@@ -7,9 +7,9 @@ from tqdm import tqdm
from lm_eval.api.instance import Instance
from lm_eval.api.registry import register_model
from lm_eval.models.utils import Collator, undistribute
from lm_eval.models.utils import Collator, replace_placeholders, undistribute
from lm_eval.models.vllm_causallms import VLLM
from lm_eval.utils import simple_parse_args_string
from lm_eval.utils import eval_logger
try:
......@@ -36,10 +36,11 @@ class VLLM_VLM(VLLM):
interleave: bool = True,
# TODO<baber>: handle max_images and limit_mm_per_prompt better
max_images: int = 999,
limit_mm_per_prompt: str = "image=1",
**kwargs,
):
kwargs["limit_mm_per_prompt"] = simple_parse_args_string(limit_mm_per_prompt)
if max_images != 999:
kwargs["limit_mm_per_prompt"] = {"image": max_images}
eval_logger.info(f"Setting limit_mm_per_prompt[image] to {max_images}")
super().__init__(
pretrained=pretrained,
trust_remote_code=trust_remote_code,
......@@ -63,6 +64,17 @@ class VLLM_VLM(VLLM):
truncation: bool = False,
):
images = [img[: self.max_images] for img in images]
# TODO<baber>: is the default placeholder always <image>?
if self.chat_applied is False:
strings = [
replace_placeholders(
string,
DEFAULT_IMAGE_PLACEHOLDER,
DEFAULT_IMAGE_PLACEHOLDER,
self.max_images,
)
for string in strings
]
outputs = []
for x, i in zip(strings, images):
......
......@@ -18,6 +18,7 @@
| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
| [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English |
| [babi](babi/README.md) | Tasks designed as question and answering challenges based on simulated stories. | English |
| [basque_bench](basque_bench/README.md) | Collection of tasks in Basque encompassing various evaluation areas. | Basque |
| [basqueglue](basqueglue/README.md) | Tasks designed to evaluate language understanding in Basque language. | Basque |
| [bbh](bbh/README.md) | Tasks focused on deep semantic understanding through hypothesization and reasoning. | English, German |
| [belebele](belebele/README.md) | Language understanding tasks in a variety of languages and scripts. | Multiple (122 languages) |
......@@ -25,6 +26,7 @@
| [bertaqa](bertaqa/README.md) | Local Basque cultural trivia QA tests in English and Basque languages. | English, Basque, Basque (MT) |
| [bigbench](bigbench/README.md) | Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models. | Multiple |
| [blimp](blimp/README.md) | Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities. | English |
| [catalan_bench](catalan_bench/README.md) | Collection of tasks in Catalan encompassing various evaluation areas. | Catalan |
| [ceval](ceval/README.md) | Tasks that evaluate language understanding and reasoning in an educational context. | Chinese |
| [cmmlu](cmmlu/README.md) | Multi-subject multiple choice question tasks for comprehensive academic assessment. | Chinese |
| code_x_glue | Tasks that involve understanding and generating code across multiple programming languages. | Go, Java, JS, PHP, Python, Ruby |
......@@ -42,6 +44,7 @@
| [fda](fda/README.md) | Tasks for extracting key-value pairs from FDA documents to test information extraction. | English |
| [fld](fld/README.md) | Tasks involving free-form and directed dialogue understanding. | English |
| [french_bench](french_bench/README.md) | Set of tasks designed to assess language model performance in French. | French|
| [galician_bench](galician_bench/README.md) | Collection of tasks in Galician encompassing various evaluation areas. | Galician |
| [glue](glue/README.md) | General Language Understanding Evaluation benchmark to test broad language abilities. | English |
| [gpqa](gpqa/README.md) | Tasks designed for general public question answering and knowledge verification. | English |
| [gsm8k](gsm8k/README.md) | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. | English |
......@@ -86,6 +89,7 @@
| [pile_10k](pile_10k/README.md) | The first 10K elements of The Pile, useful for debugging models trained on it. | English |
| [piqa](piqa/README.md) | Physical Interaction Question Answering tasks to test physical commonsense reasoning. | English |
| [polemo2](polemo2/README.md) | Sentiment analysis and emotion detection tasks based on Polish language data. | Polish |
| [portuguese_bench](portuguese_bench/README.md) | Collection of tasks in European Portuguese encompassing various evaluation areas. | Portuguese |
| [prost](prost/README.md) | Tasks requiring understanding of professional standards and ethics in various domains. | English |
| [pubmedqa](pubmedqa/README.md) | Question answering tasks based on PubMed research articles for biomedical understanding. | English |
| [qa4mre](qa4mre/README.md) | Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning. | English |
......@@ -95,6 +99,7 @@
| [sciq](sciq/README.md) | Science Question Answering tasks to assess understanding of scientific concepts. | English |
| [scrolls](scrolls/README.md) | Tasks that involve long-form reading comprehension across various domains. | English |
| [siqa](siqa/README.md) | Social Interaction Question Answering to evaluate common sense and social reasoning. | English |
| [spanish_bench](spanish_bench/README.md) | Collection of tasks in Spanish encompassing various evaluation areas. | Spanish |
| [squad_completion](squad_completion/README.md) | A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs. | English |
| [squadv2](squadv2/README.md) | Stanford Question Answering Dataset version 2, a reading comprehension benchmark. | English |
| [storycloze](storycloze/README.md) | Tasks to predict story endings, focusing on narrative logic and coherence. | English |
......@@ -107,6 +112,7 @@
| [translation](translation/README.md) | Tasks focused on evaluating the language translation capabilities of models. | Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese |
| [triviaqa](triviaqa/README.md) | A large-scale dataset for trivia question answering to test general knowledge. | English |
| [truthfulqa](truthfulqa/README.md) | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. | English |
| [turkishmmlu](turkishmmlu/README.md) | A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams. | Turkish |
| [unitxt](unitxt/README.md) | A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI. | English |
| [unscramble](unscramble/README.md) | Tasks involving the rearrangement of scrambled sentences to test syntactic understanding. | English |
| [webqs](webqs/README.md) | Web-based question answering tasks designed to evaluate internet search and retrieval. | English |
......
......@@ -40,7 +40,11 @@ class TaskManager:
[x for x in self._all_tasks if self._task_index[x]["type"] == "group"]
)
self._all_subtasks = sorted(
[x for x in self._all_tasks if self._task_index[x]["type"] == "task"]
[
x
for x in self._all_tasks
if self._task_index[x]["type"] in ["task", "python_task"]
]
)
self._all_tags = sorted(
[x for x in self._all_tasks if self._task_index[x]["type"] == "tag"]
......@@ -271,7 +275,7 @@ class TaskManager:
task_object = config["class"]()
if isinstance(task_object, ConfigurableTask):
# very scuffed: set task name here. TODO: fixme?
task_object.config.task = config["task"]
task_object.config.task = task
else:
task_object = ConfigurableTask(config=config)
......@@ -436,6 +440,30 @@ class TaskManager:
:return
Dictionary of task names as key and task metadata
"""
def _populate_tags_and_groups(config, task, tasks_and_groups, print_info):
# TODO: remove group in next release
if "tag" in config:
attr_list = config["tag"]
if isinstance(attr_list, str):
attr_list = [attr_list]
for tag in attr_list:
if tag not in tasks_and_groups:
tasks_and_groups[tag] = {
"type": "tag",
"task": [task],
"yaml_path": -1,
}
elif tasks_and_groups[tag]["type"] != "tag":
self.logger.info(
f"The tag '{tag}' is already registered as a group, this tag will not be registered. "
"This may affect tasks you want to call."
)
break
else:
tasks_and_groups[tag]["task"].append(task)
# TODO: remove group in next release
print_info = True
ignore_dirs = [
......@@ -451,10 +479,14 @@ class TaskManager:
config = utils.load_yaml_config(yaml_path, mode="simple")
if self._config_is_python_task(config):
# This is a python class config
tasks_and_groups[config["task"]] = {
task = config["task"]
tasks_and_groups[task] = {
"type": "python_task",
"yaml_path": yaml_path,
}
_populate_tags_and_groups(
config, task, tasks_and_groups, print_info
)
elif self._config_is_group(config):
# This is a group config
tasks_and_groups[config["group"]] = {
......@@ -483,41 +515,9 @@ class TaskManager:
"type": "task",
"yaml_path": yaml_path,
}
# TODO: remove group in next release
for attr in ["tag", "group"]:
if attr in config:
if attr == "group" and print_info:
self.logger.info(
"`group` and `group_alias` keys in TaskConfigs are deprecated and will be removed in v0.4.5 of lm_eval. "
"The new `tag` field will be used to allow for a shortcut to a group of tasks one does not wish to aggregate metrics across. "
"`group`s which aggregate across subtasks must be only defined in a separate group config file, "
"which will be the official way to create groups that support cross-task aggregation as in `mmlu`. "
"Please see the v0.4.4 patch notes and our documentation: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#advanced-group-configs "
"for more information."
)
print_info = False
# attr = "tag"
attr_list = config[attr]
if isinstance(attr_list, str):
attr_list = [attr_list]
for tag in attr_list:
if tag not in tasks_and_groups:
tasks_and_groups[tag] = {
"type": "tag",
"task": [task],
"yaml_path": -1,
}
elif tasks_and_groups[tag]["type"] != "tag":
self.logger.info(
f"The tag {tag} is already registered as a group, this tag will not be registered. "
"This may affect tasks you want to call."
)
break
else:
tasks_and_groups[tag]["task"].append(task)
_populate_tags_and_groups(
config, task, tasks_and_groups, print_info
)
else:
self.logger.debug(f"File {f} in {root} could not be loaded")
......
# BasqueBench
### Paper
BasqueBench is a benchmark for evaluating language models in Basque tasks. This is, it evaluates the ability of a language model to understand and generate Basque text. BasqueBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of BasqueBench will be published in a paper soon.
The new evaluation datasets included in BasqueBench are:
| Task | Category | Homepage |
|:-------------:|:-----:|:-----:|
| MGSM_eu | Math | https://huggingface.co/datasets/HiTZ/MGSM-eu |
| WNLI_eu | Natural Language Inference | https://huggingface.co/datasets/HiTZ/wnli-eu |
| XCOPA_eu | Commonsense Reasoning | https://huggingface.co/datasets/HiTZ/XCOPA-eu |
The datasets included in BasqueBench that have been made public in previous pubications are:
| Task | Category | Paper title | Homepage |
|:-------------:|:-----:|:-------------:|:-----:|
| Belebele_eu | Reading Comprehension | [The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants](https://arxiv.org/abs/2308.16884) | https://huggingface.co/datasets/facebook/belebele |
| EusExams | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusExams |
| EusProficiency | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusProficiency |
| EusReading | Reading Comprehension | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusReading |
| EusTrivia | Question Answering | [Latxa: An Open Language Model and Evaluation Suite for Basque](https://arxiv.org/abs/2403.20266) | https://huggingface.co/datasets/HiTZ/EusTrivia |
| FLORES_eu | Translation | [No Language Left Behind: Scaling Human-Centered Machine Translation](https://arxiv.org/abs/2207.04672) | https://huggingface.co/datasets/facebook/flores |
| QNLIeu | Natural Language Inference | [BasqueGLUE: A Natural Language Understanding Benchmark for Basque](https://aclanthology.org/2022.lrec-1.172/) | https://huggingface.co/datasets/orai-nlp/basqueGLUE |
| XNLIeu | Natural Language Inference | [XNLIeu: a dataset for cross-lingual NLI in Basque](https://arxiv.org/abs/2404.06996) | https://huggingface.co/datasets/HiTZ/xnli-eu |
| XStoryCloze_eu | Commonsense Reasoning | [Few-shot Learning with Multilingual Generative Language Models](https://aclanthology.org/2022.emnlp-main.616/) | https://huggingface.co/datasets/juletxara/xstory_cloze |
### Citation
Paper for BasqueBench coming soon.
### Groups and Tasks
#### Groups
- `basque_bench`: All tasks included in BasqueBench.
- `flores_eu`: All FLORES translation tasks from or to Basque.
#### Tasks
The following tasks evaluate tasks on BasqueBench dataset using various scoring methods.
- `belebele_eus_Latn`
- `eus_exams_eu`
- `eus_proficiency`
- `eus_reading`
- `eus_trivia`
- `flores_eu`
- `flores_eu-ca`
- `flores_eu-de`
- `flores_eu-en`
- `flores_eu-es`
- `flores_eu-fr`
- `flores_eu-gl`
- `flores_eu-it`
- `flores_eu-pt`
- `flores_ca-eu`
- `flores_de-eu`
- `flores_en-eu`
- `flores_es-eu`
- `flores_fr-eu`
- `flores_gl-eu`
- `flores_it-eu`
- `flores_pt-eu`
- `mgsm_direct_eu`
- `mgsm_native_cot_eu`
- `qnlieu`
- `wnli_eu`
- `xcopa_eu`
- `xnli_eu`
- `xnli_eu_native`
- `xstorycloze_eu`
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
- `belebele_eus_Latn`: Belebele Basque
- `qnlieu`: From BasqueGLUE
### Checklist
* [x] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation?
* [ ] Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
group: basque_bench
task:
- belebele_eus_Latn
- xstorycloze_eu
- flores_eu
- eus_reading
- eus_proficiency
- eus_trivia
- eus_exams_eu
- qnlieu
- xnli_eu
- xnli_eu_native
- wnli_eu
- xcopa_eu
- mgsm_direct_eu
- mgsm_native_cot_eu
metadata:
version: 1.0
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment