feat: add option to upload results to Zeno (#990)

* feat: add option to upload results to Zeno * config-based upload supporting different task types and metrics * upload tasks as individual projects * wording * readme * add example notebook * Update documentation for Zeno integration * Make zeno deps an extra * Update README.md * Document extra deps installation * Update zeno_visualize.py * fix: balance parens * fix typo * fix merge commit I botched * Update zeno_visualize.py * Update logger warning stmt * fix whitespace --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>

feat: add option to upload results to Zeno (#990)
* feat: add option to upload results to Zeno * config-based upload supporting different task types and metrics * upload tasks as individual projects * wording * readme * add example notebook * Update documentation for Zeno integration * Make zeno deps an extra * Update README.md * Document extra deps installation * Update zeno_visualize.py * fix: balance parens * fix typo * fix merge commit I botched * Update zeno_visualize.py * Update logger warning stmt * fix whitespace --------- Co-authored-by: Hailey Schoelkopf <65563625+haileyschoelkopf@users.noreply.github.com>
21d4ae98 · Alex Bäuerle · GitHub · 9e03d9d0 · 21d4ae98 · 21d4ae98
Unverified Commit 21d4ae98 authored Dec 20, 2023 by Alex Bäuerle Committed by GitHub Dec 20, 2023
Showing with 366 additions and 0 deletions

README.md README.md +40 -0

examples/visualize-zeno.ipynb examples/visualize-zeno.ipynb +114 -0

pyproject.toml pyproject.toml +2 -0

scripts/zeno_visualize.py scripts/zeno_visualize.py +210 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -59,6 +59,7 @@ We also provide a number of optional dependencies for extended functionality. Ex
 | promptsource  | For using PromtSource prompts         |
 | sentencepiece | For using the sentencepiece tokenizer |
 | vllm          | For loading models with vLLM          |
+| zeno          | For visualizing results with Zeno     |
 | all           | Loads all extras                      |

 ## Basic Usage
@@ -225,6 +226,45 @@ Additionally, one can provide a directory with `--use_cache` to cache the result

 For a full list of supported arguments, check out the [interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) guide in our documentation!

+## Visualizing Results
+
+You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
+
+First, head to [hub.zenoml.com](hub.zenoml.com) to create an account and get an API key [on your account page](hub.zenoml.com/account).
+Add this key as an environment variable:
+
+```bash
+export ZENO_API_KEY=[your api key]
+```
+
+You'll also need to install the `lm_eval[zeno]` package extra.
+
+To visualize the results, run the eval harness with the `log_samples` and `output_path` flags.
+We expect `output_path` to contain multiple folders that represent individual model names.
+You can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.
+
+```bash
+lm_eval \
+    --model hf \
+    --model_args pretrained=EleutherAI/gpt-j-6B \
+    --tasks hellaswag \
+    --device cuda:0 \
+    --batch_size 8 \
+    --log_samples \
+    --output_path output/gpt-j-6B
+```
+
+Then, you can upload the resulting data using the `zeno_visualize` script:
+
+```bash
+python scripts/zeno_visualize.py \
+    --data_path output \
+    --project_name "Eleuther Project"
+```
+
+This will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno.
+If you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.
+
 ## How to Contribute or Learn More?

 For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.

--- a/examples/visualize-zeno.ipynb
+++ b/examples/visualize-zeno.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Visualizing Results in Zeno\n",
+    "\n",
+    "Benchmarking your models is the first step towards making sure your model performs well.\n",
+    "However, looking at the data behind the benchmark, slicing the data into subsets, and comparing models on individual instances can help you even more in evaluating and quantifying the behavior of your AI system.\n",
+    "\n",
+    "All of this can be done in [Zeno](https://zenoml.com)!\n",
+    "Zeno is super easy to use with the eval harness, let's explore how you can easily upload and visualize your eval results.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install this project if you did not already do that. This is all that needs to be installed for you to be able to visualize your data in Zeno!\n",
+    "!pip install .."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Run the Eval Harness\n",
+    "\n",
+    "To visualize the results, run the eval harness with the `log_samples` and `output_path` flags. We expect `output_path` to contain multiple folders that represent individual model names. You can thus run your evaluation on any number of tasks and models and upload all of the results as projects on Zeno.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!lm_eval \\\n",
+    "    --model hf \\\n",
+    "    --model_args pretrained=EleutherAI/gpt-neo-2.7B \\\n",
+    "    --tasks hellaswag,wikitext \\\n",
+    "    --batch_size 8 \\\n",
+    "    --device mps \\\n",
+    "    --log_samples \\\n",
+    "    --output_path output/gpt-neo-2.7B \\\n",
+    "    --limit 10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Set your API Key\n",
+    "\n",
+    "This is so you can be authenticated with Zeno.\n",
+    "If you don't already have a Zeno account, first create an account on [Zeno Hub](https://hub.zenoml.com).\n",
+    "After logging in to Zeno Hub, generate your API key by clicking on your profile at the bottom left to navigate to your account page.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%env ZENO_API_KEY=YOUR_API_KEY"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Visualize Eval Results\n",
+    "\n",
+    "You can now use the `zeno_visualize` script to upload the results to Zeno.\n",
+    "\n",
+    "This will use all subfolders in `data_path` as different models and upload all tasks within these model folders to Zeno. If you run the eval harness on multiple tasks, the `project_name` will be used as a prefix and one project will be created per task.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python ../scripts/zeno_visualize.py --data_path output --project_name \"Zeno Upload Test\""
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "zeno_projects",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -73,6 +73,7 @@ anthropic = ["anthropic"]
 openai = ["openai==1.3.9", "tiktoken"]
 vllm = ["vllm"]
 ifeval = ["langdetect", "immutabledict"]
+zeno = ["pandas", "zeno-client"]
 all = [
    "lm_eval[dev]",
    "lm_eval[testing]",
@@ -85,4 +86,5 @@ all = [
    "lm_eval[openai]",
    "lm_eval[vllm]",
    "lm_eval[ifeval]",
+    "lm_eval[zeno]",
 ]
--- a/scripts/zeno_visualize.py
+++ b/scripts/zeno_visualize.py
+import argparse
+import json
+import os
+import re
+from pathlib import Path
+
+import pandas as pd
+from zeno_client import ZenoClient, ZenoMetric
+
+from lm_eval.utils import eval_logger
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Upload your data to the Zeno AI evaluation platform to visualize results. This requires a ZENO_API_KEY in your environment variables. The eleuther harness must be run with log_samples=True and an output_path set for data to be written to disk."
+    )
+    parser.add_argument(
+        "--data_path",
+        required=True,
+        help="Where to find the results of the benchmarks that have been run. Uses the name of each subfolder as the model name.",
+    )
+    parser.add_argument(
+        "--project_name",
+        required=True,
+        help="The name of the generated Zeno project.",
+    )
+    return parser.parse_args()
+
+
+def main():
+    """Upload the results of your benchmark tasks to the Zeno AI evaluation platform.
+
+    This scripts expects your results to live in a data folder where subfolders contain results of individual models.
+    """
+    args = parse_args()
+
+    client = ZenoClient(os.environ["ZENO_API_KEY"])
+
+    # Get all model subfolders from the parent data folder.
+    models = [
+        os.path.basename(os.path.normpath(f))
+        for f in os.scandir(Path(args.data_path))
+        if f.is_dir()
+    ]
+
+    assert len(models) > 0, "No model directories found in the data_path."
+
+    tasks = set(tasks_for_model(models[0], args.data_path))
+
+    for model in models:  # Make sure that all models have the same tasks.
+        old_tasks = tasks.copy()
+        task_count = len(tasks)
+
+        model_tasks = tasks_for_model(model, args.data_path)
+        tasks.intersection(set(model_tasks))
+
+        if task_count != len(tasks):
+            eval_logger.warning(
+                f"All models must have the same tasks. {model} has tasks: {model_tasks} but have already recorded tasks: {old_tasks}. Taking intersection {tasks}"
+            )
+
+    assert (
+        len(tasks) > 0
+    ), "Must provide at least one task in common amongst models to compare."
+
+    for task in tasks:
+        # Upload data for all models
+        for model_index, model in enumerate(models):
+            model_args = re.sub(
+                "/|=",
+                "__",
+                json.load(open(Path(args.data_path, model, "results.json")))["config"][
+                    "model_args"
+                ],
+            )
+            with open(
+                Path(args.data_path, model, f"{model_args}_{task}.jsonl"), "r"
+            ) as file:
+                data = json.loads(file.read())
+
+            configs = json.load(open(Path(args.data_path, model, "results.json")))[
+                "configs"
+            ]
+            config = configs[task]
+
+            if model_index == 0:  # Only need to assemble data for the first model
+                metrics = []
+                for metric in config["metric_list"]:
+                    metrics.append(
+                        ZenoMetric(
+                            name=metric["metric"],
+                            type="mean",
+                            columns=[metric["metric"]],
+                        )
+                    )
+                project = client.create_project(
+                    name=args.project_name + (f"_{task}" if len(tasks) > 1 else ""),
+                    view="text-classification",
+                    metrics=metrics,
+                )
+                project.upload_dataset(
+                    generate_dataset(data, config),
+                    id_column="id",
+                    data_column="data",
+                    label_column="labels",
+                )
+
+            project.upload_system(
+                generate_system_df(data, config),
+                name=model,
+                id_column="id",
+                output_column="output",
+            )
+
+
+def tasks_for_model(model: str, data_path: str):
+    """Get the tasks for a specific model.
+
+    Args:
+        model (str): The name of the model.
+        data_path (str): The path to the data.
+
+    Returns:
+        list: A list of tasks for the model.
+    """
+    dir_path = Path(data_path, model)
+    config = (json.load(open(Path(dir_path, "results.json")))["configs"],)
+    return list(config[0].keys())
+
+
+def generate_dataset(
+    data,
+    config,
+):
+    """Generate a Zeno dataset from evaluation data.
+
+    Args:
+        data: The data to generate a dataset for.
+        config: The configuration of the task.
+
+    Returns:
+        pd.Dataframe: A dataframe that is ready to be uploaded to Zeno.
+    """
+    ids = [x["doc_id"] for x in data]
+    labels = [x["target"] for x in data]
+    instance = [""] * len(ids)
+
+    if config["output_type"] == "loglikelihood":
+        instance = [x["arguments"][0][0] for x in data]
+        labels = [x["arguments"][0][1] for x in data]
+    elif config["output_type"] == "multiple_choice":
+        instance = [
+            x["arguments"][0][0]
+            + "\n\n"
+            + "\n".join([f"- {y[1]}" for y in x["arguments"]])
+            for x in data
+        ]
+    elif config["output_type"] == "loglikelihood_rolling":
+        instance = [x["arguments"][0][0] for x in data]
+    elif config["output_type"] == "generate_until":
+        instance = [x["arguments"][0][0] for x in data]
+
+    return pd.DataFrame(
+        {
+            "id": ids,
+            "data": instance,
+            "labels": labels,
+            "output_type": config["output_type"],
+        }
+    )
+
+
+def generate_system_df(data, config):
+    """Generate a dataframe for a specific system to be uploaded to Zeno.
+
+    Args:
+        data: The data to generate a dataframe from.
+        config: The configuration of the task.
+
+    Returns:
+        pd.Dataframe: A dataframe that is ready to be uploaded to Zeno as a system.
+    """
+    ids = [x["doc_id"] for x in data]
+    answers = [""] * len(ids)
+
+    if config["output_type"] == "loglikelihood":
+        answers = [
+            "correct" if x["filtered_resps"][0][1] is True else "incorrect"
+            for x in data
+        ]
+    elif config["output_type"] == "multiple_choice":
+        answers = [", ".join([str(y[0]) for y in x["filtered_resps"]]) for x in data]
+    elif config["output_type"] == "loglikelihood_rolling":
+        answers = [str(x["filtered_resps"][0]) for x in data]
+    elif config["output_type"] == "generate_until":
+        answers = [str(x["filtered_resps"][0]) for x in data]
+
+    metrics = {}
+    for metric in config["metric_list"]:
+        if "aggregation" in metric and metric["aggregation"] == "mean":
+            metrics[metric["metric"]] = [x[metric["metric"]] for x in data]
+
+    system_dict = {"id": ids, "output": answers}
+    system_dict.update(metrics)
+    system_df = pd.DataFrame(system_dict)
+    return system_df
+
+
+if __name__ == "__main__":
+    main()