Merge branch 'master' into big-refactor-test

e495e3a0 · gk · 6d355b85 · 9d06c953 · e495e3a0 · e495e3a0
Commit e495e3a0 authored Jun 14, 2023 by gk
20 changed files
--- a/.github/workflows/pull_request.yml
+++ b/.github/workflows/pull_request.yml
@@ -9,5 +9,5 @@ jobs:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
-          python-version: 3.8
+          python-version: 3.9
      - uses: pre-commit/action@v2.0.3
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -12,7 +12,7 @@ repos:
      - id: check-merge-conflict
      - id: check-symlinks
      - id: check-yaml
-        args: ['--unsafe']
+        args: ["--unsafe"]
      - id: destroyed-symlinks
      - id: detect-private-key
      - id: end-of-file-fixer
@@ -33,7 +33,7 @@ repos:
    rev: 22.3.0
    hooks:
      - id: black
-        language_version: python3.8
+        language_version: python3.9
  - repo: https://github.com/codespell-project/codespell
    rev: v2.1.0
    hooks:

--- a/CODEOWNERS
+++ b/CODEOWNERS
-* @jon-tow @StellaAthena
+* @jon-tow @StellaAthena @haileyschoelkopf @lintangsutawika
--- a/README.md
+++ b/README.md
@@ -7,14 +7,17 @@

 This project provides a unified framework to test generative language models on a large number of different evaluation tasks.

-**Features:**
+### Features

 - 200+ tasks implemented. See the [task-table](./docs/task_table.md) for a complete list.
- Support for the Hugging Face `transformers` library, GPT-NeoX, Megatron-DeepSpeed, and the OpenAI API, with flexible tokenization-agnostic interface.
- Support for evaluation on adapters (e.g. LoRa) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
- Task versioning to ensure reproducibility.
+- Support for the Hugging Face [transformers](https://github.com/huggingface/transformers) library, [GPT-NeoX](https://github.com/EleutherAI/gpt-neox), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed), with flexible tokenization-agnostic interface.
+- Support for commercial APIs including [OpenAI](https://openai.com/), [goose.ai](https://goose.ai/), [Anthropic](https://www.anthropic.com/), and [TextSynth](https://textsynth.com/).
+- Support for evaluation on adapters (e.g. LoRA) supported in [HuggingFace's PEFT library](https://github.com/huggingface/peft).
+- Support for GPTQ quantized models via [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
+- Evaluating with publicly available prompts ensures reproducibility and comparability between papers.
+- Task versioning to ensure reproducibility when tasks are updated.

-**Evaluation Overview**
+### Evaluation Overview

 `Task` and `Prompt` classes contain information that, when combined, produces the input to the language model. The language model is then queried to obtain an output. One or more `Filters` can then be applied to perform arbitrary operations on the model's raw output, such as selecting the final answer (for chain of thought) or calling an external API. This final output is then evaluated using a `Metric` to obtain the final result.

@@ -37,7 +40,7 @@ graph LR;
    O --> F
    Me --> R:::empty
    F --> R
- ```
+```

 ## Install

@@ -55,12 +58,19 @@ To install additional multilingual tokenization and text segmentation packages,
 pip install -e ".[multilingual]"
 ```

+To support loading GPTQ quantized models, install the package with the `auto-gptq` extra:
+
+```bash
+pip install -e ".[auto-gptq]"
+```
+
 ## Basic Usage

 > **Note**: When reporting results from eval harness, please include the task versions (shown in `results["versions"]`) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the [Task Versioning](#task-versioning) section for more info.

-To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) you can use the following command:
+### Hugging Face `transformers`

+To evaluate a model hosted on the [HuggingFace Hub](https://huggingface.co/models) (e.g. GPT-J-6B) on `lambada_openai` and `hellaswag` you can use the following command:

 ```bash
 python main.py \
@@ -70,21 +80,24 @@ python main.py \
    --device cuda:0
 ```

-Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints:
+Additional arguments can be provided to the model constructor using the `--model_args` flag. Most notably, this supports the common practice of using the `revisions` feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:

 ```bash
 python main.py \
    --model hf-causal \
-    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000 \
+    --model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
    --tasks lambada_openai,hellaswag \
    --device cuda:0
 ```

-To evaluate models that are called via `AutoSeq2SeqLM`, you instead use `hf-seq2seq`.
+To evaluate models that are loaded via `AutoSeq2SeqLM`, you instead use `hf-seq2seq`.

 > **Warning**: Choosing the wrong model may result in erroneous outputs despite not erroring.

+Arguments provided via `--model_args` get passed to the relevant constructor directly. This means that anything you can do with `AutoModel` can be done with our library.
+
 To use with [PEFT](https://github.com/huggingface/peft), take the call you would run to evaluate the base model and add `,peft=PATH` to the `model_args` argument as shown below:
+
 ```bash
 python main.py \
    --model hf-causal \
@@ -93,7 +106,18 @@ python main.py \
    --device cuda:0
 ```

-Our library also supports the OpenAI API:
+GPTQ quantized models can be loaded by specifying their file names in `,quantized=NAME` (or `,quantized=True` for default names) in the `model_args` argument:
+
+```bash
+python main.py \
+    --model hf-causal \
+    --model_args pretrained=model-name-or-path,quantized=model.safetensors,gptq_use_triton=True \
+    --tasks hellaswag
+```
+
+### Commercial APIs
+
+Our library also supports language models served via the OpenAI API:

 ```bash
 export OPENAI_API_SECRET_KEY=YOUR_KEY_HERE
@@ -115,7 +139,9 @@ python main.py \
    --check_integrity
 ```

-To evaluate mesh-transformer-jax models that are not available on HF, please invoke eval harness through [this script](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).
+### Other Frameworks
+
+A number of other libraries contain scripts for calling the eval harness through their library. These include [GPT-NeoX](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py), [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/MoE/readme_evalharness.md), and [mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/blob/master/eval_harness.py).

 💡 **Tip**: You can inspect what the LM inputs look like by running the following command:

@@ -131,16 +157,17 @@ This will write out one text file for each task.

 ## Multi-GPU Evaluation

-Multi-GPU evaluation is supported through [accelerate](https://github.com/huggingface/accelerate). To initialize the distributed environment, run ```accelerate config``` in terminal and follow the prompts. Once the environment is configured, evaluations can be launched with:
+Multi-GPU evaluation is supported through [accelerate](https://github.com/huggingface/accelerate). To initialize the distributed environment, run `accelerate config` in terminal and follow the prompts. Once the environment is configured, evaluations can be launched with:

 ```bash
 accelerate launch main.py \
    --model hf-causal \
+    --model_args pretrained=EleutherAI/pythia-12b \
    --tasks lambada_openai,arc_easy \
-    --batch_size 16 \
+    --batch_size 16
 ```

-**Warning**: Distributed evaluation requires launching multiple processes of the evaluation script. Running ```python main.py *args*``` instead of ```accelerate launch main.py *args*``` on machine with multiple GPUs will only run the evaluations on a single device.
+**Warning**: Distributed evaluation requires launching multiple processes of the evaluation script. Running `python main.py *args*` instead of `accelerate launch main.py *args*` on machine with multiple GPUs will only run the evaluations on a single device (unless you instead use `use_accelerate=True` in `--model_args`).

 ## Implementing new tasks

@@ -154,7 +181,7 @@ When reporting eval harness results, please also report the version of each task

 ## Test Set Decontamination

-To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model trainign set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
+To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points not found in the model training set. Unfortunately, outside of models trained on the Pile and C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).

 For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).


--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -202,7 +202,7 @@ def construct_requests(self, doc, ctx):
 #### What's a `Request`? What's a `doc`?
 To reiterate, a `doc` is just a `Dict` object that contains information about a document from your corpus. It can contain things like a prompt, question type information, answers and anything else you think will be needed in order to assess your model for a given task. Keep in mind that the fields of this can be basically whatever you want (you can sort this out in `training_docs` \ `validation_docs` \ `test_docs` if you need to customise things - see above), just remember to be consistent with them throughout the rest of the `Task` you write up.
 A `Request` is an object that takes the text prompt you want to present to a model and computes one of a few different types of response. These are evaluated lazily (meaning, only when the result is actually needed). If your task requires generating text you'll need to return a `rf.greedy_until` request otherwise an `rf.loglikelihood` across all labels in a classification tasks will do.
-The function `construct_requests` can return a list of `Request`s or an iterable; it's perfectly fine to `yield` them from something or other. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.
+The function `construct_requests` returns a list or tuple of, or singular `Request`s. This is particularly handy if you are creating more than one request per `doc` (usually because you're up to something like multi-task learning). The objects this function returns then get consumed one by one and turned into result objects.


 ```python
@@ -271,6 +271,19 @@ python main.py \
 	--num_fewshot K
 ```

+### Checking the Model Outputs
+The `--write_out.py` script mentioned previously can be used to verify that the prompts look as intended. If you also want to save model outputs, you can use the `--write_out` parameter in `main.py` to dump JSON with prompts and completions. The output path can be chosen with `--output_base_path`. It is helpful for debugging and for exploring model outputs.
+
+```sh
+python main.py \
+	--model gpt2 \
+	--model_args device=<device-name> \
+	--tasks <task-name> \
+	--num_fewshot K \
+    --write_out \
+    --output_base_path <path>
+```
+
 ### Running Unit Tests

 To run the entire test suite, use:

--- a/docs/task_table.md
+++ b/docs/task_table.md
@@ -17,6 +17,27 @@
 |arithmetic_4ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
 |arithmetic_5da                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
 |arithmetic_5ds                                           |     |✓  |    |         2000|acc                                                                                                                                                                              |
+|bigbench_causal_judgement                                |     |   |✓   |          190|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_date_understanding                              |     |   |✓   |          369|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_disambiguation_qa                               |     |   |✓   |          258|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_dyck_languages                                  |     |   |✓   |         1000|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_formal_fallacies_syllogisms_negation            |     |   |✓   |        14200|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_geometric_shapes                                |     |   |✓   |          359|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_hyperbaton                                      |     |   |✓   |        50000|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_logical_deduction_five_objects                  |     |   |✓   |          500|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_logical_deduction_seven_objects                 |     |   |✓   |          700|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_logical_deduction_three_objects                 |     |   |✓   |          300|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_movie_recommendation                            |     |   |✓   |          500|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_navigate                                        |     |   |✓   |         1000|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_reasoning_about_colored_objects                 |     |   |✓   |         2000|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_ruin_names                                      |     |   |✓   |          448|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_salient_translation_error_detection             |     |   |✓   |          998|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_snarks                                          |     |   |✓   |          181|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_sports_understanding                            |     |   |✓   |          986|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_temporal_sequences                              |     |   |✓   |         1000|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_tracking_shuffled_objects_five_objects          |     |   |✓   |         1250|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_tracking_shuffled_objects_seven_objects         |     |   |✓   |         1750|multiple_choice_grade, exact_str_match                                                                                                                                           |
+|bigbench_tracking_shuffled_objects_three_objects         |     |   |✓   |          300|multiple_choice_grade, exact_str_match                                                                                                                                           |
 |blimp_adjunct_island                                     |     |✓  |    |         1000|acc                                                                                                                                                                              |
 |blimp_anaphor_gender_agreement                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
 |blimp_anaphor_number_agreement                           |     |✓  |    |         1000|acc                                                                                                                                                                              |
@@ -89,6 +110,28 @@
 |cola                                                     |✓    |✓  |    |         1043|mcc                                                                                                                                                                              |
 |copa                                                     |✓    |✓  |    |          100|acc                                                                                                                                                                              |
 |coqa                                                     |✓    |✓  |    |          500|f1, em                                                                                                                                                                           |
+|crows_pairs_english                                      |     |✓  |    |         1677|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_age                                  |     |✓  |    |           91|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_autre                                |     |✓  |    |           11|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_disability                           |     |✓  |    |           65|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_gender                               |     |✓  |    |          320|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_nationality                          |     |✓  |    |          216|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_physical_appearance                  |     |✓  |    |           72|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_race_color                           |     |✓  |    |          508|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_religion                             |     |✓  |    |          111|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_sexual_orientation                   |     |✓  |    |           93|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_english_socioeconomic                        |     |✓  |    |          190|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french                                       |     |✓  |    |         1677|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_age                                   |     |✓  |    |           90|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_autre                                 |     |✓  |    |           13|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_disability                            |     |✓  |    |           66|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_gender                                |     |✓  |    |          321|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_nationality                           |     |✓  |    |          253|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_physical_appearance                   |     |✓  |    |           72|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_race_color                            |     |✓  |    |          460|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_religion                              |     |✓  |    |          115|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_sexual_orientation                    |     |✓  |    |           91|likelihood_difference, pct_stereotype                                                                                                                                            |
+|crows_pairs_french_socioeconomic                         |     |✓  |    |          196|likelihood_difference, pct_stereotype                                                                                                                                            |
 |cycle_letters                                            |     |✓  |    |        10000|acc                                                                                                                                                                              |
 |drop                                                     |✓    |✓  |    |         9536|em, f1                                                                                                                                                                           |
 |ethics_cm                                                |✓    |   |✓   |         3885|acc                                                                                                                                                                              |
@@ -161,13 +204,13 @@
 |hendrycksTest-world_religions                            |     |✓  |✓   |          171|acc, acc_norm                                                                                                                                                                    |
 |iwslt17-ar-en                                            |     |   |✓   |         1460|bleu, chrf, ter                                                                                                                                                                  |
 |iwslt17-en-ar                                            |     |   |✓   |         1460|bleu, chrf, ter                                                                                                                                                                  |
-|lambada_openai                                           |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_openai_cloze                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_openai_mt_de                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_openai_mt_en                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_openai_mt_es                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_openai_mt_fr                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
-|lambada_openai_mt_it                                     |     |✓  |    |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai                                           |     |   |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_cloze                                     |     |   |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_de                                     |     |   |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_en                                     |     |   |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_es                                     |     |   |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_fr                                     |     |   |✓   |         5153|ppl, acc                                                                                                                                                                         |
+|lambada_openai_mt_it                                     |     |   |✓   |         5153|ppl, acc                                                                                                                                                                         |
 |lambada_standard                                         |     |✓  |✓   |         5153|ppl, acc                                                                                                                                                                         |
 |lambada_standard_cloze                                   |     |✓  |✓   |         5153|ppl, acc                                                                                                                                                                         |
 |logiqa                                                   |✓    |✓  |✓   |          651|acc, acc_norm                                                                                                                                                                    |
@@ -181,6 +224,17 @@
 |math_precalc                                             |✓    |   |✓   |          546|acc                                                                                                                                                                              |
 |mathqa                                                   |✓    |✓  |✓   |         2985|acc, acc_norm                                                                                                                                                                    |
 |mc_taco                                                  |     |✓  |✓   |         9442|f1, em                                                                                                                                                                           |
+|mgsm_bn                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_de                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_en                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_es                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_fr                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_ja                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_ru                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_sw                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_te                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_th                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
+|mgsm_zh                                                  |✓    |   |✓   |          250|acc                                                                                                                                                                              |
 |mnli                                                     |✓    |✓  |    |         9815|acc                                                                                                                                                                              |
 |mnli_mismatched                                          |✓    |✓  |    |         9832|acc                                                                                                                                                                              |
 |mrpc                                                     |✓    |✓  |    |          408|acc, f1                                                                                                                                                                          |
@@ -188,6 +242,13 @@
 |mutual                                                   |✓    |✓  |    |          886|r@1, r@2, mrr                                                                                                                                                                    |
 |mutual_plus                                              |✓    |✓  |    |          886|r@1, r@2, mrr                                                                                                                                                                    |
 |openbookqa                                               |✓    |✓  |✓   |          500|acc, acc_norm                                                                                                                                                                    |
+|pawsx_de                                                 |✓    |✓  |✓   |         2000|acc                                                                                                                                                                              |
+|pawsx_en                                                 |✓    |✓  |✓   |         2000|acc                                                                                                                                                                              |
+|pawsx_es                                                 |✓    |✓  |✓   |         2000|acc                                                                                                                                                                              |
+|pawsx_fr                                                 |✓    |✓  |✓   |         2000|acc                                                                                                                                                                              |
+|pawsx_ja                                                 |✓    |✓  |✓   |         2000|acc                                                                                                                                                                              |
+|pawsx_ko                                                 |✓    |✓  |✓   |         2000|acc                                                                                                                                                                              |
+|pawsx_zh                                                 |✓    |✓  |✓   |         2000|acc                                                                                                                                                                              |
 |pile_arxiv                                               |     |✓  |✓   |         2407|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
 |pile_bookcorpus2                                         |     |✓  |✓   |           28|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
 |pile_books3                                              |     |✓  |✓   |          269|word_perplexity, byte_perplexity, bits_per_byte                                                                                                                                  |
@@ -228,6 +289,7 @@
 |squad2                                                   |✓    |✓  |    |        11873|exact, f1, HasAns_exact, HasAns_f1, NoAns_exact, NoAns_f1, best_exact, best_f1                                                                                                   |
 |sst                                                      |✓    |✓  |    |          872|acc                                                                                                                                                                              |
 |swag                                                     |✓    |✓  |    |        20006|acc, acc_norm                                                                                                                                                                    |
+|toxigen                                                  |✓    |   |✓   |          940|acc, acc_norm                                                                                                                                                                    |
 |triviaqa                                                 |✓    |✓  |    |        11313|acc                                                                                                                                                                              |
 |truthfulqa_gen                                           |     |✓  |    |          817|bleurt_max, bleurt_acc, bleurt_diff, bleu_max, bleu_acc, bleu_diff, rouge1_max, rouge1_acc, rouge1_diff, rouge2_max, rouge2_acc, rouge2_diff, rougeL_max, rougeL_acc, rougeL_diff|
 |truthfulqa_mc                                            |     |✓  |    |          817|mc1, mc2                                                                                                                                                                         |
@@ -266,3 +328,46 @@
 |wnli                                                     |✓    |✓  |    |           71|acc                                                                                                                                                                              |
 |wsc                                                      |✓    |✓  |    |          104|acc                                                                                                                                                                              |
 |wsc273                                                   |     |   |✓   |          273|acc                                                                                                                                                                              |
+|xcopa_et                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_ht                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_id                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_it                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_qu                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_sw                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_ta                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_th                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_tr                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_vi                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xcopa_zh                                                 |     |✓  |✓   |          500|acc                                                                                                                                                                              |
+|xnli_ar                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_bg                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_de                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_el                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_en                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_es                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_fr                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_hi                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_ru                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_sw                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_th                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_tr                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_ur                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_vi                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xnli_zh                                                  |✓    |✓  |✓   |         5010|acc                                                                                                                                                                              |
+|xstory_cloze_ar                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_en                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_es                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_eu                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_hi                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_id                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_my                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_ru                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_sw                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_te                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xstory_cloze_zh                                          |✓    |✓  |    |         1511|acc                                                                                                                                                                              |
+|xwinograd_en                                             |     |   |✓   |         2325|acc                                                                                                                                                                              |
+|xwinograd_fr                                             |     |   |✓   |           83|acc                                                                                                                                                                              |
+|xwinograd_jp                                             |     |   |✓   |          959|acc                                                                                                                                                                              |
+|xwinograd_pt                                             |     |   |✓   |          263|acc                                                                                                                                                                              |
+|xwinograd_ru                                             |     |   |✓   |          315|acc                                                                                                                                                                              |
+|xwinograd_zh                                             |     |   |✓   |          504|acc                                                                                                                                                                              |
--- a/ignore.txt
+++ b/ignore.txt
 ROUGE
 rouge
 nin
+maka
+mor
+te
--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
 import abc

-from typing import Union
-
 from lm_eval import utils



--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -460,7 +460,7 @@ class Task(abc.ABC):
            return self._instances

    def dump_config(self):
-        """Returns a dictionary representing the task's config. 
+        """Returns a dictionary representing the task's config.

        :returns: str
            The fewshot context.

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -30,14 +30,16 @@ def simple_evaluate(
    tasks=[],
    num_fewshot=0,
    batch_size=None,
+    max_batch_size=None,
    device=None,
    no_cache=False,
    limit=None,
    bootstrap_iters=100000,
    check_integrity=False,
    decontamination_ngrams_path=None,
+    write_out=False,
+    output_base_path=None,
 ):
-
    """Instantiate and evaluate a model on a list of tasks.

    :param model: Union[str, LM]
@@ -49,18 +51,24 @@ def simple_evaluate(
        List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
    :param num_fewshot: int
        Number of examples in few-shot context
-    :param batch_size: int, optional
+    :param batch_size: int or str, optional
        Batch size for model
+    :param max_batch_size: int, optional
+        Maximal batch size to try with automatic batch size detection
    :param device: str, optional
        PyTorch device (e.g. "cpu" or "cuda:0") for running models
    :param no_cache: bool
        Whether or not to cache
-    :param limit: int, optional
-        Limit the number of examples per task (only use this for testing)
+    :param limit: int or float, optional
+        Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
    :param check_integrity: bool
        Whether to run the relevant part of the test suite for the tasks
+    :param write_out: bool
+        If True, write details about prompts and logits to json for all tasks
+    :param output_base_path: str, optional
+        Directory to which detailed eval info will be written. Defaults to present working dir.
    :return
        Dictionary of results
    """
@@ -73,7 +81,7 @@ def simple_evaluate(
        if model_args is None:
            model_args = ""
        lm = lm_eval.api.registry.get_model(model).create_from_arg_string(
-            model_args, {"batch_size": batch_size, "device": device}
+            model_args, {"batch_size": batch_size, "max_batch_size": max_batch_size, "device": device}
        )
    else:
        assert isinstance(model, lm_eval.api.model.LM)
@@ -90,15 +98,18 @@ def simple_evaluate(
        limit=limit,
        bootstrap_iters=bootstrap_iters,
        decontamination_ngrams_path=decontamination_ngrams_path,
+        write_out=write_out,
+        output_base_path=output_base_path,
    )

    if lm.rank == 0:
        # add info about the model and few shot config
        results["config"] = {
-            "model": model,
+            "model": model if isinstance(model, str) else model.model.config._name_or_path,
            "model_args": model_args,
            "num_fewshot": num_fewshot,
            "batch_size": batch_size,
+            "batch_sizes": list(lm.batch_sizes.values()) if hasattr(lm, "batch_sizes") else [],
            "device": device,
            "no_cache": no_cache,
            "limit": limit,
@@ -120,6 +131,8 @@ def evaluate(
    limit=None,
    bootstrap_iters=100000,
    decontamination_ngrams_path=None,
+    write_out=False,
+    output_base_path=None,
 ):
    """Instantiate and evaluate a model on a list of tasks.

@@ -133,6 +146,10 @@ def evaluate(
        Limit the number of examples per task (only use this for testing)
    :param bootstrap_iters:
        Number of iterations for bootstrap statistics
+    :param write_out: bool
+        If True, write all prompts, logits and metrics to json for offline analysis
+    :param output_base_path: str, optional
+        Directory to which detailed eval info will be written. Defaults to present working dir
    :return
        Dictionary of results
    """

--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
+import os
+from lm_eval.base import BaseLM
+from tqdm import tqdm
+import time
+
+
+def anthropic_completion(client, model, prompt, max_tokens_to_sample, temperature, stop):
+    """Query Anthropic API for completion.
+
+    Retry with back-off until they respond
+    """
+    import anthropic
+
+    backoff_time = 3
+    while True:
+        try:
+            response = client.completion(
+                prompt=f"{anthropic.HUMAN_PROMPT} {prompt}{anthropic.AI_PROMPT}",
+                model=model,
+                # NOTE: Claude really likes to do CoT, and overly aggressive stop sequences
+                #       (e.g. gsm8k's ":") may truncate a lot of the input.
+                stop_sequences=[anthropic.HUMAN_PROMPT] + stop,
+                max_tokens_to_sample=max_tokens_to_sample,
+                temperature=temperature,
+            )
+            print(response)
+            return response["completion"]
+        except RuntimeError:
+            # TODO: I don't actually know what error Anthropic raises when it times out
+            #       So err update this error when we find out.
+            import traceback
+
+            traceback.print_exc()
+            time.sleep(backoff_time)
+            backoff_time *= 1.5
+
+
+class AnthropicLM(BaseLM):
+    REQ_CHUNK_SIZE = 20
+
+    def __init__(self, model):
+        """
+
+        :param model: str
+            Anthropic model e.g. claude-instant-v1
+        """
+        super().__init__()
+        import anthropic
+        self.model = model
+        self.client = anthropic.Client(os.environ['ANTHROPIC_API_KEY'])
+
+    @property
+    def eot_token_id(self):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+
+    @property
+    def max_length(self):
+        return 2048
+
+    @property
+    def max_gen_toks(self):
+        return 256
+
+    @property
+    def batch_size(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    @property
+    def device(self):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    def tok_encode(self, string: str):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+
+    def tok_decode(self, tokens):
+        raise NotImplementedError("No idea about anthropic tokenization.")
+
+    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
+        raise NotImplementedError("No support for logits.")
+
+    def greedy_until(self, requests):
+        if not requests:
+            return []
+
+        res = []
+        for request in tqdm(requests):
+            inp = request[0]
+            request_args = request[1]
+            until = request_args["until"]
+            response = anthropic_completion(
+                client=self.client,
+                model=self.model,
+                prompt=inp,
+                max_tokens_to_sample=self.max_gen_toks,
+                temperature=0.0,
+                stop=until,
+            )
+            res.append(response)
+        return res
+
+    def _model_call(self, inps):
+        # Isn't used because we override _loglikelihood_tokens
+        raise NotImplementedError()
+
+    def _model_generate(self, context, max_length, eos_token_id):
+        # Isn't used because we override greedy_until
+        raise NotImplementedError()
--- a/lm_eval/models/hf_wip.py
+++ b/lm_eval/models/hf_wip.py
+from typing import Iterable
+from tqdm import tqdm
+from accelerate import find_executable_batch_size
+import math
+import peft
+from peft import __version__ as PEFT_VERSION
+from pathlib import Path
+from typing import List, Mapping, NewType, Optional, Tuple, Union
+from tqdm import tqdm
+import torch
+import transformers
+from typing import Optional, Union
+from transformers import BatchEncoding
+from lm_eval.api.model import LM
+from lm_eval import utils
+from abc import abstractmethod
+
+
+class BaseLM(LM):
+    def __init__(self):
+        super().__init__()
+        self.batch_schedule = 1
+        self.batch_sizes = {}
+        self.max_batch_size = 512
+
+    @property
+    @abstractmethod
+    def eot_token_id(self):
+        pass
+
+    @property
+    @abstractmethod
+    def max_length(self):
+        pass
+
+    @property
+    @abstractmethod
+    def max_gen_toks(self):
+        pass
+
+    @property
+    @abstractmethod
+    def batch_size(self):
+        pass
+
+    @property
+    @abstractmethod
+    def device(self):
+        pass
+
+    @abstractmethod
+    def tok_encode(self, string: str):
+        pass
+
+    @abstractmethod
+    def tok_decode(self, tokens: Iterable[int]):
+        pass
+
+    @abstractmethod
+    def _model_generate(self, context, max_length, eos_token_id):
+        pass
+
+    @abstractmethod
+    def _model_call(self, inps):
+        """
+        inps: a torch tensor of shape [batch, sequence]
+        the size of sequence may vary from call to call
+
+        returns: a torch tensor of shape [batch, sequence, vocab] with the
+        logits returned from the model
+        """
+        pass
+
+    def _detect_batch_size(self, requests=None, pos=0):
+        if requests:
+            _, context_enc, continuation_enc = requests[pos]
+            max_length = len((context_enc + continuation_enc)[-(self.max_length + 1) :][:-1])
+        else:
+            max_length = self.max_length
+
+        # if OOM, then halves batch_size and tries again
+        @find_executable_batch_size(starting_batch_size=self.max_batch_size)
+        def forward_batch(batch_size):
+            test_batch = torch.ones((batch_size, max_length), device=self.device).long()
+            for _ in range(5):
+                _ = F.log_softmax(self._model_call(test_batch), dim=-1).cpu()
+            return batch_size
+
+        batch_size = forward_batch()
+        utils.clear_torch_cache()
+
+        return batch_size
+
+    # subclass must implement properties vocab_size, eot_token_id, max_gen_toks, batch_size, device, max_length.
+    # TODO: enforce this somehow
+
+    def _encode_pair(self, context, continuation):
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+        whole_enc = self.tok_encode(context + continuation)
+        context_enc = self.tok_encode(context)
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+        return context_enc, continuation_enc
+
+    def loglikelihood(self, requests):
+        new_reqs = []
+        for context, continuation in requests:
+            if context == "":
+                # end of text as context
+                context_enc, continuation_enc = [self.eot_token_id], self.tok_encode(continuation)
+            else:
+                context_enc, continuation_enc = self._encode_pair(context, continuation)
+
+            new_reqs.append(((context, continuation), context_enc, continuation_enc))
+
+        return self._loglikelihood_tokens(new_reqs)
+
+    def loglikelihood_rolling(self, requests):
+        # TODO: Implement caching once we've confirmed the perplexity implementation
+
+        # automatic batch size detection for vectorization
+        adaptive_batch_size = None
+        if self.batch_size == "auto":
+            # using rolling window with maximum context
+            print("Passed argument batch_size = auto. Detecting largest batch size")
+            batch_size = self._detect_batch_size()
+            print(f"Determined Largest batch size: {batch_size}")
+            adaptive_batch_size = batch_size
+
+        loglikelihoods = []
+        for (string,) in tqdm(requests):
+            rolling_token_windows = list(
+                map(
+                    utils.make_disjoint_window,
+                    utils.get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.eot_token_id,
+                        max_seq_len=self.max_length,
+                        context_len=1,
+                    ),
+                )
+            )
+
+            rolling_token_windows = [(None,) + x for x in rolling_token_windows]
+
+            # TODO: extract out this call so it only gets called once and also somehow figure out partial caching for
+            # that
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows,
+                disable_tqdm=True,
+                override_bs=adaptive_batch_size,
+            )
+
+            # discard is_greedy
+            string_nll = [x[0] for x in string_nll]
+
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+
+        return loglikelihoods
+
+    def _loglikelihood_tokens(self, requests, disable_tqdm=False, override_bs=None):
+        # TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context
+        res = []
+
+        def _collate(x):
+            # the negative sign on len(toks) sorts descending - this has a few advantages:
+            # - time estimates will always be over not underestimates, which is more useful for planning
+            # - to know the size of a batch when going through the list, you know the first one is always the batch
+            #   padded context length. this is useful to simplify the batching logic and more importantly to make
+            #   automatic adaptive batches much much easier to implement
+            # - any OOMs will happen right away rather than near the end
+
+            toks = x[1] + x[2]
+            return -len(toks), tuple(toks)
+
+        re_ord = utils.Reorderer(requests, _collate)
+
+        reordered_requests = re_ord.get_reordered()
+        n_reordered_requests = len(reordered_requests)
+
+        # automatic (variable) batch size detection for vectorization
+        # pull longest context sample from request
+        def _batch_scheduler(pos):
+            sched = pos // int(n_reordered_requests / self.batch_schedule)
+            if sched in self.batch_sizes:
+                return self.batch_sizes[sched]
+            print(f"Passed argument batch_size = auto:{self.batch_schedule}. Detecting largest batch size")
+            self.batch_sizes[sched] = self._detect_batch_size(reordered_requests, pos)
+            print(f"Determined largest batch size: {self.batch_sizes[sched]}")
+            return self.batch_sizes[sched]
+
+        for chunk in utils.chunks(
+            tqdm(reordered_requests, disable=disable_tqdm),
+            n=self.batch_size if self.batch_size != "auto" else override_bs if override_bs is not None else 0,
+            fn=_batch_scheduler if self.batch_size == "auto" and n_reordered_requests > 0 else None,
+        ):
+            inps = []
+            cont_toks_list = []
+            inplens = []
+
+            padding_length = None
+
+            # because vectorizing is annoying, we first convert each (context, continuation) pair to padded
+            # tensors, then we pack them together into a batch, call the model, and then pick it all apart
+            # again because vectorizing is annoying
+
+            for _, context_enc, continuation_enc in chunk:
+                # sanity check
+                assert len(context_enc) > 0
+                assert len(continuation_enc) > 0
+                assert len(continuation_enc) <= self.max_length
+
+                # how this all works:
+                #          CTX      CONT
+                # inp    0 1 2 3|4 5 6 7 8 9   <- last token is deleted by inp[:, :-1]
+                # gpt2    \               \
+                # logits   1 2 3|4 5 6 7 8 9   <- the ctx half gets tossed out by the
+                # cont_toks      4 5 6 7 8 9      [:, -len(continuation_enc):, :self.vocab_size] slice
+
+                # when too long to fit in context, truncate from the left
+                inp = torch.tensor(
+                    (context_enc + continuation_enc)[-(self.max_length + 1) :][:-1],
+                    dtype=torch.long,
+                ).to(self.device)
+                (inplen,) = inp.shape
+
+                cont = continuation_enc
+
+                # since in _collate we make sure length is descending, the longest is always the first one.
+                padding_length = (
+                    padding_length if padding_length is not None else inplen
+                )
+
+                # pad length from seq to padding_length
+                inp = torch.cat(
+                    [
+                        inp,  # [seq]
+                        torch.zeros(padding_length - inplen, dtype=torch.long).to(
+                            inp.device
+                        ),  # [padding_length - seq]
+                    ],
+                    dim=0,
+                )
+
+                inps.append(inp.unsqueeze(0))  # [1, padding_length]
+                cont_toks_list.append(cont)
+                inplens.append(inplen)
+
+            batched_inps = torch.cat(inps, dim=0)  # [batch, padding_length
+            multi_logits = F.log_softmax(
+                self._model_call(batched_inps), dim=-1
+            ).cpu()  # [batch, padding_length, vocab]
+
+            for (cache_key, _, _), logits, inp, inplen, cont_toks in zip(
+                chunk, multi_logits, inps, inplens, cont_toks_list
+            ):
+
+                # Slice to original seq length
+                contlen = len(cont_toks)
+                logits = logits[inplen - contlen : inplen].unsqueeze(
+                    0
+                )  # [1, seq, vocab]
+
+                # Check if per-token argmax is exactly equal to continuation
+                greedy_tokens = logits.argmax(dim=-1)
+                cont_toks = torch.tensor(cont_toks, dtype=torch.long).unsqueeze(
+                    0
+                )  # [1, seq]
+                max_equal = (greedy_tokens == cont_toks).all()
+
+                # Obtain log-probs at the corresponding continuation token indices
+                # last_token_slice = logits[:, -1, :].squeeze(0).tolist()
+                logits = torch.gather(logits, 2, cont_toks.unsqueeze(-1)).squeeze(
+                    -1
+                )  # [1, seq]
+
+                # Answer: (log prob, is-exact-match)
+                answer = (float(logits.sum()), bool(max_equal))
+
+                # partial caching
+                if cache_key is not None:
+                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
+
+                res.append(answer)
+
+        return re_ord.get_original(res)
+
+    def greedy_until(self, requests):
+        # TODO: implement fully general `until` that handles until that are
+        #       multiple tokens or that span multiple tokens correctly
+
+        # TODO: extract to TokenizedLM?
+        res = []
+
+        def _collate(x):
+            toks = self.tok_encode(x[0])
+            return len(toks), x[0]
+
+        re_ord = utils.Reorderer(requests, _collate)
+
+        for context, request_args in tqdm(re_ord.get_reordered()):
+            until = request_args["until"]
+            if isinstance(until, str):
+                until = [until]
+
+            if until:
+                (primary_until,) = self.tok_encode(until[0])
+            else:
+                primary_until = None
+
+            context_enc = torch.tensor(
+                [self.tok_encode(context)[self.max_gen_toks - self.max_length :]]
+            ).to(self.device)
+
+            max_gen_tokens = min(
+                self.max_gen_toks, request_args.get("max_length", self.max_gen_toks)
+            )
+            cont = self._model_generate(
+                context_enc, context_enc.shape[1] + max_gen_tokens, primary_until
+            )
+
+            s = self.tok_decode(cont[0].tolist()[context_enc.shape[1] :])
+
+            for term in until:
+                s = s.split(term)[0]
+
+            # partial caching
+            self.cache_hook.add_partial("greedy_until", (context, until), s)
+
+            res.append(s)
+
+        return re_ord.get_original(res)
+
+
+def _get_dtype(
+    dtype: Union[str, torch.dtype]
+) -> torch.dtype:
+    """Converts `dtype` from `str` to torch.dtype when possible. Does not use an instantiated HF AutoConfig"""
+    if isinstance(dtype, str) and dtype != "auto":
+        # Convert `str` args torch dtype: `float16` -> `torch.float16`
+        _torch_dtype = getattr(torch, dtype)
+    else:
+        _torch_dtype = dtype
+    return _torch_dtype
+
+
+class HFLM(BaseLM):
+
+    _DEFAULT_MAX_LENGTH = 2048
+
+    def __init__(
+        self,
+        device="cuda",
+        pretrained="gpt2",
+        revision="main",
+        low_cpu_mem_usage=None,
+        subfolder=None,
+        tokenizer=None,
+        batch_size=1,
+	max_length=None,
+        load_in_8bit: Optional[bool] = False,
+        trust_remote_code: Optional[bool] = False,
+        dtype: Optional[Union[str, torch.dtype]]="auto",
+    ):
+        super().__init__()
+
+        assert isinstance(device, str)
+        assert isinstance(pretrained, str)
+        assert isinstance(batch_size, (int, str))
+
+        device_list = set(
+            ["cuda", "cpu"] + [f"cuda:{i}" for i in range(torch.cuda.device_count())]
+        )
+        if device and device in device_list:
+            self._device = torch.device(device)
+            print(f"Using device '{device}'")
+        else:
+            print("Device not specified")
+            print(f"Cuda Available? {torch.cuda.is_available()}")
+            self._device = (
+                torch.device("cuda")
+                if torch.cuda.is_available()
+                else torch.device("cpu")
+            )
+
+        # TODO: update this to be less of a hack once subfolder is fixed in HF
+        revision = revision + ("/" + subfolder if subfolder is not None else "")
+
+        self.gpt2 = transformers.AutoModelForCausalLM.from_pretrained(
+            pretrained,
+            load_in_8bit=load_in_8bit,
+            low_cpu_mem_usage=low_cpu_mem_usage,
+            revision=revision,
+            torch_dtype=_get_dtype(dtype),
+            trust_remote_code=trust_remote_code,
+        ).eval()
+        if not load_in_8bit:
+            try:
+                self.gpt2.to(self.device)
+            except:
+                print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")
+        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
+            pretrained if tokenizer is None else tokenizer,
+            revision=revision,
+            trust_remote_code=trust_remote_code,
+        )
+
+        self.vocab_size = self.tokenizer.vocab_size
+
+        # setup for automatic batch size detection
+        if batch_size == "auto":
+            self.batch_size_per_gpu = batch_size
+        else:
+            self.batch_size_per_gpu = int(batch_size)
+
+        self._max_length = max_length
+
+    @property
+    def eot_token_id(self):
+        # we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_length(self):
+        if self._max_length: # if max length manually set, return it
+            return self._max_length
+        seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
+        for attr in seqlen_config_attrs:
+            if hasattr(self.gpt2.config, attr):
+                return getattr(self.gpt2.config, attr)
+        if hasattr(self.tokenizer, "model_max_length"):
+            if self.tokenizer.model_max_length == 1000000000000000019884624838656:
+                return self._DEFAULT_MAX_LENGTH
+            return self.tokenizer.model_max_length
+        return self._DEFAULT_MAX_LENGTH
+
+
+    @property
+    def max_gen_toks(self):
+        return 256
+
+    @property
+    def batch_size(self):
+        # TODO: fix multi-gpu
+        return self.batch_size_per_gpu  # * gpus
+
+    @property
+    def device(self):
+        # TODO: fix multi-gpu
+        return self._device
+
+    def tok_encode(self, string: str):
+        return self.tokenizer.encode(string, add_special_tokens=False)
+
+    def tok_decode(self, tokens):
+        return self.tokenizer.decode(tokens)
+
+    def _model_call(self, inps):
+        """
+        inps: a torch tensor of shape [batch, sequence]
+        the size of sequence may vary from call to call
+
+        returns: a torch tensor of shape [batch, sequence, vocab] with the
+        logits returned from the model
+        """
+        with torch.no_grad():
+            return self.gpt2(inps)[0]
+
+    def _model_generate(self, context, max_length, eos_token_id):
+        generation_kwargs = {"do_sample": False, "max_length": max_length}
+        if eos_token_id is not None:
+            generation_kwargs['eos_token_id'] = eos_token_id
+            generation_kwargs['pad_token_id'] = eos_token_id # setting eos_token_id as pad token
+        return self.gpt2.generate(context, **generation_kwargs)
+
+TokenSequence = Union[List[int], torch.LongTensor, torch.Tensor, BatchEncoding]
+
+_DeviceMapping = NewType("DeviceMapping", Mapping[str, Union[int, str, torch.device]])
+
+
+def _get_accelerate_args(
+    device_map_option: Optional[str] = "auto",
+    max_memory_per_gpu: Optional[Union[int, str]] = None,
+    max_cpu_memory: Optional[Union[int, str]] = None,
+    offload_folder: Optional[str] = "./offload",
+) -> dict:
+    """Returns the kwargs needed to apply `accelerate` in `AutoModel.from_pretrained`."""
+    max_memory = {}
+    if max_memory_per_gpu is not None:
+        max_memory_per_gpu_map = {
+            device_idx: max_memory_per_gpu
+            for device_idx in range(torch.cuda.device_count())
+        }
+        max_memory.update(max_memory_per_gpu_map)
+    if max_cpu_memory is not None:
+        max_memory["cpu"] = max_cpu_memory
+
+    args = {}
+    if max_memory:
+        args["max_memory"] = max_memory
+    args["device_map"] = device_map_option
+    args["offload_folder"] = offload_folder
+    return args
+
+
+def _get_dtype(
+    dtype: Union[str, torch.dtype], config: Optional[transformers.AutoConfig] = None
+) -> torch.dtype:
+    """Converts `dtype` from `str` to torch.dtype when possible."""
+    if dtype is None and config is not None:
+        _torch_dtype = config.torch_dtype
+    elif isinstance(dtype, str) and dtype != "auto":
+        # Convert `str` args torch dtype: `float16` -> `torch.float16`
+        _torch_dtype = getattr(torch, dtype)
+    else:
+        _torch_dtype = dtype
+    return _torch_dtype
+
+
+class HuggingFaceAutoLM(BaseLM):
+    AUTO_CONFIG_CLASS: transformers.AutoConfig = transformers.AutoConfig
+    AUTO_TOKENIZER_CLASS: transformers.AutoTokenizer = transformers.AutoTokenizer
+    AUTO_MODEL_CLASS: transformers.AutoModel = None
+    AUTO_PEFT_CLASS: peft.PeftModel = None
+
+    # Default max sequence length setting for when no `max_length` is provided
+    # or no max length config setting is found in the model or tokenizer.
+    _DEFAULT_MAX_LENGTH: int = 2048
+
+    def __init__(
+        self,
+        pretrained: str,
+        quantized: Optional[Union[bool, str]] = False,
+        tokenizer: Optional[str] = None,
+        subfolder: Optional[str] = None,
+        revision: Optional[str] = "main",
+        batch_size: Optional[Union[int, str]] = 1,
+        max_batch_size: Optional[int] = 512,
+        max_gen_toks: Optional[int] = 256,
+        max_length: Optional[int] = None,
+        add_special_tokens: Optional[bool] = None,
+        use_accelerate: Optional[bool] = False,
+        device_map_option: Optional[str] = "auto",
+        max_memory_per_gpu: Optional[Union[int, str]] = None,
+        max_cpu_memory: Optional[Union[int, str]] = None,
+        offload_folder: Optional[str] = "./offload",
+        dtype: Optional[Union[str, torch.dtype]] = None,
+        device: Optional[Union[int, str]] = "cuda",
+        peft: str = None,
+        load_in_8bit: Optional[bool] = False,
+        load_in_4bit: Optional[bool] = False,
+        trust_remote_code: Optional[bool] = False,
+        gptq_use_triton: Optional[bool] = False,
+    ):
+        """Initializes a HuggingFace `AutoModel` and `AutoTokenizer` for evaluation.
+        Args:
+            pretrained (str):
+                The HuggingFace Hub model ID name or the path to a pre-trained
+                model to load. This is effectively the `pretrained_model_name_or_path`
+                argument of `from_pretrained` in the HuggingFace `transformers` API.
+            quantized (str or bool, optional, defaults to False):
+                File name of a GPTQ quantized model to load. Set to `True` to use the
+                default name of the quantized model.
+            add_special_tokens (bool, optional, defaults to True):
+                Whether to add special tokens to the input sequences. If `None`, the
+                default value will be set to `True` for seq2seq models (e.g. T5) and
+                `False` for causal models.
+                WARNING: Evaluating causal models with `add_special_tokens=True` is
+                currently __not__ supported.
+            > Large model loading `accelerate` arguments
+            use_accelerate (bool, optional, defaults to False):
+                If True, uses the `accelerate` library to load a large model across
+                multiple devices.
+            device_map_option (str, optional, defaults to "auto"):
+                The device map option to use when loading the model with
+                `accelerate`.
+                Options:
+                    "auto", "balanced", "balanced_low_0", "sequential"
+                See the `accelerate` docs for more details on these options:
+                https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.device_map
+            max_memory_per_gpu (Union[int, str], optional, defaults to None):
+                The maximum memory available for each GPU in bytes as `int` or in
+                the format f"{significand}{unit_symbol}" where {unit_symbol} is
+                any of ["GB", "MB", "GIB", "MIB"]. Refer to the `max_memory` arg in
+                the "Parameters for big model inference" section of the following
+                docs:
+                https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.max_memory
+            max_cpu_memory (Union[int, str], optional, defaults to None):
+                The maximum available CPU RAM in bytes as `int` or in the format
+                f"{significand}{unit_symbol}" where {unit_symbol} is any of
+                ["GB", "MB", "GIB", "MIB"]. Refer to the `max_memory` arg in the
+                "Parameters for big model inference" section of the following docs:
+                https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.max_memory
+            offload_folder (str, optional, defaults to "./offload"):
+                The folder to offload weights into if `device_map` contains any
+                "disk" value.
+            dtype (Union[str, torch.dtype], optional, defaults to None):):
+                Converts the model weights to `dtype`, if specified. Strings get
+                converted to `torch.dtype` objects (e.g. `float16` -> `torch.float16`).
+                Use `dtype="auto"` to derive the type from the model’s weights.
+            peft (str, optional, defaults to None):
+                Path of the adapter weights to load from Huggingface. This will usually
+                include a directory that includes the files `adapter_config.json` and
+                `adapter_model.bin`. Compatible with [PEFT](https://github.com/huggingface/peft)
+            load_in_8bit (bool, optional, defaults to False):
+                If True, will convert the loaded model into mixed-8bit quantized model. See:
+                https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-8bit
+            load_in_4bit (bool, optional, defaults to False):
+                If True, will convert the loaded model into mixed-4bit quantized model. See:
+                https://huggingface.co/docs/transformers/main/en/main_classes/quantization#load-a-large-model-in-4bit
+            trust_remote_code (bool, optional, defaults to False):
+                If True, will trust the remote code when loading the model.
+            gptq_use_triton (bool, optional, defaults to False):
+                Use Triton for GPTQ inference.
+        """
+        super().__init__()
+
+        assert isinstance(pretrained, str)
+        assert isinstance(device, str)
+        assert isinstance(batch_size, (int, str))
+        if (
+            add_special_tokens is not None
+            and self.AUTO_MODEL_CLASS is transformers.AutoModelForCausalLM
+        ):
+            # TODO: Support evaluating causal models with special tokens. Currently,
+            # this is not possible because the `_loglikelihood_tokens()` method for
+            # causal LMs makes a no-special-tokens assumption given that contexts
+            # and labels/continuations are tokenized separately without special
+            # tokens, concatenated, and then processed as inputs.
+            assert (
+                not add_special_tokens
+            ), "Evaluating causal models with `add_special_tokens=True` is currently not supported."
+
+        # setup for automatic batch size detection
+        if str(batch_size).startswith("auto"):
+            batch_size = batch_size.split(":")
+            self._batch_size = batch_size[0]
+            self.batch_schedule = float(batch_size[1]) if len(batch_size) > 1 else 1
+        else:
+            self._batch_size = int(batch_size)
+        self.max_batch_size = max_batch_size
+
+        self._max_gen_toks = max_gen_toks
+        self._max_length = max_length
+        self._config = self.AUTO_CONFIG_CLASS.from_pretrained(
+            pretrained,
+            trust_remote_code=trust_remote_code,
+            revision=revision + ("/" + subfolder if subfolder is not None else ""),
+        )
+
+        self._add_special_tokens = add_special_tokens
+        self.tokenizer = self._create_auto_tokenizer(
+            pretrained=pretrained,
+            revision=revision,
+            subfolder=subfolder,
+            tokenizer=tokenizer,
+        )
+        self.tokenizer.model_max_length = self.max_length
+
+        model_kwargs = {}
+        if use_accelerate:
+            model_kwargs = _get_accelerate_args(
+                device_map_option,
+                max_memory_per_gpu,
+                max_cpu_memory,
+                offload_folder,
+            )
+        self.model = self._create_auto_model(
+            pretrained=pretrained,
+            quantized=quantized,
+            trust_remote_code=trust_remote_code,
+            revision=revision,
+            subfolder=subfolder,
+            torch_dtype=_get_dtype(dtype, self._config),
+            gptq_use_triton=gptq_use_triton,
+            load_in_8bit=load_in_8bit,
+            load_in_4bit=load_in_4bit,
+            **model_kwargs,
+        )
+        # note: peft_path can be different than pretrained model path
+        if peft is not None:
+            self.model = self._create_auto_model_peft(
+                model=self.model,
+                peft=peft,
+                revision=revision,
+                subfolder=subfolder,
+                load_in_4bit=load_in_4bit,
+            )
+        self.model.eval()
+        torch.set_grad_enabled(False)
+
+        self._device = device
+        if use_accelerate and "lm_head" in self.model.hf_device_map:
+            # `accelerate` can place `lm_head` weights on a different device than
+            # the user specified one so we force `self._device` to be the same as
+            # `lm_head`'s.
+            self._device = self.model.hf_device_map["lm_head"]
+        if not use_accelerate and not (load_in_4bit or load_in_8bit):
+            try:
+                self.model.to(self._device)
+            except:
+                print("Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore.")
+
+    def _create_auto_model(
+        self,
+        *,
+        pretrained: str,
+        quantized: Optional[Union[bool, str]] = False,
+        revision: str,
+        subfolder: str,
+        device_map: Optional[Union[str, _DeviceMapping]] = None,
+        max_memory: Optional[dict] = None,
+        offload_folder: Optional[str] = None,
+        load_in_8bit: Optional[bool] = False,
+        load_in_4bit: Optional[bool] = False,
+        trust_remote_code: Optional[bool] = False,
+        torch_dtype: Optional[Union[str, torch.dtype]] = None,
+        gptq_use_triton: Optional[bool] = False,
+    ) -> transformers.AutoModel:
+        """Returns a pre-trained pytorch model from a pre-trained model configuration."""
+        if not quantized:
+            if load_in_4bit:
+                assert transformers.__version__ >= "4.30.0", "load_in_4bit requires transformers >= 4.30.0"
+            model_kwargs = {}
+            if transformers.__version__ >= "4.30.0":
+                model_kwargs["load_in_4bit"] = load_in_4bit
+            model = self.AUTO_MODEL_CLASS.from_pretrained(
+                pretrained,
+                revision=revision + ("/" + subfolder if subfolder is not None else ""),
+                device_map=device_map,
+                max_memory=max_memory,
+                offload_folder=offload_folder,
+                load_in_8bit=load_in_8bit,
+                trust_remote_code=trust_remote_code,
+                torch_dtype=torch_dtype,
+                **model_kwargs,
+            )
+        else:
+            from auto_gptq import AutoGPTQForCausalLM
+            model = AutoGPTQForCausalLM.from_quantized(
+                pretrained,
+                model_basename=None if quantized == True else Path(quantized).stem,
+                device_map=device_map,
+                max_memory=max_memory,
+                trust_remote_code=trust_remote_code,
+                use_safetensors=True if quantized == True else quantized.endswith('.safetensors'),
+                use_triton=gptq_use_triton,
+                warmup_triton=gptq_use_triton,
+            )
+        return model
+
+    def _create_auto_model_peft(
+        self,
+        *,
+        model: transformers.PreTrainedModel,
+        peft: str,
+        revision: str,
+        subfolder: str,
+        load_in_4bit: Optional[bool] = False,
+    ):
+        if load_in_4bit:
+            assert PEFT_VERSION >= "0.4.0", "load_in_4bit requires peft >= 0.4.0"
+        model = self.AUTO_PEFT_CLASS.from_pretrained(
+            model,
+            peft,
+            revision=revision + ("/" + subfolder if subfolder is not None else ""),
+        )
+        return model
+
+    def _create_auto_tokenizer(
+        self,
+        *,
+        pretrained: str,
+        revision: str,
+        subfolder: str,
+        tokenizer: Optional[str] = None,
+    ) -> transformers.PreTrainedTokenizer:
+        """Returns a pre-trained tokenizer from a pre-trained tokenizer configuration."""
+        tokenizer = self.AUTO_TOKENIZER_CLASS.from_pretrained(
+            pretrained if tokenizer is None else tokenizer,
+            revision=revision + ("/" + subfolder if subfolder is not None else ""),
+        )
+        tokenizer.pad_token = tokenizer.eos_token
+        return tokenizer
+
+    @property
+    def add_special_tokens(self) -> bool:
+        """Whether to include special tokens in encoded text. This should be
+        determined by whether or not the model was trained with special tokens.
+        TODO: Remove these conditionals once HuggingFace supports a way to
+        check whether or not an arbitrary model was trained with special tokens.
+        """
+        if self._add_special_tokens is not None:
+            return self._add_special_tokens
+        elif self.AUTO_MODEL_CLASS is transformers.AutoModelForCausalLM:
+            return False
+        elif self.AUTO_MODEL_CLASS is transformers.AutoModelForSeq2SeqLM:
+            return True
+        else:
+            raise ValueError(
+                "Could not determine `add_special_tokens` value from the model "
+                "class. Set to `True` or `False` depending on whether the model "
+                "was pre-trained with special tokens."
+            )
+
+    @property
+    def eot_token(self) -> str:
+        return self.tokenizer.eos_token
+
+    @property
+    def eot_token_id(self) -> int:
+        return self.tokenizer.eos_token_id
+
+    @property
+    def max_gen_toks(self) -> int:
+        return self._max_gen_toks
+
+    @property
+    def max_length(self) -> int:
+        """Return the maximum sequence length of the model.
+        NOTE: Different model configurations have different max sequence length
+        attribute names.
+            - n_positions: (CTRLConfig, T5Config)
+            - max_position_embeddings: (BartConfig, RoFormerConfig)
+            - n_ctx: (GPT2Config)
+        NOTE: For relative position encoded models you should specify the max
+        sequence length of the model in the constructor via `max_length`.
+        """
+        if self._max_length is not None:
+            return self._max_length
+        # Try to get the sequence length from the model config.
+        seqlen_config_attrs = ("n_positions", "max_position_embeddings", "n_ctx")
+        for attr in seqlen_config_attrs:
+            if hasattr(self._config, attr):
+                return getattr(self._config, attr)
+        if hasattr(self.tokenizer, "model_max_length"):
+            if self.tokenizer.model_max_length == 1000000000000000019884624838656:
+                return self._DEFAULT_MAX_LENGTH
+            return self.tokenizer.model_max_length
+        return self._DEFAULT_MAX_LENGTH
+
+    @property
+    def batch_size(self) -> int:
+        # TODO: Add adaptive batch size.
+        return self._batch_size  # * gpus
+
+    @property
+    def device(self) -> Union[int, str, torch.device]:
+        return self._device
+
+    def tok_encode(self, string: str) -> TokenSequence:
+        # TODO: Merge `tok_encode_batch` here.
+        return self.tokenizer.encode(string, add_special_tokens=self.add_special_tokens)
+
+    def tok_encode_batch(self, strings: List[str]) -> TokenSequence:
+        return self.tokenizer(
+            strings,
+            padding=True,
+            add_special_tokens=self.add_special_tokens,
+            return_tensors="pt",
+        )
+
+    def tok_decode(self, tokens: torch.LongTensor) -> List[str]:
+        return self.tokenizer.batch_decode(tokens, skip_special_tokens=True)
+
+    def greedy_until(
+        self, requests: List[Tuple[str, Union[List[str], str]]]
+    ) -> List[str]:
+        def _collate(x):
+            tokens = self.tok_encode(x[0])
+            return len(tokens), x[0]
+
+        results = []
+        reorder = utils.Reorderer(requests, _collate)
+
+        adaptive_batch_size = None
+        if self.batch_size == "auto":
+            # using rolling window with maximum context
+            print("Passed argument batch_size = auto. Detecting largest batch size")
+            batch_size = self._detect_batch_size()
+            print(f"Determined Largest batch size: {batch_size}")
+            adaptive_batch_size = batch_size
+
+        for chunk in utils.chunks(
+            tqdm(reorder.get_reordered(), disable=False),
+            self.batch_size if self.batch_size != "auto" else adaptive_batch_size,
+        ):
+            context = [c[0] for c in chunk]
+            request_args = chunk[0][1]
+            stop = request_args.get("until", None)
+            stop_sequences = stop if isinstance(stop, list) else [stop]
+            max_generation_length = request_args.get("max_length", None)
+
+            assert (
+                isinstance(max_generation_length, int) or max_generation_length is None
+            )
+            assert isinstance(stop_sequences, list) or stop_sequences is None
+
+            # TODO: Find a better way to handle stop sequences for 0-shot.
+            if stop_sequences is None:
+                until = [self.eot_token]
+            else:
+                until = stop_sequences + [self.eot_token]
+
+            if max_generation_length is None:
+                max_tokens = self.max_gen_toks
+            else:
+                max_tokens = max_generation_length
+
+            token_context = self.tok_encode_batch(context)
+
+            responses = self._model_generate(
+                inputs=token_context,
+                max_tokens=max_tokens,
+                stop=until,
+            )
+            responses = self.tok_decode(responses.tolist())
+
+            for response in responses:
+                # Ensure the generated responses do not contain the stop sequences.
+                for term in until:
+                    response = response.split(term)[0]
+                # partial caching
+                self.cache_hook.add_partial("greedy_until", (context, until), response)
+                results.append(response)
+        return reorder.get_original(results)
+
+
+class AutoCausalLM(HuggingFaceAutoLM):
+    """Causal language modeling.
+    You can find a set of supported models in the HF documentation:
+    https://huggingface.co/docs/transformers/main/model_doc/auto#transformers.AutoModelForCausalLM
+    """
+
+    AUTO_MODEL_CLASS = transformers.AutoModelForCausalLM
+    AUTO_PEFT_CLASS = peft.PeftModel
+
+    def _create_auto_tokenizer(
+        self,
+        *,
+        pretrained: str,
+        revision: str,
+        subfolder: str,
+        tokenizer: Optional[str] = None,
+    ) -> transformers.PreTrainedTokenizer:
+        tokenizer = super()._create_auto_tokenizer(
+            pretrained=pretrained,
+            revision=revision,
+            subfolder=subfolder,
+            tokenizer=tokenizer,
+        )
+        tokenizer.padding_side = "left"
+        return tokenizer
+
+    def _model_call(
+        self, inputs: TokenSequence, labels: Optional[TokenSequence] = None
+    ) -> TokenSequence:
+        return self.model(inputs)["logits"]
+
+    def _model_generate(
+        self,
+        inputs: transformers.BatchEncoding,
+        max_tokens: int,
+        stop: Optional[List[str]] = None,
+    ) -> TokenSequence:
+        # Ensure that the context does not encroach into the `space`
+        # for the generation.
+        input_ids = inputs["input_ids"][:, self.max_gen_toks - self.max_length :]
+        attention_mask = inputs["attention_mask"][
+            :, self.max_gen_toks - self.max_length :
+        ]
+        input_ids = input_ids.to(self.device)
+        attention_mask = attention_mask.to(self.device)
+
+        stopping_criteria = stop_sequences_criteria(
+            self.tokenizer, stop, input_ids.shape[1], input_ids.shape[0]
+        )
+
+        generations = self.model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            # GPT style models require the `generate` `max_length` arg to include the
+            # context length, so we instead set `max_new_tokens` which is the number
+            # of new tokens to generate, excluding the current number of tokens.
+            max_new_tokens=max_tokens,
+            stopping_criteria=stopping_criteria,
+            do_sample=False,
+        )
+        return utils.select_continuation_from_batch_left_padding(
+            generations, max_context_size=inputs["input_ids"].size(1)
+        )
+
+
+class AutoSeq2SeqLM(HuggingFaceAutoLM):
+    """Seq2Seq language modeling.
+    You can find a set of supported models in the following documentation:
+    https://huggingface.co/docs/transformers/main/model_doc/auto#transformers.AutoModelForSeq2SeqLM
+    """
+
+    AUTO_MODEL_CLASS = transformers.AutoModelForSeq2SeqLM
+    AUTO_PEFT_CLASS = peft.PeftModel
+
+    def loglikelihood(
+        self, requests: List[Tuple[str, str]]
+    ) -> List[Tuple[float, bool]]:
+        new_requests = []
+        for chunk in utils.chunks(requests, self.batch_size):
+            context, continuation = zip(*chunk)
+
+            # Fill empty contexts with the EOT token.
+            context = [
+                f"{self.eot_token}" if len(text) == 0 else text for text in context
+            ]
+            context_enc = self.tok_encode_batch(context)
+            for key in context_enc:
+                context_enc[key] = context_enc[key][:, -self.max_length :]
+
+            # Remove leading whitespace introduced by the default
+            # `text_target_separator` since the context and continuation
+            # will not be concatenated as a single (decoder) input.
+            continuation = [text.lstrip() for text in continuation]
+            continuation_enc = self.tok_encode_batch(list(continuation))
+            for key in continuation_enc:
+                continuation_enc[key] = continuation_enc[key][:, -self.max_length :]
+
+            new_requests.append(
+                ((context, continuation), context_enc, continuation_enc)
+            )
+        return self._loglikelihood_tokens(new_requests)
+
+    def loglikelihood_rolling(self, requests: List[Tuple[str, str]]) -> List[float]:
+        loglikelihoods = []
+        for (string,) in tqdm(requests):
+            rolling_token_windows = list(
+                map(
+                    utils.make_disjoint_window,
+                    utils.get_rolling_token_windows(
+                        token_list=self.tok_encode(string),
+                        prefix_token=self.eot_token_id,
+                        max_seq_len=self.max_length,
+                        context_len=1,
+                    ),
+                )
+            )
+            contexts, conts = utils.split_and_pad_windows(
+                rolling_token_windows,
+                pad_token_id=self.eot_token_id,
+                max_seq_len=self.max_length,
+            )
+            # Manually create BatchEncoding tensors with attention masks as
+            # expected by `self._model_call` in `self._loglikelihood_tokens`.
+            contexts_enc = torch.Tensor(contexts).long()
+            contexts_enc = transformers.tokenization_utils_base.BatchEncoding(
+                {
+                    "input_ids": contexts_enc,
+                    "attention_mask": (contexts_enc != self.eot_token_id).long(),
+                }
+            )
+            conts_enc = torch.Tensor(conts).long()
+            conts_enc = transformers.tokenization_utils_base.BatchEncoding(
+                {
+                    "input_ids": conts_enc,
+                    "attention_mask": (conts_enc != self.eot_token_id).long(),
+                }
+            )
+            # TODO: Extract out this call so it only gets called once and also
+            # somehow figure out partial caching for.
+            rolling_token_windows_request = [
+                ((contexts, conts), contexts_enc, conts_enc)
+            ]
+            string_nll = self._loglikelihood_tokens(
+                rolling_token_windows_request, disable_tqdm=True
+            )
+            string_nll = [x[0] for x in string_nll]  # discard is_greedy
+            string_nll = sum(string_nll)
+            loglikelihoods.append(string_nll)
+        return loglikelihoods
+
+    def _loglikelihood_tokens(
+        self,
+        requests: List[Tuple[Tuple[str, str], TokenSequence, TokenSequence]],
+        disable_tqdm: Optional[bool] = False,
+    ) -> List[Tuple[float, bool]]:
+        results = []
+        for chunk in tqdm(
+            requests, total=math.ceil(len(requests)), disable=disable_tqdm
+        ):
+            cache_keys, inputs_tokens, targets_tokens = chunk
+            inputs_tokens = inputs_tokens.to(self.device)
+            targets_tokens = targets_tokens.to(self.device)
+            outputs = self._model_call(inputs=inputs_tokens, labels=targets_tokens)
+            log_softmaxes = F.log_softmax(outputs.logits, dim=-1)
+
+            output_iterator = zip(
+                zip(cache_keys[0], cache_keys[1]),
+                log_softmaxes,
+                targets_tokens["input_ids"],
+                targets_tokens["attention_mask"],
+            )
+            for cache_key, log_softmax, target_tokens, target_mask in output_iterator:
+                length = target_mask.sum()
+                log_softmax = log_softmax[:length]
+                target_tokens = target_tokens[:length]
+                greedy_tokens = log_softmax.argmax(dim=-1)
+                max_equal = (greedy_tokens == target_tokens).all()
+                target_logits = torch.gather(
+                    log_softmax, 1, target_tokens.unsqueeze(-1)
+                ).squeeze(-1)
+                answer = (float(target_logits.sum()), bool(max_equal))
+                results.append(answer)
+                if cache_key is not None:
+                    self.cache_hook.add_partial("loglikelihood", cache_key, answer)
+        return results
+
+    def _model_call(
+        self, inputs: TokenSequence, labels: Optional[TokenSequence] = None
+    ) -> TokenSequence:
+        return self.model(**inputs, labels=labels["input_ids"])
+
+    def _model_generate(
+        self,
+        inputs: transformers.BatchEncoding,
+        max_tokens: int,
+        stop: Optional[List[str]] = None,
+    ) -> TokenSequence:
+        input_ids = inputs["input_ids"][:, -self.max_length :].to(self.device)
+        attention_mask = inputs["attention_mask"][:, -self.max_length :].to(self.device)
+
+        # Generate one token to calculate the number of start tokens prepended to decoder_input_ids
+        # (leaving this here in case the below assumption is violated in the future)
+        # one_tok_gen = self.model.generate(
+        #    input_ids=torch.zeros((1, 1), dtype=torch.int),
+        #    min_length=2,
+        #    max_new_tokens=1,
+        # ).squeeze()
+        # initial_decoder_input_length = len(one_tok_gen) - 1
+
+        # Assume that there will always only be one token in the decoder inputs, assumption holds for existing HF models
+        stopping_criteria = stop_sequences_criteria(
+            self.tokenizer, stop, 1, input_ids.shape[0]
+        )
+
+        generations = self.model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=max_tokens,
+            stopping_criteria=stopping_criteria,
+            do_sample=False,
+        )
+        return generations
+
+
+class MultiTokenEOSCriteria(transformers.StoppingCriteria):
+    """Criteria to stop on the specified multi-token sequence."""
+
+    def __init__(
+        self,
+        sequence: str,
+        tokenizer: transformers.PreTrainedTokenizer,
+        initial_decoder_input_length: int,
+        batch_size: int,
+    ):
+        self.initial_decoder_input_length = initial_decoder_input_length
+        self.done_tracker = [False] * batch_size
+        self.sequence = sequence
+        self.sequence_ids = tokenizer.encode(sequence, add_special_tokens=False)
+        self.sequence_id_len = len(self.sequence_ids)
+        self.tokenizer = tokenizer
+
+    def __call__(self, input_ids, scores, **kwargs) -> bool:
+        # For efficiency, we compare the last n tokens where n is the number of tokens in the stop_sequence
+        lookback_ids_batch = input_ids[:, self.initial_decoder_input_length :][
+            :, -self.sequence_id_len :
+        ]
+
+        lookback_tokens_batch = self.tokenizer.batch_decode(lookback_ids_batch)
+
+        for i, done in enumerate(self.done_tracker):
+            if not done:
+                self.done_tracker[i] = self.sequence in lookback_tokens_batch[i]
+        return False not in self.done_tracker
+
+
+def stop_sequences_criteria(
+    tokenizer: transformers.PreTrainedTokenizer,
+    stop_sequences: List[str],
+    initial_decoder_input_length: int,
+    batch_size: int,
+) -> transformers.StoppingCriteriaList:
+    return transformers.StoppingCriteriaList(
+        [
+            *[
+                MultiTokenEOSCriteria(
+                    sequence, tokenizer, initial_decoder_input_length, batch_size
+                )
+                for sequence in stop_sequences
+            ],
+        ]
+    )
--- a/lm_eval/models/textsynth.py
+++ b/lm_eval/models/textsynth.py
@@ -125,7 +125,8 @@ class TextSynthLM(LM):
        res = []
        for request in tqdm(requests):
            inp = request[0]
-            until = request[1]
+            request_args = request[1]
+            until = request_args["until"]
            response = textsynth_completion(
                url=self.api_url + "/v1/engines/" + self.engine + "/completions",
                headers={"Authorization": "Bearer " + self.api_key},

--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -15,6 +15,9 @@ from lm_eval.api.registry import (
 )


+ALL_TASKS = sorted(list(TASK_REGISTRY.keys()) + list(GROUP_REGISTRY.keys()))
+
+
 def get_task_name_from_config(task_config):
    return "{dataset_path}_{dataset_name}".format(**task_config)


--- a/lm_eval/utils.py
+++ b/lm_eval/utils.py
@@ -8,13 +8,20 @@ import functools
 import subprocess
 import collections
 import importlib.util
+import fnmatch

-from typing import List
+from typing import List, Union
+
+import gc
+import torch

 from omegaconf import OmegaConf
 from jinja2 import BaseLoader, Environment, StrictUndefined
 from itertools import islice

+from lm_eval import tasks
+from lm_eval.logger import eval_logger
+

 class ExitCodeError(Exception):
    pass
@@ -25,6 +32,29 @@ def sh(x):
        raise ExitCodeError()


+def escaped_split(text, sep_char, maxsplit=-1):
+    """Split text into a list on occurrences of the given separation
+    character `sep_char`. The separation character may be escaped by a
+    backslash to avoid splitting at that location.
+
+    The separation character must be a string of size 1.
+
+    If `maxsplit` is given, at most `maxsplit` splits are done (thus,
+    the list will have at most `maxsplit + 1` elements). If `maxsplit`
+    is not specified or less than 0, then there is no limit on the
+    number of splits (all possible splits are made).
+    """
+    assert (
+        len(sep_char) == 1
+    ), "separation string must be a single character for escaped splitting"
+
+    if maxsplit == 0:
+        return text
+    maxsplit = max(0, maxsplit)
+
+    return re.split(r"(?<!\\)" + sep_char, text, maxsplit)
+
+
 def simple_parse_args_string(args_string):
    """
    Parses something like
@@ -44,11 +74,11 @@ def join_iters(iters):
        yield from iter


-def chunks(iter, n):
+def chunks(iter, n=0, fn=None):
    arr = []
-    for x in iter:
+    for i, x in enumerate(iter):
        arr.append(x)
-        if len(arr) == n:
+        if len(arr) == (fn(i) if fn else n):
            yield arr
            arr = []

@@ -65,6 +95,35 @@ def group(arr, fn):
    return list(res.values())


+class MultiChoice:
+    def __init__(self, choices):
+        self.choices = choices
+
+    # Simple wildcard support (linux filename patterns)
+    def __contains__(self, values):
+        for value in values.split(","):
+            if len(fnmatch.filter(self.choices, value)) == 0:
+                eval_logger.warning("{} is not in task list.".format(value))
+                eval_logger.info(f"Available tasks to choose:")
+                for choice in self.choices:
+                    eval_logger.info(f"  - {choice}")
+        return True
+
+    def __iter__(self):
+        for choice in self.choices:
+            yield choice
+
+
+# Returns a list containing all values of the source_list that
+# match at least one of the patterns
+def pattern_match(patterns, source_list):
+    task_names = set()
+    for pattern in patterns:
+        for matching in fnmatch.filter(source_list, pattern):
+            task_names.add(matching)
+    return sorted(list(task_names))
+
+
 def general_detokenize(string):
    string = string.replace(" n't", "n't")
    string = string.replace(" )", ")")
@@ -110,8 +169,8 @@ def get_rolling_token_windows(token_list, prefix_token, max_seq_len, context_len
        window_end = predicted + window_pred_len

        yield (
-            token_list[window_end - max_seq_len - 1 : window_end - 1],
-            token_list[window_end - window_pred_len : window_end],
+            token_list[window_end - max_seq_len - 1: window_end - 1],
+            token_list[window_end - window_pred_len: window_end],
        )
        predicted += window_pred_len

@@ -122,6 +181,26 @@ def make_disjoint_window(pair):
    return a[: len(a) - (len(b) - 1)], b


+def select_continuation_from_batch_left_padding(
+    generations: Union[List[List[int]], torch.Tensor], max_context_size: int
+):
+    """Select the continuation from the batch, removing prompts of different lengths.
+    Args:
+        generations (Union[List[List[int]], torch.Tensor]):
+            A tensor or list-of-lists of shape [batch_size, sequence length].
+        max_context_size (int):
+            The size of the biggest context; generations will proceed from that
+            index.
+    Example:
+        PAD     PAD Continue : The dog chased the cat  [every       day of the week]
+        Riddle  me    this   : The  dog chased the  cat [yesterday] PAD PAD PAD PAD
+    Output:
+        [every day of the week]
+        [yesterday]  PAD PAD PAD PAD
+    """
+    return generations[:, max_context_size:]
+
+
 class Reorderer:
    def __init__(self, arr, fn):
        self.size = len(arr)
@@ -336,3 +415,8 @@ def create_iterator(raw_iterator, rank, world_size, limit=None):
    among ranks in multigpu setting or only pulling a sample of documents
    """
    return islice(raw_iterator, rank, limit, world_size)
+
+
+def clear_torch_cache():
+    gc.collect()
+    torch.cuda.empty_cache()
--- a/main.py
+++ b/main.py
 import os
 import json
-import fnmatch
 import argparse

-from lm_eval import evaluator, utils
-from lm_eval.api.registry import GROUP_REGISTRY, TASK_REGISTRY
+from lm_eval import tasks, evaluator, utils
 from lm_eval.logger import eval_logger

-os.environ["TOKENIZERS_PARALLELISM"] = "false"
-ALL_TASKS = sorted(list(TASK_REGISTRY.keys()) + list(GROUP_REGISTRY.keys()))
-
-class MultiChoice:
-    def __init__(self, choices):
-        self.choices = choices

-    # Simple wildcard support (linux filename patterns)
-    def __contains__(self, values):
-        for value in values.split(","):
-            if len(fnmatch.filter(self.choices, value)) == 0:
-                eval_logger.warning("{} is not in task list.".format(value))
-                eval_logger.info(f"Available tasks to choose:")
-                # for choice in self.choices:
-                    # eval_logger.info(f"    {choice}")
-                eval_logger.info(ALL_TASKS)
-        return True
-
-    def __iter__(self):
-        for choice in self.choices:
-            yield choice
+os.environ["TOKENIZERS_PARALLELISM"] = "false"


 def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", required=True)
    parser.add_argument("--model_args", default="")
-    parser.add_argument("--tasks", default=None, choices=MultiChoice(ALL_TASKS))
+    parser.add_argument("--tasks", default=None, choices=utils.MultiChoice(tasks.ALL_TASKS))
    parser.add_argument("--config", default=None)
-    parser.add_argument("--provide_description", action="store_true")
    parser.add_argument("--num_fewshot", type=int, default=0)
-    parser.add_argument("--batch_size", type=int, default=1)
+    parser.add_argument("--batch_size", type=str, default=None)
+    parser.add_argument("--max_batch_size", type=int, default=None,
+                        help="Maximal batch size to try with --batch_size auto")
    parser.add_argument("--device", type=str, default=None)
    parser.add_argument("--output_path", default=None)
-    parser.add_argument("--limit", type=int, default=None)
+    parser.add_argument("--limit", type=float, default=None,
+                        help="Limit the number of examples per task. "
+                             "If <1, limit is a percentage of the total number of examples.")
+    parser.add_argument("--data_sampling", type=float, default=None)
    parser.add_argument("--no_cache", action="store_true")
    parser.add_argument("--decontamination_ngrams_path", default=None)
    parser.add_argument("--description_dict_path", default=None)
    parser.add_argument("--check_integrity", action="store_true")
+    parser.add_argument("--write_out", action="store_true", default=False)
+    parser.add_argument("--output_base_path", type=str, default=None)
    return parser.parse_args()


-# Returns a list containing all values of the source_list that
-# match at least one of the patterns
-def pattern_match(patterns, source_list):
-    task_names = set()
-    for pattern in patterns:
-        for matching in fnmatch.filter(source_list, pattern):
-            task_names.add(matching)
-    return sorted(list(task_names))
-
-
 def main():
    args = parse_args()

@@ -68,7 +43,9 @@ def main():
            "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT."
        )

-    if args.tasks is not None:
+    if args.tasks is None:
+        task_names = tasks.ALL_TASKS
+    else:
        if os.path.isdir(args.tasks):
            import glob

@@ -79,7 +56,7 @@ def main():
                task_names.append(config)
        else:
            tasks_list = args.tasks.split(",")
-            task_names = pattern_match(tasks_list, ALL_TASKS)
+            task_names = utils.pattern_match(tasks_list, tasks.ALL_TASKS)
            for task in [task for task in tasks_list if task not in task_names]:
                if os.path.isfile(task):
                    config = utils.load_yaml_config(task)
@@ -87,28 +64,42 @@ def main():

    eval_logger.info(f"Selected Tasks: {task_names}")

+    # TODO: description_dict?
+    # description_dict = {}
+    # if args.description_dict_path:
+    #     with open(args.description_dict_path, "r") as f:
+    #         description_dict = json.load(f)
+
    results = evaluator.simple_evaluate(
        model=args.model,
        model_args=args.model_args,
        tasks=task_names,
        num_fewshot=args.num_fewshot,
        batch_size=args.batch_size,
+        max_batch_size=args.max_batch_size,
        device=args.device,
+        no_cache=args.no_cache,
        limit=args.limit,
+        # description_dict=description_dict,
        decontamination_ngrams_path=args.decontamination_ngrams_path,
        check_integrity=args.check_integrity,
+        write_out=args.write_out,
+        output_base_path=args.output_base_path,
    )
+
    if results is not None:
        dumped = json.dumps(results, indent=2)
        print(dumped)

        if args.output_path:
+            os.makedirs(os.path.dirname(args.output_path), exist_ok=True)
            with open(args.output_path, "w") as f:
                f.write(dumped)

+        batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
        print(
-            f"{args.model} ({args.model_args}), limit: {args.limit}, provide_description: {args.provide_description}, "
-            f"num_fewshot: {args.num_fewshot}, batch_size: {args.batch_size}"
+            f"{args.model} ({args.model_args}), limit: {args.limit}, num_fewshot: {args.num_fewshot}, "
+            f"batch_size: {args.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
        )
        print(evaluator.make_table(results))


--- a/results/bloom/bloom-1b1/README.md
+++ b/results/bloom/bloom-1b1/README.md
+# bloom-1b1
+
+## bloom-1b1_common_sense_reasoning_0-shot.json
+|    Task     |Version| Metric |Value|   |Stderr|
+|-------------|------:|--------|----:|---|-----:|
+|arc_challenge|      0|acc     |23.63|±  |  1.24|
+|             |       |acc_norm|25.68|±  |  1.28|
+|arc_easy     |      0|acc     |51.47|±  |  1.03|
+|             |       |acc_norm|45.45|±  |  1.02|
+|boolq        |      1|acc     |59.08|±  |  0.86|
+|copa         |      0|acc     |68.00|±  |  4.69|
+|hellaswag    |      0|acc     |34.63|±  |  0.47|
+|             |       |acc_norm|41.77|±  |  0.49|
+|mc_taco      |      0|em      |14.49|   |      |
+|             |       |f1      |32.43|   |      |
+|openbookqa   |      0|acc     |19.60|±  |  1.78|
+|             |       |acc_norm|29.40|±  |  2.04|
+|piqa         |      0|acc     |67.14|±  |  1.10|
+|             |       |acc_norm|67.14|±  |  1.10|
+|prost        |      0|acc     |23.41|±  |  0.31|
+|             |       |acc_norm|30.50|±  |  0.34|
+|swag         |      0|acc     |43.43|±  |  0.35|
+|             |       |acc_norm|58.28|±  |  0.35|
+|winogrande   |      0|acc     |54.93|±  |  1.40|
+|wsc273       |      0|acc     |68.50|±  |  2.82|
+
+## bloom-1b1_gsm8k_8-shot.json
+|Task |Version|Metric|Value|   |Stderr|
+|-----|------:|------|----:|---|-----:|
+|gsm8k|      0|acc   | 0.83|±  |  0.25|
+
+## bloom-1b1_mathematical_reasoning_few_shot_5-shot.json
+|          Task           |Version| Metric |Value|   |Stderr|
+|-------------------------|------:|--------|----:|---|-----:|
+|drop                     |      1|em      | 1.38|±  |  0.12|
+|                         |       |f1      | 4.01|±  |  0.15|
+|gsm8k                    |      0|acc     | 0.00|±  |  0.00|
+|math_algebra             |      1|acc     | 0.00|±  |  0.00|
+|math_counting_and_prob   |      1|acc     | 0.21|±  |  0.21|
+|math_geometry            |      1|acc     | 0.21|±  |  0.21|
+|math_intermediate_algebra|      1|acc     | 0.00|±  |  0.00|
+|math_num_theory          |      1|acc     | 0.19|±  |  0.19|
+|math_prealgebra          |      1|acc     | 0.11|±  |  0.11|
+|math_precalc             |      1|acc     | 0.00|±  |  0.00|
+|mathqa                   |      0|acc     |23.55|±  |  0.78|
+|                         |       |acc_norm|23.62|±  |  0.78|
+
+## bloom-1b1_pawsx_0-shot.json
+|  Task  |Version|Metric|Value|   |Stderr|
+|--------|------:|------|----:|---|-----:|
+|pawsx_de|      0|acc   |46.95|±  |  1.12|
+|pawsx_en|      0|acc   |52.45|±  |  1.12|
+|pawsx_es|      0|acc   |51.50|±  |  1.12|
+|pawsx_fr|      0|acc   |46.15|±  |  1.11|
+|pawsx_ja|      0|acc   |48.40|±  |  1.12|
+|pawsx_ko|      0|acc   |49.90|±  |  1.12|
+|pawsx_zh|      0|acc   |48.95|±  |  1.12|
+
+## bloom-1b1_question_answering_0-shot.json
+|    Task     |Version|   Metric   |Value|   |Stderr|
+|-------------|------:|------------|----:|---|-----:|
+|headqa_en    |      0|acc         |26.44|±  |  0.84|
+|             |       |acc_norm    |30.49|±  |  0.88|
+|headqa_es    |      0|acc         |24.43|±  |  0.82|
+|             |       |acc_norm    |28.30|±  |  0.86|
+|logiqa       |      0|acc         |18.89|±  |  1.54|
+|             |       |acc_norm    |25.65|±  |  1.71|
+|squad2       |      1|exact       | 4.17|   |      |
+|             |       |f1          | 6.60|   |      |
+|             |       |HasAns_exact| 2.19|   |      |
+|             |       |HasAns_f1   | 7.05|   |      |
+|             |       |NoAns_exact | 6.14|   |      |
+|             |       |NoAns_f1    | 6.14|   |      |
+|             |       |best_exact  |50.07|   |      |
+|             |       |best_f1     |50.07|   |      |
+|triviaqa     |      1|acc         | 2.68|±  |  0.15|
+|truthfulqa_mc|      1|mc1         |25.34|±  |  1.52|
+|             |       |mc2         |41.80|±  |  1.46|
+|webqs        |      0|acc         | 1.38|±  |  0.26|
+
+## bloom-1b1_reading_comprehension_0-shot.json
+|Task|Version|Metric|Value|   |Stderr|
+|----|------:|------|----:|---|-----:|
+|coqa|      1|f1    |45.57|±  |  1.88|
+|    |       |em    |32.98|±  |  1.95|
+|drop|      1|em    | 3.31|±  |  0.18|
+|    |       |f1    | 8.63|±  |  0.22|
+|race|      1|acc   |32.63|±  |  1.45|
+
+## bloom-1b1_xcopa_0-shot.json
+|  Task  |Version|Metric|Value|   |Stderr|
+|--------|------:|------|----:|---|-----:|
+|xcopa_et|      0|acc   | 50.6|±  |  2.24|
+|xcopa_ht|      0|acc   | 53.0|±  |  2.23|
+|xcopa_id|      0|acc   | 64.8|±  |  2.14|
+|xcopa_it|      0|acc   | 50.8|±  |  2.24|
+|xcopa_qu|      0|acc   | 51.2|±  |  2.24|
+|xcopa_sw|      0|acc   | 54.4|±  |  2.23|
+|xcopa_ta|      0|acc   | 57.0|±  |  2.22|
+|xcopa_th|      0|acc   | 53.2|±  |  2.23|
+|xcopa_tr|      0|acc   | 53.0|±  |  2.23|
+|xcopa_vi|      0|acc   | 62.4|±  |  2.17|
+|xcopa_zh|      0|acc   | 59.4|±  |  2.20|
+
+## bloom-1b1_xnli_0-shot.json
+| Task  |Version|Metric|Value|   |Stderr|
+|-------|------:|------|----:|---|-----:|
+|xnli_ar|      0|acc   |33.93|±  |  0.67|
+|xnli_bg|      0|acc   |34.13|±  |  0.67|
+|xnli_de|      0|acc   |39.64|±  |  0.69|
+|xnli_el|      0|acc   |34.03|±  |  0.67|
+|xnli_en|      0|acc   |51.48|±  |  0.71|
+|xnli_es|      0|acc   |47.98|±  |  0.71|
+|xnli_fr|      0|acc   |47.15|±  |  0.71|
+|xnli_hi|      0|acc   |42.32|±  |  0.70|
+|xnli_ru|      0|acc   |40.46|±  |  0.69|
+|xnli_sw|      0|acc   |35.29|±  |  0.68|
+|xnli_th|      0|acc   |33.75|±  |  0.67|
+|xnli_tr|      0|acc   |34.79|±  |  0.67|
+|xnli_ur|      0|acc   |37.33|±  |  0.68|
+|xnli_vi|      0|acc   |44.45|±  |  0.70|
+|xnli_zh|      0|acc   |36.23|±  |  0.68|
+
+## bloom-1b1_xstory_cloze_0-shot.json
+|     Task      |Version|Metric|Value|   |Stderr|
+|---------------|------:|------|----:|---|-----:|
+|xstory_cloze_ar|      0|acc   |52.88|±  |  1.28|
+|xstory_cloze_en|      0|acc   |62.54|±  |  1.25|
+|xstory_cloze_es|      0|acc   |58.31|±  |  1.27|
+|xstory_cloze_eu|      0|acc   |54.33|±  |  1.28|
+|xstory_cloze_hi|      0|acc   |55.53|±  |  1.28|
+|xstory_cloze_id|      0|acc   |57.91|±  |  1.27|
+|xstory_cloze_my|      0|acc   |46.19|±  |  1.28|
+|xstory_cloze_ru|      0|acc   |48.25|±  |  1.29|
+|xstory_cloze_sw|      0|acc   |50.56|±  |  1.29|
+|xstory_cloze_te|      0|acc   |56.39|±  |  1.28|
+|xstory_cloze_zh|      0|acc   |58.04|±  |  1.27|
+
+## bloom-1b1_xwinograd_0-shot.json
+|    Task    |Version|Metric|Value|   |Stderr|
+|------------|------:|------|----:|---|-----:|
+|xwinograd_en|      0|acc   |69.98|±  |  0.95|
+|xwinograd_fr|      0|acc   |66.27|±  |  5.22|
+|xwinograd_jp|      0|acc   |52.87|±  |  1.61|
+|xwinograd_pt|      0|acc   |63.12|±  |  2.98|
+|xwinograd_ru|      0|acc   |54.29|±  |  2.81|
+|xwinograd_zh|      0|acc   |69.25|±  |  2.06|
--- a/results/bloom/bloom-1b1/bloom-1b1_common_sense_reasoning_0-shot.json
+++ b/results/bloom/bloom-1b1/bloom-1b1_common_sense_reasoning_0-shot.json
+{
+  "results": {
+    "boolq": {
+      "acc": 0.5908256880733945,
+      "acc_stderr": 0.008599563442397352
+    },
+    "arc_easy": {
+      "acc": 0.5147306397306397,
+      "acc_stderr": 0.010255329977562096,
+      "acc_norm": 0.45454545454545453,
+      "acc_norm_stderr": 0.010217299762709435
+    },
+    "openbookqa": {
+      "acc": 0.196,
+      "acc_stderr": 0.017770751227744862,
+      "acc_norm": 0.294,
+      "acc_norm_stderr": 0.020395095484936614
+    },
+    "hellaswag": {
+      "acc": 0.3463453495319657,
+      "acc_stderr": 0.004748324319714264,
+      "acc_norm": 0.4177454690300737,
+      "acc_norm_stderr": 0.004921798492608764
+    },
+    "swag": {
+      "acc": 0.43431970408877335,
+      "acc_stderr": 0.0035044592489844794,
+      "acc_norm": 0.5828251524542637,
+      "acc_norm_stderr": 0.0034862531772295617
+    },
+    "arc_challenge": {
+      "acc": 0.2363481228668942,
+      "acc_stderr": 0.012414960524301834,
+      "acc_norm": 0.2568259385665529,
+      "acc_norm_stderr": 0.0127669237941168
+    },
+    "mc_taco": {
+      "em": 0.1448948948948949,
+      "f1": 0.32425976796237205
+    },
+    "wsc273": {
+      "acc": 0.684981684981685,
+      "acc_stderr": 0.028165854394193602
+    },
+    "winogrande": {
+      "acc": 0.5493291239147593,
+      "acc_stderr": 0.013983928869040239
+    },
+    "prost": {
+      "acc": 0.23409479077711356,
+      "acc_stderr": 0.003093545711826552,
+      "acc_norm": 0.3049743808710504,
+      "acc_norm_stderr": 0.003363606918420179
+    },
+    "copa": {
+      "acc": 0.68,
+      "acc_stderr": 0.04688261722621504
+    },
+    "piqa": {
+      "acc": 0.6713819368879217,
+      "acc_stderr": 0.010959127105167048,
+      "acc_norm": 0.6713819368879217,
+      "acc_norm_stderr": 0.010959127105167044
+    }
+  },
+  "versions": {
+    "boolq": 1,
+    "arc_easy": 0,
+    "openbookqa": 0,
+    "hellaswag": 0,
+    "swag": 0,
+    "arc_challenge": 0,
+    "mc_taco": 0,
+    "wsc273": 0,
+    "winogrande": 0,
+    "prost": 0,
+    "copa": 0,
+    "piqa": 0
+  },
+  "config": {
+    "model": "hf-causal-experimental",
+    "model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
+    "num_fewshot": 0,
+    "batch_size": "auto",
+    "device": "cuda:0",
+    "no_cache": true,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "description_dict": {}
+  }
+}
--- a/results/bloom/bloom-1b1/bloom-1b1_gsm8k_8-shot.json
+++ b/results/bloom/bloom-1b1/bloom-1b1_gsm8k_8-shot.json
+{
+  "results": {
+    "gsm8k": {
+      "acc": 0.008339651250947688,
+      "acc_stderr": 0.002504942226860508
+    }
+  },
+  "versions": {
+    "gsm8k": 0
+  },
+  "config": {
+    "model": "hf-causal-experimental",
+    "model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
+    "num_fewshot": 8,
+    "batch_size": "auto",
+    "device": "cuda",
+    "no_cache": true,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "description_dict": {}
+  }
+}
--- a/results/bloom/bloom-1b1/bloom-1b1_mathematical_reasoning_few_shot_5-shot.json
+++ b/results/bloom/bloom-1b1/bloom-1b1_mathematical_reasoning_few_shot_5-shot.json
+{
+  "results": {
+    "mathqa": {
+      "acc": 0.2355108877721943,
+      "acc_stderr": 0.007767687364650971,
+      "acc_norm": 0.23618090452261306,
+      "acc_norm_stderr": 0.0077753193787470495
+    },
+    "gsm8k": {
+      "acc": 0.0,
+      "acc_stderr": 0.0
+    },
+    "drop": {
+      "em": 0.013842281879194632,
+      "em_stderr": 0.001196510970060749,
+      "f1": 0.040085989932885986,
+      "f1_stderr": 0.0014841664758736023
+    },
+    "math_geometry": {
+      "acc": 0.0020876826722338203,
+      "acc_stderr": 0.0020876826722338315
+    },
+    "math_counting_and_prob": {
+      "acc": 0.002109704641350211,
+      "acc_stderr": 0.002109704641350211
+    },
+    "math_prealgebra": {
+      "acc": 0.001148105625717566,
+      "acc_stderr": 0.0011481056257175708
+    },
+    "math_num_theory": {
+      "acc": 0.001851851851851852,
+      "acc_stderr": 0.0018518518518518448
+    },
+    "math_precalc": {
+      "acc": 0.0,
+      "acc_stderr": 0.0
+    },
+    "math_algebra": {
+      "acc": 0.0,
+      "acc_stderr": 0.0
+    },
+    "math_intermediate_algebra": {
+      "acc": 0.0,
+      "acc_stderr": 0.0
+    }
+  },
+  "versions": {
+    "mathqa": 0,
+    "gsm8k": 0,
+    "drop": 1,
+    "math_geometry": 1,
+    "math_counting_and_prob": 1,
+    "math_prealgebra": 1,
+    "math_num_theory": 1,
+    "math_precalc": 1,
+    "math_algebra": 1,
+    "math_intermediate_algebra": 1
+  },
+  "config": {
+    "model": "hf-causal-experimental",
+    "model_args": "pretrained=bigscience/bloom-1b1,use_accelerate=True",
+    "num_fewshot": 5,
+    "batch_size": "auto",
+    "device": "cuda:0",
+    "no_cache": true,
+    "limit": null,
+    "bootstrap_iters": 100000,
+    "description_dict": {}
+  }
+}