Merge branch 'main' into llama

bf11ac93 · Baber · 83b1c564 · ade01428 · bf11ac93 · bf11ac93
Commit bf11ac93 authored Mar 03, 2025 by Baber
20 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -29,7 +29,7 @@ repos:
      - id: mixed-line-ending
        args: [--fix=lf]
  - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.9.2
+    rev: v0.9.3
    hooks:
      # Run the linter.
      - id: ruff
@@ -38,7 +38,7 @@ repos:
        # Run the formatter.
      - id: ruff-format
  - repo: https://github.com/codespell-project/codespell
-    rev: v2.3.0
+    rev: v2.4.1
    hooks:
      - id: codespell
        exclude: >

--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 ---

 *Latest News 📣*
-
+- [2025/02] Added [SGLang](https://docs.sglang.ai/) support!
 - [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.
 - [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**
 - [2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.
@@ -238,6 +238,28 @@ vLLM occasionally differs in output from Huggingface. We treat Huggingface as th
 > [!Tip]
 > Passing `max_model_len=4096` or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use auto batch size, such as for Mistral-7B-v0.1 which defaults to a maximum length of 32k.

+### Tensor + Data Parallel and Fast Offline Batching Inference with `SGLang`
+
+We support SGLang for efficient offline batch inference. Its **[Fast Backend Runtime](https://docs.sglang.ai/index.html)** delivers high performance through optimized memory management and parallel processing techniques. Key features include tensor parallelism, continuous batching, and support for various quantization methods (FP8/INT4/AWQ/GPTQ).
+
+To use SGLang as the evaluation backend, please **install it in advance** via SGLang documents [here](https://docs.sglang.ai/start/install.html#install-sglang).
+
+> [!Tip]
+> Due to the installing method of [`Flashinfer`](https://docs.flashinfer.ai/)-- a fast attention kernel library, we don't include the dependencies of `SGLang` within [pyproject.toml](pyproject.toml). Note that the `Flashinfer` also has some requirements on `torch` version.
+
+SGLang's server arguments are slightly different from other backends, see [here](https://docs.sglang.ai/backend/server_arguments.html) for more information. We provide an example of the usage here:
+```bash
+lm_eval --model sglang \
+    --model_args pretrained={model_name},dp_size={data_parallel_size},tp_size={tensor_parallel_size},dtype=auto \
+    --tasks gsm8k_cot \
+    --batch_size auto
+```
+> [!Tip]
+> When encountering out of memory (OOM) errors (especially for multiple-choice tasks), try these solutions:
+> 1. Use a manual `batch_size`, rather than `auto`.
+> 2. Lower KV cache pool memory usage by adjusting `mem_fraction_static` - Add to your model arguments for example `--model_args pretrained=...,mem_fraction_static=0.7`.
+> 3. Increase tensor parallel size `tp_size` (if using multiple GPUs).
+
 ### Model APIs and Inference Servers

 Our library also supports the evaluation of models served via several commercial APIs, and we hope to implement support for the most commonly used performant local/self-hosted inference servers.
@@ -489,7 +511,8 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
 | api             | For using api models (Anthropic, OpenAI API) |
 | deepsparse      | For running NM's DeepSparse models           |
 | dev             | For linting PRs and contributions            |
-| gptq            | For loading models with GPTQ                 |
+| gptq            | For loading models with AutoGPTQ             |
+| gptqmodel       | For loading models with GPTQModel            |
 | hf_transfer     | For speeding up HF Hub file downloads        |
 | ifeval          | For running the IFEval task                  |
 | ibm_watsonx_ai  | For using IBM watsonx.ai model apis          |

--- a/docs/interface.md
+++ b/docs/interface.md
@@ -8,7 +8,7 @@ A majority of users run the library by cloning it from Github, installing the pa

 Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.

-This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
+This mode supports a number of command-line arguments, the details of which can also be seen via running with `-h` or `--help`:

 - `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.

@@ -82,8 +82,10 @@ We also support using the library's external API for use within model training l

 ```python
 import lm_eval
+from lm_eval.utils import setup_logging
 ...
-
+# initialize logging
+setup_logging("DEBUG") # optional, but recommended; or you can set up logging yourself
 my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
 ...
 # instantiate an LM subclass that takes your initialized model and can run

--- a/docs/new_task_guide.md
+++ b/docs/new_task_guide.md
@@ -37,7 +37,8 @@ and rename the folders and YAML file(s) as desired.

 All data downloading and management is handled through the HuggingFace (**HF**) [`datasets`](https://github.com/huggingface/datasets) API. So, the first thing you should do is check to see if your task's dataset is already provided in their catalog [here](https://huggingface.co/datasets). If it's not in there, please consider adding it to their Hub to make it accessible to a wider user base by following their [new dataset guide](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md)
 .
-
+> [!TIP]
+> To test your task, we recommend using verbose logging using `export LOGLEVEL = DEBUG` in your shell before running the evaluation script. This will help you debug any issues that may arise.
 Once you have a HuggingFace dataset prepared for your task, we want to assign our new YAML to use this dataset:

 ```yaml
@@ -143,7 +144,7 @@ The next thing we need to do is decide what format to use when presenting the da

 To write a prompt, users will use `doc_to_text`, `doc_to_target`, and `doc_to_choice` (Optional when certain conditions are met).

-`doc_to_text` defines the input string a model will be given while `doc_to_target` and `doc_to_choice` will be used to generate the target text. `doc_to_target` can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, `doc_to_choice` must be also be set with the appropriate list of possible choice strings.
+`doc_to_text` defines the input string a model will be given while `doc_to_target` and `doc_to_choice` will be used to generate the target text. `doc_to_target` can be either a text string that refers to the target string or an integer that refers to the index of the correct label. When it is set as an index, `doc_to_choice` must also be set with the appropriate list of possible choice strings.

 ### Basic prompts

@@ -172,7 +173,7 @@ doc_to_choice: choices

 We support the [Jinja 2](https://jinja.palletsprojects.com/en/3.1.x/) templating language for writing prompts. In practice, this means you can take your dataset's columns and do many basic string manipulations to place each document into prompted format.

-Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a a sample line `doc`, the model sees something the format of:
+Take for example the dataset `super_glue/boolq`. As input, we'd like to use the features `passage` and `question` and string them together so that for a sample line `doc`, the model sees something in the format of:
 ```
 doc["passage"]
 Question: doc["question"]?
@@ -284,7 +285,7 @@ As a heuristic check:
 * Do you expect to compute metrics after applying multiple such processing steps on your model outputs?
 * Does your task rely on metrics that need a custom implementation?

-For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md) . If none of the above sound like they apply to your task, it's time to continue onto checking your task performance!
+For more detail on the task system and advanced features, see [`docs/task_guide.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md). If none of the above sounds like they apply to your task, it's time to continue onto checking your task performance!

 ### Task name + tags (registering a task)

@@ -383,7 +384,7 @@ task:

 ### Configuring python classes

-There can occasions when yaml-based tasks cannot accommodate how a task is handled. LM-Eval supports the manually implementing tasks as was previously done before `0.4.x`. To register the task, you can simply make a yaml with the name of the task in `task` and the class object in `class` using the `!function` prefix.
+There can be occasions when yaml-based tasks cannot accommodate how a task is handled. LM-Eval supports the manually implementing tasks as was previously done before `0.4.x`. To register the task, you can simply make a yaml with the name of the task in `task` and the class object in `class` using the `!function` prefix.

 ```yaml
 task: squadv2
@@ -486,7 +487,7 @@ If other tasks on this dataset are already supported:

 It is recommended to include a filled-out copy of this checklist in the README.md for the subfolder you are creating, if you have created a new subfolder in `lm_eval/tasks`.

-**Finally, please add a short description of your task(s), along with a link to its subfolder in lm_eval/tasks , to [`lm_eval/tasks/README.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md) so that users can discover your task in the library, and follow the link to your README for more information about the variants supported, their task names, and the original source of the dataset and/or evaluation setup.**
+**Finally, please add a short description of your task(s), along with a link to its subfolder in lm_eval/tasks, to [`lm_eval/tasks/README.md`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/README.md) so that users can discover your task in the library, and follow the link to your README for more information about the variants supported, their task names, and the original source of the dataset and/or evaluation setup.**

 ## Submitting your task


--- a/docs/task_guide.md
+++ b/docs/task_guide.md
@@ -6,7 +6,7 @@ These YAML configuration files, along with the current codebase commit hash, are

 While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.

-If your intended task relies on features beyond what are described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.
+If your intended task relies on features beyond what is described in this guide, we'd love to hear about it! Feel free to open an issue describing the scenario on Github, create a PR to the project with a proposed implementation, or ask in the `#lm-thunderdome` channel on the EleutherAI discord.

 ## Configurations

@@ -37,7 +37,7 @@ Prompting / in-context formatting options:
 - **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
 - **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
 - **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
- **assistant_prefill** (`str`, *optional*) — String to append after the <|assistant|> token. For example, if the task is to generate a question, the assistant_prefill could be "The answer is: " to prompt the model to generate an answer to the question. If not using a chat template then this string will be appended to the end of the prompt.
+- **gen_prefix** (`str`, *optional*) — String to append after the <|assistant|> token. For example, if the task is to generate a question, the gen_prefix could be "The answer is: " to prompt the model to generate an answer to the question. If not using a chat template then this string will be appended to the end of the prompt.

 Runtime configuration options:
 - **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input.
@@ -47,7 +47,7 @@ Scoring details:
 - **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation. See docs for expected format.
 - **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, `loglikelihood_rolling`, and `multiple_choice`.
 - **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. can be used for cases such as self-consistency.
+- **repeats** (`int`, *optional*, defaults to 1) — Number of repeated runs through model for each sample. Can be used for cases such as self-consistency.
 - **filter_list** (`Union[str, list]`, *optional*) — List of filters to postprocess model outputs. See below for further detail on the filter API.
 - **should_decontaminate** (`bool`, *optional*, defaults to False) - Whether to decontaminate or not.
 - **doc_to_decontamination_query** (`str`, *optional*) — Query for decontamination if `should_decontaminate` is True. If `should_decontaminate` is True but `doc_to_decontamination_query` is `None`, `doc_to_decontamination_query` will follow `doc_to_text`.
@@ -185,7 +185,7 @@ The prior implementation method of new tasks was to subclass `Task`. While we in

 ## Including a Base YAML

-You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base temeplate is in the same directory. Otherwise, You will need to define the full path.
+You can base a YAML on another YAML file as a template. This can be handy when you need to just change the prompt for `doc_to_text` but keep the rest the same or change `filters` to compare which is better. Simply use `include` in the YAML file and write the name of the template you want to base from. This assumes that the base template is in the same directory. Otherwise, You will need to define the full path.
 ```
 include: <YAML filename or with full path>
 ...
@@ -297,7 +297,7 @@ Tasks using complex filtering:

 # Group Configuration

-When evaluating a language model, it's is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be combursome to have to list the set of tasks or add a new group name to each yaml of each individual task.
+When evaluating a language model, it is not unusual to test across a number of tasks that may not be related to one another in order to assess a variety of capabilities. To this end, it may be cumbersome to have to list the set of tasks or add a new group name to each yaml of each individual task.

 To solve this, we can create a **group** yaml config. This is a config that contains the names of the tasks that should be included in a particular group. The config consists of two main keys: a `group` key which denotes the name of the group (as it would be called from the command line, e.g. `mmlu`) and a `task` key which is where we can list the tasks. The tasks listed in `task` are the task names that have been registered. A good example of a group yaml config can be found at [../lm_eval/tasks/mmlu/default/_mmlu.yaml]. See also the [New Task Guide](./new_task_guide.md) for a more in-depth and tutorial-esque explanation of how to write complex GroupConfigs.

@@ -312,7 +312,7 @@ Groups are configured via the `GroupConfig` object. Below, we describe all field
 - **task** (`Union[str, list]`, defaults to `None`) - List of tasks that constitute the group.
 - **aggregate_metric_list** (`list`, defaults to `None`) - similar to `metric_list` in TaskConfigs, provide a list of configurations for metrics that should be aggregated across subtasks. Leaving empty will result in no aggregation being performed for this group. Keys for each list entry are:
  - `metric: str` - the name of the metric to aggregate over (all subtasks must report a metric holding this name.)
-  - `aggregation: str` - what aggregation function to apply to aggregate these per-subtask metrics.  **currently, only `mean` is supported.**
+  - `aggregation: str` - what aggregation function to apply to aggregate these per-subtask metrics. **currently, only `mean` is supported.**
  - `weight_by_size: bool = True` whether to perform micro- averaging (`True`) or macro- (`False`) averaging of subtasks' accuracy scores when reporting the group's metric. MMLU, for example, averages over per-document accuracies (the *micro average*), resulting in the same accuracy as if one simply concatenated all 57 subjects into a single dataset and evaluated accuracy on that dataset.
  - `filter_list: Union[str, List[str]] = "none"` - what filter keys one should match on to aggregate results. For example, if trying to aggregate over the `exact_match` metric using `strict-match` filter for `bbh_cot_zeroshot`, then set this to be `filter_list: "strict-match"`.  
 - **metadata** (`dict`, *optional*) - As with TaskConfigs, a field where extra config metadata can be passed. set the `num_fewshot` key within this to override the printed n_shot value in a results table for your group, for example.
--- a/examples/lm-eval-overview.ipynb
+++ b/examples/lm-eval-overview.ipynb
@@ -79,48 +79,48 @@
      "  Switched to a new branch 'big-refactor'\n",
      "  Branch 'big-refactor' set up to track remote branch 'big-refactor' from 'origin'.\n",
      "  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 42f486ee49b65926a444cb0620870a39a5b4b0a8\n",
-      "  Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
-      "  Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
-      "  Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Installing build dependencies ... \u001B[?25l\u001B[?25hdone\n",
+      "  Getting requirements to build wheel ... \u001B[?25l\u001B[?25hdone\n",
+      "  Preparing metadata (pyproject.toml) ... \u001B[?25l\u001B[?25hdone\n",
      "Collecting accelerate>=0.21.0 (from lm-eval==1.0.0)\n",
      "  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m261.4/261.4 kB\u001b[0m \u001b[31m4.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m261.4/261.4 kB\u001B[0m \u001B[31m4.1 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hCollecting evaluate (from lm-eval==1.0.0)\n",
      "  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/84.1 kB\u001b[0m \u001b[31m5.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m84.1/84.1 kB\u001B[0m \u001B[31m5.9 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hCollecting datasets>=2.0.0 (from lm-eval==1.0.0)\n",
      "  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m521.2/521.2 kB\u001b[0m \u001b[31m9.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m521.2/521.2 kB\u001B[0m \u001B[31m9.5 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hCollecting jsonlines (from lm-eval==1.0.0)\n",
      "  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)\n",
      "Requirement already satisfied: numexpr in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.8.7)\n",
      "Collecting peft>=0.2.0 (from lm-eval==1.0.0)\n",
      "  Downloading peft-0.6.2-py3-none-any.whl (174 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m174.7/174.7 kB\u001b[0m \u001b[31m7.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m174.7/174.7 kB\u001B[0m \u001B[31m7.2 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hCollecting pybind11>=2.6.2 (from lm-eval==1.0.0)\n",
      "  Downloading pybind11-2.11.1-py3-none-any.whl (227 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m227.7/227.7 kB\u001b[0m \u001b[31m12.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m227.7/227.7 kB\u001B[0m \u001B[31m12.9 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hCollecting pytablewriter (from lm-eval==1.0.0)\n",
      "  Downloading pytablewriter-1.2.0-py3-none-any.whl (111 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m111.1/111.1 kB\u001b[0m \u001b[31m8.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m111.1/111.1 kB\u001B[0m \u001B[31m8.3 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hCollecting rouge-score>=0.0.4 (from lm-eval==1.0.0)\n",
      "  Downloading rouge_score-0.1.2.tar.gz (17 kB)\n",
-      "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Preparing metadata (setup.py) ... \u001B[?25l\u001B[?25hdone\n",
      "Collecting sacrebleu>=1.5.0 (from lm-eval==1.0.0)\n",
      "  Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.7/119.7 kB\u001b[0m \u001b[31m8.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m119.7/119.7 kB\u001B[0m \u001B[31m8.7 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hRequirement already satisfied: scikit-learn>=0.24.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (1.2.2)\n",
      "Collecting sqlitedict (from lm-eval==1.0.0)\n",
      "  Downloading sqlitedict-2.1.0.tar.gz (21 kB)\n",
-      "  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Preparing metadata (setup.py) ... \u001B[?25l\u001B[?25hdone\n",
      "Requirement already satisfied: torch>=1.8 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (2.1.0+cu118)\n",
      "Collecting tqdm-multiprocess (from lm-eval==1.0.0)\n",
      "  Downloading tqdm_multiprocess-0.0.11-py3-none-any.whl (9.8 kB)\n",
      "Requirement already satisfied: transformers>=4.1 in /usr/local/lib/python3.10/dist-packages (from lm-eval==1.0.0) (4.35.2)\n",
      "Collecting zstandard (from lm-eval==1.0.0)\n",
      "  Downloading zstandard-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m5.4/5.4 MB\u001b[0m \u001b[31m29.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m5.4/5.4 MB\u001B[0m \u001B[31m29.2 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (1.23.5)\n",
      "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (23.2)\n",
      "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (5.9.5)\n",
      "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.21.0->lm-eval==1.0.0) (6.0.1)\n",
@@ -130,15 +130,15 @@
      "  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)\n",
      "Collecting dill<0.3.8,>=0.3.0 (from datasets>=2.0.0->lm-eval==1.0.0)\n",
      "  Downloading dill-0.3.7-py3-none-any.whl (115 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m14.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m115.3/115.3 kB\u001B[0m \u001B[31m14.4 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hRequirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (1.5.3)\n",
      "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2.31.0)\n",
      "Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (4.66.1)\n",
      "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.4.1)\n",
      "Collecting multiprocess (from datasets>=2.0.0->lm-eval==1.0.0)\n",
      "  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)\n",
-      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m19.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-      "\u001b[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
+      "\u001B[2K     \u001B[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001B[0m \u001B[32m134.8/134.8 kB\u001B[0m \u001B[31m19.9 MB/s\u001B[0m eta \u001B[36m0:00:00\u001B[0m\n",
+      "\u001B[?25hRequirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (2023.6.0)\n",
      "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.0.0->lm-eval==1.0.0) (3.8.6)\n",
      "Collecting responses<0.19 (from evaluate->lm-eval==1.0.0)\n",
      "  Downloading responses-0.18.0-py3-none-any.whl (38 kB)\n",
@@ -193,13 +193,13 @@
      "Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk->rouge-score>=0.0.4->lm-eval==1.0.0) (8.1.7)\n",
      "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.8->lm-eval==1.0.0) (1.3.0)\n",
      "Building wheels for collected packages: lm-eval, rouge-score, sqlitedict\n",
-      "  Building wheel for lm-eval (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Building wheel for lm-eval (pyproject.toml) ... \u001B[?25l\u001B[?25hdone\n",
      "  Created wheel for lm-eval: filename=lm_eval-1.0.0-py3-none-any.whl size=994254 sha256=88356155b19f2891981ecef948326ad6ce8ca40a6009378410ec20d0e225995a\n",
      "  Stored in directory: /tmp/pip-ephem-wheel-cache-9v6ye7h3/wheels/17/01/26/599c0779e9858a70a73fa8a306699b5b9a868f820c225457b0\n",
-      "  Building wheel for rouge-score (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Building wheel for rouge-score (setup.py) ... \u001B[?25l\u001B[?25hdone\n",
      "  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=6bb0d44e4881972c43ce194e7cb65233d309758cb15f0dec54590d3d2efcfc36\n",
      "  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4\n",
-      "  Building wheel for sqlitedict (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
+      "  Building wheel for sqlitedict (setup.py) ... \u001B[?25l\u001B[?25hdone\n",
      "  Created wheel for sqlitedict: filename=sqlitedict-2.1.0-py3-none-any.whl size=16863 sha256=5747f7dd73ddf3d8fbcebf51b5e4f718fabe1e94bccdf16d2f22a2e65ee7fdf4\n",
      "  Stored in directory: /root/.cache/pip/wheels/79/d6/e7/304e0e6cb2221022c26d8161f7c23cd4f259a9e41e8bbcfabd\n",
      "Successfully built lm-eval rouge-score sqlitedict\n",
@@ -361,6 +361,7 @@
    }
   ],
   "source": [
+    "%env LOGLEVEL=DEBUG\n",
    "!lm_eval \\\n",
    "    --model hf \\\n",
    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
@@ -462,6 +463,7 @@
   ],
   "source": [
    "# !accelerate launch --no_python\n",
+    "%env LOGLEVEL=DEBUG\n",
    "!lm_eval \\\n",
    "    --model hf \\\n",
    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
@@ -561,6 +563,7 @@
   ],
   "source": [
    "# !accelerate launch --no_python\n",
+    "%env LOGLEVEL=DEBUG\n",
    "!lm_eval \\\n",
    "    --model hf \\\n",
    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",
@@ -637,6 +640,7 @@
   ],
   "source": [
    "# !accelerate launch --no_python\n",
+    "%env LOGLEVEL=DEBUG\n",
    "!lm_eval \\\n",
    "    --model hf \\\n",
    "    --model_args pretrained=EleutherAI/pythia-2.8b \\\n",

--- a/lm_eval/__init__.py
+++ b/lm_eval/__init__.py
+import logging
+import os
+
 from .evaluator import evaluate, simple_evaluate
--- a/lm_eval/__main__.py
+++ b/lm_eval/__main__.py
@@ -213,9 +213,9 @@ def setup_parser() -> argparse.ArgumentParser:
        "--verbosity",
        "-v",
        type=str.upper,
-        default="INFO",
+        default=None,
        metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
-        help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
+        help="(Deprecated) Controls logging verbosity level. Use the `LOGLEVEL` environment variable instead. Set to DEBUG for detailed output when testing or adding new task configurations.",
    )
    parser.add_argument(
        "--wandb_args",
@@ -279,9 +279,8 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
    if args.wandb_args:
        wandb_logger = WandbLogger(**simple_parse_args_string(args.wandb_args))

-    eval_logger = utils.eval_logger
-    eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
-    eval_logger.info(f"Verbosity set to {args.verbosity}")
+    utils.setup_logging(args.verbosity)
+    eval_logger = logging.getLogger(__name__)
    os.environ["TOKENIZERS_PARALLELISM"] = "false"

    # update the evaluation tracker args with the output path and the HF token
@@ -306,7 +305,7 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:

    if args.include_path is not None:
        eval_logger.info(f"Including path: {args.include_path}")
-    task_manager = TaskManager(args.verbosity, include_path=args.include_path)
+    task_manager = TaskManager(include_path=args.include_path)

    if "push_samples_to_hub" in evaluation_tracker_args and not args.log_samples:
        eval_logger.warning(
@@ -377,8 +376,11 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True

        args.model_args = args.model_args + ",trust_remote_code=True"
-
-    eval_logger.info(f"Selected Tasks: {task_names}")
+    eval_logger.info(
+        f"Selected Tasks: {task_names}"
+    ) if eval_logger.getEffectiveLevel() >= logging.INFO else print(
+        f"Selected Tasks: {task_names}"
+    )

    request_caching_args = request_caching_arg_to_dict(
        cache_requests=args.cache_requests
@@ -403,7 +405,6 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        fewshot_as_multiturn=args.fewshot_as_multiturn,
        gen_kwargs=args.gen_kwargs,
        task_manager=task_manager,
-        verbosity=args.verbosity,
        predict_only=args.predict_only,
        random_seed=args.seed[0],
        numpy_random_seed=args.seed[1],

--- a/lm_eval/api/metrics.py
+++ b/lm_eval/api/metrics.py
@@ -12,7 +12,7 @@ import sacrebleu
 from lm_eval.api.registry import register_aggregation, register_metric


-eval_logger = logging.getLogger("lm-eval")
+eval_logger = logging.getLogger(__name__)


 # Register Aggregations First

--- a/lm_eval/api/model.py
+++ b/lm_eval/api/model.py
@@ -12,7 +12,7 @@ from tqdm import tqdm
 from lm_eval import utils


-eval_logger = logging.getLogger("lm-eval")
+eval_logger = logging.getLogger(__name__)

 T = TypeVar("T", bound="LM")


--- a/lm_eval/api/registry.py
+++ b/lm_eval/api/registry.py
@@ -6,7 +6,7 @@ import evaluate as hf_evaluate
 from lm_eval.api.model import LM


-eval_logger = logging.getLogger("lm-eval")
+eval_logger = logging.getLogger(__name__)

 MODEL_REGISTRY = {}


--- a/lm_eval/api/samplers.py
+++ b/lm_eval/api/samplers.py
+import logging
+import warnings
 from functools import partial
 from typing import TYPE_CHECKING, Iterable, Optional, Union

@@ -9,6 +11,8 @@ if TYPE_CHECKING:

    from lm_eval.api.task import ConfigurableTask, Task

+eval_logger = logging.getLogger("lm-eval")
+

 class ContextSampler:
    def __init__(
@@ -97,6 +101,13 @@ class ContextSampler:
                labeled_examples += self.doc_to_choice(doc)[doc_content]

            if doc_target != "":
+                if self.target_delimiter.isspace() and str(doc_target)[0].isspace():
+                    # TODO: add logger warn once here.
+                    warnings.warn(
+                        "Both target_delimiter and target start with a space. This may cause issues.",
+                        Warning,
+                        stacklevel=2,
+                    )
                labeled_examples += self.target_delimiter
                labeled_examples += prefix
                labeled_examples += (

--- a/lm_eval/api/task.py
+++ b/lm_eval/api/task.py
@@ -48,7 +48,7 @@ ALL_OUTPUT_TYPES = [
    "generate_until",
 ]

-eval_logger = logging.getLogger("lm-eval")
+eval_logger = logging.getLogger(__name__)


 @dataclass
@@ -1049,6 +1049,8 @@ class ConfigurableTask(Task):
            Whether to provide the fewshot examples as a multiturn conversation or a single user turn.
        :param chat_template:
            callable (from lm.apply_chat_template) that takes in a list[Dict] chat transcript and renders it into a string.
+        :param gen_prefix:
+            String to append after the <|assistant|> token.
        :returns: str
            The fewshot context.
        """
@@ -1621,13 +1623,13 @@ class ConfigurableTask(Task):
                        )
                    except TypeError:  # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
                        result_score = self._metric_fn_list[metric]([gold, result])
-                    if isinstance(result_score, dict):
-                        # TODO: this handles the case where HF evaluate returns a dict.
-                        # This allows for multiple metrics to be returned from the same function
-                        for k, v in result_score.items():
-                            result_dict[k] = v
-                        return result_dict
-                result_dict[metric] = result_score
+                if isinstance(result_score, dict):
+                    # TODO: this handles the case where HF evaluate returns a dict.
+                    # This allows for multiple metrics to be returned from the same function
+                    for k, v in result_score.items():
+                        result_dict[k] = v
+                else:
+                    result_dict[metric] = result_score
        else:
            raise ValueError(
                f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",

--- a/lm_eval/caching/cache.py
+++ b/lm_eval/caching/cache.py
 import hashlib
+import logging
 import os

 import dill

-from lm_eval.utils import eval_logger
+
+eval_logger = logging.getLogger(__name__)


 MODULE_DIR = os.path.dirname(os.path.realpath(__file__))

--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -31,7 +31,6 @@ from lm_eval.tasks import (
    get_task_dict,
 )
 from lm_eval.utils import (
-    eval_logger,
    handle_non_serializable,
    hash_string,
    positional_deprecated,
@@ -43,6 +42,8 @@ if TYPE_CHECKING:
    from lm_eval.api.model import LM
    from lm_eval.api.task import Task

+eval_logger = logging.getLogger(__name__)
+

 @positional_deprecated
 def simple_evaluate(
@@ -68,7 +69,7 @@ def simple_evaluate(
    fewshot_as_multiturn: bool = False,
    gen_kwargs: Optional[str] = None,
    task_manager: Optional[TaskManager] = None,
-    verbosity: str = "INFO",
+    verbostiy=None,
    predict_only: bool = False,
    random_seed: int = 0,
    numpy_random_seed: int = 1234,
@@ -123,6 +124,8 @@ def simple_evaluate(
    :param gen_kwargs: str
        String arguments for model generation
        Ignored for all tasks with loglikelihood output_type
+    :param verbostiy: str
+        Verbosity level for logging
    :param predict_only: bool
        If true only model outputs will be generated and returned. Metrics will not be evaluated
    :param random_seed: int
@@ -137,7 +140,8 @@ def simple_evaluate(
    :return
        Dictionary of results
    """
-    eval_logger.setLevel(getattr(logging, f"{verbosity}"))
+    if verbostiy is not None:
+        lm_eval.setup_logging(verbosity=verbostiy)
    start_date = time.time()

    if delete_requests_cache:
@@ -231,7 +235,7 @@ def simple_evaluate(
        )

    if task_manager is None:
-        task_manager = TaskManager(verbosity)
+        task_manager = TaskManager()

    task_dict = get_task_dict(tasks, task_manager)

@@ -313,9 +317,11 @@ def simple_evaluate(
        system_instruction=system_instruction,
        apply_chat_template=apply_chat_template,
        fewshot_as_multiturn=fewshot_as_multiturn,
-        verbosity=verbosity,
+        verbosity=verbostiy,
        confirm_run_unsafe_code=confirm_run_unsafe_code,
    )
+    if verbostiy is not None:
+        lm_eval.setup_logging(verbosity=verbostiy)

    if lm.rank == 0:
        if isinstance(model, str):
@@ -411,8 +417,6 @@ def evaluate(
        Dictionary of results
    """

-    eval_logger.setLevel(getattr(logging, f"{verbosity}"))
-
    if apply_chat_template:
        eval_logger.warning(
            "Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details."

--- a/lm_eval/evaluator_utils.py
+++ b/lm_eval/evaluator_utils.py
 import collections
+import logging
 import math
 import pathlib
 import sys
@@ -12,7 +13,10 @@ from lm_eval.api.metrics import (
    stderr_for_metric,
 )
 from lm_eval.api.task import Task
-from lm_eval.utils import eval_logger, positional_deprecated
+from lm_eval.utils import positional_deprecated
+
+
+eval_logger = logging.getLogger(__name__)


 class TaskOutput:

--- a/lm_eval/loggers/evaluation_tracker.py
+++ b/lm_eval/loggers/evaluation_tracker.py
 import json
+import logging
 import os
 import re
 import time
@@ -18,7 +19,6 @@ from huggingface_hub import (
 from huggingface_hub.utils import build_hf_headers, get_session, hf_raise_for_status

 from lm_eval.utils import (
-    eval_logger,
    get_file_datetime,
    get_file_task_name,
    get_results_filenames,
@@ -31,6 +31,9 @@ from lm_eval.utils import (
 )


+eval_logger = logging.getLogger(__name__)
+
+
 @dataclass(init=False)
 class GeneralConfigTracker:
    """

--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
@@ -13,6 +13,7 @@ from . import (
    openai_completions,
    optimum_ipex,
    optimum_lm,
+    sglang_causallms,
    textsynth,
    vllm_causallms,
    vllm_vlms,

--- a/lm_eval/models/anthropic_llms.py
+++ b/lm_eval/models/anthropic_llms.py
+import logging
 import os
 from functools import cached_property
 from typing import Any, Dict, List, Tuple, Union

 from tqdm import tqdm

-from lm_eval import utils
 from lm_eval.api.model import LM
 from lm_eval.api.registry import register_model
 from lm_eval.models.openai_completions import LocalCompletionsAPI
 from lm_eval.models.utils import handle_stop_sequences, retry_on_specific_exceptions


-eval_logger = utils.eval_logger
+eval_logger = logging.getLogger(__name__)


 def anthropic_completion(

--- a/lm_eval/models/api_models.py
+++ b/lm_eval/models/api_models.py
@@ -3,6 +3,7 @@ import asyncio
 import copy
 import itertools
 import json
+import logging
 from functools import cached_property
 from typing import (
    Any,
@@ -37,6 +38,8 @@ from lm_eval.api.model import TemplateLM
 from lm_eval.models.utils import Collator, chunks, configure_pad_token


+eval_logger = logging.getLogger(__name__)
+
 LogLikelihoodInputs = Tuple[Tuple[str, str], List[int], List[int]]


@@ -48,9 +51,6 @@ class JsonChatStr(NamedTuple):
        return self.prompt.encode(encoding)


-eval_logger = utils.eval_logger
-
-
 class TemplateAPI(TemplateLM):
    def __init__(
        self,
@@ -265,7 +265,7 @@ class TemplateAPI(TemplateLM):
            )
        else:
            # bit of a hack. We'll load back before sending to the API
-            return JsonChatStr(json.dumps(chat_history))
+            return JsonChatStr(json.dumps(chat_history, ensure_ascii=False))

    @cached_property
    def eot_token_id(self) -> Optional[int]: