interface.md 12.8 KB
Newer Older
haileyschoelkopf's avatar
haileyschoelkopf committed
1
2
3
4
5
6
# User Guide

This document details the interface exposed by `lm-eval` and provides details on what flags are available to users.

## Command-line Interface

7
A majority of users run the library by cloning it from Github, installing the package as editable, and running the `python -m lm_eval` script.
haileyschoelkopf's avatar
haileyschoelkopf committed
8
9
10

Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.

Baber's avatar
Baber committed
11
12
13
14
15
### Subcommand Structure

The CLI now uses a subcommand structure for better organization:

- `lm-eval run` - Execute evaluations (default behavior)
Baber's avatar
nit  
Baber committed
16
- `lm-eval ls` - List available tasks, models, etc.
Baber's avatar
Baber committed
17
18
19
20
21
22
23
24
25
26
- `lm-eval validate` - Validate task configurations

For backward compatibility, if no subcommand is specified, `run` is automatically inserted. So `lm-eval --model hf --tasks hellaswag` is equivalent to `lm-eval run --model hf --tasks hellaswag`.

### Run Command Arguments

The `run` command supports a number of command-line arguments. Details can also be seen via running with `-h` or `--help`:

#### Configuration

Baber's avatar
Baber committed
27
28
29
- `--config` **[path: str]** : Set initial arguments from a YAML configuration file. Takes a path to a YAML file that contains argument values. This allows you to specify complex configurations in a file rather than on the command line. Further CLI arguments can override values from the configuration file.

  For the complete list of available configuration fields and their types, see [`EvaluatorConfig` in the source code](../lm_eval/config/evaluate_config.py).
Baber's avatar
Baber committed
30
31

#### Model and Tasks
haileyschoelkopf's avatar
haileyschoelkopf committed
32

Baber's avatar
Baber committed
33
34
35
36
37
- `--model` **[str, default: "hf"]** : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.

- `--model_args` **[comma-sep str | json str → dict]** : Controls parameters passed to the model constructor. Can be provided as:
  - Comma-separated string: `pretrained=EleutherAI/pythia-160m,dtype=float32`
  - JSON string: `'{"pretrained": "EleutherAI/pythia-160m", "dtype": "float32"}'`
haileyschoelkopf's avatar
haileyschoelkopf committed
38

Baber's avatar
Baber committed
39
  For a full list of supported arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
haileyschoelkopf's avatar
haileyschoelkopf committed
40

Baber's avatar
Baber committed
41
- `--tasks` **[comma-sep str → list[str]]** : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `lm-eval list tasks`.
haileyschoelkopf's avatar
haileyschoelkopf committed
42

Baber's avatar
Baber committed
43
#### Evaluation Settings
haileyschoelkopf's avatar
haileyschoelkopf committed
44

Baber's avatar
Baber committed
45
46
47
48
49
50
- `--num_fewshot` **[int]** : Sets the number of few-shot examples to place in context. Must be an integer.

- `--batch_size` **[int | "auto" | "auto:N", default: 1]** : Sets the batch size used for evaluation. Options:
  - Integer: Fixed batch size (e.g., `8`)
  - `"auto"`: Automatically select the largest batch size that fits in memory
  - `"auto:N"`: Re-select maximum batch size N times during evaluation
51

Baber's avatar
Baber committed
52
  Auto mode is useful since `lm-eval` sorts documents in descending order of context length.
haileyschoelkopf's avatar
haileyschoelkopf committed
53

Baber's avatar
Baber committed
54
- `--max_batch_size` **[int]** : Sets the maximum batch size to try when using `--batch_size auto`.
haileyschoelkopf's avatar
haileyschoelkopf committed
55

Baber's avatar
Baber committed
56
- `--device` **[str]** : Sets which device to place the model onto. Examples: `"cuda"`, `"cuda:0"`, `"cpu"`, `"mps"`. Can be ignored if running multi-GPU or non-local model types.
haileyschoelkopf's avatar
haileyschoelkopf committed
57

Baber's avatar
Baber committed
58
59
60
61
62
- `--gen_kwargs` **[comma-sep str | json str → dict]** : Generation arguments for `generate_until` tasks. Same format as `--model_args`:
  - Comma-separated: `temperature=0.8,top_p=0.95`
  - JSON: `'{"temperature": 0.8, "top_p": 0.95}'`

  See model documentation (e.g., `transformers.AutoModelForCausalLM.generate()`) for supported arguments. Applied to all generation tasks - use task YAML files for per-task control.
Baber's avatar
Baber committed
63
64
65

#### Data and Output

Baber's avatar
Baber committed
66
67
68
69
70
- `--output_path` **[path: str]** : Output location for results. Format options:
  - Directory: `results/` - saves as `results/<model_name>_<timestamp>.json`
  - File: `results/output.jsonl` - saves to specific file

  When used with `--log_samples`, per-document outputs are saved in the directory.
haileyschoelkopf's avatar
haileyschoelkopf committed
71

Baber's avatar
Baber committed
72
- `--log_samples` **[flag, default: False]** : Save model outputs and inputs at per-document granularity. Requires `--output_path`. Automatically enabled when using `--predict_only`.
haileyschoelkopf's avatar
haileyschoelkopf committed
73

Baber's avatar
Baber committed
74
75
76
- `--limit` **[int | float]** : Limit evaluation examples per task. **WARNING: Only for testing!**
  - Integer: First N documents (e.g., `100`)
  - Float (0.0-1.0): Percentage of documents (e.g., `0.1` for 10%)
haileyschoelkopf's avatar
haileyschoelkopf committed
77

Baber's avatar
Baber committed
78
79
80
81
82
83
- `--samples` **[path | json str | dict → dict]** : Evaluate specific sample indices only. Input formats:
  - JSON file path: `samples.json`
  - JSON string: `'{"hellaswag": [0, 1, 2], "arc_easy": [10, 20]}'`
  - Dictionary (programmatic use)

  Format: `{"task_name": [indices], ...}`. Incompatible with `--limit`.
Baber's avatar
Baber committed
84
85
86

#### Caching and Performance

Baber's avatar
Baber committed
87
88
89
90
91
92
93
94
95
96
- `--use_cache` **[path: str]** : SQLite cache database path prefix. Creates per-process cache files:
  - Single GPU: `/path/to/cache.db`
  - Multi-GPU: `/path/to/cache_rank0.db`, `/path/to/cache_rank1.db`, etc.

  Caches model outputs to avoid re-running the same (model, task) evaluations.

- `--cache_requests` **["true" | "refresh" | "delete"]** : Dataset request caching control:
  - `"true"`: Use existing cache
  - `"refresh"`: Regenerate cache (use after changing task configs)
  - `"delete"`: Delete cache
haileyschoelkopf's avatar
haileyschoelkopf committed
97

Baber's avatar
Baber committed
98
  Cache location: `lm_eval/cache/.cache` or `$LM_HARNESS_CACHE_PATH` if set.
haileyschoelkopf's avatar
haileyschoelkopf committed
99

Baber's avatar
Baber committed
100
- `--check_integrity` **[flag, default: False]** : Run task integrity tests to validate configurations.
haileyschoelkopf's avatar
haileyschoelkopf committed
101

Baber's avatar
Baber committed
102
#### Instruct Formatting
103

Baber's avatar
Baber committed
104
- `--system_instruction` **[str]** : Custom system instruction to prepend to prompts. Used with instruction-following models.
KonradSzafer's avatar
KonradSzafer committed
105

Baber's avatar
Baber committed
106
107
108
- `--apply_chat_template` **[bool | str, default: False]** : Apply chat template formatting. Usage:
  - No argument: Apply default/only available template
  - Template name: Apply specific template (e.g., `"chatml"`)
109

Baber's avatar
Baber committed
110
  For HuggingFace models, uses the tokenizer's chat template. Default template defined in [`transformers` documentation](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912).
KonradSzafer's avatar
KonradSzafer committed
111

Baber's avatar
Baber committed
112
113
114
115
116
- `--fewshot_as_multiturn` **[flag, default: False]** : Format few-shot examples as multi-turn conversation:
  - Questions → User messages
  - Answers → Assistant responses

  Requires: `--num_fewshot > 0` and `--apply_chat_template` enabled.
KonradSzafer's avatar
KonradSzafer committed
117

Baber's avatar
Baber committed
118
119
#### Task Management

Baber's avatar
Baber committed
120
- `--include_path` **[path: str]** : Directory containing custom task YAML files. All `.yaml` files in this directory will be registered as available tasks. Use for custom tasks outside of `lm_eval/tasks/`.
Baber's avatar
Baber committed
121
122
123

#### Logging and Tracking

Baber's avatar
Baber committed
124
- `--verbosity` **[str]** : **DEPRECATED** - Use `LOGLEVEL` environment variable instead.
Baber's avatar
Baber committed
125

Baber's avatar
Baber committed
126
- `--write_out` **[flag, default: False]** : Print first document's prompt and target for each task. Useful for debugging prompt formatting.
Baber Abbasi's avatar
Baber Abbasi committed
127

Baber's avatar
Baber committed
128
- `--show_config` **[flag, default: False]** : Display full task configurations after evaluation. Shows all non-default settings from task YAML files.
129

Baber's avatar
Baber committed
130
131
132
133
- `--wandb_args` **[comma-sep str → dict]** : Weights & Biases integration. Arguments for `wandb.init()`:
  - Example: `project=my-project,name=run-1,tags=test`
  - Special: `step=123` sets logging step
  - See [W&B docs](https://docs.wandb.ai/ref/python/init) for all options
134

Baber's avatar
Baber committed
135
- `--wandb_config_args` **[comma-sep str → dict]** : Additional W&B config arguments, same format as `--wandb_args`.
Baber's avatar
Baber committed
136

Baber's avatar
Baber committed
137
138
139
140
141
142
143
144
145
146
147
- `--hf_hub_log_args` **[comma-sep str → dict]** : Hugging Face Hub logging configuration. Format: `key1=value1,key2=value2`. Options:
  - `hub_results_org`: Organization name (default: token owner)
  - `details_repo_name`: Repository for detailed results
  - `results_repo_name`: Repository for aggregated results
  - `push_results_to_hub`: Enable pushing (`True`/`False`)
  - `push_samples_to_hub`: Push samples (`True`/`False`, requires `--log_samples`)
  - `public_repo`: Make repo public (`True`/`False`)
  - `leaderboard_url`: Associated leaderboard URL
  - `point_of_contact`: Contact email
  - `gated`: Gate the dataset (`True`/`False`)
  - ~~`hub_repo_name`~~: Deprecated, use `details_repo_name` and `results_repo_name`
148

Baber's avatar
Baber committed
149
150
#### Advanced Options

Baber's avatar
Baber committed
151
152
153
154
155
156
- `--predict_only` **[flag, default: False]** : Generate outputs without computing metrics. Automatically enables `--log_samples`. Use to get raw model outputs.

- `--seed` **[int | comma-sep str → list[int], default: [0,1234,1234,1234]]** : Set random seeds for reproducibility:
  - Single integer: Same seed for all (e.g., `42`)
  - Four values: `python,numpy,torch,fewshot` seeds (e.g., `0,1234,8,52`)
  - Use `None` to skip setting a seed (e.g., `0,None,8,52`)
Baber's avatar
Baber committed
157

Baber's avatar
Baber committed
158
  Default preserves backward compatibility.
Baber's avatar
Baber committed
159

Baber's avatar
Baber committed
160
- `--trust_remote_code` **[flag, default: False]** : Allow executing remote code from Hugging Face Hub. **Security Risk**: Required for some models with custom code.
Baber's avatar
Baber committed
161

Baber's avatar
Baber committed
162
- `--confirm_run_unsafe_code` **[flag, default: False]** : Acknowledge risks when running tasks that execute arbitrary Python code (e.g., code generation tasks).
Baber's avatar
Baber committed
163

Baber's avatar
Baber committed
164
- `--metadata` **[json str → dict]** : Additional metadata for specific tasks. Format: `'{"key": "value"}'`. Required by tasks like RULER that need extra configuration.
Baber Abbasi's avatar
Baber Abbasi committed
165

haileyschoelkopf's avatar
haileyschoelkopf committed
166
167
168
169
170
171
## External Library Usage

We also support using the library's external API for use within model training loops or other scripts.

`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.

Hailey Schoelkopf's avatar
Hailey Schoelkopf committed
172
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
haileyschoelkopf's avatar
haileyschoelkopf committed
173
174
175

```python
import lm_eval
Lintang Sutawika's avatar
Lintang Sutawika committed
176
from lm_eval.utils import setup_logging
haileyschoelkopf's avatar
haileyschoelkopf committed
177
...
Lintang Sutawika's avatar
Lintang Sutawika committed
178
179
# initialize logging
setup_logging("DEBUG") # optional, but recommended; or you can set up logging yourself
haileyschoelkopf's avatar
haileyschoelkopf committed
180
181
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
...
182
183
184
185
186
187
188
189
190
191
192
193
194
# instantiate an LM subclass that takes your initialized model and can run
# - `Your_LM.loglikelihood()`
# - `Your_LM.loglikelihood_rolling()`
# - `Your_LM.generate_until()`
lm_obj = Your_LM(model=my_model, batch_size=16)

# indexes all tasks from the `lm_eval/tasks` subdirectory.
# Alternatively, you can set `TaskManager(include_path="path/to/my/custom/task/configs")`
# to include a set of tasks in a separate directory.
task_manager = lm_eval.tasks.TaskManager()

# Setting `task_manager` to the one above is optional and should generally be done
# if you want to include tasks from paths other than ones in `lm_eval/tasks`.
Zafir Stojanovski's avatar
Zafir Stojanovski committed
195
# `simple_evaluate` will instantiate its own task_manager if it is set to None here.
haileyschoelkopf's avatar
haileyschoelkopf committed
196
197
198
199
results = lm_eval.simple_evaluate( # call simple_evaluate
    model=lm_obj,
    tasks=["taskname1", "taskname2"],
    num_fewshot=0,
200
    task_manager=task_manager,
haileyschoelkopf's avatar
haileyschoelkopf committed
201
202
203
204
    ...
)
```

johnwee1's avatar
johnwee1 committed
205
See the `simple_evaluate()` and `evaluate()` functions in [lm_eval/evaluator.py](../lm_eval/evaluator.py#:~:text=simple_evaluate) for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
haileyschoelkopf's avatar
haileyschoelkopf committed
206
207
208
209

Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.

As a brief example usage of `evaluate()`:
210

haileyschoelkopf's avatar
haileyschoelkopf committed
211
212
213
```python
import lm_eval

214
215
# suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase
from my_tasks import MyTask1
haileyschoelkopf's avatar
haileyschoelkopf committed
216
217
...

218
219
# create your model (could be running finetuning with some custom modeling code)
my_model = initialize_my_model()
haileyschoelkopf's avatar
haileyschoelkopf committed
220
221
...

222
223
224
225
226
227
# instantiate an LM subclass that takes your initialized model and can run
# - `Your_LM.loglikelihood()`
# - `Your_LM.loglikelihood_rolling()`
# - `Your_LM.generate_until()`
lm_obj = Your_LM(model=my_model, batch_size=16)

228
229
# optional: the task_manager indexes tasks including ones
# specified by the user through `include_path`.
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
task_manager = lm_eval.tasks.TaskManager(
    include_path="/path/to/custom/yaml"
    )

# To get a task dict for `evaluate`
task_dict = lm_eval.tasks.get_task_dict(
    [
        "mmlu", # A stock task
        "my_custom_task", # A custom task
        {
            "task": ..., # A dict that configures a task
            "doc_to_text": ...,
            },
        MyTask1 # A task object from `lm_eval.task.Task`
        ],
    task_manager # A task manager that allows lm_eval to
                 # load the task during evaluation.
                 # If none is provided, `get_task_dict`
Sadra Barikbin's avatar
Sadra Barikbin committed
248
                 # will instantiate one itself, but this
249
250
251
252
                 # only includes the stock tasks so users
                 # will need to set this if including
                 # custom paths is required.
    )
haileyschoelkopf's avatar
haileyschoelkopf committed
253

254
results = evaluate(
haileyschoelkopf's avatar
haileyschoelkopf committed
255
    lm=lm_obj,
256
    task_dict=task_dict,
haileyschoelkopf's avatar
haileyschoelkopf committed
257
    ...
258
)
haileyschoelkopf's avatar
haileyschoelkopf committed
259
```