interface.md 14.8 KB
Newer Older
haileyschoelkopf's avatar
haileyschoelkopf committed
1
2
3
4
5
6
# User Guide

This document details the interface exposed by `lm-eval` and provides details on what flags are available to users.

## Command-line Interface

7
A majority of users run the library by cloning it from Github, installing the package as editable, and running the `python -m lm_eval` script.
haileyschoelkopf's avatar
haileyschoelkopf committed
8
9
10

Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.

Baber's avatar
Baber committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
### Subcommand Structure

The CLI now uses a subcommand structure for better organization:

- `lm-eval run` - Execute evaluations (default behavior)
- `lm-eval list` - List available tasks, models, etc.
- `lm-eval validate` - Validate task configurations

For backward compatibility, if no subcommand is specified, `run` is automatically inserted. So `lm-eval --model hf --tasks hellaswag` is equivalent to `lm-eval run --model hf --tasks hellaswag`.

### Run Command Arguments

The `run` command supports a number of command-line arguments. Details can also be seen via running with `-h` or `--help`:

#### Configuration

- `--config` : Set initial arguments from a YAML configuration file. Takes a path to a YAML file that contains argument values. This allows you to specify complex configurations in a file rather than on the command line.

#### Model and Tasks
haileyschoelkopf's avatar
haileyschoelkopf committed
30

31
- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.
haileyschoelkopf's avatar
haileyschoelkopf committed
32

33
- `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
haileyschoelkopf's avatar
haileyschoelkopf committed
34

Baber's avatar
Baber committed
35
- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `lm-eval list tasks`.
haileyschoelkopf's avatar
haileyschoelkopf committed
36

Baber's avatar
Baber committed
37
#### Evaluation Settings
haileyschoelkopf's avatar
haileyschoelkopf committed
38

Baber's avatar
Baber committed
39
- `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
40

41
- `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
haileyschoelkopf's avatar
haileyschoelkopf committed
42

43
- `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
haileyschoelkopf's avatar
haileyschoelkopf committed
44

45
- `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
haileyschoelkopf's avatar
haileyschoelkopf committed
46

Baber's avatar
Baber committed
47
48
49
50
- `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.

#### Data and Output

51
- `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
haileyschoelkopf's avatar
haileyschoelkopf committed
52

53
- `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
haileyschoelkopf's avatar
haileyschoelkopf committed
54

55
- `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
haileyschoelkopf's avatar
haileyschoelkopf committed
56

Baber's avatar
Baber committed
57
58
59
60
- `--samples` : JSON file with specific sample indices for inputs in the format `{"task_name":[indices],...}`. This allows you to evaluate only specific samples from each task. Incompatible with `--limit`.

#### Caching and Performance

61
- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
haileyschoelkopf's avatar
haileyschoelkopf committed
62

63
- `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
haileyschoelkopf's avatar
haileyschoelkopf committed
64

65
- `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
haileyschoelkopf's avatar
haileyschoelkopf committed
66

Baber's avatar
Baber committed
67
#### Instruct Formatting
68

KonradSzafer's avatar
KonradSzafer committed
69
70
- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.

71
- `--apply_chat_template` : This flag specifies whether to apply a chat template to the prompt. It can be used in the following ways:
Kiersten Stokes's avatar
Kiersten Stokes committed
72
73
  - `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied.
  - `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt.
74
75

    For Hugging Face models, the default chat template can be found in the [`default_chat_template`](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912) property of the Transformers Tokenizer.
KonradSzafer's avatar
KonradSzafer committed
76
77
78

- `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.

Baber's avatar
Baber committed
79
80
81
82
83
84
85
86
87
#### Task Management

- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.

#### Logging and Tracking

- `--verbosity` : (Deprecated) Log level. Use LOGLEVEL environment variable instead.

- `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
Baber Abbasi's avatar
Baber Abbasi committed
88

Baber's avatar
Baber committed
89
- `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
90

Kiersten Stokes's avatar
Kiersten Stokes committed
91
- `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```. Also allows for the passing of the step to log things at (passed to `wandb.run.log`), e.g., `--wandb_args step=123`.
92

Baber's avatar
Baber committed
93
94
- `--wandb_config_args`: Additional Weights & Biases config arguments passed separately from the init arguments. Format is the same as `--wandb_args` with comma-separated key=value pairs.

Kiersten Stokes's avatar
Kiersten Stokes committed
95
96
97
98
99
100
101
102
103
104
105
- `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
  - `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
  - `hub_repo_name` - repository name on Hugging Face Hub (deprecated, `details_repo_name` and `results_repo_name` should be used instead), e.g., `lm-eval-results`,
  - `details_repo_name` - repository name on Hugging Face Hub to store details, e.g., `lm-eval-results`,
  - `results_repo_name` - repository name on Hugging Face Hub to store results, e.g., `lm-eval-results`,
  - `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
  - `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
  - `public_repo` - whether the repository is public, can be `True` or `False`,
  - `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
  - `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.
  - `gated` - whether to gate the details dataset, can be `True` or `False`.
106

Baber's avatar
Baber committed
107
108
109
110
111
112
113
114
115
116
#### Advanced Options

- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.

- `--seed`: Set seed for python's random, numpy, torch, and fewshot.  Accepts a comma-separated list of 4 values for python's random, numpy, torch, and fewshot seeds, respectively, or a single integer to set the same seed for all four.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8,52` sets `random.seed(0)`, `torch.manual_seed(8)`, and fewshot seed to 52. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all four seeds to 42.

- `--trust_remote_code`: Allow executing remote code from Hugging Face Hub. This flag enables the execution of custom code from model repositories, which can be necessary for some models but introduces security risks.

- `--confirm_run_unsafe_code`: Confirm understanding of unsafe code execution risks. This flag is used to acknowledge that you understand the risks associated with executing potentially unsafe code.

Kiersten Stokes's avatar
Kiersten Stokes committed
117
- `--metadata`: JSON string to pass to TaskConfig. Used for some tasks which require additional metadata to be passed for processing. E.g., `--metadata '{"key": "value"}'`.
Baber Abbasi's avatar
Baber Abbasi committed
118

haileyschoelkopf's avatar
haileyschoelkopf committed
119
120
121
122
123
124
## External Library Usage

We also support using the library's external API for use within model training loops or other scripts.

`lm_eval` supplies two functions for external import and use: `lm_eval.evaluate()` and `lm_eval.simple_evaluate()`.

Hailey Schoelkopf's avatar
Hailey Schoelkopf committed
125
`simple_evaluate()` can be used by simply creating an `lm_eval.api.model.LM` subclass that implements the methods described in the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs/model_guide.md), and wrapping your custom model in that class as follows:
haileyschoelkopf's avatar
haileyschoelkopf committed
126
127
128

```python
import lm_eval
Lintang Sutawika's avatar
Lintang Sutawika committed
129
from lm_eval.utils import setup_logging
haileyschoelkopf's avatar
haileyschoelkopf committed
130
...
Lintang Sutawika's avatar
Lintang Sutawika committed
131
132
# initialize logging
setup_logging("DEBUG") # optional, but recommended; or you can set up logging yourself
haileyschoelkopf's avatar
haileyschoelkopf committed
133
134
my_model = initialize_my_model() # create your model (could be running finetuning with some custom modeling code)
...
135
136
137
138
139
140
141
142
143
144
145
146
147
# instantiate an LM subclass that takes your initialized model and can run
# - `Your_LM.loglikelihood()`
# - `Your_LM.loglikelihood_rolling()`
# - `Your_LM.generate_until()`
lm_obj = Your_LM(model=my_model, batch_size=16)

# indexes all tasks from the `lm_eval/tasks` subdirectory.
# Alternatively, you can set `TaskManager(include_path="path/to/my/custom/task/configs")`
# to include a set of tasks in a separate directory.
task_manager = lm_eval.tasks.TaskManager()

# Setting `task_manager` to the one above is optional and should generally be done
# if you want to include tasks from paths other than ones in `lm_eval/tasks`.
Zafir Stojanovski's avatar
Zafir Stojanovski committed
148
# `simple_evaluate` will instantiate its own task_manager if it is set to None here.
haileyschoelkopf's avatar
haileyschoelkopf committed
149
150
151
152
results = lm_eval.simple_evaluate( # call simple_evaluate
    model=lm_obj,
    tasks=["taskname1", "taskname2"],
    num_fewshot=0,
153
    task_manager=task_manager,
haileyschoelkopf's avatar
haileyschoelkopf committed
154
155
156
157
    ...
)
```

johnwee1's avatar
johnwee1 committed
158
See the `simple_evaluate()` and `evaluate()` functions in [lm_eval/evaluator.py](../lm_eval/evaluator.py#:~:text=simple_evaluate) for a full description of all arguments available. All keyword arguments to simple_evaluate share the same role as the command-line flags described previously.
haileyschoelkopf's avatar
haileyschoelkopf committed
159
160
161
162

Additionally, the `evaluate()` function offers the core evaluation functionality provided by the library, but without some of the special handling and simplification + abstraction provided by `simple_evaluate()`.

As a brief example usage of `evaluate()`:
163

haileyschoelkopf's avatar
haileyschoelkopf committed
164
165
166
```python
import lm_eval

167
168
# suppose you've defined a custom lm_eval.api.Task subclass in your own external codebase
from my_tasks import MyTask1
haileyschoelkopf's avatar
haileyschoelkopf committed
169
170
...

171
172
# create your model (could be running finetuning with some custom modeling code)
my_model = initialize_my_model()
haileyschoelkopf's avatar
haileyschoelkopf committed
173
174
...

175
176
177
178
179
180
# instantiate an LM subclass that takes your initialized model and can run
# - `Your_LM.loglikelihood()`
# - `Your_LM.loglikelihood_rolling()`
# - `Your_LM.generate_until()`
lm_obj = Your_LM(model=my_model, batch_size=16)

181
182
# optional: the task_manager indexes tasks including ones
# specified by the user through `include_path`.
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
task_manager = lm_eval.tasks.TaskManager(
    include_path="/path/to/custom/yaml"
    )

# To get a task dict for `evaluate`
task_dict = lm_eval.tasks.get_task_dict(
    [
        "mmlu", # A stock task
        "my_custom_task", # A custom task
        {
            "task": ..., # A dict that configures a task
            "doc_to_text": ...,
            },
        MyTask1 # A task object from `lm_eval.task.Task`
        ],
    task_manager # A task manager that allows lm_eval to
                 # load the task during evaluation.
                 # If none is provided, `get_task_dict`
Sadra Barikbin's avatar
Sadra Barikbin committed
201
                 # will instantiate one itself, but this
202
203
204
205
                 # only includes the stock tasks so users
                 # will need to set this if including
                 # custom paths is required.
    )
haileyschoelkopf's avatar
haileyschoelkopf committed
206

207
results = evaluate(
haileyschoelkopf's avatar
haileyschoelkopf committed
208
    lm=lm_obj,
209
    task_dict=task_dict,
haileyschoelkopf's avatar
haileyschoelkopf committed
210
    ...
211
)
haileyschoelkopf's avatar
haileyschoelkopf committed
212
```