Merge branch 'feature/eval_from_config' into metrics

# Conflicts: # lm_eval/__main__.py # lm_eval/utils.py

Merge branch 'feature/eval_from_config' into metrics
# Conflicts: # lm_eval/__main__.py # lm_eval/utils.py
7a8203fa · Baber · e6b798f9 · 91e49e23 · 7a8203fa · 7a8203fa
Commit 7a8203fa authored Jul 25, 2025 by Baber
17 changed files
--- a/docs/interface.md
+++ b/docs/interface.md
@@ -8,71 +8,160 @@ A majority of users run the library by cloning it from Github, installing the pa

 Equivalently, running the library can be done via the `lm-eval` entrypoint at the command line.

-This mode supports a number of command-line arguments, the details of which can also be seen via running with `-h` or `--help`:
+### Subcommand Structure

- `--model` : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.
+The CLI now uses a subcommand structure for better organization:

- `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=EleutherAI/pythia-160m,dtype=float32`. For a full list of what keyword arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)
+- `lm-eval run` - Execute evaluations (default behavior)
+- `lm-eval ls` - List available tasks, models, etc.
+- `lm-eval validate` - Validate task configurations

- `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `--tasks list`.
+For backward compatibility, if no subcommand is specified, `run` is automatically inserted. So `lm-eval --model hf --tasks hellaswag` is equivalent to `lm-eval run --model hf --tasks hellaswag`.

- `--num_fewshot` : Sets the number of few-shot examples to place in context. Must be an integer.
+### Run Command Arguments

- `--gen_kwargs` : takes an arg string in same format as `--model_args` and creates a dictionary of keyword arguments. These will be passed to the models for all called `generate_until` (free-form or greedy generation task) tasks, to set options such as the sampling temperature or `top_p` / `top_k`. For a list of what args are supported for each model type, reference the respective library's documentation (for example, the documentation for `transformers.AutoModelForCausalLM.generate()`.) These kwargs will be applied to all `generate_until` tasks called--we do not currently support unique gen_kwargs or batch_size values per task in a single run of the library. To control these on a per-task level, set them in that task's YAML file.
+The `run` command supports a number of command-line arguments. Details can also be seen via running with `-h` or `--help`:

- `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
+#### Configuration

- `--max_batch_size` : Sets the maximum batch size to try to fit in memory, if `--batch_size auto` is passed.
+- `--config` **[path: str]** : Set initial arguments from a YAML configuration file. Takes a path to a YAML file that contains argument values. This allows you to specify complex configurations in a file rather than on the command line. Further CLI arguments can override values from the configuration file.

- `--device` : Sets which device to place the model onto. Must be a string, for example, `"cuda", "cuda:0", "cpu", "mps"`. Defaults to "cuda", and can be ignored if running multi-GPU or running a non-local model type.
+  For the complete list of available configuration fields and their types, see [`EvaluatorConfig` in the source code](../lm_eval/config/evaluate_config.py).

- `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
+#### Model and Tasks

- `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
+- `--model` **[str, default: "hf"]** : Selects which model type or provider is evaluated. Must be a string corresponding to the name of the model type/provider being used. See [the main README](https://github.com/EleutherAI/lm-evaluation-harness/tree/main#model-apis-and-inference-servers) for a full list of enabled model names and supported libraries or APIs.

- `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
+- `--model_args` **[comma-sep str | json str → dict]** : Controls parameters passed to the model constructor. Can be provided as:
+  - Comma-separated string: `pretrained=EleutherAI/pythia-160m,dtype=float32`
+  - JSON string: `'{"pretrained": "EleutherAI/pythia-160m", "dtype": "float32"}'`

- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
+  For a full list of supported arguments, see the initialization of the `lm_eval.api.model.LM` subclass, e.g. [`HFLM`](https://github.com/EleutherAI/lm-evaluation-harness/blob/365fcda9b85bbb6e0572d91976b8daf409164500/lm_eval/models/huggingface.py#L66)

- `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lm_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
+- `--tasks` **[comma-sep str → list[str]]** : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. A list of supported tasks can be viewed with `lm-eval list tasks`.

- `--check_integrity` : If this flag is used, the library tests for each task selected are run to confirm task integrity.
+#### Evaluation Settings

- `--write_out` : Used for diagnostic purposes to observe the format of task documents passed to a model. If this flag is used, then prints the prompt and gold target string for the first document of each task.
+- `--num_fewshot` **[int]** : Sets the number of few-shot examples to place in context. Must be an integer.

- `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.
+- `--batch_size` **[int | "auto" | "auto:N", default: 1]** : Sets the batch size used for evaluation. Options:
+  - Integer: Fixed batch size (e.g., `8`)
+  - `"auto"`: Automatically select the largest batch size that fits in memory
+  - `"auto:N"`: Re-select maximum batch size N times during evaluation

- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.
+  Auto mode is useful since `lm-eval` sorts documents in descending order of context length.

- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.
+- `--max_batch_size` **[int]** : Sets the maximum batch size to try when using `--batch_size auto`.

- `--apply_chat_template` : This flag specifies whether to apply a chat template to the prompt. It can be used in the following ways:
-  - `--apply_chat_template` : When used without an argument, applies the only available chat template to the prompt. For Hugging Face models, if no dedicated chat template exists, the default chat template will be applied.
-  - `--apply_chat_template template_name` : If the model has multiple chat templates, apply the specified template to the prompt.
+- `--device` **[str]** : Sets which device to place the model onto. Examples: `"cuda"`, `"cuda:0"`, `"cpu"`, `"mps"`. Can be ignored if running multi-GPU or non-local model types.

-    For Hugging Face models, the default chat template can be found in the [`default_chat_template`](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912) property of the Transformers Tokenizer.
+- `--gen_kwargs` **[comma-sep str | json str → dict]** : Generation arguments for `generate_until` tasks. Same format as `--model_args`:
+  - Comma-separated: `temperature=0.8,top_p=0.95`
+  - JSON: `'{"temperature": 0.8, "top_p": 0.95}'`

- `--fewshot_as_multiturn` : If this flag is on, the Fewshot examples are treated as a multi-turn conversation. Questions are provided as user content and answers are provided as assistant responses. Requires `--num_fewshot` to be set to be greater than 0, and `--apply_chat_template` to be on.
+  See model documentation (e.g., `transformers.AutoModelForCausalLM.generate()`) for supported arguments. Applied to all generation tasks - use task YAML files for per-task control.

- `--predict_only`: Generates the model outputs without computing metrics. Use with `--log_samples` to retrieve decoded results.
+#### Data and Output

- `--seed`: Set seed for python's random, numpy and torch.  Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all three seeds to 42.
+- `--output_path` **[path: str]** : Output location for results. Format options:
+  - Directory: `results/` - saves as `results/<model_name>_<timestamp>.json`
+  - File: `results/output.jsonl` - saves to specific file

- `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```. Also allows for the passing of the step to log things at (passed to `wandb.run.log`), e.g., `--wandb_args step=123`.
+  When used with `--log_samples`, per-document outputs are saved in the directory.

- `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
-  - `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`. If not provided, the results will be pushed to the owner of the Hugging Face token,
-  - `hub_repo_name` - repository name on Hugging Face Hub (deprecated, `details_repo_name` and `results_repo_name` should be used instead), e.g., `lm-eval-results`,
-  - `details_repo_name` - repository name on Hugging Face Hub to store details, e.g., `lm-eval-results`,
-  - `results_repo_name` - repository name on Hugging Face Hub to store results, e.g., `lm-eval-results`,
-  - `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
-  - `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
-  - `public_repo` - whether the repository is public, can be `True` or `False`,
-  - `leaderboard_url` - URL to the leaderboard, e.g., `https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard`.
-  - `point_of_contact` - Point of contact for the results dataset, e.g., `yourname@example.com`.
-  - `gated` - whether to gate the details dataset, can be `True` or `False`.
+- `--log_samples` **[flag, default: False]** : Save model outputs and inputs at per-document granularity. Requires `--output_path`. Automatically enabled when using `--predict_only`.

- `--metadata`: JSON string to pass to TaskConfig. Used for some tasks which require additional metadata to be passed for processing. E.g., `--metadata '{"key": "value"}'`.
+- `--limit` **[int | float]** : Limit evaluation examples per task. **WARNING: Only for testing!**
+  - Integer: First N documents (e.g., `100`)
+  - Float (0.0-1.0): Percentage of documents (e.g., `0.1` for 10%)
+
+- `--samples` **[path | json str | dict → dict]** : Evaluate specific sample indices only. Input formats:
+  - JSON file path: `samples.json`
+  - JSON string: `'{"hellaswag": [0, 1, 2], "arc_easy": [10, 20]}'`
+  - Dictionary (programmatic use)
+
+  Format: `{"task_name": [indices], ...}`. Incompatible with `--limit`.
+
+#### Caching and Performance
+
+- `--use_cache` **[path: str]** : SQLite cache database path prefix. Creates per-process cache files:
+  - Single GPU: `/path/to/cache.db`
+  - Multi-GPU: `/path/to/cache_rank0.db`, `/path/to/cache_rank1.db`, etc.
+
+  Caches model outputs to avoid re-running the same (model, task) evaluations.
+
+- `--cache_requests` **["true" | "refresh" | "delete"]** : Dataset request caching control:
+  - `"true"`: Use existing cache
+  - `"refresh"`: Regenerate cache (use after changing task configs)
+  - `"delete"`: Delete cache
+
+  Cache location: `lm_eval/cache/.cache` or `$LM_HARNESS_CACHE_PATH` if set.
+
+- `--check_integrity` **[flag, default: False]** : Run task integrity tests to validate configurations.
+
+#### Instruct Formatting
+
+- `--system_instruction` **[str]** : Custom system instruction to prepend to prompts. Used with instruction-following models.
+
+- `--apply_chat_template` **[bool | str, default: False]** : Apply chat template formatting. Usage:
+  - No argument: Apply default/only available template
+  - Template name: Apply specific template (e.g., `"chatml"`)
+
+  For HuggingFace models, uses the tokenizer's chat template. Default template defined in [`transformers` documentation](https://github.com/huggingface/transformers/blob/fc35907f95459d7a6c5281dfadd680b6f7b620e3/src/transformers/tokenization_utils_base.py#L1912).
+
+- `--fewshot_as_multiturn` **[flag, default: False]** : Format few-shot examples as multi-turn conversation:
+  - Questions → User messages
+  - Answers → Assistant responses
+
+  Requires: `--num_fewshot > 0` and `--apply_chat_template` enabled.
+
+#### Task Management
+
+- `--include_path` **[path: str]** : Directory containing custom task YAML files. All `.yaml` files in this directory will be registered as available tasks. Use for custom tasks outside of `lm_eval/tasks/`.
+
+#### Logging and Tracking
+
+- `--verbosity` **[str]** : **DEPRECATED** - Use `LOGLEVEL` environment variable instead.
+
+- `--write_out` **[flag, default: False]** : Print first document's prompt and target for each task. Useful for debugging prompt formatting.
+
+- `--show_config` **[flag, default: False]** : Display full task configurations after evaluation. Shows all non-default settings from task YAML files.
+
+- `--wandb_args` **[comma-sep str → dict]** : Weights & Biases integration. Arguments for `wandb.init()`:
+  - Example: `project=my-project,name=run-1,tags=test`
+  - Special: `step=123` sets logging step
+  - See [W&B docs](https://docs.wandb.ai/ref/python/init) for all options
+
+- `--wandb_config_args` **[comma-sep str → dict]** : Additional W&B config arguments, same format as `--wandb_args`.
+
+- `--hf_hub_log_args` **[comma-sep str → dict]** : Hugging Face Hub logging configuration. Format: `key1=value1,key2=value2`. Options:
+  - `hub_results_org`: Organization name (default: token owner)
+  - `details_repo_name`: Repository for detailed results
+  - `results_repo_name`: Repository for aggregated results
+  - `push_results_to_hub`: Enable pushing (`True`/`False`)
+  - `push_samples_to_hub`: Push samples (`True`/`False`, requires `--log_samples`)
+  - `public_repo`: Make repo public (`True`/`False`)
+  - `leaderboard_url`: Associated leaderboard URL
+  - `point_of_contact`: Contact email
+  - `gated`: Gate the dataset (`True`/`False`)
+  - ~~`hub_repo_name`~~: Deprecated, use `details_repo_name` and `results_repo_name`
+
+#### Advanced Options
+
+- `--predict_only` **[flag, default: False]** : Generate outputs without computing metrics. Automatically enables `--log_samples`. Use to get raw model outputs.
+
+- `--seed` **[int | comma-sep str → list[int], default: [0,1234,1234,1234]]** : Set random seeds for reproducibility:
+  - Single integer: Same seed for all (e.g., `42`)
+  - Four values: `python,numpy,torch,fewshot` seeds (e.g., `0,1234,8,52`)
+  - Use `None` to skip setting a seed (e.g., `0,None,8,52`)
+
+  Default preserves backward compatibility.
+
+- `--trust_remote_code` **[flag, default: False]** : Allow executing remote code from Hugging Face Hub. **Security Risk**: Required for some models with custom code.
+
+- `--confirm_run_unsafe_code` **[flag, default: False]** : Acknowledge risks when running tasks that execute arbitrary Python code (e.g., code generation tasks).
+
+- `--metadata` **[json str → dict]** : Additional metadata for specific tasks. Format: `'{"key": "value"}'`. Required by tasks like RULER that need extra configuration.

 ## External Library Usage


--- a/lm_eval/__init__.py
+++ b/lm_eval/__init__.py
-import logging
-import os
-
-
 __version__ = "0.4.9"



--- a/lm_eval/__main__.py
+++ b/lm_eval/__main__.py
--- a/lm_eval/_cli/__init__.py
+++ b/lm_eval/_cli/__init__.py
+"""
+CLI subcommands to run from terminal.
+"""
--- a/lm_eval/_cli/harness.py
+++ b/lm_eval/_cli/harness.py
+import argparse
+import sys
+import textwrap
+
+from lm_eval._cli.ls import List
+from lm_eval._cli.run import Run
+from lm_eval._cli.validate import Validate
+
+
+class HarnessCLI:
+    """Main CLI parser that manages all subcommands."""
+
+    def __init__(self):
+        self._parser = argparse.ArgumentParser(
+            prog="lm-eval",
+            description="Language Model Evaluation Harness",
+            epilog=textwrap.dedent("""
+                quick start:
+                  # Basic evaluation
+                  lm-eval run --model hf --model_args pretrained=gpt2 --tasks hellaswag
+
+                  # List available tasks
+                  lm-eval ls tasks
+
+                  # Validate task configurations
+                  lm-eval validate --tasks hellaswag,arc_easy
+
+                legacy compatibility:
+                  The harness maintains backward compatibility with the original interface.
+                  If no command is specified, 'run' is automatically inserted:
+
+                  lm-eval --model hf --tasks hellaswag  # Equivalent to 'lm-eval run --model hf --tasks hellaswag'
+
+                For documentation, visit: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md
+            """),
+            formatter_class=argparse.RawDescriptionHelpFormatter,
+        )
+        self._parser.set_defaults(func=lambda args: self._parser.print_help())
+        self._subparsers = self._parser.add_subparsers(
+            dest="command", help="Available commands", metavar="COMMAND"
+        )
+        Run.create(self._subparsers)
+        List.create(self._subparsers)
+        Validate.create(self._subparsers)
+
+    def parse_args(self) -> argparse.Namespace:
+        """Parse arguments using the main parser."""
+        if len(sys.argv) > 2 and sys.argv[1] not in self._subparsers.choices:
+            # Backward compatibility: arguments provided but no valid subcommand - insert 'run'
+            sys.argv.insert(1, "run")
+        elif len(sys.argv) == 2 and "run" in sys.argv:
+            # if only 'run' is specified, ensure it is treated as a subcommand
+            self._subparsers.choices["run"].print_help()
+            sys.exit(0)
+        return self._parser.parse_args()
+
+    def execute(self, args: argparse.Namespace) -> None:
+        """Main execution method that handles subcommands and legacy support."""
+        args.func(args)
--- a/lm_eval/_cli/ls.py
+++ b/lm_eval/_cli/ls.py
+import argparse
+import textwrap
+
+from lm_eval._cli.subcommand import SubCommand
+
+
+class List(SubCommand):
+    """Command for listing available tasks."""
+
+    def __init__(self, subparsers: argparse._SubParsersAction, *args, **kwargs):
+        # Create and configure the parser
+        super().__init__(*args, **kwargs)
+        self._parser = subparsers.add_parser(
+            "ls",
+            help="List available tasks, groups, subtasks, or tags",
+            description="List available tasks, groups, subtasks, or tags from the evaluation harness.",
+            usage="lm-eval list [tasks|groups|subtasks|tags] [--include_path DIR]",
+            epilog=textwrap.dedent("""
+                examples:
+                  # List all available tasks (includes groups, subtasks, and tags)
+                  $ lm-eval ls tasks
+
+                  # List only task groups (like 'mmlu', 'glue', 'superglue')
+                  $ lm-eval ls groups
+
+                  # List only individual subtasks (like 'mmlu_abstract_algebra')
+                  $ lm-eval ls subtasks
+
+                  # Include external task definitions
+                  $ lm-eval ls tasks --include_path /path/to/external/tasks
+
+                  # List tasks from multiple external paths
+                  $ lm-eval ls tasks --include_path "/path/to/tasks1:/path/to/tasks2"
+
+                organization:
+                  • Groups: Collections of tasks with aggregated metric across subtasks (e.g., 'mmlu')
+                  • Subtasks: Individual evaluation tasks (e.g., 'mmlu_anatomy', 'hellaswag')
+                  • Tags: Similar to groups but no aggregate metric (e.g., 'reasoning', 'knowledge', 'language')
+                  • External Tasks: Custom tasks defined in external directories
+
+                evaluation usage:
+                  After listing tasks, use them with the run command!
+
+                For more information tasks configs are defined in https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks
+            """),
+            formatter_class=argparse.RawDescriptionHelpFormatter,
+        )
+        self._add_args()
+        self._parser.set_defaults(func=self._execute)
+
+    def _add_args(self) -> None:
+        self._parser.add_argument(
+            "what",
+            choices=["tasks", "groups", "subtasks", "tags"],
+            nargs="?",
+            help="What to list: tasks (all), groups, subtasks, or tags",
+        )
+        self._parser.add_argument(
+            "--include_path",
+            type=str,
+            default=None,
+            metavar="DIR",
+            help="Additional path to include if there are external tasks.",
+        )
+
+    def _execute(self, args: argparse.Namespace) -> None:
+        """Execute the list command."""
+        from lm_eval.tasks import TaskManager
+
+        task_manager = TaskManager(include_path=args.include_path)
+
+        if args.what == "tasks":
+            print(task_manager.list_all_tasks())
+        elif args.what == "groups":
+            print(task_manager.list_all_tasks(list_subtasks=False, list_tags=False))
+        elif args.what == "subtasks":
+            print(task_manager.list_all_tasks(list_groups=False, list_tags=False))
+        elif args.what == "tags":
+            print(task_manager.list_all_tasks(list_groups=False, list_subtasks=False))
+        elif args.what is None:
+            self._parser.print_help()
--- a/lm_eval/_cli/run.py
+++ b/lm_eval/_cli/run.py
+import argparse
+import json
+import logging
+import os
+import textwrap
+from functools import partial
+
+from lm_eval._cli.subcommand import SubCommand
+from lm_eval._cli.utils import (
+    _int_or_none_list_arg_type,
+    key_val_to_dict,
+    merge_dicts,
+    request_caching_arg_to_dict,
+    try_parse_json,
+)
+
+
+class Run(SubCommand):
+    """Command for running language model evaluation."""
+
+    def __init__(self, subparsers: argparse._SubParsersAction, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self._parser = subparsers.add_parser(
+            "run",
+            help="Run the evaluation harness on specified tasks",
+            description="Evaluate language models on various benchmarks and tasks.",
+            usage="lm-eval run --model <model> --tasks <task> <task> --model_args <arg=value> <arg=value> [options]",
+            epilog=textwrap.dedent("""
+                examples:
+                  # Basic evaluation with HuggingFace model
+                  $ lm-eval run --model hf --model_args pretrained=gpt2 dtype=float32 --tasks hellaswag
+
+                  # Evaluate on multiple tasks with few-shot examples
+                  $ lm-eval run --model vllm --model_args pretrained=EleutherAI/gpt-j-6B --tasks arc_easy arc_challenge --num_fewshot 5
+
+                  # Evaluation with custom generation parameters
+                  $ lm-eval run --model hf --model_args pretrained=gpt2 --tasks lambada --gen_kwargs temperature=0.8 top_p=0.95 'stop=["\\n\\n"]'
+
+                  # Use configuration file
+                  $ lm-eval run --config my_config.yaml --tasks mmlu
+
+                For more information, see: https://github.com/EleutherAI/lm-evaluation-harness
+            """),
+            formatter_class=argparse.RawDescriptionHelpFormatter,
+        )
+        self._add_args()
+        self._parser.set_defaults(func=self._execute)
+
+    def _add_args(self) -> None:
+        self._parser = self._parser
+
+        # Defaults are set in config/evaluate_config.py
+        config_group = self._parser.add_argument_group("configuration")
+        config_group.add_argument(
+            "--config",
+            "-C",
+            default=None,
+            type=str,
+            metavar="YAML_PATH",
+            help="Set initial arguments from YAML config",
+        )
+
+        # Model and Tasks
+        model_group = self._parser.add_argument_group("model and tasks")
+        model_group.add_argument(
+            "--model",
+            "-m",
+            type=str,
+            default=None,
+            metavar="MODEL_NAME",
+            help="Model name (default: hf)",
+        )
+        model_group.add_argument(
+            "--tasks",
+            "-t",
+            default=None,
+            type=str,
+            nargs="*",
+            metavar="TASK1 TASK2",
+            help=textwrap.dedent("""
+                Space or Comma-separated list of task names or groupings.
+                Use 'lm-eval list tasks' to see all available tasks.
+            """).strip(),
+        )
+        model_group.add_argument(
+            "--model_args",
+            "-a",
+            default=None,
+            nargs="*",
+            type=key_val_to_dict,
+            metavar="ARGS",
+            help="Model arguments as 'key=val,key2=val2' or `key=val` `key2=val2`",
+        )
+
+        # Evaluation Settings
+        eval_group = self._parser.add_argument_group("evaluation settings")
+        eval_group.add_argument(
+            "--num_fewshot",
+            "-f",
+            type=int,
+            default=None,
+            metavar="N",
+            help="Number of examples in few-shot context",
+        )
+        eval_group.add_argument(
+            "--batch_size",
+            "-b",
+            type=str,
+            default=argparse.SUPPRESS,
+            metavar="auto|auto:N|N",
+            help=textwrap.dedent(
+                "Batch size: 'auto', 'auto:N' (auto-tune N times), or integer (default: 1)"
+            ),
+        )
+        eval_group.add_argument(
+            "--max_batch_size",
+            type=int,
+            default=None,
+            metavar="N",
+            help="Maximum batch size when using --batch_size auto",
+        )
+        eval_group.add_argument(
+            "--device",
+            type=str,
+            default=None,
+            metavar="DEVICE",
+            help="Device to use (e.g. cuda, cuda:0, cpu, mps)",
+        )
+        eval_group.add_argument(
+            "--gen_kwargs",
+            type=key_val_to_dict,
+            default=None,
+            nargs="*",
+            metavar="KWARGS",
+            help=textwrap.dedent(
+                'Generation arguments as `temperature=0,stop=["stop"]` or `key=val` `key2=val2`.'
+                "Values should be parsable with ast.literal_eval."
+            ),
+        )
+
+        # Data and Output
+        data_group = self._parser.add_argument_group("data and output")
+        data_group.add_argument(
+            "--output_path",
+            "-o",
+            default=None,
+            type=str,
+            metavar="OUTPUT_PATH",
+            help="Output dir or json file for results (and samples)",
+        )
+        data_group.add_argument(
+            "--log_samples",
+            "-s",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Save all model outputs and documents for post-hoc analysis",
+        )
+        data_group.add_argument(
+            "--limit",
+            "-L",
+            type=float,
+            default=None,
+            metavar="N|0.0-1.0",
+            help="Limit examples per task (integer count or fraction)",
+        )
+        data_group.add_argument(
+            "--samples",
+            "-E",
+            default=None,
+            type=try_parse_json,
+            metavar='"task1": [1,2,3,4,...]"',
+            help=textwrap.dedent(
+                "`...` `...` Sample indices for inputs. Incompatible with --limit."
+                " Values be parsable with ast.literal_eval."
+            ),
+        )
+
+        # Caching and Performance
+        cache_group = self._parser.add_argument_group("caching and performance")
+        cache_group.add_argument(
+            "--use_cache",
+            "-c",
+            type=str,
+            default=None,
+            metavar="CACHE_DIR",
+            help="SQLite database path for caching model outputs.",
+        )
+        cache_group.add_argument(
+            "--cache_requests",
+            type=request_caching_arg_to_dict,
+            default=None,
+            choices=["true", "refresh", "delete"],
+            help="Cache dataset request building (true|refresh|delete)",
+        )
+        cache_group.add_argument(
+            "--check_integrity",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Run task test suite validation",
+        )
+
+        # Prompt Formatting
+        template_group = self._parser.add_argument_group("instruct formatting")
+        template_group.add_argument(
+            "--system_instruction",
+            type=str,
+            default=None,
+            metavar="INSTRUCTION",
+            help="Add custom system instruction.",
+        )
+        template_group.add_argument(
+            "--apply_chat_template",
+            type=str,
+            nargs="?",
+            const=True,
+            default=argparse.SUPPRESS,
+            metavar="TEMPLATE",
+            help="Apply chat template to prompts (optional template name)",
+        )
+        template_group.add_argument(
+            "--fewshot_as_multiturn",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Use fewshot examples as multi-turn conversation",
+        )
+
+        # Task Management
+        task_group = self._parser.add_argument_group("task management")
+        task_group.add_argument(
+            "--include_path",
+            type=str,
+            default=None,
+            metavar="TASK_DIR",
+            help="Additional directory for external tasks",
+        )
+
+        # Logging and Tracking
+        logging_group = self._parser.add_argument_group("logging and tracking")
+        logging_group.add_argument(
+            "--verbosity",
+            "-v",
+            type=str.upper,
+            default=None,
+            metavar="LEVEL",
+            help="(Deprecated) Log level. Use LOGLEVEL env var instead",
+        )
+        logging_group.add_argument(
+            "--write_out",
+            "-w",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Print prompts for first few documents",
+        )
+        logging_group.add_argument(
+            "--show_config",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Display full task configuration after evaluation",
+        )
+        logging_group.add_argument(
+            "--wandb_args",
+            type=key_val_to_dict,
+            default=argparse.SUPPRESS,
+            metavar="ARGS",
+            help="Weights & Biases init arguments key=val key2=val2",
+        )
+        logging_group.add_argument(
+            "--wandb_config_args",
+            type=key_val_to_dict,
+            default=argparse.SUPPRESS,
+            metavar="ARGS",
+            help="Weights & Biases config arguments key=val key2=val2",
+        )
+        logging_group.add_argument(
+            "--hf_hub_log_args",
+            type=key_val_to_dict,
+            default=argparse.SUPPRESS,
+            metavar="ARGS",
+            help="Hugging Face Hub logging arguments key=val key2=val2",
+        )
+
+        # Advanced Options
+        advanced_group = self._parser.add_argument_group("advanced options")
+        advanced_group.add_argument(
+            "--predict_only",
+            "-x",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Save predictions only, skip metric computation",
+        )
+        default_seed_string = "0,1234,1234,1234"
+        advanced_group.add_argument(
+            "--seed",
+            type=partial(_int_or_none_list_arg_type, 3, 4, default_seed_string),
+            default=None,
+            metavar="SEED|S1,S2,S3,S4",
+            help=textwrap.dedent(f"""
+                Random seeds for python,numpy,torch,fewshot (default: {default_seed_string}).
+                Use single integer for all, or comma-separated list of 4 values.
+                Use 'None' to skip setting a seed. Example: --seed 42 or --seed 0,None,8,52
+            """).strip(),
+        )
+        advanced_group.add_argument(
+            "--trust_remote_code",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Allow executing remote code from Hugging Face Hub",
+        )
+        advanced_group.add_argument(
+            "--confirm_run_unsafe_code",
+            action="store_true",
+            default=argparse.SUPPRESS,
+            help="Confirm understanding of unsafe code execution risks",
+        )
+        advanced_group.add_argument(
+            "--metadata",
+            type=json.loads,
+            default=None,
+            metavar="`key=val` `key2=val2`",
+            help=textwrap.dedent(
+                """`key=val` `key2=val` args parsable by ast.literal_eval (merged with model_args),
+                required for some tasks such as RULER"""
+            ),
+        )
+
+    @staticmethod
+    def _execute(args: argparse.Namespace) -> None:
+        """Runs the evaluation harness with the provided arguments."""
+        os.environ["TOKENIZERS_PARALLELISM"] = "false"
+        MERGE_ARGS_DICTS = [
+            "model_args",
+            "gen_kwargs",
+            "wandb_args",
+            "wandb_config_args",
+            "hf_hub_log_args",
+        ]
+        for arg_name in MERGE_ARGS_DICTS:
+            if current_value := getattr(args, arg_name, None):
+                setattr(args, arg_name, merge_dicts(*current_value))
+
+        from lm_eval.config.evaluate_config import EvaluatorConfig
+
+        eval_logger = logging.getLogger(__name__)
+
+        # Create and validate config (most validation now occurs in EvaluationConfig)
+        cfg = EvaluatorConfig.from_cli(args)
+
+        from lm_eval import simple_evaluate
+        from lm_eval.loggers import EvaluationTracker, WandbLogger
+        from lm_eval.utils import handle_non_serializable, make_table
+
+        # Set up logging
+        if cfg.wandb_args:
+            wandb_logger = WandbLogger(cfg.wandb_args, cfg.wandb_config_args)
+
+        # Set up evaluation tracker
+        if cfg.output_path:
+            cfg.hf_hub_log_args["output_path"] = cfg.output_path
+
+        if os.environ.get("HF_TOKEN", None):
+            cfg.hf_hub_log_args["token"] = os.environ.get("HF_TOKEN")
+
+        evaluation_tracker = EvaluationTracker(**cfg.hf_hub_log_args)
+
+        # Create task manager (metadata already set up in config validation)
+        task_manager = cfg.process_tasks(cfg.metadata)
+
+        # Validation warnings (keep these in CLI as they're logging-specific)
+        if "push_samples_to_hub" in cfg.hf_hub_log_args and not cfg.log_samples:
+            eval_logger.warning(
+                "Pushing samples to the Hub requires --log_samples to be set."
+            )
+
+        # Log task selection (tasks already processed in config)
+        if cfg.include_path is not None:
+            eval_logger.info(f"Including path: {cfg.include_path}")
+        eval_logger.info(f"Selected Tasks: {cfg.tasks}")
+
+        # Run evaluation
+        results = simple_evaluate(
+            model=cfg.model,
+            model_args=cfg.model_args,
+            tasks=cfg.tasks,
+            num_fewshot=cfg.num_fewshot,
+            batch_size=cfg.batch_size,
+            max_batch_size=cfg.max_batch_size,
+            device=cfg.device,
+            use_cache=cfg.use_cache,
+            cache_requests=cfg.cache_requests.get("cache_requests", False),
+            rewrite_requests_cache=cfg.cache_requests.get(
+                "rewrite_requests_cache", False
+            ),
+            delete_requests_cache=cfg.cache_requests.get(
+                "delete_requests_cache", False
+            ),
+            limit=cfg.limit,
+            samples=cfg.samples,
+            check_integrity=cfg.check_integrity,
+            write_out=cfg.write_out,
+            log_samples=cfg.log_samples,
+            evaluation_tracker=evaluation_tracker,
+            system_instruction=cfg.system_instruction,
+            apply_chat_template=cfg.apply_chat_template,
+            fewshot_as_multiturn=cfg.fewshot_as_multiturn,
+            gen_kwargs=cfg.gen_kwargs,
+            task_manager=task_manager,
+            verbosity=cfg.verbosity,
+            predict_only=cfg.predict_only,
+            random_seed=cfg.seed[0] if cfg.seed else None,
+            numpy_random_seed=cfg.seed[1] if cfg.seed else None,
+            torch_random_seed=cfg.seed[2] if cfg.seed else None,
+            fewshot_random_seed=cfg.seed[3] if cfg.seed else None,
+            confirm_run_unsafe_code=cfg.confirm_run_unsafe_code,
+            metadata=cfg.metadata,
+        )
+
+        # Process results
+        if results is not None:
+            if cfg.log_samples:
+                samples = results.pop("samples")
+
+            dumped = json.dumps(
+                results, indent=2, default=handle_non_serializable, ensure_ascii=False
+            )
+            if cfg.show_config:
+                print(dumped)
+
+            batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
+
+            # W&B logging
+            if cfg.wandb_args:
+                try:
+                    wandb_logger.post_init(results)
+                    wandb_logger.log_eval_result()
+                    if cfg.log_samples:
+                        wandb_logger.log_eval_samples(samples)
+                except Exception as e:
+                    eval_logger.info(f"Logging to W&B failed: {e}")
+
+            # Save results
+            evaluation_tracker.save_results_aggregated(
+                results=results, samples=samples if cfg.log_samples else None
+            )
+
+            if cfg.log_samples:
+                for task_name, _ in results["configs"].items():
+                    evaluation_tracker.save_results_samples(
+                        task_name=task_name, samples=samples[task_name]
+                    )
+
+            if (
+                evaluation_tracker.push_results_to_hub
+                or evaluation_tracker.push_samples_to_hub
+            ):
+                evaluation_tracker.recreate_metadata_card()
+
+            # Print results
+            cfg.model_args.pop("trust_remote_code", None)
+            print(
+                f"{cfg.model} ({cfg.model_args}), gen_kwargs: ({cfg.gen_kwargs}), "
+                f"limit: {cfg.limit}, num_fewshot: {cfg.num_fewshot}, "
+                f"batch_size: {cfg.batch_size}{f' ({batch_sizes})' if batch_sizes else ''}"
+            )
+            print(make_table(results))
+            if "groups" in results:
+                print(make_table(results, "groups"))
+
+            if cfg.wandb_args:
+                wandb_logger.run.finish()
--- a/lm_eval/_cli/subcommand.py
+++ b/lm_eval/_cli/subcommand.py
+import argparse
+from abc import ABC, abstractmethod
+
+
+class SubCommand(ABC):
+    """Base class for all subcommands."""
+
+    def __init__(self, *args, **kwargs):
+        pass
+
+    @classmethod
+    def create(cls, subparsers: argparse._SubParsersAction):
+        """Factory method to create and register a command instance."""
+        return cls(subparsers)
+
+    @abstractmethod
+    def _add_args(self) -> None:
+        """Add arguments specific to this subcommand."""
+        pass
--- a/lm_eval/_cli/utils.py
+++ b/lm_eval/_cli/utils.py
+import argparse
+import ast
+import json
+import logging
+from typing import Any, Optional, Union
+
+
+def try_parse_json(value: Union[str, dict, None]) -> Union[str, dict, None]:
+    """Try to parse a string as JSON. If it fails, return the original string."""
+    if value is None:
+        return None
+    if isinstance(value, dict):
+        return value
+    try:
+        return json.loads(value)
+    except json.JSONDecodeError:
+        if "{" in value:
+            raise ValueError(
+                f"Invalid JSON: {value}. Hint: Use double quotes for JSON strings."
+            )
+        return value
+
+
+def _int_or_none_list_arg_type(
+    min_len: int, max_len: int, defaults: str, value: str, split_char: str = ","
+) -> list[Union[int, None]]:
+    """Parses a string of integers or 'None' values separated by a specified character into a list.
+    Validates the number of items against specified minimum and maximum lengths and fills missing values with defaults."""
+
+    def parse_value(item):
+        """Parses an individual item, converting it to an integer or `None`."""
+        item = item.strip().lower()
+        if item == "none":
+            return None
+        try:
+            return int(item)
+        except ValueError:
+            raise ValueError(f"{item} is not an integer or None")
+
+    items = [parse_value(v) for v in value.split(split_char)]
+    num_items = len(items)
+
+    if num_items == 1:
+        items = items * max_len
+    elif num_items < min_len or num_items > max_len:
+        raise ValueError(
+            f"Argument requires {max_len} integers or None, separated by '{split_char}'"
+        )
+    elif num_items != max_len:
+        logging.warning(
+            f"Argument requires {max_len} integers or None, separated by '{split_char}'. "
+            "Missing values will be filled with defaults."
+        )
+        default_items = [parse_value(v) for v in defaults.split(split_char)]
+        items.extend(default_items[num_items:])
+
+    return items
+
+
+def request_caching_arg_to_dict(cache_requests: Optional[str]) -> dict[str, bool]:
+    """Convert a request caching argument to a dictionary."""
+    if cache_requests is None:
+        return {}
+    request_caching_args = {
+        "cache_requests": cache_requests in {"true", "refresh"},
+        "rewrite_requests_cache": cache_requests == "refresh",
+        "delete_requests_cache": cache_requests == "delete",
+    }
+
+    return request_caching_args
+
+
+def check_argument_types(parser: argparse.ArgumentParser) -> None:
+    """
+    Check to make sure all CLI args are typed, raises error if not
+    """
+    for action in parser._actions:
+        # Skip help, subcommands, and const actions
+        if action.dest in ["help", "command"] or action.const is not None:
+            continue
+        if action.type is None:
+            raise ValueError(f"Argument '{action.dest}' doesn't have a type specified.")
+        else:
+            continue
+
+
+def handle_cli_value_string(arg: str) -> Any:
+    if arg.lower() == "true":
+        return True
+    elif arg.lower() == "false":
+        return False
+    elif arg.isnumeric():
+        return int(arg)
+    try:
+        return float(arg)
+    except ValueError:
+        try:
+            return ast.literal_eval(arg)
+        except (ValueError, SyntaxError):
+            return arg
+
+
+def key_val_to_dict(args: str) -> dict:
+    """Parse model arguments from a string into a dictionary."""
+    return (
+        {
+            k: handle_cli_value_string(v)
+            for k, v in (item.split("=") for item in args.split(","))
+        }
+        if args
+        else {}
+    )
+
+
+def merge_dicts(*dicts):
+    return {k: v for d in dicts for k, v in d.items()}
--- a/lm_eval/_cli/validate.py
+++ b/lm_eval/_cli/validate.py
+import argparse
+import sys
+import textwrap
+
+from lm_eval._cli.subcommand import SubCommand
+
+
+class Validate(SubCommand):
+    """Command for validating tasks."""
+
+    def __init__(self, subparsers: argparse._SubParsersAction, *args, **kwargs):
+        # Create and configure the self._parser
+        super().__init__(*args, **kwargs)
+        self._parser = subparsers.add_parser(
+            "validate",
+            help="Validate task configurations",
+            description="Validate task configurations and check for errors.",
+            usage="lm-eval validate --tasks <task1,task2> [--include_path DIR]",
+            epilog=textwrap.dedent("""
+                examples:
+                  # Validate a single task
+                  lm-eval validate --tasks hellaswag
+
+                  # Validate multiple tasks
+                  lm-eval validate --tasks arc_easy,arc_challenge,hellaswag
+
+                  # Validate a task group
+                  lm-eval validate --tasks mmlu
+
+                  # Validate tasks with external definitions
+                  lm-eval validate --tasks my_custom_task --include_path ./custom_tasks
+
+                  # Validate tasks from multiple external paths
+                  lm-eval validate --tasks custom_task1,custom_task2 --include_path "/path/to/tasks1:/path/to/tasks2"
+
+                validation check:
+                  The validate command performs several checks:
+                  • Task existence: Verifies all specified tasks are available
+                  • Configuration syntax: Checks YAML/JSON configuration files
+                  • Dataset access: Validates dataset paths and configurations
+                  • Required fields: Ensures all mandatory task parameters are present
+                  • Metric definitions: Verifies metric functions and aggregation methods
+                  • Filter pipelines: Validates filter chains and their parameters
+                  • Template rendering: Tests prompt templates with sample data
+
+                task config files:
+                  Tasks are defined using YAML configuration files with these key sections:
+                  • task: Task name and metadata
+                  • dataset_path: HuggingFace dataset identifier
+                  • doc_to_text: Template for converting documents to prompts
+                  • doc_to_target: Template for extracting target answers
+                  • metric_list: List of evaluation metrics to compute
+                  • output_type: Type of model output (loglikelihood, generate_until, etc.)
+                  • filter_list: Post-processing filters for model outputs
+
+                common errors:
+                  • Missing required fields in YAML configuration
+                  • Invalid dataset paths or missing dataset splits
+                  • Malformed Jinja2 templates in doc_to_text/doc_to_target
+                  • Undefined metrics or aggregation functions
+                  • Invalid filter names or parameters
+                  • Circular dependencies in task inheritance
+                  • Missing external task files when using --include_path
+
+                debugging tips:
+                  • Use --include_path to test external task definitions
+                  • Check task configuration files for syntax errors
+                  • Verify dataset access and authentication if needed
+                  • Use 'lm-eval list tasks' to see available tasks
+
+                For task configuration guide, see: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/task_guide.md
+            """),
+            formatter_class=argparse.RawDescriptionHelpFormatter,
+        )
+        self._add_args()
+        self._parser.set_defaults(func=self._execute)
+
+    def _add_args(self) -> None:
+        self._parser.add_argument(
+            "--tasks",
+            "-t",
+            required=True,
+            type=str,
+            metavar="TASK1,TASK2",
+            help="Comma-separated list of task names to validate",
+        )
+        self._parser.add_argument(
+            "--include_path",
+            type=str,
+            default=None,
+            metavar="DIR",
+            help="Additional path to include if there are external tasks.",
+        )
+
+    def _execute(self, args: argparse.Namespace) -> None:
+        """Execute the validate command."""
+        from lm_eval.tasks import TaskManager
+
+        task_manager = TaskManager(include_path=args.include_path)
+        task_list = args.tasks.split(",")
+
+        print(f"Validating tasks: {task_list}")
+        # For now, just validate that tasks exist
+        task_names = task_manager.match_tasks(task_list)
+        task_missing = [task for task in task_list if task not in task_names]
+
+        if task_missing:
+            missing = ", ".join(task_missing)
+            print(f"Tasks not found: {missing}")
+            sys.exit(1)
+        else:
+            print("All tasks found and valid")
--- a/lm_eval/config/__init__.py
+++ b/lm_eval/config/__init__.py
+from .evaluate_config import EvaluatorConfig
+
+
+__all__ = [
+    "EvaluatorConfig",
+]
--- a/lm_eval/config/evaluate_config.py
+++ b/lm_eval/config/evaluate_config.py
+import json
+import logging
+import textwrap
+from argparse import Namespace
+from dataclasses import asdict, dataclass, field
+from pathlib import Path
+from typing import TYPE_CHECKING, Any, Dict, Optional, Union
+
+import yaml
+
+from lm_eval.utils import simple_parse_args_string
+
+
+if TYPE_CHECKING:
+    from lm_eval.tasks import TaskManager
+
+eval_logger = logging.getLogger(__name__)
+DICT_KEYS = [
+    "wandb_args",
+    "wandb_config_args",
+    "hf_hub_log_args",
+    "metadata",
+    "model_args",
+    "gen_kwargs",
+]
+
+
+@dataclass
+class EvaluatorConfig:
+    """Configuration for language model evaluation runs.
+
+    This dataclass contains all parameters for configuring model evaluations via
+    `simple_evaluate()` or the CLI. It supports initialization from:
+    - CLI arguments (via `from_cli()`)
+    - YAML configuration files (via `from_config()`)
+    - Direct instantiation with keyword arguments
+
+    The configuration handles argument parsing, validation, and preprocessing
+    to ensure properly structured and validated.
+
+    Example:
+        # From CLI arguments
+        config = EvaluatorConfig.from_cli(args)
+
+        # From YAML file
+        config = EvaluatorConfig.from_config("eval_config.yaml")
+
+        # Direct instantiation
+        config = EvaluatorConfig(
+            model="hf",
+            model_args={"pretrained": "gpt2"},
+            tasks=["hellaswag", "arc_easy"],
+            num_fewshot=5
+        )
+
+      See individual field documentation for detailed parameter descriptions.
+    """
+
+    # Core evaluation parameters
+    config: Optional[str] = field(
+        default=None, metadata={"help": "Path to YAML config file"}
+    )
+    model: str = field(default="hf", metadata={"help": "Name of model e.g. 'hf'"})
+    model_args: dict = field(
+        default_factory=dict, metadata={"help": "Arguments for model initialization"}
+    )
+    tasks: list[str] = field(
+        default_factory=list,
+        metadata={"help": "List of task names to evaluate"},
+    )
+
+    # Few-shot and batching
+    num_fewshot: Optional[int] = field(
+        default=None, metadata={"help": "Number of examples in few-shot context"}
+    )
+    batch_size: int = field(default=1, metadata={"help": "Batch size for evaluation"})
+    max_batch_size: Optional[int] = field(
+        default=None, metadata={"help": "Maximum batch size for auto batching"}
+    )
+
+    # Device
+    device: Optional[str] = field(
+        default="cuda:0", metadata={"help": "Device to use (e.g. cuda, cuda:0, cpu)"}
+    )
+
+    # Data sampling and limiting
+    limit: Optional[float] = field(
+        default=None, metadata={"help": "Limit number of examples per task"}
+    )
+    samples: Union[str, dict, None] = field(
+        default=None,
+        metadata={"help": "dict, JSON string or path to JSON file with doc indices"},
+    )
+
+    # Caching
+    use_cache: Optional[str] = field(
+        default=None,
+        metadata={"help": "Path to sqlite db file for caching model outputs"},
+    )
+    cache_requests: dict = field(
+        default_factory=dict,
+        metadata={"help": "Cache dataset requests: true/refresh/delete"},
+    )
+
+    # Output and logging flags
+    check_integrity: bool = field(
+        default=False, metadata={"help": "Run test suite for tasks"}
+    )
+    write_out: bool = field(
+        default=False, metadata={"help": "Print prompts for first few documents"}
+    )
+    log_samples: bool = field(
+        default=False, metadata={"help": "Save model outputs and inputs"}
+    )
+    output_path: Optional[str] = field(
+        default=None, metadata={"help": "Dir path where result metrics will be saved"}
+    )
+    predict_only: bool = field(
+        default=False,
+        metadata={
+            "help": "Only save model outputs, don't evaluate metrics. Use with log_samples."
+        },
+    )
+
+    # Chat and instruction handling
+    system_instruction: Optional[str] = field(
+        default=None, metadata={"help": "Custom System instruction to add"}
+    )
+    apply_chat_template: Union[bool, str] = field(
+        default=False,
+        metadata={
+            "help": "Apply chat template to prompt. Either True, or a string identifying the tokenizer template."
+        },
+    )
+    fewshot_as_multiturn: bool = field(
+        default=False,
+        metadata={
+            "help": "Use fewshot as multi-turn conversation. Requires apply_chat_template=True."
+        },
+    )
+
+    # Configuration display
+    show_config: bool = field(
+        default=False, metadata={"help": "Show full config at end of evaluation"}
+    )
+
+    # External tasks and generation
+    include_path: Optional[str] = field(
+        default=None, metadata={"help": "Additional dir path for external tasks"}
+    )
+    gen_kwargs: Optional[dict] = field(
+        default=None, metadata={"help": "Arguments for model generation"}
+    )
+
+    # Logging and verbosity
+    verbosity: Optional[str] = field(
+        default=None, metadata={"help": "Logging verbosity level"}
+    )
+
+    # External integrations
+    wandb_args: dict = field(
+        default_factory=dict, metadata={"help": "Arguments for wandb.init"}
+    )
+    wandb_config_args: dict = field(
+        default_factory=dict, metadata={"help": "Arguments for wandb.config.update"}
+    )
+    hf_hub_log_args: dict = field(
+        default_factory=dict, metadata={"help": "Arguments for HF Hub logging"}
+    )
+
+    # Reproducibility
+    seed: list = field(
+        default_factory=lambda: [0, 1234, 1234, 1234],
+        metadata={"help": "Seeds for random, numpy, torch, fewshot (random)"},
+    )
+
+    # Security
+    trust_remote_code: bool = field(
+        default=False, metadata={"help": "Trust remote code for HF datasets"}
+    )
+    confirm_run_unsafe_code: bool = field(
+        default=False,
+        metadata={
+            "help": "Confirm understanding of unsafe code risks (for code tasks that executes arbitrary Python)"
+        },
+    )
+
+    # Internal metadata
+    metadata: dict = field(
+        default_factory=dict,
+        metadata={"help": "Additional metadata for tasks that require it"},
+    )
+
+    @classmethod
+    def from_cli(cls, namespace: Namespace) -> "EvaluatorConfig":
+        """
+        Build an EvaluationConfig by merging with simple precedence:
+        CLI args > YAML config > built-in defaults
+        """
+        # Start with built-in defaults
+        config = asdict(cls())
+
+        # Load and merge YAML config if provided
+        if used_config := hasattr(namespace, "config") and namespace.config:
+            config.update(cls.load_yaml_config(namespace.config))
+
+        # Override with CLI args (only truthy values, exclude non-config args)
+        excluded_args = {"command", "func"}  # argparse internal args
+        cli_args = {
+            k: v for k, v in vars(namespace).items() if v and k not in excluded_args
+        }
+        config.update(cli_args)
+
+        # Parse string arguments that should be dictionaries
+        config = cls._parse_dict_args(config)
+
+        # Create instance and validate
+        instance = cls(**config)
+        instance.configure()
+        if used_config:
+            print(textwrap.dedent(f"""{instance}"""))
+
+        return instance
+
+    @classmethod
+    def from_config(cls, config_path: Union[str, Path]) -> "EvaluatorConfig":
+        """
+        Build an EvaluationConfig from a YAML config file.
+        Merges with built-in defaults and validates.
+        """
+        # Load YAML config
+        yaml_config = cls.load_yaml_config(config_path)
+        # Parse string arguments that should be dictionaries
+        yaml_config = cls._parse_dict_args(yaml_config)
+        instance = cls(**yaml_config)
+        instance.configure()
+
+        return instance
+
+    @staticmethod
+    def _parse_dict_args(config: Dict[str, Any]) -> Dict[str, Any]:
+        """Parse string arguments that should be dictionaries."""
+        for key in config:
+            if key in DICT_KEYS and isinstance(config[key], str):
+                config[key] = simple_parse_args_string(config[key])
+        return config
+
+    @staticmethod
+    def load_yaml_config(config_path: Union[str, Path]) -> Dict[str, Any]:
+        """Load and validate YAML config file."""
+        config_file = (
+            Path(config_path) if not isinstance(config_path, Path) else config_path
+        )
+        if not config_file.is_file():
+            raise FileNotFoundError(f"Config file not found: {config_path}")
+
+        try:
+            yaml_data = yaml.safe_load(config_file.read_text())
+        except yaml.YAMLError as e:
+            raise ValueError(f"Invalid YAML in {config_path}: {e}")
+        except (OSError, UnicodeDecodeError) as e:
+            raise ValueError(f"Could not read config file {config_path}: {e}")
+
+        if not isinstance(yaml_data, dict):
+            raise ValueError(
+                f"YAML root must be a mapping, got {type(yaml_data).__name__}"
+            )
+
+        return yaml_data
+
+    def configure(self) -> None:
+        """Validate configuration and preprocess fields after creation."""
+        self._validate_arguments()
+        self._process_arguments()
+        self._set_trust_remote_code()
+
+    def _validate_arguments(self) -> None:
+        """Validate configuration arguments and cross-field constraints."""
+        if self.limit:
+            eval_logger.warning(
+                "--limit SHOULD ONLY BE USED FOR TESTING. "
+                "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT."
+            )
+
+        # predict_only implies log_samples
+        if self.predict_only:
+            self.log_samples = True
+
+        # log_samples or predict_only requires output_path
+        if (self.log_samples or self.predict_only) and not self.output_path:
+            raise ValueError(
+                "Specify --output_path if providing --log_samples or --predict_only"
+            )
+
+        # fewshot_as_multiturn requires apply_chat_template
+        if self.fewshot_as_multiturn and self.apply_chat_template is False:
+            raise ValueError(
+                "When `fewshot_as_multiturn` is selected, `apply_chat_template` must be set."
+            )
+
+        # samples and limit are mutually exclusive
+        if self.samples and self.limit is not None:
+            raise ValueError("If --samples is not None, then --limit must be None.")
+
+        # tasks is required
+        if self.tasks is None:
+            raise ValueError("Need to specify task to evaluate.")
+
+    def _process_arguments(self) -> None:
+        """Process samples argument - load from file if needed."""
+        if self.samples:
+            if isinstance(self.samples, dict):
+                self.samples = self.samples
+            elif isinstance(self.samples, str):
+                try:
+                    self.samples = json.loads(self.samples)
+                except json.JSONDecodeError:
+                    if (samples_path := Path(self.samples)).is_file():
+                        self.samples = json.loads(samples_path.read_text())
+
+        # Set up metadata by merging model_args and metadata.
+        if self.model_args is None:
+            self.model_args = {}
+        if self.metadata is None:
+            self.metadata = {}
+
+        self.metadata = self.model_args | self.metadata
+
+    def process_tasks(self, metadata: Optional[dict] = None) -> "TaskManager":
+        """Process and validate tasks, return resolved task names."""
+        from lm_eval import utils
+        from lm_eval.tasks import TaskManager
+
+        # if metadata manually passed use that:
+        self.metadata = metadata if metadata else self.metadata
+
+        # Create task manager with metadata
+        task_manager = TaskManager(
+            include_path=self.include_path,
+            metadata=self.metadata if self.metadata else {},
+        )
+
+        task_names = task_manager.match_tasks(self.tasks)
+
+        # Check for any individual task files in the list
+        for task in [task for task in self.tasks if task not in task_names]:
+            task_path = Path(task)
+            if task_path.is_file():
+                config = utils.load_yaml_config(str(task_path))
+                task_names.append(config)
+
+        # Check for missing tasks
+        task_missing = [
+            task for task in self.tasks if task not in task_names and "*" not in task
+        ]
+
+        if task_missing:
+            missing = ", ".join(task_missing)
+            raise ValueError(f"Tasks not found: {missing}")
+
+        # Update tasks with resolved names
+        self.tasks = task_names
+        return task_manager
+
+    def _set_trust_remote_code(self) -> None:
+        """Apply trust_remote_code setting if enabled."""
+        if self.trust_remote_code:
+            # HACK: import datasets and override its HF_DATASETS_TRUST_REMOTE_CODE value internally,
+            # because it's already been determined based on the prior env var before launching our
+            # script--`datasets` gets imported by lm_eval internally before these lines can update the env.
+            import datasets
+
+            datasets.config.HF_DATASETS_TRUST_REMOTE_CODE = True
+
+            # Add to model_args for the actual model initialization
+            if self.model_args is None:
+                self.model_args = {}
+            self.model_args["trust_remote_code"] = True
--- a/lm_eval/evaluator.py
+++ b/lm_eval/evaluator.py
@@ -31,11 +31,11 @@ from lm_eval.loggers import EvaluationTracker
 from lm_eval.loggers.utils import add_env_info, add_tokenizer_info, get_git_commit_hash
 from lm_eval.tasks import TaskManager, get_task_dict
 from lm_eval.utils import (
+    get_logger,
    handle_non_serializable,
    hash_dict_images,
    hash_string,
    positional_deprecated,
-    setup_logging,
    simple_parse_args_string,
    wrap_text,
 )
@@ -149,7 +149,7 @@ def simple_evaluate(
        Dictionary of results
    """
    if verbosity is not None:
-        setup_logging(verbosity=verbosity)
+        get_logger(verbosity)
    start_date = time.time()

    if limit is not None and samples is not None:
@@ -372,8 +372,6 @@ def simple_evaluate(
        verbosity=verbosity,
        confirm_run_unsafe_code=confirm_run_unsafe_code,
    )
-    if verbosity is not None:
-        setup_logging(verbosity=verbosity)

    if lm.rank == 0:
        if isinstance(model, str):
@@ -477,7 +475,9 @@ def evaluate(
            "Either 'limit' or 'samples' must be None, but both are not None."
        )
    if samples is not None:
-        eval_logger.info(f"Evaluating examples for tasks {list(samples.keys())}")
+        eval_logger.info(
+            f"Evaluating examples for tasks {[x for x in list(samples.keys()) if x in task_dict]}"
+        )
    if apply_chat_template:
        eval_logger.warning(
            "Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details."
@@ -770,13 +770,3 @@ def evaluate(

    else:
        return None
-
-
-def request_caching_arg_to_dict(cache_requests: str) -> dict:
-    request_caching_args = {
-        "cache_requests": cache_requests in {"true", "refresh"},
-        "rewrite_requests_cache": cache_requests == "refresh",
-        "delete_requests_cache": cache_requests == "delete",
-    }
-
-    return request_caching_args
--- a/lm_eval/tasks/__init__.py
+++ b/lm_eval/tasks/__init__.py
@@ -30,7 +30,7 @@ class TaskManager:
        metadata: Optional[dict] = None,
    ) -> None:
        if verbosity is not None:
-            utils.setup_logging(verbosity)
+            utils.get_logger(verbosity)
        self.include_path = include_path
        self.metadata = metadata
        self._task_index = self.initialize_tasks(

--- a/lm_eval/utils.py
+++ b/lm_eval/utils.py
@@ -46,8 +46,75 @@ def wrap_text(string: str, width: int = 140, **kwargs) -> str | None:
    )


-def setup_logging(verbosity=logging.INFO):
-    # Configure the root logger
+def get_logger(level: str | None = None) -> logging.Logger:
+    """
+    Get a logger with a stream handler that captures all lm_eval logs.
+
+    Args:
+        level (Optional[str]): The logging level.
+    Example:
+        >>> logger = get_logger("INFO")
+        >>> logger.info("Log this")
+        INFO:lm_eval:Log this!
+
+    Returns:
+        logging.Logger: The logger.
+    """
+    logger = logging.getLogger("lm_eval")
+    if not logger.hasHandlers():
+        logger.addHandler(logging.StreamHandler())
+        logger.setLevel(logging.INFO)
+    if level is not None:
+        level = getattr(logging, level.upper())
+        logger.setLevel(level)
+    return logger
+
+
+def setup_logging(verbosity=logging.INFO, suppress_third_party=True):
+    """
+    Configure logging for the lm_eval CLI application.
+
+    WARNING: This function is intended for CLI use only. Library users should
+    use get_logger() instead to avoid interfering with their application's
+    logging configuration.
+
+    Args:
+        verbosity: Log level (int) or string name. Can be overridden by LOGLEVEL env var.
+        suppress_third_party: Whether to suppress verbose third-party library logs.
+
+    Returns:
+        logging.Logger: The configured lm_eval logger instance.
+    """
+    # Validate verbosity parameter
+    if isinstance(verbosity, str):
+        level_map = {
+            "DEBUG": logging.DEBUG,
+            "INFO": logging.INFO,
+            "WARNING": logging.WARNING,
+            "ERROR": logging.ERROR,
+            "CRITICAL": logging.CRITICAL,
+        }
+        verbosity = level_map.get(verbosity.upper(), logging.INFO)
+    elif not isinstance(verbosity, int):
+        verbosity = logging.INFO
+
+    # Get log level from environment or use default
+    if log_level_env := os.environ.get("LOGLEVEL", None):
+        level_map = {
+            "DEBUG": logging.DEBUG,
+            "INFO": logging.INFO,
+            "WARNING": logging.WARNING,
+            "ERROR": logging.ERROR,
+            "CRITICAL": logging.CRITICAL,
+        }
+        log_level = level_map.get(log_level_env.upper(), verbosity)
+    else:
+        log_level = verbosity
+
+    # Get the lm_eval logger directly
+    logger = logging.getLogger("lm_eval")
+
+    # Configure custom formatter
    class CustomFormatter(logging.Formatter):
        def format(self, record):
            record.name = record.name.removeprefix("im_eval.")
@@ -58,32 +125,27 @@ def setup_logging(verbosity=logging.INFO):
        datefmt="%Y-%m-%d:%H:%M:%S",
    )

-    log_level = os.environ.get("LOGLEVEL", verbosity) or verbosity
-
-    level_map = {
-        "DEBUG": logging.DEBUG,
-        "INFO": logging.INFO,
-        "WARNING": logging.WARNING,
-        "ERROR": logging.ERROR,
-        "CRITICAL": logging.CRITICAL,
-    }
-
-    log_level = level_map.get(str(log_level).upper(), logging.INFO)
-
-    if not logging.root.handlers:
+    # Check if handler already exists to prevent duplicates
+    has_stream_handler = any(
+        isinstance(h, logging.StreamHandler) for h in logger.handlers
+    )
+    if not has_stream_handler:
        handler = logging.StreamHandler()
        handler.setFormatter(formatter)
+        logger.addHandler(handler)
+        # For CLI use, we disable propagation to avoid duplicate messages
+        logger.propagate = False

-        root_logger = logging.getLogger()
-        root_logger.addHandler(handler)
-        root_logger.setLevel(log_level)
+    # Set the logger level
+    logger.setLevel(log_level)

-        if log_level == logging.DEBUG:
-            third_party_loggers = ["urllib3", "filelock", "fsspec"]
-            for logger_name in third_party_loggers:
-                logging.getLogger(logger_name).setLevel(logging.INFO)
-    else:
-        logging.getLogger().setLevel(log_level)
+    # Optionally suppress verbose third-party library logs
+    if suppress_third_party and log_level == logging.DEBUG:
+        third_party_loggers = ["urllib3", "filelock", "fsspec"]
+        for logger_name in third_party_loggers:
+            logging.getLogger(logger_name).setLevel(logging.INFO)
+
+    return logger


 def hash_string(string: str) -> str:
@@ -578,6 +640,7 @@ def create_iterator(
    return islice(raw_iterator, rank, limit, world_size)


+# TODO: why func for metric calc is here in eval utils?
 def weighted_f1_score(items):
    from sklearn.metrics import f1_score


--- a/templates/example_ci_config.yaml
+++ b/templates/example_ci_config.yaml
+# Language Model Evaluation Harness Configuration File
+#
+# This YAML configuration file allows you to specify evaluation parameters
+# instead of passing them as command-line arguments.
+#
+# Usage:
+#   $ lm_eval --config templates/example_ci_config.yaml
+#
+# You can override any values in this config with further command-line arguments:
+#   $ lm_eval --config templates/example_ci_config.yaml --model_args pretrained=gpt2 --tasks mmlu
+#
+# For expected types and values, refer to EvaluatorConfig in lm_eval/config/evaluate_config.py
+# All parameters are optional and have the same meaning as their CLI counterparts.
+
+model: hf
+model_args:
+  pretrained: EleutherAI/pythia-14m
+  dtype: float16
+tasks:
+  - hellaswag
+  - arc_easy
+batch_size: 1
+trust_remote_code: true
+log_samples: true
+output_path: ./test
+gen_kwargs:
+  do_sample: true
+  temperature: 0.7
+  stop: ["\n", "<|endoftext|>"]
+samples:
+  hellaswag: [1,2,3,4,5,6,7,8,9,10]
+  arc_easy: [10,20,30,40,50,60,70,80,90,100]
+metadata:
+  name: Example CI Config
+  description: This is an example configuration file for testing purposes.
--- a/tests/test_cli_subcommands.py
+++ b/tests/test_cli_subcommands.py
+import argparse
+import sys
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from lm_eval._cli.harness import HarnessCLI
+from lm_eval._cli.ls import List
+from lm_eval._cli.run import Run
+from lm_eval._cli.utils import (
+    _int_or_none_list_arg_type,
+    check_argument_types,
+    request_caching_arg_to_dict,
+    try_parse_json,
+)
+from lm_eval._cli.validate import Validate
+
+
+class TestHarnessCLI:
+    """Test the main HarnessCLI class."""
+
+    def test_harness_cli_init(self):
+        """Test HarnessCLI initialization."""
+        cli = HarnessCLI()
+        assert cli._parser is not None
+        assert cli._subparsers is not None
+
+    def test_harness_cli_has_subcommands(self):
+        """Test that HarnessCLI has all expected subcommands."""
+        cli = HarnessCLI()
+        subcommands = cli._subparsers.choices
+        assert "run" in subcommands
+        assert "ls" in subcommands
+        assert "validate" in subcommands
+
+    def test_harness_cli_backward_compatibility(self):
+        """Test backward compatibility: inserting 'run' when no subcommand is provided."""
+        cli = HarnessCLI()
+        test_args = ["lm-eval", "--model", "hf", "--tasks", "hellaswag"]
+        with patch.object(sys, "argv", test_args):
+            args = cli.parse_args()
+            assert args.command == "run"
+            assert args.model == "hf"
+            assert args.tasks == "hellaswag"
+
+    def test_harness_cli_help_default(self):
+        """Test that help is printed when no arguments are provided."""
+        cli = HarnessCLI()
+        with patch.object(sys, "argv", ["lm-eval"]):
+            args = cli.parse_args()
+            # The func is a lambda that calls print_help
+            # Let's test it calls the help function correctly
+            with patch.object(cli._parser, "print_help") as mock_help:
+                args.func(args)
+                mock_help.assert_called_once()
+
+    def test_harness_cli_run_help_only(self):
+        """Test that 'lm-eval run' shows help."""
+        cli = HarnessCLI()
+        with patch.object(sys, "argv", ["lm-eval", "run"]):
+            with pytest.raises(SystemExit):
+                cli.parse_args()
+
+
+class TestListCommand:
+    """Test the List subcommand."""
+
+    def test_list_command_creation(self):
+        """Test List command creation."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        list_cmd = List.create(subparsers)
+        assert isinstance(list_cmd, List)
+
+    def test_list_command_arguments(self):
+        """Test List command arguments."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        List.create(subparsers)
+
+        # Test valid arguments
+        args = parser.parse_args(["ls", "tasks"])
+        assert args.what == "tasks"
+        assert args.include_path is None
+
+        args = parser.parse_args(["ls", "groups", "--include_path", "/path/to/tasks"])
+        assert args.what == "groups"
+        assert args.include_path == "/path/to/tasks"
+
+    def test_list_command_choices(self):
+        """Test List command only accepts valid choices."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        List.create(subparsers)
+
+        # Valid choices should work
+        for choice in ["tasks", "groups", "subtasks", "tags"]:
+            args = parser.parse_args(["ls", choice])
+            assert args.what == choice
+
+        # Invalid choice should fail
+        with pytest.raises(SystemExit):
+            parser.parse_args(["ls", "invalid"])
+
+    @patch("lm_eval.tasks.TaskManager")
+    def test_list_command_execute_tasks(self, mock_task_manager):
+        """Test List command execution for tasks."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        list_cmd = List.create(subparsers)
+
+        mock_tm_instance = MagicMock()
+        mock_tm_instance.list_all_tasks.return_value = "task1\ntask2\ntask3"
+        mock_task_manager.return_value = mock_tm_instance
+
+        args = parser.parse_args(["ls", "tasks"])
+        with patch("builtins.print") as mock_print:
+            list_cmd._execute(args)
+            mock_print.assert_called_once_with("task1\ntask2\ntask3")
+            mock_tm_instance.list_all_tasks.assert_called_once_with()
+
+    @patch("lm_eval.tasks.TaskManager")
+    def test_list_command_execute_groups(self, mock_task_manager):
+        """Test List command execution for groups."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        list_cmd = List.create(subparsers)
+
+        mock_tm_instance = MagicMock()
+        mock_tm_instance.list_all_tasks.return_value = "group1\ngroup2"
+        mock_task_manager.return_value = mock_tm_instance
+
+        args = parser.parse_args(["ls", "groups"])
+        with patch("builtins.print") as mock_print:
+            list_cmd._execute(args)
+            mock_print.assert_called_once_with("group1\ngroup2")
+            mock_tm_instance.list_all_tasks.assert_called_once_with(
+                list_subtasks=False, list_tags=False
+            )
+
+
+class TestRunCommand:
+    """Test the Run subcommand."""
+
+    def test_run_command_creation(self):
+        """Test Run command creation."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        run_cmd = Run.create(subparsers)
+        assert isinstance(run_cmd, Run)
+
+    def test_run_command_basic_arguments(self):
+        """Test Run command basic arguments."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        Run.create(subparsers)
+
+        args = parser.parse_args(
+            ["run", "--model", "hf", "--tasks", "hellaswag,arc_easy"]
+        )
+        assert args.model == "hf"
+        assert args.tasks == "hellaswag,arc_easy"
+
+    def test_run_command_model_args(self):
+        """Test Run command model arguments parsing."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        Run.create(subparsers)
+
+        # Test key=value format
+        args = parser.parse_args(["run", "--model_args", "pretrained=gpt2,device=cuda"])
+        assert args.model_args == "pretrained=gpt2,device=cuda"
+
+        # Test JSON format
+        args = parser.parse_args(
+            ["run", "--model_args", '{"pretrained": "gpt2", "device": "cuda"}']
+        )
+        assert args.model_args == {"pretrained": "gpt2", "device": "cuda"}
+
+    def test_run_command_batch_size(self):
+        """Test Run command batch size arguments."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        Run.create(subparsers)
+
+        # Test integer batch size
+        args = parser.parse_args(["run", "--batch_size", "32"])
+        assert args.batch_size == "32"
+
+        # Test auto batch size
+        args = parser.parse_args(["run", "--batch_size", "auto"])
+        assert args.batch_size == "auto"
+
+        # Test auto with repetitions
+        args = parser.parse_args(["run", "--batch_size", "auto:5"])
+        assert args.batch_size == "auto:5"
+
+    def test_run_command_seed_parsing(self):
+        """Test Run command seed parsing."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        Run.create(subparsers)
+
+        # Test single seed
+        args = parser.parse_args(["run", "--seed", "42"])
+        assert args.seed == [42, 42, 42, 42]
+
+        # Test multiple seeds
+        args = parser.parse_args(["run", "--seed", "0,1234,5678,9999"])
+        assert args.seed == [0, 1234, 5678, 9999]
+
+        # Test with None values
+        args = parser.parse_args(["run", "--seed", "0,None,1234,None"])
+        assert args.seed == [0, None, 1234, None]
+
+    @patch("lm_eval.simple_evaluate")
+    @patch("lm_eval.config.evaluate_config.EvaluatorConfig")
+    @patch("lm_eval.loggers.EvaluationTracker")
+    @patch("lm_eval.utils.make_table")
+    def test_run_command_execute_basic(
+        self, mock_make_table, mock_tracker, mock_config, mock_simple_evaluate
+    ):
+        """Test Run command basic execution."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        run_cmd = Run.create(subparsers)
+
+        # Mock configuration
+        mock_cfg_instance = MagicMock()
+        mock_cfg_instance.wandb_args = None
+        mock_cfg_instance.output_path = None
+        mock_cfg_instance.hf_hub_log_args = {}
+        mock_cfg_instance.include_path = None
+        mock_cfg_instance.tasks = ["hellaswag"]
+        mock_cfg_instance.model = "hf"
+        mock_cfg_instance.model_args = {"pretrained": "gpt2"}
+        mock_cfg_instance.gen_kwargs = {}
+        mock_cfg_instance.limit = None
+        mock_cfg_instance.num_fewshot = 0
+        mock_cfg_instance.batch_size = 1
+        mock_cfg_instance.log_samples = False
+        mock_cfg_instance.process_tasks.return_value = MagicMock()
+        mock_config.from_cli.return_value = mock_cfg_instance
+
+        # Mock evaluation results
+        mock_simple_evaluate.return_value = {
+            "results": {"hellaswag": {"acc": 0.75}},
+            "config": {"batch_sizes": [1]},
+            "configs": {"hellaswag": {}},
+            "versions": {"hellaswag": "1.0"},
+            "n-shot": {"hellaswag": 0},
+        }
+
+        # Mock make_table to avoid complex table rendering
+        mock_make_table.return_value = (
+            "| Task | Result |\n|------|--------|\n| hellaswag | 0.75 |"
+        )
+
+        args = parser.parse_args(["run", "--model", "hf", "--tasks", "hellaswag"])
+
+        with patch("builtins.print"):
+            run_cmd._execute(args)
+
+        mock_config.from_cli.assert_called_once()
+        mock_simple_evaluate.assert_called_once()
+        mock_make_table.assert_called_once()
+
+
+class TestValidateCommand:
+    """Test the Validate subcommand."""
+
+    def test_validate_command_creation(self):
+        """Test Validate command creation."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        validate_cmd = Validate.create(subparsers)
+        assert isinstance(validate_cmd, Validate)
+
+    def test_validate_command_arguments(self):
+        """Test Validate command arguments."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        Validate.create(subparsers)
+
+        args = parser.parse_args(["validate", "--tasks", "hellaswag,arc_easy"])
+        assert args.tasks == "hellaswag,arc_easy"
+        assert args.include_path is None
+
+        args = parser.parse_args(
+            ["validate", "--tasks", "custom_task", "--include_path", "/path/to/tasks"]
+        )
+        assert args.tasks == "custom_task"
+        assert args.include_path == "/path/to/tasks"
+
+    def test_validate_command_requires_tasks(self):
+        """Test Validate command requires tasks argument."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        Validate.create(subparsers)
+
+        with pytest.raises(SystemExit):
+            parser.parse_args(["validate"])
+
+    @patch("lm_eval.tasks.TaskManager")
+    def test_validate_command_execute_success(self, mock_task_manager):
+        """Test Validate command execution with valid tasks."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        validate_cmd = Validate.create(subparsers)
+
+        mock_tm_instance = MagicMock()
+        mock_tm_instance.match_tasks.return_value = ["hellaswag", "arc_easy"]
+        mock_task_manager.return_value = mock_tm_instance
+
+        args = parser.parse_args(["validate", "--tasks", "hellaswag,arc_easy"])
+
+        with patch("builtins.print") as mock_print:
+            validate_cmd._execute(args)
+
+        mock_print.assert_any_call("Validating tasks: ['hellaswag', 'arc_easy']")
+        mock_print.assert_any_call("All tasks found and valid")
+
+    @patch("lm_eval.tasks.TaskManager")
+    def test_validate_command_execute_missing_tasks(self, mock_task_manager):
+        """Test Validate command execution with missing tasks."""
+        parser = argparse.ArgumentParser()
+        subparsers = parser.add_subparsers()
+        validate_cmd = Validate.create(subparsers)
+
+        mock_tm_instance = MagicMock()
+        mock_tm_instance.match_tasks.return_value = ["hellaswag"]
+        mock_task_manager.return_value = mock_tm_instance
+
+        args = parser.parse_args(["validate", "--tasks", "hellaswag,nonexistent"])
+
+        with patch("builtins.print") as mock_print:
+            with pytest.raises(SystemExit) as exc_info:
+                validate_cmd._execute(args)
+
+        assert exc_info.value.code == 1
+        mock_print.assert_any_call("Tasks not found: nonexistent")
+
+
+class TestCLIUtils:
+    """Test CLI utility functions."""
+
+    def test_try_parse_json_with_json_string(self):
+        """Test try_parse_json with valid JSON string."""
+        result = try_parse_json('{"key": "value", "num": 42}')
+        assert result == {"key": "value", "num": 42}
+
+    def test_try_parse_json_with_dict(self):
+        """Test try_parse_json with dict input."""
+        input_dict = {"key": "value"}
+        result = try_parse_json(input_dict)
+        assert result is input_dict
+
+    def test_try_parse_json_with_none(self):
+        """Test try_parse_json with None."""
+        result = try_parse_json(None)
+        assert result is None
+
+    def test_try_parse_json_with_plain_string(self):
+        """Test try_parse_json with plain string."""
+        result = try_parse_json("key=value,key2=value2")
+        assert result == "key=value,key2=value2"
+
+    def test_try_parse_json_with_invalid_json(self):
+        """Test try_parse_json with invalid JSON."""
+        with pytest.raises(ValueError) as exc_info:
+            try_parse_json('{key: "value"}')  # Invalid JSON (unquoted key)
+        assert "Invalid JSON" in str(exc_info.value)
+        assert "double quotes" in str(exc_info.value)
+
+    def test_int_or_none_list_single_value(self):
+        """Test _int_or_none_list_arg_type with single value."""
+        result = _int_or_none_list_arg_type(3, 4, "0,1,2,3", "42")
+        assert result == [42, 42, 42, 42]
+
+    def test_int_or_none_list_multiple_values(self):
+        """Test _int_or_none_list_arg_type with multiple values."""
+        result = _int_or_none_list_arg_type(3, 4, "0,1,2,3", "10,20,30,40")
+        assert result == [10, 20, 30, 40]
+
+    def test_int_or_none_list_with_none(self):
+        """Test _int_or_none_list_arg_type with None values."""
+        result = _int_or_none_list_arg_type(3, 4, "0,1,2,3", "10,None,30,None")
+        assert result == [10, None, 30, None]
+
+    def test_int_or_none_list_invalid_value(self):
+        """Test _int_or_none_list_arg_type with invalid value."""
+        with pytest.raises(ValueError):
+            _int_or_none_list_arg_type(3, 4, "0,1,2,3", "10,invalid,30,40")
+
+    def test_int_or_none_list_too_few_values(self):
+        """Test _int_or_none_list_arg_type with too few values."""
+        with pytest.raises(ValueError):
+            _int_or_none_list_arg_type(3, 4, "0,1,2,3", "10,20")
+
+    def test_int_or_none_list_too_many_values(self):
+        """Test _int_or_none_list_arg_type with too many values."""
+        with pytest.raises(ValueError):
+            _int_or_none_list_arg_type(3, 4, "0,1,2,3", "10,20,30,40,50")
+
+    def test_request_caching_arg_to_dict_none(self):
+        """Test request_caching_arg_to_dict with None."""
+        result = request_caching_arg_to_dict(None)
+        assert result == {}
+
+    def test_request_caching_arg_to_dict_true(self):
+        """Test request_caching_arg_to_dict with 'true'."""
+        result = request_caching_arg_to_dict("true")
+        assert result == {
+            "cache_requests": True,
+            "rewrite_requests_cache": False,
+            "delete_requests_cache": False,
+        }
+
+    def test_request_caching_arg_to_dict_refresh(self):
+        """Test request_caching_arg_to_dict with 'refresh'."""
+        result = request_caching_arg_to_dict("refresh")
+        assert result == {
+            "cache_requests": True,
+            "rewrite_requests_cache": True,
+            "delete_requests_cache": False,
+        }
+
+    def test_request_caching_arg_to_dict_delete(self):
+        """Test request_caching_arg_to_dict with 'delete'."""
+        result = request_caching_arg_to_dict("delete")
+        assert result == {
+            "cache_requests": False,
+            "rewrite_requests_cache": False,
+            "delete_requests_cache": True,
+        }
+
+    def test_check_argument_types_raises_on_untyped(self):
+        """Test check_argument_types raises error for untyped arguments."""
+        parser = argparse.ArgumentParser()
+        parser.add_argument("--untyped")  # No type specified
+
+        with pytest.raises(ValueError) as exc_info:
+            check_argument_types(parser)
+        assert "untyped" in str(exc_info.value)
+        assert "doesn't have a type specified" in str(exc_info.value)
+
+    def test_check_argument_types_passes_on_typed(self):
+        """Test check_argument_types passes for typed arguments."""
+        parser = argparse.ArgumentParser()
+        parser.add_argument("--typed", type=str)
+
+        # Should not raise
+        check_argument_types(parser)
+
+    def test_check_argument_types_skips_const_actions(self):
+        """Test check_argument_types skips const actions."""
+        parser = argparse.ArgumentParser()
+        parser.add_argument("--flag", action="store_const", const=True)
+
+        # Should not raise
+        check_argument_types(parser)