@@ -44,10 +44,10 @@ To install additional multilingual tokenization and text segmentation packages,
pip install-e".[multilingual]"
```
To support loading GPTQ quantized models, install the package with the `auto-gptq` extra:
To support loading GPTQ quantized models, install the package with the `gptq` extra:
```bash
pip install-e".[auto-gptq]"
pip install-e".[gptq]"
```
## Basic Usage
...
...
@@ -94,6 +94,25 @@ accelerate launch main.py \
This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one.
However, if your model *is too large to be run on a single one of your GPUs*, then we provide an alternative method to run these large models: use of the `parallelize` argument.
To pass even more advanced keyword arguments to `accelerate`, we allow for the following arguments as well:
-`device_map_option`: How to split model weights across available GPUs. defaults to "auto".
-`max_memory_per_gpu`: the max GPU memory to use per GPU in loading the model.
-`max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
-`offload_folder`: a folder where model weights will be offloaded to disk if needed.
Using this setting helps for massive models like BLOOM which require, or to avoid exceeding your total system RAM (by default, with `accelerate launch` one copy of the model for each GPU is initialized in RAM before moving it to GPU, resulting in large RAM usage spikes around the start of the script that may cause errors such as `Killed`.) However, it naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs.
**Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**
### Commercial APIs
...
...
@@ -141,17 +160,17 @@ For models loaded with the HuggingFace `transformers` library, any arguments pr
GPTQ quantized models can be loaded by specifying their file names in `,quantized=NAME` (or `,quantized=True` for default names) in the `model_args` argument:
[GPTQ](https://github.com/PanQiWei/AutoGPTQ) quantized models can be loaded by specifying their file names in `,gptq=NAME` (or `,gptq=True` for default names) in the `model_args` argument:
# multigpu data-parallel support when launched with accelerate
ifgpus>1:
accelerator=Accelerator()
ifgpus>accelerator.num_processes:
ifparallelize:
ifaccelerator.num_processes>1:
raiseRuntimeError(
"Attempted to use both a HF Accelerate `device_map` and to launch via `accelerate launch`. If this is the case, please either remove `parallelize=True` from --model_args or launch outside of the Accelerate launcher."
)
else:
pass
elifgpus>accelerator.num_processes:
# TODO: make sure there's still never an edge case where we unintentionally default to CPU
eval_logger.warning(
"WARNING: The number of total system GPUs does not match the number of spawned processes. "
This list keeps track of which tasks' implementations have been ported to YAML / v2.0 of the Eval Harness.
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation.
Boxes should be checked iff tasks are implemented in the refactor and tested for regression. Tasks should be struck through if checked *against original introducing paper* implementation or popularizing implementation. (WIP) Denotes that there exists a PR or person working on this task already.
- [ ] Glue (WIP)
- [x] SuperGlue
...
...
@@ -12,7 +12,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] Lambada (Multilingual)
- [x] Wikitext
- [x] PiQA
- [ ] PROST
- [ ] PROST (WIP)
- [ ] MCTACO
- [x] Pubmed QA
- [x] SciQ
...
...
@@ -21,11 +21,16 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] TriviaQA
- [x] AI2 ARC
- [ ] LogiQA
- [] HellaSwag
- [x] HellaSwag
- [x] SWAG
- [x] OpenBookQA
- [ ] SQuADv2
- [x] RACE
- [ ] LogiQA (WIP)
- [x] HellaSwag
- [ ] SWAG (WIP)
- [x] OpenBookQA
- [ ] SQuADv2 (WIP)
- [ ] HeadQA
- [ ] MathQA
- [ ] WebQs
...
...
@@ -35,7 +40,7 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [ ] Hendrycks Ethics
- [ ] TruthfulQA
- [ ] MuTual
- [ ] Hendrycks Math
- [ ] Hendrycks Math (WIP)
- [ ] Asdiv
- [ ] GSM8k
- [x] Arithmetic
...
...
@@ -45,6 +50,8 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
- [x] ~~Pile (perplexity)~~
- [ ] BLiMP
- [ ] ToxiGen
- [ ] StoryCloze
- [ ] NaturalQs
- [ ] CrowS-Pairs
- [ ] XCopa
- [ ] BIG-Bench
...
...
@@ -53,6 +60,9 @@ Boxes should be checked iff tasks are implemented in the refactor and tested for
Tasks added in the revamped harness that were not previously available. Again, a strikethrough denotes checking performed *against the original task's implementation or published results introducing the task*.