Merge pull request #812 from fattorib/bump-accelerate

[Refactor] Bump min accelerate version and update documentation

Merge pull request #812 from fattorib/bump-accelerate
[Refactor] Bump min accelerate version and update documentation
cc7828dd · Hailey Schoelkopf · GitHub · a346e6a0 · f96f330f · cc7828dd
Unverified Commit cc7828dd authored Sep 04, 2023 by Hailey Schoelkopf Committed by GitHub Sep 04, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 7 deletions

README.md README.md +4 -2

lm_eval/models/huggingface.py lm_eval/models/huggingface.py +11 -4

setup.py setup.py +1 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -116,8 +116,10 @@ accelerate launch main.py \

 This will perform *data-parallel evaluation*: that is, placing a **single full copy** of your model onto each available GPU and *splitting batches across GPUs* to evaluate on K GPUs K times faster than on one.

-However, if your model *is too large to be run on a single one of your GPUs*, then we provide an alternative method to run these large models: use of the `parallelize` argument.
+If your model is *is too large to be run on a single one of your GPUs* then you can use `accelerate` with Fully Sharded Data Parallel (FSDP) that splits the weights of the model across your data parallel ranks. To enable this, ensure you select `YES` when asked ```Do you want to use FullyShardedDataParallel?``` when running `accelerate config`. To enable memory-efficient loading, select `YES` when asked `Do you want each individually wrapped FSDP unit to broadcast module parameters from rank 0 at the start?`. This will ensure only the rank 0 process loads the model and then broadcasts the parameters to the other ranks instead of having each rank load all parameters which can lead to large RAM usage spikes around the start of the script that may cause errors.

+
+We also provide an second method to run these large models: use of the `parallelize` argument.
 ```
 python main.py \
    --model hf \
@@ -132,7 +134,7 @@ To pass even more advanced keyword arguments to `accelerate`, we allow for the f
 - `max_cpu_memory`: the max amount of CPU memory to use when offloading the model weights to RAM.
 - `offload_folder`: a folder where model weights will be offloaded to disk if needed.

-Using this setting helps for massive models like BLOOM which require, or to avoid exceeding your total system RAM (by default, with `accelerate launch` one copy of the model for each GPU is initialized in RAM before moving it to GPU, resulting in large RAM usage spikes around the start of the script that may cause errors such as `Killed`.) However, it naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs.
+Note that this method naively splits models across GPUs, resulting in only a single GPU performing work at any point in time, and so is much slower than launching with `accelerate launch`, possibly by a factor of the total # of GPUs.

 **Note that this option requires launching evaluation via `python main.py` rather than `accelerate launch main.py`.**


--- a/lm_eval/models/huggingface.py
+++ b/lm_eval/models/huggingface.py
@@ -22,7 +22,7 @@ from lm_eval.api.registry import register_model

 from lm_eval.utils import MultiTokenEOSCriteria, stop_sequences_criteria

-from accelerate import Accelerator, find_executable_batch_size
+from accelerate import Accelerator, find_executable_batch_size, DistributedType
 from typing import List, Optional, Union


@@ -295,9 +295,16 @@ class HFLM(LM):
                        "Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes`. If the desired GPU is being used, this message is safe to ignore."
                    )
            else:
-                self._model = accelerator.prepare_model(
-                    self.model, evaluation_mode=True
-                )
+                assert accelerator.distributed_type in [
+                    DistributedType.FSDP, 
+                    DistributedType.MULTI_GPU
+                ], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+                if accelerator.distributed_type == DistributedType.FSDP:
+                    self._model = accelerator.prepare(self.model)
+                else:
+                    self._model = accelerator.prepare_model(
+                        self.model, evaluation_mode=True 
+                    )
                self._device = torch.device(f"cuda:{accelerator.local_process_index}")
                self.accelerator = accelerator


--- a/setup.py
+++ b/setup.py
@@ -53,7 +53,7 @@ setuptools.setup(
    ],
    python_requires=">=3.9",
    install_requires=[
-        "accelerate>=0.18.0",
+        "accelerate>=0.21.0",
        "evaluate",
        "datasets>=2.0.0",
        "evaluate>=0.4.0",