Upstream Mamba Support (`mamba_ssm`) (#1110)

* modularize HFLM code * pass through extra kwargs to AutoModel.from_pretrained call * remove explicit model_kwargs * rename gptq -> autogptq * fix tokenizer pad token errors * ensure model always respects device_map and autogptq's selected devices * add a _get_config helper fn * add mambaLMWrapper * add mamba extra * add mamba extra * fix conditional import * Fix botched merge commit * Remove beginning-of-file comment for consistency * Add docstring for mambaLM re: supported kwargs * Alphabetize extras * Update extras table * appease precommit * run precommit on mamba_lm

Upstream Mamba Support (`mamba_ssm`) (#1110)
* modularize HFLM code * pass through extra kwargs to AutoModel.from_pretrained call * remove explicit model_kwargs * rename gptq -> autogptq * fix tokenizer pad token errors * ensure model always respects device_map and autogptq's selected devices * add a _get_config helper fn * add mambaLMWrapper * add mamba extra * add mamba extra * fix conditional import * Fix botched merge commit * Remove beginning-of-file comment for consistency * Add docstring for mambaLM re: supported kwargs * Alphabetize extras * Update extras table * appease precommit * run precommit on mamba_lm
5503b274 · Hailey Schoelkopf · GitHub · b69ca72e · 5503b274 · 5503b274
Unverified Commit 5503b274 authored Dec 22, 2023 by Hailey Schoelkopf Committed by GitHub Dec 22, 2023
Showing with 152 additions and 17 deletions

README.md README.md +10 -5

lm_eval/models/__init__.py lm_eval/models/__init__.py +1 -0

lm_eval/models/mamba_lm.py lm_eval/models/mamba_lm.py +125 -0

pyproject.toml pyproject.toml +16 -12

No files found.
--- a/README.md
+++ b/README.md
@@ -51,15 +51,20 @@ We also provide a number of optional dependencies for extended functionality. Ex
 | Name          | Use                                   |
 |---------------|---------------------------------------|
 | anthropic     | For using Anthropic's models          |
+| dev           | For linting PRs and contributions     |
 | gptq          | For loading models with GPTQ          |
-| dev           | You probably don't want to use this   |
+| ifeval        | For running the IFEval task           |
+| mamba         | For loading Mamba SSM models          |
+| math          | For running math task answer checking |
 | multilingual  | For multilingual tokenizers           |
 | openai        | For using OpenAI's models             |
-| promptsource  | For using PromtSource prompts         |
+| promptsource  | For using PromptSource prompts        |
 | sentencepiece | For using the sentencepiece tokenizer |
+| testing       | For running library test suite        |
 | vllm          | For loading models with vLLM          |
 | zeno          | For visualizing results with Zeno     |
-| all           | Loads all extras                      |
+|---------------|---------------------------------------|
+| all           | Loads all extras (not recommended)    |

 ## Basic Usage

@@ -162,7 +167,6 @@ lm_eval --model local-chat-completions --tasks gsm8k --model_args model=facebook
 ```
 Note that for externally hosted models, configs such as `--device` and `--batch_size` should not be used and do not function. Just like you can use `--model_args` to pass arbitrary arguments to the model constructor for local models, you can use it to pass arbitrary arguments to the model API for hosted models. See the documentation of the hosting service for information on what arguments they support.

-
 | API or Inference Server                                                                                                   | Implemented?                    | `--model <xxx>` name                                                | Models supported:                                                                             | Request Types:                                             |
 |---------------------------------------------------------------------------------------------------------------------------|---------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|------------------------------------------------------------|
 | OpenAI Completions                                                                                                        | :heavy_check_mark:              | `openai-completions` | up to `code-davinci-002`                                                                      | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
@@ -172,7 +176,8 @@ Note that for externally hosted models, configs such as `--device` and `--batch_
 | Cohere                                                                                                                    | [:hourglass: - blocked on Cohere API bug](https://github.com/EleutherAI/lm-evaluation-harness/pull/395) | N/A                                                                 | [All `cohere.generate()` engines](https://docs.cohere.com/docs/models)                        | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
 | [Llama.cpp](https://github.com/ggerganov/llama.cpp) (via [llama-cpp-python](https://github.com/abetlen/llama-cpp-python)) | :heavy_check_mark:              | `gguf`, `ggml`                                                      | [All models supported by llama.cpp](https://github.com/ggerganov/llama.cpp)                   | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
 | vLLM                                                                                                                      | :heavy_check_mark:       | `vllm`                                                              | [Most HF Causal Language Models](https://docs.vllm.ai/en/latest/models/supported_models.html) | `generate_until`, `loglikelihood`, `loglikelihood_rolling` |
-| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-chat-completions` (using `openai-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface                                  | `generate_until`                                           |                                | ...                                                      |
+| Mamba                       | :heavy_check_mark:       | `mamba_ssm`                                                                      | [Mamba architecture Language Models via the `mamba_ssm` package](https://huggingface.co/state-spaces) | `generate_until`, `loglikelihood`, `loglikelihood_rolling`                             |
+| Your local inference server!                                                                                              | :heavy_check_mark:                             | `local-chat-completions` (using `openai-chat-completions` model type)    | Any server address that accepts GET requests using HF models and mirror's OpenAI's ChatCompletions interface                                  | `generate_until`                                           |                                | ...                                                      |

 It is on our roadmap to create task variants designed to enable models which do not serve logprobs/loglikelihoods to be compared with generation performance of open-source models.


--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
@@ -5,5 +5,6 @@ from . import dummy
 from . import anthropic_llms
 from . import gguf
 from . import vllm_causallms
+from . import mamba_lm

 # TODO: implement __all__
--- a/lm_eval/models/mamba_lm.py
+++ b/lm_eval/models/mamba_lm.py
+from typing import Optional, Union
+
+import torch
+
+from lm_eval import utils
+from lm_eval.api.registry import register_model
+from lm_eval.models.huggingface import HFLM
+
+
+@register_model("mamba_ssm")
+class MambaLMWrapper(HFLM):
+    def __init__(
+        self,
+        pretrained="state-spaces/mamba-130m",
+        **kwargs,
+    ) -> None:
+        """
+        Mamba (via the `mamba_ssm` package) supports the following args:
+        ```
+        d_model: int,
+        n_layer: int,
+        vocab_size: int,
+        initializer_cfg=None,
+        pad_vocab_size_multiple: int = 1,
+        ssm_cfg=None,
+        norm_epsilon: float = 1e-5,
+        rms_norm: bool = False,
+        initializer_cfg=None,
+        fused_add_norm=False,
+        residual_in_fp32=False,
+        ```
+
+        See https://github.com/state-spaces/mamba/blob/main/mamba_ssm/models/mixer_seq_simple.py#L175 for more info.
+        The above can all be passed via `--model_args` or to this __init__() directly
+        but we recommend placing many of these within the config.json file uploaded alongside your
+        Mamba model to the HF Hub instead.
+        All other HuggingFace from_pretrained() kwargs
+        such as those related to
+        `parallelize=True`, PEFT, autoGPTQ,
+        or any sub-configurations of these advanced args,
+        are unsupported by the `mamba_ssm` package.
+
+        The HFLM arguments
+
+        `backend`, `revision`, `subfolder`, `tokenizer`, `truncation`, `max_length`,
+        `device`, `dtype`, `batch_size`, `max_batch_size`, `trust_remote_code`, `use_fast_tokenizer`
+
+        Are all supported by Mamba where they do not conflict
+        with Mamba-specific restrictions such as causal LMs only.
+        """
+
+        if "backend" in kwargs:
+            # mamba currently only supports causal models
+            assert kwargs["backend"] == "causal"
+
+        super().__init__(
+            pretrained=pretrained,
+            # set appropriate defaults for tokenizer, max length, etc
+            backend=kwargs.get("backend", "causal"),
+            tokenizer=kwargs.get("tokenizer", "EleutherAI/gpt-neox-20b"),
+            max_length=kwargs.get("max_length", 2048),
+            **kwargs,
+        )
+
+    def _get_config(
+        self,
+        pretrained: str,
+        **kwargs,
+    ) -> None:
+        try:
+            from mamba_ssm.utils.hf import load_config_hf  # noqa: F811
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'mamba_ssm' LM type, but package `mamba_ssm` is not installed. \
+please install mamba via `pip install lm-eval[mamba]` or `pip install -e .[mamba]`",
+            )
+
+        self._config = load_config_hf(pretrained)
+
+    def _create_model(
+        self,
+        pretrained: str,
+        dtype: Optional[Union[str, torch.dtype]] = "float16",
+        # no `parallelize=True` options
+        # no PEFT and quantization options
+        # Mamba does not support arbitrary HF from_pretrained() args
+        **kwargs,
+    ) -> None:
+        try:
+            from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel  # noqa: F811
+        except ModuleNotFoundError:
+            raise Exception(
+                "attempted to use 'mamba_ssm' LM type, but package `mamba_ssm` is not installed. \
+please install mamba via `pip install lm-eval[mamba]` or `pip install -e .[mamba]`",
+            )
+
+        self._model = MambaLMHeadModel.from_pretrained(
+            pretrained,
+            device=self._device,
+            dtype=torch.float16 if dtype == "auto" else utils.get_dtype(dtype),
+            **kwargs,
+        )
+
+    def _model_generate(self, context, max_length, stop, **generation_kwargs):
+        for key in ("do_sample", "attention_mask"):
+            if key in generation_kwargs:
+                generation_kwargs.pop(key)
+
+        # mamba's custom GenerationMixin currently does not support
+        # passing stopping criteria.
+        # for the time being, we simply generate to max length,
+        # then truncate (equivalent result)
+        # -- this should be revisited to speed up generation
+        # stopping_criteria = stop_sequences_criteria(
+        #     self.tokenizer, stop, 1, context.shape[0]
+        # )
+
+        return self.model.generate(
+            input_ids=context,
+            max_length=max_length,
+            # stopping_criteria=stopping_criteria,
+            # pad_token_id=self.tokenizer.pad_token_id,
+            # use_cache=True,
+            **generation_kwargs,
+        )
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -54,31 +54,35 @@ Homepage = "https://github.com/EleutherAI/lm-evaluation-harness"
 Repository = "https://github.com/EleutherAI/lm-evaluation-harness"

 [project.optional-dependencies]
+anthropic = ["anthropic"]
 dev = ["pytest", "pytest-cov", "pytest-xdist", "pre-commit", "mypy"]
-multilingual = ["nagisa>=0.2.7", "jieba>=0.42.1", "pycountry"]
+gptq = ["auto-gptq[triton] @ git+https://github.com/PanQiWei/AutoGPTQ"]
+ifeval = ["langdetect", "immutabledict"]
+mamba = ["mamba_ssm", "causal-conv1d==1.0.2"]
 math = ["sympy>=1.12", "antlr4-python3-runtime==4.11"]
-sentencepiece = ["sentencepiece>=0.1.98", "protobuf>=4.22.1"]
+multilingual = ["nagisa>=0.2.7", "jieba>=0.42.1", "pycountry"]
+openai = ["openai==1.3.9", "tiktoken"]
 promptsource = [
    "promptsource @ git+https://github.com/bigscience-workshop/promptsource.git#egg=promptsource"
 ]
-gptq = ["auto-gptq[triton] @ git+https://github.com/PanQiWei/AutoGPTQ"]
-anthropic = ["anthropic"]
-openai = ["openai==1.3.9", "tiktoken"]
+sentencepiece = ["sentencepiece>=0.1.98", "protobuf>=4.22.1"]
+testing = ["pytest", "pytest-cov", "pytest-xdist"]
 vllm = ["vllm"]
-ifeval = ["langdetect", "immutabledict"]
 zeno = ["pandas", "zeno-client"]
 all = [
+    "lm_eval[anthropic]",
    "lm_eval[dev]",
-    "lm_eval[testing]",
+    "lm_eval[gptq]",
+    "lm_eval[ifeval]",
    "lm_eval[linting]",
+    "lm_eval[mamba]",
+    "lm_eval[math]",
    "lm_eval[multilingual]",
-    "lm_eval[sentencepiece]",
-    "lm_eval[promptsource]",
-    "lm_eval[gptq]",
-    "lm_eval[anthropic]",
    "lm_eval[openai]",
+    "lm_eval[promptsource]",
+    "lm_eval[sentencepiece]",
+    "lm_eval[testing]",
    "lm_eval[vllm]",
-    "lm_eval[ifeval]",
    "lm_eval[zeno]",
 ]