Enable steering HF models (#2749)

* Enable steering HF models Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu> * increase HF download timeout * Update readme; improve steering vector device handling * Update latest news * remove HF timeout increase * fix tests * ignore sae lens test * fix accidental force push --------- Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>

Enable steering HF models (#2749)
* Enable steering HF models Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu> * increase HF download timeout * Update readme; improve steering vector device handling * Update latest news * remove HF timeout increase * fix tests * ignore sae lens test * fix accidental force push --------- Co-authored-by: Matthew Khoriaty <matthewkhoriaty2026@u.northwestern.edu>
d35008f1 · Lucia Quirke · GitHub · 14b0bd26 · d35008f1 · d35008f1
Unverified Commit d35008f1 authored Mar 04, 2025 by Lucia Quirke Committed by GitHub Mar 04, 2025
8 changed files
--- a/.github/workflows/unit_tests.yml
+++ b/.github/workflows/unit_tests.yml
@@ -61,7 +61,7 @@ jobs:
 #                pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt
 #        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
    - name: Test with pytest
-      run: python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py
+      run: python -m pytest --showlocals -s -vv -n=auto --ignore=tests/models/test_neuralmagic.py --ignore=tests/models/test_openvino.py --ignore=tests/models/test_hf_steered.py
    - name: Archive artifacts
      uses: actions/upload-artifact@v4
      with:

--- a/README.md
+++ b/README.md
@@ -5,6 +5,7 @@
 ---

 *Latest News 📣*
+- [2025/03] Added support for steering HF models!
 - [2025/02] Added [SGLang](https://docs.sglang.ai/) support!
 - [2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the `hf-multimodal` and `vllm-vlm` model types and `mmmu` task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval), a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.
 - [2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. **To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.**
@@ -157,6 +158,50 @@ To learn more about model parallelism and how to use it with the `accelerate` li

 **Note: we do not currently support multi-node evaluations natively, and advise using either an externally hosted server to run inference requests against, or creating a custom integration with your distributed framework [as is done for the GPT-NeoX library](https://github.com/EleutherAI/gpt-neox/blob/main/eval_tasks/eval_adapter.py).**

+### Steered Hugging Face `transformers` models
+
+To evaluate a Hugging Face `transformers` model with steering vectors applied, specify the model type as `steered` and provide the path to either a PyTorch file containing pre-defined steering vectors, or a CSV file that specifies how to derive steering vectors from pretrained `sparsify` or `sae_lens` models (you will need to install the corresponding optional dependency for this method).
+
+Specify pre-defined steering vectors:
+
+```python
+import torch
+
+steer_config = {
+    "layers.3": {
+        "steering_vector": torch.randn(1, 768),
+        "bias": torch.randn(1, 768),
+        "steering_coefficient": 1,
+        "action": "add"
+    },
+}
+torch.save(steer_config, "steer_config.pt")
+```
+
+Specify derived steering vectors:
+
+```python
+import pandas as pd
+
+pd.DataFrame({
+    "loader": ["sparsify"],
+    "action": ["add"],
+    "sparse_model": ["EleutherAI/sae-pythia-70m-32k"],
+    "hookpoint": ["layers.3"],
+    "feature_index": [30],
+    "steering_coefficient": [10.0],
+}).to_csv("steer_config.csv", index=False)
+```
+
+Run the evaluation harness with steering vectors applied:
+```bash
+lm_eval --model steered \
+    --model_args pretrained=EleutherAI/pythia-160m,steer_path=steer_config.pt \
+    --tasks lambada_openai,hellaswag \
+    --device cuda:0 \
+    --batch_size 8
+```
+
 ### NVIDIA `nemo` models

 [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo) is a generative AI framework built for researchers and pytorch developers working on language models.
@@ -523,8 +568,10 @@ Extras dependencies can be installed via `pip install -e ".[NAME]"`
 | multilingual    | For multilingual tokenizers                  |
 | optimum         | For running Intel OpenVINO models            |
 | promptsource    | For using PromptSource prompts               |
+| sae_lens        | For using SAELens to steer models            |
 | sentencepiece   | For using the sentencepiece tokenizer        |
 | sparseml        | For using NM's SparseML models               |
+| sparsify        | For using Sparsify to steer models           |
 | testing         | For running library test suite               |
 | vllm            | For loading models with vLLM                 |
 | zeno            | For visualizing results with Zeno            |

--- a/lm_eval/models/__init__.py
+++ b/lm_eval/models/__init__.py
@@ -3,6 +3,7 @@ from . import (
    api_models,
    dummy,
    gguf,
+    hf_steered,
    hf_vlms,
    huggingface,
    ibm_watsonx_ai,

--- a/lm_eval/models/hf_steered.py
+++ b/lm_eval/models/hf_steered.py
+from contextlib import contextmanager
+from functools import partial
+from pathlib import Path
+from typing import Any, Callable, Generator, Optional, Union
+
+import torch
+from peft.peft_model import PeftModel
+from torch import Tensor, nn
+from transformers import PreTrainedModel
+
+from lm_eval.api.registry import register_model
+from lm_eval.models.huggingface import HFLM
+
+
+@contextmanager
+def steer(
+    model: Union[PreTrainedModel, PeftModel], hook_to_steer: dict[str, Callable]
+) -> Generator[None, Any, None]:
+    """
+    Context manager that temporarily hooks models and steers them.
+
+    Args:
+        model: The transformer model to hook
+        hook_to_steer: Dictionary mapping hookpoints to steering functions
+
+    Yields:
+        None
+    """
+
+    def create_hook(hookpoint: str):
+        def hook_fn(module: nn.Module, input: Any, output: Tensor):
+            # If output is a tuple (like in some transformer layers), take first element
+            if isinstance(output, tuple):
+                output = (hook_to_steer[hookpoint](output[0]), *output[1:])  # type: ignore
+            else:
+                output = hook_to_steer[hookpoint](output)
+
+            return output
+
+        return hook_fn
+
+    handles = []
+    hookpoints = list(hook_to_steer.keys())
+
+    for name, module in model.base_model.named_modules():
+        if name in hookpoints:
+            handle = module.register_forward_hook(create_hook(name))
+            handles.append(handle)
+
+    if len(handles) != len(hookpoints):
+        raise ValueError(f"Not all hookpoints could be resolved: {hookpoints}")
+
+    try:
+        yield None
+    finally:
+        for handle in handles:
+            handle.remove()
+
+
+@register_model("steered")
+class SteeredModel(HFLM):
+    hook_to_steer: dict[str, Callable]
+
+    def __init__(
+        self,
+        pretrained: str,
+        steer_path: str,
+        device: Optional[str] = None,
+        **kwargs,
+    ):
+        """
+        HFLM with a steered forward pass.
+
+        To derive steering vectors from a sparse model loadable with sparsify or sae_lens,
+        provide the path to a CSV file with the following columns (example rows are provided below):
+
+        loader,action,sparse_model,hookpoint,feature_index,steering_coefficient,sae_id,description,
+        sparsify,add,EleutherAI/sae-pythia-70m-32k,layers.3,30,10.0,,,
+        sae_lens,add,gemma-scope-2b-pt-res-canonical,layers.20,12082,240.0,layer_20/width_16k/canonical,increase dogs,
+
+        To load steering vectors directly, provide the path to a pytorch (.pt) file with content in the following format:
+
+        {
+            hookpoint: {
+                "steering_vector": <torch.Tensor>,
+                "steering_coefficient": <float>,
+                "action": <Literal["add", "clamp"]>,
+                "bias": <torch.Tensor | None>,
+            },
+            ...
+        }
+        """
+        super().__init__(pretrained=pretrained, device=device, **kwargs)
+
+        if steer_path.endswith(".pt") or steer_path.endswith(".pth"):
+            with open(steer_path, "rb") as f:
+                steer_config: dict[str, dict[str, Any]] = torch.load(
+                    f, weights_only=True
+                )
+        elif steer_path.endswith(".csv"):
+            steer_config = self.derive_steer_config(steer_path)
+        else:
+            raise ValueError(f"Unknown steer file type: {steer_path}")
+
+        hook_to_steer = {}
+        for hookpoint, steer_info in steer_config.items():
+            action = steer_info["action"]
+            steering_coefficient = steer_info["steering_coefficient"]
+            steering_vector = (
+                steer_info["steering_vector"].to(self.device).to(self.model.dtype)
+            )
+            bias = (
+                steer_info["bias"].to(self.device).to(self.model.dtype)
+                if steer_info["bias"] is not None
+                else None
+            )
+
+            if action == "add":
+                # Steers the model by adding some multiple of a steering vector to all sequence positions.
+                hook_to_steer[hookpoint] = (
+                    lambda acts: acts + steering_coefficient * steering_vector
+                )
+            elif action == "clamp":
+                hook_to_steer[hookpoint] = partial(
+                    self.clamp,
+                    steering_vector=steering_vector,
+                    value=steering_coefficient,
+                    bias=bias,
+                )
+            else:
+                raise ValueError(f"Unknown hook type: {action}")
+
+        self.hook_to_steer = hook_to_steer
+
+    @classmethod
+    def derive_steer_config(cls, steer_path: str):
+        """Derive a dictionary of steering vectors from sparse model(/s) specified in a CSV file."""
+        import pandas as pd
+
+        df = pd.read_csv(steer_path)
+        steer_data: dict[str, dict[str, Any]] = {}
+
+        if any(df["loader"] == "sparsify"):
+            from sparsify import SparseCoder
+        if any(df["loader"] == "sae_lens"):
+            from sae_lens import SAE
+
+            sae_cache = {}
+
+            def load_from_sae_lens(sae_release: str, sae_id: str):
+                cache_key = (sae_release, sae_id)
+                if cache_key not in sae_cache:
+                    sae_cache[cache_key] = SAE.from_pretrained(sae_release, sae_id)[0]
+
+                return sae_cache[cache_key]
+
+        for _, row in df.iterrows():
+            action = row.get("action", "add")
+            sparse_name = row["sparse_model"]
+            hookpoint = row["hookpoint"]
+            feature_index = int(row["feature_index"])
+            steering_coefficient = float(row["steering_coefficient"])
+            loader = row.get("loader", "sparsify")
+
+            if loader == "sparsify":
+                name_path = Path(sparse_name)
+
+                sparse_coder = (
+                    SparseCoder.load_from_disk(name_path / hookpoint)
+                    if name_path.exists()
+                    else SparseCoder.load_from_hub(sparse_name, hookpoint)
+                )
+                assert sparse_coder.W_dec is not None
+
+                steering_vector = sparse_coder.W_dec[feature_index]
+                bias = sparse_coder.b_dec
+
+            elif loader == "sae_lens":
+                sparse_coder = load_from_sae_lens(
+                    sae_release=sparse_name, sae_id=row["sae_id"]
+                )
+                steering_vector = sparse_coder.W_dec[feature_index]
+                bias = sparse_coder.b_dec
+                if hookpoint == "" or pd.isna(hookpoint):
+                    hookpoint = sparse_coder.cfg.hook_name
+            else:
+                raise ValueError(f"Unknown loader: {loader}")
+
+            steer_data[hookpoint] = {
+                "action": action,
+                "steering_coefficient": steering_coefficient,
+                "steering_vector": steering_vector,
+                "bias": bias,
+            }
+
+        return steer_data
+
+    @classmethod
+    def clamp(
+        cls,
+        acts: Tensor,
+        steering_vector: Tensor,
+        value: float,
+        bias: Optional[Tensor] = None,
+    ):
+        """Clamps a direction of the activations to be the steering vector * the value.
+
+        Args:
+            acts (Tensor): The activations tensor to edit of shape [batch, pos, features]
+            steering_vector (Tensor): A direction to clamp of shape [features]
+            value (float): Value to clamp the direction to
+            bias (Tensor | None): Optional bias to add to the activations
+
+        Returns:
+            Tensor: The modified activations with the specified direction clamped
+        """
+
+        if bias is not None:
+            acts = acts - bias
+
+        direction = steering_vector / torch.norm(steering_vector)
+        proj_magnitude = torch.sum(acts * direction, dim=-1, keepdim=True)
+        orthogonal_component = acts - proj_magnitude * direction
+
+        clamped = orthogonal_component + direction * value
+
+        if bias is not None:
+            return clamped + bias
+
+        return clamped
+
+    def forward(self, *args, **kwargs):
+        with torch.no_grad():
+            with steer(self.model, self.hook_to_steer):
+                return self.model.forward(*args, **kwargs)
+
+    def _model_call(self, *args, **kwargs):
+        with steer(self.model, self.hook_to_steer):
+            return super()._model_call(*args, **kwargs)
+
+    def _model_generate(self, *args, **kwargs):
+        with steer(self.model, self.hook_to_steer):
+            return super()._model_generate(*args, **kwargs)
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -70,7 +70,9 @@ math = ["sympy>=1.12", "antlr4-python3-runtime==4.11", "math_verify[antlr4_11_0]
 multilingual = ["nagisa>=0.2.7", "jieba>=0.42.1", "pycountry"]
 optimum = ["optimum[openvino]"]
 promptsource = ["promptsource>=0.2.3"]
+sae_lens = ["sae_lens"]
 sentencepiece = ["sentencepiece>=0.1.98"]
+sparsify = ["sparsify"]
 sparseml = ["sparseml-nightly[llm]>=1.8.0.20240404"]
 testing = ["pytest", "pytest-cov", "pytest-xdist"]
 vllm = ["vllm>=0.4.2"]
@@ -91,7 +93,9 @@ all = [
    "lm_eval[multilingual]",
    "lm_eval[openai]",
    "lm_eval[promptsource]",
+    "lm_eval[sae_lens]",
    "lm_eval[sentencepiece]",
+    "lm_eval[sparsify]",
    "lm_eval[sparseml]",
    "lm_eval[testing]",
    "lm_eval[vllm]",

--- a/tests/models/test_hf_steered.py
+++ b/tests/models/test_hf_steered.py
+# ruff: noqa
+from __future__ import annotations
+
+import os
+import sys
+from pathlib import Path
+
+import numpy as np
+import pytest
+import torch
+
+from lm_eval import tasks
+from lm_eval.api.instance import Instance
+
+pytest.skip("dependency conflict on CI", allow_module_level=True)
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+task_manager = tasks.TaskManager()
+
+TEST_STRING = "foo bar"
+
+
+class Test_SteeredModel:
+    from lm_eval.models.hf_steered import SteeredModel
+
+    torch.use_deterministic_algorithms(True)
+    task_list = task_manager.load_task_or_group(["arc_easy", "gsm8k", "wikitext"])
+    version_minor = sys.version_info.minor
+    multiple_choice_task = task_list["arc_easy"]  # type: ignore
+    multiple_choice_task.build_all_requests(limit=10, rank=0, world_size=1)
+    MULTIPLE_CH: list[Instance] = multiple_choice_task.instances
+    generate_until_task = task_list["gsm8k"]  # type: ignore
+    generate_until_task._config.generation_kwargs["max_gen_toks"] = 10
+    generate_until_task.set_fewshot_seed(1234)  # fewshot random generator seed
+    generate_until_task.build_all_requests(limit=10, rank=0, world_size=1)
+    generate_until: list[Instance] = generate_until_task.instances
+    rolling_task = task_list["wikitext"]  # type: ignore
+    rolling_task.build_all_requests(limit=10, rank=0, world_size=1)
+    ROLLING: list[Instance] = rolling_task.instances
+
+    MULTIPLE_CH_RES = [
+        -41.79737854003906,
+        -42.964412689208984,
+        -33.909732818603516,
+        -37.055198669433594,
+        -22.980390548706055,
+        -20.268718719482422,
+        -14.76205062866211,
+        -27.887500762939453,
+        -15.797225952148438,
+        -15.914306640625,
+        -13.01901626586914,
+        -18.053699493408203,
+        -13.33236312866211,
+        -13.35921859741211,
+        -12.12301254272461,
+        -11.86703109741211,
+        -47.02234649658203,
+        -47.69982147216797,
+        -36.420310974121094,
+        -50.065345764160156,
+        -16.742475509643555,
+        -18.542402267456055,
+        -26.460208892822266,
+        -20.307228088378906,
+        -17.686725616455078,
+        -21.752883911132812,
+        -33.17183303833008,
+        -39.21712112426758,
+        -14.78198528289795,
+        -16.775150299072266,
+        -11.49817180633545,
+        -15.404842376708984,
+        -13.141255378723145,
+        -15.870940208435059,
+        -15.29050064086914,
+        -12.36030387878418,
+        -44.557891845703125,
+        -55.43851089477539,
+        -52.66646194458008,
+        -56.289222717285156,
+    ]
+    generate_until_RES = [
+        " The average of $2.50 each is $",
+        " A robe takes 2 bolts of blue fiber and half",
+        " $50,000 in repairs.\n\nQuestion",
+        " He runs 1 sprint 3 times a week.",
+        " They feed each of her chickens three cups of mixed",
+        " The price of the glasses is $5, but",
+        " The total percentage of students who said they like to",
+        " Carla is downloading a 200 GB file. Normally",
+        " John drives for 3 hours at a speed of 60",
+        " Eliza sells 4 tickets to 5 friends so she",
+    ]
+    ROLLING_RES = [
+        -3604.61328125,
+        -19778.67626953125,
+        -8835.119384765625,
+        -27963.37841796875,
+        -7636.4351806640625,
+        -9491.43603515625,
+        -41047.35205078125,
+        -8396.804443359375,
+        -45966.24645996094,
+        -7159.05322265625,
+    ]
+    LM = SteeredModel(
+        pretrained="EleutherAI/pythia-70m",
+        device="cpu",
+        dtype="float32",
+        steer_path="tests/testconfigs/sparsify_intervention.csv",
+    )
+
+    def test_load_with_sae_lens(self) -> None:
+        from lm_eval.models.hf_steered import SteeredModel
+
+        SteeredModel(
+            pretrained="EleutherAI/pythia-70m",
+            device="cpu",
+            dtype="float32",
+            steer_path="tests/testconfigs/sae_lens_intervention.csv",
+        )
+
+        assert True
+
+    def test_loglikelihood(self) -> None:
+        res = self.LM.loglikelihood(self.MULTIPLE_CH)
+        _RES, _res = self.MULTIPLE_CH_RES, [r[0] for r in res]
+        # log samples to CI
+        dir_path = Path("test_logs")
+        dir_path.mkdir(parents=True, exist_ok=True)
+
+        file_path = dir_path / f"outputs_log_{self.version_minor}.txt"
+        file_path = file_path.resolve()
+        with open(file_path, "w", encoding="utf-8") as f:
+            f.write("\n".join(str(x) for x in _res))
+        assert np.allclose(_res, _RES, atol=1e-2)
+        # check indices for Multiple Choice
+        argmax_RES, argmax_res = (
+            np.argmax(np.array(_RES).reshape(-1, 4), axis=1),
+            np.argmax(np.array(_res).reshape(-1, 4), axis=1),
+        )
+        assert (argmax_RES == argmax_res).all()
+
+    def test_generate_until(self) -> None:
+        res = self.LM.generate_until(self.generate_until)
+        assert res == self.generate_until_RES
+
+    def test_loglikelihood_rolling(self) -> None:
+        res = self.LM.loglikelihood_rolling(self.ROLLING)
+        assert np.allclose(res, self.ROLLING_RES, atol=1e-1)
+
+    def test_toc_encode(self) -> None:
+        res = self.LM.tok_encode(TEST_STRING)
+        assert res == [12110, 2534]
+
+    def test_toc_decode(self) -> None:
+        res = self.LM.tok_decode([12110, 2534])
+        assert res == TEST_STRING
+
+    def test_batch_encode(self) -> None:
+        res = self.LM.tok_batch_encode([TEST_STRING, "bar foo"])[0].tolist()
+        assert res == [[12110, 2534], [2009, 17374]]
+
+    def test_model_generate(self) -> None:
+        context = self.LM.tok_batch_encode([TEST_STRING])[0]
+        res = self.LM._model_generate(context, max_length=10, stop=["\n\n"])
+        res = self.LM.tok_decode(res[0])
+        assert res == "foo bar\n<bazhang> !info bar"
--- a/tests/testconfigs/sae_lens_intervention.csv
+++ b/tests/testconfigs/sae_lens_intervention.csv
+loader,hook_action,sparse_model,hookpoint,feature_index,steering_coefficient,sae_id,description,
+sae_lens,add,gemma-scope-2b-pt-res-canonical,layers.20,12082,10.0,layer_20/width_16k/canonical,increase dogs,
--- a/tests/testconfigs/sparsify_intervention.csv
+++ b/tests/testconfigs/sparsify_intervention.csv
+loader,hook_action,sparse_model,hookpoint,feature_index,steering_coefficient,sae_id,description,
+sparsify,add,EleutherAI/sae-pythia-70m-32k,layers.3,30,0.1,,,