[Docs] Add some details about what the MoE block needs for the Transformers backend (#28588)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Docs] Add some details about what the MoE block needs for the Transformers backend (#28588)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
32262834 · Harry Mellor · GitHub · 8832fff9 · 32262834
Unverified Commit 32262834 authored Nov 13, 2025 by Harry Mellor Committed by GitHub Nov 13, 2025
Show whitespace changes
Inline Side-by-side

Showing with 23 additions and 1 deletion

docs/models/supported_models.md docs/models/supported_models.md +23 -1

No files found.
--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@@ -75,7 +75,12 @@ This section details the necessary modifications to make to a Transformers compa
 To make your model compatible with the Transformers backend, it needs:

 1. `kwargs` passed down through all modules from `MyModel` to `MyAttention`.
-    1. If your model is encoder-only, you must also add `is_causal = False` to `MyAttention`.
+    - If your model is encoder-only:
+        1. Add `is_causal = False` to `MyAttention`.
+    - If your model is mixture-of-experts (MoE):
+        1. Your sparse MoE block must have an attribute called `experts`.
+        2. The class of `experts` (`MyExperts`) must inherit from `nn.ModuleList`.
+        3. `MyExperts.forward` must accept `hidden_states`, `top_k_index`, `top_k_weights`.
 2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
 3. `MyModel` must contain `_supports_attention_backend = True`.

@@ -102,6 +107,23 @@ class MyAttention(nn.Module):
        )
        ...

+# Only do this for mixture-of-experts models
+class MyExperts(nn.ModuleList):
+    def forward(self, hidden_states, top_k_index, top_k_weights):
+        ...
+
+# Only do this for mixture-of-experts models
+class MySparseMoEBlock(nn.Module):
+    def __init__(self, config):
+        ...
+        self.experts = MyExperts(config)
+        ...
+
+    def forward(self, hidden_states: torch.Tensor):
+        ...
+        hidden_states = self.experts(hidden_states, top_k_index, top_k_weights)
+        ...
+
 class MyModel(PreTrainedModel):
    _supports_attention_backend = True
 ```