Fix incorrect cache allocation with multi-query (#2203)

We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.

Fix incorrect cache allocation with multi-query (#2203)
We wouldn't allocate any memory in multi-query (1 KV head). Fixes Starcoder et al.
153fcf77 · Daniël de Kok · GitHub · cce475a9 · 153fcf77
Unverified Commit 153fcf77 authored Jul 08, 2024 by Daniël de Kok Committed by GitHub Jul 08, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 1 deletion

server/text_generation_server/models/flash_causal_lm.py server/text_generation_server/models/flash_causal_lm.py +6 -1

No files found.
--- a/server/text_generation_server/models/flash_causal_lm.py
+++ b/server/text_generation_server/models/flash_causal_lm.py
@@ -912,7 +912,12 @@ class FlashCausalLM(Model):
                    break
            if num_kv_heads is None:
                raise ValueError("Cannot get the number of key/value heads")
-        self.num_kv_heads = num_kv_heads // self.process_group.size()
+        self.num_kv_heads = (
+            num_kv_heads // self.process_group.size()
+            if num_kv_heads > 1
+            else num_kv_heads
+        )
+        assert self.num_kv_heads > 0
        self.head_size = config.hidden_size // config.num_attention_heads

        self.cuda_graphs = {}