faster startup of vLLM (#982)

* update --------- Co-authored-by: Robert Irvine <robert@seamlessml.com>

faster startup of vLLM (#982)
* update --------- Co-authored-by: Robert Irvine <robert@seamlessml.com>
4b5bcf89 · Robert Irvine · GitHub · 852ef5b4 · 4b5bcf89
Unverified Commit 4b5bcf89 authored Sep 08, 2023 by Robert Irvine Committed by GitHub Sep 08, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 3 additions and 2 deletions

vllm/model_executor/layers/attention.py vllm/model_executor/layers/attention.py +3 -2

No files found.
--- a/vllm/model_executor/layers/attention.py
+++ b/vllm/model_executor/layers/attention.py
@@ -259,8 +259,9 @@ class PagedAttentionWithRoPE(PagedAttention):
        self.is_neox_style = is_neox_style

        # Create the cos and sin cache.
-        inv_freq = 1.0 / (base**(torch.arange(0, rotary_dim, 2) / rotary_dim))
-        t = torch.arange(max_position).float()
+        inv_freq = 1.0 / (base**(
+            torch.arange(0, rotary_dim, 2, device="cuda") / rotary_dim))
+        t = torch.arange(max_position, device="cuda").float()
        freqs = torch.einsum("i,j -> ij", t, inv_freq.float())
        cos = freqs.cos()
        sin = freqs.sin()