1. 16 Jan, 2026 1 commit
  2. 05 Aug, 2025 1 commit
  3. 04 Aug, 2025 15 commits
  4. 31 Jul, 2025 3 commits
    • Michael Yang's avatar
      tests · f1c73840
      Michael Yang authored
      f1c73840
    • Michael Yang's avatar
      bf16 · 4a8fc3f9
      Michael Yang authored
      4a8fc3f9
    • Jesse Gross's avatar
      kvcache: Enable SWA to retain additional entries · 4183bb05
      Jesse Gross authored
      Models that use sliding window attention can only resume a sequence
      from the cache if it falls within the saved windows. This works well
      if the next message picks up where the old one left off. However, it
      generally prevents a partial prefix match unless the entire conversation
      falls within the sliding window.
      
      This can be a problem with reasoning models where the traces are
      supposed to be removed from future messages, forcing the entire
      history to be re-evaluated.
      
      This change allows models to specify that a larger amount of the
      history be retained in memory, to allow more partial resumption.
      It still respects the window that the model was trained on for
      token generation.
      4183bb05
  5. 30 Jul, 2025 3 commits
  6. 29 Jul, 2025 3 commits
  7. 28 Jul, 2025 1 commit
  8. 27 Jul, 2025 1 commit
  9. 25 Jul, 2025 2 commits
    • Jesse Gross's avatar
      kvcache: Group shift operations into batches · 764be748
      Jesse Gross authored
      Currently, when we need to do a shift on the cache, it is one
      RoPE operation on the entire size of the cache (per layer). In
      some cases, this can create a compute graph that is larger than
      the forward pass since the forward pass is working in batches.
      Since we don't consider shifting in our memory estimates, it's
      possible for this to cause a crash if we run out of memory.
      
      By limiting the size of the RoPE calls to batch size chunks, we
      ensure that the shift will never exceed the size of the forward
      pass, since the forward pass will also contain a RoPE of the same
      size. This does not have a sigificant impact on performance since
      RoPE is a math operation that is mostly proportional to the size
      of its inputs.
      
      In theory defrag could have the same issue since it also creates a
      compute graph outside of the forward pass, however, since it is
      only copies, it does not require any working space.
      764be748
    • Ruyut's avatar
      b72e5adb
  10. 24 Jul, 2025 2 commits
  11. 23 Jul, 2025 2 commits
  12. 22 Jul, 2025 2 commits
  13. 20 Jul, 2025 2 commits
  14. 19 Jul, 2025 1 commit
  15. 17 Jul, 2025 1 commit