1. 12 Aug, 2024 4 commits
    • drbh's avatar
      feat: validate template variables before apply and improve sliding wi… (#2403) · 155f9c98
      drbh authored
      * feat: validate template variables before apply and improve sliding window check
      
      * fix: improve missing template var test
      155f9c98
    • Daniël de Kok's avatar
      Add support for prefix caching to the v3 router (#2392) · 8deeaca4
      Daniël de Kok authored
      This change adds support for prefix caching to the v3 router. This
      is broken up from the backend support to ease reviewing.
      
      For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
      in this case, the router will switch to `RadixAllocator`. This
      allocator uses a radix trie to keep track of prefills that were
      seen prior. If a new prefill is a prefix of a previously-seen
      prefil, the router will send a request with `prefix_len>0`, which
      can be used by the backend to decide to reuse KV blocks from the
      cache, rather than recomputing them.
      
      Even though backend support is not added in this PR, the backend
      will still work with prefix caching enabled. The prefix lengths
      are just ignored and not used.
      8deeaca4
    • Nicolas Patry's avatar
      Fixing import exl2 (#2399) · 84bc3d7b
      Nicolas Patry authored
      84bc3d7b
    • Nicolas Patry's avatar
      Upgrade fbgemm (#2398) · 9c739651
      Nicolas Patry authored
      * Upgrade fbgemm
      
      * Fix fbgemm version
      9c739651
  2. 09 Aug, 2024 3 commits
    • Nicolas Patry's avatar
      Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847
      Nicolas Patry authored
      * Using an enum for flash backens (paged/flashdecoding/flashinfer)
      
      * Early exit on server too.
      
      * Clippy.
      
      * Fix clippy and fmt.
      7a48a847
    • Vaibhav Srivastav's avatar
      Update documentation for Supported models (#2386) · b2b9c427
      Vaibhav Srivastav authored
      * Minor doc fixes
      
      * up.
      
      * Other minor updates.
      b2b9c427
    • Daniël de Kok's avatar
      Add FlashInfer support (#2354) · 7830de15
      Daniël de Kok authored
      This change adds support for FlashInfer. FlashInfer can be enabled using
      `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
      Since this functionality is currently only for testing, FlashInfer is
      not installed anywhere yet.
      
      The FlashInfer API is quite different from FlashAttention/vLLM in that
      it requires more global bookkeeping:
      
      * A wrapper class needs to be contstructed (which we just call *state*).
        Since this is fairly expensive (due to pinned host memory allocation),
        we only do this once in a FlashCausalLM instance or for each CUDA
        Graph size.
      * Each model forward call needs to be wrapped in `begin_forward` and
        `end_forward`. This sets up data structures that can be reused for all
        calls to attention for that forward call.
      
      When calling attention, we need access to the state object. To avoid
      passing an argument down the call chain (which would require changes to
      all models), we use a context variable.
      
      Each model forward call is wrapped using a context manager that does all
      the bookkeeping for such a call:
      
      * Set the context variable to the forward call's state.
      * Call `begin_forward` on the state.
      * Yield.
      * Call `end_forward` on the state.
      * Reset the context variable.
      
      We cannot use a single shared global variable for this, since e.g. CUDA
      Graphs of different sizes each have their own state.
      7830de15
  3. 08 Aug, 2024 6 commits
  4. 07 Aug, 2024 1 commit
  5. 06 Aug, 2024 3 commits
  6. 05 Aug, 2024 1 commit
    • drbh's avatar
      fix: attempt forward on flash attn2 to check hardware support (#2335) · 215ed3ad
      drbh authored
      * fix: attempt forward on flash attn2 to check hardware support
      
      * fix: warn window_size_left when using flash attn 1
      
      * fix: prefer version check over test op and avoid window_size_left if not flash attn2
      
      * fix: improve condtional and error message
      
      * fix: update sliding window conditional
      
      * fix: simplify changes and revert model changes
      
      * fix: avoid changing conditional
      
      * fix: typo tweak
      215ed3ad
  7. 01 Aug, 2024 2 commits
  8. 31 Jul, 2024 2 commits
  9. 30 Jul, 2024 1 commit
  10. 29 Jul, 2024 2 commits
  11. 26 Jul, 2024 2 commits
    • drbh's avatar
      feat: add ruff and resolve issue (#2262) · bab02ff2
      drbh authored
      * feat: add ruff and resolve issue
      
      * fix: update client exports and adjust after rebase
      
      * fix: adjust syntax to avoid circular import
      
      * fix: adjust client ruff settings
      
      * fix: lint and refactor import check and avoid model enum as global names
      
      * fix: improve fbgemm_gpu check and lints
      
      * fix: update lints
      
      * fix: prefer comparing model enum over str
      
      * fix: adjust lints and ignore specific rules
      
      * fix: avoid unneeded quantize check
      bab02ff2
    • Daniël de Kok's avatar
  12. 25 Jul, 2024 1 commit
  13. 24 Jul, 2024 4 commits
  14. 23 Jul, 2024 8 commits