1. 17 Oct, 2024 2 commits
    • Daniël de Kok's avatar
      Simplify the `attention` function (#2609) · 59ea38cb
      Daniël de Kok authored
      * Simplify the `attention` function
      
      - Use one definition rather than multiple.
      - Add `key`/`value` arguments, so that we don't need the
        `PREFILL_IN_KVCACHE` constant.
      - Make it kwargs-only (to avoid mixing up the various `Tensor` args).
      
      * Fixup flashinfer support
      59ea38cb
    • Daniël de Kok's avatar
      Support `e4m3fn` KV cache (#2655) · 5bbe1ce0
      Daniël de Kok authored
      * Support `e4m3fn` KV cache
      
      * Make check more obvious
      5bbe1ce0
  2. 08 Oct, 2024 1 commit
  3. 07 Oct, 2024 1 commit
  4. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb