1. 24 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
  2. 28 Sep, 2024 1 commit
  3. 24 Sep, 2024 1 commit
  4. 20 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Prefix caching (#2402) · b70ae096
      Nicolas Patry authored
      
      
      * Prefix caching WIP
      
      * Fixing prefix attention.
      
      * Fixing flashinfer import.
      
      * Fixing black.
      
      * Fixing medusa (still wrong outputs, but functional).
      
      * Just medusa values now.
      
      * Fixing medusa without prefix caching.
      
      * Fixing prefix caching.
      
      * Medusa requires reshaping.
      
      * Removing the logs.
      
      * Remove router.nix
      
      * Fixup:
      
      - Remove logs
      - Disable VLMs (they do not work)
      - Disable prefix caching when user wants prefill logprobs.
      
      * Update flake.lock
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      b70ae096
  5. 09 Aug, 2024 1 commit
    • Daniël de Kok's avatar
      Add FlashInfer support (#2354) · 7830de15
      Daniël de Kok authored
      This change adds support for FlashInfer. FlashInfer can be enabled using
      `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
      Since this functionality is currently only for testing, FlashInfer is
      not installed anywhere yet.
      
      The FlashInfer API is quite different from FlashAttention/vLLM in that
      it requires more global bookkeeping:
      
      * A wrapper class needs to be contstructed (which we just call *state*).
        Since this is fairly expensive (due to pinned host memory allocation),
        we only do this once in a FlashCausalLM instance or for each CUDA
        Graph size.
      * Each model forward call needs to be wrapped in `begin_forward` and
        `end_forward`. This sets up data structures that can be reused for all
        calls to attention for that forward call.
      
      When calling attention, we need access to the state object. To avoid
      passing an argument down the call chain (which would require changes to
      all models), we use a context variable.
      
      Each model forward call is wrapped using a context manager that does all
      the bookkeeping for such a call:
      
      * Set the context variable to the forward call's state.
      * Call `begin_forward` on the state.
      * Yield.
      * Call `end_forward` on the state.
      * Reset the context variable.
      
      We cannot use a single shared global variable for this, since e.g. CUDA
      Graphs of different sizes each have their own state.
      7830de15