1. 26 Nov, 2024 1 commit
  2. 25 Nov, 2024 1 commit
  3. 22 Nov, 2024 1 commit
  4. 21 Nov, 2024 1 commit
  5. 20 Nov, 2024 2 commits
  6. 19 Nov, 2024 3 commits
  7. 18 Nov, 2024 4 commits
  8. 17 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Remove vLLM dependency for CUDA (#2751) · 52e48739
      Daniël de Kok authored
      * Remove vLLM dependency for CUDA
      
      This change adds `attention-kernels` as a dependency for paged
      attention and cache reshaping. With that, we don't use vLLM
      anywhere for CUDA.
      
      Tested run (since we don't have paged attention in CI):
      
      ```
      ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
      [...]
      5 snapshots passed.
      ```
      
      * Fix clippy warning
      52e48739
  9. 15 Nov, 2024 4 commits
  10. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  11. 04 Nov, 2024 4 commits
  12. 02 Nov, 2024 1 commit
  13. 01 Nov, 2024 1 commit
    • drbh's avatar
      fix cuda graphs for qwen2-vl (#2708) · 01dacf8e
      drbh authored
      
      
      * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
      
      * fix: only check model type if config exists
      
      * fix: adjust sharding and lm head logic
      
      * fix qwen2 failure in intel cpu
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * fix: return correct shape logits and add streaming test
      
      * fix: remove unused import and refactor test
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      01dacf8e
  14. 30 Oct, 2024 1 commit
    • drbh's avatar
      Support qwen2 vl (#2689) · befd9f67
      drbh authored
      * feat: add support for qwen2 vl model
      
      * feat: fix token padding, enable warmup and process basic request
      
      * fix: improve get_position_ids, add lift embed_tokens
      
      * fix: remove get_cos_sin_hack dev function
      
      * feat: add simple test chat with meesage and text
      
      * fix: lint test
      
      * fix: adjust positional embeddings for multi dimensional position ids
      
      * fix: update docs and lint unused vars
      
      * fix: include linted file
      
      * fix: add norm after text output
      
      * fix: format model file
      
      * fix: adjust for ruff lints
      
      * fix: remove unused rotate_half
      
      * feat: refactors and calc num features
      
      * fix: prefer position_ids passed from vlm causal lm and reset ids on batch
      
      * fix: adjust get_position_ids if not available and add required args to signatures
      
      * fix: adjust resize case for qwen2_vl warmup
      
      * fix: avoid qwen2 vl specific paths with qwen2
      befd9f67
  15. 28 Oct, 2024 4 commits
  16. 25 Oct, 2024 3 commits
  17. 24 Oct, 2024 2 commits
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
    • Daniël de Kok's avatar
  18. 23 Oct, 2024 3 commits
  19. 19 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Make handling of FP8 scales more consisent (#2666) · 5e0fb468
      Daniël de Kok authored
      Change `fp8_quantize` so that we can pass around reciprocals everywhere,
      so scales are always passed around in the checkpoint format.
      
      I also noticed that we ignore any input scales that we might have when
      fbgemm is available. Skip this path if we already have a scale.
      5e0fb468
  20. 18 Oct, 2024 1 commit