• Daniël de Kok's avatar
    Add basic FP8 KV cache support (#2603) · 2358c2bb
    Daniël de Kok authored
    * Add basic FP8 KV cache support
    
    This change adds rudimentary FP8 KV cache support. The support is
    enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
    uses this type for the KV cache. However support is still limited:
    
    * Only the `fp8_e5m2` type is supported.
    * The KV cache layout is the same as `float16`/`bfloat16` (HND).
    * The FP8 KV cache is only supported for FlashInfer.
    * Loading of scales is not yet supported.
    
    * Fix Cargo.toml
    2358c2bb
cli.py 12.1 KB