"src/vscode:/vscode.git/clone" did not exist on "6b727842d7fd370ac057c092d913bf8557dd32c2"
  1. 09 Dec, 2024 1 commit
  2. 06 Dec, 2024 1 commit
    • Nicolas Patry's avatar
      Auto max prefill (#2797) · 5df80590
      Nicolas Patry authored
      * Attempt at automatic max batch prefill.
      
      * Taking into account number of shards.
      
      * Adding more cards.
      
      * Adding A100 + H100
      
      * Adding a few more cards.
      
      * Logprobs cost too much.
      
      * h100 better name, and keep factor of 2
      
      * Damn inflated sparse tflops.
      
      * Typo in h100.
      
      * Updated the flops calculation (checked with fvcore).
      
      * chunking by default.
      
      * Fix prefix caching for chat completion since we removed logprobs.
      
      * More tests.
      
      * Dropping all the prefill logprobs.
      
      * Add a flag that enables users to get logprobs back.
      
      * Repairing prompt token counting.
      
      * Fixing a few tests.
      
      * Remove some scaffolding.
      
      * Attempting to reduces the issues (workarounds for now).
      5df80590
  3. 22 Nov, 2024 1 commit
  4. 21 Nov, 2024 1 commit
  5. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  6. 28 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd
      Nicolas Patry authored
      * Choosing input/total tokens automatically based on available VRAM?
      
      * Update doc.
      
      * Remove generated files.
      
      * Trying to fix non chunking targets.
      
      * Attempt #2
      
      * fix.
      
      * QuantLinear is rocm compatible.
      
      * Much simpler logic after the overhead.
      
      * Updating logic + non flash.
      
      * Revert doc text.
      
      * Simple updates.
      
      * Fix integration mt0 (transformers update).
      0c9b6cdd
  7. 25 Oct, 2024 1 commit
  8. 23 Oct, 2024 1 commit
  9. 17 Oct, 2024 1 commit
  10. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
  11. 19 Sep, 2024 1 commit
  12. 16 Aug, 2024 1 commit