1. 09 Dec, 2024 1 commit
  2. 06 Dec, 2024 6 commits
  3. 04 Dec, 2024 1 commit
  4. 03 Dec, 2024 2 commits
    • Nicolas Patry's avatar
      Saving some VRAM. (#2790) · b57f3703
      Nicolas Patry authored
      * Saving some VRAM.
      
      - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
        left, so 400MB saved.
      
      - Effect not as visible on attention=flashinfer and n_shard=1. I suspect
        it's linked to the torch allocator.
      
      * Adding assertion.
      b57f3703
    • Daniël de Kok's avatar
      Sync (most) server dependencies with Nix (#2782) · 2003d8be
      Daniël de Kok authored
      
      
      * Sync (most) server dependencies with Nix
      
      Skipped most grpcio packages, because of protobuf version
      incompatibility with the opentelemetry packages.
      
      * Add a primitive script to generate Poetry commands to sync with Nix
      
      This is not fully automated, since getting the Nix versions may be
      unresolvable. However, it does take most of the work out of doing
      this manually.
      
      * Upgrade eetq ?
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      2003d8be
  5. 02 Dec, 2024 4 commits
  6. 28 Nov, 2024 1 commit
    • drbh's avatar
      Support continue final message (#2733) · d4718051
      drbh authored
      * feat: support continue_final_message param in chat request
      
      * feat: add test for continue final message
      
      * fix: bump openapi docs
      
      * fix: remove continue_final_message chat request param
      
      * fix: remove unneeded launcher args in continue test
      
      * fix: bump test output
      
      * fix: remove accidentally included guideline from rebase
      
      * fix: remove guideline tests
      
      * fix: adjust continuation tests expected text
      
      * fix: replace expected output for continue test
      d4718051
  7. 26 Nov, 2024 3 commits
  8. 25 Nov, 2024 2 commits
  9. 22 Nov, 2024 2 commits
  10. 21 Nov, 2024 7 commits
  11. 20 Nov, 2024 5 commits
  12. 19 Nov, 2024 4 commits
  13. 18 Nov, 2024 2 commits
    • drbh's avatar
      feat: support flash attention 2 in qwen2 vl vision blocks (#2721) · 38cff84a
      drbh authored
      * feat: support flash attention 2 in qwen2 vl vision blocks
      
      * fix: calc max_seqlen once and small refactors
      38cff84a
    • Daniël de Kok's avatar
      Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f
      Daniël de Kok authored
      
      
      * Add support for compressed-tensors w8a8 int checkpoints
      
      This change adds a loader for w8a8 int checkpoints. One large benefit of
      int8 support is that the corresponding cutlass matmul kernels also work on
      compute capability 7.5.
      
      Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
      
      |     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
      |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
      |gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
      |               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
      |ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
      |               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
      |               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
      |               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|
      
      Which is the same ballpark as vLLM.
      
      As usual, lots of thanks to Neural Magic/vLLM for the kernels.
      
      * Always use dynamic input quantization for w8a8 int
      
      It's far less flaky and gives better output.
      
      * Use marlin-kernels 0.3.5
      
      * Fix a typo
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Small fixes
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      3c9df21f