1. 25 Oct, 2024 1 commit
  2. 24 Oct, 2024 3 commits
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
    • Daniël de Kok's avatar
      Fix Phi 3.5 MoE tests (#2684) · 14a0df3a
      Daniël de Kok authored
      PR #2682 also fixed in issue in Phi MoE, but it changes the test
      outputs a bit. Fix this.
      14a0df3a
    • Daniël de Kok's avatar
  3. 23 Oct, 2024 4 commits
  4. 22 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add `impureWithCuda` dev shell (#2677) · 9c9ef37c
      Daniël de Kok authored
      * Add `impureWithCuda` dev shell
      
      This shell is handy when developing some kernels jointly with TGI - it
      adds nvcc and a bunch of commonly-used CUDA libraries to the environment.
      
      We don't add this to the normal impure shell to keep the development
      environment as clean as possible (avoid accidental dependencies, etc.).
      
      * Add cuDNN
      9c9ef37c
  5. 21 Oct, 2024 2 commits
  6. 19 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Make handling of FP8 scales more consisent (#2666) · 5e0fb468
      Daniël de Kok authored
      Change `fp8_quantize` so that we can pass around reciprocals everywhere,
      so scales are always passed around in the checkpoint format.
      
      I also noticed that we ignore any input scales that we might have when
      fbgemm is available. Skip this path if we already have a scale.
      5e0fb468
  7. 18 Oct, 2024 1 commit
  8. 17 Oct, 2024 5 commits
  9. 16 Oct, 2024 2 commits
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
    • Mohit Sharma's avatar
      Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8
      Mohit Sharma authored
      * (feat) fp8 fnuz support for rocm
      
      * (review comments) Fix compression_config load, type hints
      
      * (bug) update all has_tensor
      
      * (review_comments) fix typo and added comments
      
      * (nit) improved comment
      704a58c8
  10. 15 Oct, 2024 3 commits
  11. 14 Oct, 2024 5 commits
  12. 11 Oct, 2024 1 commit
  13. 10 Oct, 2024 3 commits
  14. 09 Oct, 2024 2 commits
    • Nicolas Patry's avatar
      AMD CI (#2589) · 43f39f68
      Nicolas Patry authored
      * Only run 1 valid test.
      
      * TRying the tailscale action quickly.
      
      * ?
      
      * bash spaces.
      
      * Remove tailscale.
      
      * More quotes.
      
      * mnt2 ?
      
      * Othername to avoid recursive directories.
      
      * Good old tmate.
      
      * Remove tmate.
      
      * Trying a few things.
      
      * Remove some stuff.
      
      * Sleep ?
      
      * Tmp
      
      * busybox
      
      * Launcher tgi
      
      * Starting hello
      
      * Busybox in python
      
      * No device.
      
      * Removing all variables ?
      
      * A un moment donné.
      
      * Tmp
      
      * Tmp2
      
      * DEvice request, no container name
      
      * No device requests
      
      * Without pytest.
      
      * No pytest.
      
      * from env
      
      * Start with devices
      
      * Attemp #1
      
      * Remove stdin messing
      
      * Only 1 test, no container name
      
      * Raw tgi
      
      * Sending args.
      
      * Show pip freeze.
      
      * Start downloading with token
      
      * Giving HIP devices.
      
      * Mount volume + port forward
      
      * Without pytest.
      
      * No token
      
      * Repeated arguments
      
      * Wrong kwarg.
      
      * On 2 GPUs
      
      * Fallback to single shard CI test.
      
      * Testing
      
      * yaml
      
      * Common cache ?
      
      * Trailing slash ?
      
      * Docker volume split.
      
      * Fix docker volume
      
      * Fixing ?
      
      * ?
      
      * Try no devices ?
      
      * Flash llama on intel CPU ?
      
      * Fix nvidia ?
      
      * Temp deactivate intel, activate nvidia ?
      43f39f68
    • Daniël de Kok's avatar
      nix: add black and isort to the closure (#2619) · 9ed0c85f
      Daniël de Kok authored
      To make sure that everything is formatted with the same black version
      as CI.
      
      I sometimes use isort for new files to get nicely ordered imports,
      so add it as well. Also set the isort configuration to format in a
      way that is compatible with black.
      9ed0c85f
  15. 08 Oct, 2024 4 commits
  16. 07 Oct, 2024 2 commits