1. 23 Oct, 2024 2 commits
  2. 22 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add `impureWithCuda` dev shell (#2677) · 9c9ef37c
      Daniël de Kok authored
      * Add `impureWithCuda` dev shell
      
      This shell is handy when developing some kernels jointly with TGI - it
      adds nvcc and a bunch of commonly-used CUDA libraries to the environment.
      
      We don't add this to the normal impure shell to keep the development
      environment as clean as possible (avoid accidental dependencies, etc.).
      
      * Add cuDNN
      9c9ef37c
  3. 21 Oct, 2024 2 commits
  4. 19 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Make handling of FP8 scales more consisent (#2666) · 5e0fb468
      Daniël de Kok authored
      Change `fp8_quantize` so that we can pass around reciprocals everywhere,
      so scales are always passed around in the checkpoint format.
      
      I also noticed that we ignore any input scales that we might have when
      fbgemm is available. Skip this path if we already have a scale.
      5e0fb468
  5. 18 Oct, 2024 1 commit
  6. 17 Oct, 2024 5 commits
  7. 16 Oct, 2024 2 commits
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
    • Mohit Sharma's avatar
      Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8
      Mohit Sharma authored
      * (feat) fp8 fnuz support for rocm
      
      * (review comments) Fix compression_config load, type hints
      
      * (bug) update all has_tensor
      
      * (review_comments) fix typo and added comments
      
      * (nit) improved comment
      704a58c8
  8. 15 Oct, 2024 3 commits
  9. 14 Oct, 2024 5 commits
  10. 11 Oct, 2024 1 commit
  11. 10 Oct, 2024 3 commits
  12. 09 Oct, 2024 2 commits
    • Nicolas Patry's avatar
      AMD CI (#2589) · 43f39f68
      Nicolas Patry authored
      * Only run 1 valid test.
      
      * TRying the tailscale action quickly.
      
      * ?
      
      * bash spaces.
      
      * Remove tailscale.
      
      * More quotes.
      
      * mnt2 ?
      
      * Othername to avoid recursive directories.
      
      * Good old tmate.
      
      * Remove tmate.
      
      * Trying a few things.
      
      * Remove some stuff.
      
      * Sleep ?
      
      * Tmp
      
      * busybox
      
      * Launcher tgi
      
      * Starting hello
      
      * Busybox in python
      
      * No device.
      
      * Removing all variables ?
      
      * A un moment donné.
      
      * Tmp
      
      * Tmp2
      
      * DEvice request, no container name
      
      * No device requests
      
      * Without pytest.
      
      * No pytest.
      
      * from env
      
      * Start with devices
      
      * Attemp #1
      
      * Remove stdin messing
      
      * Only 1 test, no container name
      
      * Raw tgi
      
      * Sending args.
      
      * Show pip freeze.
      
      * Start downloading with token
      
      * Giving HIP devices.
      
      * Mount volume + port forward
      
      * Without pytest.
      
      * No token
      
      * Repeated arguments
      
      * Wrong kwarg.
      
      * On 2 GPUs
      
      * Fallback to single shard CI test.
      
      * Testing
      
      * yaml
      
      * Common cache ?
      
      * Trailing slash ?
      
      * Docker volume split.
      
      * Fix docker volume
      
      * Fixing ?
      
      * ?
      
      * Try no devices ?
      
      * Flash llama on intel CPU ?
      
      * Fix nvidia ?
      
      * Temp deactivate intel, activate nvidia ?
      43f39f68
    • Daniël de Kok's avatar
      nix: add black and isort to the closure (#2619) · 9ed0c85f
      Daniël de Kok authored
      To make sure that everything is formatted with the same black version
      as CI.
      
      I sometimes use isort for new files to get nicely ordered imports,
      so add it as well. Also set the isort configuration to format in a
      way that is compatible with black.
      9ed0c85f
  13. 08 Oct, 2024 4 commits
  14. 07 Oct, 2024 2 commits
  15. 04 Oct, 2024 2 commits
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
    • Daniël de Kok's avatar
  16. 03 Oct, 2024 2 commits
  17. 02 Oct, 2024 2 commits