1. 28 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd
      Nicolas Patry authored
      * Choosing input/total tokens automatically based on available VRAM?
      
      * Update doc.
      
      * Remove generated files.
      
      * Trying to fix non chunking targets.
      
      * Attempt #2
      
      * fix.
      
      * QuantLinear is rocm compatible.
      
      * Much simpler logic after the overhead.
      
      * Updating logic + non flash.
      
      * Revert doc text.
      
      * Simple updates.
      
      * Fix integration mt0 (transformers update).
      0c9b6cdd
  2. 25 Oct, 2024 2 commits
  3. 24 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
  4. 23 Oct, 2024 2 commits
  5. 18 Oct, 2024 1 commit
  6. 17 Oct, 2024 3 commits
  7. 16 Oct, 2024 2 commits
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
    • Mohit Sharma's avatar
      Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8
      Mohit Sharma authored
      * (feat) fp8 fnuz support for rocm
      
      * (review comments) Fix compression_config load, type hints
      
      * (bug) update all has_tensor
      
      * (review_comments) fix typo and added comments
      
      * (nit) improved comment
      704a58c8
  8. 15 Oct, 2024 1 commit
  9. 14 Oct, 2024 1 commit
    • Dmitry Rogozhkin's avatar
      feat: enable pytorch xpu support for non-attention models (#2561) · 58848cb4
      Dmitry Rogozhkin authored
      
      
      XPU backend is available natively (without IPEX) in pytorch starting
      from pytorch 2.4. This commit extends TGI to cover the case when user
      has XPU support thru pytorch 2.4, but does not have IPEX installed.
      Models which don't require attention can work. For attention required
      models more work is needed to provide attention implementation.
      
      Tested with the following models:
      * teknium/OpenHermes-2.5-Mistral-7B
      * bigscience/bloom-560m
      * google/gemma-7b
      * google/flan-t5-xxl
      Signed-off-by: default avatarDmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      58848cb4
  10. 07 Oct, 2024 1 commit
  11. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
  12. 02 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  13. 30 Sep, 2024 2 commits
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d
    • Mohit Sharma's avatar
      Update ROCM libs and improvements (#2579) · f9e561ec
      Mohit Sharma authored
      * style
      
      * update torch
      
      * ix issues
      
      * fix clone
      
      * revert mkl
      
      * added custom PA
      
      * style
      
      * fix style
      
      * style
      
      * hide env vart
      
      * fix mixtral model
      
      * add skinny kernel and merge fixes
      
      * fixed style
      
      * fix issue for sliding window models
      
      * addressed review comments
      
      * fix import
      
      * improved error messag
      
      * updated default value
      
      * remove import
      
      * fix imports after rebase
      
      * float16 dep
      
      * improve dockerfile
      
      * cleaned dockerfile
      f9e561ec
  14. 28 Sep, 2024 1 commit
  15. 27 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2
      Daniël de Kok authored
      * Improve support for GPUs with capability < 8
      
      - For models that cannot use flashinfer, use flash-attn v1 + paged
        attention for models with a compute capability older than 8.
      - Disable prefix caching when using paged attention.
      - When using flash-attn v1, pass the key/value, rather than the
        cache, since v1 cannot use block tables.
      
      * nix: add flash-attn-v1 to the server environment
      
      * Move disabling prefix caching into the block of exceptions
      
      * Capability as `usize`s
      5b6b74e2
  16. 26 Sep, 2024 1 commit
  17. 24 Sep, 2024 2 commits
  18. 20 Sep, 2024 1 commit
  19. 17 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9
      Daniël de Kok authored
      * Move to moe-kernels package and switch to common MoE layer
      
      This change introduces the new `moe-kernels` package:
      
      - Add `moe-kernels` as a dependency.
      - Introduce a `SparseMoELayer` module that can be used by MoE
        models.
      - Port over Mixtral and Deepseek.
      
      * Make `cargo check` pass
      
      * Update runner
      ce85efa9
  20. 12 Sep, 2024 1 commit
  21. 11 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Fix tokenization yi (#2507) · dae3bf1d
      Nicolas Patry authored
      * Fixing odd tokenization self modifications on the Rust side (load and
      resave in Python).
      
      * Fixing the builds ?
      
      * Fix the gh action?
      
      * Fixing the location ?
      
      * Validation is odd.
      
      * Try a faster runner
      
      * Upgrade python version.
      
      * Remove sccache
      
      * No sccache.
      
      * Getting libpython maybe ?
      
      * List stuff.
      
      * Monkey it up.
      
      * have no idea at this point
      
      * Tmp.
      
      * Shot in the dark.
      
      * Tmate the hell out of this.
      
      * Desperation.
      
      * WTF.
      
      * -y.
      
      * Apparently 3.10 is not available anymore.
      
      * Updating the dockerfile to make libpython discoverable at runtime too.
      
      * Put back rust tests.
      
      * Why do we want mkl on AMD ?
      
      * Forcing 3.11 ?
      dae3bf1d
    • Nicolas Patry's avatar
      Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6
      Nicolas Patry authored
      
      
      * Adding prefix test.
      
      * [WIP] tmp dump of integration load tests.
      
      * Remove other tensor creation.
      
      * Fixed the radix tree.
      
      Used a slice everywhere in radix.rs to keep the cheap Arc cloning
      instead of recomputing the input_ids.
      
      * Fix parsing
      
      * Is it really flashinfer version ?
      
      * Remove some comments.
      
      * Revert the max prefix hit.
      
      * Adding numpy to diff.
      
      * Upgraded flashinfer.
      
      * Upgrading some stuff.
      
      * Are we done yet ?
      
      * Minor fixup
      
      * Remove 1 log and put back the other.
      
      * Add comment for why slot 0 is OK.
      
      * Mounting on the job.
      
      * Get me a debug branch
      
      * Debugging CIs is fun.
      
      * Attempt #28
      
      * wip
      
      * Tmate.
      
      * Praying.
      
      * Updating VLM causal model with updated context.
      
      * Important line got squashed.
      
      * Tmate again.
      
      * Fingers crossed.
      
      * We want only 1 run of integration tests.....
      
      ---------
      Co-authored-by: default avatarGuillaume LEGENDRE <glegendre01@gmail.com>
      a4e3e8c6
  22. 05 Sep, 2024 1 commit
  23. 02 Sep, 2024 1 commit
  24. 29 Aug, 2024 2 commits
    • Nicolas Patry's avatar
      Tied embeddings in MLP speculator. (#2473) · d9fbbaaf
      Nicolas Patry authored
      * Tied embeddings in MLP speculator.
      
      * Fixing the scale_weight when users decide to not use the speculation as
      much as defined in the config.
      
      * Adding scaling support + optimize some ops.
      d9fbbaaf
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  25. 26 Aug, 2024 1 commit
  26. 20 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Prefix caching (#2402) · b70ae096
      Nicolas Patry authored
      
      
      * Prefix caching WIP
      
      * Fixing prefix attention.
      
      * Fixing flashinfer import.
      
      * Fixing black.
      
      * Fixing medusa (still wrong outputs, but functional).
      
      * Just medusa values now.
      
      * Fixing medusa without prefix caching.
      
      * Fixing prefix caching.
      
      * Medusa requires reshaping.
      
      * Removing the logs.
      
      * Remove router.nix
      
      * Fixup:
      
      - Remove logs
      - Disable VLMs (they do not work)
      - Disable prefix caching when user wants prefill logprobs.
      
      * Update flake.lock
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      b70ae096
  27. 15 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Fixing exl2 and other quanize tests again. (#2419) · 57b34958
      Nicolas Patry authored
      * Fixing exl2 and other quanize tests again.
      
      * Mark exl2 as non release (so CI tests them, needs to be removed latet).
      
      * Fixing exl2 (by disabling cuda graphs)
      
      * Fix quantization defaults without cuda graphs on exl2 (linked to new
      issues with it).
      
      * Removing serde override.
      
      * Go back to released exl2 and remove log.
      
      * Adding warnings for deprecated bitsandbytes + upgrade info to warn.
      57b34958
  28. 14 Aug, 2024 1 commit
  29. 13 Aug, 2024 1 commit
  30. 12 Aug, 2024 2 commits
    • drbh's avatar
      feat: validate template variables before apply and improve sliding wi… (#2403) · 155f9c98
      drbh authored
      * feat: validate template variables before apply and improve sliding window check
      
      * fix: improve missing template var test
      155f9c98
    • Daniël de Kok's avatar
      Add support for prefix caching to the v3 router (#2392) · 8deeaca4
      Daniël de Kok authored
      This change adds support for prefix caching to the v3 router. This
      is broken up from the backend support to ease reviewing.
      
      For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
      in this case, the router will switch to `RadixAllocator`. This
      allocator uses a radix trie to keep track of prefills that were
      seen prior. If a new prefill is a prefix of a previously-seen
      prefil, the router will send a request with `prefix_len>0`, which
      can be used by the backend to decide to reuse KV blocks from the
      cache, rather than recomputing them.
      
      Even though backend support is not added in this PR, the backend
      will still work with prefix caching enabled. The prefix lengths
      are just ignored and not used.
      8deeaca4