1. 15 Nov, 2024 3 commits
  2. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  3. 04 Nov, 2024 4 commits
  4. 02 Nov, 2024 1 commit
  5. 01 Nov, 2024 1 commit
    • drbh's avatar
      fix cuda graphs for qwen2-vl (#2708) · 01dacf8e
      drbh authored
      
      
      * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
      
      * fix: only check model type if config exists
      
      * fix: adjust sharding and lm head logic
      
      * fix qwen2 failure in intel cpu
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * fix: return correct shape logits and add streaming test
      
      * fix: remove unused import and refactor test
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      01dacf8e
  6. 30 Oct, 2024 1 commit
    • drbh's avatar
      Support qwen2 vl (#2689) · befd9f67
      drbh authored
      * feat: add support for qwen2 vl model
      
      * feat: fix token padding, enable warmup and process basic request
      
      * fix: improve get_position_ids, add lift embed_tokens
      
      * fix: remove get_cos_sin_hack dev function
      
      * feat: add simple test chat with meesage and text
      
      * fix: lint test
      
      * fix: adjust positional embeddings for multi dimensional position ids
      
      * fix: update docs and lint unused vars
      
      * fix: include linted file
      
      * fix: add norm after text output
      
      * fix: format model file
      
      * fix: adjust for ruff lints
      
      * fix: remove unused rotate_half
      
      * feat: refactors and calc num features
      
      * fix: prefer position_ids passed from vlm causal lm and reset ids on batch
      
      * fix: adjust get_position_ids if not available and add required args to signatures
      
      * fix: adjust resize case for qwen2_vl warmup
      
      * fix: avoid qwen2 vl specific paths with qwen2
      befd9f67
  7. 28 Oct, 2024 3 commits
    • Nicolas Patry's avatar
      Fixing auto bloom test. (#2699) · 3a9cdc32
      Nicolas Patry authored
      3a9cdc32
    • Nicolas Patry's avatar
      We can have a tokenizer anywhere. (#2527) · 90b226db
      Nicolas Patry authored
      * We can have a tokenizer anywhere.
      
      * Handling potential lack of offsets (python tokenizer)
      
      * Remove redundancy.
      
      * Fixing the tests.
      
      * Flake.lock update ?
      
      * Fixing the  GIL locking.
      
      * Fixing mamba by using the transformers version.
      
      * Adding the legacy handle.
      
      * Ellide lifetime.
      
      * Lint.
      
      * Deprecation message.
      
      * Fixing bad rebase.
      90b226db
    • Nicolas Patry's avatar
      Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd
      Nicolas Patry authored
      * Choosing input/total tokens automatically based on available VRAM?
      
      * Update doc.
      
      * Remove generated files.
      
      * Trying to fix non chunking targets.
      
      * Attempt #2
      
      * fix.
      
      * QuantLinear is rocm compatible.
      
      * Much simpler logic after the overhead.
      
      * Updating logic + non flash.
      
      * Revert doc text.
      
      * Simple updates.
      
      * Fix integration mt0 (transformers update).
      0c9b6cdd
  8. 25 Oct, 2024 3 commits
  9. 24 Oct, 2024 2 commits
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
    • Daniël de Kok's avatar
  10. 23 Oct, 2024 2 commits
  11. 19 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Make handling of FP8 scales more consisent (#2666) · 5e0fb468
      Daniël de Kok authored
      Change `fp8_quantize` so that we can pass around reciprocals everywhere,
      so scales are always passed around in the checkpoint format.
      
      I also noticed that we ignore any input scales that we might have when
      fbgemm is available. Skip this path if we already have a scale.
      5e0fb468
  12. 18 Oct, 2024 1 commit
  13. 17 Oct, 2024 4 commits
  14. 16 Oct, 2024 2 commits
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
    • Mohit Sharma's avatar
      Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8
      Mohit Sharma authored
      * (feat) fp8 fnuz support for rocm
      
      * (review comments) Fix compression_config load, type hints
      
      * (bug) update all has_tensor
      
      * (review_comments) fix typo and added comments
      
      * (nit) improved comment
      704a58c8
  15. 15 Oct, 2024 1 commit
  16. 14 Oct, 2024 1 commit
    • Dmitry Rogozhkin's avatar
      feat: enable pytorch xpu support for non-attention models (#2561) · 58848cb4
      Dmitry Rogozhkin authored
      
      
      XPU backend is available natively (without IPEX) in pytorch starting
      from pytorch 2.4. This commit extends TGI to cover the case when user
      has XPU support thru pytorch 2.4, but does not have IPEX installed.
      Models which don't require attention can work. For attention required
      models more work is needed to provide attention implementation.
      
      Tested with the following models:
      * teknium/OpenHermes-2.5-Mistral-7B
      * bigscience/bloom-560m
      * google/gemma-7b
      * google/flan-t5-xxl
      Signed-off-by: default avatarDmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      58848cb4
  17. 11 Oct, 2024 1 commit
  18. 08 Oct, 2024 2 commits
  19. 07 Oct, 2024 2 commits
  20. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
  21. 02 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  22. 30 Sep, 2024 2 commits
    • Daniël de Kok's avatar
      MoE Marlin: support `desc_act` for `groupsize != -1` (#2590) · 1c84a30f
      Daniël de Kok authored
      This change uses the updated Marlin MoE kernel from vLLM to support
      MoE with activation sorting and groups.
      1c84a30f
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d