1. 25 Nov, 2024 1 commit
  2. 22 Nov, 2024 1 commit
  3. 21 Nov, 2024 1 commit
  4. 20 Nov, 2024 1 commit
  5. 19 Nov, 2024 1 commit
    • drbh's avatar
      PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645) · 5489406c
      drbh authored
      
      
      * add OpenAI like tool_choice for named choice
      
      * add tests
      
      * fix: run linter and bump api docs
      
      * fix: consolidate changes and remove old tool type
      
      * feat: improve, simplify and rename tool choice struct add required support and refactor
      
      * fix: simplify tool choice logic, improve tests, openapi and rust docs
      
      * fix: refactor away prepare_chat_input and improve tool grammar apply control flow
      
      * feat: update docs and add tool choice configuration section
      
      * fix: simplify naming, tool choice default and improve test
      
      * fix: adjust tool choice none logic, add test and small refactors
      
      * fix: add missing snapshot file
      
      * fix: adjust tool choice type in test
      
      * fix: adjust default when json tool choice is
      
      * fix: remove trailing space lint after rebase
      
      * fix: remove mostly mocked unit test
      
      ---------
      Co-authored-by: default avatarLinus Bierhoff <linus.bierhoff@icloud.com>
      5489406c
  6. 18 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f
      Daniël de Kok authored
      
      
      * Add support for compressed-tensors w8a8 int checkpoints
      
      This change adds a loader for w8a8 int checkpoints. One large benefit of
      int8 support is that the corresponding cutlass matmul kernels also work on
      compute capability 7.5.
      
      Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
      
      |     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
      |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
      |gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
      |               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
      |ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
      |               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
      |               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
      |               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|
      
      Which is the same ballpark as vLLM.
      
      As usual, lots of thanks to Neural Magic/vLLM for the kernels.
      
      * Always use dynamic input quantization for w8a8 int
      
      It's far less flaky and gives better output.
      
      * Use marlin-kernels 0.3.5
      
      * Fix a typo
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Small fixes
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      3c9df21f
  7. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  8. 01 Nov, 2024 1 commit
    • drbh's avatar
      fix cuda graphs for qwen2-vl (#2708) · 01dacf8e
      drbh authored
      
      
      * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
      
      * fix: only check model type if config exists
      
      * fix: adjust sharding and lm head logic
      
      * fix qwen2 failure in intel cpu
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * fix: return correct shape logits and add streaming test
      
      * fix: remove unused import and refactor test
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      01dacf8e
  9. 30 Oct, 2024 1 commit
    • drbh's avatar
      Support qwen2 vl (#2689) · befd9f67
      drbh authored
      * feat: add support for qwen2 vl model
      
      * feat: fix token padding, enable warmup and process basic request
      
      * fix: improve get_position_ids, add lift embed_tokens
      
      * fix: remove get_cos_sin_hack dev function
      
      * feat: add simple test chat with meesage and text
      
      * fix: lint test
      
      * fix: adjust positional embeddings for multi dimensional position ids
      
      * fix: update docs and lint unused vars
      
      * fix: include linted file
      
      * fix: add norm after text output
      
      * fix: format model file
      
      * fix: adjust for ruff lints
      
      * fix: remove unused rotate_half
      
      * feat: refactors and calc num features
      
      * fix: prefer position_ids passed from vlm causal lm and reset ids on batch
      
      * fix: adjust get_position_ids if not available and add required args to signatures
      
      * fix: adjust resize case for qwen2_vl warmup
      
      * fix: avoid qwen2 vl specific paths with qwen2
      befd9f67
  10. 28 Oct, 2024 4 commits
  11. 26 Oct, 2024 1 commit
  12. 25 Oct, 2024 3 commits
  13. 24 Oct, 2024 2 commits
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
    • Daniël de Kok's avatar
      Fix Phi 3.5 MoE tests (#2684) · 14a0df3a
      Daniël de Kok authored
      PR #2682 also fixed in issue in Phi MoE, but it changes the test
      outputs a bit. Fix this.
      14a0df3a
  14. 21 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Test Marlin MoE with `desc_act=true` (#2622) · 7f54b733
      Daniël de Kok authored
      Update the Mixtral GPTQ test to use a model with `desc_act=true` and
      `group_size!=-1` to ensure that we are checking activation
      sorting/non-full K (with tensor parallelism). The `desc_act=false` case
      is already checked by the Mixtral AWQ test.
      7f54b733
  15. 18 Oct, 2024 1 commit
  16. 16 Oct, 2024 1 commit
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
  17. 10 Oct, 2024 2 commits
    • Nicolas Patry's avatar
      Intel ci (#2630) · 3dbdf63e
      Nicolas Patry authored
      * Intel CI ?
      
      * Let's try non sharded gemma.
      
      * Snapshot rename
      
      * Apparently container can be gone already.
      3dbdf63e
    • drbh's avatar
      feat: allow tool calling to respond without a tool (#2614) · e36dfaa8
      drbh authored
      
      
      * feat: process token stream before returning to client
      
      * fix: expect content in test
      
      * fix: improve comparison via ruff lint
      
      * fix: return event in all cases
      
      * fix: always send event on error, avoid unwraps, refactor and improve tests
      
      * fix: prefer no_tool over notify_error to improve reponse
      
      * fix: adjust chat input test for no_tool
      
      * fix: adjust test expected content
      
      ---------
      Co-authored-by: default avatarSystem administrator <root@ip-10-90-0-186.ec2.internal>
      e36dfaa8
  18. 08 Oct, 2024 1 commit
  19. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
  20. 03 Oct, 2024 1 commit
  21. 02 Oct, 2024 2 commits
    • drbh's avatar
      Unroll notify error into generate response (#2597) · d22b0c1f
      drbh authored
      * feat: unroll notify_error if no tool is choosen
      
      * fix: expect simple message when no tool is selected
      
      * fix: improve test to avoid notify_error
      
      * fix: improve docs and indicate change in expected response
      
      * fix: adjust linting in test file
      d22b0c1f
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  22. 30 Sep, 2024 2 commits
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d
    • Daniël de Kok's avatar
      Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a
      Daniël de Kok authored
      This change add support for MoE models that use GPTQ quantization.
      Currently only models with the following properties are supported:
      
      - No `desc_act` with tensor parallelism, unless `group_size=-1`.
      - No asymmetric quantization.
      - No AWQ.
      90a1d04a
  23. 24 Sep, 2024 1 commit
  24. 19 Sep, 2024 1 commit
    • Nicolas Patry's avatar
      Stream options. (#2533) · f512021e
      Nicolas Patry authored
      * Stream options.
      
      * Fetch stuff from nix integration test for easier testing.
      
      * Adding the assert.
      
      * Only send the usage when asked for.
      
      * Update the docs.
      
      * Impure test because we need network.
      
      * develop.
      
      * Optional usage.
      
      * Fixes.
      
      * Workflow
      f512021e
  25. 17 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9
      Daniël de Kok authored
      * Move to moe-kernels package and switch to common MoE layer
      
      This change introduces the new `moe-kernels` package:
      
      - Add `moe-kernels` as a dependency.
      - Introduce a `SparseMoELayer` module that can be used by MoE
        models.
      - Port over Mixtral and Deepseek.
      
      * Make `cargo check` pass
      
      * Update runner
      ce85efa9
  26. 16 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Adding a test for FD. (#2516) · 38fcafcf
      Nicolas Patry authored
      * Adding a test for FD.
      
      * Fixing flashdecoding (empty batch doesn't work).
      
      * Fixing the invalid popping.
      
      * Fixing radix with block_size > 1
      
      * Last reference.
      
      * Use an actual hash.
      
      * Update hash for slice.len() == 1
      
      * Update the locks.
      
      * Increasing docker timeout.
      38fcafcf
    • Daniël de Kok's avatar
      Add tests for Mixtral (#2520) · 77746552
      Daniël de Kok authored
      Disable by default because CI runners do not have enough GPUs.
      77746552
  27. 11 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Fix truffle (#2514) · 69e3be20
      Nicolas Patry authored
      * Attempting to discard the trufflehog warning.
      
      * Attempt to fix trufflehog.
      69e3be20
    • Nicolas Patry's avatar
      Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6
      Nicolas Patry authored
      
      
      * Adding prefix test.
      
      * [WIP] tmp dump of integration load tests.
      
      * Remove other tensor creation.
      
      * Fixed the radix tree.
      
      Used a slice everywhere in radix.rs to keep the cheap Arc cloning
      instead of recomputing the input_ids.
      
      * Fix parsing
      
      * Is it really flashinfer version ?
      
      * Remove some comments.
      
      * Revert the max prefix hit.
      
      * Adding numpy to diff.
      
      * Upgraded flashinfer.
      
      * Upgrading some stuff.
      
      * Are we done yet ?
      
      * Minor fixup
      
      * Remove 1 log and put back the other.
      
      * Add comment for why slot 0 is OK.
      
      * Mounting on the job.
      
      * Get me a debug branch
      
      * Debugging CIs is fun.
      
      * Attempt #28
      
      * wip
      
      * Tmate.
      
      * Praying.
      
      * Updating VLM causal model with updated context.
      
      * Important line got squashed.
      
      * Tmate again.
      
      * Fingers crossed.
      
      * We want only 1 run of integration tests.....
      
      ---------
      Co-authored-by: default avatarGuillaume LEGENDRE <glegendre01@gmail.com>
      a4e3e8c6
  28. 29 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  29. 27 Aug, 2024 1 commit
    • drbh's avatar
      Pr 2451 ci branch (#2454) · cfa73b5c
      drbh authored
      
      
      * fix[router]: Fix tools not passed in chat template
      Signed-off-by: default avatarGitHub <noreply@github.com>
      
      * feat: improve default tool serialization and lints
      
      * feat: refactor tool logic to include notify_error in prompt and adjust typing
      
      * fix: adjust non tool template apply
      
      * fix: simplify tool grammar logic and improve schema
      
      * feat: avoid skip tool test and avoid empty tool prompts
      
      * fix: increase test client timeout for grammar compilation tests
      
      ---------
      Signed-off-by: default avatarGitHub <noreply@github.com>
      Co-authored-by: default avatarSimone Rossi <simone.rossi.93@gmail.com>
      cfa73b5c