1. 21 Nov, 2024 2 commits
  2. 19 Nov, 2024 1 commit
    • drbh's avatar
      PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme (#2645) · 5489406c
      drbh authored
      
      
      * add OpenAI like tool_choice for named choice
      
      * add tests
      
      * fix: run linter and bump api docs
      
      * fix: consolidate changes and remove old tool type
      
      * feat: improve, simplify and rename tool choice struct add required support and refactor
      
      * fix: simplify tool choice logic, improve tests, openapi and rust docs
      
      * fix: refactor away prepare_chat_input and improve tool grammar apply control flow
      
      * feat: update docs and add tool choice configuration section
      
      * fix: simplify naming, tool choice default and improve test
      
      * fix: adjust tool choice none logic, add test and small refactors
      
      * fix: add missing snapshot file
      
      * fix: adjust tool choice type in test
      
      * fix: adjust default when json tool choice is
      
      * fix: remove trailing space lint after rebase
      
      * fix: remove mostly mocked unit test
      
      ---------
      Co-authored-by: default avatarLinus Bierhoff <linus.bierhoff@icloud.com>
      5489406c
  3. 17 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Remove vLLM dependency for CUDA (#2751) · 52e48739
      Daniël de Kok authored
      * Remove vLLM dependency for CUDA
      
      This change adds `attention-kernels` as a dependency for paged
      attention and cache reshaping. With that, we don't use vLLM
      anywhere for CUDA.
      
      Tested run (since we don't have paged attention in CI):
      
      ```
      ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
      [...]
      5 snapshots passed.
      ```
      
      * Fix clippy warning
      52e48739
  4. 15 Nov, 2024 2 commits
  5. 07 Nov, 2024 1 commit
  6. 04 Nov, 2024 1 commit
  7. 30 Oct, 2024 1 commit
    • drbh's avatar
      Support qwen2 vl (#2689) · befd9f67
      drbh authored
      * feat: add support for qwen2 vl model
      
      * feat: fix token padding, enable warmup and process basic request
      
      * fix: improve get_position_ids, add lift embed_tokens
      
      * fix: remove get_cos_sin_hack dev function
      
      * feat: add simple test chat with meesage and text
      
      * fix: lint test
      
      * fix: adjust positional embeddings for multi dimensional position ids
      
      * fix: update docs and lint unused vars
      
      * fix: include linted file
      
      * fix: add norm after text output
      
      * fix: format model file
      
      * fix: adjust for ruff lints
      
      * fix: remove unused rotate_half
      
      * feat: refactors and calc num features
      
      * fix: prefer position_ids passed from vlm causal lm and reset ids on batch
      
      * fix: adjust get_position_ids if not available and add required args to signatures
      
      * fix: adjust resize case for qwen2_vl warmup
      
      * fix: avoid qwen2 vl specific paths with qwen2
      befd9f67
  8. 28 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      We can have a tokenizer anywhere. (#2527) · 90b226db
      Nicolas Patry authored
      * We can have a tokenizer anywhere.
      
      * Handling potential lack of offsets (python tokenizer)
      
      * Remove redundancy.
      
      * Fixing the tests.
      
      * Flake.lock update ?
      
      * Fixing the  GIL locking.
      
      * Fixing mamba by using the transformers version.
      
      * Adding the legacy handle.
      
      * Ellide lifetime.
      
      * Lint.
      
      * Deprecation message.
      
      * Fixing bad rebase.
      90b226db
  9. 25 Oct, 2024 1 commit
  10. 23 Oct, 2024 2 commits
  11. 16 Oct, 2024 1 commit
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
  12. 15 Oct, 2024 1 commit
    • Alvaro Bartolome's avatar
      Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` (#2651) · ffe05ccd
      Alvaro Bartolome authored
      As spotted by @philschmid, the payload was compliant with Vertex AI, but
      just partially, since ideally the most compliant version would be with
      the generation kwargs flattened to be on the same level as the
      `messages`; meaning that Vertex AI would still expect a list of
      instances, but each instance would be an OpenAI-compatible instance,
      which is more clear; and more aligned with the SageMaker integration
      too, so kudos to him for spotting that; and sorry from my end for any
      inconvenience @Narsil.
      ffe05ccd
  13. 10 Oct, 2024 1 commit
  14. 08 Oct, 2024 1 commit
  15. 03 Oct, 2024 1 commit
  16. 02 Oct, 2024 3 commits
    • drbh's avatar
      Unroll notify error into generate response (#2597) · d22b0c1f
      drbh authored
      * feat: unroll notify_error if no tool is choosen
      
      * fix: expect simple message when no tool is selected
      
      * fix: improve test to avoid notify_error
      
      * fix: improve docs and indicate change in expected response
      
      * fix: adjust linting in test file
      d22b0c1f
    • Nicolas Patry's avatar
      Max token capacity metric (#2595) · 0204946d
      Nicolas Patry authored
      
      
      * adding max_token_capacity_metric
      
      * added tgi to name of metric
      
      * Adding max capacity metric.
      
      * Add description for the metrics
      
      ---------
      Co-authored-by: default avatarEdwinhr716 <Edandres249@gmail.com>
      0204946d
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  17. 30 Sep, 2024 1 commit
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d
  18. 26 Sep, 2024 1 commit
  19. 24 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Cleanup Vertex + Chat (#2553) · c032280b
      Nicolas Patry authored
      * Cleanup Vertex + Chat
      
      * logprobs defaults to false.
      
      * Parameters are optional
      
      * Fix  docs.
      
      * Changing back this logprobs default.
      
      * Fixup doc.
      
      * Let's debug that.
      
      * Not unstable.
      
      * Updating Cargo ?
      
      * Wat?
      
      * Dummy change.
      
      * Trying some other install.
      
      * Trying smething.
      
      * Revert everything.
      
      * Update Cargo lock.
      
      * Fixing the pre-commit after rebase.
      c032280b
    • OlivierDehaene's avatar
      chore: Add old V2 backend (#2551) · 10e6f292
      OlivierDehaene authored
      * wip
      
      * added v2
      10e6f292
  20. 19 Sep, 2024 1 commit
    • Nicolas Patry's avatar
      Stream options. (#2533) · f512021e
      Nicolas Patry authored
      * Stream options.
      
      * Fetch stuff from nix integration test for easier testing.
      
      * Adding the assert.
      
      * Only send the usage when asked for.
      
      * Update the docs.
      
      * Impure test because we need network.
      
      * develop.
      
      * Optional usage.
      
      * Fixes.
      
      * Workflow
      f512021e
  21. 17 Sep, 2024 1 commit
  22. 11 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Fix tokenization yi (#2507) · dae3bf1d
      Nicolas Patry authored
      * Fixing odd tokenization self modifications on the Rust side (load and
      resave in Python).
      
      * Fixing the builds ?
      
      * Fix the gh action?
      
      * Fixing the location ?
      
      * Validation is odd.
      
      * Try a faster runner
      
      * Upgrade python version.
      
      * Remove sccache
      
      * No sccache.
      
      * Getting libpython maybe ?
      
      * List stuff.
      
      * Monkey it up.
      
      * have no idea at this point
      
      * Tmp.
      
      * Shot in the dark.
      
      * Tmate the hell out of this.
      
      * Desperation.
      
      * WTF.
      
      * -y.
      
      * Apparently 3.10 is not available anymore.
      
      * Updating the dockerfile to make libpython discoverable at runtime too.
      
      * Put back rust tests.
      
      * Why do we want mkl on AMD ?
      
      * Forcing 3.11 ?
      dae3bf1d
    • Nicolas Patry's avatar
      Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6
      Nicolas Patry authored
      
      
      * Adding prefix test.
      
      * [WIP] tmp dump of integration load tests.
      
      * Remove other tensor creation.
      
      * Fixed the radix tree.
      
      Used a slice everywhere in radix.rs to keep the cheap Arc cloning
      instead of recomputing the input_ids.
      
      * Fix parsing
      
      * Is it really flashinfer version ?
      
      * Remove some comments.
      
      * Revert the max prefix hit.
      
      * Adding numpy to diff.
      
      * Upgraded flashinfer.
      
      * Upgrading some stuff.
      
      * Are we done yet ?
      
      * Minor fixup
      
      * Remove 1 log and put back the other.
      
      * Add comment for why slot 0 is OK.
      
      * Mounting on the job.
      
      * Get me a debug branch
      
      * Debugging CIs is fun.
      
      * Attempt #28
      
      * wip
      
      * Tmate.
      
      * Praying.
      
      * Updating VLM causal model with updated context.
      
      * Important line got squashed.
      
      * Tmate again.
      
      * Fingers crossed.
      
      * We want only 1 run of integration tests.....
      
      ---------
      Co-authored-by: default avatarGuillaume LEGENDRE <glegendre01@gmail.com>
      a4e3e8c6
  23. 02 Sep, 2024 1 commit
  24. 29 Aug, 2024 2 commits
    • drbh's avatar
      feat: add /v1/models endpoint (#2433) · d5202c46
      drbh authored
      * feat: add /v1/models endpoint
      
      * feat: add /v1/models endpoint
      
      * fix: remove unused type import
      
      * fix: revert route typo
      
      * fix: update docs with new endpoint
      
      * fix: add to redocly ignore and lint
      d5202c46
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  25. 27 Aug, 2024 2 commits
  26. 16 Aug, 2024 1 commit
  27. 12 Aug, 2024 4 commits
    • drbh's avatar
      Pr 2395 ci run (#2406) · 9a7830bd
      drbh authored
      
      
      * fix(router): Fix appending to message content
      
      * feat: add message and chat template test
      
      ---------
      Co-authored-by: default avatarSimone Rossi <simone.rossi.93@gmail.com>
      9a7830bd
    • drbh's avatar
      fix: improve completions to send a final chunk with usage details (#2336) · 30395b09
      drbh authored
      * fix: improve completions to send a final chunk with usage details
      
      * fix: include finish reason string
      
      * fix: remove dev debug trait and unneeded mut
      
      * fix: update openapi schema
      30395b09
    • drbh's avatar
      feat: validate template variables before apply and improve sliding wi… (#2403) · 155f9c98
      drbh authored
      * feat: validate template variables before apply and improve sliding window check
      
      * fix: improve missing template var test
      155f9c98
    • Daniël de Kok's avatar
      Add support for prefix caching to the v3 router (#2392) · 8deeaca4
      Daniël de Kok authored
      This change adds support for prefix caching to the v3 router. This
      is broken up from the backend support to ease reviewing.
      
      For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
      in this case, the router will switch to `RadixAllocator`. This
      allocator uses a radix trie to keep track of prefills that were
      seen prior. If a new prefill is a prefix of a previously-seen
      prefil, the router will send a request with `prefix_len>0`, which
      can be used by the backend to decide to reuse KV blocks from the
      cache, rather than recomputing them.
      
      Even though backend support is not added in this PR, the backend
      will still work with prefix caching enabled. The prefix lengths
      are just ignored and not used.
      8deeaca4
  28. 09 Aug, 2024 1 commit