1. 06 Sep, 2024 5 commits
  2. 05 Sep, 2024 4 commits
  3. 02 Sep, 2024 4 commits
  4. 29 Aug, 2024 5 commits
    • Nicolas Patry's avatar
      Tied embeddings in MLP speculator. (#2473) · d9fbbaaf
      Nicolas Patry authored
      * Tied embeddings in MLP speculator.
      
      * Fixing the scale_weight when users decide to not use the speculation as
      much as defined in the config.
      
      * Adding scaling support + optimize some ops.
      d9fbbaaf
    • Wang, Yi's avatar
      update doc with intel cpu part (#2420) · 9883f3b4
      Wang, Yi authored
      
      
      * update doc with intel cpu part
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * Apply suggestions from code review
      
      we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release.
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      9883f3b4
    • drbh's avatar
      feat: add /v1/models endpoint (#2433) · d5202c46
      drbh authored
      * feat: add /v1/models endpoint
      
      * feat: add /v1/models endpoint
      
      * fix: remove unused type import
      
      * fix: revert route typo
      
      * fix: update docs with new endpoint
      
      * fix: add to redocly ignore and lint
      d5202c46
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
    • Daniël de Kok's avatar
      nix: build Torch against MKL and various other improvements (#2469) · 4e821c00
      Daniël de Kok authored
      Updates tgi-nix input:
      
      - Move Torch closer to upstream by building against MKL.
      - Remove compute capability 8.7 from Torch (Jetson).
      - Sync nixpkgs cumpute capabilities with Torch (avoids
        compiling too mana capabilities for MAGMA).
      - Use nixpkgs configuration passed through by `tgi-nix`.
      4e821c00
  5. 28 Aug, 2024 1 commit
  6. 27 Aug, 2024 3 commits
  7. 26 Aug, 2024 1 commit
  8. 23 Aug, 2024 1 commit
    • Daniël de Kok's avatar
      nix: add default package (#2453) · f3c5d7d9
      Daniël de Kok authored
      The default package wraps the launcher and puts the server/router in the
      path.
      
      As a result, TGI can be started using something like:
      
      ```
      nix run .# -- \
        --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
        --port 8080
      ```
      f3c5d7d9
  9. 21 Aug, 2024 3 commits
  10. 20 Aug, 2024 2 commits
    • Daniël de Kok's avatar
      nix: add pure server to flake, add both pure and impure devshells (#2430) · f5f11b79
      Daniël de Kok authored
      * nix: pure server and support both pure and impure devShells
      
      * nix: remove unused poetry2nix input
      
      It is not wired up and we now have a pure server.
      
      * nix: add ipdb to impure devshell
      f5f11b79
    • Nicolas Patry's avatar
      Prefix caching (#2402) · b70ae096
      Nicolas Patry authored
      
      
      * Prefix caching WIP
      
      * Fixing prefix attention.
      
      * Fixing flashinfer import.
      
      * Fixing black.
      
      * Fixing medusa (still wrong outputs, but functional).
      
      * Just medusa values now.
      
      * Fixing medusa without prefix caching.
      
      * Fixing prefix caching.
      
      * Medusa requires reshaping.
      
      * Removing the logs.
      
      * Remove router.nix
      
      * Fixup:
      
      - Remove logs
      - Disable VLMs (they do not work)
      - Disable prefix caching when user wants prefill logprobs.
      
      * Update flake.lock
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      b70ae096
  11. 19 Aug, 2024 1 commit
  12. 16 Aug, 2024 6 commits
  13. 15 Aug, 2024 3 commits
  14. 14 Aug, 2024 1 commit
    • Funtowicz Morgan's avatar
      More fixes trtllm (#2342) · 3f385991
      Funtowicz Morgan authored
      * (backend) use parking_lot crate for RwLock fairness
      
      * (docker) let's put rust in the TRTLLM folder when building
      
      * (docker) build ompi with SLURM support
      
      * (launcher) default new server::run parameters to false for now
      
      * (chore) fmt ... why?
      3f385991