1. 30 Oct, 2024 1 commit
    • drbh's avatar
      Support qwen2 vl (#2689) · befd9f67
      drbh authored
      * feat: add support for qwen2 vl model
      
      * feat: fix token padding, enable warmup and process basic request
      
      * fix: improve get_position_ids, add lift embed_tokens
      
      * fix: remove get_cos_sin_hack dev function
      
      * feat: add simple test chat with meesage and text
      
      * fix: lint test
      
      * fix: adjust positional embeddings for multi dimensional position ids
      
      * fix: update docs and lint unused vars
      
      * fix: include linted file
      
      * fix: add norm after text output
      
      * fix: format model file
      
      * fix: adjust for ruff lints
      
      * fix: remove unused rotate_half
      
      * feat: refactors and calc num features
      
      * fix: prefer position_ids passed from vlm causal lm and reset ids on batch
      
      * fix: adjust get_position_ids if not available and add required args to signatures
      
      * fix: adjust resize case for qwen2_vl warmup
      
      * fix: avoid qwen2 vl specific paths with qwen2
      befd9f67
  2. 28 Oct, 2024 2 commits
    • Nicolas Patry's avatar
      Fixing auto bloom test. (#2699) · 3a9cdc32
      Nicolas Patry authored
      3a9cdc32
    • Nicolas Patry's avatar
      We can have a tokenizer anywhere. (#2527) · 90b226db
      Nicolas Patry authored
      * We can have a tokenizer anywhere.
      
      * Handling potential lack of offsets (python tokenizer)
      
      * Remove redundancy.
      
      * Fixing the tests.
      
      * Flake.lock update ?
      
      * Fixing the  GIL locking.
      
      * Fixing mamba by using the transformers version.
      
      * Adding the legacy handle.
      
      * Ellide lifetime.
      
      * Lint.
      
      * Deprecation message.
      
      * Fixing bad rebase.
      90b226db
  3. 24 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
  4. 23 Oct, 2024 2 commits
  5. 17 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Simplify the `attention` function (#2609) · 59ea38cb
      Daniël de Kok authored
      * Simplify the `attention` function
      
      - Use one definition rather than multiple.
      - Add `key`/`value` arguments, so that we don't need the
        `PREFILL_IN_KVCACHE` constant.
      - Make it kwargs-only (to avoid mixing up the various `Tensor` args).
      
      * Fixup flashinfer support
      59ea38cb
  6. 16 Oct, 2024 1 commit
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
  7. 07 Oct, 2024 1 commit
  8. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
  9. 02 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  10. 30 Sep, 2024 2 commits
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d
    • Mohit Sharma's avatar
      Update ROCM libs and improvements (#2579) · f9e561ec
      Mohit Sharma authored
      * style
      
      * update torch
      
      * ix issues
      
      * fix clone
      
      * revert mkl
      
      * added custom PA
      
      * style
      
      * fix style
      
      * style
      
      * hide env vart
      
      * fix mixtral model
      
      * add skinny kernel and merge fixes
      
      * fixed style
      
      * fix issue for sliding window models
      
      * addressed review comments
      
      * fix import
      
      * improved error messag
      
      * updated default value
      
      * remove import
      
      * fix imports after rebase
      
      * float16 dep
      
      * improve dockerfile
      
      * cleaned dockerfile
      f9e561ec
  11. 27 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2
      Daniël de Kok authored
      * Improve support for GPUs with capability < 8
      
      - For models that cannot use flashinfer, use flash-attn v1 + paged
        attention for models with a compute capability older than 8.
      - Disable prefix caching when using paged attention.
      - When using flash-attn v1, pass the key/value, rather than the
        cache, since v1 cannot use block tables.
      
      * nix: add flash-attn-v1 to the server environment
      
      * Move disabling prefix caching into the block of exceptions
      
      * Capability as `usize`s
      5b6b74e2
  12. 26 Sep, 2024 1 commit
  13. 24 Sep, 2024 1 commit
  14. 20 Sep, 2024 1 commit
  15. 17 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9
      Daniël de Kok authored
      * Move to moe-kernels package and switch to common MoE layer
      
      This change introduces the new `moe-kernels` package:
      
      - Add `moe-kernels` as a dependency.
      - Introduce a `SparseMoELayer` module that can be used by MoE
        models.
      - Port over Mixtral and Deepseek.
      
      * Make `cargo check` pass
      
      * Update runner
      ce85efa9
  16. 05 Sep, 2024 1 commit
  17. 02 Sep, 2024 1 commit
  18. 29 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  19. 26 Aug, 2024 1 commit
  20. 20 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Prefix caching (#2402) · b70ae096
      Nicolas Patry authored
      
      
      * Prefix caching WIP
      
      * Fixing prefix attention.
      
      * Fixing flashinfer import.
      
      * Fixing black.
      
      * Fixing medusa (still wrong outputs, but functional).
      
      * Just medusa values now.
      
      * Fixing medusa without prefix caching.
      
      * Fixing prefix caching.
      
      * Medusa requires reshaping.
      
      * Removing the logs.
      
      * Remove router.nix
      
      * Fixup:
      
      - Remove logs
      - Disable VLMs (they do not work)
      - Disable prefix caching when user wants prefill logprobs.
      
      * Update flake.lock
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      b70ae096
  21. 08 Aug, 2024 4 commits
  22. 07 Aug, 2024 1 commit
  23. 06 Aug, 2024 2 commits
  24. 01 Aug, 2024 2 commits
  25. 26 Jul, 2024 2 commits
    • drbh's avatar
      feat: add ruff and resolve issue (#2262) · bab02ff2
      drbh authored
      * feat: add ruff and resolve issue
      
      * fix: update client exports and adjust after rebase
      
      * fix: adjust syntax to avoid circular import
      
      * fix: adjust client ruff settings
      
      * fix: lint and refactor import check and avoid model enum as global names
      
      * fix: improve fbgemm_gpu check and lints
      
      * fix: update lints
      
      * fix: prefer comparing model enum over str
      
      * fix: adjust lints and ignore specific rules
      
      * fix: avoid unneeded quantize check
      bab02ff2
    • Daniël de Kok's avatar
  26. 24 Jul, 2024 2 commits
  27. 23 Jul, 2024 2 commits
  28. 22 Jul, 2024 2 commits