1. 09 Dec, 2024 1 commit
  2. 06 Dec, 2024 2 commits
    • Nicolas Patry's avatar
      Adding A100 compute. (#2806) · d96dcb17
      Nicolas Patry authored
      d96dcb17
    • Nicolas Patry's avatar
      Auto max prefill (#2797) · 5df80590
      Nicolas Patry authored
      * Attempt at automatic max batch prefill.
      
      * Taking into account number of shards.
      
      * Adding more cards.
      
      * Adding A100 + H100
      
      * Adding a few more cards.
      
      * Logprobs cost too much.
      
      * h100 better name, and keep factor of 2
      
      * Damn inflated sparse tflops.
      
      * Typo in h100.
      
      * Updated the flops calculation (checked with fvcore).
      
      * chunking by default.
      
      * Fix prefix caching for chat completion since we removed logprobs.
      
      * More tests.
      
      * Dropping all the prefill logprobs.
      
      * Add a flag that enables users to get logprobs back.
      
      * Repairing prompt token counting.
      
      * Fixing a few tests.
      
      * Remove some scaffolding.
      
      * Attempting to reduces the issues (workarounds for now).
      5df80590
  3. 02 Dec, 2024 1 commit
  4. 21 Nov, 2024 1 commit
  5. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  6. 04 Nov, 2024 1 commit
  7. 28 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd
      Nicolas Patry authored
      * Choosing input/total tokens automatically based on available VRAM?
      
      * Update doc.
      
      * Remove generated files.
      
      * Trying to fix non chunking targets.
      
      * Attempt #2
      
      * fix.
      
      * QuantLinear is rocm compatible.
      
      * Much simpler logic after the overhead.
      
      * Updating logic + non flash.
      
      * Revert doc text.
      
      * Simple updates.
      
      * Fix integration mt0 (transformers update).
      0c9b6cdd
  8. 25 Oct, 2024 1 commit
  9. 21 Oct, 2024 1 commit
  10. 17 Oct, 2024 1 commit
  11. 16 Oct, 2024 1 commit
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
  12. 15 Oct, 2024 1 commit
  13. 14 Oct, 2024 1 commit
  14. 04 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
  15. 02 Oct, 2024 1 commit
  16. 30 Sep, 2024 1 commit
  17. 27 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2
      Daniël de Kok authored
      * Improve support for GPUs with capability < 8
      
      - For models that cannot use flashinfer, use flash-attn v1 + paged
        attention for models with a compute capability older than 8.
      - Disable prefix caching when using paged attention.
      - When using flash-attn v1, pass the key/value, rather than the
        cache, since v1 cannot use block tables.
      
      * nix: add flash-attn-v1 to the server environment
      
      * Move disabling prefix caching into the block of exceptions
      
      * Capability as `usize`s
      5b6b74e2
  18. 20 Sep, 2024 1 commit
  19. 19 Sep, 2024 1 commit
  20. 11 Sep, 2024 1 commit
    • Nicolas Patry's avatar
      Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6
      Nicolas Patry authored
      
      
      * Adding prefix test.
      
      * [WIP] tmp dump of integration load tests.
      
      * Remove other tensor creation.
      
      * Fixed the radix tree.
      
      Used a slice everywhere in radix.rs to keep the cheap Arc cloning
      instead of recomputing the input_ids.
      
      * Fix parsing
      
      * Is it really flashinfer version ?
      
      * Remove some comments.
      
      * Revert the max prefix hit.
      
      * Adding numpy to diff.
      
      * Upgraded flashinfer.
      
      * Upgrading some stuff.
      
      * Are we done yet ?
      
      * Minor fixup
      
      * Remove 1 log and put back the other.
      
      * Add comment for why slot 0 is OK.
      
      * Mounting on the job.
      
      * Get me a debug branch
      
      * Debugging CIs is fun.
      
      * Attempt #28
      
      * wip
      
      * Tmate.
      
      * Praying.
      
      * Updating VLM causal model with updated context.
      
      * Important line got squashed.
      
      * Tmate again.
      
      * Fingers crossed.
      
      * We want only 1 run of integration tests.....
      
      ---------
      Co-authored-by: default avatarGuillaume LEGENDRE <glegendre01@gmail.com>
      a4e3e8c6
  21. 29 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  22. 15 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Fixing exl2 and other quanize tests again. (#2419) · 57b34958
      Nicolas Patry authored
      * Fixing exl2 and other quanize tests again.
      
      * Mark exl2 as non release (so CI tests them, needs to be removed latet).
      
      * Fixing exl2 (by disabling cuda graphs)
      
      * Fix quantization defaults without cuda graphs on exl2 (linked to new
      issues with it).
      
      * Removing serde override.
      
      * Go back to released exl2 and remove log.
      
      * Adding warnings for deprecated bitsandbytes + upgrade info to warn.
      57b34958
  23. 09 Aug, 2024 1 commit
  24. 31 Jul, 2024 1 commit
  25. 29 Jul, 2024 1 commit
    • Erik Kaunismäki's avatar
      Run ci api key (#2315) · 583d37a2
      Erik Kaunismäki authored
      
      
      * Add API_Key for Auth and conditionally add authorisation for non info/health endpoints.
      
      * change name to info routes
      
      * Fix comment
      
      * convert strings to lowercase for case insensitive comparison
      
      * convert header to string
      
      * fixes and update docs
      
      * update docs again
      
      * revert wrong update
      
      ---------
      Co-authored-by: default avatarKevin Duffy <kevin.duffy94@gmail.com>
      583d37a2
  26. 24 Jul, 2024 1 commit
    • drbh's avatar
      fix: refactor adapter weight loading and mapping (#2193) · 5d85a958
      drbh authored
      * fix: refactor adapter weight loading and mapping
      
      * feat: enable lora load from directory
      
      * fix: adjust launcher for local lora adapters
      
      * feat: improve weight loading and add tests
      
      * fix: improve logging and rebase syntax issue
      
      * fix: impove adapter merge comments and remove unused conditional
      
      * fix: improve get_model_with_lora_adapters naming
      
      * fix: comment typo
      5d85a958
  27. 23 Jul, 2024 1 commit
  28. 22 Jul, 2024 1 commit
    • Nicolas Patry's avatar
      Softcapping for gemma2. (#2273) · 6aeb6690
      Nicolas Patry authored
      * Softcapping for gemma2.
      
      * Less clutter.
      
      * No access to transformers config, only config_dict here.
      
      * 0.0 is the null value in the C++ API.
      6aeb6690
  29. 19 Jul, 2024 1 commit
  30. 01 Jul, 2024 1 commit
  31. 25 Jun, 2024 5 commits
    • drbh's avatar
      Enable multiple LoRa adapters (#2010) · 04e1af94
      drbh authored
      
      
      * feat: first draft load multiple lora
      
      * feat: load weights within layer and refactor lora pass
      
      * fix: refactor and reduce lora math
      
      * feat: baseline impl single request multi lora support
      
      * feat: prefer lorax implementation and port loading logic
      
      * fix: prefer adapter_data and refactors
      
      * feat: perfer loraxs custom punica kernels and add mlp loras
      
      * fix: adjust batch for bgmv
      
      * fix: adjust adapter_segments logic when in batch
      
      * fix: refactor and move changes to v3 proto
      
      * fix: pass model_id for all flash causal lms
      
      * fix: pass model_id for all causal and seq2seq lms
      
      * fix: add model_id to model test
      
      * feat: add lora support to mistral and refactors
      
      * feat: prefer model id in request
      
      * fix: include rust code for adapter id
      
      * feat: bump launcher and add new lora docs
      
      * feat: support base model generation and refactors
      
      * fix: rename doc to retry ci build
      
      * feat: support if vlm models
      
      * fix: add adapter_data param and avoid missing layers
      
      * fix: add adapter_data param to phi and neox
      
      * fix: update all models forwards to include adapter_data
      
      * fix: add model_id to IdeficsCausalLM
      
      * Update lora.md
      
      Fixed a typo
      
      * Update lora.md
      
      Fixing spam image
      
      * fix: add lora kernel to dockerfile, support running without kernels and refactors
      
      * fix: avoid dockerfile conflict
      
      * fix: refactors and adjust flash llama lora logic
      
      * fix: skip llama test due to CI issue (temp)
      
      * fix: skip llama test CI (temp) 2
      
      * fix: revert skips and prefer updated ci token for tests
      
      * fix: refactors and helpful comments
      
      * fix: add noop in TensorParallelAdapterRowLinear too
      
      * fix: refactor and move shard_lora_weights logic
      
      * fix: exit early if no adapter_data
      
      ---------
      Co-authored-by: default avatarDerek <datavistics@gmail.com>
      04e1af94
    • Nicolas Patry's avatar
      Fix CI . (#2118) · a2a97b05
      Nicolas Patry authored
      Fix clippy.
      a2a97b05
    • Wang, Yi's avatar
      use xpu-smi to dump used memory (#2047) · 83634dc1
      Wang, Yi authored
      
      
      * use xpu-smi to dump used memory
      xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * Update server/text_generation_server/utils/import_utils.py
      Co-authored-by: default avatarDaniël de Kok <me@github.danieldk.eu>
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      Co-authored-by: default avatarDaniël de Kok <me@github.danieldk.eu>
      83634dc1
    • KevinDuffy94's avatar
      Add OTLP Service Name Environment Variable (#2076) · 1869ee2f
      KevinDuffy94 authored
      * Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069
      
      * Update Docs
      
      * Update README.md
      
      * Update Launcher Docs
      
      * Update Launcher Docs
      Removing Option
      1869ee2f
    • Lucain's avatar
      Support `HF_TOKEN` environment variable (#2066) · 3447c722
      Lucain authored
      
      
      * Support HF_TOKEN environement variable
      
      * Load test.
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      3447c722
  32. 10 Jun, 2024 1 commit
    • fxmarty's avatar
      ROCm and sliding windows fixes (#2033) · 9b3674d9
      fxmarty authored
      * update vllm commit & fix models using sliding window
      
      * update
      
      * update commit
      
      * fix bug where tunableop is bound to cuda graph even when cuda graph are disabled
      
      * enable tunableop by default
      
      * fix sliding window
      
      * address review
      
      * dead code
      
      * precise comment
      
      * is it flaky?
      9b3674d9
  33. 06 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for Marlin-quantized models · 4594e6fa
      Daniël de Kok authored
      This change adds support for Marlin-quantized models. Marlin is an
      FP16xINT4 matmul kernel, which provides good speedups decoding batches
      of 16-32 tokens. It supports quantized models with symmetric
      quantization, groupsize -1 or 128, and 4-bit.
      
      Tested with:
      
      - Llama 2
      - Llama 3
      - Phi 3
      4594e6fa
  34. 31 May, 2024 1 commit
  35. 30 May, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for exl2 quantization · 36dd1601
      Daniël de Kok authored
      Mostly straightforward, changes to existing code:
      
      * Wrap quantizer parameters in a small wrapper to avoid passing
        around untyped tuples and needing to repack them as a dict.
      * Move scratch space computation to warmup, because we need the
        maximum input sequence length to avoid allocating huge
        scratch buffers that OOM.
      36dd1601