1. 29 Aug, 2024 2 commits
    • Nicolas Patry's avatar
      Tied embeddings in MLP speculator. (#2473) · d9fbbaaf
      Nicolas Patry authored
      * Tied embeddings in MLP speculator.
      
      * Fixing the scale_weight when users decide to not use the speculation as
      much as defined in the config.
      
      * Adding scaling support + optimize some ops.
      d9fbbaaf
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  2. 20 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Prefix caching (#2402) · b70ae096
      Nicolas Patry authored
      
      
      * Prefix caching WIP
      
      * Fixing prefix attention.
      
      * Fixing flashinfer import.
      
      * Fixing black.
      
      * Fixing medusa (still wrong outputs, but functional).
      
      * Just medusa values now.
      
      * Fixing medusa without prefix caching.
      
      * Fixing prefix caching.
      
      * Medusa requires reshaping.
      
      * Removing the logs.
      
      * Remove router.nix
      
      * Fixup:
      
      - Remove logs
      - Disable VLMs (they do not work)
      - Disable prefix caching when user wants prefill logprobs.
      
      * Update flake.lock
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      b70ae096
  3. 14 Aug, 2024 1 commit
  4. 13 Aug, 2024 1 commit
  5. 12 Aug, 2024 2 commits
  6. 09 Aug, 2024 2 commits
    • Nicolas Patry's avatar
      Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847
      Nicolas Patry authored
      * Using an enum for flash backens (paged/flashdecoding/flashinfer)
      
      * Early exit on server too.
      
      * Clippy.
      
      * Fix clippy and fmt.
      7a48a847
    • Daniël de Kok's avatar
      Add FlashInfer support (#2354) · 7830de15
      Daniël de Kok authored
      This change adds support for FlashInfer. FlashInfer can be enabled using
      `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
      Since this functionality is currently only for testing, FlashInfer is
      not installed anywhere yet.
      
      The FlashInfer API is quite different from FlashAttention/vLLM in that
      it requires more global bookkeeping:
      
      * A wrapper class needs to be contstructed (which we just call *state*).
        Since this is fairly expensive (due to pinned host memory allocation),
        we only do this once in a FlashCausalLM instance or for each CUDA
        Graph size.
      * Each model forward call needs to be wrapped in `begin_forward` and
        `end_forward`. This sets up data structures that can be reused for all
        calls to attention for that forward call.
      
      When calling attention, we need access to the state object. To avoid
      passing an argument down the call chain (which would require changes to
      all models), we use a context variable.
      
      Each model forward call is wrapped using a context manager that does all
      the bookkeeping for such a call:
      
      * Set the context variable to the forward call's state.
      * Call `begin_forward` on the state.
      * Yield.
      * Call `end_forward` on the state.
      * Reset the context variable.
      
      We cannot use a single shared global variable for this, since e.g. CUDA
      Graphs of different sizes each have their own state.
      7830de15
  7. 08 Aug, 2024 1 commit
  8. 06 Aug, 2024 1 commit
  9. 05 Aug, 2024 1 commit
    • drbh's avatar
      fix: attempt forward on flash attn2 to check hardware support (#2335) · 215ed3ad
      drbh authored
      * fix: attempt forward on flash attn2 to check hardware support
      
      * fix: warn window_size_left when using flash attn 1
      
      * fix: prefer version check over test op and avoid window_size_left if not flash attn2
      
      * fix: improve condtional and error message
      
      * fix: update sliding window conditional
      
      * fix: simplify changes and revert model changes
      
      * fix: avoid changing conditional
      
      * fix: typo tweak
      215ed3ad
  10. 01 Aug, 2024 1 commit
    • Daniël de Kok's avatar
      Unify attention output handling (#2343) · 47447ef0
      Daniël de Kok authored
      - Always return the hidden states.
      - Create the output tensor inside the `attention` and `paged_attention`
        functions.
      
      This removes the difference between how the output is handled between
      attention (output parameter) and paged attention (return value). This
      also removes the assumption that the attention implementation can
      write to an output tensor (in preparation of FlashInfer).
      47447ef0
  11. 31 Jul, 2024 1 commit
    • Daniël de Kok's avatar
      Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300) · 34f7dcfd
      Daniël de Kok authored
      The `GPTWeightLoader` was structured like this in pseudocode:
      
      if marlin:
        Set up tensors in a way that GPTQ-Marlin expects
      else:
        Set up tensors in a way that ExLlama/GPTQ/AWQ expect
      
      However, the GPT-Marlin implementation details should really be in the
      `marlin` module. So move the former part out to a separate
      `GPTQMarlinWeightsLoader`.
      34f7dcfd
  12. 30 Jul, 2024 1 commit
  13. 29 Jul, 2024 1 commit
  14. 26 Jul, 2024 1 commit
    • drbh's avatar
      feat: add ruff and resolve issue (#2262) · bab02ff2
      drbh authored
      * feat: add ruff and resolve issue
      
      * fix: update client exports and adjust after rebase
      
      * fix: adjust syntax to avoid circular import
      
      * fix: adjust client ruff settings
      
      * fix: lint and refactor import check and avoid model enum as global names
      
      * fix: improve fbgemm_gpu check and lints
      
      * fix: update lints
      
      * fix: prefer comparing model enum over str
      
      * fix: adjust lints and ignore specific rules
      
      * fix: avoid unneeded quantize check
      bab02ff2
  15. 25 Jul, 2024 1 commit
  16. 24 Jul, 2024 2 commits
    • drbh's avatar
      fix: refactor adapter weight loading and mapping (#2193) · 5d85a958
      drbh authored
      * fix: refactor adapter weight loading and mapping
      
      * feat: enable lora load from directory
      
      * fix: adjust launcher for local lora adapters
      
      * feat: improve weight loading and add tests
      
      * fix: improve logging and rebase syntax issue
      
      * fix: impove adapter merge comments and remove unused conditional
      
      * fix: improve get_model_with_lora_adapters naming
      
      * fix: comment typo
      5d85a958
    • Daniël de Kok's avatar
      Split up `layers.marlin` into several files (#2292) · 93d2b9fe
      Daniël de Kok authored
      The marlin.py file was getting large, split it up.
      93d2b9fe
  17. 23 Jul, 2024 3 commits
    • Daniël de Kok's avatar
      Add support for Llama 3 rotary embeddings (#2286) · 4ab41737
      Daniël de Kok authored
      * Add support for Llama 3 rotary embeddings
      
      * Update transformers to 4.43
      4ab41737
    • Daniël de Kok's avatar
      Add support for repacking AWQ weights for GPTQ-Marlin (#2278) · 9935720c
      Daniël de Kok authored
      * Add support for repacking AWQ weights for GPTQ-Marlin
      
      So far we couldn't support AWQ because virtually all AWQ models use
      symmetric quantization, which GPTQ-Marlin did not suppors. GPTQ-Marlin
      has recently added support AWQ repacking and AWQ asymmetric quantization
      (zero_point=True).
      
      This change updates all GPTQ-Marlin kernels from upstream and wires up
      AWQ support. For now enabling AWQ using Marlin requires running TGI with
      `--quantize gptq`.
      
      * Enable Marlin for supported AWQ configurations by default
      
      This makes the AWQ -> GPTQ repack test redundant, since we are now
      testing this with the regular AWQ test.
      9935720c
    • OlivierDehaene's avatar
      fix(l4): fix fp8 logic on l4 (#2277) · 5fca30ee
      OlivierDehaene authored
      * fix(l4): fix fp8 logic on l4
      
      * also quant weights with single scale
      
      * use marlin even on 89
      5fca30ee
  18. 22 Jul, 2024 2 commits
  19. 20 Jul, 2024 1 commit
    • OlivierDehaene's avatar
      feat(fp8): use fbgemm kernels and load fp8 weights directly (#2248) · 53ec0b79
      OlivierDehaene authored
      * feat(fp8): add support for fbgemm
      
      * allow loading fp8 weights directly
      
      * update outlines
      
      * fix makefile
      
      * build fbgemm
      
      * avoid circular import and fix dockerfile
      
      * add default dtype
      
      * refactored weights loader
      
      * fix auto conversion
      
      * fix quantization config parsing
      
      * force new nccl on install
      
      * missing get_weights implementation
      
      * increase timeout
      53ec0b79
  20. 19 Jul, 2024 2 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • Daniël de Kok's avatar
      Improve the handling of quantized weights (#2250) · ba291dad
      Daniël de Kok authored
      * Improve the handling of quantized weights
      
      Handling of quantized weights was split between two mechanisms:
      
      - For quantized checkpoints, we used the new weight loader
        infrastructure.
      - For quantization while loading (EETQ, FP8, bitsandbytes) we
        instead relied on conditional in `get_linear`.
      
      Weight loaders support context managers to selectively load
      particular layers with different weight loaders, which is useful
      for models like Idefics2 AWQ, which uses a quantized text model,
      but unquantized vision and connector models. However, the context
      manager would be overrided by `get_linear`, which string-checks
      `quantizer`. Also, the context manager would not work with
      EETQ, FP8, and bitsandbytes.
      
      This change migrates all quantizers to the weight loader infrastructure.
      This has several benefits:
      
      - We can use context managers with all quantizers.
      - All the implementation details move down to the quantizer layers,
        `get_linear` does not need to know how to handle quantizer linear
        layers.
      - All quantizer weights are strongly typed, we don't pass around
        raw tensors.
      - We don't have to pass around the `quantizer` string everywhere.
      
      * Exclude non-MLP layers when using FP8 quantization with Llama
      ba291dad
  21. 12 Jul, 2024 2 commits
  22. 11 Jul, 2024 1 commit
  23. 09 Jul, 2024 1 commit
    • Daniël de Kok's avatar
      Move quantized weight handling out of the `Weights` class (#2194) · 8511669c
      Daniël de Kok authored
      Quantized weights were loaded in the `Weights` class, but this was
      getting quite unwieldy, where every higher level method to load weights
      was a long conditional to cover all the different quantizers.
      
      This change moves loading of quantized weights out of the `Weights`
      class. This is done by defining a simple `WeightsLoader` interface
      that is implemented by `Exl2WeightsLoader`, `GPTQWeightsLoader`,
      and `MarlinWeightsLoader`. These implementations are in the quantizers'
      respective modules. The `Weights` class provides the low-level load
      operations (such as loading tensors or sharded tensors), but delegates
      loads that need quantizer-specific weight processing to a loader. The
      loaders still use the low-level functionality provided by `Weights`.
      
      I initially tried making a hierarchy where a class like `GPTQWeights`
      would inherit from `Weights`. But it is not very flexible (e.g. does
      not work well with the new weight storage mock used in tests) and
      the implicit indirections made the code harder to follow.
      8511669c
  24. 05 Jul, 2024 1 commit
  25. 02 Jul, 2024 2 commits
  26. 01 Jul, 2024 3 commits
    • Nicolas Patry's avatar
      [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. (#1940) · 4327210e
      Nicolas Patry authored
      * Using flash decoding
      
      Conditional flashdecoding.
      
      Fix max_q.
      
      Working kvcache
      
      Working version with flash decoding.
      
      Make it work for mistral.
      
      Fix after rebase..
      
      Less intrusive.
      
      REvert changes in modeling.
      
      Speedup flashdecoding.
      
      HHachweew
      Hack to make other models work.
      
      Fixing non flash decoding llama path.
      
      Router logic knows about page size.
      
      Missing 2 models.
      
      Missing cohere.
      
      Fixing cohere flash decoding.
      
      Revamped all this architecture.
      
      Fix cohere.
      
      Fixing falcon.
      
      Enabling custom block size schedule.
      
      Update router/src/infer.rs
      
      Not sending preallocated output.
      
      * Making it work on non flash decoding.
      
      * Fix Cohere.
      
      * Fix non decoding paths.
      
      * Rebased.
      
      * No need for cache_manager anymore.
      
      * Update?
      
      * "ipex" -> "cpu"
      
      * These do not belong.
      
      * Factoring cu_seqlen_qk for better abstracting over every model.
      
      * Fixing non flash tests/imports.
      
      * Changing return everywhere.
      
      * Update mistral past.
      
      * Fixing Mi{s,x}tral (non functional in Flash Decoding mode though).
      
      * Fixup mistral clamping (had issues with cuda graphs).
      
      * No need to recreate anything actually.
      4327210e
    • Wang, Yi's avatar
      refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform (#2132) · 5da4cfab
      Wang, Yi authored
      
      
      * refine get xpu free memory
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * enable qwen2 in xpu
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * enable gemma/gemma2/phi in intel platform
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      5da4cfab
    • Daniël de Kok's avatar
      Use GPTQ-Marlin for supported GPTQ configurations (#2111) · 2ce80194
      Daniël de Kok authored
      GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
      let's use it by default if the kernels are installed, the GPU supports
      it, and the kernels support the configuration.
      
      For models generated by `text-generation-server quantize`, use
      `sym=False`. This subcommand symmetric quantization since the beginning
      and incorrectly reporting the model to be symmetric will use
      GPTQ-Marlin (which does not support asymmetric quantization).
      2ce80194
  27. 25 Jun, 2024 2 commits
    • Daniël de Kok's avatar
      Add support for Marlin 2:4 sparsity (#2102) · f1f98e36
      Daniël de Kok authored
      This change adds support for 2:4 sparsity when using Marlin
      quantization. The 2:4 kernel is used when:
      
      * The quantizer is `marlin`;
      * the quantizer checkpoint format is `marlin_24`.
      
      Fixes #2098.
      f1f98e36
    • Daniël de Kok's avatar
      Support AWQ quantization with bias (#2117) · 14980df2
      Daniël de Kok authored
      When the AWQ quantizer was used with a layer that uses a bias,
      the bias tensor was not correctly passed/used. Instead, the
      value `true`/`1.0` was added to the linear transformation.
      
      Correctly pass through the bias when it is not `None`.
      
      Fixes #2106.
      14980df2