"examples/vscode:/vscode.git/clone" did not exist on "327b24469236e467ce3a91fac9fea6a3540af56d"
  1. 30 Sep, 2024 2 commits
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d
    • Daniël de Kok's avatar
      Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a
      Daniël de Kok authored
      This change add support for MoE models that use GPTQ quantization.
      Currently only models with the following properties are supported:
      
      - No `desc_act` with tensor parallelism, unless `group_size=-1`.
      - No asymmetric quantization.
      - No AWQ.
      90a1d04a
  2. 24 Sep, 2024 1 commit
  3. 19 Sep, 2024 1 commit
    • Nicolas Patry's avatar
      Stream options. (#2533) · f512021e
      Nicolas Patry authored
      * Stream options.
      
      * Fetch stuff from nix integration test for easier testing.
      
      * Adding the assert.
      
      * Only send the usage when asked for.
      
      * Update the docs.
      
      * Impure test because we need network.
      
      * develop.
      
      * Optional usage.
      
      * Fixes.
      
      * Workflow
      f512021e
  4. 17 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9
      Daniël de Kok authored
      * Move to moe-kernels package and switch to common MoE layer
      
      This change introduces the new `moe-kernels` package:
      
      - Add `moe-kernels` as a dependency.
      - Introduce a `SparseMoELayer` module that can be used by MoE
        models.
      - Port over Mixtral and Deepseek.
      
      * Make `cargo check` pass
      
      * Update runner
      ce85efa9
  5. 16 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Adding a test for FD. (#2516) · 38fcafcf
      Nicolas Patry authored
      * Adding a test for FD.
      
      * Fixing flashdecoding (empty batch doesn't work).
      
      * Fixing the invalid popping.
      
      * Fixing radix with block_size > 1
      
      * Last reference.
      
      * Use an actual hash.
      
      * Update hash for slice.len() == 1
      
      * Update the locks.
      
      * Increasing docker timeout.
      38fcafcf
    • Daniël de Kok's avatar
      Add tests for Mixtral (#2520) · 77746552
      Daniël de Kok authored
      Disable by default because CI runners do not have enough GPUs.
      77746552
  6. 11 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Fix truffle (#2514) · 69e3be20
      Nicolas Patry authored
      * Attempting to discard the trufflehog warning.
      
      * Attempt to fix trufflehog.
      69e3be20
    • Nicolas Patry's avatar
      Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6
      Nicolas Patry authored
      
      
      * Adding prefix test.
      
      * [WIP] tmp dump of integration load tests.
      
      * Remove other tensor creation.
      
      * Fixed the radix tree.
      
      Used a slice everywhere in radix.rs to keep the cheap Arc cloning
      instead of recomputing the input_ids.
      
      * Fix parsing
      
      * Is it really flashinfer version ?
      
      * Remove some comments.
      
      * Revert the max prefix hit.
      
      * Adding numpy to diff.
      
      * Upgraded flashinfer.
      
      * Upgrading some stuff.
      
      * Are we done yet ?
      
      * Minor fixup
      
      * Remove 1 log and put back the other.
      
      * Add comment for why slot 0 is OK.
      
      * Mounting on the job.
      
      * Get me a debug branch
      
      * Debugging CIs is fun.
      
      * Attempt #28
      
      * wip
      
      * Tmate.
      
      * Praying.
      
      * Updating VLM causal model with updated context.
      
      * Important line got squashed.
      
      * Tmate again.
      
      * Fingers crossed.
      
      * We want only 1 run of integration tests.....
      
      ---------
      Co-authored-by: default avatarGuillaume LEGENDRE <glegendre01@gmail.com>
      a4e3e8c6
  7. 06 Sep, 2024 2 commits
  8. 29 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  9. 27 Aug, 2024 1 commit
    • drbh's avatar
      Pr 2451 ci branch (#2454) · cfa73b5c
      drbh authored
      
      
      * fix[router]: Fix tools not passed in chat template
      Signed-off-by: default avatarGitHub <noreply@github.com>
      
      * feat: improve default tool serialization and lints
      
      * feat: refactor tool logic to include notify_error in prompt and adjust typing
      
      * fix: adjust non tool template apply
      
      * fix: simplify tool grammar logic and improve schema
      
      * feat: avoid skip tool test and avoid empty tool prompts
      
      * fix: increase test client timeout for grammar compilation tests
      
      ---------
      Signed-off-by: default avatarGitHub <noreply@github.com>
      Co-authored-by: default avatarSimone Rossi <simone.rossi.93@gmail.com>
      cfa73b5c
  10. 16 Aug, 2024 2 commits
  11. 15 Aug, 2024 2 commits
  12. 12 Aug, 2024 1 commit
  13. 08 Aug, 2024 1 commit
  14. 29 Jul, 2024 1 commit
  15. 26 Jul, 2024 1 commit
    • drbh's avatar
      feat: add ruff and resolve issue (#2262) · bab02ff2
      drbh authored
      * feat: add ruff and resolve issue
      
      * fix: update client exports and adjust after rebase
      
      * fix: adjust syntax to avoid circular import
      
      * fix: adjust client ruff settings
      
      * fix: lint and refactor import check and avoid model enum as global names
      
      * fix: improve fbgemm_gpu check and lints
      
      * fix: update lints
      
      * fix: prefer comparing model enum over str
      
      * fix: adjust lints and ignore specific rules
      
      * fix: avoid unneeded quantize check
      bab02ff2
  16. 25 Jul, 2024 3 commits
  17. 22 Jul, 2024 2 commits
  18. 20 Jul, 2024 1 commit
  19. 19 Jul, 2024 2 commits
    • Daniël de Kok's avatar
      Add support for Deepseek V2 (#2224) · e52be9bb
      Daniël de Kok authored
      Deepseek V2 is a MoE model from Deepseek. Relevant variations
      compared to other models:
      
      - Grouped top-K in expert selection.
      - mscale in yarn is calculated using the `mscale` and `mscale_all_dim`
        configuration options.
      - `mscale_all_dim` is also used in scaling attention softmax.
      - Permuting of the query/key representations before applying rotary
        embeddings.
      - Some projections cannot be sharded (`q_a_proj`, `kv_a_proj_with_mqa`).
        So, we need weight loads that supports quantized weights. To this
        end `{Weights,WeightLoader}.get_weight` was added.
      - The query/key head dimensionality differs from that of the value,
        so we need to pad during attention.
      - Heads with size 192, needs an extension to our paged attention
        fork and we need to ensure that the KV cache is allocated with the
        correct size.
      - Shared experts.
      e52be9bb
    • Daniël de Kok's avatar
      Improve the handling of quantized weights (#2250) · ba291dad
      Daniël de Kok authored
      * Improve the handling of quantized weights
      
      Handling of quantized weights was split between two mechanisms:
      
      - For quantized checkpoints, we used the new weight loader
        infrastructure.
      - For quantization while loading (EETQ, FP8, bitsandbytes) we
        instead relied on conditional in `get_linear`.
      
      Weight loaders support context managers to selectively load
      particular layers with different weight loaders, which is useful
      for models like Idefics2 AWQ, which uses a quantized text model,
      but unquantized vision and connector models. However, the context
      manager would be overrided by `get_linear`, which string-checks
      `quantizer`. Also, the context manager would not work with
      EETQ, FP8, and bitsandbytes.
      
      This change migrates all quantizers to the weight loader infrastructure.
      This has several benefits:
      
      - We can use context managers with all quantizers.
      - All the implementation details move down to the quantizer layers,
        `get_linear` does not need to know how to handle quantizer linear
        layers.
      - All quantizer weights are strongly typed, we don't pass around
        raw tensors.
      - We don't have to pass around the `quantizer` string everywhere.
      
      * Exclude non-MLP layers when using FP8 quantization with Llama
      ba291dad
  20. 15 Jul, 2024 1 commit
    • drbh's avatar
      feat: simple mistral lora integration tests (#2180) · 5a650669
      drbh authored
      * feat: simple mistral lora integration tests
      
      * fix: include args in docker launcher
      
      * fix: disable cuda graphs with lora and warn
      
      * fix: adjust docs and precommit issues
      
      * fix: re update docs
      5a650669
  21. 05 Jul, 2024 2 commits
    • Daniël de Kok's avatar
      GPTQ CI improvements (#2151) · 67ef0649
      Daniël de Kok authored
      * Add more representative Llama GPTQ test
      
      The Llama GPTQ test is updated to use a model with the commonly-used
      quantizer config format and activation sorting. The old test is
      kept around (but renamed) since it tests the format produced by
      `text-generation-server quantize`.
      
      * Add support for manually triggering a release build
      67ef0649
    • Nicolas Patry's avatar
      Refactor dead code - Removing all `flash_xxx.py` files. (#2166) · fb2f74e2
      Nicolas Patry authored
      * Refactor dead code.
      
      * First working step.
      
      * Remove a lot of duplicated code.
      
      * More dead code.
      
      * More cleanup.
      
      * Fix Santacoder test.
      
      * Fixing the simple tests.
      
      * Fixing sharding.
      
      * Fixes for VLM.
      
      * Fixing santacoder (num_kv_heads hardcoded).
      
      * Removing more dead code.
      
      * Fixing `config.n_head`.
      
      * Stopping earlier because of `<end_of_utterance>` in idefics2.
      
      * Addresses comments.
      
      * Removing the dead code.
      
      * Fuse back mistral into FlashCausalLM.
      
      * Finish removal.
      
      * Fixing docs + causal_lm `batch_class`.
      
      * Fixing docs + causal.lm.
      
      * Add default to Gemma Causality.
      
      * Default value for gemma/gemma2.
      
      * Wrong default.
      fb2f74e2
  22. 01 Jul, 2024 1 commit
    • Daniël de Kok's avatar
      Use GPTQ-Marlin for supported GPTQ configurations (#2111) · 2ce80194
      Daniël de Kok authored
      GPTQ-Marlin is currently the best-performing kernel for GPTQ models. So
      let's use it by default if the kernels are installed, the GPU supports
      it, and the kernels support the configuration.
      
      For models generated by `text-generation-server quantize`, use
      `sym=False`. This subcommand symmetric quantization since the beginning
      and incorrectly reporting the model to be symmetric will use
      GPTQ-Marlin (which does not support asymmetric quantization).
      2ce80194
  23. 27 Jun, 2024 1 commit
  24. 25 Jun, 2024 2 commits
  25. 24 Jun, 2024 1 commit
    • Nicolas Patry's avatar
      New runner. Manual squash. (#2110) · 480d3b33
      Nicolas Patry authored
      * New runner. Manual squash.
      
      * Network host.
      
      * Put back trufflehog with proper extension.
      
      * No network host ?
      
      * Moving buildx install after tailscale ?
      
      * 1.79
      480d3b33
  26. 17 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Support different image sizes in prefill in VLMs (#2065) · e9037708
      Daniël de Kok authored
      When a batch contained images if different sizes during prefill, the
      server would fail (see e.g. #2056). Images were processed separately and
      then concatenated. However, this can fail for images with different sizes.
      
      Fix this by preprocessing all images in the batch together, so that the
      image processor can ensure that all image tensors have compatible sizes.
      e9037708
  27. 14 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for GPTQ Marlin (#2052) · 093a27c5
      Daniël de Kok authored
      Add support for GPTQ Marlin kernels
      
      GPTQ Marlin extends the Marlin kernels to support common GPTQ
      configurations:
      
      - bits: 4 or 8
      - groupsize: -1, 32, 64, or 128
      - desc_act: true/false
      
      Using the GPTQ Marlin kernels requires repacking the parameters in the
      Marlin quantizer format.
      
      The kernels were contributed by Neural Magic to VLLM. We vendor them
      here for convenience.
      093a27c5
  28. 11 Jun, 2024 1 commit