1. 17 Oct, 2024 1 commit
  2. 16 Oct, 2024 2 commits
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
    • Mohit Sharma's avatar
      Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8
      Mohit Sharma authored
      * (feat) fp8 fnuz support for rocm
      
      * (review comments) Fix compression_config load, type hints
      
      * (bug) update all has_tensor
      
      * (review_comments) fix typo and added comments
      
      * (nit) improved comment
      704a58c8
  3. 15 Oct, 2024 3 commits
  4. 14 Oct, 2024 5 commits
  5. 11 Oct, 2024 1 commit
  6. 10 Oct, 2024 3 commits
  7. 09 Oct, 2024 2 commits
    • Nicolas Patry's avatar
      AMD CI (#2589) · 43f39f68
      Nicolas Patry authored
      * Only run 1 valid test.
      
      * TRying the tailscale action quickly.
      
      * ?
      
      * bash spaces.
      
      * Remove tailscale.
      
      * More quotes.
      
      * mnt2 ?
      
      * Othername to avoid recursive directories.
      
      * Good old tmate.
      
      * Remove tmate.
      
      * Trying a few things.
      
      * Remove some stuff.
      
      * Sleep ?
      
      * Tmp
      
      * busybox
      
      * Launcher tgi
      
      * Starting hello
      
      * Busybox in python
      
      * No device.
      
      * Removing all variables ?
      
      * A un moment donné.
      
      * Tmp
      
      * Tmp2
      
      * DEvice request, no container name
      
      * No device requests
      
      * Without pytest.
      
      * No pytest.
      
      * from env
      
      * Start with devices
      
      * Attemp #1
      
      * Remove stdin messing
      
      * Only 1 test, no container name
      
      * Raw tgi
      
      * Sending args.
      
      * Show pip freeze.
      
      * Start downloading with token
      
      * Giving HIP devices.
      
      * Mount volume + port forward
      
      * Without pytest.
      
      * No token
      
      * Repeated arguments
      
      * Wrong kwarg.
      
      * On 2 GPUs
      
      * Fallback to single shard CI test.
      
      * Testing
      
      * yaml
      
      * Common cache ?
      
      * Trailing slash ?
      
      * Docker volume split.
      
      * Fix docker volume
      
      * Fixing ?
      
      * ?
      
      * Try no devices ?
      
      * Flash llama on intel CPU ?
      
      * Fix nvidia ?
      
      * Temp deactivate intel, activate nvidia ?
      43f39f68
    • Daniël de Kok's avatar
      nix: add black and isort to the closure (#2619) · 9ed0c85f
      Daniël de Kok authored
      To make sure that everything is formatted with the same black version
      as CI.
      
      I sometimes use isort for new files to get nicely ordered imports,
      so add it as well. Also set the isort configuration to format in a
      way that is compatible with black.
      9ed0c85f
  8. 08 Oct, 2024 4 commits
  9. 07 Oct, 2024 2 commits
  10. 04 Oct, 2024 2 commits
    • Daniël de Kok's avatar
      Add basic FP8 KV cache support (#2603) · 2358c2bb
      Daniël de Kok authored
      * Add basic FP8 KV cache support
      
      This change adds rudimentary FP8 KV cache support. The support is
      enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so
      uses this type for the KV cache. However support is still limited:
      
      * Only the `fp8_e5m2` type is supported.
      * The KV cache layout is the same as `float16`/`bfloat16` (HND).
      * The FP8 KV cache is only supported for FlashInfer.
      * Loading of scales is not yet supported.
      
      * Fix Cargo.toml
      2358c2bb
    • Daniël de Kok's avatar
  11. 03 Oct, 2024 2 commits
  12. 02 Oct, 2024 4 commits
    • drbh's avatar
      Unroll notify error into generate response (#2597) · d22b0c1f
      drbh authored
      * feat: unroll notify_error if no tool is choosen
      
      * fix: expect simple message when no tool is selected
      
      * fix: improve test to avoid notify_error
      
      * fix: improve docs and indicate change in expected response
      
      * fix: adjust linting in test file
      d22b0c1f
    • drbh's avatar
      CI (2592): Allow LoRA adapter revision in server launcher (#2602) · 23354595
      drbh authored
      
      
      allow revision for lora adapters from launcher
      Co-authored-by: default avatarSida <sida@kulamind.com>
      Co-authored-by: default avatarteamclouday <teamclouday@gmail.com>
      23354595
    • Nicolas Patry's avatar
      Max token capacity metric (#2595) · 0204946d
      Nicolas Patry authored
      
      
      * adding max_token_capacity_metric
      
      * added tgi to name of metric
      
      * Adding max capacity metric.
      
      * Add description for the metrics
      
      ---------
      Co-authored-by: default avatarEdwinhr716 <Edandres249@gmail.com>
      0204946d
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  13. 01 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      nix: experimental support for building a Docker container (#2470) · 584b4d7a
      Daniël de Kok authored
      
      
      * nix: experimental support for building a Docker image
      
      Run using something like:
      
      ```
      docker run \
        --device nvidia.com/gpu=all \
        -it --rm -p 8080:80 \
        -v $PWD/data:/data \
        -v $PWD/tmp:/tmp \
        tgi-docker:latest \
        --model-id <model_id>
      ```
      
      * Example of building the Docker image using Nix inside Docker
      
      * Stream to make the builder image smaller
      
      This avoids storing a Docker image tarball in the image. Instead,
      stream the layers while doing `docker run`.
      
      * Don't spam journalctl on Linux
      
      * Other dockerfile.
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      584b4d7a
  14. 30 Sep, 2024 7 commits
    • Daniël de Kok's avatar
      MoE Marlin: support `desc_act` for `groupsize != -1` (#2590) · 1c84a30f
      Daniël de Kok authored
      This change uses the updated Marlin MoE kernel from vLLM to support
      MoE with activation sorting and groups.
      1c84a30f
    • Daniël de Kok's avatar
      Move flake back to tgi-nix `main` (#2586) · d1f257ac
      Daniël de Kok authored
      d1f257ac
    • drbh's avatar
      feat: support phi3.5 moe (#2479) · 93a7042d
      drbh authored
      
      
      * feat: support phi3.5 moe model loading
      
      * fix: prefer llama base model and improve rotary logic
      
      * feat: return reasonable generation and add integration test
      
      * fix: run lint and update docs
      
      * fix: rerun lint for openapi docs
      
      * fix: prefer do_sample false unless temp is set by user, and update chat tests
      
      * fix: small typo adjustments
      
      * fix: consolidate long rope paths
      
      * fix: revert greedy by default and test changes
      
      * Vendor configuration so that we don't have to `trust_remote_code`
      
      * Use SparseMoELayer
      
      * Add support for dense MoE
      
      * Some type annotations
      
      * Add the usual model tests
      
      * Ruff.
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      93a7042d
    • Daniël de Kok's avatar
      Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a
      Daniël de Kok authored
      This change add support for MoE models that use GPTQ quantization.
      Currently only models with the following properties are supported:
      
      - No `desc_act` with tensor parallelism, unless `group_size=-1`.
      - No asymmetric quantization.
      - No AWQ.
      90a1d04a
    • Mohit Sharma's avatar
      Update ROCM libs and improvements (#2579) · f9e561ec
      Mohit Sharma authored
      * style
      
      * update torch
      
      * ix issues
      
      * fix clone
      
      * revert mkl
      
      * added custom PA
      
      * style
      
      * fix style
      
      * style
      
      * hide env vart
      
      * fix mixtral model
      
      * add skinny kernel and merge fixes
      
      * fixed style
      
      * fix issue for sliding window models
      
      * addressed review comments
      
      * fix import
      
      * improved error messag
      
      * updated default value
      
      * remove import
      
      * fix imports after rebase
      
      * float16 dep
      
      * improve dockerfile
      
      * cleaned dockerfile
      f9e561ec
    • Ikram Ul Haq's avatar
      Update architecture.md (#2577) · e790cfc0
      Ikram Ul Haq authored
      e790cfc0
    • Daniël de Kok's avatar
      Remove compute capability lazy cell (#2580) · afc7ded8
      Daniël de Kok authored
      Remove compute capability lock
      
      We are only calling the `get_cuda_capability` function once, so avoiding
      the cost of multiple calls is not really necessary yet.
      afc7ded8
  15. 28 Sep, 2024 1 commit