1. 03 Dec, 2024 1 commit
    • Daniël de Kok's avatar
      Sync (most) server dependencies with Nix (#2782) · 2003d8be
      Daniël de Kok authored
      
      
      * Sync (most) server dependencies with Nix
      
      Skipped most grpcio packages, because of protobuf version
      incompatibility with the opentelemetry packages.
      
      * Add a primitive script to generate Poetry commands to sync with Nix
      
      This is not fully automated, since getting the Nix versions may be
      unresolvable. However, it does take most of the work out of doing
      this manually.
      
      * Upgrade eetq ?
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      2003d8be
  2. 19 Nov, 2024 1 commit
  3. 18 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f
      Daniël de Kok authored
      
      
      * Add support for compressed-tensors w8a8 int checkpoints
      
      This change adds a loader for w8a8 int checkpoints. One large benefit of
      int8 support is that the corresponding cutlass matmul kernels also work on
      compute capability 7.5.
      
      Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
      
      |     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
      |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
      |gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
      |               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
      |ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
      |               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
      |               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
      |               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|
      
      Which is the same ballpark as vLLM.
      
      As usual, lots of thanks to Neural Magic/vLLM for the kernels.
      
      * Always use dynamic input quantization for w8a8 int
      
      It's far less flaky and gives better output.
      
      * Use marlin-kernels 0.3.5
      
      * Fix a typo
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Small fixes
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      3c9df21f
  4. 17 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Remove vLLM dependency for CUDA (#2751) · 52e48739
      Daniël de Kok authored
      * Remove vLLM dependency for CUDA
      
      This change adds `attention-kernels` as a dependency for paged
      attention and cache reshaping. With that, we don't use vLLM
      anywhere for CUDA.
      
      Tested run (since we don't have paged attention in CI):
      
      ```
      ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
      [...]
      5 snapshots passed.
      ```
      
      * Fix clippy warning
      52e48739
  5. 14 Nov, 2024 1 commit
  6. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  7. 04 Nov, 2024 1 commit
  8. 25 Oct, 2024 1 commit
  9. 24 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
  10. 22 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add `impureWithCuda` dev shell (#2677) · 9c9ef37c
      Daniël de Kok authored
      * Add `impureWithCuda` dev shell
      
      This shell is handy when developing some kernels jointly with TGI - it
      adds nvcc and a bunch of commonly-used CUDA libraries to the environment.
      
      We don't add this to the normal impure shell to keep the development
      environment as clean as possible (avoid accidental dependencies, etc.).
      
      * Add cuDNN
      9c9ef37c
  11. 08 Oct, 2024 2 commits
  12. 04 Oct, 2024 1 commit
  13. 02 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  14. 01 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      nix: experimental support for building a Docker container (#2470) · 584b4d7a
      Daniël de Kok authored
      
      
      * nix: experimental support for building a Docker image
      
      Run using something like:
      
      ```
      docker run \
        --device nvidia.com/gpu=all \
        -it --rm -p 8080:80 \
        -v $PWD/data:/data \
        -v $PWD/tmp:/tmp \
        tgi-docker:latest \
        --model-id <model_id>
      ```
      
      * Example of building the Docker image using Nix inside Docker
      
      * Stream to make the builder image smaller
      
      This avoids storing a Docker image tarball in the image. Instead,
      stream the layers while doing `docker run`.
      
      * Don't spam journalctl on Linux
      
      * Other dockerfile.
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      584b4d7a
  15. 30 Sep, 2024 3 commits
  16. 27 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2
      Daniël de Kok authored
      * Improve support for GPUs with capability < 8
      
      - For models that cannot use flashinfer, use flash-attn v1 + paged
        attention for models with a compute capability older than 8.
      - Disable prefix caching when using paged attention.
      - When using flash-attn v1, pass the key/value, rather than the
        cache, since v1 cannot use block tables.
      
      * nix: add flash-attn-v1 to the server environment
      
      * Move disabling prefix caching into the block of exceptions
      
      * Capability as `usize`s
      5b6b74e2
  17. 19 Sep, 2024 2 commits
  18. 17 Sep, 2024 1 commit
  19. 12 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Add nix test. (#2513) · d95c670a
      Nicolas Patry authored
      * Add nix test.
      
      * Modifying yourself means you need to rerun.
      
      * Fixing the test + adding click (needed for pre-commit hooks).
      
      * Try thuis.
      
      * Our runner + pure test (not written)
      
      * Reemove server.
      
      * Root user.
      
      * Different user ?
      
      * Add the actual test target.
      
      * Forgot this modification.
      
      * Add a formatter.
      
      * Add the secrets.
      
      * Fixed the auth token ?
      
      * Adding the other tests.
      
      * Missing pre-commit.
      
      * Test requires cargo for cargo fmt.
      
      * Update it a bit.
      
      * Up.
      
      * Attempting to use a cache location for the models.
      
      * Ignore the cache for now.
      d95c670a
    • Daniël de Kok's avatar
      nix: support Python tokenizer conversion in the router (#2515) · 94304649
      Daniël de Kok authored
      Ideally we wouldn't have the router wrapper that this change adds,
      but when I give PyO3 a Python interpreter with packages, it ends
      up linking libpython from the Python interpreter rather than the
      constructed environment and cannot pick up the Python modules as
      a result.
      94304649
  20. 06 Sep, 2024 1 commit
  21. 02 Sep, 2024 1 commit
  22. 29 Aug, 2024 1 commit
    • Daniël de Kok's avatar
      nix: build Torch against MKL and various other improvements (#2469) · 4e821c00
      Daniël de Kok authored
      Updates tgi-nix input:
      
      - Move Torch closer to upstream by building against MKL.
      - Remove compute capability 8.7 from Torch (Jetson).
      - Sync nixpkgs cumpute capabilities with Torch (avoids
        compiling too mana capabilities for MAGMA).
      - Use nixpkgs configuration passed through by `tgi-nix`.
      4e821c00
  23. 23 Aug, 2024 1 commit
    • Daniël de Kok's avatar
      nix: add default package (#2453) · f3c5d7d9
      Daniël de Kok authored
      The default package wraps the launcher and puts the server/router in the
      path.
      
      As a result, TGI can be started using something like:
      
      ```
      nix run .# -- \
        --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
        --port 8080
      ```
      f3c5d7d9
  24. 21 Aug, 2024 1 commit
  25. 20 Aug, 2024 2 commits
    • Daniël de Kok's avatar
      nix: add pure server to flake, add both pure and impure devshells (#2430) · f5f11b79
      Daniël de Kok authored
      * nix: pure server and support both pure and impure devShells
      
      * nix: remove unused poetry2nix input
      
      It is not wired up and we now have a pure server.
      
      * nix: add ipdb to impure devshell
      f5f11b79
    • Nicolas Patry's avatar
      Prefix caching (#2402) · b70ae096
      Nicolas Patry authored
      
      
      * Prefix caching WIP
      
      * Fixing prefix attention.
      
      * Fixing flashinfer import.
      
      * Fixing black.
      
      * Fixing medusa (still wrong outputs, but functional).
      
      * Just medusa values now.
      
      * Fixing medusa without prefix caching.
      
      * Fixing prefix caching.
      
      * Medusa requires reshaping.
      
      * Removing the logs.
      
      * Remove router.nix
      
      * Fixup:
      
      - Remove logs
      - Disable VLMs (they do not work)
      - Disable prefix caching when user wants prefill logprobs.
      
      * Update flake.lock
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      b70ae096
  26. 19 Aug, 2024 1 commit
  27. 16 Aug, 2024 1 commit
  28. 15 Aug, 2024 1 commit
  29. 14 Aug, 2024 2 commits
  30. 13 Aug, 2024 2 commits
  31. 12 Aug, 2024 2 commits