Commits · 2003d8be0c02a59cecafc035090f8bcdd64b85d9 · OpenDAS / text-generation-inference

03 Dec, 2024 1 commit

Sync (most) server dependencies with Nix (#2782) · 2003d8be

Daniël de Kok authored Dec 03, 2024



* Sync (most) server dependencies with Nix

Skipped most grpcio packages, because of protobuf version
incompatibility with the opentelemetry packages.

* Add a primitive script to generate Poetry commands to sync with Nix

This is not fully automated, since getting the Nix versions may be
unresolvable. However, it does take most of the work out of doing
this manually.

* Upgrade eetq ?

* Fmt.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

2003d8be

19 Nov, 2024 1 commit
- Update to moe-kernels 0.7.0 (#2720) · 2007a947
  Daniël de Kok authored Nov 19, 2024
```
This version syncs with the vLLM kernels and brings some performance
improvements.
```
  2007a947
18 Nov, 2024 1 commit

Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f

Daniël de Kok authored Nov 18, 2024



* Add support for compressed-tensors w8a8 int checkpoints

This change adds a loader for w8a8 int checkpoints. One large benefit of
int8 support is that the corresponding cutlass matmul kernels also work on
compute capability 7.5.

Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:

|     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
|               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
|ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
|               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
|               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
|               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|

Which is the same ballpark as vLLM.

As usual, lots of thanks to Neural Magic/vLLM for the kernels.

* Always use dynamic input quantization for w8a8 int

It's far less flaky and gives better output.

* Use marlin-kernels 0.3.5

* Fix a typo
Co-authored-by: drbh <david.richard.holtz@gmail.com>

* Small fixes

---------
Co-authored-by: drbh <david.richard.holtz@gmail.com>

3c9df21f

17 Nov, 2024 1 commit

Remove vLLM dependency for CUDA (#2751) · 52e48739

Daniël de Kok authored Nov 17, 2024

* Remove vLLM dependency for CUDA

This change adds `attention-kernels` as a dependency for paged
attention and cache reshaping. With that, we don't use vLLM
anywhere for CUDA.

Tested run (since we don't have paged attention in CI):

```
❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
[...]
5 snapshots passed.
```

* Fix clippy warning

52e48739

14 Nov, 2024 1 commit
- nix: update nixpkgs (#2746) · ca4f46dd
  Daniël de Kok authored Nov 14, 2024
```
Updates from Triton 2.1.0 to 3.1.0 (among other things).
```
  ca4f46dd
10 Nov, 2024 1 commit

Add initial support for compressed-tensors checkpoints (#2732) · a7850008

Daniël de Kok authored Nov 10, 2024

compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.

a7850008

04 Nov, 2024 1 commit
- nix: move to tgi-nix `main` (#2718) · 5eedb2ec
  Daniël de Kok authored Nov 04, 2024
  
  5eedb2ec
25 Oct, 2024 1 commit

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

24 Oct, 2024 1 commit

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74

22 Oct, 2024 1 commit

Add `impureWithCuda` dev shell (#2677) · 9c9ef37c

Daniël de Kok authored Oct 22, 2024

* Add `impureWithCuda` dev shell

This shell is handy when developing some kernels jointly with TGI - it
adds nvcc and a bunch of commonly-used CUDA libraries to the environment.

We don't add this to the normal impure shell to keep the development
environment as clean as possible (avoid accidental dependencies, etc.).

* Add cuDNN

9c9ef37c

08 Oct, 2024 2 commits
- nix: move back to the tgi-nix main branch (#2620) · 6db3bcb7
  Daniël de Kok authored Oct 08, 2024
  
  6db3bcb7
- Add support for fused MoE Marlin for AWQ (#2616) · 64142489
  Daniël de Kok authored Oct 08, 2024
```
* Add support for fused MoE Marlin for AWQ

This uses the updated MoE Marlin kernels from vLLM.

* Add integration test for AWQ MoE
```
  64142489
04 Oct, 2024 1 commit
- nix: example of local package overrides during development (#2607) · 68103079
  Daniël de Kok authored Oct 04, 2024
  
  68103079
02 Oct, 2024 1 commit

Mllama flash version (#2585) · d18ed5cf

Nicolas Patry authored Oct 02, 2024

* Working loading state.

* Preprocessing.

* Working state ? (Broke idefics1 temporarily).

* Cleaner condition.

* Fix idefics.

* Updating config, removing TODO

* Mllama

* Ugrade transformers 4.45

* Flashing mllama.

* Starting to get there.

* Working state.

* Integrations tests for mllama (cutting to 10 tokens because there seems'
to be instability after (meaning size of the batch matters.

* Updating model link.

* Earlier assert.

* Fix vlm ?

* remove log.

* Force ignore all images but last.

* Default dtype bfloat16.

* Update integration test after switch to bf16.

* Remove dead code.

* Removed dead code.

* Upgrade the flake to latest transformers/tokenizers

* Move to hf tgi-nix

* Upgrade to 0.5.0

d18ed5cf

01 Oct, 2024 1 commit

nix: experimental support for building a Docker container (#2470) · 584b4d7a

Daniël de Kok authored Oct 01, 2024



* nix: experimental support for building a Docker image

Run using something like:

```
docker run \
  --device nvidia.com/gpu=all \
  -it --rm -p 8080:80 \
  -v $PWD/data:/data \
  -v $PWD/tmp:/tmp \
  tgi-docker:latest \
  --model-id <model_id>
```

* Example of building the Docker image using Nix inside Docker

* Stream to make the builder image smaller

This avoids storing a Docker image tarball in the image. Instead,
stream the layers while doing `docker run`.

* Don't spam journalctl on Linux

* Other dockerfile.

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

584b4d7a

30 Sep, 2024 3 commits

MoE Marlin: support `desc_act` for `groupsize != -1` (#2590) · 1c84a30f
Daniël de Kok authored Sep 30, 2024
```
This change uses the updated Marlin MoE kernel from vLLM to support
MoE with activation sorting and groups.
```
1c84a30f
Move flake back to tgi-nix `main` (#2586) · d1f257ac
Daniël de Kok authored Sep 30, 2024

d1f257ac

Add support for GPTQ-quantized MoE models using MoE Marlin (#2557) · 90a1d04a

Daniël de Kok authored Sep 30, 2024

This change add support for MoE models that use GPTQ quantization.
Currently only models with the following properties are supported:

- No `desc_act` with tensor parallelism, unless `group_size=-1`.
- No asymmetric quantization.
- No AWQ.

90a1d04a

27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

19 Sep, 2024 2 commits

doc: clarify that `--quantize` is not needed for pre-quantized models (#2536) · abd24dd3
Daniël de Kok authored Sep 19, 2024

abd24dd3

Stream options. (#2533) · f512021e

Nicolas Patry authored Sep 19, 2024

* Stream options.

* Fetch stuff from nix integration test for easier testing.

* Adding the assert.

* Only send the usage when asked for.

* Update the docs.

* Impure test because we need network.

* develop.

* Optional usage.

* Fixes.

* Workflow

f512021e

17 Sep, 2024 1 commit
- nix: pure Rust check/fmt/clippy/test (#2525) · 71e42686
  Daniël de Kok authored Sep 17, 2024
```
Runs the tests in a Nix build sandbox.
```
  71e42686
12 Sep, 2024 2 commits

Add nix test. (#2513) · d95c670a

Nicolas Patry authored Sep 12, 2024

* Add nix test.

* Modifying yourself means you need to rerun.

* Fixing the test + adding click (needed for pre-commit hooks).

* Try thuis.

* Our runner + pure test (not written)

* Reemove server.

* Root user.

* Different user ?

* Add the actual test target.

* Forgot this modification.

* Add a formatter.

* Add the secrets.

* Fixed the auth token ?

* Adding the other tests.

* Missing pre-commit.

* Test requires cargo for cargo fmt.

* Update it a bit.

* Up.

* Attempting to use a cache location for the models.

* Ignore the cache for now.

d95c670a

nix: support Python tokenizer conversion in the router (#2515) · 94304649

Daniël de Kok authored Sep 12, 2024

Ideally we wouldn't have the router wrapper that this change adds,
but when I give PyO3 a Python interpreter with packages, it ends
up linking libpython from the Python interpreter rather than the
constructed environment and cannot pick up the Python modules as
a result.

94304649

06 Sep, 2024 1 commit
- nix: add pyright/ruff for proper LSP in the impure devshell (#2496) · 0424e27f
  Daniël de Kok authored Sep 06, 2024
```
We need this to ensure that pyright/ruff are part of the same
interpreter/venv.
```
  0424e27f
02 Sep, 2024 1 commit
- nix: improve impure devshell (#2478) · e4ab8554
  Daniël de Kok authored Sep 02, 2024
```
- Add some test dependencies.
- Install server in venv.
- Install Python client in venv.
```
  e4ab8554
29 Aug, 2024 1 commit

nix: build Torch against MKL and various other improvements (#2469) · 4e821c00

Daniël de Kok authored Aug 29, 2024

Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.

4e821c00

23 Aug, 2024 1 commit

nix: add default package (#2453) · f3c5d7d9

Daniël de Kok authored Aug 23, 2024

The default package wraps the launcher and puts the server/router in the
path.

As a result, TGI can be started using something like:

```
nix run .# -- \
  --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --port 8080
```

f3c5d7d9

21 Aug, 2024 1 commit
- nix: add `text-generation-benchmark` to pure devshell (#2431) · 94744150
  Daniël de Kok authored Aug 21, 2024
```
nix: add text-generation-benchmark to pure devshell
```
  94744150
20 Aug, 2024 2 commits

nix: add pure server to flake, add both pure and impure devshells (#2430) · f5f11b79

Daniël de Kok authored Aug 20, 2024

* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell

f5f11b79

Prefix caching (#2402) · b70ae096

Nicolas Patry authored Aug 20, 2024



* Prefix caching WIP

* Fixing prefix attention.

* Fixing flashinfer import.

* Fixing black.

* Fixing medusa (still wrong outputs, but functional).

* Just medusa values now.

* Fixing medusa without prefix caching.

* Fixing prefix caching.

* Medusa requires reshaping.

* Removing the logs.

* Remove router.nix

* Fixup:

- Remove logs
- Disable VLMs (they do not work)
- Disable prefix caching when user wants prefill logprobs.

* Update flake.lock

---------
Co-authored-by: Daniël de Kok <me@danieldk.eu>

b70ae096

19 Aug, 2024 1 commit
- nix: update to CUDA 12.4 (#2429) · 38773453
  Daniël de Kok authored Aug 19, 2024
```
* Update to CUDA 12.4

* poetry2nix: follow tgi-nix nixpkgs
```
  38773453
16 Aug, 2024 1 commit

nix: try to reduce the number of Rust rebuilds (#2424) · 1411bfb9

Daniël de Kok authored Aug 16, 2024

Try to reduce the number of router/launcher rebuilds by filtering
sources. In this way, recompiles should only be triggered by changes
in Cargo or Rust files.

1411bfb9

15 Aug, 2024 1 commit
- nix: build router incrementally (#2422) · 9aaa12e7
  Daniël de Kok authored Aug 15, 2024
  
  9aaa12e7
14 Aug, 2024 2 commits
- Upgrading exl2. (#2415) · f3b5c694
  Nicolas Patry authored Aug 14, 2024
```
* Upgrading exl2.

* Fixing the other pathways.

* Fix idefics.
```
  f3b5c694
- nix: partial incremental build of the router (#2416) · c5fff92b
  Daniël de Kok authored Aug 14, 2024
```
This is less incremental than crate2nix, but does build all dependencies
separately, so avoids full rebuilds.
```
  c5fff92b
13 Aug, 2024 2 commits
- Adding more kernels to flake. (#2411) · cd9b15d1
  Nicolas Patry authored Aug 13, 2024
  
  cd9b15d1
- nix: incremental build of the launcher (#2410) · 6f4bb4f2
  Daniël de Kok authored Aug 13, 2024
  
  6f4bb4f2
12 Aug, 2024 2 commits
- Updating the flake. (#2404) · 19ea85f8
  Nicolas Patry authored Aug 12, 2024
  
  19ea85f8
- Adding launcher to build. (#2397) · 730fa00e
  Nicolas Patry authored Aug 12, 2024
  
  730fa00e