Commits · a7850008429c4c1c4a2ded7bbed4c1b12d22d287 · OpenDAS / text-generation-inference

"docs/vscode:/vscode.git/clone" did not exist on "b072a45f48f3bccaea22c08a47d2bcefd6bf2ce8"

10 Nov, 2024 1 commit

Add initial support for compressed-tensors checkpoints (#2732) · a7850008

Daniël de Kok authored Nov 10, 2024

compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.

a7850008

25 Oct, 2024 1 commit

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

27 Sep, 2024 1 commit

Improve support for GPUs with capability < 8 (#2575) · 5b6b74e2

Daniël de Kok authored Sep 27, 2024

* Improve support for GPUs with capability < 8

- For models that cannot use flashinfer, use flash-attn v1 + paged
  attention for models with a compute capability older than 8.
- Disable prefix caching when using paged attention.
- When using flash-attn v1, pass the key/value, rather than the
  cache, since v1 cannot use block tables.

* nix: add flash-attn-v1 to the server environment

* Move disabling prefix caching into the block of exceptions

* Capability as `usize`s

5b6b74e2

17 Sep, 2024 1 commit

Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9

Daniël de Kok authored Sep 17, 2024

* Move to moe-kernels package and switch to common MoE layer

This change introduces the new `moe-kernels` package:

- Add `moe-kernels` as a dependency.
- Introduce a `SparseMoELayer` module that can be used by MoE
  models.
- Port over Mixtral and Deepseek.

* Make `cargo check` pass

* Update runner

ce85efa9

02 Sep, 2024 1 commit
- nix: add punica-kernels (#2477) · de2cdeca
  Daniël de Kok authored Sep 02, 2024
```
Enables LoRA support.
```
  de2cdeca
29 Aug, 2024 1 commit

nix: build Torch against MKL and various other improvements (#2469) · 4e821c00

Daniël de Kok authored Aug 29, 2024

Updates tgi-nix input:

- Move Torch closer to upstream by building against MKL.
- Remove compute capability 8.7 from Torch (Jetson).
- Sync nixpkgs cumpute capabilities with Torch (avoids
  compiling too mana capabilities for MAGMA).
- Use nixpkgs configuration passed through by `tgi-nix`.

4e821c00

21 Aug, 2024 2 commits
- nix: add awq-inference-engine as server dependency (#2442) · 358ceb67
  Daniël de Kok authored Aug 21, 2024
  
  358ceb67
- Adding eetq to flake. (#2438) · 310778e0
  Nicolas Patry authored Aug 21, 2024
  
  310778e0
20 Aug, 2024 1 commit

nix: add pure server to flake, add both pure and impure devshells (#2430) · f5f11b79

Daniël de Kok authored Aug 20, 2024

* nix: pure server and support both pure and impure devShells

* nix: remove unused poetry2nix input

It is not wired up and we now have a pure server.

* nix: add ipdb to impure devshell

f5f11b79