- 25 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots
-
- 27 Sep, 2024 1 commit
-
-
Daniël de Kok authored
* Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s
-
- 17 Sep, 2024 1 commit
-
-
Daniël de Kok authored
* Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner
-
- 02 Sep, 2024 1 commit
-
-
Daniël de Kok authored
Enables LoRA support.
-
- 29 Aug, 2024 1 commit
-
-
Daniël de Kok authored
Updates tgi-nix input: - Move Torch closer to upstream by building against MKL. - Remove compute capability 8.7 from Torch (Jetson). - Sync nixpkgs cumpute capabilities with Torch (avoids compiling too mana capabilities for MAGMA). - Use nixpkgs configuration passed through by `tgi-nix`.
-
- 21 Aug, 2024 2 commits
-
-
Daniël de Kok authored
-
Nicolas Patry authored
-
- 20 Aug, 2024 1 commit
-
-
Daniël de Kok authored
* nix: pure server and support both pure and impure devShells * nix: remove unused poetry2nix input It is not wired up and we now have a pure server. * nix: add ipdb to impure devshell
-