- 21 Nov, 2024 1 commit
-
-
OlivierDehaene authored
* feat: add payload limit * update launcher
-
- 10 Nov, 2024 1 commit
-
-
Daniël de Kok authored
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.
-
- 28 Oct, 2024 1 commit
-
-
Nicolas Patry authored
* Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).
-
- 25 Oct, 2024 1 commit
-
-
OlivierDehaene authored
-
- 23 Oct, 2024 1 commit
-
-
OlivierDehaene authored
* feat: allow any supported payload on /invocations * update openAPI * update doc
-
- 17 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Support `e4m3fn` KV cache * Make check more obvious
-
- 04 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml
-
- 19 Sep, 2024 1 commit
-
-
Daniël de Kok authored
-
- 16 Aug, 2024 1 commit
-
-
Hugo Larcher authored
* doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-