- 22 Nov, 2024 2 commits
-
-
OlivierDehaene authored
* chore: prepare 2.4.1 release * fix tests * fmt
-
Daniël de Kok authored
This fixes a bug in 2:4 Marlin: https://github.com/vllm-project/vllm/pull/10464
-
- 21 Nov, 2024 7 commits
-
-
OlivierDehaene authored
* feat: add payload limit * update launcher
-
Hugo Larcher authored
* feat: Add automatic nightly benchmarks * fix: Update runners group * fix: add created_at field to results * fix: Add variable results file location
-
Lucain authored
-
Daniël de Kok authored
-
drbh authored
-
OlivierDehaene authored
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770) * Incomplete generation stream fix (#2754) entries.len() could > batch.size in prefill, so need to filter as well. Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * entries was wrongly extended for model that did not support chunking --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi <yi.a.wang@intel.com>
-
Daniël de Kok authored
-
- 20 Nov, 2024 5 commits
-
-
drbh authored
fix: set outlines version to 0.1.3 to avoid bug
-
Daniël de Kok authored
* nix: build and cache all devshells * nix: add poetry to the impure shell This shouldn't be used to manage dependencies in a Nix devshell, but can be handy to update `poetry.lock`. * Fix Nix build, disable pure shell (covered by Nix tests)
-
Daniël de Kok authored
This change adds support for wNa16 int checkpoints with 2:4 sparsity using Marlin 2:4 kernels.
-
Daniël de Kok authored
-
Daniël de Kok authored
-
- 19 Nov, 2024 4 commits
-
-
drbh authored
-
drbh authored
* add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by:Linus Bierhoff <linus.bierhoff@icloud.com>
-
Daniël de Kok authored
This version syncs with the vLLM kernels and brings some performance improvements.
-
Daniël de Kok authored
-
- 18 Nov, 2024 4 commits
-
-
drbh authored
* feat: support flash attention 2 in qwen2 vl vision blocks * fix: calc max_seqlen once and small refactors
-
Daniël de Kok authored
* Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match |↑ |0.8431|± |0.0100| | | |strict-match | 8|exact_match |↑ |0.8393|± |0.0101| |ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8597|± | N/A| | | |none | 0|inst_level_strict_acc |↑ |0.8201|± | N/A| | | |none | 0|prompt_level_loose_acc |↑ |0.7967|± |0.0173| | | |none | 0|prompt_level_strict_acc|↑ |0.7468|± |0.0187| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by:
drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by:
drbh <david.richard.holtz@gmail.com>
-
Wang, Yi authored
* add ipex moe implementation to support Mixtral and PhiMoe Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * update to ipex xpu 2.5 Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * torch has xpu support in 2.5 Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix oneapi basekit version Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
drbh authored
-
- 17 Nov, 2024 1 commit
-
-
Daniël de Kok authored
* Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning
-
- 15 Nov, 2024 7 commits
-
-
drbh authored
* feat: return streaming errors as an event formatted for openai's client * fix: propagate completions error events to stream * fix: improve stream api error format and add status code * fix: improve streamin error to include error_type * Revert "fix: improve streamin error to include error_type" This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a. * Reworked the implementation. * Revert "Reworked the implementation." This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5. * Small lifting. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Nicolas Patry authored
* Upgrading our deps. * fixup. * Fixup.
-
Alex Weston authored
* Upgrade outlines to 0.1.1 * Update for new API * Check if allowed tokens is None --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
jito authored
Signed-off-by:jitokim <pigberger70@gmail.com>
-
Billel Mokeddem authored
fix: change embeddings to embedding Co-authored-by:Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
-
Billel Mokeddem authored
Co-authored-by:Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
-
Daniël de Kok authored
-
- 14 Nov, 2024 1 commit
-
-
Daniël de Kok authored
Updates from Triton 2.1.0 to 3.1.0 (among other things).
-
- 10 Nov, 2024 1 commit
-
-
Daniël de Kok authored
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.
-
- 07 Nov, 2024 1 commit
-
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 04 Nov, 2024 6 commits
-
-
Wang, Yi authored
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ ipex kernel provide func like add_bias, so no need add it outside Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
-
Nicolas Patry authored
-
Travis Addair authored
-
Nicolas Patry authored
-
drbh authored
-
- 02 Nov, 2024 1 commit
-
-
drbh authored
* fix: create position ids for text only input * fix: prefer repeat over expand to avoid clone
-