- 06 Dec, 2024 2 commits
-
-
OlivierDehaene authored
* feat: auto max_new_tokens * update default * Fixing the tests. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 04 Dec, 2024 1 commit
-
-
drbh authored
-
- 03 Dec, 2024 2 commits
-
-
Nicolas Patry authored
* Saving some VRAM. - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB left, so 400MB saved. - Effect not as visible on attention=flashinfer and n_shard=1. I suspect it's linked to the torch allocator. * Adding assertion.
-
Daniël de Kok authored
* Sync (most) server dependencies with Nix Skipped most grpcio packages, because of protobuf version incompatibility with the opentelemetry packages. * Add a primitive script to generate Poetry commands to sync with Nix This is not fully automated, since getting the Nix versions may be unresolvable. However, it does take most of the work out of doing this manually. * Upgrade eetq ? * Fmt. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 02 Dec, 2024 4 commits
-
-
Dmitry Rogozhkin authored
LLama 3 has a list of values as eos_token_id: "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']" This breaks tokenizer since it expects single value. This commit uses tokenizer.eos_token_id instead in such a case. Fixes: #2440 Signed-off-by:Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
-
drbh authored
-
Torsten Raudssus authored
-
Nicolas Patry authored
-
- 28 Nov, 2024 1 commit
-
-
drbh authored
* feat: support continue_final_message param in chat request * feat: add test for continue final message * fix: bump openapi docs * fix: remove continue_final_message chat request param * fix: remove unneeded launcher args in continue test * fix: bump test output * fix: remove accidentally included guideline from rebase * fix: remove guideline tests * fix: adjust continuation tests expected text * fix: replace expected output for continue test
-
- 26 Nov, 2024 3 commits
-
-
jp authored
Fix: typo in model loading code Fix typo in model loading code
-
Wang, Yi authored
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention) Signed-off-by:Wang,Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
The compressed-tensors configuration can specify the configuration of the KV cache as well. Use an FP8 KV cache when the configuration tells us to do so (all other options and types are ignored for now).
-
- 25 Nov, 2024 2 commits
-
-
Daniël de Kok authored
* Move JSON grammar -> regex grammar conversion to the router This change moves the JSON grammar -> regex grammar conversion to the router by adding a dependency on the `outlines-core` Rust crate. In contrast to the Python implementation, the conversions are not LRU-cached since they seem to be fast enough: simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs] change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05) Performance has improved. complex schema time: [14.875 µs 14.881 µs 14.887 µs] change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05) Performance has improved. Using the schemas from: https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py -
drbh authored
* feat: concat the adapter id to the model id in chat response * fix: updated to include only the adapter id in chat response
-
- 22 Nov, 2024 2 commits
-
-
OlivierDehaene authored
* chore: prepare 2.4.1 release * fix tests * fmt
-
Daniël de Kok authored
This fixes a bug in 2:4 Marlin: https://github.com/vllm-project/vllm/pull/10464
-
- 21 Nov, 2024 7 commits
-
-
OlivierDehaene authored
* feat: add payload limit * update launcher
-
Hugo Larcher authored
* feat: Add automatic nightly benchmarks * fix: Update runners group * fix: add created_at field to results * fix: Add variable results file location
-
Lucain authored
-
Daniël de Kok authored
-
drbh authored
-
OlivierDehaene authored
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770) * Incomplete generation stream fix (#2754) entries.len() could > batch.size in prefill, so need to filter as well. Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * entries was wrongly extended for model that did not support chunking --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi <yi.a.wang@intel.com>
-
Daniël de Kok authored
-
- 20 Nov, 2024 5 commits
-
-
drbh authored
fix: set outlines version to 0.1.3 to avoid bug
-
Daniël de Kok authored
* nix: build and cache all devshells * nix: add poetry to the impure shell This shouldn't be used to manage dependencies in a Nix devshell, but can be handy to update `poetry.lock`. * Fix Nix build, disable pure shell (covered by Nix tests)
-
Daniël de Kok authored
This change adds support for wNa16 int checkpoints with 2:4 sparsity using Marlin 2:4 kernels.
-
Daniël de Kok authored
-
Daniël de Kok authored
-
- 19 Nov, 2024 4 commits
-
-
drbh authored
-
drbh authored
* add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by:Linus Bierhoff <linus.bierhoff@icloud.com>
-
Daniël de Kok authored
This version syncs with the vLLM kernels and brings some performance improvements.
-
Daniël de Kok authored
-
- 18 Nov, 2024 4 commits
-
-
drbh authored
* feat: support flash attention 2 in qwen2 vl vision blocks * fix: calc max_seqlen once and small refactors
-
Daniël de Kok authored
* Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match |↑ |0.8431|± |0.0100| | | |strict-match | 8|exact_match |↑ |0.8393|± |0.0101| |ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8597|± | N/A| | | |none | 0|inst_level_strict_acc |↑ |0.8201|± | N/A| | | |none | 0|prompt_level_loose_acc |↑ |0.7967|± |0.0173| | | |none | 0|prompt_level_strict_acc|↑ |0.7468|± |0.0187| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by:
drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by:
drbh <david.richard.holtz@gmail.com>
-
Wang, Yi authored
* add ipex moe implementation to support Mixtral and PhiMoe Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * update to ipex xpu 2.5 Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * torch has xpu support in 2.5 Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix oneapi basekit version Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
drbh authored
-
- 17 Nov, 2024 1 commit
-
-
Daniël de Kok authored
* Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning
-
- 15 Nov, 2024 2 commits
-
-
drbh authored
* feat: return streaming errors as an event formatted for openai's client * fix: propagate completions error events to stream * fix: improve stream api error format and add status code * fix: improve streamin error to include error_type * Revert "fix: improve streamin error to include error_type" This reverts commit 2b1a360b1511d94ea9a24e5432e498e67939506a. * Reworked the implementation. * Revert "Reworked the implementation." This reverts commit 7c3f29777f17411ae4ade57e2f88e73cde704ee5. * Small lifting. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Nicolas Patry authored
* Upgrading our deps. * fixup. * Fixup.
-