- 24 Jan, 2025 1 commit
-
-
xuxzh1 authored
-
- 20 Jan, 2025 1 commit
-
-
xuxzh1 authored
-
- 27 Dec, 2024 1 commit
-
-
xuxzh1 authored
-
- 24 Dec, 2024 1 commit
-
-
xuxzh1 authored
-
- 23 Dec, 2024 1 commit
-
-
xuxzh1 authored
-
- 09 Dec, 2024 6 commits
-
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Nicolas Patry authored
* New version. * Link fixup. * Update docs. * FIxup.
-
Nicolas Patry authored
* V3 document. * Updating asset.
-
Nicolas Patry authored
* Attempt for cleverer auto batch_prefill values (some simplifications). * Less flaky tests. * Fixing typo insertion. * Update launcher/src/main.rs Co-authored-by:
Daniël de Kok <me@danieldk.eu> * Adding small comment for source of calculation. * Adding L40. * Adding L40s. --------- Co-authored-by:
Daniël de Kok <me@danieldk.eu>
-
- 06 Dec, 2024 6 commits
-
-
drbh authored
* feat: support loading gemma2 as vlm text model * feat: add test for paligemma2
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Nicolas Patry authored
* Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).
-
OlivierDehaene authored
* feat: auto max_new_tokens * update default * Fixing the tests. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 04 Dec, 2024 1 commit
-
-
drbh authored
-
- 03 Dec, 2024 2 commits
-
-
Nicolas Patry authored
* Saving some VRAM. - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB left, so 400MB saved. - Effect not as visible on attention=flashinfer and n_shard=1. I suspect it's linked to the torch allocator. * Adding assertion.
-
Daniël de Kok authored
* Sync (most) server dependencies with Nix Skipped most grpcio packages, because of protobuf version incompatibility with the opentelemetry packages. * Add a primitive script to generate Poetry commands to sync with Nix This is not fully automated, since getting the Nix versions may be unresolvable. However, it does take most of the work out of doing this manually. * Upgrade eetq ? * Fmt. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 02 Dec, 2024 4 commits
-
-
Dmitry Rogozhkin authored
LLama 3 has a list of values as eos_token_id: "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']" This breaks tokenizer since it expects single value. This commit uses tokenizer.eos_token_id instead in such a case. Fixes: #2440 Signed-off-by:Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
-
drbh authored
-
Torsten Raudssus authored
-
Nicolas Patry authored
-
- 28 Nov, 2024 1 commit
-
-
drbh authored
* feat: support continue_final_message param in chat request * feat: add test for continue final message * fix: bump openapi docs * fix: remove continue_final_message chat request param * fix: remove unneeded launcher args in continue test * fix: bump test output * fix: remove accidentally included guideline from rebase * fix: remove guideline tests * fix: adjust continuation tests expected text * fix: replace expected output for continue test
-
- 26 Nov, 2024 3 commits
-
-
jp authored
Fix: typo in model loading code Fix typo in model loading code
-
Wang, Yi authored
upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageattention) Signed-off-by:Wang,Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
The compressed-tensors configuration can specify the configuration of the KV cache as well. Use an FP8 KV cache when the configuration tells us to do so (all other options and types are ignored for now).
-
- 25 Nov, 2024 2 commits
-
-
Daniël de Kok authored
* Move JSON grammar -> regex grammar conversion to the router This change moves the JSON grammar -> regex grammar conversion to the router by adding a dependency on the `outlines-core` Rust crate. In contrast to the Python implementation, the conversions are not LRU-cached since they seem to be fast enough: simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs] change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05) Performance has improved. complex schema time: [14.875 µs 14.881 µs 14.887 µs] change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05) Performance has improved. Using the schemas from: https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py -
drbh authored
* feat: concat the adapter id to the model id in chat response * fix: updated to include only the adapter id in chat response
-
- 22 Nov, 2024 2 commits
-
-
OlivierDehaene authored
* chore: prepare 2.4.1 release * fix tests * fmt
-
Daniël de Kok authored
This fixes a bug in 2:4 Marlin: https://github.com/vllm-project/vllm/pull/10464
-
- 21 Nov, 2024 7 commits
-
-
OlivierDehaene authored
* feat: add payload limit * update launcher
-
Hugo Larcher authored
* feat: Add automatic nightly benchmarks * fix: Update runners group * fix: add created_at field to results * fix: Add variable results file location
-
Lucain authored
-
Daniël de Kok authored
-
drbh authored
-
OlivierDehaene authored
fix: incomplete generations w/ single tokens generations and models that did not support chunking (#2770) * Incomplete generation stream fix (#2754) entries.len() could > batch.size in prefill, so need to filter as well. Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * entries was wrongly extended for model that did not support chunking --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi <yi.a.wang@intel.com>
-
Daniël de Kok authored
-
- 20 Nov, 2024 1 commit
-
-
drbh authored
fix: set outlines version to 0.1.3 to avoid bug
-