- 20 Jan, 2025 1 commit
-
-
xuxzh1 authored
-
- 23 Dec, 2024 1 commit
-
-
xuxzh1 authored
-
- 09 Dec, 2024 1 commit
-
-
Nicolas Patry authored
* Attempt for cleverer auto batch_prefill values (some simplifications). * Less flaky tests. * Fixing typo insertion. * Update launcher/src/main.rs Co-authored-by:
Daniël de Kok <me@danieldk.eu> * Adding small comment for source of calculation. * Adding L40. * Adding L40s. --------- Co-authored-by:
Daniël de Kok <me@danieldk.eu>
-
- 06 Dec, 2024 3 commits
-
-
drbh authored
* feat: support loading gemma2 as vlm text model * feat: add test for paligemma2
-
Nicolas Patry authored
-
Nicolas Patry authored
* Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).
-
- 04 Dec, 2024 1 commit
-
-
drbh authored
-
- 03 Dec, 2024 2 commits
-
-
Nicolas Patry authored
* Saving some VRAM. - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB left, so 400MB saved. - Effect not as visible on attention=flashinfer and n_shard=1. I suspect it's linked to the torch allocator. * Adding assertion.
-
Daniël de Kok authored
* Sync (most) server dependencies with Nix Skipped most grpcio packages, because of protobuf version incompatibility with the opentelemetry packages. * Add a primitive script to generate Poetry commands to sync with Nix This is not fully automated, since getting the Nix versions may be unresolvable. However, it does take most of the work out of doing this manually. * Upgrade eetq ? * Fmt. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 02 Dec, 2024 1 commit
-
-
Dmitry Rogozhkin authored
LLama 3 has a list of values as eos_token_id: "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']" This breaks tokenizer since it expects single value. This commit uses tokenizer.eos_token_id instead in such a case. Fixes: #2440 Signed-off-by:Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
-
- 26 Nov, 2024 1 commit
-
-
Daniël de Kok authored
The compressed-tensors configuration can specify the configuration of the KV cache as well. Use an FP8 KV cache when the configuration tells us to do so (all other options and types are ignored for now).
-
- 25 Nov, 2024 1 commit
-
-
Daniël de Kok authored
* Move JSON grammar -> regex grammar conversion to the router This change moves the JSON grammar -> regex grammar conversion to the router by adding a dependency on the `outlines-core` Rust crate. In contrast to the Python implementation, the conversions are not LRU-cached since they seem to be fast enough: simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs] change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05) Performance has improved. complex schema time: [14.875 µs 14.881 µs 14.887 µs] change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05) Performance has improved. Using the schemas from: https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
-
- 21 Nov, 2024 1 commit
-
-
OlivierDehaene authored
* feat: add payload limit * update launcher
-
- 20 Nov, 2024 1 commit
-
-
Daniël de Kok authored
This change adds support for wNa16 int checkpoints with 2:4 sparsity using Marlin 2:4 kernels.
-
- 19 Nov, 2024 2 commits
-
-
drbh authored
-
Daniël de Kok authored
-
- 18 Nov, 2024 4 commits
-
-
drbh authored
* feat: support flash attention 2 in qwen2 vl vision blocks * fix: calc max_seqlen once and small refactors
-
Daniël de Kok authored
* Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match |↑ |0.8431|± |0.0100| | | |strict-match | 8|exact_match |↑ |0.8393|± |0.0101| |ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8597|± | N/A| | | |none | 0|inst_level_strict_acc |↑ |0.8201|± | N/A| | | |none | 0|prompt_level_loose_acc |↑ |0.7967|± |0.0173| | | |none | 0|prompt_level_strict_acc|↑ |0.7468|± |0.0187| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by:
drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by:
drbh <david.richard.holtz@gmail.com>
-
Wang, Yi authored
* add ipex moe implementation to support Mixtral and PhiMoe Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * update to ipex xpu 2.5 Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * torch has xpu support in 2.5 Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix oneapi basekit version Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
drbh authored
-
- 17 Nov, 2024 1 commit
-
-
Daniël de Kok authored
* Remove vLLM dependency for CUDA This change adds `attention-kernels` as a dependency for paged attention and cache reshaping. With that, we don't use vLLM anywhere for CUDA. Tested run (since we don't have paged attention in CI): ``` ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release [...] 5 snapshots passed. ``` * Fix clippy warning
-
- 15 Nov, 2024 4 commits
-
-
Nicolas Patry authored
* Upgrading our deps. * fixup. * Fixup.
-
Alex Weston authored
* Upgrade outlines to 0.1.1 * Update for new API * Check if allowed tokens is None --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Billel Mokeddem authored
fix: change embeddings to embedding Co-authored-by:Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
-
Billel Mokeddem authored
Co-authored-by:Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
-
- 10 Nov, 2024 1 commit
-
-
Daniël de Kok authored
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.
-
- 04 Nov, 2024 4 commits
-
-
Wang, Yi authored
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ ipex kernel provide func like add_bias, so no need add it outside Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
-
Travis Addair authored
-
Nicolas Patry authored
-
- 02 Nov, 2024 1 commit
-
-
drbh authored
* fix: create position ids for text only input * fix: prefer repeat over expand to avoid clone
-
- 01 Nov, 2024 1 commit
-
-
drbh authored
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 30 Oct, 2024 1 commit
-
-
drbh authored
* feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2
-
- 28 Oct, 2024 3 commits
-
-
Nicolas Patry authored
-
Nicolas Patry authored
* We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.
-
Nicolas Patry authored
* Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).
-
- 25 Oct, 2024 3 commits
-
-
OlivierDehaene authored
* feat: add triton kernels to decrease latency of large batches * cast to int32 * fix kernel * fix kernel * disable triton on rocm * fix speculation * add slots filtering kernel
-
Daniël de Kok authored
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots
-
Nicolas Patry authored
-
- 24 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer
-