- 09 Dec, 2024 1 commit
-
-
Nicolas Patry authored
* Attempt for cleverer auto batch_prefill values (some simplifications). * Less flaky tests. * Fixing typo insertion. * Update launcher/src/main.rs Co-authored-by:
Daniël de Kok <me@danieldk.eu> * Adding small comment for source of calculation. * Adding L40. * Adding L40s. --------- Co-authored-by:
Daniël de Kok <me@danieldk.eu>
-
- 06 Dec, 2024 2 commits
-
-
drbh authored
* feat: support loading gemma2 as vlm text model * feat: add test for paligemma2
-
Nicolas Patry authored
* Attempt at automatic max batch prefill. * Taking into account number of shards. * Adding more cards. * Adding A100 + H100 * Adding a few more cards. * Logprobs cost too much. * h100 better name, and keep factor of 2 * Damn inflated sparse tflops. * Typo in h100. * Updated the flops calculation (checked with fvcore). * chunking by default. * Fix prefix caching for chat completion since we removed logprobs. * More tests. * Dropping all the prefill logprobs. * Add a flag that enables users to get logprobs back. * Repairing prompt token counting. * Fixing a few tests. * Remove some scaffolding. * Attempting to reduces the issues (workarounds for now).
-
- 28 Nov, 2024 1 commit
-
-
drbh authored
* feat: support continue_final_message param in chat request * feat: add test for continue final message * fix: bump openapi docs * fix: remove continue_final_message chat request param * fix: remove unneeded launcher args in continue test * fix: bump test output * fix: remove accidentally included guideline from rebase * fix: remove guideline tests * fix: adjust continuation tests expected text * fix: replace expected output for continue test
-
- 25 Nov, 2024 1 commit
-
-
Daniël de Kok authored
* Move JSON grammar -> regex grammar conversion to the router This change moves the JSON grammar -> regex grammar conversion to the router by adding a dependency on the `outlines-core` Rust crate. In contrast to the Python implementation, the conversions are not LRU-cached since they seem to be fast enough: simple schema time: [5.8293 µs 5.8307 µs 5.8320 µs] change: [-13.166% -12.884% -12.641%] (p = 0.00 < 0.05) Performance has improved. complex schema time: [14.875 µs 14.881 µs 14.887 µs] change: [-2.1637% -1.9914% -1.7852%] (p = 0.00 < 0.05) Performance has improved. Using the schemas from: https://github.com/dottxt-ai/outlines-core/blob/main/benchmarks/bench_json_schema.py
-
- 22 Nov, 2024 1 commit
-
-
OlivierDehaene authored
* chore: prepare 2.4.1 release * fix tests * fmt
-
- 21 Nov, 2024 1 commit
-
-
drbh authored
-
- 20 Nov, 2024 1 commit
-
-
Daniël de Kok authored
This change adds support for wNa16 int checkpoints with 2:4 sparsity using Marlin 2:4 kernels.
-
- 19 Nov, 2024 1 commit
-
-
drbh authored
* add OpenAI like tool_choice for named choice * add tests * fix: run linter and bump api docs * fix: consolidate changes and remove old tool type * feat: improve, simplify and rename tool choice struct add required support and refactor * fix: simplify tool choice logic, improve tests, openapi and rust docs * fix: refactor away prepare_chat_input and improve tool grammar apply control flow * feat: update docs and add tool choice configuration section * fix: simplify naming, tool choice default and improve test * fix: adjust tool choice none logic, add test and small refactors * fix: add missing snapshot file * fix: adjust tool choice type in test * fix: adjust default when json tool choice is * fix: remove trailing space lint after rebase * fix: remove mostly mocked unit test --------- Co-authored-by:Linus Bierhoff <linus.bierhoff@icloud.com>
-
- 18 Nov, 2024 1 commit
-
-
Daniël de Kok authored
* Add support for compressed-tensors w8a8 int checkpoints This change adds a loader for w8a8 int checkpoints. One large benefit of int8 support is that the corresponding cutlass matmul kernels also work on compute capability 7.5. Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8: | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match |↑ |0.8431|± |0.0100| | | |strict-match | 8|exact_match |↑ |0.8393|± |0.0101| |ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8597|± | N/A| | | |none | 0|inst_level_strict_acc |↑ |0.8201|± | N/A| | | |none | 0|prompt_level_loose_acc |↑ |0.7967|± |0.0173| | | |none | 0|prompt_level_strict_acc|↑ |0.7468|± |0.0187| Which is the same ballpark as vLLM. As usual, lots of thanks to Neural Magic/vLLM for the kernels. * Always use dynamic input quantization for w8a8 int It's far less flaky and gives better output. * Use marlin-kernels 0.3.5 * Fix a typo Co-authored-by:
drbh <david.richard.holtz@gmail.com> * Small fixes --------- Co-authored-by:
drbh <david.richard.holtz@gmail.com>
-
- 10 Nov, 2024 1 commit
-
-
Daniël de Kok authored
compressed-tensors is a safetensors extension for sparse, quantized tensors. The format is more powerful than earlier AWQ/GPTQ/FP8 quantization, because - Different quantizer configurations can be used for different targets. - The format can specify input/output quantizers in addition to weight quantizers. - Configurable exclusions for quantization. This change adds a dependency on the `compressed-tensors` package for its configuration parsing and layer matching functionality. The following types of quantization are supported in this PR: - W8A16 and W4A16 INT using GPTQ-Marlin kernels. - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels. Support for other quantization types will be added in subsequent PRs.
-
- 01 Nov, 2024 1 commit
-
-
drbh authored
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 30 Oct, 2024 1 commit
-
-
drbh authored
* feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2
-
- 28 Oct, 2024 4 commits
-
-
Nicolas Patry authored
* Monkey patching as a desperate measure. * New snapshot ?
-
Nicolas Patry authored
* More timeout on docker start ? * Latest upgrade.
-
Nicolas Patry authored
* We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.
-
Nicolas Patry authored
-
- 26 Oct, 2024 1 commit
-
-
Nicolas Patry authored
* Avoiding timeout for bloom tests. * Skip the test let's see if it's always the first tests that fails. * Fail early. * Pulling ? * No early exit.
-
- 25 Oct, 2024 3 commits
-
-
OlivierDehaene authored
-
Daniël de Kok authored
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots
-
Nicolas Patry authored
-
- 24 Oct, 2024 2 commits
-
-
Daniël de Kok authored
* Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer
-
Daniël de Kok authored
PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix this.
-
- 21 Oct, 2024 1 commit
-
-
Daniël de Kok authored
Update the Mixtral GPTQ test to use a model with `desc_act=true` and `group_size!=-1` to ensure that we are checking activation sorting/non-full K (with tensor parallelism). The `desc_act=false` case is already checked by the Mixtral AWQ test.
-
- 18 Oct, 2024 1 commit
-
-
Nicolas Patry authored
* add gptq and awq int4 support in intel platform Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 16 Oct, 2024 1 commit
-
-
OlivierDehaene authored
* wip * rollback * refactor to use prefix/postfix namming + fix all_input_ids_tensor * maybe patching vlms? * fix filter and concat * wip, no filter, no concat * current * add prepare_for_prefill * working * load tested * re-create slots * re-create slots * fix slot_filtering_indices * feedback loop * remove log * fix benchmarker * fix vlm and seq2seq * rename to cache and input lengths * fix prefill logprobs * fix launcher * fix logprobs? * idk at this point * max input length * omfg * remove debugging lines * fix tests * fix mllama * fix cargo tests * remove support chunking for paged * Fixing non blocked attentions * Fixing dtype + AMD, Ipex targets. * lint fix. * rename * Fix prefix_caching variable, remove defaults in server (confusing a lot of the times). * Add simple resolution when user specifies ATTENTION=paged. * Put back non default simple tests. * Fix env name --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 10 Oct, 2024 2 commits
-
-
Nicolas Patry authored
* Intel CI ? * Let's try non sharded gemma. * Snapshot rename * Apparently container can be gone already.
-
drbh authored
* feat: process token stream before returning to client * fix: expect content in test * fix: improve comparison via ruff lint * fix: return event in all cases * fix: always send event on error, avoid unwraps, refactor and improve tests * fix: prefer no_tool over notify_error to improve reponse * fix: adjust chat input test for no_tool * fix: adjust test expected content --------- Co-authored-by:System administrator <root@ip-10-90-0-186.ec2.internal>
-
- 08 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Add support for fused MoE Marlin for AWQ This uses the updated MoE Marlin kernels from vLLM. * Add integration test for AWQ MoE
-
- 04 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Add basic FP8 KV cache support This change adds rudimentary FP8 KV cache support. The support is enabled by passing `--kv-cache-dtype fp8_e5m2` to the launcher. Doing so uses this type for the KV cache. However support is still limited: * Only the `fp8_e5m2` type is supported. * The KV cache layout is the same as `float16`/`bfloat16` (HND). * The FP8 KV cache is only supported for FlashInfer. * Loading of scales is not yet supported. * Fix Cargo.toml
-
- 03 Oct, 2024 1 commit
-
- 02 Oct, 2024 2 commits
-
-
drbh authored
* feat: unroll notify_error if no tool is choosen * fix: expect simple message when no tool is selected * fix: improve test to avoid notify_error * fix: improve docs and indicate change in expected response * fix: adjust linting in test file
-
Nicolas Patry authored
* Working loading state. * Preprocessing. * Working state ? (Broke idefics1 temporarily). * Cleaner condition. * Fix idefics. * Updating config, removing TODO * Mllama * Ugrade transformers 4.45 * Flashing mllama. * Starting to get there. * Working state. * Integrations tests for mllama (cutting to 10 tokens because there seems' to be instability after (meaning size of the batch matters. * Updating model link. * Earlier assert. * Fix vlm ? * remove log. * Force ignore all images but last. * Default dtype bfloat16. * Update integration test after switch to bf16. * Remove dead code. * Removed dead code. * Upgrade the flake to latest transformers/tokenizers * Move to hf tgi-nix * Upgrade to 0.5.0
-
- 30 Sep, 2024 2 commits
-
-
drbh authored
* feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by:
Daniël de Kok <me@danieldk.eu> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
Daniël de Kok authored
This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.
-
- 24 Sep, 2024 1 commit
-
-
Nicolas Patry authored
* More tensor cores. * Fixing the logic. * Gemma is modified by this.
-
- 19 Sep, 2024 1 commit
-
-
Nicolas Patry authored
* Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow
-
- 17 Sep, 2024 1 commit
-
-
Daniël de Kok authored
* Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner
-
- 16 Sep, 2024 2 commits
-
-
Nicolas Patry authored
* Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.
-
Daniël de Kok authored
Disable by default because CI runners do not have enough GPUs.
-