- 30 Sep, 2024 7 commits
-
-
Daniël de Kok authored
This change uses the updated Marlin MoE kernel from vLLM to support MoE with activation sorting and groups.
-
Daniël de Kok authored
-
drbh authored
* feat: support phi3.5 moe model loading * fix: prefer llama base model and improve rotary logic * feat: return reasonable generation and add integration test * fix: run lint and update docs * fix: rerun lint for openapi docs * fix: prefer do_sample false unless temp is set by user, and update chat tests * fix: small typo adjustments * fix: consolidate long rope paths * fix: revert greedy by default and test changes * Vendor configuration so that we don't have to `trust_remote_code` * Use SparseMoELayer * Add support for dense MoE * Some type annotations * Add the usual model tests * Ruff. --------- Co-authored-by:
Daniël de Kok <me@danieldk.eu> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
Daniël de Kok authored
This change add support for MoE models that use GPTQ quantization. Currently only models with the following properties are supported: - No `desc_act` with tensor parallelism, unless `group_size=-1`. - No asymmetric quantization. - No AWQ.
-
Mohit Sharma authored
* style * update torch * ix issues * fix clone * revert mkl * added custom PA * style * fix style * style * hide env vart * fix mixtral model * add skinny kernel and merge fixes * fixed style * fix issue for sliding window models * addressed review comments * fix import * improved error messag * updated default value * remove import * fix imports after rebase * float16 dep * improve dockerfile * cleaned dockerfile
-
Ikram Ul Haq authored
-
Daniël de Kok authored
Remove compute capability lock We are only calling the `get_cuda_capability` function once, so avoiding the cost of multiple calls is not really necessary yet.
-
- 28 Sep, 2024 1 commit
-
-
Daniël de Kok authored
-
- 27 Sep, 2024 1 commit
-
-
Daniël de Kok authored
* Improve support for GPUs with capability < 8 - For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables. * nix: add flash-attn-v1 to the server environment * Move disabling prefix caching into the block of exceptions * Capability as `usize`s
-
- 26 Sep, 2024 2 commits
-
-
Alvaro Bartolome authored
* Fix `cargo build --features google` * Add `cargo test --features google`
-
Alvaro Bartolome authored
* Add LoRA adapters support for Gemma2 * Make `black` formatting happy
-
- 24 Sep, 2024 13 commits
-
-
Nicholas Broad authored
specify how to call local adapters
-
Nicolas Patry authored
* More tensor cores. * Fixing the logic. * Gemma is modified by this.
-
Nicolas Patry authored
* Cleanup Vertex + Chat * logprobs defaults to false. * Parameters are optional * Fix docs. * Changing back this logprobs default. * Fixup doc. * Let's debug that. * Not unstable. * Updating Cargo ? * Wat? * Dummy change. * Trying some other install. * Trying smething. * Revert everything. * Update Cargo lock. * Fixing the pre-commit after rebase.
-
Nicolas Patry authored
-
Aritra Roy Gosthipaty authored
* chore: adding note for private models in quicktour doc * Update docs/source/quicktour.md Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com> * Update docs/source/quicktour.md Co-authored-by:
vb <vaibhavs10@gmail.com> * Update docs/source/quicktour.md Co-authored-by:
vb <vaibhavs10@gmail.com> --------- Co-authored-by:
Omar Sanseviero <osanseviero@gmail.com> Co-authored-by:
vb <vaibhavs10@gmail.com>
-
Orhun Parmaksız authored
-
Orhun Parmaksız authored
-
Daniël de Kok authored
This replaces the custom layers in both models.
-
Daniël de Kok authored
* Add support for scalar FP8 weight scales * Support LLM compressor FP8 checkpoints on H100 On H100, we use fbgemm-gpu, which requires bfloat16 as the input dtype. However, we wouldn't pick up fp8 quantization for models quantized with LLM compressor. This change adds enough parsing to detect if models have FP8-quantized weights. * Remove stray debug print
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Alvaro Bartolome authored
-
OlivierDehaene authored
* wip * added v2
-
- 23 Sep, 2024 1 commit
-
-
Daniël de Kok authored
-
- 20 Sep, 2024 3 commits
-
-
Nicolas Patry authored
* Preparing for release. * Upgrade version in docs.
-
OlivierDehaene authored
* fix: wrap python basic logs in debug assertion in launcher * use level filters instead
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 19 Sep, 2024 3 commits
-
-
Daniël de Kok authored
-
Daniël de Kok authored
* Update to moe-kenels 0.3.1 * Attempt to fix apt failure
-
Nicolas Patry authored
* Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow
-
- 17 Sep, 2024 3 commits
-
-
Daniël de Kok authored
* Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner
-
OlivierDehaene authored
-
Daniël de Kok authored
Runs the tests in a Nix build sandbox.
-
- 16 Sep, 2024 2 commits
-
-
Nicolas Patry authored
* Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.
-
Daniël de Kok authored
Disable by default because CI runners do not have enough GPUs.
-
- 13 Sep, 2024 1 commit
-
-
Alex Strick van Linschoten authored
* use ratatui not archived tui * bump ratatui all the way with options
-
- 12 Sep, 2024 3 commits
-
-
Wang, Yi authored
enable intel ipex cpu and xpu in python3.11 Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
drbh authored
fix: pass missing revision arg for lora adapter when loading multiple adapters
-
Nicolas Patry authored
* Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.
-