- 19 Sep, 2024 1 commit
-
-
Nicolas Patry authored
* Stream options. * Fetch stuff from nix integration test for easier testing. * Adding the assert. * Only send the usage when asked for. * Update the docs. * Impure test because we need network. * develop. * Optional usage. * Fixes. * Workflow
-
- 17 Sep, 2024 3 commits
-
-
Daniël de Kok authored
* Move to moe-kernels package and switch to common MoE layer This change introduces the new `moe-kernels` package: - Add `moe-kernels` as a dependency. - Introduce a `SparseMoELayer` module that can be used by MoE models. - Port over Mixtral and Deepseek. * Make `cargo check` pass * Update runner
-
OlivierDehaene authored
-
Daniël de Kok authored
Runs the tests in a Nix build sandbox.
-
- 16 Sep, 2024 2 commits
-
-
Nicolas Patry authored
* Adding a test for FD. * Fixing flashdecoding (empty batch doesn't work). * Fixing the invalid popping. * Fixing radix with block_size > 1 * Last reference. * Use an actual hash. * Update hash for slice.len() == 1 * Update the locks. * Increasing docker timeout.
-
Daniël de Kok authored
Disable by default because CI runners do not have enough GPUs.
-
- 13 Sep, 2024 1 commit
-
-
Alex Strick van Linschoten authored
* use ratatui not archived tui * bump ratatui all the way with options
-
- 12 Sep, 2024 4 commits
-
-
Wang, Yi authored
enable intel ipex cpu and xpu in python3.11 Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
drbh authored
fix: pass missing revision arg for lora adapter when loading multiple adapters
-
Nicolas Patry authored
* Add nix test. * Modifying yourself means you need to rerun. * Fixing the test + adding click (needed for pre-commit hooks). * Try thuis. * Our runner + pure test (not written) * Reemove server. * Root user. * Different user ? * Add the actual test target. * Forgot this modification. * Add a formatter. * Add the secrets. * Fixed the auth token ? * Adding the other tests. * Missing pre-commit. * Test requires cargo for cargo fmt. * Update it a bit. * Up. * Attempting to use a cache location for the models. * Ignore the cache for now.
-
Daniël de Kok authored
Ideally we wouldn't have the router wrapper that this change adds, but when I give PyO3 a Python interpreter with packages, it ends up linking libpython from the Python interpreter rather than the constructed environment and cannot pick up the Python modules as a result.
-
- 11 Sep, 2024 3 commits
-
-
Nicolas Patry authored
* Attempting to discard the trufflehog warning. * Attempt to fix trufflehog.
-
Nicolas Patry authored
* Fixing odd tokenization self modifications on the Rust side (load and resave in Python). * Fixing the builds ? * Fix the gh action? * Fixing the location ? * Validation is odd. * Try a faster runner * Upgrade python version. * Remove sccache * No sccache. * Getting libpython maybe ? * List stuff. * Monkey it up. * have no idea at this point * Tmp. * Shot in the dark. * Tmate the hell out of this. * Desperation. * WTF. * -y. * Apparently 3.10 is not available anymore. * Updating the dockerfile to make libpython discoverable at runtime too. * Put back rust tests. * Why do we want mkl on AMD ? * Forcing 3.11 ?
-
Nicolas Patry authored
* Adding prefix test. * [WIP] tmp dump of integration load tests. * Remove other tensor creation. * Fixed the radix tree. Used a slice everywhere in radix.rs to keep the cheap Arc cloning instead of recomputing the input_ids. * Fix parsing * Is it really flashinfer version ? * Remove some comments. * Revert the max prefix hit. * Adding numpy to diff. * Upgraded flashinfer. * Upgrading some stuff. * Are we done yet ? * Minor fixup * Remove 1 log and put back the other. * Add comment for why slot 0 is OK. * Mounting on the job. * Get me a debug branch * Debugging CIs is fun. * Attempt #28 * wip * Tmate. * Praying. * Updating VLM causal model with updated context. * Important line got squashed. * Tmate again. * Fingers crossed. * We want only 1 run of integration tests..... --------- Co-authored-by:Guillaume LEGENDRE <glegendre01@gmail.com>
-
- 07 Sep, 2024 1 commit
-
-
Vallepu Vamsi Krishna authored
Update Makefile-fbgemm Added Directory check for FBGEMM repository cloning.
-
- 06 Sep, 2024 6 commits
-
-
Nicolas Patry authored
-
Martin Iglesias Goyanes authored
* Add links to Adyen blogpost * Adding to toctree. * Update external.md * Update _toctree.yml --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Daniël de Kok authored
-
Daniël de Kok authored
These should all be cheap assertions. Also: * Fixup some comments. * Delete a `remove` that was done unnecessarily twice.
-
Daniël de Kok authored
-
Daniël de Kok authored
We need this to ensure that pyright/ruff are part of the same interpreter/venv.
-
- 05 Sep, 2024 4 commits
-
-
Wang, Yi authored
fix regression caused by attention api change. ipex.varlen_attention does not support paged-cache format kv input now. Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
-
Nicolas Patry authored
-
Daniël de Kok authored
The minimum batch size logic could cause prefix blocks to be deallocated without prefill. The next allocation of the same prefix would then use garbage blocks.
-
- 02 Sep, 2024 4 commits
-
-
drbh authored
* feat: support lora revisions and qkv_proj weights * fix: add qkv_proj weights to weight test
-
drbh authored
* fix: enable chat requests in vertex endpoint * feat: avoid unwrap and pre allocate future vec
-
Daniël de Kok authored
Enables LoRA support.
-
Daniël de Kok authored
- Add some test dependencies. - Install server in venv. - Install Python client in venv.
-
- 29 Aug, 2024 5 commits
-
-
Nicolas Patry authored
* Tied embeddings in MLP speculator. * Fixing the scale_weight when users decide to not use the speculation as much as defined in the config. * Adding scaling support + optimize some ops.
-
Wang, Yi authored
* update doc with intel cpu part Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Apply suggestions from code review we do not use latest ever in documentation, it causes too many issues for users. Release number get update on every release. --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
drbh authored
* feat: add /v1/models endpoint * feat: add /v1/models endpoint * fix: remove unused type import * fix: revert route typo * fix: update docs with new endpoint * fix: add to redocly ignore and lint
-
Nicolas Patry authored
* Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by:
drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by:
OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by:
OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by:
drbh <david.richard.holtz@gmail.com> Co-authored-by:
OlivierDehaene <olivier@huggingface.co>
-
Daniël de Kok authored
Updates tgi-nix input: - Move Torch closer to upstream by building against MKL. - Remove compute capability 8.7 from Torch (Jetson). - Sync nixpkgs cumpute capabilities with Torch (avoids compiling too mana capabilities for MAGMA). - Use nixpkgs configuration passed through by `tgi-nix`.
-
- 28 Aug, 2024 1 commit
-
-
drbh authored
-
- 27 Aug, 2024 3 commits
-
-
drbh authored
* fix: support tojson and avoid message indexing issue in template * fix: prefer minijinja native methods and prefer workspace level dependency * fix: adjust comment typo
-
Nicolas Patry authored
-
drbh authored
* fix[router]: Fix tools not passed in chat template Signed-off-by:
GitHub <noreply@github.com> * feat: improve default tool serialization and lints * feat: refactor tool logic to include notify_error in prompt and adjust typing * fix: adjust non tool template apply * fix: simplify tool grammar logic and improve schema * feat: avoid skip tool test and avoid empty tool prompts * fix: increase test client timeout for grammar compilation tests --------- Signed-off-by:
GitHub <noreply@github.com> Co-authored-by:
Simone Rossi <simone.rossi.93@gmail.com>
-
- 26 Aug, 2024 1 commit
-
-
drbh authored
* Fix: don't apply post layernorm in SiglipVisionTransformer This fixes a bug with LLaVA Next when using Siglip as the vision model. LLaVA Next expects the output of the vision model to be the encoder outputs before layernorm (see original transformers implementation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L813). This also makes Siglip consistent with the existing Clip implementation: https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/models/custom_modeling/clip.py#L613 * fix: adjust pali gemma for post layer norm and small refactors --------- Co-authored-by:
Travis Addair <tgaddair@gmail.com>
-
- 23 Aug, 2024 1 commit
-
-
Daniël de Kok authored
The default package wraps the launcher and puts the server/router in the path. As a result, TGI can be started using something like: ``` nix run .# -- \ --model-id hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \ --port 8080 ```
-