- 29 Aug, 2024 1 commit
-
-
Nicolas Patry authored
* Making prefix/flashinfer the default and testing the full release tests. * Include flashinfer in the docker. * Using prebuilt. * Allowing window_left_size (dummy version). * Disabling flashinfer/prefix caching on odd head_dim * Disable prefix caching for lora. * More specific codes. * Update lock * Updating integration tests with new values with FI/FD. Remove paged as a default too, and using FD everywhere. * Update cargo lock ? * Upgrade to 1.80 because of bitstream... * Everywhere 1.80 * Forgot last default place. * Apply suggestions from code review Co-authored-by:
drbh <david.richard.holtz@gmail.com> * Updated flake lock * Tmp * Upgrade resolution system for less errors in resolution. * Remove lambda for cleaner function. * Handling debugger. * OVerride the env in server tests. * Is this enough to make it work ? * This seems to be working. * Downgrade some logs. * Fixing the default for vlm. * Don't enable prefix caching on VLM just yet. * Change `add_special_tokens` in order to have the correct tokens for chat input and not (since it's super important with the prefixing now) * Fixing prefix caching for flashdecoding. * Update all models. * Fixed flashinfer version. * add_special_tokens is internal only * Fixing seqlen with the new vlms. * Fixing the issue with `add_special_tokens` not being passed around. * Fixing the test. * Removing encoder_decoder (seq2seq). * Update the chat test. * Fixing the batching tokenization in flash causal lm. * Truncating left for radix purposes. * Oops this doesn't belong here. * Put back default pure shell. * Update server tests - Default to throughput test in k6 - Use TGI_WIGGLE_ROOM to adjust wiggle room * Only n_heads / process_group.size() are necessary. * Revert the integrationt tests change (seem linked to head_size modification). * Adding error message when assert is violated. * Fixing the free algorithm to handle times where the common prefix is smaller. * Apply suggestions from code review Co-authored-by:
OlivierDehaene <olivier@huggingface.co> * Update server/text_generation_server/layers/attention/common.py Co-authored-by:
OlivierDehaene <olivier@huggingface.co> * Fix disabling prefix caching - Fix windowing checks. * Revert the Cohere tokenizer change (for now using a revision instead). * Fmt. --------- Co-authored-by:
drbh <david.richard.holtz@gmail.com> Co-authored-by:
OlivierDehaene <olivier@huggingface.co>
-
- 27 Aug, 2024 1 commit
-
-
Nicolas Patry authored
-
- 16 Aug, 2024 2 commits
-
-
Nicolas Patry authored
* All integration tests back everywhere (too many failed CI). * Upgrade integration tests after 12.4 * Attempt to remove the specifed compute cap. * Common arch list. * Punica uses raw ASM which is not valid on 9.0 apparently.
-
Hugo Larcher authored
* doc: Add metrics documentation and add a 'Reference' section * doc: Add API reference * doc: Refactor API reference * fix: Message API link * Bad rebase * Moving the docs. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 12 Aug, 2024 1 commit
-
-
Wang, Yi authored
add intel-cpu docker image Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 01 Aug, 2024 1 commit
-
-
Daniël de Kok authored
* Fix cache block size for flash decoding This seems to have been accidentally dropped during the TRT-LLM PR rebase. * Also run CI on changes to `backends`
-
- 31 Jul, 2024 1 commit
-
-
Nicolas Patry authored
* wip wip refacto refacto Initial setup for CXX binding to TRTLLM Working FFI call for TGI and TRTLLM backend Remove unused parameters annd force tokenizer name to be set Overall build TRTLLM and deps through CMake build system Enable end to end CMake build First version loading engines and making it ready for inference Remembering to check how we can detect support for chunked context Move to latest TensorRT-LLM version Specify which default log level to use depending on CMake build type make leader executor mode working unconditionally call InitializeBackend on the FFI layer bind to CUDA::nvml to retrieve compute capabilities at runtime updated logic and comment to detect cuda compute capabilities implement the Stream method to send new tokens through a callback use spdlog release 1.14.1 moving forward update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c correctly tell cmake to build dependent tensorrt-llm required libraries create cmake install target to put everything relevant in installation folder add auth_token CLI argument to provide hf hub authentification token allow converting huggingface::tokenizers error to TensorRtLlmBackendError use correct include for spdlog include guard to build example in cmakelists working setup of the ffi layer remove fmt import use external fmt lib end to end ffi flow working make sure to track include/ffi.h to trigger rebuild from cargo impl the rust backend which currently cannot move the actual computation in background thread expose shutdown function at ffi layer impl RwLock scenario for TensorRtLllmBackend oops missing c++ backend definitions compute the number of maximum new tokens for each request independently make sure the context is not dropped in the middle of the async decoding. remove unnecessary log add all the necessary plumbery to return the generated content update invalid doc in cpp file correctly forward back the log probabilities remove unneeded scope variable for now refactor Stream impl for Generation to factorise code expose the internal missing start/queue timestamp forward tgi parameters rep/freq penalty add some more validation about grammar not supported define a shared struct to hold the result of a decoding step expose information about potential error happening while decoding remove logging add logging in case of decoding error make sure executor_worker is provided add initial Dockerfile for TRTLLM backend add some more information in CMakeLists.txt to correctly install executorWorker add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper simplify prebuilt trtllm libraries name definition do the same name definition stuff for tensorrt_llm_executor_static leverage pkg-config to probe libraries paths and reuse new install structure from cmake fix bad copy/past missing nvinfer linkage direction align all the linker search dependency add missing pkgconfig folder for MPI in Dockerfile correctly setup linking search path for runtime layer fix missing / before tgi lib path adding missing ld_library_path for cuda stubs in Dockerfile update tgi entrypoint commenting out Python part for TensorRT installation refactored docker image move to TensorRT-LLM v0.11.0 make docker linter happy with same capitalization rule fix typo refactor the compute capabilities detection along with num gpus update TensorRT-LLM to latest version update TensorRT install script to latest update build.rs to link to cuda 12.5 add missing dependant libraries for linking clean up a bit install to decoder_attention target add some custom stuff for nccl linkage fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time use std::env::const::ARCH make sure variable live long enough... look for cuda 12.5 add some more basic info in README.md * Rebase. * Fix autodocs. * Let's try to enable trtllm backend. * Ignore backends/v3 by default. * Fixing client. * Fix makefile + autodocs. * Updating the schema thing + redocly. * Fix trtllm lint. * Adding pb files ? * Remove cargo fmt temporarily. * ? * Tmp. * Remove both check + clippy ? * Backporting telemetry. * Backporting 457fb0a1 * Remove PB from git. * Fixing PB with default member backends/client * update TensorRT-LLM to latest version * provided None for api_key * link against libtensorrt_llm and not libtensorrt-llm --------- Co-authored-by:
OlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com> Co-authored-by:
Morgan Funtowicz <morgan@huggingface.co>
-
- 29 Jul, 2024 1 commit
-
-
Adrien authored
Signed-off-by:Adrien <adrien@huggingface.co>
-
- 25 Jul, 2024 2 commits
-
-
Adrien authored
-
Nicolas Patry authored
* Using g6 instead of g5. * Update the idefics2 snapshot.
-
- 22 Jul, 2024 2 commits
-
-
Adrien authored
-
Adrien authored
* test new instances Signed-off-by:
Adrien <adrien@huggingface.co> * improve build ci Signed-off-by:
Adrien <adrien@huggingface.co> --------- Signed-off-by:
Adrien <adrien@huggingface.co>
-
- 20 Jul, 2024 2 commits
-
-
OlivierDehaene authored
* feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout
-
Adrien authored
* re-push to internal registry Signed-off-by:
Adrien <adrien@huggingface.co> * fix name Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip debug Signed-off-by:
Adrien <adrien@huggingface.co> * add debug Signed-off-by:
Adrien <adrien@huggingface.co> * should Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * ww Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * ww Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * wip Signed-off-by:
Adrien <adrien@huggingface.co> * debug Signed-off-by:
Adrien <adrien@huggingface.co> * w Signed-off-by:
Adrien <adrien@huggingface.co> * revert tests Signed-off-by:
Adrien <adrien@huggingface.co> * last reverts Signed-off-by:
Adrien <adrien@huggingface.co> * another one Signed-off-by:
Adrien <adrien@huggingface.co> --------- Signed-off-by:
Adrien <adrien@huggingface.co>
-
- 09 Jul, 2024 2 commits
-
-
Nicolas Patry authored
* Updating the self check * Fix. * Revert the CLI . * cli. * Space. * Revert cargo update.
-
Nicolas Patry authored
-
- 08 Jul, 2024 1 commit
-
-
Guillaume LEGENDRE authored
* Update build.yaml * Update build.yaml * change to S3 cache * change to CPU Runners * remove comments
-
- 05 Jul, 2024 2 commits
-
-
Daniël de Kok authored
* Consistently take `prefix` in model constructors * Release test check fix * Misc refactor-related fixes
-
Daniël de Kok authored
* Add more representative Llama GPTQ test The Llama GPTQ test is updated to use a model with the commonly-used quantizer config format and activation sorting. The old test is kept around (but renamed) since it tests the format produced by `text-generation-server quantize`. * Add support for manually triggering a release build
-
- 03 Jul, 2024 1 commit
-
-
drbh authored
* feat: add pre commit step to force schema update when router changes * fix: prefer improved update_doc and start server and compare * fix: adjust typo * fix: adjust revert typo * fix: update workflow to use update_doc md command * feat: improve workflow to check openapi schema too * fix: adjust timeout for CI * fix: adjust raise condition and install server in ci * fix: install protoc before server * feat: improve update doc and add command to print router schema * fix: adjust autodoc workflow * fix: explicitly install protoc and python * fix: alllow trailing space in openapi schema diff
-
- 02 Jul, 2024 1 commit
-
-
Guillaume LEGENDRE authored
* first test with registry mirror * change push registry * remove comments * Move cache to push registry * fix registry url * Update .github/workflows/ci_build.yaml --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 01 Jul, 2024 1 commit
-
-
Nicolas Patry authored
-
- 28 Jun, 2024 1 commit
-
-
Nicolas Patry authored
-
- 25 Jun, 2024 3 commits
-
-
Daniël de Kok authored
* Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI
-
Nicolas Patry authored
* Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN
-
Lucain authored
* Support HF_TOKEN environement variable * Load test. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 24 Jun, 2024 1 commit
-
-
Nicolas Patry authored
* New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79
-
- 19 Jun, 2024 1 commit
-
-
drbh authored
-
- 18 Jun, 2024 2 commits
-
-
Daniël de Kok authored
-
Guillaume LEGENDRE authored
* test local tailscale * Update build.yaml * Update build.yaml * Update build.yaml * Update build.yaml * wait for ssh * network host * change step order
-
- 17 Jun, 2024 1 commit
-
-
Daniël de Kok authored
* Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass
-
- 11 Jun, 2024 1 commit
-
-
drbh authored
* feat: support response_format in chat * fix: adjust typos * fix: add trufflehog lint
-
- 10 Jun, 2024 2 commits
-
-
Luc Georges authored
-
Luc Georges authored
-
- 07 Jun, 2024 1 commit
-
-
Daniël de Kok authored
The router will now send the input as chunks besides as a single string. This change modifies the server to process chunked input rather than strings. This also allows us to remove the image extraction code from the server.
-
- 06 Jun, 2024 3 commits
-
-
Nicolas Patry authored
This reverts commit 101ac9a7.
-
Nicolas Patry authored
-
Nicolas Patry authored
# What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
-
- 04 Jun, 2024 1 commit
-
-
Nicolas Patry authored
# What does this PR do? Making `make install` a much better sane default to start local dev environments. <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
-
- 22 May, 2024 1 commit
-
-
Nicolas Patry authored
# What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil -->
-