- 25 Jun, 2024 10 commits
-
-
Daniël de Kok authored
* Add pytest release marker Annotate a test with `@pytest.mark.release` and it only gets run with `pytest integration-tests --release`. * Mark many models as `release` to speed up CI
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
* Removing IPEX_AVAIL. Chose to unify CPU and XPU under `ipex`. Most code is exactly similar except for a very few spots. The biggest number of spots is the kv-cache layout and the flash_xxx.py files. Since those files should be removed soon and factored away, we should not need them. * Forgot a few places. * Unrelated change. * Fixing HF_TOKEN. * HF_TOKEN
-
drbh authored
* feat: add simple tests for weights * fix: adjust types and add tests * fix: adjust so all tests pass * feat: improve weight tests * fix: add missing tests and renames * fix: tweak shapes
-
Wang, Yi authored
* add CPU tgi support Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * ipex distributed ops support Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Funtowicz Morgan <mfuntowicz@users.noreply.github.com>
-
sunxichen authored
fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api (#2089) Co-authored-by:sunxichen <sun.xc@digitalcnzz.com>
-
Wang, Yi authored
* use xpu-smi to dump used memory xpu use "ZE_AFFINITY_MASK" to control card, usage is like CUDA_VISIBLE_DEVICES Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Update server/text_generation_server/utils/import_utils.py Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
Jeff authored
* corrected Pydantic warning. * Update clients/python/text_generation/types.py Co-authored-by:
Daniël de Kok <me@github.danieldk.eu> --------- Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by:
Daniël de Kok <me@github.danieldk.eu>
-
KevinDuffy94 authored
* Adding Service Name Environment variable for https://github.com/huggingface/text-generation-inference/issues/2069 * Update Docs * Update README.md * Update Launcher Docs * Update Launcher Docs Removing Option
-
Lucain authored
* Support HF_TOKEN environement variable * Load test. --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
- 24 Jun, 2024 2 commits
-
-
ur4t authored
* Fix cargo-chef prepare In prepare stage, cargo-chef reads Cargo.lock and transforms it accordingly. If Cargo.lock is not present, cargo-chef will generate a new one first, which might vary a lot and invalidate docker build caches. * Fix Dockerfile_amd and Dockerfile_intel
-
Nicolas Patry authored
* New runner. Manual squash. * Network host. * Put back trufflehog with proper extension. * No network host ? * Moving buildx install after tailscale ? * 1.79
-
- 21 Jun, 2024 2 commits
-
-
drbh authored
-
Daniël de Kok authored
The subcommand did not work due to some broken imports.
-
- 20 Jun, 2024 2 commits
-
-
Daniël de Kok authored
For Phi-3-Small I need to shard a packed QKV bias tensor, for which I implemented the `Weights.get_packed_sharded` method. However, this method can also replace the `Weights._get_qweight` method and the custom sharding code from `Weights.get_weights_col_packed`.
-
Daniël de Kok authored
Fixes #2081.
-
- 19 Jun, 2024 1 commit
-
-
drbh authored
-
- 18 Jun, 2024 2 commits
-
-
Daniël de Kok authored
-
Guillaume LEGENDRE authored
* test local tailscale * Update build.yaml * Update build.yaml * Update build.yaml * Update build.yaml * wait for ssh * network host * change step order
-
- 17 Jun, 2024 4 commits
-
-
Daniël de Kok authored
* Set maximum grpc message receive size to 2GiB The previous default was 4MiB, which doesn't really work well for multi-modal models. * Update to Rust 1.79.0 * Fixup formatting to make PR pass
-
Ziru Niu authored
-
Lysandre Debut authored
* Contributing guide & Code of Conduct * Redirect to GitHub's tutorial on PRs
-
Daniël de Kok authored
When a batch contained images if different sizes during prefill, the server would fail (see e.g. #2056). Images were processed separately and then concatenated. However, this can fail for images with different sizes. Fix this by preprocessing all images in the batch together, so that the image processor can ensure that all image tensors have compatible sizes.
-
- 14 Jun, 2024 3 commits
-
-
Alvaro Moran authored
* doc: adding architecture document * doc: add architecture to toctree * fix: avoid cargo lock changes * fix: avoid cargo lock tweak --------- Co-authored-by:drbh <david.richard.holtz@gmail.com>
-
Tiezhen WANG authored
* Update the link for qwen2 * Fix Qwen2 model URL in model table * Fix too eager staging --------- Co-authored-by:Daniël de Kok <me@danieldk.eu>
-
Daniël de Kok authored
Add support for GPTQ Marlin kernels GPTQ Marlin extends the Marlin kernels to support common GPTQ configurations: - bits: 4 or 8 - groupsize: -1, 32, 64, or 128 - desc_act: true/false Using the GPTQ Marlin kernels requires repacking the parameters in the Marlin quantizer format. The kernels were contributed by Neural Magic to VLLM. We vendor them here for convenience.
-
- 13 Jun, 2024 2 commits
-
-
drbh authored
* feat: add kserve feature and basic routes * feat: implement infer endpoint wrapper around generate * fix: refactor and improve types * fix: improve infer and simplify * fix: cleanup and improve api docs * fix: refactor and encapsulate kserve feat in file * fix: remove typos after rebase
-
drbh authored
* Use minijinja's pycompat mode for python methods * fix: cargo fmt lint for pre commit --------- Co-authored-by:Armin Ronacher <armin.ronacher@active-4.com>
-
- 12 Jun, 2024 2 commits
-
-
OlivierDehaene authored
* fix(layers): fix SuRotaryEmbedding * change arange * remove logs
-
OlivierDehaene authored
-
- 11 Jun, 2024 2 commits
- 10 Jun, 2024 4 commits
-
-
Luc Georges authored
-
Luc Georges authored
-
Daniël de Kok authored
Add support for Phi-3-medium The main difference between the medium and mini models is that medium uses grouped query attention with a packed QKV matrix. This change adds support for GQA with packed matrixes to `Weights.get_weights_col_packed` and uses it for Phi-3. This also allows us to remove the custom implementation of GQA from dbrx attention loading.
-
fxmarty authored
* update vllm commit & fix models using sliding window * update * update commit * fix bug where tunableop is bound to cuda graph even when cuda graph are disabled * enable tunableop by default * fix sliding window * address review * dead code * precise comment * is it flaky?
-
- 07 Jun, 2024 1 commit
-
-
Daniël de Kok authored
The router will now send the input as chunks besides as a single string. This change modifies the server to process chunked input rather than strings. This also allows us to remove the image extraction code from the server.
-
- 06 Jun, 2024 3 commits
-
-
Wang, Yi authored
# What does this PR do? <!-- Congratulations! You've made it this far! You're not quite done yet though. Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution. Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change. Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost. --> <!-- Remove if not applicable --> Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation ). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. <!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @ @OlivierDehaene OR @Narsil --> Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
Nicolas Patry authored
This reverts commit 101ac9a7.
-
Nicolas Patry authored
-