1. 03 Dec, 2024 1 commit
    • Daniël de Kok's avatar
      Sync (most) server dependencies with Nix (#2782) · 2003d8be
      Daniël de Kok authored
      
      
      * Sync (most) server dependencies with Nix
      
      Skipped most grpcio packages, because of protobuf version
      incompatibility with the opentelemetry packages.
      
      * Add a primitive script to generate Poetry commands to sync with Nix
      
      This is not fully automated, since getting the Nix versions may be
      unresolvable. However, it does take most of the work out of doing
      this manually.
      
      * Upgrade eetq ?
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      2003d8be
  2. 22 Nov, 2024 1 commit
  3. 20 Nov, 2024 1 commit
  4. 19 Nov, 2024 1 commit
  5. 18 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for compressed-tensors w8a8 int checkpoints (#2745) · 3c9df21f
      Daniël de Kok authored
      
      
      * Add support for compressed-tensors w8a8 int checkpoints
      
      This change adds a loader for w8a8 int checkpoints. One large benefit of
      int8 support is that the corresponding cutlass matmul kernels also work on
      compute capability 7.5.
      
      Evaluation on neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a8:
      
      |     Tasks     |Version|     Filter     |n-shot|        Metric         |   |Value |   |Stderr|
      |---------------|------:|----------------|-----:|-----------------------|---|-----:|---|------|
      |gsm8k_cot_llama|      3|flexible-extract|     8|exact_match            |↑  |0.8431|±  |0.0100|
      |               |       |strict-match    |     8|exact_match            |↑  |0.8393|±  |0.0101|
      |ifeval         |      4|none            |     0|inst_level_loose_acc   |↑  |0.8597|±  |   N/A|
      |               |       |none            |     0|inst_level_strict_acc  |↑  |0.8201|±  |   N/A|
      |               |       |none            |     0|prompt_level_loose_acc |↑  |0.7967|±  |0.0173|
      |               |       |none            |     0|prompt_level_strict_acc|↑  |0.7468|±  |0.0187|
      
      Which is the same ballpark as vLLM.
      
      As usual, lots of thanks to Neural Magic/vLLM for the kernels.
      
      * Always use dynamic input quantization for w8a8 int
      
      It's far less flaky and gives better output.
      
      * Use marlin-kernels 0.3.5
      
      * Fix a typo
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Small fixes
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      3c9df21f
  6. 17 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Remove vLLM dependency for CUDA (#2751) · 52e48739
      Daniël de Kok authored
      * Remove vLLM dependency for CUDA
      
      This change adds `attention-kernels` as a dependency for paged
      attention and cache reshaping. With that, we don't use vLLM
      anywhere for CUDA.
      
      Tested run (since we don't have paged attention in CI):
      
      ```
      ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
      [...]
      5 snapshots passed.
      ```
      
      * Fix clippy warning
      52e48739
  7. 15 Nov, 2024 1 commit
  8. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  9. 25 Oct, 2024 1 commit
  10. 24 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
  11. 09 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      nix: add black and isort to the closure (#2619) · 9ed0c85f
      Daniël de Kok authored
      To make sure that everything is formatted with the same black version
      as CI.
      
      I sometimes use isort for new files to get nicely ordered imports,
      so add it as well. Also set the isort configuration to format in a
      way that is compatible with black.
      9ed0c85f
  12. 08 Oct, 2024 1 commit
  13. 02 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Mllama flash version (#2585) · d18ed5cf
      Nicolas Patry authored
      * Working loading state.
      
      * Preprocessing.
      
      * Working state ? (Broke idefics1 temporarily).
      
      * Cleaner condition.
      
      * Fix idefics.
      
      * Updating config, removing TODO
      
      * Mllama
      
      * Ugrade transformers 4.45
      
      * Flashing mllama.
      
      * Starting to get there.
      
      * Working state.
      
      * Integrations tests for mllama (cutting to 10 tokens because there seems'
      to be instability after (meaning size of the batch matters.
      
      * Updating model link.
      
      * Earlier assert.
      
      * Fix vlm ?
      
      * remove log.
      
      * Force ignore all images but last.
      
      * Default dtype bfloat16.
      
      * Update integration test after switch to bf16.
      
      * Remove dead code.
      
      * Removed dead code.
      
      * Upgrade the flake to latest transformers/tokenizers
      
      * Move to hf tgi-nix
      
      * Upgrade to 0.5.0
      d18ed5cf
  14. 30 Sep, 2024 1 commit
  15. 19 Sep, 2024 1 commit
  16. 17 Sep, 2024 1 commit
    • Daniël de Kok's avatar
      Move to moe-kernels package and switch to common MoE layer (#2511) · ce85efa9
      Daniël de Kok authored
      * Move to moe-kernels package and switch to common MoE layer
      
      This change introduces the new `moe-kernels` package:
      
      - Add `moe-kernels` as a dependency.
      - Introduce a `SparseMoELayer` module that can be used by MoE
        models.
      - Port over Mixtral and Deepseek.
      
      * Make `cargo check` pass
      
      * Update runner
      ce85efa9
  17. 06 Sep, 2024 2 commits
  18. 15 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Fixing exl2 and other quanize tests again. (#2419) · 57b34958
      Nicolas Patry authored
      * Fixing exl2 and other quanize tests again.
      
      * Mark exl2 as non release (so CI tests them, needs to be removed latet).
      
      * Fixing exl2 (by disabling cuda graphs)
      
      * Fix quantization defaults without cuda graphs on exl2 (linked to new
      issues with it).
      
      * Removing serde override.
      
      * Go back to released exl2 and remove log.
      
      * Adding warnings for deprecated bitsandbytes + upgrade info to warn.
      57b34958
  19. 29 Jul, 2024 1 commit
  20. 23 Jul, 2024 3 commits
  21. 22 Jul, 2024 1 commit
    • Nicolas Patry's avatar
      Softcapping for gemma2. (#2273) · 6aeb6690
      Nicolas Patry authored
      * Softcapping for gemma2.
      
      * Less clutter.
      
      * No access to transformers config, only config_dict here.
      
      * 0.0 is the null value in the C++ API.
      6aeb6690
  22. 04 Jun, 2024 1 commit
    • Nicolas Patry's avatar
      Making `make install` work better by default. (#2004) · 8390e251
      Nicolas Patry authored
      # What does this PR do?
      
      Making `make install` a much better sane default to start local dev
      environments.
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      8390e251
  23. 24 May, 2024 1 commit
    • Nicolas Patry's avatar
      Fix seeded output. (#1949) · d32e33bd
      Nicolas Patry authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      d32e33bd
  24. 16 May, 2024 1 commit
  25. 30 Apr, 2024 1 commit
    • Nicolas Patry's avatar
      (chore): torch 2.3.0 (#1833) · dccab725
      Nicolas Patry authored
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      dccab725
  26. 18 Apr, 2024 2 commits
  27. 12 Apr, 2024 1 commit
  28. 11 Apr, 2024 1 commit
  29. 29 Mar, 2024 1 commit
  30. 22 Mar, 2024 2 commits
  31. 21 Mar, 2024 1 commit
    • drbh's avatar
      fix: improve tool type, bump pydantic and outlines (#1650) · de6cb15f
      drbh authored
      This PR resolves a couple 
      
      - [X] adjusts the tool response to align with openai's tools response
      type
      - [X] bumps pydantic to `2.6.4` in all apps (resolves dependency issue
      when running tests)
      - [X] bump `outlines` version and fix import for new name
      de6cb15f
  32. 15 Mar, 2024 1 commit
  33. 28 Feb, 2024 1 commit
  34. 21 Feb, 2024 1 commit
  35. 16 Feb, 2024 1 commit