1. 28 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      We can have a tokenizer anywhere. (#2527) · 90b226db
      Nicolas Patry authored
      * We can have a tokenizer anywhere.
      
      * Handling potential lack of offsets (python tokenizer)
      
      * Remove redundancy.
      
      * Fixing the tests.
      
      * Flake.lock update ?
      
      * Fixing the  GIL locking.
      
      * Fixing mamba by using the transformers version.
      
      * Adding the legacy handle.
      
      * Ellide lifetime.
      
      * Lint.
      
      * Deprecation message.
      
      * Fixing bad rebase.
      90b226db
  2. 16 Aug, 2024 1 commit
  3. 15 Aug, 2024 1 commit
  4. 25 Jun, 2024 1 commit
    • Daniël de Kok's avatar
      Add pytest release marker (#2114) · fc9c3153
      Daniël de Kok authored
      * Add pytest release marker
      
      Annotate a test with `@pytest.mark.release` and it only gets run
      with `pytest integration-tests --release`.
      
      * Mark many models as `release` to speed up CI
      fc9c3153
  5. 21 Feb, 2024 1 commit
  6. 16 Feb, 2024 1 commit
  7. 14 Feb, 2024 1 commit
    • Nicolas Patry's avatar
      Improving mamba runtime by using updates (#1552) · d6b0fb9e
      Nicolas Patry authored
      - Move float16 to bfloat16, which has less imprecisions (load test are
        failing with the update kernels + f16, all working under bf16).
      
        Another note, is that we are not respecting the layer norm in f32
        defined in the configuration (this is OK in my book, but that could
        impact the f16 precision)
      
      - Moved to update kernels. Triton overhead is super high, removed by
        switching to cuda graphs works great (update cuda graph is available
        in TRT-LLM if needed, seems *exactly* like the regular ssm kernel.
      
      - Moved inference_params struct in order to make only 2 tensors, to
        reduce the overhead of copying back and forth to the cuda graphs.
      
      - Left over overhead seems entirely in the tokenization bit. (Still 4
        copies are paid before launching the graph)
      
      
      # What does this PR do?
      
      <!--
      Congratulations! You've made it this far! You're not quite done yet
      though.
      
      Once merged, your PR is going to appear in the release notes with the
      title you set, so make sure it's a great title that fully reflects the
      extent of your awesome contribution.
      
      Then, please replace this with a description of the change and which
      issue is fixed (if applicable). Please also include relevant motivation
      and context. List any dependencies (if any) that are required for this
      change.
      
      Once you're done, someone will review your PR shortly (see the section
      "Who can review?" below to tag some potential reviewers). They may
      suggest changes to make the code even better. If no one reviewed your PR
      after a week has passed, don't hesitate to post a new comment
      @-mentioning the same persons---sometimes notifications get lost.
      -->
      
      <!-- Remove if not applicable -->
      
      Fixes # (issue)
      
      
      ## Before submitting
      - [ ] This PR fixes a typo or improves the docs (you can dismiss the
      other checks if that's the case).
      - [ ] Did you read the [contributor
      guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
            Pull Request section?
      - [ ] Was this discussed/approved via a Github issue or the
      [forum](https://discuss.huggingface.co/)? Please add a link
            to it if that's the case.
      - [ ] Did you make sure to update the documentation with your changes?
      Here are the
      [documentation
      guidelines](https://github.com/huggingface/transformers/tree/main/docs),
      and
      [here are tips on formatting
      docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
      - [ ] Did you write any new necessary tests?
      
      
      ## Who can review?
      
      Anyone in the community is free to review the PR once the tests have
      passed. Feel free to tag
      members/contributors who may be interested in your PR.
      
      <!-- Your PR will be replied to more quickly if you can figure out the
      right person to tag with @
      
      
      @OlivierDehaene OR @Narsil
      
       -->
      d6b0fb9e
  8. 08 Feb, 2024 2 commits
    • OlivierDehaene's avatar
      09b7c26b
    • drbh's avatar
      Impl simple mamba model (#1480) · bd405e03
      drbh authored
      This draft PR is a work in progress implementation of the mamba model.
      This PR currently loads weights, and produces correct logits after a
      single pass.
      
      This PR still needs to correctly integrate this model so it produces
      tokens as expected, and apply optimization to avoid all copies during
      runtime/unnecessary operations.
      
      #### Helpful resources
      [Mamba: Linear-Time Sequence Modeling with Selective State Spaces
      (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752)
      https://github.com/johnma2006/mamba-minimal
      
      https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs
      https://github.com/huggingface/transformers/pull/28094
      
      
      
      Notes: this dev work is currently targeting `state-spaces/mamba-130m`,
      so if you want to test please use that model. Additionally when starting
      the router the prefill needs to be limited: `cargo run --
      --max-batch-prefill-tokens 768 --max-input-length 768`
      
      
      ## Update / Current State
      
      Integration tests have been added and basic functionality such as model
      loading is supported.
      
      ```bash
      cd integration-tests
      pytest -vv models/test_fused_kernel_mamba.py
      ```
      - [x] add tests
      - [x] load model
      - [x] make simple request 
      - [ ] resolve warmup issue
      - [ ] resolve output issues
      
      
      fetching models tested during dev
      ```bash
      text-generation-server download-weights state-spaces/mamba-130m
      text-generation-server download-weights state-spaces/mamba-1.4b
      text-generation-server download-weights state-spaces/mamba-2.8b
      ```
      
      The server can be run 
      ```bash
      cd server
       MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b
      ```
      
      router
      ```bash
      cargo run
      ```
      
      make a request
      ```bash
      curl -s localhost:3000/generate \
          -X POST \
          -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
          -H 'Content-Type: application/json' | jq
      ```
      
      response
      ```json
      {
        "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data."
      }
      ```
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      bd405e03