1. 25 Oct, 2024 2 commits
    • Funtowicz Morgan's avatar
      [TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e
      Funtowicz Morgan authored
      
      
      * (backend) use parking_lot crate for RwLock fairness
      
      # Conflicts:
      #	backends/trtllm/src/backend.rs
      
      * (launcher) default new server::run parameters to false for now
      
      * (chore) fmt ... why?
      
      * (ffi) use const for GetSamplingConfig
      
      * (server) expose new SchedulingError
      
      * (trt)
      
      * (build) setup ccache if available
      
      * (ffi) add max_new_tokens parameters
      
      * (backend) cleanup a bit
      
      * (backend) expose PullNewTokens
      
      * (ffi) cleanup again
      
      * (ffi) add missing headers imports
      
      * (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>
      
      * (looper) new looper initial implementation
      
      * (ffi) remove narrowing type warning
      
      * (ffi) encode the provided user prompt within each request thread
      
      * (misc) change scope identifiers
      
      * (backend) implement the post_processor background thread
      
      * (misc) missing Result types for Rust
      
      * use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step
      
      * (server) forward auth_token to server::run
      
      * (build) fetchcontent use archives instead of git
      
      * (ffi) fix usage of wrong vector constructor making a capacity fill call
      
      * (ffi) missing namespace for tle::Response
      
      * (ffi) do not use reference capture in lambda as we are not capturing anything
      
      * (backend) refactor & cleanup
      
      * (Dockerfile.trtllm) delete for now
      
      * (misc) simplify [make_]move_iterator by using c++20 type inference
      
      * (misc) no need to move for uint32_t items
      
      * (scheduler) rework submit/pull logic
      
      * (post) impl postprocessing
      
      * (misc) delete backend.rs
      
      * (misc) rerun-if-changed all the cmake modules
      
      * (misc) move to latest trtllm
      
      * (fix): HOPPER_SM_MAJOR is 9 not 8
      
      * (misc: build for sm_{75,80,86,89,90} by default
      
      * (misc): build with trtllm 0.13.0
      
      * (misc): increase verbosity of spdlog
      
      * (fix): do not recreate the stateful hashmap at every it
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): disable logging in release mode
      
      * (misc): improve trtllm download script robustness
      
      * (fix): ore fixes for Dockerfile
      
      * misc(cuda): require 12.6
      
      * chore(cmake): use correct policy for download_timestamp
      
      * feat(looper): check engine and executorWorker paths exist before creating the backend
      
      * chore(cmake): download timestamp should be before URL
      
      * feat(looper): minor optimizations to avoid growing too much the containers
      
      * chore(trtllm): move dockerfile to right place
      
      * chore(trtllm): disable tokenizer parallelism by default
      
      * chore(trtllm): fmt
      
      * chore(trtllm): post-rebase commit
      
      * chore(trtllm): remove unused method
      
      * feat(trtllm): cache maxNumTokens to avoid calling JSON everytime
      
      * misc(router): remove SchedulingError
      
      * feat(trtllm): do not tokenize twice
      
      * Revert "chore(trtllm): remove unused method"
      
      This reverts commit 31747163
      
      * chore(rebase): fix invalid references
      
      * chore(router): add python dependency
      
      * Lint.
      
      * Fix bad rebase
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      43df056e
    • Nicolas Patry's avatar
      ed87b464
  2. 23 Oct, 2024 1 commit
  3. 16 Oct, 2024 1 commit
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
  4. 02 Oct, 2024 1 commit
  5. 24 Sep, 2024 2 commits
  6. 16 Sep, 2024 1 commit
    • Nicolas Patry's avatar
      Adding a test for FD. (#2516) · 38fcafcf
      Nicolas Patry authored
      * Adding a test for FD.
      
      * Fixing flashdecoding (empty batch doesn't work).
      
      * Fixing the invalid popping.
      
      * Fixing radix with block_size > 1
      
      * Last reference.
      
      * Use an actual hash.
      
      * Update hash for slice.len() == 1
      
      * Update the locks.
      
      * Increasing docker timeout.
      38fcafcf
  7. 11 Sep, 2024 2 commits
    • Nicolas Patry's avatar
      Fix tokenization yi (#2507) · dae3bf1d
      Nicolas Patry authored
      * Fixing odd tokenization self modifications on the Rust side (load and
      resave in Python).
      
      * Fixing the builds ?
      
      * Fix the gh action?
      
      * Fixing the location ?
      
      * Validation is odd.
      
      * Try a faster runner
      
      * Upgrade python version.
      
      * Remove sccache
      
      * No sccache.
      
      * Getting libpython maybe ?
      
      * List stuff.
      
      * Monkey it up.
      
      * have no idea at this point
      
      * Tmp.
      
      * Shot in the dark.
      
      * Tmate the hell out of this.
      
      * Desperation.
      
      * WTF.
      
      * -y.
      
      * Apparently 3.10 is not available anymore.
      
      * Updating the dockerfile to make libpython discoverable at runtime too.
      
      * Put back rust tests.
      
      * Why do we want mkl on AMD ?
      
      * Forcing 3.11 ?
      dae3bf1d
    • Nicolas Patry's avatar
      Prefix test - Different kind of load test to trigger prefix test bugs. (#2490) · a4e3e8c6
      Nicolas Patry authored
      
      
      * Adding prefix test.
      
      * [WIP] tmp dump of integration load tests.
      
      * Remove other tensor creation.
      
      * Fixed the radix tree.
      
      Used a slice everywhere in radix.rs to keep the cheap Arc cloning
      instead of recomputing the input_ids.
      
      * Fix parsing
      
      * Is it really flashinfer version ?
      
      * Remove some comments.
      
      * Revert the max prefix hit.
      
      * Adding numpy to diff.
      
      * Upgraded flashinfer.
      
      * Upgrading some stuff.
      
      * Are we done yet ?
      
      * Minor fixup
      
      * Remove 1 log and put back the other.
      
      * Add comment for why slot 0 is OK.
      
      * Mounting on the job.
      
      * Get me a debug branch
      
      * Debugging CIs is fun.
      
      * Attempt #28
      
      * wip
      
      * Tmate.
      
      * Praying.
      
      * Updating VLM causal model with updated context.
      
      * Important line got squashed.
      
      * Tmate again.
      
      * Fingers crossed.
      
      * We want only 1 run of integration tests.....
      
      ---------
      Co-authored-by: default avatarGuillaume LEGENDRE <glegendre01@gmail.com>
      a4e3e8c6
  8. 06 Sep, 2024 2 commits
  9. 05 Sep, 2024 1 commit
  10. 29 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Lots of improvements (Still 2 allocators) (#2449) · e415b690
      Nicolas Patry authored
      
      
      * Making prefix/flashinfer the default and testing the full release tests.
      
      * Include flashinfer in the docker.
      
      * Using prebuilt.
      
      * Allowing window_left_size (dummy version).
      
      * Disabling flashinfer/prefix caching on odd head_dim
      
      * Disable prefix caching for lora.
      
      * More specific codes.
      
      * Update lock
      
      * Updating integration tests with new values with FI/FD.
      
      Remove paged as a default too, and using FD everywhere.
      
      * Update cargo lock ?
      
      * Upgrade to 1.80 because of bitstream...
      
      * Everywhere 1.80
      
      * Forgot last default place.
      
      * Apply suggestions from code review
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      
      * Updated flake lock
      
      * Tmp
      
      * Upgrade resolution system for less errors in resolution.
      
      * Remove lambda for cleaner function.
      
      * Handling debugger.
      
      * OVerride the env in server tests.
      
      * Is this enough to make it work ?
      
      * This seems to be working.
      
      * Downgrade some logs.
      
      * Fixing the default for vlm.
      
      * Don't enable prefix caching on VLM just yet.
      
      * Change `add_special_tokens` in order to have the correct tokens for chat
      input and not (since it's super important with the prefixing now)
      
      * Fixing prefix caching for flashdecoding.
      
      * Update all models.
      
      * Fixed flashinfer version.
      
      * add_special_tokens is internal only
      
      * Fixing seqlen with the new vlms.
      
      * Fixing the issue with `add_special_tokens` not being passed around.
      
      * Fixing the test.
      
      * Removing encoder_decoder (seq2seq).
      
      * Update the chat test.
      
      * Fixing the batching tokenization in flash causal lm.
      
      * Truncating left for radix purposes.
      
      * Oops this doesn't belong here.
      
      * Put back default pure shell.
      
      * Update server tests
      
      - Default to throughput test in k6
      - Use TGI_WIGGLE_ROOM to adjust wiggle room
      
      * Only n_heads / process_group.size() are necessary.
      
      * Revert the integrationt tests change (seem linked to head_size
      modification).
      
      * Adding error message when assert is violated.
      
      * Fixing the free algorithm to handle times where the common prefix is
      smaller.
      
      * Apply suggestions from code review
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Update server/text_generation_server/layers/attention/common.py
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      
      * Fix disabling prefix caching - Fix windowing checks.
      
      * Revert the Cohere tokenizer change (for now using a revision instead).
      
      * Fmt.
      
      ---------
      Co-authored-by: default avatardrbh <david.richard.holtz@gmail.com>
      Co-authored-by: default avatarOlivierDehaene <olivier@huggingface.co>
      e415b690
  11. 27 Aug, 2024 1 commit
  12. 20 Aug, 2024 1 commit
    • Nicolas Patry's avatar
      Prefix caching (#2402) · b70ae096
      Nicolas Patry authored
      
      
      * Prefix caching WIP
      
      * Fixing prefix attention.
      
      * Fixing flashinfer import.
      
      * Fixing black.
      
      * Fixing medusa (still wrong outputs, but functional).
      
      * Just medusa values now.
      
      * Fixing medusa without prefix caching.
      
      * Fixing prefix caching.
      
      * Medusa requires reshaping.
      
      * Removing the logs.
      
      * Remove router.nix
      
      * Fixup:
      
      - Remove logs
      - Disable VLMs (they do not work)
      - Disable prefix caching when user wants prefill logprobs.
      
      * Update flake.lock
      
      ---------
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      b70ae096
  13. 14 Aug, 2024 1 commit
    • Funtowicz Morgan's avatar
      More fixes trtllm (#2342) · 3f385991
      Funtowicz Morgan authored
      * (backend) use parking_lot crate for RwLock fairness
      
      * (docker) let's put rust in the TRTLLM folder when building
      
      * (docker) build ompi with SLURM support
      
      * (launcher) default new server::run parameters to false for now
      
      * (chore) fmt ... why?
      3f385991
  14. 12 Aug, 2024 2 commits
    • Nicolas Patry's avatar
      Keeping the benchmark somewhere (#2401) · 136bcc81
      Nicolas Patry authored
      
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      136bcc81
    • Daniël de Kok's avatar
      Add support for prefix caching to the v3 router (#2392) · 8deeaca4
      Daniël de Kok authored
      This change adds support for prefix caching to the v3 router. This
      is broken up from the backend support to ease reviewing.
      
      For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
      in this case, the router will switch to `RadixAllocator`. This
      allocator uses a radix trie to keep track of prefills that were
      seen prior. If a new prefill is a prefix of a previously-seen
      prefil, the router will send a request with `prefix_len>0`, which
      can be used by the backend to decide to reuse KV blocks from the
      cache, rather than recomputing them.
      
      Even though backend support is not added in this PR, the backend
      will still work with prefix caching enabled. The prefix lengths
      are just ignored and not used.
      8deeaca4
  15. 09 Aug, 2024 2 commits
    • Nicolas Patry's avatar
      Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847
      Nicolas Patry authored
      * Using an enum for flash backens (paged/flashdecoding/flashinfer)
      
      * Early exit on server too.
      
      * Clippy.
      
      * Fix clippy and fmt.
      7a48a847
    • drbh's avatar
      Pr 2352 ci branch (#2382) · 6d06473c
      drbh authored
      
      
      * Fix unsigned integer underflow
      
      Passing --max-batch-size to the launcher actually had no effect
      because after a few requests the max_size passed to State::next_batch
      would underflow becoming a largo positive number.
      
      In the scheduler, as soon as the cached batch size reached the
      max_batch_size the max_size passed to next_batch becomes 0.
      Since the only check in that funcion is
      ```
      if Some(batch_requests.len()) == max_size {
          break;
      }
      ```
      and it's called after the `batch_requests.len()` has
      become 1, it doesn't do anything to prevent more than 0
      requests from being batched.
      
      Now we have cached batch in the server that is large than
      max_batch_size and `max_size - batch_size as usize`
      underflows.
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      
      * fix: update v3 scheduler and ensure max_batch_size > 0
      
      ---------
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      Co-authored-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      6d06473c
  16. 01 Aug, 2024 1 commit
  17. 31 Jul, 2024 2 commits
    • Erik Kaunismäki's avatar
      refactor usage stats (#2339) · 7451041e
      Erik Kaunismäki authored
      
      
      * refactor usage stats
      
      * Update docs/source/usage_statistics.md
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * changes based on feedback
      
      * run python3 udpate_doc.py
      
      * fix pre-commit
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * delete option around usage stats arg
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      7451041e
    • Nicolas Patry's avatar
      Rebase TRT-llm (#2331) · 2b19d671
      Nicolas Patry authored
      * wip
      
      wip
      
      refacto
      
      refacto
      
      Initial setup for CXX binding to TRTLLM
      
      Working FFI call for TGI and TRTLLM backend
      
      Remove unused parameters annd force tokenizer name to be set
      
      Overall build TRTLLM and deps through CMake build system
      
      Enable end to end CMake build
      
      First version loading engines and making it ready for inference
      
      Remembering to check how we can detect support for chunked context
      
      Move to latest TensorRT-LLM version
      
      Specify which default log level to use depending on CMake build type
      
      make leader executor mode working
      
      unconditionally call InitializeBackend on the FFI layer
      
      bind to CUDA::nvml to retrieve compute capabilities at runtime
      
      updated logic and comment to detect cuda compute capabilities
      
      implement the Stream method to send new tokens through a callback
      
      use spdlog release 1.14.1 moving forward
      
      update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
      
      correctly tell cmake to build dependent tensorrt-llm required libraries
      
      create cmake install target to put everything relevant in installation folder
      
      add auth_token CLI argument to provide hf hub authentification token
      
      allow converting huggingface::tokenizers error to TensorRtLlmBackendError
      
      use correct include for spdlog
      
      include guard to build example in cmakelists
      
      working setup of the ffi layer
      
      remove fmt import
      
      use external fmt lib
      
      end to end ffi flow working
      
      make sure to track include/ffi.h to trigger rebuild from cargo
      
      impl the rust backend which currently cannot move the actual computation in background thread
      
      expose shutdown function at ffi layer
      
      impl RwLock scenario for TensorRtLllmBackend
      
      oops missing c++ backend definitions
      
      compute the number of maximum new tokens for each request independently
      
      make sure the context is not dropped in the middle of the async decoding.
      
      remove unnecessary log
      
      add all the necessary plumbery to return the generated content
      
      update invalid doc in cpp file
      
      correctly forward back the log probabilities
      
      remove unneeded scope variable for now
      
      refactor Stream impl for Generation to factorise code
      
      expose the internal missing start/queue timestamp
      
      forward tgi parameters rep/freq penalty
      
      add some more validation about grammar not supported
      
      define a shared struct to hold the result of a decoding step
      
      expose information about potential error happening while decoding
      
      remove logging
      
      add logging in case of decoding error
      
      make sure executor_worker is provided
      
      add initial Dockerfile for TRTLLM backend
      
      add some more information in CMakeLists.txt to correctly install executorWorker
      
      add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
      
      simplify prebuilt trtllm libraries name definition
      
      do the same name definition stuff for tensorrt_llm_executor_static
      
      leverage pkg-config to probe libraries paths and reuse new install structure from cmake
      
      fix bad copy/past missing nvinfer linkage direction
      
      align all the linker search dependency
      
      add missing pkgconfig folder for MPI in Dockerfile
      
      correctly setup linking search path for runtime layer
      
      fix missing / before tgi lib path
      
      adding missing ld_library_path for cuda stubs in Dockerfile
      
      update tgi entrypoint
      
      commenting out Python part for TensorRT installation
      
      refactored docker image
      
      move to TensorRT-LLM v0.11.0
      
      make docker linter happy with same capitalization rule
      
      fix typo
      
      refactor the compute capabilities detection along with num gpus
      
      update TensorRT-LLM to latest version
      
      update TensorRT install script to latest
      
      update build.rs to link to cuda 12.5
      
      add missing dependant libraries for linking
      
      clean up a bit
      
      install to decoder_attention target
      
      add some custom stuff for nccl linkage
      
      fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
      
      use std::env::const::ARCH
      
      make sure variable live long enough...
      
      look for cuda 12.5
      
      add some more basic info in README.md
      
      * Rebase.
      
      * Fix autodocs.
      
      * Let's try to enable trtllm backend.
      
      * Ignore backends/v3 by default.
      
      * Fixing client.
      
      * Fix makefile + autodocs.
      
      * Updating the schema thing + redocly.
      
      * Fix trtllm lint.
      
      * Adding pb files ?
      
      * Remove cargo fmt temporarily.
      
      * ?
      
      * Tmp.
      
      * Remove both check + clippy  ?
      
      * Backporting telemetry.
      
      * Backporting 457fb0a1
      
      
      
      * Remove PB from git.
      
      * Fixing PB with default member backends/client
      
      * update TensorRT-LLM to latest version
      
      * provided None for api_key
      
      * link against libtensorrt_llm and not libtensorrt-llm
      
      ---------
      Co-authored-by: default avatarOlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
      Co-authored-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      2b19d671