1. 25 Oct, 2024 1 commit
    • Funtowicz Morgan's avatar
      [TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e
      Funtowicz Morgan authored
      
      
      * (backend) use parking_lot crate for RwLock fairness
      
      # Conflicts:
      #	backends/trtllm/src/backend.rs
      
      * (launcher) default new server::run parameters to false for now
      
      * (chore) fmt ... why?
      
      * (ffi) use const for GetSamplingConfig
      
      * (server) expose new SchedulingError
      
      * (trt)
      
      * (build) setup ccache if available
      
      * (ffi) add max_new_tokens parameters
      
      * (backend) cleanup a bit
      
      * (backend) expose PullNewTokens
      
      * (ffi) cleanup again
      
      * (ffi) add missing headers imports
      
      * (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>
      
      * (looper) new looper initial implementation
      
      * (ffi) remove narrowing type warning
      
      * (ffi) encode the provided user prompt within each request thread
      
      * (misc) change scope identifiers
      
      * (backend) implement the post_processor background thread
      
      * (misc) missing Result types for Rust
      
      * use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step
      
      * (server) forward auth_token to server::run
      
      * (build) fetchcontent use archives instead of git
      
      * (ffi) fix usage of wrong vector constructor making a capacity fill call
      
      * (ffi) missing namespace for tle::Response
      
      * (ffi) do not use reference capture in lambda as we are not capturing anything
      
      * (backend) refactor & cleanup
      
      * (Dockerfile.trtllm) delete for now
      
      * (misc) simplify [make_]move_iterator by using c++20 type inference
      
      * (misc) no need to move for uint32_t items
      
      * (scheduler) rework submit/pull logic
      
      * (post) impl postprocessing
      
      * (misc) delete backend.rs
      
      * (misc) rerun-if-changed all the cmake modules
      
      * (misc) move to latest trtllm
      
      * (fix): HOPPER_SM_MAJOR is 9 not 8
      
      * (misc: build for sm_{75,80,86,89,90} by default
      
      * (misc): build with trtllm 0.13.0
      
      * (misc): increase verbosity of spdlog
      
      * (fix): do not recreate the stateful hashmap at every it
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): disable logging in release mode
      
      * (misc): improve trtllm download script robustness
      
      * (fix): ore fixes for Dockerfile
      
      * misc(cuda): require 12.6
      
      * chore(cmake): use correct policy for download_timestamp
      
      * feat(looper): check engine and executorWorker paths exist before creating the backend
      
      * chore(cmake): download timestamp should be before URL
      
      * feat(looper): minor optimizations to avoid growing too much the containers
      
      * chore(trtllm): move dockerfile to right place
      
      * chore(trtllm): disable tokenizer parallelism by default
      
      * chore(trtllm): fmt
      
      * chore(trtllm): post-rebase commit
      
      * chore(trtllm): remove unused method
      
      * feat(trtllm): cache maxNumTokens to avoid calling JSON everytime
      
      * misc(router): remove SchedulingError
      
      * feat(trtllm): do not tokenize twice
      
      * Revert "chore(trtllm): remove unused method"
      
      This reverts commit 31747163
      
      * chore(rebase): fix invalid references
      
      * chore(router): add python dependency
      
      * Lint.
      
      * Fix bad rebase
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      43df056e
  2. 31 Jul, 2024 1 commit
    • Nicolas Patry's avatar
      Rebase TRT-llm (#2331) · 2b19d671
      Nicolas Patry authored
      * wip
      
      wip
      
      refacto
      
      refacto
      
      Initial setup for CXX binding to TRTLLM
      
      Working FFI call for TGI and TRTLLM backend
      
      Remove unused parameters annd force tokenizer name to be set
      
      Overall build TRTLLM and deps through CMake build system
      
      Enable end to end CMake build
      
      First version loading engines and making it ready for inference
      
      Remembering to check how we can detect support for chunked context
      
      Move to latest TensorRT-LLM version
      
      Specify which default log level to use depending on CMake build type
      
      make leader executor mode working
      
      unconditionally call InitializeBackend on the FFI layer
      
      bind to CUDA::nvml to retrieve compute capabilities at runtime
      
      updated logic and comment to detect cuda compute capabilities
      
      implement the Stream method to send new tokens through a callback
      
      use spdlog release 1.14.1 moving forward
      
      update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
      
      correctly tell cmake to build dependent tensorrt-llm required libraries
      
      create cmake install target to put everything relevant in installation folder
      
      add auth_token CLI argument to provide hf hub authentification token
      
      allow converting huggingface::tokenizers error to TensorRtLlmBackendError
      
      use correct include for spdlog
      
      include guard to build example in cmakelists
      
      working setup of the ffi layer
      
      remove fmt import
      
      use external fmt lib
      
      end to end ffi flow working
      
      make sure to track include/ffi.h to trigger rebuild from cargo
      
      impl the rust backend which currently cannot move the actual computation in background thread
      
      expose shutdown function at ffi layer
      
      impl RwLock scenario for TensorRtLllmBackend
      
      oops missing c++ backend definitions
      
      compute the number of maximum new tokens for each request independently
      
      make sure the context is not dropped in the middle of the async decoding.
      
      remove unnecessary log
      
      add all the necessary plumbery to return the generated content
      
      update invalid doc in cpp file
      
      correctly forward back the log probabilities
      
      remove unneeded scope variable for now
      
      refactor Stream impl for Generation to factorise code
      
      expose the internal missing start/queue timestamp
      
      forward tgi parameters rep/freq penalty
      
      add some more validation about grammar not supported
      
      define a shared struct to hold the result of a decoding step
      
      expose information about potential error happening while decoding
      
      remove logging
      
      add logging in case of decoding error
      
      make sure executor_worker is provided
      
      add initial Dockerfile for TRTLLM backend
      
      add some more information in CMakeLists.txt to correctly install executorWorker
      
      add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
      
      simplify prebuilt trtllm libraries name definition
      
      do the same name definition stuff for tensorrt_llm_executor_static
      
      leverage pkg-config to probe libraries paths and reuse new install structure from cmake
      
      fix bad copy/past missing nvinfer linkage direction
      
      align all the linker search dependency
      
      add missing pkgconfig folder for MPI in Dockerfile
      
      correctly setup linking search path for runtime layer
      
      fix missing / before tgi lib path
      
      adding missing ld_library_path for cuda stubs in Dockerfile
      
      update tgi entrypoint
      
      commenting out Python part for TensorRT installation
      
      refactored docker image
      
      move to TensorRT-LLM v0.11.0
      
      make docker linter happy with same capitalization rule
      
      fix typo
      
      refactor the compute capabilities detection along with num gpus
      
      update TensorRT-LLM to latest version
      
      update TensorRT install script to latest
      
      update build.rs to link to cuda 12.5
      
      add missing dependant libraries for linking
      
      clean up a bit
      
      install to decoder_attention target
      
      add some custom stuff for nccl linkage
      
      fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
      
      use std::env::const::ARCH
      
      make sure variable live long enough...
      
      look for cuda 12.5
      
      add some more basic info in README.md
      
      * Rebase.
      
      * Fix autodocs.
      
      * Let's try to enable trtllm backend.
      
      * Ignore backends/v3 by default.
      
      * Fixing client.
      
      * Fix makefile + autodocs.
      
      * Updating the schema thing + redocly.
      
      * Fix trtllm lint.
      
      * Adding pb files ?
      
      * Remove cargo fmt temporarily.
      
      * ?
      
      * Tmp.
      
      * Remove both check + clippy  ?
      
      * Backporting telemetry.
      
      * Backporting 457fb0a1
      
      
      
      * Remove PB from git.
      
      * Fixing PB with default member backends/client
      
      * update TensorRT-LLM to latest version
      
      * provided None for api_key
      
      * link against libtensorrt_llm and not libtensorrt-llm
      
      ---------
      Co-authored-by: default avatarOlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
      Co-authored-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      2b19d671