• Nicolas Patry's avatar
    Rebase TRT-llm (#2331) · 2b19d671
    Nicolas Patry authored
    * wip
    
    wip
    
    refacto
    
    refacto
    
    Initial setup for CXX binding to TRTLLM
    
    Working FFI call for TGI and TRTLLM backend
    
    Remove unused parameters annd force tokenizer name to be set
    
    Overall build TRTLLM and deps through CMake build system
    
    Enable end to end CMake build
    
    First version loading engines and making it ready for inference
    
    Remembering to check how we can detect support for chunked context
    
    Move to latest TensorRT-LLM version
    
    Specify which default log level to use depending on CMake build type
    
    make leader executor mode working
    
    unconditionally call InitializeBackend on the FFI layer
    
    bind to CUDA::nvml to retrieve compute capabilities at runtime
    
    updated logic and comment to detect cuda compute capabilities
    
    implement the Stream method to send new tokens through a callback
    
    use spdlog release 1.14.1 moving forward
    
    update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
    
    correctly tell cmake to build dependent tensorrt-llm required libraries
    
    create cmake install target to put everything relevant in installation folder
    
    add auth_token CLI argument to provide hf hub authentification token
    
    allow converting huggingface::tokenizers error to TensorRtLlmBackendError
    
    use correct include for spdlog
    
    include guard to build example in cmakelists
    
    working setup of the ffi layer
    
    remove fmt import
    
    use external fmt lib
    
    end to end ffi flow working
    
    make sure to track include/ffi.h to trigger rebuild from cargo
    
    impl the rust backend which currently cannot move the actual computation in background thread
    
    expose shutdown function at ffi layer
    
    impl RwLock scenario for TensorRtLllmBackend
    
    oops missing c++ backend definitions
    
    compute the number of maximum new tokens for each request independently
    
    make sure the context is not dropped in the middle of the async decoding.
    
    remove unnecessary log
    
    add all the necessary plumbery to return the generated content
    
    update invalid doc in cpp file
    
    correctly forward back the log probabilities
    
    remove unneeded scope variable for now
    
    refactor Stream impl for Generation to factorise code
    
    expose the internal missing start/queue timestamp
    
    forward tgi parameters rep/freq penalty
    
    add some more validation about grammar not supported
    
    define a shared struct to hold the result of a decoding step
    
    expose information about potential error happening while decoding
    
    remove logging
    
    add logging in case of decoding error
    
    make sure executor_worker is provided
    
    add initial Dockerfile for TRTLLM backend
    
    add some more information in CMakeLists.txt to correctly install executorWorker
    
    add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
    
    simplify prebuilt trtllm libraries name definition
    
    do the same name definition stuff for tensorrt_llm_executor_static
    
    leverage pkg-config to probe libraries paths and reuse new install structure from cmake
    
    fix bad copy/past missing nvinfer linkage direction
    
    align all the linker search dependency
    
    add missing pkgconfig folder for MPI in Dockerfile
    
    correctly setup linking search path for runtime layer
    
    fix missing / before tgi lib path
    
    adding missing ld_library_path for cuda stubs in Dockerfile
    
    update tgi entrypoint
    
    commenting out Python part for TensorRT installation
    
    refactored docker image
    
    move to TensorRT-LLM v0.11.0
    
    make docker linter happy with same capitalization rule
    
    fix typo
    
    refactor the compute capabilities detection along with num gpus
    
    update TensorRT-LLM to latest version
    
    update TensorRT install script to latest
    
    update build.rs to link to cuda 12.5
    
    add missing dependant libraries for linking
    
    clean up a bit
    
    install to decoder_attention target
    
    add some custom stuff for nccl linkage
    
    fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
    
    use std::env::const::ARCH
    
    make sure variable live long enough...
    
    look for cuda 12.5
    
    add some more basic info in README.md
    
    * Rebase.
    
    * Fix autodocs.
    
    * Let's try to enable trtllm backend.
    
    * Ignore backends/v3 by default.
    
    * Fixing client.
    
    * Fix makefile + autodocs.
    
    * Updating the schema thing + redocly.
    
    * Fix trtllm lint.
    
    * Adding pb files ?
    
    * Remove cargo fmt temporarily.
    
    * ?
    
    * Tmp.
    
    * Remove both check + clippy  ?
    
    * Backporting telemetry.
    
    * Backporting 457fb0a1
    
    
    
    * Remove PB from git.
    
    * Fixing PB with default member backends/client
    
    * update TensorRT-LLM to latest version
    
    * provided None for api_key
    
    * link against libtensorrt_llm and not libtensorrt-llm
    
    ---------
    Co-authored-by: default avatarOlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
    Co-authored-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
    2b19d671
openapi.json 55.7 KB