1. 12 Aug, 2024 7 commits
  2. 09 Aug, 2024 10 commits
    • Daniël de Kok's avatar
      8dcc7d3f
    • drbh's avatar
      feat: add guideline to chat request and template (#2391) · 0d06aed0
      drbh authored
      * feat: add guideline to chat request and template
      
      * fix: add template test and update docs
      0d06aed0
    • Nicolas Patry's avatar
      Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847
      Nicolas Patry authored
      * Using an enum for flash backens (paged/flashdecoding/flashinfer)
      
      * Early exit on server too.
      
      * Clippy.
      
      * Fix clippy and fmt.
      7a48a847
    • Daniël de Kok's avatar
      flake: use rust-overlay (#2390) · 6e127dcc
      Daniël de Kok authored
      6e127dcc
    • Vaibhav Srivastav's avatar
      Update documentation for Supported models (#2386) · b2b9c427
      Vaibhav Srivastav authored
      * Minor doc fixes
      
      * up.
      
      * Other minor updates.
      b2b9c427
    • Daniël de Kok's avatar
      flake: add fmt and clippy (#2389) · 977534bc
      Daniël de Kok authored
      977534bc
    • Nicolas Patry's avatar
    • Daniël de Kok's avatar
      Add experimental flake (#2384) · c6d5039c
      Daniël de Kok authored
      Add flake.nix
      c6d5039c
    • Daniël de Kok's avatar
      Add FlashInfer support (#2354) · 7830de15
      Daniël de Kok authored
      This change adds support for FlashInfer. FlashInfer can be enabled using
      `FLASH_INFER=1` and is currently only implemented in `FlashCausalLM`.
      Since this functionality is currently only for testing, FlashInfer is
      not installed anywhere yet.
      
      The FlashInfer API is quite different from FlashAttention/vLLM in that
      it requires more global bookkeeping:
      
      * A wrapper class needs to be contstructed (which we just call *state*).
        Since this is fairly expensive (due to pinned host memory allocation),
        we only do this once in a FlashCausalLM instance or for each CUDA
        Graph size.
      * Each model forward call needs to be wrapped in `begin_forward` and
        `end_forward`. This sets up data structures that can be reused for all
        calls to attention for that forward call.
      
      When calling attention, we need access to the state object. To avoid
      passing an argument down the call chain (which would require changes to
      all models), we use a context variable.
      
      Each model forward call is wrapped using a context manager that does all
      the bookkeeping for such a call:
      
      * Set the context variable to the forward call's state.
      * Call `begin_forward` on the state.
      * Yield.
      * Call `end_forward` on the state.
      * Reset the context variable.
      
      We cannot use a single shared global variable for this, since e.g. CUDA
      Graphs of different sizes each have their own state.
      7830de15
    • drbh's avatar
      Pr 2352 ci branch (#2382) · 6d06473c
      drbh authored
      
      
      * Fix unsigned integer underflow
      
      Passing --max-batch-size to the launcher actually had no effect
      because after a few requests the max_size passed to State::next_batch
      would underflow becoming a largo positive number.
      
      In the scheduler, as soon as the cached batch size reached the
      max_batch_size the max_size passed to next_batch becomes 0.
      Since the only check in that funcion is
      ```
      if Some(batch_requests.len()) == max_size {
          break;
      }
      ```
      and it's called after the `batch_requests.len()` has
      become 1, it doesn't do anything to prevent more than 0
      requests from being batched.
      
      Now we have cached batch in the server that is large than
      max_batch_size and `max_size - batch_size as usize`
      underflows.
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      
      * fix: update v3 scheduler and ensure max_batch_size > 0
      
      ---------
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      Co-authored-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      6d06473c
  3. 08 Aug, 2024 7 commits
  4. 07 Aug, 2024 1 commit
  5. 06 Aug, 2024 6 commits
  6. 05 Aug, 2024 2 commits
  7. 01 Aug, 2024 3 commits
  8. 31 Jul, 2024 4 commits
    • Erik Kaunismäki's avatar
      refactor usage stats (#2339) · 7451041e
      Erik Kaunismäki authored
      
      
      * refactor usage stats
      
      * Update docs/source/usage_statistics.md
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * changes based on feedback
      
      * run python3 udpate_doc.py
      
      * fix pre-commit
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * delete option around usage stats arg
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      7451041e
    • drbh's avatar
      Pr 2290 ci run (#2329) · f7f61876
      drbh authored
      
      
      * MODEL_ID propagation fix
      
      * fix: remove global model id
      
      ---------
      Co-authored-by: default avatarroot <root@tw031.pit.tensorwave.lan>
      f7f61876
    • Daniël de Kok's avatar
      Handle GPTQ-Marlin loading in `GPTQMarlinWeightLoader` (#2300) · 34f7dcfd
      Daniël de Kok authored
      The `GPTWeightLoader` was structured like this in pseudocode:
      
      if marlin:
        Set up tensors in a way that GPTQ-Marlin expects
      else:
        Set up tensors in a way that ExLlama/GPTQ/AWQ expect
      
      However, the GPT-Marlin implementation details should really be in the
      `marlin` module. So move the former part out to a separate
      `GPTQMarlinWeightsLoader`.
      34f7dcfd
    • Nicolas Patry's avatar
      Rebase TRT-llm (#2331) · 2b19d671
      Nicolas Patry authored
      * wip
      
      wip
      
      refacto
      
      refacto
      
      Initial setup for CXX binding to TRTLLM
      
      Working FFI call for TGI and TRTLLM backend
      
      Remove unused parameters annd force tokenizer name to be set
      
      Overall build TRTLLM and deps through CMake build system
      
      Enable end to end CMake build
      
      First version loading engines and making it ready for inference
      
      Remembering to check how we can detect support for chunked context
      
      Move to latest TensorRT-LLM version
      
      Specify which default log level to use depending on CMake build type
      
      make leader executor mode working
      
      unconditionally call InitializeBackend on the FFI layer
      
      bind to CUDA::nvml to retrieve compute capabilities at runtime
      
      updated logic and comment to detect cuda compute capabilities
      
      implement the Stream method to send new tokens through a callback
      
      use spdlog release 1.14.1 moving forward
      
      update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
      
      correctly tell cmake to build dependent tensorrt-llm required libraries
      
      create cmake install target to put everything relevant in installation folder
      
      add auth_token CLI argument to provide hf hub authentification token
      
      allow converting huggingface::tokenizers error to TensorRtLlmBackendError
      
      use correct include for spdlog
      
      include guard to build example in cmakelists
      
      working setup of the ffi layer
      
      remove fmt import
      
      use external fmt lib
      
      end to end ffi flow working
      
      make sure to track include/ffi.h to trigger rebuild from cargo
      
      impl the rust backend which currently cannot move the actual computation in background thread
      
      expose shutdown function at ffi layer
      
      impl RwLock scenario for TensorRtLllmBackend
      
      oops missing c++ backend definitions
      
      compute the number of maximum new tokens for each request independently
      
      make sure the context is not dropped in the middle of the async decoding.
      
      remove unnecessary log
      
      add all the necessary plumbery to return the generated content
      
      update invalid doc in cpp file
      
      correctly forward back the log probabilities
      
      remove unneeded scope variable for now
      
      refactor Stream impl for Generation to factorise code
      
      expose the internal missing start/queue timestamp
      
      forward tgi parameters rep/freq penalty
      
      add some more validation about grammar not supported
      
      define a shared struct to hold the result of a decoding step
      
      expose information about potential error happening while decoding
      
      remove logging
      
      add logging in case of decoding error
      
      make sure executor_worker is provided
      
      add initial Dockerfile for TRTLLM backend
      
      add some more information in CMakeLists.txt to correctly install executorWorker
      
      add some more information in CMakeLists.txt to correctly find and install nvrtc wrapper
      
      simplify prebuilt trtllm libraries name definition
      
      do the same name definition stuff for tensorrt_llm_executor_static
      
      leverage pkg-config to probe libraries paths and reuse new install structure from cmake
      
      fix bad copy/past missing nvinfer linkage direction
      
      align all the linker search dependency
      
      add missing pkgconfig folder for MPI in Dockerfile
      
      correctly setup linking search path for runtime layer
      
      fix missing / before tgi lib path
      
      adding missing ld_library_path for cuda stubs in Dockerfile
      
      update tgi entrypoint
      
      commenting out Python part for TensorRT installation
      
      refactored docker image
      
      move to TensorRT-LLM v0.11.0
      
      make docker linter happy with same capitalization rule
      
      fix typo
      
      refactor the compute capabilities detection along with num gpus
      
      update TensorRT-LLM to latest version
      
      update TensorRT install script to latest
      
      update build.rs to link to cuda 12.5
      
      add missing dependant libraries for linking
      
      clean up a bit
      
      install to decoder_attention target
      
      add some custom stuff for nccl linkage
      
      fix envvar CARGO_CFG_TARGET_ARCH set at runtime vs compile time
      
      use std::env::const::ARCH
      
      make sure variable live long enough...
      
      look for cuda 12.5
      
      add some more basic info in README.md
      
      * Rebase.
      
      * Fix autodocs.
      
      * Let's try to enable trtllm backend.
      
      * Ignore backends/v3 by default.
      
      * Fixing client.
      
      * Fix makefile + autodocs.
      
      * Updating the schema thing + redocly.
      
      * Fix trtllm lint.
      
      * Adding pb files ?
      
      * Remove cargo fmt temporarily.
      
      * ?
      
      * Tmp.
      
      * Remove both check + clippy  ?
      
      * Backporting telemetry.
      
      * Backporting 457fb0a1
      
      
      
      * Remove PB from git.
      
      * Fixing PB with default member backends/client
      
      * update TensorRT-LLM to latest version
      
      * provided None for api_key
      
      * link against libtensorrt_llm and not libtensorrt-llm
      
      ---------
      Co-authored-by: default avatarOlivierDehaene <23298448+OlivierDehaene@users.noreply.github.com>
      Co-authored-by: default avatarMorgan Funtowicz <morgan@huggingface.co>
      2b19d671