1. 12 Aug, 2024 2 commits
    • Nicolas Patry's avatar
      Keeping the benchmark somewhere (#2401) · 136bcc81
      Nicolas Patry authored
      
      Co-authored-by: default avatarDaniël de Kok <me@danieldk.eu>
      136bcc81
    • Daniël de Kok's avatar
      Add support for prefix caching to the v3 router (#2392) · 8deeaca4
      Daniël de Kok authored
      This change adds support for prefix caching to the v3 router. This
      is broken up from the backend support to ease reviewing.
      
      For now prefix caching is only enabled with `USE_PREFIX_CACHING=1`
      in this case, the router will switch to `RadixAllocator`. This
      allocator uses a radix trie to keep track of prefills that were
      seen prior. If a new prefill is a prefix of a previously-seen
      prefil, the router will send a request with `prefix_len>0`, which
      can be used by the backend to decide to reuse KV blocks from the
      cache, rather than recomputing them.
      
      Even though backend support is not added in this PR, the backend
      will still work with prefix caching enabled. The prefix lengths
      are just ignored and not used.
      8deeaca4
  2. 09 Aug, 2024 2 commits
    • Nicolas Patry's avatar
      Using an enum for flash backens (paged/flashdecoding/flashinfer) (#2385) · 7a48a847
      Nicolas Patry authored
      * Using an enum for flash backens (paged/flashdecoding/flashinfer)
      
      * Early exit on server too.
      
      * Clippy.
      
      * Fix clippy and fmt.
      7a48a847
    • drbh's avatar
      Pr 2352 ci branch (#2382) · 6d06473c
      drbh authored
      
      
      * Fix unsigned integer underflow
      
      Passing --max-batch-size to the launcher actually had no effect
      because after a few requests the max_size passed to State::next_batch
      would underflow becoming a largo positive number.
      
      In the scheduler, as soon as the cached batch size reached the
      max_batch_size the max_size passed to next_batch becomes 0.
      Since the only check in that funcion is
      ```
      if Some(batch_requests.len()) == max_size {
          break;
      }
      ```
      and it's called after the `batch_requests.len()` has
      become 1, it doesn't do anything to prevent more than 0
      requests from being batched.
      
      Now we have cached batch in the server that is large than
      max_batch_size and `max_size - batch_size as usize`
      underflows.
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      
      * fix: update v3 scheduler and ensure max_batch_size > 0
      
      ---------
      Signed-off-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      Co-authored-by: default avatarMax de Bayser <mbayser@br.ibm.com>
      6d06473c
  3. 01 Aug, 2024 1 commit
  4. 31 Jul, 2024 2 commits
    • Erik Kaunismäki's avatar
      refactor usage stats (#2339) · 7451041e
      Erik Kaunismäki authored
      
      
      * refactor usage stats
      
      * Update docs/source/usage_statistics.md
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * changes based on feedback
      
      * run python3 udpate_doc.py
      
      * fix pre-commit
      
      * Update router/src/server.rs
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      
      * delete option around usage stats arg
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      7451041e
    • Nicolas Patry's avatar
      Rebase TRT-llm (#2331) · 2b19d671
      Nicolas Patry authored
      * wip
      
      wip
      
      refacto
      
      refacto
      
      Initial setup for CXX binding to TRTLLM
      
      Working FFI call for TGI and TRTLLM backend
      
      Remove unused parameters annd force tokenizer name to be set
      
      Overall build TRTLLM and deps through CMake build system
      
      Enable end to end CMake build
      
      First version loading engines and making it ready for inference
      
      Remembering to check how we can detect support for chunked context
      
      Move to latest TensorRT-LLM version
      
      Specify which default log level to use depending on CMake build type
      
      make leader executor mode working
      
      unconditionally call InitializeBackend on the FFI layer
      
      bind to CUDA::nvml to retrieve compute capabilities at runtime
      
      updated logic and comment to detect cuda compute capabilities
      
      implement the Stream method to send new tokens through a callback
      
      use spdlog release 1.14.1 moving forward
      
      update trtllm to latest version a96cccafcf6365c128f004f779160951f8c0801c
      
      correctly tell cmake to build dependent tensorrt...
      2b19d671