1. 28 Oct, 2024 2 commits
  2. 26 Oct, 2024 1 commit
  3. 25 Oct, 2024 8 commits
    • OlivierDehaene's avatar
      chore: prepare 2.4.0 release (#2695) · a6b02da9
      OlivierDehaene authored
      a6b02da9
    • OlivierDehaene's avatar
      feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93
      OlivierDehaene authored
      * feat: add triton kernels to decrease latency of large batches
      
      * cast to int32
      
      * fix kernel
      
      * fix kernel
      
      * disable triton on rocm
      
      * fix speculation
      
      * add slots filtering kernel
      6f88bd93
    • Daniël de Kok's avatar
      Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32
      Daniël de Kok authored
      * Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels
      
      Performance and accuracy of these kernels are on par (tested with Llama
      70B and 405B). Removes a dependency and resolves some stability issues
      we have been seeing.
      
      * Update test snapshots
      0f346a32
    • Funtowicz Morgan's avatar
      Add support for stop words in TRTLLM (#2678) · ba5fc7d9
      Funtowicz Morgan authored
      * feat(trtllm): rewrite health to not account for current state
      
      * chore(looper): cleanup a bit more
      
      * feat(post_processing): max_new_tokens is const evaluated now
      
      * chore(ffi):formatting
      
      * feat(trtllm): add stop words handling
      
      # Conflicts:
      #	backends/trtllm/lib/backend.cpp
      
      * chore(trtllm): create specific parallelconfig factory and logging init methods
      
      * chore(trtllm): define a macro for SizeType cast
      
      * chore(trtllm): use GetParallelConfig
      
      * chore(trtllm): minor refactoring
      
      * chore(trtllm): validate there are enough GPus on the system for the desired model
      
      * chore(trtllm): ensure max throughput scheduling policy is selected
      
      * chore(trtllm): minor fix
      
      * chore(router): minor refactorings
      
      * feat(docker): build with-slurm ompi
      
      * feat(docker): add python3.10 dev to runtime deps
      
      * chore(docker): add mpi to ld_library_path
      
      * chore(docker): install transformers
      
      * feat(trtllm): detect stop_words from generation_config.json
      ba5fc7d9
    • Nicolas Patry's avatar
      Fixing mt0 test. (#2692) · db68bd05
      Nicolas Patry authored
      db68bd05
    • Nicolas Patry's avatar
    • Funtowicz Morgan's avatar
      [TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e
      Funtowicz Morgan authored
      
      
      * (backend) use parking_lot crate for RwLock fairness
      
      # Conflicts:
      #	backends/trtllm/src/backend.rs
      
      * (launcher) default new server::run parameters to false for now
      
      * (chore) fmt ... why?
      
      * (ffi) use const for GetSamplingConfig
      
      * (server) expose new SchedulingError
      
      * (trt)
      
      * (build) setup ccache if available
      
      * (ffi) add max_new_tokens parameters
      
      * (backend) cleanup a bit
      
      * (backend) expose PullNewTokens
      
      * (ffi) cleanup again
      
      * (ffi) add missing headers imports
      
      * (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>
      
      * (looper) new looper initial implementation
      
      * (ffi) remove narrowing type warning
      
      * (ffi) encode the provided user prompt within each request thread
      
      * (misc) change scope identifiers
      
      * (backend) implement the post_processor background thread
      
      * (misc) missing Result types for Rust
      
      * use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step
      
      * (server) forward auth_token to server::run
      
      * (build) fetchcontent use archives instead of git
      
      * (ffi) fix usage of wrong vector constructor making a capacity fill call
      
      * (ffi) missing namespace for tle::Response
      
      * (ffi) do not use reference capture in lambda as we are not capturing anything
      
      * (backend) refactor & cleanup
      
      * (Dockerfile.trtllm) delete for now
      
      * (misc) simplify [make_]move_iterator by using c++20 type inference
      
      * (misc) no need to move for uint32_t items
      
      * (scheduler) rework submit/pull logic
      
      * (post) impl postprocessing
      
      * (misc) delete backend.rs
      
      * (misc) rerun-if-changed all the cmake modules
      
      * (misc) move to latest trtllm
      
      * (fix): HOPPER_SM_MAJOR is 9 not 8
      
      * (misc: build for sm_{75,80,86,89,90} by default
      
      * (misc): build with trtllm 0.13.0
      
      * (misc): increase verbosity of spdlog
      
      * (fix): do not recreate the stateful hashmap at every it
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): disable logging in release mode
      
      * (misc): improve trtllm download script robustness
      
      * (fix): ore fixes for Dockerfile
      
      * misc(cuda): require 12.6
      
      * chore(cmake): use correct policy for download_timestamp
      
      * feat(looper): check engine and executorWorker paths exist before creating the backend
      
      * chore(cmake): download timestamp should be before URL
      
      * feat(looper): minor optimizations to avoid growing too much the containers
      
      * chore(trtllm): move dockerfile to right place
      
      * chore(trtllm): disable tokenizer parallelism by default
      
      * chore(trtllm): fmt
      
      * chore(trtllm): post-rebase commit
      
      * chore(trtllm): remove unused method
      
      * feat(trtllm): cache maxNumTokens to avoid calling JSON everytime
      
      * misc(router): remove SchedulingError
      
      * feat(trtllm): do not tokenize twice
      
      * Revert "chore(trtllm): remove unused method"
      
      This reverts commit 31747163
      
      * chore(rebase): fix invalid references
      
      * chore(router): add python dependency
      
      * Lint.
      
      * Fix bad rebase
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      43df056e
    • Nicolas Patry's avatar
      ed87b464
  4. 24 Oct, 2024 3 commits
    • Daniël de Kok's avatar
      Add support for FP8 KV cache scales (#2628) · eab07f74
      Daniël de Kok authored
      * Add support for FP8 KV cache scales
      
      Since FP8 only has limited dynamic range, we can scale keys/values
      before storing them into the cache (and unscale them in attention). To
      avoid rescaling the cache as the absmax values change, good scales are
      usually determined per layer using calibration calibration data and stored
      in the checkpoint.
      
      This change adds support for for using key-value scales and loading them
      from checkpoints in the two most common formats:
      
      - Separate per-layer `k_scale` and `v_scale` scalars.
      - Per-layer `kv_scale` scalar (older format).
      
      Currently, scales are only used with an `float8_e4m3fn` cache.
      
      Besides adding support for key/value scales, the `fp8_quantize` function
      is also extended to support quantization with a kernel vendored from
      vLLM. This is slightly faster than the PyTorch implementation, but also
      scales in FP32, potentially improving accuracy.
      
      * Update FP8 KV cache test to use checkpoint with scales
      
      * `can_scale`: check that the attention is flashinfer
      eab07f74
    • Daniël de Kok's avatar
      Fix Phi 3.5 MoE tests (#2684) · 14a0df3a
      Daniël de Kok authored
      PR #2682 also fixed in issue in Phi MoE, but it changes the test
      outputs a bit. Fix this.
      14a0df3a
    • Daniël de Kok's avatar
  5. 23 Oct, 2024 4 commits
  6. 22 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Add `impureWithCuda` dev shell (#2677) · 9c9ef37c
      Daniël de Kok authored
      * Add `impureWithCuda` dev shell
      
      This shell is handy when developing some kernels jointly with TGI - it
      adds nvcc and a bunch of commonly-used CUDA libraries to the environment.
      
      We don't add this to the normal impure shell to keep the development
      environment as clean as possible (avoid accidental dependencies, etc.).
      
      * Add cuDNN
      9c9ef37c
  7. 21 Oct, 2024 2 commits
  8. 19 Oct, 2024 1 commit
    • Daniël de Kok's avatar
      Make handling of FP8 scales more consisent (#2666) · 5e0fb468
      Daniël de Kok authored
      Change `fp8_quantize` so that we can pass around reciprocals everywhere,
      so scales are always passed around in the checkpoint format.
      
      I also noticed that we ignore any input scales that we might have when
      fbgemm is available. Skip this path if we already have a scale.
      5e0fb468
  9. 18 Oct, 2024 1 commit
  10. 17 Oct, 2024 5 commits
  11. 16 Oct, 2024 2 commits
    • OlivierDehaene's avatar
      feat: prefill chunking (#2600) · a6a0c97e
      OlivierDehaene authored
      
      
      * wip
      
      * rollback
      
      * refactor to use prefix/postfix namming + fix all_input_ids_tensor
      
      * maybe patching vlms?
      
      * fix filter and concat
      
      * wip, no filter, no concat
      
      * current
      
      * add prepare_for_prefill
      
      * working
      
      * load tested
      
      * re-create slots
      
      * re-create slots
      
      * fix slot_filtering_indices
      
      * feedback loop
      
      * remove log
      
      * fix benchmarker
      
      * fix vlm and seq2seq
      
      * rename to cache and input lengths
      
      * fix prefill logprobs
      
      * fix launcher
      
      * fix logprobs?
      
      * idk at this point
      
      * max input length
      
      * omfg
      
      * remove debugging lines
      
      * fix tests
      
      * fix mllama
      
      * fix cargo tests
      
      * remove support chunking for paged
      
      * Fixing non blocked attentions
      
      * Fixing dtype + AMD, Ipex targets.
      
      * lint fix.
      
      * rename
      
      * Fix prefix_caching variable, remove defaults in server (confusing a lot
      of the times).
      
      * Add simple resolution when user specifies ATTENTION=paged.
      
      * Put back non default simple tests.
      
      * Fix env name
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      a6a0c97e
    • Mohit Sharma's avatar
      Fp8 e4m3_fnuz support for rocm (#2588) · 704a58c8
      Mohit Sharma authored
      * (feat) fp8 fnuz support for rocm
      
      * (review comments) Fix compression_config load, type hints
      
      * (bug) update all has_tensor
      
      * (review_comments) fix typo and added comments
      
      * (nit) improved comment
      704a58c8
  12. 15 Oct, 2024 3 commits
  13. 14 Oct, 2024 5 commits
  14. 11 Oct, 2024 1 commit
  15. 10 Oct, 2024 1 commit
    • Nicolas Patry's avatar
      Intel ci (#2630) · 3dbdf63e
      Nicolas Patry authored
      * Intel CI ?
      
      * Let's try non sharded gemma.
      
      * Snapshot rename
      
      * Apparently container can be gone already.
      3dbdf63e