1. 18 Nov, 2024 3 commits
  2. 17 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Remove vLLM dependency for CUDA (#2751) · 52e48739
      Daniël de Kok authored
      * Remove vLLM dependency for CUDA
      
      This change adds `attention-kernels` as a dependency for paged
      attention and cache reshaping. With that, we don't use vLLM
      anywhere for CUDA.
      
      Tested run (since we don't have paged attention in CI):
      
      ```
      ❯ ATTENTION=paged python -m pytest integration-tests -k "llama and awq" --release
      [...]
      5 snapshots passed.
      ```
      
      * Fix clippy warning
      52e48739
  3. 15 Nov, 2024 7 commits
  4. 14 Nov, 2024 1 commit
  5. 10 Nov, 2024 1 commit
    • Daniël de Kok's avatar
      Add initial support for compressed-tensors checkpoints (#2732) · a7850008
      Daniël de Kok authored
      compressed-tensors is a safetensors extension for sparse, quantized
      tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
      quantization, because
      
      - Different quantizer configurations can be used for different targets.
      - The format can specify input/output quantizers in addition to weight
        quantizers.
      - Configurable exclusions for quantization.
      
      This change adds a dependency on the `compressed-tensors` package for
      its configuration parsing and layer matching functionality.
      
      The following types of quantization are supported in this PR:
      
      - W8A16 and W4A16 INT using GPTQ-Marlin kernels.
      - W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.
      
      Support for other quantization types will be added in subsequent PRs.
      a7850008
  6. 07 Nov, 2024 1 commit
  7. 04 Nov, 2024 6 commits
  8. 02 Nov, 2024 1 commit
  9. 01 Nov, 2024 1 commit
    • drbh's avatar
      fix cuda graphs for qwen2-vl (#2708) · 01dacf8e
      drbh authored
      
      
      * feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl
      
      * fix: only check model type if config exists
      
      * fix: adjust sharding and lm head logic
      
      * fix qwen2 failure in intel cpu
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      
      * fix: return correct shape logits and add streaming test
      
      * fix: remove unused import and refactor test
      
      ---------
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      01dacf8e
  10. 30 Oct, 2024 2 commits
    • drbh's avatar
      Support qwen2 vl (#2689) · befd9f67
      drbh authored
      * feat: add support for qwen2 vl model
      
      * feat: fix token padding, enable warmup and process basic request
      
      * fix: improve get_position_ids, add lift embed_tokens
      
      * fix: remove get_cos_sin_hack dev function
      
      * feat: add simple test chat with meesage and text
      
      * fix: lint test
      
      * fix: adjust positional embeddings for multi dimensional position ids
      
      * fix: update docs and lint unused vars
      
      * fix: include linted file
      
      * fix: add norm after text output
      
      * fix: format model file
      
      * fix: adjust for ruff lints
      
      * fix: remove unused rotate_half
      
      * feat: refactors and calc num features
      
      * fix: prefer position_ids passed from vlm causal lm and reset ids on batch
      
      * fix: adjust get_position_ids if not available and add required args to signatures
      
      * fix: adjust resize case for qwen2_vl warmup
      
      * fix: avoid qwen2 vl specific paths with qwen2
      befd9f67
    • Wang, Yi's avatar
      add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702) · 46aeb086
      Wang, Yi authored
      
      
      add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'"
      Signed-off-by: default avatarWang, Yi A <yi.a.wang@intel.com>
      46aeb086
  11. 28 Oct, 2024 7 commits
  12. 26 Oct, 2024 1 commit
  13. 25 Oct, 2024 8 commits
    • OlivierDehaene's avatar
      chore: prepare 2.4.0 release (#2695) · a6b02da9
      OlivierDehaene authored
      a6b02da9
    • OlivierDehaene's avatar
      feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93
      OlivierDehaene authored
      * feat: add triton kernels to decrease latency of large batches
      
      * cast to int32
      
      * fix kernel
      
      * fix kernel
      
      * disable triton on rocm
      
      * fix speculation
      
      * add slots filtering kernel
      6f88bd93
    • Daniël de Kok's avatar
      Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32
      Daniël de Kok authored
      * Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels
      
      Performance and accuracy of these kernels are on par (tested with Llama
      70B and 405B). Removes a dependency and resolves some stability issues
      we have been seeing.
      
      * Update test snapshots
      0f346a32
    • Funtowicz Morgan's avatar
      Add support for stop words in TRTLLM (#2678) · ba5fc7d9
      Funtowicz Morgan authored
      * feat(trtllm): rewrite health to not account for current state
      
      * chore(looper): cleanup a bit more
      
      * feat(post_processing): max_new_tokens is const evaluated now
      
      * chore(ffi):formatting
      
      * feat(trtllm): add stop words handling
      
      # Conflicts:
      #	backends/trtllm/lib/backend.cpp
      
      * chore(trtllm): create specific parallelconfig factory and logging init methods
      
      * chore(trtllm): define a macro for SizeType cast
      
      * chore(trtllm): use GetParallelConfig
      
      * chore(trtllm): minor refactoring
      
      * chore(trtllm): validate there are enough GPus on the system for the desired model
      
      * chore(trtllm): ensure max throughput scheduling policy is selected
      
      * chore(trtllm): minor fix
      
      * chore(router): minor refactorings
      
      * feat(docker): build with-slurm ompi
      
      * feat(docker): add python3.10 dev to runtime deps
      
      * chore(docker): add mpi to ld_library_path
      
      * chore(docker): install transformers
      
      * feat(trtllm): detect stop_words from generation_config.json
      ba5fc7d9
    • Nicolas Patry's avatar
      Fixing mt0 test. (#2692) · db68bd05
      Nicolas Patry authored
      db68bd05
    • Nicolas Patry's avatar
    • Funtowicz Morgan's avatar
      [TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e
      Funtowicz Morgan authored
      
      
      * (backend) use parking_lot crate for RwLock fairness
      
      # Conflicts:
      #	backends/trtllm/src/backend.rs
      
      * (launcher) default new server::run parameters to false for now
      
      * (chore) fmt ... why?
      
      * (ffi) use const for GetSamplingConfig
      
      * (server) expose new SchedulingError
      
      * (trt)
      
      * (build) setup ccache if available
      
      * (ffi) add max_new_tokens parameters
      
      * (backend) cleanup a bit
      
      * (backend) expose PullNewTokens
      
      * (ffi) cleanup again
      
      * (ffi) add missing headers imports
      
      * (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>
      
      * (looper) new looper initial implementation
      
      * (ffi) remove narrowing type warning
      
      * (ffi) encode the provided user prompt within each request thread
      
      * (misc) change scope identifiers
      
      * (backend) implement the post_processor background thread
      
      * (misc) missing Result types for Rust
      
      * use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step
      
      * (server) forward auth_token to server::run
      
      * (build) fetchcontent use archives instead of git
      
      * (ffi) fix usage of wrong vector constructor making a capacity fill call
      
      * (ffi) missing namespace for tle::Response
      
      * (ffi) do not use reference capture in lambda as we are not capturing anything
      
      * (backend) refactor & cleanup
      
      * (Dockerfile.trtllm) delete for now
      
      * (misc) simplify [make_]move_iterator by using c++20 type inference
      
      * (misc) no need to move for uint32_t items
      
      * (scheduler) rework submit/pull logic
      
      * (post) impl postprocessing
      
      * (misc) delete backend.rs
      
      * (misc) rerun-if-changed all the cmake modules
      
      * (misc) move to latest trtllm
      
      * (fix): HOPPER_SM_MAJOR is 9 not 8
      
      * (misc: build for sm_{75,80,86,89,90} by default
      
      * (misc): build with trtllm 0.13.0
      
      * (misc): increase verbosity of spdlog
      
      * (fix): do not recreate the stateful hashmap at every it
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): update dependency in trtllm dockerfile
      
      * (misc): disable logging in release mode
      
      * (misc): improve trtllm download script robustness
      
      * (fix): ore fixes for Dockerfile
      
      * misc(cuda): require 12.6
      
      * chore(cmake): use correct policy for download_timestamp
      
      * feat(looper): check engine and executorWorker paths exist before creating the backend
      
      * chore(cmake): download timestamp should be before URL
      
      * feat(looper): minor optimizations to avoid growing too much the containers
      
      * chore(trtllm): move dockerfile to right place
      
      * chore(trtllm): disable tokenizer parallelism by default
      
      * chore(trtllm): fmt
      
      * chore(trtllm): post-rebase commit
      
      * chore(trtllm): remove unused method
      
      * feat(trtllm): cache maxNumTokens to avoid calling JSON everytime
      
      * misc(router): remove SchedulingError
      
      * feat(trtllm): do not tokenize twice
      
      * Revert "chore(trtllm): remove unused method"
      
      This reverts commit 31747163
      
      * chore(rebase): fix invalid references
      
      * chore(router): add python dependency
      
      * Lint.
      
      * Fix bad rebase
      
      ---------
      Co-authored-by: default avatarNicolas Patry <patry.nicolas@protonmail.com>
      43df056e
    • Nicolas Patry's avatar
      ed87b464