- 04 Nov, 2024 3 commits
-
-
Travis Addair authored
-
Nicolas Patry authored
-
drbh authored
-
- 02 Nov, 2024 1 commit
-
-
drbh authored
* fix: create position ids for text only input * fix: prefer repeat over expand to avoid clone
-
- 01 Nov, 2024 1 commit
-
-
drbh authored
* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl * fix: only check model type if config exists * fix: adjust sharding and lm head logic * fix qwen2 failure in intel cpu Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix: return correct shape logits and add streaming test * fix: remove unused import and refactor test --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 30 Oct, 2024 2 commits
-
-
drbh authored
* feat: add support for qwen2 vl model * feat: fix token padding, enable warmup and process basic request * fix: improve get_position_ids, add lift embed_tokens * fix: remove get_cos_sin_hack dev function * feat: add simple test chat with meesage and text * fix: lint test * fix: adjust positional embeddings for multi dimensional position ids * fix: update docs and lint unused vars * fix: include linted file * fix: add norm after text output * fix: format model file * fix: adjust for ruff lints * fix: remove unused rotate_half * feat: refactors and calc num features * fix: prefer position_ids passed from vlm causal lm and reset ids on batch * fix: adjust get_position_ids if not available and add required args to signatures * fix: adjust resize case for qwen2_vl warmup * fix: avoid qwen2 vl specific paths with qwen2
-
Wang, Yi authored
add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'" Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
- 28 Oct, 2024 7 commits
-
-
Nicolas Patry authored
* Monkey patching as a desperate measure. * New snapshot ?
-
Nicolas Patry authored
* More timeout on docker start ? * Latest upgrade.
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Nicolas Patry authored
* We can have a tokenizer anywhere. * Handling potential lack of offsets (python tokenizer) * Remove redundancy. * Fixing the tests. * Flake.lock update ? * Fixing the GIL locking. * Fixing mamba by using the transformers version. * Adding the legacy handle. * Ellide lifetime. * Lint. * Deprecation message. * Fixing bad rebase.
-
Nicolas Patry authored
* Choosing input/total tokens automatically based on available VRAM? * Update doc. * Remove generated files. * Trying to fix non chunking targets. * Attempt #2 * fix. * QuantLinear is rocm compatible. * Much simpler logic after the overhead. * Updating logic + non flash. * Revert doc text. * Simple updates. * Fix integration mt0 (transformers update).
-
Nicolas Patry authored
-
- 26 Oct, 2024 1 commit
-
-
Nicolas Patry authored
* Avoiding timeout for bloom tests. * Skip the test let's see if it's always the first tests that fails. * Fail early. * Pulling ? * No early exit.
-
- 25 Oct, 2024 8 commits
-
-
OlivierDehaene authored
-
OlivierDehaene authored
* feat: add triton kernels to decrease latency of large batches * cast to int32 * fix kernel * fix kernel * disable triton on rocm * fix speculation * add slots filtering kernel
-
Daniël de Kok authored
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels Performance and accuracy of these kernels are on par (tested with Llama 70B and 405B). Removes a dependency and resolves some stability issues we have been seeing. * Update test snapshots
-
Funtowicz Morgan authored
* feat(trtllm): rewrite health to not account for current state * chore(looper): cleanup a bit more * feat(post_processing): max_new_tokens is const evaluated now * chore(ffi):formatting * feat(trtllm): add stop words handling # Conflicts: # backends/trtllm/lib/backend.cpp * chore(trtllm): create specific parallelconfig factory and logging init methods * chore(trtllm): define a macro for SizeType cast * chore(trtllm): use GetParallelConfig * chore(trtllm): minor refactoring * chore(trtllm): validate there are enough GPus on the system for the desired model * chore(trtllm): ensure max throughput scheduling policy is selected * chore(trtllm): minor fix * chore(router): minor refactorings * feat(docker): build with-slurm ompi * feat(docker): add python3.10 dev to runtime deps * chore(docker): add mpi to ld_library_path * chore(docker): install transformers * feat(trtllm): detect stop_words from generation_config.json
-
Nicolas Patry authored
-
Nicolas Patry authored
-
Funtowicz Morgan authored
* (backend) use parking_lot crate for RwLock fairness # Conflicts: # backends/trtllm/src/backend.rs * (launcher) default new server::run parameters to false for now * (chore) fmt ... why? * (ffi) use const for GetSamplingConfig * (server) expose new SchedulingError * (trt) * (build) setup ccache if available * (ffi) add max_new_tokens parameters * (backend) cleanup a bit * (backend) expose PullNewTokens * (ffi) cleanup again * (ffi) add missing headers imports * (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException> * (looper) new looper initial implementation * (ffi) remove narrowing type warning * (ffi) encode the provided user prompt within each request thread * (misc) change scope identifiers * (backend) implement the post_processor background thread * (misc) missing Result types for Rust * use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step * (server) forward auth_token to server::run * (build) fetchcontent use archives instead of git * (ffi) fix usage of wrong vector constructor making a capacity fill call * (ffi) missing namespace for tle::Response * (ffi) do not use reference capture in lambda as we are not capturing anything * (backend) refactor & cleanup * (Dockerfile.trtllm) delete for now * (misc) simplify [make_]move_iterator by using c++20 type inference * (misc) no need to move for uint32_t items * (scheduler) rework submit/pull logic * (post) impl postprocessing * (misc) delete backend.rs * (misc) rerun-if-changed all the cmake modules * (misc) move to latest trtllm * (fix): HOPPER_SM_MAJOR is 9 not 8 * (misc: build for sm_{75,80,86,89,90} by default * (misc): build with trtllm 0.13.0 * (misc): increase verbosity of spdlog * (fix): do not recreate the stateful hashmap at every it * (misc): update dependency in trtllm dockerfile * (misc): update dependency in trtllm dockerfile * (misc): disable logging in release mode * (misc): improve trtllm download script robustness * (fix): ore fixes for Dockerfile * misc(cuda): require 12.6 * chore(cmake): use correct policy for download_timestamp * feat(looper): check engine and executorWorker paths exist before creating the backend * chore(cmake): download timestamp should be before URL * feat(looper): minor optimizations to avoid growing too much the containers * chore(trtllm): move dockerfile to right place * chore(trtllm): disable tokenizer parallelism by default * chore(trtllm): fmt * chore(trtllm): post-rebase commit * chore(trtllm): remove unused method * feat(trtllm): cache maxNumTokens to avoid calling JSON everytime * misc(router): remove SchedulingError * feat(trtllm): do not tokenize twice * Revert "chore(trtllm): remove unused method" This reverts commit 31747163 * chore(rebase): fix invalid references * chore(router): add python dependency * Lint. * Fix bad rebase --------- Co-authored-by:Nicolas Patry <patry.nicolas@protonmail.com>
-
Nicolas Patry authored
specifiying a value.
-
- 24 Oct, 2024 3 commits
-
-
Daniël de Kok authored
* Add support for FP8 KV cache scales Since FP8 only has limited dynamic range, we can scale keys/values before storing them into the cache (and unscale them in attention). To avoid rescaling the cache as the absmax values change, good scales are usually determined per layer using calibration calibration data and stored in the checkpoint. This change adds support for for using key-value scales and loading them from checkpoints in the two most common formats: - Separate per-layer `k_scale` and `v_scale` scalars. - Per-layer `kv_scale` scalar (older format). Currently, scales are only used with an `float8_e4m3fn` cache. Besides adding support for key/value scales, the `fp8_quantize` function is also extended to support quantization with a kernel vendored from vLLM. This is slightly faster than the PyTorch implementation, but also scales in FP32, potentially improving accuracy. * Update FP8 KV cache test to use checkpoint with scales * `can_scale`: check that the attention is flashinfer
-
Daniël de Kok authored
PR #2682 also fixed in issue in Phi MoE, but it changes the test outputs a bit. Fix this.
-
Daniël de Kok authored
-
- 23 Oct, 2024 4 commits
-
-
OlivierDehaene authored
* feat: allow any supported payload on /invocations * update openAPI * update doc
-
OlivierDehaene authored
-
OlivierDehaene authored
* feat: natively support Granite models * Update doc
-
Daniël de Kok authored
-
- 22 Oct, 2024 1 commit
-
-
Daniël de Kok authored
* Add `impureWithCuda` dev shell This shell is handy when developing some kernels jointly with TGI - it adds nvcc and a bunch of commonly-used CUDA libraries to the environment. We don't add this to the normal impure shell to keep the development environment as clean as possible (avoid accidental dependencies, etc.). * Add cuDNN
-
- 21 Oct, 2024 2 commits
-
-
Wang, Yi authored
Signed-off-by:Wang, Yi A <yi.a.wang@intel.com>
-
Daniël de Kok authored
Update the Mixtral GPTQ test to use a model with `desc_act=true` and `group_size!=-1` to ensure that we are checking activation sorting/non-full K (with tensor parallelism). The `desc_act=false` case is already checked by the Mixtral AWQ test.
-
- 19 Oct, 2024 1 commit
-
-
Daniël de Kok authored
Change `fp8_quantize` so that we can pass around reciprocals everywhere, so scales are always passed around in the checkpoint format. I also noticed that we ignore any input scales that we might have when fbgemm is available. Skip this path if we already have a scale.
-
- 18 Oct, 2024 1 commit
-
-
Nicolas Patry authored
* add gptq and awq int4 support in intel platform Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * fix ci failure Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * set kv cache dtype Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * refine the code according to the review command Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> * Simplifying conditionals + reverting integration tests values. * Unused import * Fix redundant import. * Revert change after rebase. * Upgrading the tests (TP>1 fix changes to use different kernels.) * Update server/text_generation_server/layers/gptq/__init__.py --------- Signed-off-by:
Wang, Yi A <yi.a.wang@intel.com> Co-authored-by:
Wang, Yi A <yi.a.wang@intel.com>
-
- 17 Oct, 2024 5 commits
-
-
Daniël de Kok authored
-
drbh authored
* fix: prefer inplace softmax to avoid copy * Update server/text_generation_server/models/flash_causal_lm.py Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com> --------- Co-authored-by:
Nicolas Patry <patry.nicolas@protonmail.com>
-
oOraph authored
tgi-entrypoint: exec instead of spawning a child process reason: otherwise parent will receive the signals when we'd like tgi to receive them keeping the parent/child is not necessary and would require the parent to handle signals to forward them properly to the child Signed-off-by:
Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by:
Raphael Glon <oOraph@users.noreply.github.com>
-
Daniël de Kok authored
* Simplify the `attention` function - Use one definition rather than multiple. - Add `key`/`value` arguments, so that we don't need the `PREFILL_IN_KVCACHE` constant. - Make it kwargs-only (to avoid mixing up the various `Tensor` args). * Fixup flashinfer support
-
Daniël de Kok authored
* Support `e4m3fn` KV cache * Make check more obvious
-