Commits · 003eaec0fbe00aacf03547b317163363cef56ab9 · OpenDAS / text-generation-inference

15 Nov, 2024 4 commits
- fix response type of document for Text Generation Inference (#2743) · 003eaec0
  jito authored Nov 15, 2024
```
Signed-off-by: jitokim <pigberger70@gmail.com>
```
  003eaec0
- Fix: Change embeddings to embedding (#2738) · 4f4857a4
  Billel Mokeddem authored Nov 15, 2024
```
fix: change embeddings to embedding
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
```
  4f4857a4
- Fix: Change model_type from ssm to mamba (#2740) · f9ee46f7
  Billel Mokeddem authored Nov 15, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-28-135.us-west-2.compute.internal>
```
  f9ee46f7
- benchmark: fix prefill throughput (#2741) · 8442f1ac
  Daniël de Kok authored Nov 15, 2024
  
  8442f1ac
14 Nov, 2024 1 commit
- nix: update nixpkgs (#2746) · ca4f46dd
  Daniël de Kok authored Nov 14, 2024
```
Updates from Triton 2.1.0 to 3.1.0 (among other things).
```
  ca4f46dd
10 Nov, 2024 1 commit

Add initial support for compressed-tensors checkpoints (#2732) · a7850008

Daniël de Kok authored Nov 10, 2024

compressed-tensors is a safetensors extension for sparse, quantized
tensors. The format is more powerful than earlier AWQ/GPTQ/FP8
quantization, because

- Different quantizer configurations can be used for different targets.
- The format can specify input/output quantizers in addition to weight
  quantizers.
- Configurable exclusions for quantization.

This change adds a dependency on the `compressed-tensors` package for
its configuration parsing and layer matching functionality.

The following types of quantization are supported in this PR:

- W8A16 and W4A16 INT using GPTQ-Marlin kernels.
- W8A8 and W8A16 FP using FP8-Marlin and cutlass kernels.

Support for other quantization types will be added in subsequent PRs.

a7850008

07 Nov, 2024 1 commit
- add trust_remote_code in tokenizer to fix baichuan issue (#2725) · 97f7a22f
  Wang, Yi authored Nov 07, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  97f7a22f
04 Nov, 2024 6 commits
- fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… (#2717) · b1f9044d
  Wang, Yi authored Nov 04, 2024
```
fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Instruct-AWQ
ipex kernel provide func like add_bias, so no need add it outside
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
```
  b1f9044d
- nix: move to tgi-nix `main` (#2718) · 5eedb2ec
  Daniël de Kok authored Nov 04, 2024
  
  5eedb2ec
- Fixing linting on main. (#2719) · 9fde5666
  Nicolas Patry authored Nov 04, 2024
  
  9fde5666
- Fix prefix caching + speculative decoding (#2711) · aadc9cb4
  Travis Addair authored Nov 04, 2024
  
  aadc9cb4
- Hotfixing auto length (warmup max_s was wrong). (#2716) · a5593ba8
  Nicolas Patry authored Nov 04, 2024
  
  a5593ba8
- fix: add chat_tokenize endpoint to api docs (#2710) · 08c4184e
  drbh authored Nov 04, 2024
  
  08c4184e
02 Nov, 2024 1 commit

fix: create position ids for text only input (#2714) · 6e322052

drbh authored Nov 01, 2024

* fix: create position ids for text only input

* fix: prefer repeat over expand to avoid clone

6e322052

01 Nov, 2024 1 commit

fix cuda graphs for qwen2-vl (#2708) · 01dacf8e

drbh authored Oct 31, 2024



* feat: support multidimensional position ids on batch to enable cuda graphs on qwen2-vl

* fix: only check model type if config exists

* fix: adjust sharding and lm head logic

* fix qwen2 failure in intel cpu
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix: return correct shape logits and add streaming test

* fix: remove unused import and refactor test

---------
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

01dacf8e

30 Oct, 2024 2 commits

Support qwen2 vl (#2689) · befd9f67

drbh authored Oct 30, 2024

* feat: add support for qwen2 vl model

* feat: fix token padding, enable warmup and process basic request

* fix: improve get_position_ids, add lift embed_tokens

* fix: remove get_cos_sin_hack dev function

* feat: add simple test chat with meesage and text

* fix: lint test

* fix: adjust positional embeddings for multi dimensional position ids

* fix: update docs and lint unused vars

* fix: include linted file

* fix: add norm after text output

* fix: format model file

* fix: adjust for ruff lints

* fix: remove unused rotate_half

* feat: refactors and calc num features

* fix: prefer position_ids passed from vlm causal lm and reset ids on batch

* fix: adjust get_position_ids if not available and add required args to signatures

* fix: adjust resize case for qwen2_vl warmup

* fix: avoid qwen2 vl specific paths with qwen2

befd9f67

add xpu triton in dockerfile, or will show "Could not import Flash At… (#2702) · 46aeb086

Wang, Yi authored Oct 30, 2024



add xpu triton in dockerfile, or will show "Could not import Flash Attention enabled models: No module named 'triton'"
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

46aeb086

28 Oct, 2024 7 commits

Monkey patching as a desperate measure. (#2704) · 98330df6
Nicolas Patry authored Oct 28, 2024
```
* Monkey patching as a desperate measure.

* New snapshot ?
```
98330df6
More timeout on docker start ? (#2701) · 513d19b9
Nicolas Patry authored Oct 28, 2024
```
* More timeout on docker start ?

* Latest upgrade.
```
513d19b9
Fixing auto bloom test. (#2699) · 3a9cdc32
Nicolas Patry authored Oct 28, 2024

3a9cdc32
Update poetry lock. (#2698) · 78ce618c
Nicolas Patry authored Oct 28, 2024

78ce618c

We can have a tokenizer anywhere. (#2527) · 90b226db

Nicolas Patry authored Oct 28, 2024

* We can have a tokenizer anywhere.

* Handling potential lack of offsets (python tokenizer)

* Remove redundancy.

* Fixing the tests.

* Flake.lock update ?

* Fixing the  GIL locking.

* Fixing mamba by using the transformers version.

* Adding the legacy handle.

* Ellide lifetime.

* Lint.

* Deprecation message.

* Fixing bad rebase.

90b226db

Choosing input/total tokens automatically based on available VRAM? (#2673) · 0c9b6cdd

Nicolas Patry authored Oct 28, 2024

* Choosing input/total tokens automatically based on available VRAM?

* Update doc.

* Remove generated files.

* Trying to fix non chunking targets.

* Attempt #2

* fix.

* QuantLinear is rocm compatible.

* Much simpler logic after the overhead.

* Updating logic + non flash.

* Revert doc text.

* Simple updates.

* Fix integration mt0 (transformers update).

0c9b6cdd

Green main (#2697) · 2e4f4ba1
Nicolas Patry authored Oct 28, 2024

2e4f4ba1

26 Oct, 2024 1 commit

Avoiding timeout for bloom tests. (#2693) · 8a8794a6

Nicolas Patry authored Oct 26, 2024

* Avoiding timeout for bloom tests.

* Skip the test let's see if it's always the first tests that fails.

* Fail early.

* Pulling ?

* No early exit.

8a8794a6

25 Oct, 2024 8 commits

chore: prepare 2.4.0 release (#2695) · a6b02da9
OlivierDehaene authored Oct 25, 2024

a6b02da9

feat: add triton kernels to decrease latency of large batches (#2687) · 6f88bd93

OlivierDehaene authored Oct 25, 2024

* feat: add triton kernels to decrease latency of large batches

* cast to int32

* fix kernel

* fix kernel

* disable triton on rocm

* fix speculation

* add slots filtering kernel

6f88bd93

Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels (#2688) · 0f346a32

Daniël de Kok authored Oct 25, 2024

* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels

Performance and accuracy of these kernels are on par (tested with Llama
70B and 405B). Removes a dependency and resolves some stability issues
we have been seeing.

* Update test snapshots

0f346a32

Add support for stop words in TRTLLM (#2678) · ba5fc7d9

Funtowicz Morgan authored Oct 25, 2024

* feat(trtllm): rewrite health to not account for current state

* chore(looper): cleanup a bit more

* feat(post_processing): max_new_tokens is const evaluated now

* chore(ffi):formatting

* feat(trtllm): add stop words handling

# Conflicts:
#	backends/trtllm/lib/backend.cpp

* chore(trtllm): create specific parallelconfig factory and logging init methods

* chore(trtllm): define a macro for SizeType cast

* chore(trtllm): use GetParallelConfig

* chore(trtllm): minor refactoring

* chore(trtllm): validate there are enough GPus on the system for the desired model

* chore(trtllm): ensure max throughput scheduling policy is selected

* chore(trtllm): minor fix

* chore(router): minor refactorings

* feat(docker): build with-slurm ompi

* feat(docker): add python3.10 dev to runtime deps

* chore(docker): add mpi to ld_library_path

* chore(docker): install transformers

* feat(trtllm): detect stop_words from generation_config.json

ba5fc7d9

Fixing mt0 test. (#2692) · db68bd05
Nicolas Patry authored Oct 25, 2024

db68bd05
Fixing rocm gptq by using triton code too (renamed cuda into triton). (#2691) · cece8635
Nicolas Patry authored Oct 25, 2024

cece8635

[TENSORRT-LLM] - Implement new looper thread based backend (#2357) · 43df056e

Funtowicz Morgan authored Oct 25, 2024



* (backend) use parking_lot crate for RwLock fairness

# Conflicts:
#	backends/trtllm/src/backend.rs

* (launcher) default new server::run parameters to false for now

* (chore) fmt ... why?

* (ffi) use const for GetSamplingConfig

* (server) expose new SchedulingError

* (trt)

* (build) setup ccache if available

* (ffi) add max_new_tokens parameters

* (backend) cleanup a bit

* (backend) expose PullNewTokens

* (ffi) cleanup again

* (ffi) add missing headers imports

* (ffi) add template specialization to catch and convert to Rust Result<T, tensorrt_llm::common::TllmException>

* (looper) new looper initial implementation

* (ffi) remove narrowing type warning

* (ffi) encode the provided user prompt within each request thread

* (misc) change scope identifiers

* (backend) implement the post_processor background thread

* (misc) missing Result types for Rust

* use blocking_recv in looper to consume awaiting_requests at max before pulling in a single step

* (server) forward auth_token to server::run

* (build) fetchcontent use archives instead of git

* (ffi) fix usage of wrong vector constructor making a capacity fill call

* (ffi) missing namespace for tle::Response

* (ffi) do not use reference capture in lambda as we are not capturing anything

* (backend) refactor & cleanup

* (Dockerfile.trtllm) delete for now

* (misc) simplify [make_]move_iterator by using c++20 type inference

* (misc) no need to move for uint32_t items

* (scheduler) rework submit/pull logic

* (post) impl postprocessing

* (misc) delete backend.rs

* (misc) rerun-if-changed all the cmake modules

* (misc) move to latest trtllm

* (fix): HOPPER_SM_MAJOR is 9 not 8

* (misc: build for sm_{75,80,86,89,90} by default

* (misc): build with trtllm 0.13.0

* (misc): increase verbosity of spdlog

* (fix): do not recreate the stateful hashmap at every it

* (misc): update dependency in trtllm dockerfile

* (misc): update dependency in trtllm dockerfile

* (misc): disable logging in release mode

* (misc): improve trtllm download script robustness

* (fix): ore fixes for Dockerfile

* misc(cuda): require 12.6

* chore(cmake): use correct policy for download_timestamp

* feat(looper): check engine and executorWorker paths exist before creating the backend

* chore(cmake): download timestamp should be before URL

* feat(looper): minor optimizations to avoid growing too much the containers

* chore(trtllm): move dockerfile to right place

* chore(trtllm): disable tokenizer parallelism by default

* chore(trtllm): fmt

* chore(trtllm): post-rebase commit

* chore(trtllm): remove unused method

* feat(trtllm): cache maxNumTokens to avoid calling JSON everytime

* misc(router): remove SchedulingError

* feat(trtllm): do not tokenize twice

* Revert "chore(trtllm): remove unused method"

This reverts commit 31747163

* chore(rebase): fix invalid references

* chore(router): add python dependency

* Lint.

* Fix bad rebase

---------
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

43df056e

Fixing "deadlock" when python prompts for trust_remote_code by always (#2664) · ed87b464
Nicolas Patry authored Oct 25, 2024
```
specifiying a value.
```
ed87b464

24 Oct, 2024 3 commits

Add support for FP8 KV cache scales (#2628) · eab07f74

Daniël de Kok authored Oct 24, 2024

* Add support for FP8 KV cache scales

Since FP8 only has limited dynamic range, we can scale keys/values
before storing them into the cache (and unscale them in attention). To
avoid rescaling the cache as the absmax values change, good scales are
usually determined per layer using calibration calibration data and stored
in the checkpoint.

This change adds support for for using key-value scales and loading them
from checkpoints in the two most common formats:

- Separate per-layer `k_scale` and `v_scale` scalars.
- Per-layer `kv_scale` scalar (older format).

Currently, scales are only used with an `float8_e4m3fn` cache.

Besides adding support for key/value scales, the `fp8_quantize` function
is also extended to support quantization with a kernel vendored from
vLLM. This is slightly faster than the PyTorch implementation, but also
scales in FP32, potentially improving accuracy.

* Update FP8 KV cache test to use checkpoint with scales

* `can_scale`: check that the attention is flashinfer

eab07f74

Fix Phi 3.5 MoE tests (#2684) · 14a0df3a

Daniël de Kok authored Oct 24, 2024

PR #2682 also fixed in issue in Phi MoE, but it changes the test
outputs a bit. Fix this.

14a0df3a

flashinfer: reminder to remove contiguous call in the future (#2685) · 1b914f37
Daniël de Kok authored Oct 24, 2024

1b914f37

23 Oct, 2024 4 commits
- feat: allow any supported payload on /invocations (#2683) · 41c26237
  OlivierDehaene authored Oct 23, 2024
```
* feat: allow any supported payload on /invocations

* update openAPI

* update doc
```
  41c26237
- hotfix: fix flashllama · 27ff1871
  OlivierDehaene authored Oct 23, 2024
  
  27ff1871
- feat: natively support Granite models (#2682) · 03c9388b
  OlivierDehaene authored Oct 23, 2024
```
* feat: natively support Granite models

* Update doc
```
  03c9388b
- Make moe-kernels and marlin-kernels mandatory in CUDA installs (#2632) · f58eb70e
  Daniël de Kok authored Oct 23, 2024
  
  f58eb70e