Commits · 4fa3e3334978dce74eba296ee8cc2e970ed20e5e · OpenDAS / vllm_cscc

20 Oct, 2024 1 commit
- [Kernel] Support sliding window in flash attention backend (#9403) · 4fa3e333
  Chen Zhang authored Oct 20, 2024
  
  4fa3e333
19 Oct, 2024 1 commit

[Kernel] Add env variable to force flashinfer backend to enable tensor cores (#9497) · 0c9a5258

Thomas Parnell authored Oct 19, 2024


Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

0c9a5258

17 Oct, 2024 2 commits

Support `BERTModel` (first `encoder-only` embedding model) (#9056) · 343f8e09

Robert Shaw authored Oct 17, 2024


Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: laishzh <laishengzhang@gmail.com>
Co-authored-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

343f8e09

[Core] Deprecating block manager v1 and make block manager v2 default (#8704) · 81ede99c

Kuntai Du authored Oct 17, 2024

Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).

81ede99c

16 Oct, 2024 1 commit
- [CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267) · 776dbd74
  Russell Bryant authored Oct 16, 2024
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  776dbd74
14 Oct, 2024 1 commit
- [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel (#9350) · 473e7b36
  Woosuk Kwon authored Oct 14, 2024
  
  473e7b36
13 Oct, 2024 1 commit
- [CI] Fix merge conflict (#9317) · f519902c
  Lily Liu authored Oct 12, 2024
  
  f519902c
12 Oct, 2024 2 commits
- [Bugfix] Fix bug of xformer prefill for encoder-decoder (#9026) · 00298e09
  Xiang Xu authored Oct 12, 2024
  
  00298e09
- [SpecDec] Remove Batch Expansion (2/3) (#9298) · 89feb4c8
  Lily Liu authored Oct 11, 2024
  
  89feb4c8
11 Oct, 2024 2 commits
- [Doc] Compatibility matrix for mutual exclusive features (#8512) · 8baf85e4
  Wallas Henrique authored Oct 11, 2024
```
Signed-off-by: Wallas Santos <wallashss@ibm.com>
```
  8baf85e4
- [Model] Support Mamba (#6484) · 7342a7d7
  Tyler Michael Smith authored Oct 11, 2024
  
  7342a7d7
07 Oct, 2024 1 commit
- [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (#9089) · 4f95ffee
  Isotr0py authored Oct 07, 2024
  
  4f95ffee
06 Oct, 2024 2 commits
- [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038) · cb3b2b9b
  Varun Sundar Rabindranath authored Oct 06, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  cb3b2b9b
- [core] use forward context for flash infer (#9097) · f4dd830e
  youkaichao authored Oct 05, 2024
  
  f4dd830e
03 Oct, 2024 1 commit
- [misc] add forward context for attention (#9029) · 9aaf14c6
  youkaichao authored Oct 03, 2024
  
  9aaf14c6
02 Oct, 2024 2 commits
- [OpenVINO] Enable GPU support for OpenVINO vLLM backend (#8192) · f58d4fcc
  Sergey Shlyapnikov authored Oct 03, 2024
  
  f58d4fcc
- [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645) · afb050b2
  Varun Sundar Rabindranath authored Oct 02, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  afb050b2
01 Oct, 2024 1 commit
- [Spec Decode] (1/2) Remove batch expansion (#8839) · 15702038
  Lily Liu authored Oct 01, 2024
  
  15702038
27 Sep, 2024 3 commits
- [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378) · c2ec430a
  Varun Sundar Rabindranath authored Sep 27, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  c2ec430a
- [torch.compile] use empty tensor instead of None for profiling (#8875) · a9b15c60
  youkaichao authored Sep 27, 2024
  
  a9b15c60
- [TPU] Update pallas.py to support trillium (#8871) · 8df2dc3c
  Brittany authored Sep 27, 2024
  
  8df2dc3c
21 Sep, 2024 1 commit
- [Kernel] Build flash-attn from source (#8245) · 71c60491
  Luka Govedič authored Sep 21, 2024
  
  71c60491
20 Sep, 2024 1 commit
- [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (#8474) · 9e5ec35b
  William Lin authored Sep 19, 2024
  
  9e5ec35b
19 Sep, 2024 1 commit
- [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) · 9cc373f3
  Charlie Fu authored Sep 19, 2024
  
  9cc373f3
18 Sep, 2024 2 commits
- [CI/Build] Update Ruff version (#8469) · 9d104b5b
  Aaron Pham authored Sep 18, 2024
```
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  9d104b5b
- [CI/Build] Avoid CUDA initialization (#8534) · 6ffa3f31
  Cyrus Leung authored Sep 18, 2024
  
  6ffa3f31
17 Sep, 2024 1 commit
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) · 1009e93c
  sroy745 authored Sep 17, 2024
  
  1009e93c
14 Sep, 2024 1 commit
- [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) · 1ef0d2ef
  Charlie Fu authored Sep 13, 2024
  
  1ef0d2ef
13 Sep, 2024 1 commit
- [Hardware][intel GPU] bump up ipex version to 2.3 (#8365) · 85172520
  Kunshang Ji authored Sep 14, 2024
```
Co-authored-by: Yan Ma <yan.ma@intel.com>
```
  85172520
12 Sep, 2024 3 commits
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427) · 01987725
  Alexander Matveev authored Sep 12, 2024
  
  01987725
- [multi-step] add flashinfer backend (#7928) · a6c0f365
  William Lin authored Sep 12, 2024
  
  a6c0f365
- [torch.compile] hide slicing under custom op for inductor (#8384) · 7de49aa8
  youkaichao authored Sep 12, 2024
  
  7de49aa8
10 Sep, 2024 2 commits
- [Bugfix] lookahead block table with cuda graph max capture (#8340) · 22f3a4bc
  Alexander Matveev authored Sep 10, 2024
```
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340)
```
  22f3a4bc
- [Spec Decode] Move ops.advance_step to flash attn advance_step (#8224) · 5faedf1b
  Kevin Lin authored Sep 10, 2024
  
  5faedf1b
05 Sep, 2024 1 commit
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) · e39ebf5c
  Elfie Guo authored Sep 04, 2024
  
  e39ebf5c
31 Aug, 2024 1 commit
- [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) · 622f8abf
  Pavani Majety authored Aug 30, 2024
  
  622f8abf
30 Aug, 2024 2 commits
- [TPU][Bugfix] Fix tpu type api (#8035) · 2684efc4
  Woosuk Kwon authored Aug 30, 2024
  
  2684efc4
- [TPU] Support single and multi-host TPUs on GKE (#7613) · 2148441f
  Richard Liu authored Aug 30, 2024
  
  2148441f
29 Aug, 2024 2 commits
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for... · 6b342156
  Pavani Majety authored Aug 29, 2024
```
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend.  + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  6b342156
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) · ef99a787
  youkaichao authored Aug 28, 2024
  
  ef99a787