Commits · df3dcdf49dccfa4914d825fa08b74de8ae050e1e · OpenDAS / vllm_cscc · GitLab

07 Oct, 2024 1 commit
- [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (#9089) · 4f95ffee
  Isotr0py authored Oct 07, 2024
  
  4f95ffee
06 Oct, 2024 2 commits
- [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038) · cb3b2b9b
  Varun Sundar Rabindranath authored Oct 06, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  cb3b2b9b
- [core] use forward context for flash infer (#9097) · f4dd830e
  youkaichao authored Oct 05, 2024
  
  f4dd830e
03 Oct, 2024 1 commit
- [misc] add forward context for attention (#9029) · 9aaf14c6
  youkaichao authored Oct 03, 2024
  
  9aaf14c6
02 Oct, 2024 2 commits
- [OpenVINO] Enable GPU support for OpenVINO vLLM backend (#8192) · f58d4fcc
  Sergey Shlyapnikov authored Oct 03, 2024
  
  f58d4fcc
- [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645) · afb050b2
  Varun Sundar Rabindranath authored Oct 02, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  afb050b2
01 Oct, 2024 1 commit
- [Spec Decode] (1/2) Remove batch expansion (#8839) · 15702038
  Lily Liu authored Oct 01, 2024
  
  15702038
27 Sep, 2024 3 commits
- [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378) · c2ec430a
  Varun Sundar Rabindranath authored Sep 27, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  c2ec430a
- [torch.compile] use empty tensor instead of None for profiling (#8875) · a9b15c60
  youkaichao authored Sep 27, 2024
  
  a9b15c60
- [TPU] Update pallas.py to support trillium (#8871) · 8df2dc3c
  Brittany authored Sep 27, 2024
  
  8df2dc3c
21 Sep, 2024 1 commit
- [Kernel] Build flash-attn from source (#8245) · 71c60491
  Luka Govedič authored Sep 21, 2024
  
  71c60491
20 Sep, 2024 1 commit
- [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (#8474) · 9e5ec35b
  William Lin authored Sep 19, 2024
  
  9e5ec35b
19 Sep, 2024 1 commit
- [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) · 9cc373f3
  Charlie Fu authored Sep 19, 2024
  
  9cc373f3
18 Sep, 2024 2 commits
- [CI/Build] Update Ruff version (#8469) · 9d104b5b
  Aaron Pham authored Sep 18, 2024
```
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
```
  9d104b5b
- [CI/Build] Avoid CUDA initialization (#8534) · 6ffa3f31
  Cyrus Leung authored Sep 18, 2024
  
  6ffa3f31
17 Sep, 2024 1 commit
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) · 1009e93c
  sroy745 authored Sep 17, 2024
  
  1009e93c
14 Sep, 2024 1 commit
- [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) · 1ef0d2ef
  Charlie Fu authored Sep 13, 2024
  
  1ef0d2ef
13 Sep, 2024 1 commit
- [Hardware][intel GPU] bump up ipex version to 2.3 (#8365) · 85172520
  Kunshang Ji authored Sep 14, 2024
```
Co-authored-by: Yan Ma <yan.ma@intel.com>
```
  85172520
12 Sep, 2024 3 commits
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427) · 01987725
  Alexander Matveev authored Sep 12, 2024
  
  01987725
- [multi-step] add flashinfer backend (#7928) · a6c0f365
  William Lin authored Sep 12, 2024
  
  a6c0f365
- [torch.compile] hide slicing under custom op for inductor (#8384) · 7de49aa8
  youkaichao authored Sep 12, 2024
  
  7de49aa8
10 Sep, 2024 2 commits
- [Bugfix] lookahead block table with cuda graph max capture (#8340) · 22f3a4bc
  Alexander Matveev authored Sep 10, 2024
```
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340)
```
  22f3a4bc
- [Spec Decode] Move ops.advance_step to flash attn advance_step (#8224) · 5faedf1b
  Kevin Lin authored Sep 10, 2024
  
  5faedf1b
05 Sep, 2024 1 commit
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) · e39ebf5c
  Elfie Guo authored Sep 04, 2024
  
  e39ebf5c
31 Aug, 2024 1 commit
- [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) · 622f8abf
  Pavani Majety authored Aug 30, 2024
  
  622f8abf
30 Aug, 2024 2 commits
- [TPU][Bugfix] Fix tpu type api (#8035) · 2684efc4
  Woosuk Kwon authored Aug 30, 2024
  
  2684efc4
- [TPU] Support single and multi-host TPUs on GKE (#7613) · 2148441f
  Richard Liu authored Aug 30, 2024
  
  2148441f
29 Aug, 2024 2 commits
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for... · 6b342156
  Pavani Majety authored Aug 29, 2024
```
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend.  + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  6b342156
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) · ef99a787
  youkaichao authored Aug 28, 2024
  
  ef99a787
28 Aug, 2024 1 commit
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798) · b98cc28f
  Pavani Majety authored Aug 28, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  b98cc28f
27 Aug, 2024 1 commit
- Revert #7509 (#7887) · 9606c719
  Cody Yu authored Aug 27, 2024
  
  9606c719
21 Aug, 2024 1 commit
- [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention... · 53328d75
  LI MOU authored Aug 21, 2024
```
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
```
  53328d75
20 Aug, 2024 1 commit
- [Core] Add `AttentionState` abstraction (#7663) · 3b682179
  Antoni Baum authored Aug 20, 2024
  
  3b682179
16 Aug, 2024 2 commits
- [spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571) · f366f633
  William Lin authored Aug 16, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  f366f633
- register custom op for flash attn and use from torch.ops (#7536) · 54bd9a03
  youkaichao authored Aug 15, 2024
  
  54bd9a03
13 Aug, 2024 1 commit
- [hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102) · 4d2dc507
  youkaichao authored Aug 13, 2024
  
  4d2dc507
12 Aug, 2024 3 commits
- [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208) · a046f863
  jon-chuang authored Aug 12, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  a046f863
- [Bugfix] Fix logit soft cap in flash-attn backend (#7425) · cfba4def
  Woosuk Kwon authored Aug 12, 2024
  
  cfba4def
- [Kernel] Flashinfer correctness fix for v0.1.3 (#7319) · ec2affa8
  Lily Liu authored Aug 12, 2024
  
  ec2affa8
09 Aug, 2024 1 commit
- [Misc] Add numpy implementation of `compute_slot_mapping` (#7377) · 999ef0b9
  Antoni Baum authored Aug 09, 2024
  
  999ef0b9