Commits · 1009e93c5d634c724eeff3d4e453369337f502d4 · OpenDAS / vllm_cscc

17 Sep, 2024 1 commit
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) · 1009e93c
  sroy745 authored Sep 17, 2024
  
  1009e93c
14 Sep, 2024 1 commit
- [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) · 1ef0d2ef
  Charlie Fu authored Sep 13, 2024
  
  1ef0d2ef
13 Sep, 2024 1 commit
- [Hardware][intel GPU] bump up ipex version to 2.3 (#8365) · 85172520
  Kunshang Ji authored Sep 14, 2024
```
Co-authored-by: Yan Ma <yan.ma@intel.com>
```
  85172520
12 Sep, 2024 3 commits
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427) · 01987725
  Alexander Matveev authored Sep 12, 2024
  
  01987725
- [multi-step] add flashinfer backend (#7928) · a6c0f365
  William Lin authored Sep 12, 2024
  
  a6c0f365
- [torch.compile] hide slicing under custom op for inductor (#8384) · 7de49aa8
  youkaichao authored Sep 12, 2024
  
  7de49aa8
10 Sep, 2024 2 commits
- [Bugfix] lookahead block table with cuda graph max capture (#8340) · 22f3a4bc
  Alexander Matveev authored Sep 10, 2024
```
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340)
```
  22f3a4bc
- [Spec Decode] Move ops.advance_step to flash attn advance_step (#8224) · 5faedf1b
  Kevin Lin authored Sep 10, 2024
  
  5faedf1b
05 Sep, 2024 1 commit
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) · e39ebf5c
  Elfie Guo authored Sep 04, 2024
  
  e39ebf5c
31 Aug, 2024 1 commit
- [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) · 622f8abf
  Pavani Majety authored Aug 30, 2024
  
  622f8abf
30 Aug, 2024 2 commits
- [TPU][Bugfix] Fix tpu type api (#8035) · 2684efc4
  Woosuk Kwon authored Aug 30, 2024
  
  2684efc4
- [TPU] Support single and multi-host TPUs on GKE (#7613) · 2148441f
  Richard Liu authored Aug 30, 2024
  
  2148441f
29 Aug, 2024 2 commits
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for... · 6b342156
  Pavani Majety authored Aug 29, 2024
```
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend.  + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  6b342156
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) · ef99a787
  youkaichao authored Aug 28, 2024
  
  ef99a787
28 Aug, 2024 1 commit
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798) · b98cc28f
  Pavani Majety authored Aug 28, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  b98cc28f
27 Aug, 2024 1 commit
- Revert #7509 (#7887) · 9606c719
  Cody Yu authored Aug 27, 2024
  
  9606c719
21 Aug, 2024 1 commit
- [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention... · 53328d75
  LI MOU authored Aug 21, 2024
```
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
```
  53328d75
20 Aug, 2024 1 commit
- [Core] Add `AttentionState` abstraction (#7663) · 3b682179
  Antoni Baum authored Aug 20, 2024
  
  3b682179
16 Aug, 2024 2 commits
- [spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571) · f366f633
  William Lin authored Aug 16, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  f366f633
- register custom op for flash attn and use from torch.ops (#7536) · 54bd9a03
  youkaichao authored Aug 15, 2024
  
  54bd9a03
12 Aug, 2024 3 commits
- [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208) · a046f863
  jon-chuang authored Aug 12, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  a046f863
- [Bugfix] Fix logit soft cap in flash-attn backend (#7425) · cfba4def
  Woosuk Kwon authored Aug 12, 2024
  
  cfba4def
- [Kernel] Flashinfer correctness fix for v0.1.3 (#7319) · ec2affa8
  Lily Liu authored Aug 12, 2024
  
  ec2affa8
09 Aug, 2024 2 commits
- [Misc] Add numpy implementation of `compute_slot_mapping` (#7377) · 999ef0b9
  Antoni Baum authored Aug 09, 2024
  
  999ef0b9
- [Performance] Optimize e2e overheads: Reduce python allocations (#7162) · e02ac556
  Alexander Matveev authored Aug 09, 2024
  
  e02ac556
07 Aug, 2024 1 commit
- [Kernel] Fix Flashinfer Correctness (#7284) · e53dfd3e
  Lily Liu authored Aug 07, 2024
  
  e53dfd3e
05 Aug, 2024 1 commit
- [MISC] Use non-blocking transfer in prepare_input (#7172) · ef527be0
  Cody Yu authored Aug 05, 2024
  
  ef527be0
03 Aug, 2024 1 commit
- [Bugfix] Fix block table for seqs that have prefix cache hits (#7018) · fb2c1c86
  Zach Zheng authored Aug 02, 2024
  
  fb2c1c86
02 Aug, 2024 1 commit
- [Kernel] Fix input for flashinfer prefill wrapper. (#7008) · 954f7305
  Lily Liu authored Aug 01, 2024
  
  954f7305
01 Aug, 2024 1 commit
- [Misc] Support attention logits soft-capping with flash-attn (#7022) · 805a8a75
  Woosuk Kwon authored Aug 01, 2024
  
  805a8a75
27 Jul, 2024 2 commits
- [TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856) · fad5576c
  Woosuk Kwon authored Jul 27, 2024
  
  fad5576c
- [Hardware][TPU] Implement tensor parallelism with Ray (#5871) · 52f07e3d
  Woosuk Kwon authored Jul 26, 2024
  
  52f07e3d
25 Jul, 2024 1 commit
- [Bugfix] Fix decode tokens w. CUDA graph (#6757) · 309aaef8
  Cody Yu authored Jul 24, 2024
  
  309aaef8
24 Jul, 2024 2 commits
- [Core] Tweaks to model runner/input builder developer APIs (#6712) · 5448f676
  Antoni Baum authored Jul 24, 2024
  
  5448f676
- Add fp8 support to `reshape_and_cache_flash` (#6667) · 0e63494c
  Antoni Baum authored Jul 24, 2024
  
  0e63494c
23 Jul, 2024 1 commit
- [Core] Modulize prepare input and attention metadata builder (#6596) · e0c15758
  Cody Yu authored Jul 22, 2024
  
  e0c15758
20 Jul, 2024 2 commits
- [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA,... · 06d6c5fe
  Matt Wong authored Jul 20, 2024
```
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543)
```
  06d6c5fe
- [Misc] Consolidate and optimize logic for building padded tensors (#6541) · 9042d683
  Cyrus Leung authored Jul 20, 2024
  
  9042d683
18 Jul, 2024 1 commit
- [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2... · c8a7d51c
  Noam Gat authored Jul 18, 2024
```
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501)
```
  c8a7d51c
17 Jul, 2024 1 commit
- [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) · 2fa4623d
  Cody Yu authored Jul 17, 2024
  
  2fa4623d