Commits · e97f802b2d74861af77997691a7d1c36498f6dca · OpenDAS / vllm_cscc

23 Jan, 2025 1 commit

[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906) · e97f802b

Gregory Shtrasberg authored Jan 23, 2025


Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>

e97f802b

22 Jan, 2025 1 commit
- [core] separate builder init and builder prepare for each batch (#12253) · 66818e5b
  youkaichao authored Jan 22, 2025
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  66818e5b
22 Nov, 2024 1 commit
- [torch.compile] support all attention backends (#10558) · eebad39f
  youkaichao authored Nov 22, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  eebad39f
04 Nov, 2024 1 commit
- [Misc] Compute query_start_loc/seq_start_loc on CPU (#9447) · 4dbcbbeb
  Yang Zheng authored Nov 04, 2024
```
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
```
  4dbcbbeb
02 Nov, 2024 1 commit
- [Encoder Decoder] Add flash_attn kernel support for encoder-decoder models (#9559) · a78dd330
  sroy745 authored Nov 01, 2024
  
  a78dd330
01 Nov, 2024 1 commit
- [Core][VLM] Add precise multi-modal placeholder tracking (#8346) · 6c0b7f54
  Peter Salas authored Nov 01, 2024
```
Signed-off-by: Peter Salas <peter@fixie.ai>
```
  6c0b7f54
31 Oct, 2024 1 commit

[Bugfix] Fix `illegal memory access` error with chunked prefill, prefix... · 55650c83

sasha0552 authored Oct 31, 2024


[Bugfix] Fix `illegal memory access` error with chunked prefill, prefix caching, block manager v2 and xformers enabled together (#9532)
Signed-off-by: sasha0552 <admin@sasha0552.org>

55650c83

21 Oct, 2024 1 commit
- [Doc] Consistent naming of attention backends (#9498) · 496e991d
  Thomas Parnell authored Oct 21, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  496e991d
17 Oct, 2024 1 commit

[Core] Deprecating block manager v1 and make block manager v2 default (#8704) · 81ede99c

Kuntai Du authored Oct 17, 2024

Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).

81ede99c

12 Oct, 2024 1 commit
- [SpecDec] Remove Batch Expansion (2/3) (#9298) · 89feb4c8
  Lily Liu authored Oct 11, 2024
  
  89feb4c8
01 Oct, 2024 1 commit
- [Spec Decode] (1/2) Remove batch expansion (#8839) · 15702038
  Lily Liu authored Oct 01, 2024
  
  15702038
18 Sep, 2024 1 commit

[CI/Build] Update Ruff version (#8469) · 9d104b5b

Aaron Pham authored Sep 18, 2024


Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

9d104b5b

17 Sep, 2024 1 commit
- [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) · 1009e93c
  sroy745 authored Sep 17, 2024
  
  1009e93c
20 Aug, 2024 1 commit
- [Core] Add `AttentionState` abstraction (#7663) · 3b682179
  Antoni Baum authored Aug 20, 2024
  
  3b682179
09 Aug, 2024 2 commits
- [Misc] Add numpy implementation of `compute_slot_mapping` (#7377) · 999ef0b9
  Antoni Baum authored Aug 09, 2024
  
  999ef0b9
- [Performance] Optimize e2e overheads: Reduce python allocations (#7162) · e02ac556
  Alexander Matveev authored Aug 09, 2024
  
  e02ac556
05 Aug, 2024 1 commit
- [MISC] Use non-blocking transfer in prepare_input (#7172) · ef527be0
  Cody Yu authored Aug 05, 2024
  
  ef527be0
01 Aug, 2024 1 commit
- [Misc] Support attention logits soft-capping with flash-attn (#7022) · 805a8a75
  Woosuk Kwon authored Aug 01, 2024
  
  805a8a75
25 Jul, 2024 1 commit
- [Bugfix] Fix decode tokens w. CUDA graph (#6757) · 309aaef8
  Cody Yu authored Jul 24, 2024
  
  309aaef8
23 Jul, 2024 1 commit
- [Core] Modulize prepare input and attention metadata builder (#6596) · e0c15758
  Cody Yu authored Jul 22, 2024
  
  e0c15758
20 Jul, 2024 1 commit
- [Misc] Consolidate and optimize logic for building padded tensors (#6541) · 9042d683
  Cyrus Leung authored Jul 20, 2024
  
  9042d683
17 Jul, 2024 1 commit
- [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) · 2fa4623d
  Cody Yu authored Jul 17, 2024
  
  2fa4623d
08 Jul, 2024 1 commit

[Kernel] Correctly invoke prefill & decode kernels for cross-attention... · 543aa485

afeldman-nm authored Jul 08, 2024


[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

543aa485