Commits · 04668ebe7a35b69f1d2f8b04ef255bb16c8d2a01 · OpenDAS / vllm_cscc

23 Nov, 2024 2 commits
- [Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593) · 04668ebe
  Isotr0py authored Nov 24, 2024
```
Signed-off-by: Isotr0py <2037008807@qq.com>
```
  04668ebe
- [core] gemma2 full context length support (#10584) · 4aba6e3d
  youkaichao authored Nov 22, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  4aba6e3d
22 Nov, 2024 1 commit
- [torch.compile] support all attention backends (#10558) · eebad39f
  youkaichao authored Nov 22, 2024
```
Signed-off-by: youkaichao <youkaichao@gmail.com>
```
  eebad39f
20 Oct, 2024 1 commit
- [Kernel] Support sliding window in flash attention backend (#9403) · 4fa3e333
  Chen Zhang authored Oct 20, 2024
  
  4fa3e333
16 Oct, 2024 1 commit
- [CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267) · 776dbd74
  Russell Bryant authored Oct 16, 2024
```
Signed-off-by: Russell Bryant <rbryant@redhat.com>
```
  776dbd74
11 Oct, 2024 1 commit
- [Model] Support Mamba (#6484) · 7342a7d7
  Tyler Michael Smith authored Oct 11, 2024
  
  7342a7d7
06 Aug, 2024 1 commit

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

01 Aug, 2024 1 commit
- [Misc] Support attention logits soft-capping with flash-attn (#7022) · 805a8a75
  Woosuk Kwon authored Aug 01, 2024
  
  805a8a75
23 Jul, 2024 1 commit
- [Misc] Support FP8 kv cache scales from compressed-tensors (#6528) · 9e0b558a
  Michael Goin authored Jul 23, 2024
  
  9e0b558a
20 Jul, 2024 1 commit
- [ Misc ] `fbgemm` checkpoints (#6559) · 683e3cb9
  Robert Shaw authored Jul 20, 2024
  
  683e3cb9
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
08 Jul, 2024 1 commit

[Kernel] Correctly invoke prefill & decode kernels for cross-attention... · 543aa485

afeldman-nm authored Jul 08, 2024


[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

543aa485

28 Jun, 2024 1 commit
- [Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (#5936) · 4bf35ed9
  Michael Goin authored Jun 28, 2024
  
  4bf35ed9
27 May, 2024 1 commit

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2

Zhuohan Li authored May 27, 2024


Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

1102bef2

25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

16 May, 2024 1 commit
- [Bugfix] Fix FP8 KV cache support (#4869) · 9a31a817
  Woosuk Kwon authored May 16, 2024
  
  9a31a817
15 May, 2024 1 commit

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

13 May, 2024 1 commit
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
01 May, 2024 1 commit
- [Misc]Add customized information for models (#4132) · d6f4bd7c
  Jee Li authored May 01, 2024
  
  d6f4bd7c
11 Apr, 2024 1 commit
- [Core][5/N] Fully working chunked prefill e2e (#3884) · 67b4221a
  SangBin Cho authored Apr 11, 2024
  
  67b4221a
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

25 Mar, 2024 1 commit
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332