Commits · 6cd5e5b07e4415d064d93b8a66331a097bd9287e · OpenDAS / vllm_cscc · GitLab

10 Sep, 2024 1 commit
- [Misc] Fused MoE Marlin support for GPTQ (#8217) · 6cd5e5b0
  Dipika Sikka authored Sep 09, 2024
  
  6cd5e5b0
05 Sep, 2024 1 commit
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) · e39ebf5c
  Elfie Guo authored Sep 04, 2024
  
  e39ebf5c
29 Aug, 2024 2 commits
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for... · 6b342156
  Pavani Majety authored Aug 29, 2024
```
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend.  + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  6b342156
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) · ef99a787
  youkaichao authored Aug 28, 2024
  
  ef99a787
28 Aug, 2024 3 commits
- [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) · fdd9daaf
  Mor Zusman authored Aug 29, 2024
  
  fdd9daaf
- [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and... · e5697d16
  rasmith authored Aug 28, 2024
```
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
```
  e5697d16
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798) · b98cc28f
  Pavani Majety authored Aug 28, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  b98cc28f
21 Aug, 2024 1 commit
- [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention... · 53328d75
  LI MOU authored Aug 21, 2024
```
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
```
  53328d75
20 Aug, 2024 1 commit
- [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) · 5288c06a
  Lucas Wilkinson authored Aug 20, 2024
  
  5288c06a
16 Aug, 2024 3 commits
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) · e837b624
  Charlie Fu authored Aug 16, 2024
  
  e837b624
- register custom op for flash attn and use from torch.ops (#7536) · 54bd9a03
  youkaichao authored Aug 15, 2024
  
  54bd9a03
- [Misc/Testing] Use `torch.testing.assert_close` (#7324) · 50b8d08d
  jon-chuang authored Aug 15, 2024
  
  50b8d08d
12 Aug, 2024 1 commit
- [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208) · a046f863
  jon-chuang authored Aug 12, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  a046f863
08 Aug, 2024 1 commit
- [Bugfix][Kernel] Increased atol to fix failing tests (#7305) · 5fb4a3f6
  Luka Govedič authored Aug 08, 2024
  
  5fb4a3f6
06 Aug, 2024 2 commits

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

[Kernel] Add per-tensor and per-token AZP epilogues (#5941) · 8d59dbb0
Luka Govedič authored Aug 06, 2024
```
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
8d59dbb0

02 Aug, 2024 1 commit
- [Misc] Disambiguate quantized types via a new ScalarType (#6396) · a8d604ca
  Lucas Wilkinson authored Aug 02, 2024
  
  a8d604ca
01 Aug, 2024 2 commits
- [Misc] Support attention logits soft-capping with flash-attn (#7022) · 805a8a75
  Woosuk Kwon authored Aug 01, 2024
  
  805a8a75
- [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) · 7ecee343
  Jee Jee Li authored Aug 01, 2024
  
  7ecee343
31 Jul, 2024 1 commit
- Support W4A8 quantization for vllm (#5218) · 6512937d
  HandH1998 authored Jul 31, 2024
  
  6512937d
30 Jul, 2024 1 commit
- [Kernel] Tuned int8 kernels for Ada Lovelace (#6848) · af647fb8
  Varun Sundar Rabindranath authored Jul 29, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  af647fb8
29 Jul, 2024 2 commits
- [Bugfix] Allow vllm to still work if triton is not installed. (#6786) · 9a7e2d05
  Thomas Parnell authored Jul 29, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  9a7e2d05
- [Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677) · 766435e6
  Varun Sundar Rabindranath authored Jul 29, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  766435e6
27 Jul, 2024 2 commits
- [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) · 75acdaa4
  Alexander Matveev authored Jul 27, 2024
  
  75acdaa4
- [Model] H2O Danube3-4b (#6451) · 14dbd5a7
  Joe authored Jul 26, 2024
  
  14dbd5a7
24 Jul, 2024 1 commit
- Add fp8 support to `reshape_and_cache_flash` (#6667) · 0e63494c
  Antoni Baum authored Jul 24, 2024
  
  0e63494c
22 Jul, 2024 1 commit
- [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) · fea59c77
  Tyler Michael Smith authored Jul 22, 2024
  
  fea59c77
21 Jul, 2024 1 commit
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
20 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) · 2e265642
  Varun Sundar Rabindranath authored Jul 19, 2024
```
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
```
  2e265642
19 Jul, 2024 1 commit
- [ Kernel ] Enable Dynamic Per Token `fp8` (#6547) · 4cc24f01
  Robert Shaw authored Jul 19, 2024
  
  4cc24f01
18 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) · b5241e41
  Varun Sundar Rabindranath authored Jul 17, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  b5241e41
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
11 Jul, 2024 1 commit
- [ Misc ] Refactor Marlin Python Utilities (#6082) · b675069d
  Robert Shaw authored Jul 11, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  b675069d
08 Jul, 2024 1 commit

[Kernel] Correctly invoke prefill & decode kernels for cross-attention... · 543aa485

afeldman-nm authored Jul 08, 2024


[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

543aa485

04 Jul, 2024 1 commit
- [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051) · 69ec3ca1
  Lily Liu authored Jul 04, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  69ec3ca1
03 Jul, 2024 2 commits
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
- [hardware][misc] introduce platform abstraction (#6080) · 482045ee
  youkaichao authored Jul 02, 2024
  
  482045ee
02 Jul, 2024 1 commit

[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) · 7c008c51

Robert Shaw authored Jul 02, 2024


Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

7c008c51

01 Jul, 2024 2 commits
- [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029) · 12a59959
  Avshalom Manevich authored Jul 02, 2024
  
  12a59959
- [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) · 614aa512
  youkaichao authored Jun 30, 2024
  
  614aa512