Commits · 7015417fd4910a47263ea34c79c2cdb2ff314fdf · OpenDAS / vllm_cscc

06 Sep, 2024 1 commit
- [Misc] Remove `SqueezeLLM` (#8220) · 23f32229
  Dipika Sikka authored Sep 06, 2024
  
  23f32229
28 Aug, 2024 2 commits
- [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) · fdd9daaf
  Mor Zusman authored Aug 29, 2024
  
  fdd9daaf
- [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and... · e5697d16
  rasmith authored Aug 28, 2024
```
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
```
  e5697d16
27 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766) · fc911880
  Dipika Sikka authored Aug 27, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  fc911880
22 Aug, 2024 1 commit
- Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) · aae74ef9
  Michael Goin authored Aug 21, 2024
  
  aae74ef9
21 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527) · 8678a69a
  Dipika Sikka authored Aug 21, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  8678a69a
20 Aug, 2024 1 commit
- [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) · 5288c06a
  Lucas Wilkinson authored Aug 20, 2024
  
  5288c06a
16 Aug, 2024 2 commits
- [Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596) · 37fd47e7
  bnellnm authored Aug 16, 2024
  
  37fd47e7
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) · e837b624
  Charlie Fu authored Aug 16, 2024
  
  e837b624
13 Aug, 2024 1 commit
- [TPU] Suppress import custom_ops warning (#7458) · d6e634f3
  Woosuk Kwon authored Aug 13, 2024
  
  d6e634f3
06 Aug, 2024 1 commit
- [Kernel] Add per-tensor and per-token AZP epilogues (#5941) · 8d59dbb0
  Luka Govedič authored Aug 06, 2024
```
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
```
  8d59dbb0
05 Aug, 2024 1 commit
- [Core] Support loading GGUF model (#5191) · 360bd67c
  Isotr0py authored Aug 06, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  360bd67c
02 Aug, 2024 1 commit
- [Misc] Disambiguate quantized types via a new ScalarType (#6396) · a8d604ca
  Lucas Wilkinson authored Aug 02, 2024
  
  a8d604ca
01 Aug, 2024 1 commit
- [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) · 7ecee343
  Jee Jee Li authored Aug 01, 2024
  
  7ecee343
31 Jul, 2024 3 commits
- Support W4A8 quantization for vllm (#5218) · 6512937d
  HandH1998 authored Jul 31, 2024
  
  6512937d
- [CI/Build] Fix mypy errors (#6968) · 9f0e69b6
  Cyrus Leung authored Jul 31, 2024
  
  9f0e69b6
- [mypy] Enable following imports for some directories (#6681) · da1f7cc1
  Cyrus Leung authored Jul 31, 2024
  
  da1f7cc1
30 Jul, 2024 1 commit
- [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) · d7a299ed
  Tyler Michael Smith authored Jul 30, 2024
  
  d7a299ed
27 Jul, 2024 1 commit
- [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) · 75acdaa4
  Alexander Matveev authored Jul 27, 2024
  
  75acdaa4
24 Jul, 2024 1 commit
- Add fp8 support to `reshape_and_cache_flash` (#6667) · 0e63494c
  Antoni Baum authored Jul 24, 2024
  
  0e63494c
21 Jul, 2024 1 commit
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
20 Jul, 2024 2 commits
- [ Misc ] `fbgemm` checkpoints (#6559) · 683e3cb9
  Robert Shaw authored Jul 20, 2024
  
  683e3cb9
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) · 2e265642
  Varun Sundar Rabindranath authored Jul 19, 2024
```
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
```
  2e265642
19 Jul, 2024 1 commit
- [ Kernel ] Enable Dynamic Per Token `fp8` (#6547) · 4cc24f01
  Robert Shaw authored Jul 19, 2024
  
  4cc24f01
18 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) · b5241e41
  Varun Sundar Rabindranath authored Jul 17, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  b5241e41
17 Jul, 2024 1 commit
- [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) · e76466dd
  Alexander Matveev authored Jul 17, 2024
  
  e76466dd
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
03 Jul, 2024 1 commit
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
26 Jun, 2024 1 commit

[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560) · 5bfd1bbc

Luka Govedič authored Jun 26, 2024


Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

5bfd1bbc

20 Jun, 2024 2 commits
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) · 3f3b6b21
  Tyler Michael Smith authored Jun 20, 2024
  
  3f3b6b21
- [Model] Port over CLIPVisionModel for VLMs (#5591) · ad137cd1
  Roger Wang authored Jun 20, 2024
  
  ad137cd1
17 Jun, 2024 1 commit

[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a

Kunshang Ji authored Jun 18, 2024

Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>

728c4c8a

13 Jun, 2024 1 commit

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

12 Jun, 2024 1 commit
- [misc] add hint for AttributeError (#5462) · 622d4512
  youkaichao authored Jun 12, 2024
  
  622d4512
09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 3 commits

[Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b

Dipika Sikka authored Jun 07, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

ca3ea51b

[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48

Tyler Michael Smith authored Jun 07, 2024

Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.

8d75fe48

[Misc] Missing error message for custom ops import (#5282) · 15063741
Jie Fu (傅杰) authored Jun 07, 2024

15063741

03 Jun, 2024 1 commit
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9