Commits · ad60a973fbee9102ca542c7eaa388c02fd8581ce · OpenDAS / vllm_cscc · GitLab

01 Oct, 2025 1 commit

[New Model] DeepSeek-V3.2 (Rebased to Main) (#25896) · b3230e1a

Yongye Zhu authored Sep 30, 2025


Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Co-authored-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: simon-mo <simon.mo@hey.com>

b3230e1a

23 Sep, 2025 1 commit
- [BugFix] Fix UB in per_token_group_quant.cu (#24913) · 2357480b
  rivos-shreeasish authored Sep 23, 2025
```
Signed-off-by: Shreeasish Kumar <shreeasish@rivosinc.com>
```
  2357480b
17 Sep, 2025 1 commit
- Apply fixes for CUDA 13 (#24599) · bfe93801
  Aidyn-A authored Sep 17, 2025
```
Signed-off-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
```
  bfe93801
13 Sep, 2025 1 commit
- [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (#24757) · dbeee384
  elvischenv authored Sep 13, 2025
```
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
```
  dbeee384
05 Aug, 2025 1 commit
- [Feature] Non-contiguous Support for FP8 Quantization (#21961) · 4771df7b
  Wentao Ye authored Aug 05, 2025
```
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
```
  4771df7b
30 Jul, 2025 1 commit
- [Perf] Using `__nv_fp8_e4m3` instead of `c10::e4m3` for `per_token_group_quant` (#21867) · 1b0a1555
  Wentao Ye authored Jul 29, 2025
```
Signed-off-by: yewentao256 <zhyanwentao@126.com>
```
  1b0a1555
26 Jul, 2025 1 commit
- [Perf] Cuda Kernel for Int8 Per Token Group Quant (#21476) · 75d29cf4
  Wentao Ye authored Jul 25, 2025
```
Signed-off-by: yewentao256 <zhyanwentao@126.com>
```
  75d29cf4
22 Jul, 2025 2 commits
- [Perf] Cuda Kernel for Per Token Group Quant (#21083) · 774d0c01
  Wentao Ye authored Jul 22, 2025
```
Signed-off-by: yewentao256 <zhyanwentao@126.com>
```
  774d0c01
- [perf] Add fused MLA QKV + strided layernorm (#21116) · 4fb56914
  Mickaël Seznec authored Jul 22, 2025
```
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: mgoin <mgoin64@gmail.com>
```
  4fb56914
16 Jun, 2025 1 commit
- [MISC] Remove unused variableds in C++ (#19609) · c6703d1e
  Lu Fang authored Jun 16, 2025
```
Signed-off-by: Lu Fang <lufang@fb.com>
```
  c6703d1e
12 Jun, 2025 1 commit
- add kvcache fp8 · 504a12b8
  zhuwenwen authored Jun 12, 2025
  
  504a12b8
03 Jun, 2025 1 commit
- [Perf] Tune `scaled_fp8_quant` by increasing vectorization (#18844) · e31446b6
  Michael Goin authored Jun 03, 2025
```
Signed-off-by: mgoin <mgoin64@gmail.com>
```
  e31446b6
14 May, 2025 1 commit
- add kvint8 · 45273722
  xiabo authored May 14, 2025
  
  45273722
07 May, 2025 1 commit
- Removed unused marlin cuda code (#17684) · a17cef70
  Michael Goin authored May 06, 2025
```
Signed-off-by: mgoin <mgoin64@gmail.com>
```
  a17cef70
31 Mar, 2025 2 commits
- [Feature][ROCm]Enable fusion pass for torch.compile on ROCm (#15050) · e8582945
  Charlie Fu authored Mar 31, 2025
```
Signed-off-by: charlifu <charlifu@amd.com>
```
  e8582945
- skip fp8 kernels and paged_attention_rocm · 52675626
  zhuwenwen authored Mar 31, 2025
  
  52675626
15 Mar, 2025 1 commit
- [Misc][Easy] Annotate unused vars in the csrc files (#14798) · 8c0d15d5
  Lu Fang authored Mar 14, 2025
```
Signed-off-by: Lu Fang <lufang@fb.com>
```
  8c0d15d5
14 Mar, 2025 1 commit
- forward fix PR 14245, restore build on ROCm 6.2 (#14709) · 2a602b05
  Jeff Daily authored Mar 13, 2025
```
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
```
  2a602b05
11 Mar, 2025 1 commit
- dynamic distpatch of fp8 kernels (#14245) · a1c8f379
  Jeff Daily authored Mar 11, 2025
```
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
```
  a1c8f379
27 Feb, 2025 1 commit
- [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined (#13851) · a31614e3
  ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 authored Feb 27, 2025
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
```
  a31614e3
25 Feb, 2025 1 commit
- [ROCm][Quantization][Kernel] Using HIP FP8 header (#12593) · aabeb268
  Gregory Shtrasberg authored Feb 25, 2025
  
  aabeb268
20 Feb, 2025 1 commit
- [ROCm] MI300A compile targets deprecation (#13560) · 0023cd2b
  Gregory Shtrasberg authored Feb 20, 2025
  
  0023cd2b
13 Dec, 2024 1 commit

[torch.compile] Dynamic fp8 + rms_norm fusion (#10906) · 30870b4f

Luka Govedič authored Dec 12, 2024


Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

30870b4f

08 Nov, 2024 1 commit
- [torch.compile] Fuse RMSNorm with quant (#9138) · 4f93dfe9
  Luka Govedič authored Nov 08, 2024
```
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
```
  4f93dfe9
16 Oct, 2024 1 commit
- [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel (#9425) · c3fab5f7
  Tyler Michael Smith authored Oct 16, 2024
  
  c3fab5f7
04 Oct, 2024 1 commit
- [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845) · aeb37c2a
  Lucas Wilkinson authored Oct 03, 2024
  
  aeb37c2a
22 Aug, 2024 1 commit
- [Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce` (#7233) · 7937009a
  Luka Govedič authored Aug 21, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  7937009a
16 Aug, 2024 1 commit
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) · e837b624
  Charlie Fu authored Aug 16, 2024
  
  e837b624
05 Aug, 2024 1 commit
- [CI/Build] Suppress divide-by-zero and missing return statement warnings (#7001) · 6e4852ce
  Tyler Michael Smith authored Aug 05, 2024
  
  6e4852ce
30 Jul, 2024 1 commit
- [Kernel] Squash a few more warnings (#6914) · cbbc9044
  Tyler Michael Smith authored Jul 30, 2024
  
  cbbc9044
26 Jul, 2024 1 commit
- [Bugfix][Kernel] Promote another index to int64_t (#6838) · 50704f52
  Tyler Michael Smith authored Jul 26, 2024
  
  50704f52
22 Jul, 2024 1 commit
- [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) · fea59c77
  Tyler Michael Smith authored Jul 22, 2024
  
  fea59c77
21 Jul, 2024 1 commit
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
20 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) · 2e265642
  Varun Sundar Rabindranath authored Jul 19, 2024
```
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
```
  2e265642
18 Jul, 2024 1 commit
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) · b5241e41
  Varun Sundar Rabindranath authored Jul 17, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  b5241e41
03 Jul, 2024 1 commit
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
12 Jun, 2024 1 commit

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
22 May, 2024 1 commit
- [CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722) · 5f6d10c1
  Michael Goin authored May 22, 2024
  
  5f6d10c1
10 May, 2024 1 commit
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017