Commits · d6953beb91da4e9c99be4c0a1304a2d24189535c · OpenDAS / vllm_cscc

05 Oct, 2025 1 commit
- Convert formatting to use `ruff` instead of `yapf` + `isort` (#26247) · d6953beb
  Harry Mellor authored Oct 05, 2025
```
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
```
  d6953beb
20 Sep, 2025 1 commit
- [V1] Support `LLM.apply_model` (#18465) · 3d9a1d2d
  Cyrus Leung authored Sep 20, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  3d9a1d2d
22 Aug, 2025 1 commit
- [Deprecation] Remove `prompt_token_ids` arg fallback in `LLM.generate` and `LLM.embed` (#18800) · 8896eb72
  Cyrus Leung authored Aug 22, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  8896eb72
05 Aug, 2025 1 commit
- [Feature] Non-contiguous Support for FP8 Quantization (#21961) · 4771df7b
  Wentao Ye authored Aug 05, 2025
```
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
```
  4771df7b
03 Jun, 2025 1 commit
- [Misc] Add SPDX-FileCopyrightText (#19100) · 02f0c7b2
  Simon Mo authored Jun 03, 2025
```
Signed-off-by: simon-mo <simon.mo@hey.com>
```
  02f0c7b2
26 Mar, 2025 1 commit

[FEAT][ROCm] Integrate Fused MoE Kernels from AITER (#14967) · 5ebf6674

vllmellm authored Mar 26, 2025


Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

5ebf6674

15 Mar, 2025 1 commit

[V1] V1 Enablement Oracle (#13726) · d4d93db2

Robert Shaw authored Mar 15, 2025


Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

d4d93db2

11 Mar, 2025 1 commit
- dynamic distpatch of fp8 kernels (#14245) · a1c8f379
  Jeff Daily authored Mar 11, 2025
```
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
```
  a1c8f379
07 Feb, 2025 1 commit

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation... · eaa92d44

TJian authored Feb 08, 2025

[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (#12501)

eaa92d44

02 Feb, 2025 1 commit

[Misc] Add SPDX-License-Identifier headers to python source files (#12628) · e489ad7a

Russell Bryant authored Feb 02, 2025

- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/

Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------
Signed-off-by: Russell Bryant <rbryant@redhat.com>

e489ad7a

20 Jan, 2025 1 commit
- [Core] Interface for accessing model from `VllmRunner` (#10353) · 59a0192f
  Cyrus Leung authored Jan 20, 2025
```
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  59a0192f
18 Sep, 2024 1 commit
- [CI/Build] Avoid CUDA initialization (#8534) · 6ffa3f31
  Cyrus Leung authored Sep 18, 2024
  
  6ffa3f31
16 Aug, 2024 1 commit
- [Misc/Testing] Use `torch.testing.assert_close` (#7324) · 50b8d08d
  jon-chuang authored Aug 15, 2024
  
  50b8d08d
07 Aug, 2024 1 commit
- [Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219) · 5223199e
  Michael Goin authored Aug 07, 2024
  
  5223199e
30 Jul, 2024 1 commit
- [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) · d7a299ed
  Tyler Michael Smith authored Jul 30, 2024
  
  d7a299ed
25 Jul, 2024 1 commit
- [Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761) · 65b1f121
  Michael Goin authored Jul 25, 2024
  
  65b1f121
23 Jul, 2024 1 commit
- [CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) · 01c16ede
  Michael Goin authored Jul 23, 2024
  
  01c16ede
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
03 Jul, 2024 1 commit
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
30 Jun, 2024 1 commit
- [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) · af9ad46f
  Robert Shaw authored Jun 30, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  af9ad46f
13 Jun, 2024 1 commit
- [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466) · 23ec72fa
  Michael Goin authored Jun 13, 2024
  
  23ec72fa
12 Jun, 2024 3 commits

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

Revert "[CI/Build] Add `is_quant_method_supported` to control quantization... · e3c12bf6
Simon Mo authored Jun 12, 2024
```
Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463)
```
e3c12bf6
[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253) · 3dd6853b
Michael Goin authored Jun 12, 2024

3dd6853b

08 Jun, 2024 1 commit
- [CI/Test] improve robustness of test (vllm_runner) (#5357) · 8ea5e44a
  youkaichao authored Jun 08, 2024
```
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)
```
  8ea5e44a
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
20 Apr, 2024 1 commit

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3