Commits · 1b8a0d71cf5aa1a43c14478ec90538c3fbe1b315 · OpenDAS / vllm_cscc

13 Jun, 2024 1 commit

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 1 commit

[Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b

Dipika Sikka authored Jun 07, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

ca3ea51b

05 Jun, 2024 1 commit
- [Misc] Add CustomOp interface for device portability (#5255) · 41ca62cf
  Woosuk Kwon authored Jun 05, 2024
  
  41ca62cf
04 Jun, 2024 1 commit
- [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env... · f42a006b
  afeldman-nm authored Jun 03, 2024
```
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210)
```
  f42a006b
03 Jun, 2024 1 commit
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
01 Jun, 2024 2 commits
- [Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce
  Varun Sundar Rabindranath authored Jun 01, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  f081c3ce
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
31 May, 2024 1 commit
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

23 May, 2024 2 commits
- [Kernel] Initial Activation Quantization Support (#4525) · a1242324
  Dipika Sikka authored May 23, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  a1242324
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 2 commits
- [Misc] Take user preference in attention selector (#4960) · ee3eea0a
  Cody Yu authored May 22, 2024
  
  ee3eea0a
- [Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954) · 8674f988
  Tyler Michael Smith authored May 22, 2024
```
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
```
  8674f988
20 May, 2024 1 commit
- [Kernel] Add flash-attn back (#4907) · b57e6c59
  Woosuk Kwon authored May 19, 2024
  
  b57e6c59
19 May, 2024 1 commit
- [Kernel] Add marlin_24 unit tests (#4901) · 27ce8547
  Alexander Matveev authored May 19, 2024
  
  27ce8547
17 May, 2024 1 commit
- [Bugfix] fix rope error when load models with different dtypes (#4835) · 33e0823d
  Jinzhen Lin authored May 17, 2024
  
  33e0823d
16 May, 2024 2 commits
- [Kernel] Add w8a8 CUTLASS kernels (#4749) · 2060e936
  Tyler Michael Smith authored May 16, 2024
  
  2060e936
- Add marlin unit tests and marlin benchmark script (#4815) · 5c342570
  alexm-nm authored May 16, 2024
  
  5c342570
15 May, 2024 1 commit

Revert "[Kernel] Use flash-attn for decoding (#3648)" (#4820) · 8a7cc254

SangBin Cho authored May 15, 2024

Lora 3 & 4 test seems to have illegal memory access failure after this commit;

[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241

This reverts commit 1356df53.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

8a7cc254

13 May, 2024 2 commits

[Kernel] Use flash-attn for decoding (#3648) · 1356df53

Stephen Krider authored May 13, 2024


Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>

1356df53

[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425) · 350f9e10

Cyrus Leung authored May 13, 2024

Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.

350f9e10

10 May, 2024 1 commit
- [Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535) · c8331017
  Cody Yu authored May 09, 2024
  
  c8331017
08 May, 2024 3 commits
- [CI/Test] fix swap test for multi gpu (#4689) · 230c4b38
  youkaichao authored May 08, 2024
  
  230c4b38
- [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) · 20cfcdec
  youkaichao authored May 08, 2024
  
  20cfcdec
- [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573) · 0f9a6e3d
  DefTruth authored May 09, 2024
  
  0f9a6e3d
07 May, 2024 1 commit
- [Core][Optimization] change python dict to pytorch tensor (#4607) · 63575bc2
  youkaichao authored May 06, 2024
  
  63575bc2
04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

03 May, 2024 2 commits
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
02 May, 2024 1 commit
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f
18 Apr, 2024 1 commit
- [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) · e8cc7967
  Michał Moskal authored Apr 18, 2024
  
  e8cc7967
11 Apr, 2024 2 commits
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
- [Misc] Add indirection layer for custom ops (#3913) · e9da5a40
  Kunshang Ji authored Apr 11, 2024
  
  e9da5a40
03 Apr, 2024 1 commit

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) · 2ff767b5

Adrian Abeyta authored Apr 03, 2024


Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

2ff767b5

30 Mar, 2024 1 commit
- [Kernel] Layernorm performance optimization (#3662) · b6d10354
  mawong-amd authored Mar 30, 2024
  
  b6d10354
27 Mar, 2024 1 commit
- feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277) · 45b6ef65
  Roger Wang authored Mar 27, 2024
  
  45b6ef65
25 Mar, 2024 2 commits
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
- [Core] Refactor Attention Take 2 (#3462) · 925f3332
  Woosuk Kwon authored Mar 24, 2024
  
  925f3332
24 Mar, 2024 1 commit
- [CI] typo fix: is_hip --> is_hip() (#3595) · 8b268a46
  youkaichao authored Mar 24, 2024
  
  8b268a46