Commits · e351572900f7d87e14fe203ea3a49c1c7ddae0d6 · OpenDAS / vllm_cscc

17 Sep, 2024 1 commit
- [Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) · 9855b995
  chenqianfzh authored Sep 17, 2024
  
  9855b995
13 Sep, 2024 1 commit
- [misc][ci] fix quant test (#8449) · a2469127
  youkaichao authored Sep 13, 2024
  
  a2469127
11 Sep, 2024 1 commit
- [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257) · 0b952af4
  Li, Jiang authored Sep 12, 2024
  
  0b952af4
29 Aug, 2024 1 commit
- support bitsandbytes 8-bit and FP4 quantized models (#7445) · 4664ceaa
  chenqianfzh authored Aug 29, 2024
  
  4664ceaa
27 Aug, 2024 1 commit
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766) · fc911880
  Dipika Sikka authored Aug 27, 2024
```
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
```
  fc911880
21 Aug, 2024 1 commit
- [ci][test] adjust max wait time for cpu offloading test (#7709) · 9e51b6a6
  youkaichao authored Aug 20, 2024
  
  9e51b6a6
16 Aug, 2024 3 commits
- [Kernel] W8A16 Int8 inside FusedMoE (#7415) · 7fc23be8
  Mor Zusman authored Aug 16, 2024
  
  7fc23be8
- [Misc/Testing] Use `torch.testing.assert_close` (#7324) · 50b8d08d
  jon-chuang authored Aug 15, 2024
  
  50b8d08d
- [CI] Move quantization cpu offload tests out of fastcheck (#7574) · e1655287
  Michael Goin authored Aug 16, 2024
  
  e1655287
14 Aug, 2024 1 commit
- [Misc] Revert `compressed-tensors` code reuse (#7521) · f55a9aea
  Kyle Sayers authored Aug 14, 2024
  
  f55a9aea
13 Aug, 2024 1 commit
- [Misc] `compressed-tensors` code reuse (#7277) · 373538f9
  Kyle Sayers authored Aug 13, 2024
  
  373538f9
07 Aug, 2024 2 commits
- [Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219) · 5223199e
  Michael Goin authored Aug 07, 2024
  
  5223199e
- [Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and... · 0f7052bc
  Dipika Sikka authored Aug 07, 2024
```
[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874)
```
  0f7052bc
05 Aug, 2024 1 commit
- [Core] Support loading GGUF model (#5191) · 360bd67c
  Isotr0py authored Aug 06, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  360bd67c
01 Aug, 2024 1 commit
- [CI/Build] Remove sparseml requirement from testing (#7037) · fb3db616
  Michael Goin authored Aug 01, 2024
  
  fb3db616
30 Jul, 2024 1 commit
- [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) · d7a299ed
  Tyler Michael Smith authored Jul 30, 2024
  
  d7a299ed
25 Jul, 2024 1 commit
- [Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints (#6761) · 65b1f121
  Michael Goin authored Jul 25, 2024
  
  65b1f121
23 Jul, 2024 3 commits
- [bitsandbytes]: support read bnb pre-quantized model (#5753) · 87525fab
  dongmao zhang authored Jul 23, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  87525fab
- [CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) · 01c16ede
  Michael Goin authored Jul 23, 2024
  
  01c16ede
- [Misc] Support FP8 kv cache scales from compressed-tensors (#6528) · 9e0b558a
  Michael Goin authored Jul 23, 2024
  
  9e0b558a
21 Jul, 2024 1 commit
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
11 Jul, 2024 1 commit
- [ Misc ] Refactor Marlin Python Utilities (#6082) · b675069d
  Robert Shaw authored Jul 11, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
```
  b675069d
07 Jul, 2024 1 commit
- [ Misc ] Support Fp8 via `llm-compressor` (#6110) · abfe705a
  Robert Shaw authored Jul 07, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  abfe705a
03 Jul, 2024 3 commits
- [ Misc ] Clean Up `CompressedTensorsW8A8` (#6113) · 62963d12
  Robert Shaw authored Jul 03, 2024
  
  62963d12
- [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) · 47f0954a
  Michael Goin authored Jul 03, 2024
  
  47f0954a
- [hardware][misc] introduce platform abstraction (#6080) · 482045ee
  youkaichao authored Jul 02, 2024
  
  482045ee
02 Jul, 2024 1 commit
- [CORE] Quantized lm-head Framework (#4442) · ee93f4f9
  Qubitium-ModelCloud authored Jul 03, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
```
  ee93f4f9
01 Jul, 2024 1 commit
- [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) · 614aa512
  youkaichao authored Jun 30, 2024
  
  614aa512
30 Jun, 2024 1 commit
- [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) · af9ad46f
  Robert Shaw authored Jun 30, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  af9ad46f
25 Jun, 2024 1 commit
- [Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794) · dd248f76
  Dipika Sikka authored Jun 25, 2024
  
  dd248f76
19 Jun, 2024 1 commit
- [Misc] Add per channel support for static activation quantization; update w8a8... · 4a30d7e3
  Dipika Sikka authored Jun 19, 2024
```
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650)
```
  4a30d7e3
18 Jun, 2024 1 commit
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token... · 95db455e
  Dipika Sikka authored Jun 18, 2024
```
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542)
```
  95db455e
17 Jun, 2024 1 commit
- [Kernel] `compressed-tensors` marlin 24 support (#5435) · 890d8d96
  Dipika Sikka authored Jun 17, 2024
  
  890d8d96
16 Jun, 2024 1 commit
- [CI][BugFix] Flip is_quant_method_supported condition (#5577) · 4a676905
  Michael Goin authored Jun 16, 2024
  
  4a676905
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
13 Jun, 2024 2 commits
- [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466) · 23ec72fa
  Michael Goin authored Jun 13, 2024
  
  23ec72fa
- [Kernel] `w4a16` support for `compressed-tensors` (#5385) · c2637a61
  Dipika Sikka authored Jun 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  c2637a61
12 Jun, 2024 2 commits

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

Revert "[CI/Build] Add `is_quant_method_supported` to control quantization... · e3c12bf6
Simon Mo authored Jun 12, 2024
```
Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463)
```
e3c12bf6