Commits · 1fabf3e1ccacae309ca09c7413db86c68d9dcde3 · OpenDAS / vllm_cscc

24 Jul, 2024 1 commit
- add gemm pad and fa pad for 7b model · 0b5e4e11
  zhuwenwen authored Jul 24, 2024
  
  0b5e4e11
20 Jul, 2024 5 commits
- use two method to add bias · 4caf1539
  zhuwenwen authored Jul 20, 2024
  
  4caf1539
- support nn layout · 1e0cb1f4
  zhuwenwen authored Jul 20, 2024
  
  1e0cb1f4
- 去掉调试信息 · 9653385f
  gaoqiong authored Jul 20, 2024
  
  9653385f
- 修改nn支持方式 · 835bd9fc
  gaoqiong authored Jul 20, 2024
  
  835bd9fc
- modify gemm pad strategy · 7fe40ced
  zhuwenwen authored Jul 20, 2024
  
  7fe40ced
12 Jul, 2024 1 commit
- [ Misc ] Remove separate bias add (#6353) · 6047187c
  Robert Shaw authored Jul 12, 2024
  
  6047187c
11 Jul, 2024 1 commit
- [Doc] Remove comments incorrectly copied from another project (#6286) · 99ded1e1
  daquexian authored Jul 11, 2024
  
  99ded1e1
09 Jul, 2024 1 commit
- [Bugfix]fix and needs_scalar_to_array logic check (#6238) · d3a24513
  Baoyuan Qi authored Jul 10, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  d3a24513
08 Jul, 2024 1 commit
- add 7b pad dim · 5cdabd7b
  zhuwenwen authored Jul 08, 2024
  
  5cdabd7b
06 Jul, 2024 1 commit
- add fa pad · 371b1251
  zhuwenwen authored Jul 06, 2024
  
  371b1251
30 Jun, 2024 1 commit
- [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) · af9ad46f
  Robert Shaw authored Jun 30, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  af9ad46f
28 Jun, 2024 3 commits
- [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921) · 2cd402e1
  Robert Shaw authored Jun 28, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  2cd402e1
- [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (#5928) · b1852307
  Robert Shaw authored Jun 28, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  b1852307
- add gemm paddig · e58014d7
  zhuwenwen authored Jun 28, 2024
  
  e58014d7
18 Jun, 2024 1 commit
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token... · 95db455e
  Dipika Sikka authored Jun 18, 2024
```
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542)
```
  95db455e
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
25 May, 2024 1 commit
- fix merge · 145787ae
  zhuwenwen authored May 25, 2024
  
  145787ae
23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

12 May, 2024 2 commits
- fix qkv linear · 47c04371
  zhuwenwen authored May 12, 2024
  
  47c04371
- add linear bias · 0d27f0c7
  zhuwenwen authored May 12, 2024
  
  0d27f0c7
07 May, 2024 1 commit
- add llama_nn support · f26ecef8
  zhuwenwen authored May 07, 2024
  
  f26ecef8
01 May, 2024 1 commit
- [Misc]Add customized information for models (#4132) · d6f4bd7c
  Jee Li authored May 01, 2024
  
  d6f4bd7c
30 Apr, 2024 1 commit

[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) · 111815d4

Robert Shaw authored Apr 30, 2024


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

111815d4

29 Apr, 2024 1 commit
- [mypy][5/N] Support all typing on model executor (#4427) · df29793d
  SangBin Cho authored Apr 29, 2024
  
  df29793d
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
25 Apr, 2024 1 commit
- add support of llama_nn · 8aa30111
  zhuwenwen authored Apr 25, 2024
  
  8aa30111
24 Apr, 2024 1 commit
- [BUG] fixed fp8 conflict with aqlm (#4307) · 79a268c4
  Robert Shaw authored Apr 23, 2024
```
Fixes fp8 iterface which broke in AQLM merge.
```
  79a268c4
23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
20 Apr, 2024 1 commit

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

11 Apr, 2024 1 commit
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

25 Mar, 2024 1 commit
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
13 Mar, 2024 1 commit
- [Minor] Fix bias in if to remove ambiguity (#3259) · ba8dc958
  Hui Liu authored Mar 13, 2024
  
  ba8dc958
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c

01 Feb, 2024 1 commit
- Remove hardcoded `device="cuda" ` to support more devices (#2503) · 96b6f475
  Kunshang Ji authored Feb 02, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
```
  96b6f475
15 Jan, 2024 1 commit
- fix weigit loading for GQA with TP (#2379) · f780504d
  Chenhui Zhang authored Jan 16, 2024
  
  f780504d
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8