Commits · f1df5dbfd6782408228f39bdc0722fa465629f0f · OpenDAS / vllm_cscc

23 Aug, 2024 1 commit
- [Misc] Update `marlin` to use vLLMParameters (#7803) · f1df5dbf
  Dipika Sikka authored Aug 23, 2024
  
  f1df5dbf
22 Aug, 2024 1 commit
- [Misc] update fp8 to use `vLLMParameter` (#7437) · 955b5191
  Dipika Sikka authored Aug 22, 2024
  
  955b5191
21 Aug, 2024 1 commit
- [Model] Add AWQ quantization support for InternVL2 model (#7187) · 12e1c65b
  Isotr0py authored Aug 21, 2024
  
  12e1c65b
19 Aug, 2024 1 commit
- [Core] Support tensor parallelism for GGUF quantization (#7520) · 7601cb04
  Isotr0py authored Aug 20, 2024
  
  7601cb04
13 Aug, 2024 2 commits
- [Misc] Update `awq` and `awq_marlin` to use `vLLMParameters` (#7422) · b1e5afc3
  Dipika Sikka authored Aug 13, 2024
  
  b1e5afc3
- [Misc] Update `gptq_marlin` to use new vLLMParameters (#7281) · fb377d7e
  Dipika Sikka authored Aug 13, 2024
  
  fb377d7e
09 Aug, 2024 1 commit
- [Bugfix] Fix `PerTensorScaleParameter` weight loading for fused models (#7376) · 5c6c54d6
  Dipika Sikka authored Aug 09, 2024
  
  5c6c54d6
07 Aug, 2024 1 commit
- [Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and... · 0f7052bc
  Dipika Sikka authored Aug 07, 2024
```
[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` (#5874)
```
  0f7052bc
05 Aug, 2024 1 commit
- [Core] Support loading GGUF model (#5191) · 360bd67c
  Isotr0py authored Aug 06, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  360bd67c
26 Jul, 2024 1 commit
- Fix ReplicatedLinear weight loading (#6793) · 062a1d0f
  QQSong authored Jul 25, 2024
  
  062a1d0f
20 Jul, 2024 1 commit
- [ Misc ] `fbgemm` checkpoints (#6559) · 683e3cb9
  Robert Shaw authored Jul 20, 2024
  
  683e3cb9
19 Jul, 2024 2 commits
- [Model] RowParallelLinear: pass bias to quant_method.apply (#6327) · a5314e86
  Thomas Parnell authored Jul 19, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  a5314e86
- [ Misc ] non-uniform quantization via `compressed-tensors` for `Llama` (#6515) · dbe55885
  Robert Shaw authored Jul 18, 2024
  
  dbe55885
16 Jul, 2024 1 commit
- [Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081) · 978aed53
  Michael Goin authored Jul 16, 2024
  
  978aed53
12 Jul, 2024 1 commit
- [ Misc ] Remove separate bias add (#6353) · 6047187c
  Robert Shaw authored Jul 12, 2024
  
  6047187c
11 Jul, 2024 1 commit
- [Doc] Remove comments incorrectly copied from another project (#6286) · 99ded1e1
  daquexian authored Jul 11, 2024
  
  99ded1e1
09 Jul, 2024 1 commit
- [Bugfix]fix and needs_scalar_to_array logic check (#6238) · d3a24513
  Baoyuan Qi authored Jul 10, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  d3a24513
30 Jun, 2024 1 commit
- [ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940) · af9ad46f
  Robert Shaw authored Jun 30, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  af9ad46f
28 Jun, 2024 2 commits
- [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921) · 2cd402e1
  Robert Shaw authored Jun 28, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  2cd402e1
- [ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (#5928) · b1852307
  Robert Shaw authored Jun 28, 2024
```
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
```
  b1852307
18 Jun, 2024 1 commit
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token... · 95db455e
  Dipika Sikka authored Jun 18, 2024
```
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542)
```
  95db455e
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

01 May, 2024 1 commit
- [Misc]Add customized information for models (#4132) · d6f4bd7c
  Jee Li authored May 01, 2024
  
  d6f4bd7c
30 Apr, 2024 1 commit

[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) · 111815d4

Robert Shaw authored Apr 30, 2024


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

111815d4

29 Apr, 2024 1 commit
- [mypy][5/N] Support all typing on model executor (#4427) · df29793d
  SangBin Cho authored Apr 29, 2024
  
  df29793d
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
24 Apr, 2024 1 commit
- [BUG] fixed fp8 conflict with aqlm (#4307) · 79a268c4
  Robert Shaw authored Apr 23, 2024
```
Fixes fp8 iterface which broke in AQLM merge.
```
  79a268c4
23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
20 Apr, 2024 1 commit

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

11 Apr, 2024 1 commit
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

25 Mar, 2024 1 commit
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
13 Mar, 2024 1 commit
- [Minor] Fix bias in if to remove ambiguity (#3259) · ba8dc958
  Hui Liu authored Mar 13, 2024
  
  ba8dc958
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c

01 Feb, 2024 1 commit
- Remove hardcoded `device="cuda" ` to support more devices (#2503) · 96b6f475
  Kunshang Ji authored Feb 02, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
```
  96b6f475
15 Jan, 2024 1 commit
- fix weigit loading for GQA with TP (#2379) · f780504d
  Chenhui Zhang authored Jan 16, 2024
  
  f780504d
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8