Commits · 835bd9fcc011487bacffb0e0a08337573fd264ac · OpenDAS / vllm_cscc

20 Jul, 2024 2 commits
- 修改nn支持方式 · 835bd9fc
  gaoqiong authored Jul 20, 2024
  
  835bd9fc
- modify gemm pad strategy · 7fe40ced
  zhuwenwen authored Jul 20, 2024
  
  7fe40ced
08 Jul, 2024 1 commit
- add 7b pad dim · 5cdabd7b
  zhuwenwen authored Jul 08, 2024
  
  5cdabd7b
06 Jul, 2024 1 commit
- add fa pad · 371b1251
  zhuwenwen authored Jul 06, 2024
  
  371b1251
28 Jun, 2024 1 commit
- add gemm paddig · e58014d7
  zhuwenwen authored Jun 28, 2024
  
  e58014d7
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
25 May, 2024 1 commit
- fix merge · 145787ae
  zhuwenwen authored May 25, 2024
  
  145787ae
23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

12 May, 2024 2 commits
- fix qkv linear · 47c04371
  zhuwenwen authored May 12, 2024
  
  47c04371
- add linear bias · 0d27f0c7
  zhuwenwen authored May 12, 2024
  
  0d27f0c7
07 May, 2024 1 commit
- add llama_nn support · f26ecef8
  zhuwenwen authored May 07, 2024
  
  f26ecef8
01 May, 2024 1 commit
- [Misc]Add customized information for models (#4132) · d6f4bd7c
  Jee Li authored May 01, 2024
  
  d6f4bd7c
30 Apr, 2024 1 commit

[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) · 111815d4

Robert Shaw authored Apr 30, 2024


Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

111815d4

29 Apr, 2024 1 commit
- [mypy][5/N] Support all typing on model executor (#4427) · df29793d
  SangBin Cho authored Apr 29, 2024
  
  df29793d
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
25 Apr, 2024 1 commit
- add support of llama_nn · 8aa30111
  zhuwenwen authored Apr 25, 2024
  
  8aa30111
24 Apr, 2024 1 commit
- [BUG] fixed fp8 conflict with aqlm (#4307) · 79a268c4
  Robert Shaw authored Apr 23, 2024
```
Fixes fp8 iterface which broke in AQLM merge.
```
  79a268c4
23 Apr, 2024 1 commit
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
20 Apr, 2024 1 commit

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

11 Apr, 2024 1 commit
- [Core] Set `linear_weights` directly on the layer (#3977) · a10d3056
  Antoni Baum authored Apr 11, 2024
  
  a10d3056
10 Apr, 2024 1 commit

[Core][Refactor] move parallel_utils into vllm/distributed (#3950) · 63e7176f

youkaichao authored Apr 10, 2024

[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)

63e7176f

25 Mar, 2024 1 commit
- [CI] Try introducing isort. (#3495) · 01bfb22b
  SangBin Cho authored Mar 25, 2024
  
  01bfb22b
13 Mar, 2024 1 commit
- [Minor] Fix bias in if to remove ambiguity (#3259) · ba8dc958
  Hui Liu authored Mar 13, 2024
  
  ba8dc958
11 Mar, 2024 1 commit
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
01 Mar, 2024 1 commit

Integrate Marlin Kernels for Int4 GPTQ inference (#2497) · c0c2335c

Robert Shaw authored Mar 01, 2024


Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>

c0c2335c

01 Feb, 2024 1 commit
- Remove hardcoded `device="cuda" ` to support more devices (#2503) · 96b6f475
  Kunshang Ji authored Feb 02, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
```
  96b6f475
15 Jan, 2024 1 commit
- fix weigit loading for GQA with TP (#2379) · f780504d
  Chenhui Zhang authored Jan 16, 2024
  
  f780504d
15 Dec, 2023 1 commit
- Add GPTQ support (#916) · 0fbfc4b8
  CHU Tianxiang authored Dec 15, 2023
  
  0fbfc4b8
16 Nov, 2023 1 commit

TP/quantization/weight loading refactor part 2 - Refactor quantized linear... · 7076fa1c

Zhuohan Li authored Nov 15, 2023

TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)

Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.

7076fa1c