Commits · ffbef65cc2473c2dbac12de1deb808e2450e389f · OpenDAS / vllm_cscc

01 Aug, 2024 1 commit
- support fa pad · ffbef65c
  zhuwenwen authored Aug 01, 2024
  
  ffbef65c
24 Jul, 2024 1 commit
- add gemm pad and fa pad for 7b model · 0b5e4e11
  zhuwenwen authored Jul 24, 2024
  
  0b5e4e11
22 Jul, 2024 1 commit
- fix index error of baichuan2 · 795ce518
  zhuwenwen authored Jul 22, 2024
  
  795ce518
20 Jul, 2024 3 commits
- support nn layout · 1e0cb1f4
  zhuwenwen authored Jul 20, 2024
  
  1e0cb1f4
- 去掉调试信息 · 9653385f
  gaoqiong authored Jul 20, 2024
  
  9653385f
- 修改nn支持方式 · 835bd9fc
  gaoqiong authored Jul 20, 2024
  
  835bd9fc
09 Jul, 2024 1 commit
- Support Deepseek-V2 (#4650) · b1b95055
  huangwb authored Jul 09, 2024
  
  b1b95055
08 Jul, 2024 1 commit
- add 7b pad dim · 5cdabd7b
  zhuwenwen authored Jul 08, 2024
  
  5cdabd7b
06 Jul, 2024 1 commit
- add fa pad · 371b1251
  zhuwenwen authored Jul 06, 2024
  
  371b1251
10 Jun, 2024 2 commits
- [Bugfix] Fix LLaVA-NeXT (#5380) · 2c0d9335
  Cyrus Leung authored Jun 10, 2024
  
  2c0d9335
- [Model] Initial support for LLaVA-NeXT (#4199) · 6b29d6fe
  Cyrus Leung authored Jun 10, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  6b29d6fe
08 Jun, 2024 1 commit
- [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
  Michael Goin authored Jun 08, 2024
  
  c09dade2
07 Jun, 2024 1 commit
- fix DbrxFusedNormAttention missing cache_config (#5340) · 767c727a
  Calvinn Ng authored Jun 08, 2024
```
Co-authored-by: team <calvinn.ng@ahrefs.com>
```
  767c727a
05 Jun, 2024 1 commit
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
03 Jun, 2024 1 commit
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
01 Jun, 2024 1 commit
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
31 May, 2024 1 commit
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
27 May, 2024 2 commits
- [Model] Add support for falcon-11B (#5069) · 890aa93d
  Isotr0py authored May 28, 2024
  
  890aa93d
- [Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) · 1102bef2
  Zhuohan Li authored May 27, 2024
```
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  1102bef2
25 May, 2024 1 commit

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799) · 8e192ff9

Eric Xihui Lin authored May 25, 2024


Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

8e192ff9

23 May, 2024 1 commit

[Kernel] Initial Activation Quantization Support (#4525) · a1242324

Dipika Sikka authored May 23, 2024


Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

a1242324

22 May, 2024 3 commits
- [Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991) · a36de682
  Philipp Moritz authored May 22, 2024
  
  a36de682
- [Model] LoRA gptbigcode implementation (#3949) · 97b03000
  raywanb authored May 23, 2024
  
  97b03000
- [Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0
  Cody Yu authored May 22, 2024
```
The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
```
  a3a73ab0
21 May, 2024 2 commits
- [Model] Add Phi-2 LoRA support (#4886) · f12c3b5b
  Isotr0py authored May 21, 2024
  
  f12c3b5b
- [Model] add rope_scaling support for qwen2 (#4930) · d130b573
  HUANG Fei authored May 21, 2024
  
  d130b573
20 May, 2024 1 commit
- [Model] LLaVA model refactor (#4910) · 6287537a
  Cyrus Leung authored May 20, 2024
  
  6287537a
19 May, 2024 1 commit
- [Bugfix][Model] Add base class for vision-language models (#4809) · f68470e8
  Cyrus Leung authored May 19, 2024
  
  f68470e8
18 May, 2024 1 commit

[Lora] Support long context lora (#4787) · 2e9a2227

SangBin Cho authored May 18, 2024

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

2e9a2227

17 May, 2024 1 commit
- Sync huggingface modifications of qwen Moe model (#4774) · 48d5985a
  eigenLiu authored May 18, 2024
  
  48d5985a
13 May, 2024 2 commits
- [Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793) · 33d3914b
  Philipp Moritz authored May 13, 2024
  
  33d3914b
- [Misc] Enhance attention selector (#4751) · 0fca3cdc
  Woosuk Kwon authored May 13, 2024
  
  0fca3cdc
12 May, 2024 1 commit
- [Model] Add support for IBM Granite Code models (#4636) · 6eaccb73
  Yikang Shen authored May 12, 2024
  
  6eaccb73
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
09 May, 2024 1 commit

[Model] Snowflake arctic model implementation (#4652) · ebce310b

Hao Zhang authored May 09, 2024


Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

ebce310b

04 May, 2024 1 commit

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

01 May, 2024 1 commit
- [Misc] Remove Mixtral device="cuda" declarations (#4543) · c9d852d6
  Philipp Moritz authored May 01, 2024
```
Remove the device="cuda" declarations in mixtral as promised in #4343
```
  c9d852d6
27 Apr, 2024 2 commits
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418) · 4ea1f967
  Robert Shaw authored Apr 27, 2024
  
  4ea1f967
- [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) · 12628d3c
  Philipp Moritz authored Apr 26, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  12628d3c
26 Apr, 2024 1 commit
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d