Commits · a7dcc62086ea751b46b4821c2811cf8ac83711bf · OpenDAS / vllm_cscc

20 Jun, 2024 1 commit
- [Model] Port over CLIPVisionModel for VLMs (#5591) · ad137cd1
  Roger Wang authored Jun 20, 2024
  
  ad137cd1
19 Jun, 2024 3 commits
- [Misc] Add per channel support for static activation quantization; update w8a8... · 4a30d7e3
  Dipika Sikka authored Jun 19, 2024
```
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650)
```
  4a30d7e3
- [Model] Add FP8 kv cache for Qwen2 (#5656) · da971ec7
  Michael Goin authored Jun 19, 2024
  
  da971ec7
- [Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628) · 59a1eb59
  Shukant Pal authored Jun 18, 2024
  
  59a1eb59
18 Jun, 2024 6 commits
- [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639) · 8a173382
  Thomas Parnell authored Jun 18, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  8a173382
- [Model] LoRA support added for command-r (#5178) · 07feecde
  sergey-tinkoff authored Jun 18, 2024
  
  07feecde
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token... · 95db455e
  Dipika Sikka authored Jun 18, 2024
```
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542)
```
  95db455e
- [Misc] Remove import from transformers logging (#5625) · f0cc0e68
  Chang Su authored Jun 18, 2024
  
  f0cc0e68
- [Model] Initialize Phi-3-vision support (#4986) · daef218b
  Isotr0py authored Jun 18, 2024
  
  daef218b
- [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the... · fa9e3852
  sroy745 authored Jun 17, 2024
```
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131)
```
  fa9e3852
17 Jun, 2024 3 commits
- [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a
  Kunshang Ji authored Jun 18, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
```
  728c4c8a
- [Kernel] `compressed-tensors` marlin 24 support (#5435) · 890d8d96
  Dipika Sikka authored Jun 17, 2024
  
  890d8d96
- [Model] Rename Phi3 rope scaling type (#5595) · 9333fb8e
  Amit Garg authored Jun 17, 2024
  
  9333fb8e
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
14 Jun, 2024 3 commits
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460) · e2afb03c
  Thomas Parnell authored Jun 14, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  e2afb03c
- [ Misc ] Rs/compressed tensors cleanup (#5432) · 15985680
  Robert Shaw authored Jun 14, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
```
  15985680
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
13 Jun, 2024 5 commits

[Kernel] Disable CUTLASS kernels for fp8 (#5505) · e38042d4
Tyler Michael Smith authored Jun 13, 2024

e38042d4

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

[Doc] Update LLaVA docs (#5437) · 0ce7b952
Cyrus Leung authored Jun 14, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
0ce7b952

[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497) · bd439735

wenyujin333 authored Jun 14, 2024

Tune Qwen2-57B-A14B configs based on #4921

Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2

A100 GPU

benchmark	no config	w/ PR
tp=2	10.53 requests/s, 11058.17 tokens/s	12.47 requests/s, 13088.57 tokens/s
tp=4	17.77 requests/s, 18662.95 tokens/s	20.20 requests/s, 21212.32 tokens/s

bd439735

[Kernel] `w4a16` support for `compressed-tensors` (#5385) · c2637a61
Dipika Sikka authored Jun 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
c2637a61

12 Jun, 2024 2 commits
- [Frontend] [Core] Support for sharded tensorized models (#4990) · 51602eef
  Travis Johnson authored Jun 12, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  51602eef
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
11 Jun, 2024 1 commit
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
10 Jun, 2024 3 commits
- [Bugfix] Fix LLaVA-NeXT (#5380) · 2c0d9335
  Cyrus Leung authored Jun 10, 2024
  
  2c0d9335
- [Model] Initial support for LLaVA-NeXT (#4199) · 6b29d6fe
  Cyrus Leung authored Jun 10, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  6b29d6fe
- [Misc] Update to comply with the new `compressed-tensors` config (#5350) · 5884c2b4
  Dipika Sikka authored Jun 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  5884c2b4
09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
08 Jun, 2024 2 commits

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
Michael Goin authored Jun 08, 2024

c09dade2

[Bug Fix] Fix the support check for FP8 CUTLASS (#5352) · e69ded7d

Cheng Li authored Jun 07, 2024

Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.

e69ded7d

07 Jun, 2024 5 commits
- fix DbrxFusedNormAttention missing cache_config (#5340) · 767c727a
  Calvinn Ng authored Jun 08, 2024
```
Co-authored-by: team <calvinn.ng@ahrefs.com>
```
  767c727a
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
- [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48
  Tyler Michael Smith authored Jun 07, 2024
```
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
```
  8d75fe48
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
- [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) · a31cab75
  Antoni Baum authored Jun 06, 2024
  
  a31cab75
06 Jun, 2024 1 commit
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) · abe855d6
  Philipp Moritz authored Jun 06, 2024
  
  abe855d6
05 Jun, 2024 3 commits
- [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) · 7b0a0dfb
  Breno Faria authored Jun 06, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  7b0a0dfb
- [Misc] Skip for logits_scale == 1.0 (#5291) · 6a7c7711
  Woosuk Kwon authored Jun 05, 2024
  
  6a7c7711
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) · 51a08e7d
  Philipp Moritz authored Jun 05, 2024
  
  51a08e7d