Commits · 281ca6c196a350cb4c9a9a430d23456764dd31cf · OpenDAS / vllm_cscc

20 Jul, 2024 3 commits
- 去掉调试信息 · 9653385f
  gaoqiong authored Jul 20, 2024
  
  9653385f
- 修改nn支持方式 · 835bd9fc
  gaoqiong authored Jul 20, 2024
  
  835bd9fc
- modify gemm pad strategy · 7fe40ced
  zhuwenwen authored Jul 20, 2024
  
  7fe40ced
17 Jul, 2024 1 commit
- set default layout · 22839191
  zhuwenwen authored Jul 17, 2024
  
  22839191
09 Jul, 2024 1 commit
- Support Deepseek-V2 (#4650) · b1b95055
  huangwb authored Jul 09, 2024
  
  b1b95055
08 Jul, 2024 1 commit
- add 7b pad dim · 5cdabd7b
  zhuwenwen authored Jul 08, 2024
  
  5cdabd7b
06 Jul, 2024 1 commit
- add fa pad · 371b1251
  zhuwenwen authored Jul 06, 2024
  
  371b1251
01 Jul, 2024 1 commit
- add qwen2 arch · e9aa4ff0
  zhuwenwen authored Jul 01, 2024
  
  e9aa4ff0
28 Jun, 2024 1 commit
- add gemm paddig · e58014d7
  zhuwenwen authored Jun 28, 2024
  
  e58014d7
11 Jun, 2024 1 commit
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
10 Jun, 2024 3 commits
- [Bugfix] Fix LLaVA-NeXT (#5380) · 2c0d9335
  Cyrus Leung authored Jun 10, 2024
  
  2c0d9335
- [Model] Initial support for LLaVA-NeXT (#4199) · 6b29d6fe
  Cyrus Leung authored Jun 10, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  6b29d6fe
- [Misc] Update to comply with the new `compressed-tensors` config (#5350) · 5884c2b4
  Dipika Sikka authored Jun 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  5884c2b4
09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
08 Jun, 2024 2 commits

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
Michael Goin authored Jun 08, 2024

c09dade2

[Bug Fix] Fix the support check for FP8 CUTLASS (#5352) · e69ded7d

Cheng Li authored Jun 07, 2024

Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.

e69ded7d

07 Jun, 2024 5 commits
- fix DbrxFusedNormAttention missing cache_config (#5340) · 767c727a
  Calvinn Ng authored Jun 08, 2024
```
Co-authored-by: team <calvinn.ng@ahrefs.com>
```
  767c727a
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
- [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48
  Tyler Michael Smith authored Jun 07, 2024
```
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
```
  8d75fe48
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
- [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) · a31cab75
  Antoni Baum authored Jun 06, 2024
  
  a31cab75
06 Jun, 2024 1 commit
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) · abe855d6
  Philipp Moritz authored Jun 06, 2024
  
  abe855d6
05 Jun, 2024 5 commits
- [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) · 7b0a0dfb
  Breno Faria authored Jun 06, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  7b0a0dfb
- [Misc] Skip for logits_scale == 1.0 (#5291) · 6a7c7711
  Woosuk Kwon authored Jun 05, 2024
  
  6a7c7711
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) · 51a08e7d
  Philipp Moritz authored Jun 05, 2024
  
  51a08e7d
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
- [Misc] Add CustomOp interface for device portability (#5255) · 41ca62cf
  Woosuk Kwon authored Jun 05, 2024
  
  41ca62cf
04 Jun, 2024 2 commits
- [Kernel] Enhance MoE benchmarking & tuning script (#4921) · 3a434b07
  Woosuk Kwon authored Jun 03, 2024
  
  3a434b07
- [Bugfix] Support `prompt_logprobs==0` (#5217) · 06b2550c
  Toshiki Kataoka authored Jun 04, 2024
  
  06b2550c
03 Jun, 2024 3 commits
- [FRONTEND] OpenAI `tools` support named functions (#5032) · f775a07e
  Breno Faria authored Jun 04, 2024
  
  f775a07e
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
02 Jun, 2024 1 commit
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) · a66cf40b
  Divakar Verma authored Jun 02, 2024
```
This PR enables the fused topk_softmax kernel used in moe layer for HIP
```
  a66cf40b
01 Jun, 2024 3 commits
- [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) · b9c0605a
  chenqianfzh authored Jun 01, 2024
  
  b9c0605a
- [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) · c3540728
  Ye Cao authored Jun 02, 2024
```
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com>
```
  c3540728
- [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) · 260d119e
  Tyler Michael Smith authored Jun 01, 2024
  
  260d119e
31 May, 2024 2 commits
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
- [Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120) · b35be540
  Robert Shaw authored May 30, 2024
  
  b35be540
30 May, 2024 2 commits
- support tn/nn · e5d707db
  zhuwenwen authored May 30, 2024
  
  e5d707db
- [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108) · 5bf185a1
  Alexander Matveev authored May 29, 2024
  
  5bf185a1