Commits · 7462218e1a04631e4d59dab0cb62e1c12b719f1e · OpenDAS / vllm_cscc

"vllm/entrypoints/openai/engine/serving.py" did not exist on "740374d456a638df98ffbc7d9dab328752330e62"

22 Aug, 2024 1 commit
- Update layernorm_kernels_opt.cu · d8ae62c7
  zhangshao authored Aug 22, 2024
  
  d8ae62c7
21 Aug, 2024 1 commit
- Update refactoring operation · bd93e661
  zhuwenwen authored Aug 21, 2024
  
  bd93e661
20 Aug, 2024 4 commits
- Refactoring the optimized kernel · 4405f82c
  zhuwenwen authored Aug 20, 2024
  
  4405f82c
- Update layernorm_kernels.cu · 2dbefd03
  zhangshao authored Aug 20, 2024
  
  2dbefd03
- 修复rmsnorm bug，增加USE_VLLM_OLD_OP标志使用原版rmsnorm · 785f450d
  zhangshao authored Aug 20, 2024
  
  785f450d
- Update layernorm_kernels.cu · 1c5e7720
  zhangshao authored Aug 20, 2024
  
  1c5e7720
17 Aug, 2024 1 commit
- [Kernel] Revert layernorm_kernels · f82f451f
  zhuwenwen authored Aug 17, 2024
  
  f82f451f
15 Aug, 2024 1 commit
- 解决数据量过大，导致int32索引越界的问题 · dfdc05ae
  zhangshao authored Aug 15, 2024
  
  dfdc05ae
13 Aug, 2024 1 commit
- 恢复blocksize8和32支持 · f99a8d1c
  zhangshao authored Aug 13, 2024
  
  f99a8d1c
12 Aug, 2024 1 commit
- fix:act_and_mul_kernel core dump bug · b13506a5
  bianch authored Aug 12, 2024
  
  b13506a5
10 Aug, 2024 1 commit
- Revert feat:optimize act_and_mul_kernel · 880b2e41
  zhuwenwen authored Aug 10, 2024
  
  880b2e41
09 Aug, 2024 2 commits
- Revert "pa add v prefetch for gemm1" · 749242a0
  flyingdown authored Aug 09, 2024
```
This reverts commit f38bd872.
```
  749242a0
- feat:optimize act_and_mul_kernel · b8c88ed3
  bianch authored Aug 09, 2024
  
  b8c88ed3
06 Aug, 2024 2 commits
- fix deepseek_v2 236b fused_moe_kernel expert_ids_ptr value error · b2068609
  wangmin6 authored Aug 06, 2024
  
  b2068609
- 恢复对bf16的支持 · 9f9f3796
  zhangshao authored Aug 06, 2024
  
  9f9f3796
05 Aug, 2024 1 commit
- pa add v prefetch for gemm1 · f38bd872
  flyingdown authored Aug 05, 2024
  
  f38bd872
29 Jul, 2024 1 commit
- 增加head size 64-256支持 · 69185c0b
  zhangshao authored Jul 29, 2024
  
  69185c0b
24 Jul, 2024 1 commit
- update gptq relative path · 2d0a73a3
  zhuwenwen authored Jul 24, 2024
  
  2d0a73a3
23 Jul, 2024 2 commits
- optimize rmsnorm kernel · c62f8e9a
  zhuwenwen authored Jul 23, 2024
  
  c62f8e9a
- refactoring the transpose kernel and update supported model · b2dd1743
  zhuwenwen authored Jul 23, 2024
  
  b2dd1743
22 Jul, 2024 2 commits
- pa优化，编译选项优化 · 1be9a629
  zhangshao authored Jul 22, 2024
  
  1be9a629
- fix gptq performance degradation when batch size>4 issue · f423ad60
  huangwb authored Jul 22, 2024
  
  f423ad60
20 Jul, 2024 2 commits
- support nn layout · 1e0cb1f4
  zhuwenwen authored Jul 20, 2024
  
  1e0cb1f4
- 修改nn支持方式 · 835bd9fc
  gaoqiong authored Jul 20, 2024
  
  835bd9fc
18 Jul, 2024 1 commit
- back to pa and rn · 71b1be50
  zhuwenwen authored Jul 18, 2024
  
  71b1be50
17 Jul, 2024 1 commit
- add rotary_embedding for tgi · e499f96c
  huangwb authored Jul 17, 2024
  
  e499f96c
10 Jul, 2024 2 commits
- pa_v1用原始代码pa_v2用新代码 · deeb9cb8
  zhangshao authored Jul 10, 2024
  
  deeb9cb8
- 优化rmsnorm和page_attn · 9e10e8f7
  zhangshao authored Jul 10, 2024
  
  9e10e8f7
02 Jul, 2024 2 commits
- change pa v1 to 128 · bbf9488b
  zhuwenwen authored Jul 02, 2024
  
  bbf9488b
- change num_thread to 256 · 8ee4ae1f
  zhuwenwen authored Jul 02, 2024
  
  8ee4ae1f
13 Jun, 2024 2 commits

[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452) · cd9c0d65
Jie Fu (傅杰) authored Jun 14, 2024

cd9c0d65

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

12 Jun, 2024 2 commits

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

skip fp8 · 103f3110
zhuwenwen authored Jun 12, 2024

103f3110

09 Jun, 2024 1 commit
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
07 Jun, 2024 2 commits
- [Misc] Remove unused cuda_utils.h in CPU backend (#5345) · 6840a716
  Jie Fu (傅杰) authored Jun 08, 2024
  
  6840a716
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
05 Jun, 2024 1 commit
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) · ccd4f129
  Tyler Michael Smith authored Jun 05, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  ccd4f129
03 Jun, 2024 2 commits
- [CI/BUILD] enable intel queue for longer CPU tests (#4113) · cafb8e06
  Yuan authored Jun 04, 2024
  
  cafb8e06
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c