Commits · e373853e12c890964e21fa0ac9b46fee48fa7c76 · OpenDAS / vllm_cscc

01 Jul, 2024 2 commits
- [Frontend] Relax api url assertion for openai benchmarking (#6046) · e373853e
  James Whedbee authored Jul 01, 2024
  
  e373853e
- [Misc] update benchmark backend for scalellm (#6018) · bb603268
  zhyncs authored Jul 02, 2024
  
  bb603268
29 Jun, 2024 1 commit
- [Bugfix] fix missing last itl in openai completions benchmark (#5926) · c4bca740
  mcalman authored Jun 28, 2024
  
  c4bca740
28 Jun, 2024 1 commit
- [Hardware][Intel] OpenVINO vLLM backend (#5379) · 57f09a41
  Ilya Lavrenov authored Jun 28, 2024
  
  57f09a41
25 Jun, 2024 1 commit
- [Speculative Decoding] Support draft model on different tensor-parallel size... · 2ce5d668
  Woo-Yeon Lee authored Jun 25, 2024
```
 [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414)
```
  2ce5d668
20 Jun, 2024 1 commit
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) · 8065a7e2
  Michael Goin authored Jun 20, 2024
  
  8065a7e2
19 Jun, 2024 2 commits
- [Misc]Add param max-model-len in benchmark_latency.py (#5629) · d8714530
  DearPlanet authored Jun 19, 2024
  
  d8714530
- [Bugfix] Fix w8a8 benchmarks for int8 case (#5643) · 6820724e
  Tyler Michael Smith authored Jun 18, 2024
  
  6820724e
18 Jun, 2024 1 commit

[Misc] Add OpenTelemetry support (#4687) · 7879f24d

Ronen Schaffer authored Jun 18, 2024

This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here

7879f24d

17 Jun, 2024 4 commits
- [CI] the readability of benchmarking and prepare for dashboard (#5571) · 9e4e6fe2
  Kuntai Du authored Jun 17, 2024
```
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571)
```
  9e4e6fe2
- [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a
  Kunshang Ji authored Jun 18, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
```
  728c4c8a
- [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (#5588) · 1f12122b
  zhyncs authored Jun 18, 2024
  
  1f12122b
- Fix w8a8 benchmark and add Llama-3-8B (#5562) · e2b85cf8
  Cody Yu authored Jun 16, 2024
  
  e2b85cf8
15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
14 Jun, 2024 2 commits
- [Misc] Fix arg names (#5524) · d74674bb
  Allen.Dou authored Jun 15, 2024
  
  d74674bb
- [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with... · 319ad7f1
  Kuntai Du authored Jun 13, 2024
```
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (#5073)
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  319ad7f1
13 Jun, 2024 2 commits

[Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56

Tyler Michael Smith authored Jun 13, 2024


Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

85657b56

[Bugfix]if the content is started with ":"(response of ping), client should i… (#5303) · 88407532
Wang, Yi authored Jun 13, 2024
```
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
88407532

12 Jun, 2024 1 commit
- [Hardware] Initial TPU integration (#5292) · 1a8bfd92
  Woosuk Kwon authored Jun 12, 2024
  
  1a8bfd92
08 Jun, 2024 1 commit
- [Misc] Add args for selecting distributed executor to benchmarks (#5335) · b3376e5c
  Benjamin Kitor authored Jun 07, 2024
  
  b3376e5c
05 Jun, 2024 2 commits
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) · 51a08e7d
  Philipp Moritz authored Jun 05, 2024
  
  51a08e7d
- [misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263) · 02cc3b51
  Tyler Michael Smith authored Jun 05, 2024
  
  02cc3b51
04 Jun, 2024 2 commits
- [Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242) · 27208be6
  Woosuk Kwon authored Jun 04, 2024
  
  27208be6
- [Kernel] Enhance MoE benchmarking & tuning script (#4921) · 3a434b07
  Woosuk Kwon authored Jun 03, 2024
  
  3a434b07
01 Jun, 2024 1 commit

[Kernel] Update Cutlass fp8 configs (#5144) · f081c3ce

Varun Sundar Rabindranath authored Jun 01, 2024


Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

f081c3ce

31 May, 2024 2 commits
- [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) · e9899fb7
  Cody Yu authored May 31, 2024
  
  e9899fb7
- [Model] Support MAP-NEO model (#5081) · a22dea54
  SnowDist authored May 31, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  a22dea54
29 May, 2024 1 commit
- [Misc] add gpu_memory_utilization arg (#5079) · 616e600e
  Marut Pandya authored May 28, 2024
```
Signed-off-by: pandyamarut <pandyamarut@gmail.com>
```
  616e600e
28 May, 2024 1 commit
- [Core] Consolidate prompt arguments to LLM engines (#4328) · 5ae5ed1e
  Cyrus Leung authored May 29, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  5ae5ed1e
25 May, 2024 1 commit
- [Misc] Make Serving Benchmark More User-friendly (#5044) · f17a1a8f
  Roger Wang authored May 25, 2024
  
  f17a1a8f
23 May, 2024 1 commit
- Marlin 24 prefill performance improvement (about 25% better on average) (#4983) · 60662532
  Alexander Matveev authored May 23, 2024
  
  60662532
22 May, 2024 1 commit

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893) · a3a73ab0

Cody Yu authored May 22, 2024

The 2nd PR for #4532.

This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

a3a73ab0

20 May, 2024 1 commit
- [Doc]Add documentation to benchmarking script when running TGI (#4920) · c3af4472
  Kuntai Du authored May 20, 2024
  
  c3af4472
16 May, 2024 3 commits
- Add JSON output support for benchmark_latency and benchmark_throughput (#4848) · f09edd8a
  Simon Mo authored May 16, 2024
  
  f09edd8a
- Add marlin unit tests and marlin benchmark script (#4815) · 5c342570
  alexm-nm authored May 16, 2024
  
  5c342570
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) · 973617ae
  Cody Yu authored May 16, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
```
  973617ae
14 May, 2024 1 commit
- [Core][Hash][Automatic Prefix caching] Accelerating the hashing function by... · ccb63a82
  Kuntai Du authored May 14, 2024
```
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696)
```
  ccb63a82
03 May, 2024 1 commit
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
01 May, 2024 2 commits

[Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4

Philipp Moritz authored May 01, 2024

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency

24bb4fe4

[Core] Enable prefix caching with block manager v2 enabled (#4142) · 24750f4c
leiwen83 authored May 02, 2024
```
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
```
24750f4c