Commits · 95db455e7f337e99ffafd0b14367a7cbc11dca43 · OpenDAS / vllm_cscc

18 Jun, 2024 7 commits
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token... · 95db455e
  Dipika Sikka authored Jun 18, 2024
```
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542)
```
  95db455e
- [Misc] Add OpenTelemetry support (#4687) · 7879f24d
  Ronen Schaffer authored Jun 18, 2024
```
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
```
  7879f24d
- [Misc] Remove import from transformers logging (#5625) · f0cc0e68
  Chang Su authored Jun 18, 2024
  
  f0cc0e68
- [bugfix][distributed] improve p2p capability test (#5612) · db5ec52a
  youkaichao authored Jun 18, 2024
```
[bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612)
```
  db5ec52a
- [misc][typo] fix typo (#5620) · 8eadcf0b
  youkaichao authored Jun 17, 2024
  
  8eadcf0b
- [Model] Initialize Phi-3-vision support (#4986) · daef218b
  Isotr0py authored Jun 18, 2024
  
  daef218b
- [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the... · fa9e3852
  sroy745 authored Jun 17, 2024
```
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131)
```
  fa9e3852
17 Jun, 2024 8 commits
- [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606) · 26e1188e
  zifeitong authored Jun 17, 2024
  
  26e1188e
- [Bugfix] Fix KV head calculation for MPT models when using GQA (#5142) · a3e8a05d
  Bruce Fontaine authored Jun 17, 2024
  
  a3e8a05d
- [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584) · e441bad6
  youkaichao authored Jun 17, 2024
  
  e441bad6
- [bugfix][distributed] fix 16 gpus local rank arrangement (#5604) · 1b44aaf4
  youkaichao authored Jun 17, 2024
  
  1b44aaf4
- [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) · 728c4c8a
  Kunshang Ji authored Jun 18, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com>
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>
```
  728c4c8a
- [Kernel] `compressed-tensors` marlin 24 support (#5435) · 890d8d96
  Dipika Sikka authored Jun 17, 2024
  
  890d8d96
- Correct alignment in the seq_len diagram. (#5592) · 9e74d9d0
  Charles Riggins authored Jun 18, 2024
```
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>
```
  9e74d9d0
- [Model] Rename Phi3 rope scaling type (#5595) · 9333fb8e
  Amit Garg authored Jun 17, 2024
  
  9333fb8e
15 Jun, 2024 5 commits
- [Fix] Correct OpenAI batch response format (#5554) · 3ce2c050
  zifeitong authored Jun 15, 2024
  
  3ce2c050
- [BugFix] Don't start a Ray cluster when not using Ray (#5570) · 1c0afa13
  Nick Hill authored Jun 15, 2024
  
  1c0afa13
- [misc] Do not allow to use lora with chunked prefill. (#5538) · e691918e
  SangBin Cho authored Jun 15, 2024
```
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
```
  e691918e
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
- [Core][Bugfix]: fix prefix caching for blockv2 (#5364) · 1b8a0d71
  leiwen83 authored Jun 15, 2024
```
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  1b8a0d71
14 Jun, 2024 9 commits
- [Core][Distributed] improve p2p cache generation (#5528) · f5bb85b4
  youkaichao authored Jun 14, 2024
  
  f5bb85b4
- [Bugfix] Fix typo in Pallas backend (#5558) · 28c145eb
  Woosuk Kwon authored Jun 14, 2024
  
  28c145eb
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460) · e2afb03c
  Thomas Parnell authored Jun 14, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  e2afb03c
- [Doc] Update documentation on Tensorizer (#5471) · 6e2527a7
  Sanger Steel authored Jun 14, 2024
  
  6e2527a7
- [misc][distributed] fix benign error in `is_in_the_same_node` (#5512) · d1c3d7d1
  youkaichao authored Jun 14, 2024
  
  d1c3d7d1
- [Core] Remove duplicate processing in async engine (#5525) · 77490c6f
  Cyrus Leung authored Jun 15, 2024
  
  77490c6f
- [ Misc ] Rs/compressed tensors cleanup (#5432) · 15985680
  Robert Shaw authored Jun 14, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
```
  15985680
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) · 703475f6
  Tyler Michael Smith authored Jun 14, 2024
  
  703475f6
- bump version to v0.5.0.post1 (#5522) · 0f0d8bc0
  Simon Mo authored Jun 13, 2024
  
  0f0d8bc0
13 Jun, 2024 11 commits
- Add `cuda_device_count_stateless` (#5473) · 50eed24d
  Antoni Baum authored Jun 13, 2024
  
  50eed24d
- [Kernel] Disable CUTLASS kernels for fp8 (#5505) · e38042d4
  Tyler Michael Smith authored Jun 13, 2024
  
  e38042d4
- Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478) · 6b0511a5
  Antoni Baum authored Jun 13, 2024
  
  6b0511a5
- [MISC] Remove FP8 warning (#5472) · 30299a41
  Cody Yu authored Jun 13, 2024
```
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
```
  30299a41
- [Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56
  Tyler Michael Smith authored Jun 13, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  85657b56
- [Doc] Update LLaVA docs (#5437) · 0ce7b952
  Cyrus Leung authored Jun 14, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  0ce7b952
- [Misc] Add vLLM version getter to utils (#5098) · 03dccc88
  Cyrus Leung authored Jun 14, 2024
  
  03dccc88
- [Hardware][Intel] Optimize CPU backend and add more performance tips (#4971) · 80aa7e91
  Li, Jiang authored Jun 14, 2024
```
Co-authored-by: Jianan Gu <jianan.gu@intel.com>
```
  80aa7e91
- [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497) · bd439735
  wenyujin333 authored Jun 14, 2024
```
Tune Qwen2-57B-A14B configs based on #4921

Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2

A100 GPU

benchmark	no config	w/ PR
tp=2	10.53 requests/s, 11058.17 tokens/s	12.47 requests/s, 13088.57 tokens/s
tp=4	17.77 requests/s, 18662.95 tokens/s	20.20 requests/s, 21212.32 tokens/s
```
  bd439735
- [Kernel] `w4a16` support for `compressed-tensors` (#5385) · c2637a61
  Dipika Sikka authored Jun 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  c2637a61
- [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) · ea3890a5
  youkaichao authored Jun 12, 2024
```
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
```
  ea3890a5