Commits · c35e4a3dd74fa5952b04354a3c7cfd0ed09e2eb0 · OpenDAS / vllm_cscc

21 Jun, 2024 3 commits
- [BugFix] Fix test_phi3v.py (#5725) · c35e4a3d
  Chang Su authored Jun 20, 2024
  
  c35e4a3d
- [Kernel] Add punica dimension for Qwen2 LoRA (#5441) · 1f567421
  Jinzhen Lin authored Jun 21, 2024
  
  1f567421
- [Model] MLPSpeculator speculative decoding support (#4947) · b12518d3
  Joshua Rosenkranz authored Jun 20, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>
```
  b12518d3
20 Jun, 2024 2 commits
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) · 8065a7e2
  Michael Goin authored Jun 20, 2024
  
  8065a7e2
- [Misc] Improve conftest (#5681) · 3730a1c8
  Cyrus Leung authored Jun 20, 2024
  
  3730a1c8
19 Jun, 2024 4 commits
- [Misc] Add per channel support for static activation quantization; update w8a8... · 4a30d7e3
  Dipika Sikka authored Jun 19, 2024
```
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650)
```
  4a30d7e3
- [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) · 78687504
  zifeitong authored Jun 19, 2024
  
  78687504
- [ci][distributed] add tests for custom allreduce (#5689) · d571ca01
  youkaichao authored Jun 19, 2024
  
  d571ca01
- [Bugfix] Added test for sampling repetition penalty bug. (#5659) · e5150f2c
  Thomas Parnell authored Jun 19, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  e5150f2c
18 Jun, 2024 7 commits
- [Model] LoRA support added for command-r (#5178) · 07feecde
  sergey-tinkoff authored Jun 18, 2024
  
  07feecde
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token... · 95db455e
  Dipika Sikka authored Jun 18, 2024
```
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542)
```
  95db455e
- [Misc] Add OpenTelemetry support (#4687) · 7879f24d
  Ronen Schaffer authored Jun 18, 2024
```
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.

I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
```
  7879f24d
- [CI/Build][Misc] Update Pytest Marker for VLMs (#5623) · 4ad7b53e
  Roger Wang authored Jun 18, 2024
  
  4ad7b53e
- [Kernel] Add punica dimensions for Granite 13b (#5559) · 5002175e
  Joe Runde authored Jun 17, 2024
```
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
```
  5002175e
- [Model] Initialize Phi-3-vision support (#4986) · daef218b
  Isotr0py authored Jun 18, 2024
  
  daef218b
- [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the... · fa9e3852
  sroy745 authored Jun 17, 2024
```
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131)
```
  fa9e3852
17 Jun, 2024 1 commit
- [Kernel] `compressed-tensors` marlin 24 support (#5435) · 890d8d96
  Dipika Sikka authored Jun 17, 2024
  
  890d8d96
16 Jun, 2024 1 commit
- [CI][BugFix] Flip is_quant_method_supported condition (#5577) · 4a676905
  Michael Goin authored Jun 16, 2024
  
  4a676905
15 Jun, 2024 4 commits
- add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145) · d919ecc7
  Alexander Matveev authored Jun 15, 2024
  
  d919ecc7
- [CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568) · 81fbb365
  Cyrus Leung authored Jun 15, 2024
  
  81fbb365
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
- [Core][Bugfix]: fix prefix caching for blockv2 (#5364) · 1b8a0d71
  leiwen83 authored Jun 15, 2024
```
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
```
  1b8a0d71
14 Jun, 2024 1 commit
- [mis] fix flaky test of test_cuda_device_count_stateless (#5546) · 48f589e1
  youkaichao authored Jun 14, 2024
  
  48f589e1
13 Jun, 2024 7 commits
- Add `cuda_device_count_stateless` (#5473) · 50eed24d
  Antoni Baum authored Jun 13, 2024
  
  50eed24d
- [CI/Build] Disable test_fp8.py (#5508) · 33e3b372
  Tyler Michael Smith authored Jun 13, 2024
  
  33e3b372
- [Kernel] Factor out epilogues from cutlass kernels (#5391) · 85657b56
  Tyler Michael Smith authored Jun 13, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  85657b56
- [CI/Build] Simplify OpenAI server setup in tests (#5100) · 39873476
  Cyrus Leung authored Jun 14, 2024
  
  39873476
- [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466) · 23ec72fa
  Michael Goin authored Jun 13, 2024
  
  23ec72fa
- [Kernel] `w4a16` support for `compressed-tensors` (#5385) · c2637a61
  Dipika Sikka authored Jun 13, 2024
```
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
```
  c2637a61
- [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) · ea3890a5
  youkaichao authored Jun 12, 2024
```
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)
```
  ea3890a5
12 Jun, 2024 5 commits

[Frontend] [Core] Support for sharded tensorized models (#4990) · 51602eef

Travis Johnson authored Jun 12, 2024


Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

51602eef

[Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342

Cody Yu authored Jun 12, 2024

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

5985e342

[CI] Upgrade codespell version. (#5381) · 847cdcca
SangBin Cho authored Jun 13, 2024

847cdcca
Revert "[CI/Build] Add `is_quant_method_supported` to control quantization... · e3c12bf6
Simon Mo authored Jun 12, 2024
```
Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463)
```
e3c12bf6
[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253) · 3dd6853b
Michael Goin authored Jun 12, 2024

3dd6853b

11 Jun, 2024 5 commits
- [Core][Doc] Default to multiprocessing for single-node distributed case (#5230) · 99dac099
  Nick Hill authored Jun 11, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  99dac099
- [Core][Distributed] add same-node detection (#5369) · c4bd03c7
  youkaichao authored Jun 11, 2024
  
  c4bd03c7
- [Frontend] Customizable RoPE theta (#5197) · dcbf4286
  sasha0552 authored Jun 11, 2024
  
  dcbf4286
- [Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026) · 640052b0
  Cyrus Leung authored Jun 11, 2024
  
  640052b0
- [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) · 351d5e7b
  maor-ps authored Jun 11, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  351d5e7b