Commits · cb77ad836f0ee8572c0f3d6f08fa993b2565a55b · OpenDAS / vllm_cscc · GitLab

10 Jun, 2024 11 commits
- [Docs] Alphabetically sort sponsors (#5386) · cb77ad83
  Woosuk Kwon authored Jun 10, 2024
  
  cb77ad83
- [Docs] Add Docs on Limitations of VLM Support (#5383) · 856c9900
  Roger Wang authored Jun 10, 2024
  
  856c9900
- [ci] Mount buildkite agent on Docker container to upload benchmark results (#5330) · c5602f0b
  Kevin H. Luu authored Jun 10, 2024
```
Signed-off-by: kevin <kevin@anyscale.com>
```
  c5602f0b
- [ci] Use small_cpu_queue for doc build (#5331) · f7f9c5f9
  Kevin H. Luu authored Jun 10, 2024
```
Signed-off-by: kevin <kevin@anyscale.com>
```
  f7f9c5f9
- [Bugfix] Fix LLaVA-NeXT (#5380) · 2c0d9335
  Cyrus Leung authored Jun 10, 2024
  
  2c0d9335
- [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319) · 774d1035
  Itay Etelis authored Jun 10, 2024
  
  774d1035
- [Model] Initial support for LLaVA-NeXT (#4199) · 6b29d6fe
  Cyrus Leung authored Jun 10, 2024
```
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  6b29d6fe
- [Misc] Improve error message when LoRA parsing fails (#5194) · 0bfa1c4f
  Cyrus Leung authored Jun 10, 2024
  
  0bfa1c4f
- [misc][typo] fix typo (#5372) · c81da5f5
  youkaichao authored Jun 10, 2024
  
  c81da5f5
- [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (#5374) · 68bc8170
  Roger Wang authored Jun 10, 2024
  
  68bc8170
- [Misc] Update to comply with the new `compressed-tensors` config (#5350) · 5884c2b4
  Dipika Sikka authored Jun 09, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  5884c2b4
09 Jun, 2024 4 commits
- [Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164) · 45f92c00
  Bla_ckB authored Jun 10, 2024
  
  45f92c00
- [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) · 5467ac31
  bnellnm authored Jun 09, 2024
  
  5467ac31
- [mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361) · 5d7e3d01
  youkaichao authored Jun 08, 2024
```
[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361)
```
  5d7e3d01
- [Core][CUDA Graph] add output buffer for cudagraph (#5074) · 0373e183
  youkaichao authored Jun 08, 2024
```
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)
```
  0373e183
08 Jun, 2024 6 commits
- [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) · c09dade2
  Michael Goin authored Jun 08, 2024
  
  c09dade2
- [CI/Test] improve robustness of test (vllm_runner) (#5357) · 8ea5e44a
  youkaichao authored Jun 08, 2024
```
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)
```
  8ea5e44a
- [CI/Test] improve robustness of test (hf_runner) (#5347) · 9fb900f9
  youkaichao authored Jun 07, 2024
```
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347)
```
  9fb900f9
- [ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965) · c96fc067
  Hongxia Yang authored Jun 07, 2024
  
  c96fc067
- [Misc] Add args for selecting distributed executor to benchmarks (#5335) · b3376e5c
  Benjamin Kitor authored Jun 07, 2024
  
  b3376e5c
- [Bug Fix] Fix the support check for FP8 CUTLASS (#5352) · e69ded7d
  Cheng Li authored Jun 07, 2024
```
Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.
```
  e69ded7d
07 Jun, 2024 12 commits
- fix DbrxFusedNormAttention missing cache_config (#5340) · 767c727a
  Calvinn Ng authored Jun 08, 2024
```
Co-authored-by: team <calvinn.ng@ahrefs.com>
```
  767c727a
- [Misc] Remove unused cuda_utils.h in CPU backend (#5345) · 6840a716
  Jie Fu (傅杰) authored Jun 08, 2024
  
  6840a716
- [Frontend] Add OpenAI Vision API Support (#5237) · 7a9cb294
  Roger Wang authored Jun 07, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  7a9cb294
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
- Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) · dc49fb89
  limingshu authored Jun 07, 2024
  
  dc49fb89
- Remove Ray health check (#4693) · 18a277b5
  Antoni Baum authored Jun 07, 2024
  
  18a277b5
- [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48
  Tyler Michael Smith authored Jun 07, 2024
```
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
```
  8d75fe48
- [Misc][Utils] allow get_open_port to be called for multiple times (#5333) · 388596c9
  youkaichao authored Jun 06, 2024
  
  388596c9
- [Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135) · baa15a9e
  Itay Etelis authored Jun 07, 2024
  
  baa15a9e
- [Misc] Missing error message for custom ops import (#5282) · 15063741
  Jie Fu (傅杰) authored Jun 07, 2024
  
  15063741
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
- [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) · a31cab75
  Antoni Baum authored Jun 06, 2024
  
  a31cab75
06 Jun, 2024 4 commits
- [Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) · 828da0d4
  Matthew Goldey authored Jun 06, 2024
  
  828da0d4
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) · abe855d6
  Philipp Moritz authored Jun 06, 2024
  
  abe855d6
- Bugfix: fix broken of download models from modelscope (#5233) · 4efff036
  liuyhwangyh authored Jun 07, 2024
```
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
```
  4efff036
- [CI/Build] Update vision tests (#5307) · 89c92078
  Cyrus Leung authored Jun 06, 2024
  
  89c92078
05 Jun, 2024 3 commits
- [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) · 7b0a0dfb
  Breno Faria authored Jun 06, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  7b0a0dfb
- [CI] Disable flash_attn backend for spec decode (#5286) · 3a6ae1d3
  Simon Mo authored Jun 05, 2024
  
  3a6ae1d3
- [Docs] Add Ray Summit CFP (#5295) · 8f1729b8
  Simon Mo authored Jun 05, 2024
  
  8f1729b8