Commits · e69ded7d1c8a4f6ed26e64090bdc050c06cde3b9 · OpenDAS / vllm_cscc

"vllm/model_executor/models/hyperclovax.py" did not exist on "0c15c2e4868173642cec766c9819a210aef5e518"

08 Jun, 2024 1 commit

[Bug Fix] Fix the support check for FP8 CUTLASS (#5352) · e69ded7d

Cheng Li authored Jun 07, 2024

Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)

This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.

e69ded7d

07 Jun, 2024 11 commits
- fix DbrxFusedNormAttention missing cache_config (#5340) · 767c727a
  Calvinn Ng authored Jun 08, 2024
```
Co-authored-by: team <calvinn.ng@ahrefs.com>
```
  767c727a
- [Frontend] Add OpenAI Vision API Support (#5237) · 7a9cb294
  Roger Wang authored Jun 07, 2024
```
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
```
  7a9cb294
- [Kernel] Dynamic Per-Token Activation Quantization (#5037) · ca3ea51b
  Dipika Sikka authored Jun 07, 2024
```
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
```
  ca3ea51b
- Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) · dc49fb89
  limingshu authored Jun 07, 2024
  
  dc49fb89
- Remove Ray health check (#4693) · 18a277b5
  Antoni Baum authored Jun 07, 2024
  
  18a277b5
- [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48
  Tyler Michael Smith authored Jun 07, 2024
```
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
```
  8d75fe48
- [Misc][Utils] allow get_open_port to be called for multiple times (#5333) · 388596c9
  youkaichao authored Jun 06, 2024
  
  388596c9
- [Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135) · baa15a9e
  Itay Etelis authored Jun 07, 2024
  
  baa15a9e
- [Misc] Missing error message for custom ops import (#5282) · 15063741
  Jie Fu (傅杰) authored Jun 07, 2024
  
  15063741
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
- [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) · a31cab75
  Antoni Baum authored Jun 06, 2024
  
  a31cab75
06 Jun, 2024 4 commits
- [Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) · 828da0d4
  Matthew Goldey authored Jun 06, 2024
  
  828da0d4
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) · abe855d6
  Philipp Moritz authored Jun 06, 2024
  
  abe855d6
- Bugfix: fix broken of download models from modelscope (#5233) · 4efff036
  liuyhwangyh authored Jun 07, 2024
```
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
```
  4efff036
- [CI/Build] Update vision tests (#5307) · 89c92078
  Cyrus Leung authored Jun 06, 2024
  
  89c92078
05 Jun, 2024 13 commits
- [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) · 7b0a0dfb
  Breno Faria authored Jun 06, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  7b0a0dfb
- [Misc] Skip for logits_scale == 1.0 (#5291) · 6a7c7711
  Woosuk Kwon authored Jun 05, 2024
  
  6a7c7711
- [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290) · 0f83ddd4
  Alex Wu authored Jun 05, 2024
  
  0f83ddd4
- [Bugfix] Make EngineArgs use named arguments for config construction (#5285) · 065aff6c
  Michael Goin authored Jun 05, 2024
  
  065aff6c
- [BugFix] Fix log message about default max model length (#5284) · 3d33e372
  Nick Hill authored Jun 05, 2024
  
  3d33e372
- [Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252) · faf71bcd
  Nick Hill authored Jun 05, 2024
  
  faf71bcd
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) · 51a08e7d
  Philipp Moritz authored Jun 05, 2024
  
  51a08e7d
- [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207) · eb8fcd26
  DriverSong authored Jun 06, 2024
```
Co-authored-by: qiujiawei9 <qiujiawei9@jd.com>
```
  eb8fcd26
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
- [Frontend] OpenAI API server: Add `add_special_tokens` to... · f0a50054
  tomeras91 authored Jun 05, 2024
```
[Frontend] OpenAI API server: Add `add_special_tokens` to ChatCompletionRequest (default False) (#5278)
```
  f0a50054
- [Misc] Fix docstring of get_attn_backend (#5271) · c65146e7
  Woosuk Kwon authored Jun 05, 2024
  
  c65146e7
- [Misc] Add CustomOp interface for device portability (#5255) · 41ca62cf
  Woosuk Kwon authored Jun 05, 2024
  
  41ca62cf
- [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226) · 974fc9b8
  zifeitong authored Jun 04, 2024
  
  974fc9b8
04 Jun, 2024 3 commits
- [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229) · a58f24e5
  zifeitong authored Jun 03, 2024
  
  a58f24e5
- [Kernel] Enhance MoE benchmarking & tuning script (#4921) · 3a434b07
  Woosuk Kwon authored Jun 03, 2024
  
  3a434b07
- [Bugfix] Support `prompt_logprobs==0` (#5217) · 06b2550c
  Toshiki Kataoka authored Jun 04, 2024
  
  06b2550c
03 Jun, 2024 5 commits
- [FRONTEND] OpenAI `tools` support named functions (#5032) · f775a07e
  Breno Faria authored Jun 04, 2024
  
  f775a07e
- [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) · 10c38e3e
  Kaiyang Chen authored Jun 04, 2024
  
  10c38e3e
- [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) · cbb2f59c
  Tyler Michael Smith authored Jun 03, 2024
  
  cbb2f59c
- [Core] Remove unnecessary copies in flash attn backend (#5138) · 0ab278ca
  Antoni Baum authored Jun 03, 2024
  
  0ab278ca
- [Core] Support image processor (#4197) · 7a64d24a
  Cyrus Leung authored Jun 03, 2024
  
  7a64d24a
02 Jun, 2024 2 commits
- [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) · a66cf40b
  Divakar Verma authored Jun 02, 2024
```
This PR enables the fused topk_softmax kernel used in moe layer for HIP
```
  a66cf40b
- [Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643) · f790ad3c
  Avinash Raj authored Jun 02, 2024
  
  f790ad3c
01 Jun, 2024 1 commit
- [BugFix] Prevent `LLM.encode` for non-generation Models (#5184) · 044793d8
  Robert Shaw authored Jun 01, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  044793d8