Commits · 18a277b52dd2a64ee4c0111fc8cda126031e5889 · OpenDAS / vllm_cscc · GitLab

07 Jun, 2024 7 commits
- Remove Ray health check (#4693) · 18a277b5
  Antoni Baum authored Jun 07, 2024
  
  18a277b5
- [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) · 8d75fe48
  Tyler Michael Smith authored Jun 07, 2024
```
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
```
  8d75fe48
- [Misc][Utils] allow get_open_port to be called for multiple times (#5333) · 388596c9
  youkaichao authored Jun 06, 2024
  
  388596c9
- [Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135) · baa15a9e
  Itay Etelis authored Jun 07, 2024
  
  baa15a9e
- [Misc] Missing error message for custom ops import (#5282) · 15063741
  Jie Fu (傅杰) authored Jun 07, 2024
  
  15063741
- [Core] Change LoRA embedding sharding to support loading methods (#5038) · ccdc490d
  Antoni Baum authored Jun 06, 2024
  
  ccdc490d
- [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) · a31cab75
  Antoni Baum authored Jun 06, 2024
  
  a31cab75
06 Jun, 2024 4 commits
- [Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) · 828da0d4
  Matthew Goldey authored Jun 06, 2024
  
  828da0d4
- [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) · abe855d6
  Philipp Moritz authored Jun 06, 2024
  
  abe855d6
- Bugfix: fix broken of download models from modelscope (#5233) · 4efff036
  liuyhwangyh authored Jun 07, 2024
```
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
```
  4efff036
- [CI/Build] Update vision tests (#5307) · 89c92078
  Cyrus Leung authored Jun 06, 2024
  
  89c92078
05 Jun, 2024 19 commits
- [Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109) · 7b0a0dfb
  Breno Faria authored Jun 06, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
```
  7b0a0dfb
- [CI] Disable flash_attn backend for spec decode (#5286) · 3a6ae1d3
  Simon Mo authored Jun 05, 2024
  
  3a6ae1d3
- [Docs] Add Ray Summit CFP (#5295) · 8f1729b8
  Simon Mo authored Jun 05, 2024
  
  8f1729b8
- [Misc] Skip for logits_scale == 1.0 (#5291) · 6a7c7711
  Woosuk Kwon authored Jun 05, 2024
  
  6a7c7711
- [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290) · 0f83ddd4
  Alex Wu authored Jun 05, 2024
  
  0f83ddd4
- [Bugfix] Make EngineArgs use named arguments for config construction (#5285) · 065aff6c
  Michael Goin authored Jun 05, 2024
  
  065aff6c
- [BugFix] Fix log message about default max model length (#5284) · 3d33e372
  Nick Hill authored Jun 05, 2024
  
  3d33e372
- [Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252) · faf71bcd
  Nick Hill authored Jun 05, 2024
  
  faf71bcd
- [Docs] Add Sequoia as sponsors (#5287) · f270a395
  Simon Mo authored Jun 05, 2024
  
  f270a395
- [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) · 51a08e7d
  Philipp Moritz authored Jun 05, 2024
  
  51a08e7d
- [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207) · eb8fcd26
  DriverSong authored Jun 06, 2024
```
Co-authored-by: qiujiawei9 <qiujiawei9@jd.com>
```
  eb8fcd26
- [Model] Correct Mixtral FP8 checkpoint loading (#5231) · 5563a4de
  Cody Yu authored Jun 05, 2024
  
  5563a4de
- [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) · ccd4f129
  Tyler Michael Smith authored Jun 05, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  ccd4f129
- [misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263) · 02cc3b51
  Tyler Michael Smith authored Jun 05, 2024
  
  02cc3b51
- [CI] Add nightly benchmarks (#5260) · d5b1eb08
  Simon Mo authored Jun 05, 2024
  
  d5b1eb08
- [Frontend] OpenAI API server: Add `add_special_tokens` to... · f0a50054
  tomeras91 authored Jun 05, 2024
```
[Frontend] OpenAI API server: Add `add_special_tokens` to ChatCompletionRequest (default False) (#5278)
```
  f0a50054
- [Misc] Fix docstring of get_attn_backend (#5271) · c65146e7
  Woosuk Kwon authored Jun 05, 2024
  
  c65146e7
- [Misc] Add CustomOp interface for device portability (#5255) · 41ca62cf
  Woosuk Kwon authored Jun 05, 2024
  
  41ca62cf
- [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226) · 974fc9b8
  zifeitong authored Jun 04, 2024
  
  974fc9b8
04 Jun, 2024 10 commits
- [Misc] update collect env (#5261) · fee4dcc3
  youkaichao authored Jun 04, 2024
  
  fee4dcc3
- [Misc] Add transformers version to collect_env.py (#5259) · 650a4cc5
  Michael Goin authored Jun 04, 2024
  
  650a4cc5
- [CI] mark AMD test as softfail to prevent blockage (#5256) · 9ca62d86
  Simon Mo authored Jun 04, 2024
  
  9ca62d86
- [CI/Build] Reducing CPU CI execution time (#5241) · 45c35f0d
  Li, Jiang authored Jun 05, 2024
  
  45c35f0d
- [CI/Build] Simplify model loading for `HfRunner` (#5251) · 9ba093b4
  Cyrus Leung authored Jun 05, 2024
  
  9ba093b4
- [Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242) · 27208be6
  Woosuk Kwon authored Jun 04, 2024
  
  27208be6
- [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend (#5249) · 87d5abef
  Jie Fu (傅杰) authored Jun 05, 2024
  
  87d5abef
- [CI/Build] Add inputs tests (#5215) · ec784b25
  Cyrus Leung authored Jun 04, 2024
  
  ec784b25
- [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229) · a58f24e5
  zifeitong authored Jun 03, 2024
  
  a58f24e5
- [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env... · f42a006b
  afeldman-nm authored Jun 03, 2024
```
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210)
```
  f42a006b