Commits · 8fe838659164b415d7f3044ec6b7e5bc52c6b6a5 · OpenDAS / vllm_cscc · GitLab

14 Mar, 2024 3 commits
- [Kernel] change benchmark script so that result can be directly used; tune moe... · 8fe83865
  youkaichao authored Mar 14, 2024
```
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389)
```
  8fe83865
- allow user to chose which vllm's merics to display in grafana (#3393) · a37415c3
  Allen.Dou authored Mar 14, 2024
  
  a37415c3
- [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion (#3383) · 81653d96
  Simon Mo authored Mar 13, 2024
  
  81653d96
13 Mar, 2024 10 commits
- [FIX] Simpler fix for async engine running on ray (#3371) · eeab52a4
  Zhuohan Li authored Mar 13, 2024
  
  eeab52a4
- Fix lint (#3388) · c33afd89
  Antoni Baum authored Mar 13, 2024
  
  c33afd89
- Add batched RoPE kernel (#3095) · 7e9bd08f
  Terry authored Mar 13, 2024
  
  7e9bd08f
- Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when... · ae0ccb40
  Or Sharir authored Mar 13, 2024
```
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350)
```
  ae0ccb40
- [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256) · 739c350c
  陈序 authored Mar 14, 2024
  
  739c350c
- [Minor] Fix bias in if to remove ambiguity (#3259) · ba8dc958
  Hui Liu authored Mar 13, 2024
  
  ba8dc958
- add hf_transfer to requirements.txt (#3031) · e221910e
  Ronan McGovern authored Mar 13, 2024
  
  e221910e
- [Fix] Fix quantization="gptq" when using Marlin (#3319) · b167109b
  Bo-Wen Wang authored Mar 13, 2024
```
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
```
  b167109b
- Add kernel for GeGLU with approximate GELU (#3337) · 602358f8
  Woosuk Kwon authored Mar 12, 2024
  
  602358f8
- Fixes #1556 double free (#3347) · 49a3c866
  Breno Faria authored Mar 13, 2024
  
  49a3c866
12 Mar, 2024 1 commit
- docs: Add BentoML deployment doc (#3336) · b0925b38
  Sherlock Xu authored Mar 13, 2024
```
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com>
```
  b0925b38
11 Mar, 2024 7 commits
- Support Mistral Model Inference with transformers-neuronx (#3153) · 654865e2
  DAIZHENWEI authored Mar 11, 2024
  
  654865e2
- [ROCm] Fix warp and lane calculation in blockReduceSum (#3321) · c9415c19
  kliuae authored Mar 12, 2024
  
  c9415c19
- Add distributed model executor abstraction (#3191) · 4c922709
  Zhuohan Li authored Mar 11, 2024
  
  4c922709
- [docs] Add LoRA support information for models (#3299) · 657061fd
  Philipp Moritz authored Mar 11, 2024
  
  657061fd
- Re-enable the 80 char line width limit (#3305) · 2f8844ba
  Zhuohan Li authored Mar 10, 2024
  
  2f8844ba
- [Fix] Fix best_of behavior when n=1 (#3298) · 4b59f00e
  Nick Hill authored Mar 10, 2024
  
  4b59f00e
- [BugFix] Fix get tokenizer when using ray (#3301) · 9e8744a5
  Roy authored Mar 11, 2024
  
  9e8744a5
10 Mar, 2024 2 commits
- [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) · e4a28e53
  Douglas Lehr authored Mar 10, 2024
  
  e4a28e53
- Enhance lora tests with more layer and rank variations (#3243) · 0bba88df
  Terry authored Mar 09, 2024
  
  0bba88df
09 Mar, 2024 2 commits
- [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) · 8437bae6
  Cade Daniel authored Mar 08, 2024
  
  8437bae6
- [FIX] Fix prefix test error on main (#3286) · f48c6791
  Zhuohan Li authored Mar 08, 2024
  
  f48c6791
08 Mar, 2024 7 commits
- Move model filelocks from `/tmp/` to `~/.cache/vllm/locks/` dir (#3241) · c2c5e090
  Michael Goin authored Mar 08, 2024
  
  c2c5e090
- [FIX] Make `flash_attn` optional (#3269) · 1cb0cc29
  Woosuk Kwon authored Mar 08, 2024
  
  1cb0cc29
- [Docs] Fix Unmocked Imports (#3275) · 99c3cfb8
  Roger Wang authored Mar 08, 2024
  
  99c3cfb8
- [Minor Fix] Fix comments in benchmark_serving (#3252) · 1ece1ae8
  TianYu GUO authored Mar 08, 2024
  
  1ece1ae8
- Feature add lora support for Qwen2 (#3177) · c59e120c
  whyiug authored Mar 08, 2024
  
  c59e120c
- Connect engine healthcheck to openai server (#3260) · d2339d68
  Nick Hill authored Mar 07, 2024
  
  d2339d68
- Fix auto prefix bug (#3239) · b35cc934
  ElizaWszola authored Mar 08, 2024
  
  b35cc934
07 Mar, 2024 5 commits
- Possible fix for conflict between Automated Prefix Caching (#2762) and... · 8cbba462
  jacobthebanana authored Mar 07, 2024
```
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263)
```
  8cbba462
- Measure model memory usage (#3120) · 385da2da
  Michael Goin authored Mar 07, 2024
  
  385da2da
- Separate attention backends (#3005) · 2daf23ab
  Woosuk Kwon authored Mar 07, 2024
  
  2daf23ab
- Update requirements-dev.txt to include package for benchmarking scripts. (#3181) · cbf4c05b
  Chen Wang authored Mar 07, 2024
```
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
```
  cbf4c05b
- Add GPTQ support for Gemma (#3200) · d3c04b6a
  TechxGenus authored Mar 07, 2024
  
  d3c04b6a
06 Mar, 2024 3 commits
- Add tqdm `dynamic_ncols=True` (#3242) · 4cb3b924
  Chujie Zheng authored Mar 06, 2024
  
  4cb3b924
- [Testing] Fix core tests (#3224) · a33ce60c
  Cade Daniel authored Mar 06, 2024
  
  a33ce60c
- [Tests] Add block manager and scheduler tests (#3108) · 24aecf42
  SangBin Cho authored Mar 06, 2024
  
  24aecf42