Commits · ef978fe4111b0eb91c81eceba4d9791b94c7ffbf · norm / vllm

25 Feb, 2024 1 commit
- Port metrics from `aioprometheus` to `prometheus_client` (#2730) · ef978fe4
  Harry Mellor authored Feb 25, 2024
  
  ef978fe4
22 Feb, 2024 3 commits
- Include tokens from prompt phase in `counter_generation_tokens` (#2802) · 4caf7044
  Ronen Schaffer authored Feb 23, 2024
  
  4caf7044
- Optimize GeGLU layer in Gemma (#2975) · fd5dcc5c
  Woosuk Kwon authored Feb 21, 2024
  
  fd5dcc5c
- chore(vllm): codespell for spell checking (#2820) · 93dc5a28
  Massimiliano Pronesti authored Feb 22, 2024
  
  93dc5a28
21 Feb, 2024 2 commits
- Support per-request seed (#2514) · 7d2dcce1
  Nick Hill authored Feb 21, 2024
  
  7d2dcce1
- Add metrics to RequestOutput (#2876) · 017d9f15
  Antoni Baum authored Feb 20, 2024
  
  017d9f15
20 Feb, 2024 1 commit
- [FIX] Fix beam search test (#2930) · 63e2a641
  Zhuohan Li authored Feb 20, 2024
  
  63e2a641
19 Feb, 2024 3 commits
- Fix `vllm:prompt_tokens_total` metric calculation (#2869) · e433c115
  Ronen Schaffer authored Feb 19, 2024
  
  e433c115
- Support OLMo models. (#2832) · ab3a5a82
  Isotr0py authored Feb 19, 2024
  
  ab3a5a82
- [Test] Add basic correctness test (#2908) · a61f0521
  Zhuohan Li authored Feb 18, 2024
  
  a61f0521
17 Feb, 2024 1 commit

multi-LoRA as extra models in OpenAI server (#2775) · 8f36444c

jvmncs authored Feb 17, 2024

how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models

8f36444c

15 Feb, 2024 1 commit
- [BugFix] Fix GC bug for `LLM` class (#2882) · d7afab6d
  Woosuk Kwon authored Feb 14, 2024
  
  d7afab6d
13 Feb, 2024 1 commit

Add LoRA support for Mixtral (#2831) · 2a543d6e

Terry authored Feb 13, 2024

* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell

2a543d6e

06 Feb, 2024 2 commits
- [Minor] More fix of test_cache.py CI test failure (#2750) · fe6d09ae
  Lily Liu authored Feb 06, 2024
  
  fe6d09ae
- Add fused top-K softmax kernel for MoE (#2769) · f0d4e145
  Woosuk Kwon authored Feb 05, 2024
  
  f0d4e145
05 Feb, 2024 1 commit
- [ROCm] Fix some kernels failed unit tests (#2498) · 56f738ae
  Hongxia Yang authored Feb 05, 2024
  
  56f738ae
01 Feb, 2024 1 commit
- Remove hardcoded `device="cuda" ` to support more devices (#2503) · 96b6f475
  Kunshang Ji authored Feb 02, 2024
```
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
```
  96b6f475
31 Jan, 2024 2 commits
- Add unit test for Mixtral MoE layer (#2677) · d0d93b92
  Philipp Moritz authored Jan 31, 2024
  
  d0d93b92
- [Minor] Fix test_cache.py CI test failure (#2684) · 89efcf1c
  Philipp Moritz authored Jan 31, 2024
  
  89efcf1c
30 Jan, 2024 2 commits
- Add swap_blocks unit tests (#2616) · 4f65af0e
  Vladimir authored Jan 30, 2024
  
  4f65af0e
- DeepseekMoE support with Fused MoE kernel (#2453) · 5d60def0
  wangding zeng authored Jan 30, 2024
```
Co-authored-by: roy <jasonailu87@gmail.com>
```
  5d60def0
29 Jan, 2024 1 commit

Support FP8-E5M2 KV Cache (#2279) · 9090bf02

zhaoyang-star authored Jan 29, 2024


Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

9090bf02

27 Jan, 2024 1 commit
- Implement custom all reduce kernels (#2192) · 38017003
  Hanzhi Zhou authored Jan 28, 2024
  
  38017003
25 Jan, 2024 1 commit
- Support Batch Completion in Server (#2529) · 3a7dd7e3
  Simon Mo authored Jan 24, 2024
  
  3a7dd7e3
24 Jan, 2024 1 commit
- [Bugfix] fix crash if max_tokens=None (#2570) · 3209b490
  Nikola Borisov authored Jan 23, 2024
  
  3209b490
23 Jan, 2024 1 commit

[Experimental] Add multi-LoRA support (#1804) · 9b945daa

Antoni Baum authored Jan 24, 2024


Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>

9b945daa

22 Jan, 2024 2 commits
- Add a 1-line docstring to explain why calling context_attention_fwd twice in... · 7a0b011d
  Jason Zhu authored Jan 22, 2024
```
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553)
```
  7a0b011d
- [Speculative decoding 2/9] Multi-step worker for draft model (#2424) · 18bfcdd0
  Cade Daniel authored Jan 21, 2024
  
  18bfcdd0
19 Jan, 2024 2 commits
- Simplify broadcast logic for control messages (#2501) · ef9b636e
  Zhuohan Li authored Jan 19, 2024
  
  ef9b636e
- refactor complemention api for readability (#2499) · dd7e8f5f
  Simon Mo authored Jan 18, 2024
  
  dd7e8f5f
18 Jan, 2024 1 commit

[Experimental] Prefix Caching Support (#1669) · d10f8e1d

shiyi.c_98 authored Jan 17, 2024


Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

d10f8e1d

17 Jan, 2024 2 commits
- OpenAI Server refactoring (#2360) · 14cc317b
  FlorianJoncour authored Jan 17, 2024
  
  14cc317b
- Add StableLM3B model (#2372) · e1957c6e
  Hyunsung Lee authored Jan 17, 2024
  
  e1957c6e
14 Jan, 2024 1 commit
- [CI] Add Buildkite (#2355) · 6e01e8c1
  Simon Mo authored Jan 14, 2024
  
  6e01e8c1
12 Jan, 2024 1 commit

Aligning `top_p` and `top_k` Sampling (#1885) · 218dc2cc

陈序 authored Jan 13, 2024

* Align top_p and top_k with huggingface

* remove _get_prompt_and_output_tokens

* rename _apply_top_p_top_k

* compare top_p top_k with hf

* fix test errors

218dc2cc

09 Jan, 2024 1 commit
- [Speculative decoding 1/9] Optimized rejection sampler (#2336) · 79d64c49
  Cade Daniel authored Jan 09, 2024
  
  79d64c49
04 Jan, 2024 1 commit
- Revert the changes in test_cache (#2335) · 94176712
  Woosuk Kwon authored Jan 03, 2024
  
  94176712
03 Jan, 2024 2 commits
- Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) · fd4ea8ef
  Zhuohan Li authored Jan 04, 2024
  
  fd4ea8ef
- [FIX] Support non-zero CUDA devices in custom kernels (#1959) · 77af974b
  Jee Li authored Jan 03, 2024
  
  77af974b
27 Dec, 2023 1 commit
- [BUGFIX] Fix communication test (#2285) · 358c328d
  Zhuohan Li authored Dec 28, 2023
  
  358c328d