Commits · c1c0d00b88320f97e00a3175fac235a232893da5 · norm / vllm

27 Feb, 2024 2 commits
- Don't use cupy when `enforce_eager=True` (#3037) · c1c0d00b
  Roy authored Feb 27, 2024
  
  c1c0d00b
- [Minor] Remove unused config files (#3039) · d9f726c4
  Roy authored Feb 27, 2024
  
  d9f726c4
26 Feb, 2024 3 commits
- [Minor] Remove gather_cached_kv kernel (#3043) · d6e4a130
  Woosuk Kwon authored Feb 26, 2024
  
  d6e4a130
- Optimize Triton MoE Kernel (#2979) · cfc15a10
  Philipp Moritz authored Feb 26, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  cfc15a10
- Add LogProbs for Chat Completions in OpenAI (#2918) · 70f3e8e3
  Jared Moore authored Feb 25, 2024
  
  70f3e8e3
25 Feb, 2024 1 commit
- Port metrics from `aioprometheus` to `prometheus_client` (#2730) · ef978fe4
  Harry Mellor authored Feb 25, 2024
  
  ef978fe4
23 Feb, 2024 1 commit
- [Fix] Fissertion on YaRN model len (#2984) · f7c12349
  Woosuk Kwon authored Feb 23, 2024
  
  f7c12349
22 Feb, 2024 10 commits
- Fix nvcc not found in vlm-openai image (#2781) · 57f04494
  zhaoyang-star authored Feb 23, 2024
  
  57f04494
- Include tokens from prompt phase in `counter_generation_tokens` (#2802) · 4caf7044
  Ronen Schaffer authored Feb 23, 2024
  
  4caf7044
- Remove Flash Attention in test env (#2982) · 6f32cddf
  Woosuk Kwon authored Feb 22, 2024
  
  6f32cddf
- [FIX] Fix a bug in initializing Yarn RoPE (#2983) · c530e2cf
  44670 authored Feb 22, 2024
  
  c530e2cf
- Optimize GeGLU layer in Gemma (#2975) · fd5dcc5c
  Woosuk Kwon authored Feb 21, 2024
  
  fd5dcc5c
- chore(vllm): codespell for spell checking (#2820) · 93dc5a28
  Massimiliano Pronesti authored Feb 22, 2024
  
  93dc5a28
- Use Llama RMSNorm custom op for Gemma (#2974) · 95529e32
  Woosuk Kwon authored Feb 21, 2024
  
  95529e32
- Migrate MistralForCausalLM to LlamaForCausalLM (#2868) · 344020c9
  Roy authored Feb 22, 2024
  
  344020c9
- Added early stopping to completion APIs (#2939) · 5574081c
  Mustafa Eyceoz authored Feb 21, 2024
  
  5574081c
- Update comment (#2934) · d7f39648
  Ronen Schaffer authored Feb 22, 2024
  
  d7f39648
21 Feb, 2024 7 commits
- Bump up version to v0.3.2 (#2968) · 8fbd84bf
  Zhuohan Li authored Feb 21, 2024
```
This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832).
```
  8fbd84bf
- Support per-request seed (#2514) · 7d2dcce1
  Nick Hill authored Feb 21, 2024
  
  7d2dcce1
- [ROCm] Upgrade transformers to v4.38.0 (#2967) · dc903e70
  Woosuk Kwon authored Feb 21, 2024
  
  dc903e70
- [FIX] Add Gemma model to the doc (#2966) · a9c82128
  Zhuohan Li authored Feb 21, 2024
  
  a9c82128
- Upgrade transformers to v4.38.0 (#2965) · c20ecb6a
  Woosuk Kwon authored Feb 21, 2024
  
  c20ecb6a
- Add Gemma model (#2964) · 5253edaa
  Xiang Xu authored Feb 21, 2024
  
  5253edaa
- Add metrics to RequestOutput (#2876) · 017d9f15
  Antoni Baum authored Feb 20, 2024
  
  017d9f15
20 Feb, 2024 3 commits
- Make vLLM logging formatting optional (#2877) · 181b27d8
  Antoni Baum authored Feb 20, 2024
  
  181b27d8
- [FIX] Fix beam search test (#2930) · 63e2a641
  Zhuohan Li authored Feb 20, 2024
  
  63e2a641
- [ROCm] include gfx908 as supported (#2792) · 264017a2
  James Whedbee authored Feb 19, 2024
  
  264017a2
19 Feb, 2024 4 commits
- Fix `vllm:prompt_tokens_total` metric calculation (#2869) · e433c115
  Ronen Schaffer authored Feb 19, 2024
  
  e433c115
- Add warning to prevent changes to benchmark api server (#2858) · 86fd8bb0
  Simon Mo authored Feb 18, 2024
  
  86fd8bb0
- Support OLMo models. (#2832) · ab3a5a82
  Isotr0py authored Feb 19, 2024
  
  ab3a5a82
- [Test] Add basic correctness test (#2908) · a61f0521
  Zhuohan Li authored Feb 18, 2024
  
  a61f0521
18 Feb, 2024 2 commits
- [Minor] Small fix to make distributed init logic in worker looks cleaner (#2905) · 537c9755
  Zhuohan Li authored Feb 18, 2024
  
  537c9755
- Add code-revision config argument for Hugging Face Hub (#2892) · 786b7f18
  Mark Mozolewski authored Feb 17, 2024
  
  786b7f18
17 Feb, 2024 2 commits

multi-LoRA as extra models in OpenAI server (#2775) · 8f36444c

jvmncs authored Feb 17, 2024

how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models

8f36444c

Defensively copy `sampling_params` (#2881) · 185b2c29

Nick Hill authored Feb 17, 2024

If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.

Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059

185b2c29

16 Feb, 2024 2 commits
- Bump up to v0.3.1 (#2887) · 5f08050d
  Woosuk Kwon authored Feb 16, 2024
  
  5f08050d
- Prefix Caching- fix t4 triton error (#2517) · 64da65b3
  shiyi.c_98 authored Feb 16, 2024
  
  64da65b3
15 Feb, 2024 3 commits
- [ROCm] Dockerfile fix for flash-attention build (#2885) · 5255d99d
  Hongxia Yang authored Feb 15, 2024
  
  5255d99d
- Fix DeciLM (#2883) · 4f2ad111
  Philipp Moritz authored Feb 14, 2024
  
  4f2ad111
- [BugFix] Fix GC bug for `LLM` class (#2882) · d7afab6d
  Woosuk Kwon authored Feb 14, 2024
  
  d7afab6d