Commits · 4caf7044e052399f07089aa8f586d5bd641f7d53 · norm / vllm

22 Feb, 2024 9 commits
- Include tokens from prompt phase in `counter_generation_tokens` (#2802) · 4caf7044
  Ronen Schaffer authored Feb 23, 2024
  
  4caf7044
- Remove Flash Attention in test env (#2982) · 6f32cddf
  Woosuk Kwon authored Feb 22, 2024
  
  6f32cddf
- [FIX] Fix a bug in initializing Yarn RoPE (#2983) · c530e2cf
  44670 authored Feb 22, 2024
  
  c530e2cf
- Optimize GeGLU layer in Gemma (#2975) · fd5dcc5c
  Woosuk Kwon authored Feb 21, 2024
  
  fd5dcc5c
- chore(vllm): codespell for spell checking (#2820) · 93dc5a28
  Massimiliano Pronesti authored Feb 22, 2024
  
  93dc5a28
- Use Llama RMSNorm custom op for Gemma (#2974) · 95529e32
  Woosuk Kwon authored Feb 21, 2024
  
  95529e32
- Migrate MistralForCausalLM to LlamaForCausalLM (#2868) · 344020c9
  Roy authored Feb 22, 2024
  
  344020c9
- Added early stopping to completion APIs (#2939) · 5574081c
  Mustafa Eyceoz authored Feb 21, 2024
  
  5574081c
- Update comment (#2934) · d7f39648
  Ronen Schaffer authored Feb 22, 2024
  
  d7f39648
21 Feb, 2024 7 commits
- Bump up version to v0.3.2 (#2968) · 8fbd84bf
  Zhuohan Li authored Feb 21, 2024
```
This version is for more model support. Add support for Gemma models (#2964) and OLMo models (#2832).
```
  8fbd84bf
- Support per-request seed (#2514) · 7d2dcce1
  Nick Hill authored Feb 21, 2024
  
  7d2dcce1
- [ROCm] Upgrade transformers to v4.38.0 (#2967) · dc903e70
  Woosuk Kwon authored Feb 21, 2024
  
  dc903e70
- [FIX] Add Gemma model to the doc (#2966) · a9c82128
  Zhuohan Li authored Feb 21, 2024
  
  a9c82128
- Upgrade transformers to v4.38.0 (#2965) · c20ecb6a
  Woosuk Kwon authored Feb 21, 2024
  
  c20ecb6a
- Add Gemma model (#2964) · 5253edaa
  Xiang Xu authored Feb 21, 2024
  
  5253edaa
- Add metrics to RequestOutput (#2876) · 017d9f15
  Antoni Baum authored Feb 20, 2024
  
  017d9f15
20 Feb, 2024 3 commits
- Make vLLM logging formatting optional (#2877) · 181b27d8
  Antoni Baum authored Feb 20, 2024
  
  181b27d8
- [FIX] Fix beam search test (#2930) · 63e2a641
  Zhuohan Li authored Feb 20, 2024
  
  63e2a641
- [ROCm] include gfx908 as supported (#2792) · 264017a2
  James Whedbee authored Feb 19, 2024
  
  264017a2
19 Feb, 2024 4 commits
- Fix `vllm:prompt_tokens_total` metric calculation (#2869) · e433c115
  Ronen Schaffer authored Feb 19, 2024
  
  e433c115
- Add warning to prevent changes to benchmark api server (#2858) · 86fd8bb0
  Simon Mo authored Feb 18, 2024
  
  86fd8bb0
- Support OLMo models. (#2832) · ab3a5a82
  Isotr0py authored Feb 19, 2024
  
  ab3a5a82
- [Test] Add basic correctness test (#2908) · a61f0521
  Zhuohan Li authored Feb 18, 2024
  
  a61f0521
18 Feb, 2024 2 commits
- [Minor] Small fix to make distributed init logic in worker looks cleaner (#2905) · 537c9755
  Zhuohan Li authored Feb 18, 2024
  
  537c9755
- Add code-revision config argument for Hugging Face Hub (#2892) · 786b7f18
  Mark Mozolewski authored Feb 17, 2024
  
  786b7f18
17 Feb, 2024 2 commits

multi-LoRA as extra models in OpenAI server (#2775) · 8f36444c

jvmncs authored Feb 17, 2024

how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models

8f36444c

Defensively copy `sampling_params` (#2881) · 185b2c29

Nick Hill authored Feb 17, 2024

If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.

Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059

185b2c29

16 Feb, 2024 2 commits
- Bump up to v0.3.1 (#2887) · 5f08050d
  Woosuk Kwon authored Feb 16, 2024
  
  5f08050d
- Prefix Caching- fix t4 triton error (#2517) · 64da65b3
  shiyi.c_98 authored Feb 16, 2024
  
  64da65b3
15 Feb, 2024 4 commits
- [ROCm] Dockerfile fix for flash-attention build (#2885) · 5255d99d
  Hongxia Yang authored Feb 15, 2024
  
  5255d99d
- Fix DeciLM (#2883) · 4f2ad111
  Philipp Moritz authored Feb 14, 2024
  
  4f2ad111
- [BugFix] Fix GC bug for `LLM` class (#2882) · d7afab6d
  Woosuk Kwon authored Feb 14, 2024
  
  d7afab6d
- Align LoRA code between Mistral and Mixtral (fixes #2875) (#2880) · 31348dff
  Philipp Moritz authored Feb 14, 2024
```
* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.

* Make LoRA logic for Mistral and Mixtral the same

---------
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com>
```
  31348dff
14 Feb, 2024 6 commits
- Don't use cupy NCCL for AMD backends (#2855) · 25e86b6a
  Woosuk Kwon authored Feb 14, 2024
  
  25e86b6a
- Migrate AquilaForCausalLM to LlamaForCausalLM (#2867) · 4efbac6d
  Roy authored Feb 15, 2024
  
  4efbac6d
- Fix docker python version (#2845) · 87069ccf
  Nikola Borisov authored Feb 14, 2024
  
  87069ccf
- [Fix] Fix memory profiling when GPU is used by multiple processes (#2863) · 7e45107f
  Woosuk Kwon authored Feb 13, 2024
  
  7e45107f
- Fix internlm after https://github.com/vllm-project/vllm/pull/2860 (#2861) · 0c48b37c
  Philipp Moritz authored Feb 13, 2024
  
  0c48b37c
- Migrate InternLMForCausalLM to LlamaForCausalLM (#2860) · 7eacffd9
  Philipp Moritz authored Feb 13, 2024
```
Co-authored-by: Roy <jasonailu87@gmail.com>
```
  7eacffd9
13 Feb, 2024 1 commit

Add LoRA support for Mixtral (#2831) · 2a543d6e

Terry authored Feb 13, 2024

* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell

2a543d6e