Commits · f366f6339b1dbd9176470489b99dae4fb454f5ce · OpenDAS / vllm_cscc

16 Aug, 2024 1 commit
- [spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571) · f366f633
  William Lin authored Aug 16, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  f366f633
09 Aug, 2024 2 commits
- [Core] Add span metrics for model_forward, scheduler and sampler time (#7089) · 933790c2
  Mahesh Keralapura authored Aug 09, 2024
  
  933790c2
- [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971) · 57b7be0e
  William Lin authored Aug 08, 2024
  
  57b7be0e
05 Aug, 2024 2 commits
- [SpecDecode] Support FlashInfer in DraftModelRunner (#6926) · e9630458
  Bongwon Jang authored Aug 06, 2024
  
  e9630458
- [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963) · 82a1b1a8
  Cade Daniel authored Aug 05, 2024
  
  82a1b1a8
31 Jul, 2024 1 commit
- [Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836) · f230cc2c
  Cyrus Leung authored Jul 31, 2024
  
  f230cc2c
30 Jul, 2024 1 commit
- [BugFix] Fix use of per-request seed with pipeline parallel (#6698) · 5cf9254a
  Nick Hill authored Jul 30, 2024
  
  5cf9254a
24 Jul, 2024 1 commit
- [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. (#6686) · 40468b13
  Allen.Dou authored Jul 24, 2024
  
  40468b13
21 Jul, 2024 1 commit
- [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both... · 14f91fe6
  sroy745 authored Jul 20, 2024
```
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
```
  14f91fe6
20 Jul, 2024 1 commit
- [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA,... · 06d6c5fe
  Matt Wong authored Jul 20, 2024
```
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543)
```
  06d6c5fe
19 Jul, 2024 3 commits
- [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578) · f0bbfaf9
  Thomas Parnell authored Jul 19, 2024
  
  f0bbfaf9
- [BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369) · a921e863
  Woo-Yeon Lee authored Jul 19, 2024
  
  a921e863
- [Bugfix] Make spec. decode respect per-request seed. (#6034) · d4201e06
  Thomas Parnell authored Jul 19, 2024
```
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  d4201e06
18 Jul, 2024 1 commit
- [Misc] Minor patch for draft model runner (#6523) · 8a74c68b
  Cody Yu authored Jul 17, 2024
  
  8a74c68b
17 Jul, 2024 2 commits
- [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) · e76466dd
  Alexander Matveev authored Jul 17, 2024
  
  e76466dd
- [Misc][Speculative decoding] Typos and typing fixes (#6467) · a19e8d37
  shangmingc authored Jul 17, 2024
```
Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com>
```
  a19e8d37
10 Jul, 2024 2 commits
- [Speculative Decoding] Enabling bonus token in speculative decoding for KV... · ae151d73
  sroy745 authored Jul 10, 2024
```
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765)
```
  ae151d73
- [Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978) · 2416b26e
  Abhinav Goyal authored Jul 10, 2024
  
  2416b26e
09 Jul, 2024 1 commit

[CORE] Adding support for insertion of soft-tuned prompts (#4645) · 4d6ada94

Swapnil Parekh authored Jul 09, 2024


Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

4d6ada94

03 Jul, 2024 1 commit

[vlm] Remove vision language config. (#6089) · d9e98f42

xwjiang2010 authored Jul 03, 2024


Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

d9e98f42

02 Jul, 2024 3 commits

[Model] Jamba support (#4115) · 9d6a8daa

Mor Zusman authored Jul 03, 2024


Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>

9d6a8daa

[Core] Pipeline Parallel Support (#4412) · c5832d2a
Murali Andoorveedu authored Jul 02, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
c5832d2a
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050) · 15aba081
Sirej Dua authored Jul 02, 2024
```
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
```
15aba081

01 Jul, 2024 1 commit
- [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) · 80ca1e6a
  sroy745 authored Jul 01, 2024
  
  80ca1e6a
28 Jun, 2024 1 commit
- [Spec Decode] Introduce DraftModelRunner (#5799) · b2c62023
  Cody Yu authored Jun 28, 2024
  
  b2c62023
26 Jun, 2024 1 commit

[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408) · dda48115

Stephanie Wang authored Jun 25, 2024


Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie <swang@anyscale.com>
Co-authored-by: Stephanie <swang@anyscale.com>

dda48115

25 Jun, 2024 1 commit
- [Speculative Decoding] Support draft model on different tensor-parallel size... · 2ce5d668
  Woo-Yeon Lee authored Jun 25, 2024
```
 [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414)
```
  2ce5d668
21 Jun, 2024 1 commit

[Model] MLPSpeculator speculative decoding support (#4947) · b12518d3

Joshua Rosenkranz authored Jun 20, 2024


Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>

b12518d3

15 Jun, 2024 1 commit
- [mypy] Enable type checking for test directory (#5017) · 0e9164b4
  Cyrus Leung authored Jun 15, 2024
  
  0e9164b4
11 Jun, 2024 1 commit
- [Misc] Various simplifications and typing fixes (#5368) · a0086298
  Nick Hill authored Jun 10, 2024
  
  a0086298
05 Jun, 2024 1 commit
- [Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252) · faf71bcd
  Nick Hill authored Jun 05, 2024
  
  faf71bcd
25 May, 2024 1 commit
- [Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) · d5a16977
  Lily Liu authored May 25, 2024
  
  d5a16977
22 May, 2024 1 commit
- [Core] Eliminate parallel worker per-step task scheduling overhead (#4894) · eb6d3c26
  Nick Hill authored May 22, 2024
  
  eb6d3c26
16 May, 2024 1 commit
- [Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840) · 973617ae
  Cody Yu authored May 16, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
```
  973617ae
15 May, 2024 1 commit

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1

SangBin Cho authored May 15, 2024

[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)

This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.

It also refactors subquery_start_loc which was not refactored in the previous PR

65bf2ac1

13 May, 2024 1 commit
- [Speculative decoding] Improve n-gram efficiency (#4724) · ce532ff4
  Cody Yu authored May 13, 2024
  
  ce532ff4
11 May, 2024 1 commit
- [Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734) · e254497b
  Chang Su authored May 11, 2024
  
  e254497b
10 May, 2024 1 commit

[Core] Fix circular reference which leaked llm instance in local dev env (#4737) · 6a0f6172

SangBin Cho authored May 10, 2024

Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.

When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.

I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.

6a0f6172

08 May, 2024 2 commits
- [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (#4672) · 8b9241be
  Cade Daniel authored May 08, 2024
  
  8b9241be
- [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) · f942efb5
  Cody Yu authored May 08, 2024
```
Co-authored-by: Cade Daniel <edacih@gmail.com>
```
  f942efb5