Commits · bbf55c4805efba5f1d7094f5e2888b3ef26c0fd7 · OpenDAS / vllm_cscc · GitLab

17 Aug, 2024 2 commits
- [VLM] Refactor `MultiModalConfig` initialization and profiling (#7530) · bbf55c48
  Roger Wang authored Aug 17, 2024
  
  bbf55c48
- [Model] Pipeline parallel support for JAIS (#7603) · e73f76ee
  Besher Alkurdi authored Aug 17, 2024
  
  e73f76ee
16 Aug, 2024 3 commits
- [Kernel] W8A16 Int8 inside FusedMoE (#7415) · 7fc23be8
  Mor Zusman authored Aug 16, 2024
  
  7fc23be8
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) · e837b624
  Charlie Fu authored Aug 16, 2024
  
  e837b624
- [Misc] Add quantization config support for speculative model. (#7343) · b67ae00c
  shangmingc authored Aug 16, 2024
  
  b67ae00c
14 Aug, 2024 2 commits
- [core] [3/N] multi-step args and sequence.py (#7452) · 2ecf7b17
  William Lin authored Aug 14, 2024
  
  2ecf7b17
- [VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126) · 3f674a49
  Cyrus Leung authored Aug 15, 2024
  
  3f674a49
13 Aug, 2024 1 commit
- [hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102) · 4d2dc507
  youkaichao authored Aug 13, 2024
  
  4d2dc507
12 Aug, 2024 2 commits
- [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208) · a046f863
  jon-chuang authored Aug 12, 2024
```
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
```
  a046f863
- [Core] Consolidate `GB` constant and enable float GB arguments (#7416) · 4ddc4743
  Cyrus Leung authored Aug 13, 2024
  
  4ddc4743
09 Aug, 2024 3 commits
- [Core] Add span metrics for model_forward, scheduler and sampler time (#7089) · 933790c2
  Mahesh Keralapura authored Aug 09, 2024
  
  933790c2
- [Core] Support serving encoder/decoder models (#7258) · 7eb4a51c
  Cyrus Leung authored Aug 09, 2024
  
  7eb4a51c
- [TPU] Add Load-time W8A16 quantization for TPU Backend (#7005) · 0fa14907
  Siyuan Liu authored Aug 08, 2024
  
  0fa14907
08 Aug, 2024 2 commits
- [Misc] Temporarily resolve the error of BitAndBytes (#7308) · a049b107
  Jee Jee Li authored Aug 09, 2024
  
  a049b107
- [Frontend] remove max_num_batched_tokens limit for lora (#7288) · 48abee9e
  Cherilyn Buren authored Aug 08, 2024
  
  48abee9e
06 Aug, 2024 2 commits

[Core] Subclass ModelRunner to support cross-attention & encoder sequences... · fd95e026

afeldman-nm authored Aug 06, 2024


[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>

fd95e026

[LoRA] Relax LoRA condition (#7146) · 9118217f
Jee Jee Li authored Aug 06, 2024

9118217f

05 Aug, 2024 2 commits
- [Core] Support loading GGUF model (#5191) · 360bd67c
  Isotr0py authored Aug 06, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  360bd67c
- [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963) · 82a1b1a8
  Cade Daniel authored Aug 05, 2024
  
  82a1b1a8
04 Aug, 2024 2 commits
- Clean up remaining Punica C information (#7027) · f80ab352
  Jee Jee Li authored Aug 05, 2024
  
  f80ab352
- [Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when... · b1c9aa3d
  Thomas Parnell authored Aug 04, 2024
```
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator (#7105)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
```
  b1c9aa3d
03 Aug, 2024 1 commit
- [Frontend] Warn if user `max_model_len` is greater than derived `max_model_len` (#7080) · 825b0448
  Jeff Fialho authored Aug 03, 2024
```
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
```
  825b0448
01 Aug, 2024 2 commits
- [Models] Support Qwen model with PP (#6974) · fc912e08
  Murali Andoorveedu authored Aug 01, 2024
```
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  fc912e08
- [Model] Pipeline parallel support for Qwen2 (#6924) · 1d2e7fb7
  xuyi authored Aug 01, 2024
  
  1d2e7fb7
31 Jul, 2024 2 commits
- [Misc] Add compressed-tensors to optimized quant list (#7006) · a0dce938
  Michael Goin authored Jul 31, 2024
  
  a0dce938
- [mypy] Enable following imports for some directories (#6681) · da1f7cc1
  Cyrus Leung authored Jul 31, 2024
  
  da1f7cc1
27 Jul, 2024 2 commits
- Add Nemotron to PP_SUPPORTED_MODELS (#6863) · b1366a95
  Michael Goin authored Jul 27, 2024
  
  b1366a95
- enforce eager mode with bnb quantization temporarily (#6846) · bb549467
  chenqianfzh authored Jul 26, 2024
  
  bb549467
23 Jul, 2024 5 commits
- [bitsandbytes]: support read bnb pre-quantized model (#5753) · 87525fab
  dongmao zhang authored Jul 23, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  87525fab
- [Model] Pipeline Parallel Support for DeepSeek v2 (#6519) · 507ef787
  Travis Johnson authored Jul 23, 2024
```
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
```
  507ef787
- [BugFix] Fix RoPE error in Llama 3.1 (#6693) · a112a84a
  Woosuk Kwon authored Jul 23, 2024
  
  a112a84a
- [Bugfix] Fix a log error in chunked prefill (#6694) · 461089a2
  Woosuk Kwon authored Jul 23, 2024
  
  461089a2
- support ignore patterns in model loader (#6673) · 3eda4ec7
  Simon Mo authored Jul 22, 2024
  
  3eda4ec7
21 Jul, 2024 2 commits
- [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) · 396d92d5
  Alexander Matveev authored Jul 21, 2024
  
  396d92d5
- [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both... · 14f91fe6
  sroy745 authored Jul 20, 2024
```
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
```
  14f91fe6
20 Jul, 2024 2 commits
- [ Misc ] `fbgemm` checkpoints (#6559) · 683e3cb9
  Robert Shaw authored Jul 20, 2024
  
  683e3cb9
- [Core] Allow specifying custom Executor (#6557) · 7bd82002
  Antoni Baum authored Jul 19, 2024
  
  7bd82002
19 Jul, 2024 2 commits
- [Core] Multiprocessing Pipeline Parallel support (#6130) · b5672a11
  Nick Hill authored Jul 18, 2024
```
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
```
  b5672a11
- Add support for a rope extension method (#6553) · c5df56f8
  Simon Mo authored Jul 18, 2024
  
  c5df56f8
18 Jul, 2024 1 commit
- [core][model] yet another cpu offload implementation (#6496) · 1c27d25f
  youkaichao authored Jul 17, 2024
```
Co-authored-by: Michael Goin <michael@neuralmagic.com>
```
  1c27d25f