Commits · 73c8d677e57e42374bcfb2271b8f1cf7f2c0a486 · OpenDAS / vllm_cscc

29 Apr, 2024 1 commit
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) · 73c8d677
  Robert Shaw authored Apr 29, 2024
```
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  73c8d677
27 Apr, 2024 6 commits
- [Core] Support offline use of local cache for models (#4374) · d6e520e1
  Prashant Gupta authored Apr 27, 2024
```
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
```
  d6e520e1
- [BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389) · 81661da7
  Nick Hill authored Apr 27, 2024
```
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
```
  81661da7
- [Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363) · dfea1731
  Ruoyu Qin authored Apr 28, 2024
  
  dfea1731
- [Bugfix][Core] Fix get decoding config from ray (#4335) · 7134303c
  Roy authored Apr 27, 2024
  
  7134303c
- [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) · eefeb164
  Austin Veselka authored Apr 27, 2024
```
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
```
  eefeb164
- [Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355) · 8947bc3c
  Cyrus Leung authored Apr 27, 2024
  
  8947bc3c
26 Apr, 2024 3 commits
- [Misc][Refactor] Generalize linear_method to be quant_method (#4373) · a62aaf1d
  Cody Yu authored Apr 26, 2024
  
  a62aaf1d
- [Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) · 603ad848
  SangBin Cho authored Apr 26, 2024
  
  603ad848
- [Bugfix] Fix parameter name in `get_tokenizer` (#4107) · a74dee9b
  Cyrus Leung authored Apr 26, 2024
  
  a74dee9b
24 Apr, 2024 2 commits
- [Misc] Reduce supported Punica dtypes (#4304) · 468d761b
  Woosuk Kwon authored Apr 23, 2024
  
  468d761b
- [Core][Distributed] use cpu/gloo to initialize pynccl (#4248) · 91f50a6f
  youkaichao authored Apr 23, 2024
  
  91f50a6f
23 Apr, 2024 4 commits
- [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) · 1e8f4252
  Cyrus Leung authored Apr 24, 2024
  
  1e8f4252
- AQLM CUDA support (#3287) · 2b7949c1
  James Fleming authored Apr 23, 2024
```
Co-authored-by: mgoin <michael@neuralmagic.com>
```
  2b7949c1
- [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) · 62b8aebc
  Cade Daniel authored Apr 23, 2024
  
  62b8aebc
- [Core] Scheduling optimization 2 (#4280) · 050f285f
  SangBin Cho authored Apr 23, 2024
  
  050f285f
22 Apr, 2024 1 commit
- [Core] Scheduler perf fix (#4270) · ad8d696a
  SangBin Cho authored Apr 23, 2024
  
  ad8d696a
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
20 Apr, 2024 3 commits

[Frontend] multiple sampling params support (#3570) · 91528575
nunjunj authored Apr 20, 2024

91528575

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

[Bugfix] Add fix for JSON whitespace (#4189) · 138485a8
Ayush Rautwar authored Apr 19, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
```
138485a8

19 Apr, 2024 1 commit
- [Bugfix] Fix LoRA loading check (#4138) · d17c8477
  Jee Li authored Apr 19, 2024
```
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  d17c8477
18 Apr, 2024 5 commits
- [Core] add an option to log every function call to for debugging hang/crash in... · 8a7a3e44
  youkaichao authored Apr 18, 2024
```
[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079)
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  8a7a3e44
- [Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) · e1bb2fd5
  James Whedbee authored Apr 18, 2024
  
  e1bb2fd5
- [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) · e8cc7967
  Michał Moskal authored Apr 18, 2024
  
  e8cc7967
- [Bugfix] Get available quantization methods from quantization registry (#4098) · 53b018ed
  Michael Goin authored Apr 18, 2024
  
  53b018ed
- [Core] nccl integrity check and test (#4155) · 6dc1fc9c
  youkaichao authored Apr 17, 2024
```
[Core] Add integrity check during initialization; add test for it (#4155)
```
  6dc1fc9c
17 Apr, 2024 2 commits
- [Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134) · a5322254
  Shoichi Uchinami authored Apr 18, 2024
  
  a5322254
- [Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024) · 8438e056
  youkaichao authored Apr 17, 2024
```
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)
```
  8438e056
16 Apr, 2024 4 commits
- [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) · e95cd879
  Cade Daniel authored Apr 16, 2024
  
  e95cd879
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb
- LM Format Enforcer Guided Decoding Support (#3868) · 05434764
  Noam Gat authored Apr 16, 2024
```
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  05434764
- [Core] Fix engine-use-ray broken (#4105) · 4e7ee664
  SangBin Cho authored Apr 16, 2024
  
  4e7ee664
14 Apr, 2024 1 commit
- [Frontend] [Core] feat: Add model loading using `tensorizer` (#3476) · 711a0002
  Sanger Steel authored Apr 13, 2024
  
  711a0002
13 Apr, 2024 1 commit
- [Kernel] Add punica dimension for Baichuan-13B (#4053) · 989ae253
  Jee Li authored Apr 13, 2024
  
  989ae253
12 Apr, 2024 2 commits
- [Test] Test multiple attn backend for chunked prefill. (#4023) · 36729bac
  SangBin Cho authored Apr 13, 2024
  
  36729bac
- [Core] Support LoRA on quantized models (#4012) · 1096717a
  Jee Li authored Apr 12, 2024
  
  1096717a
11 Apr, 2024 3 commits
- [BugFix] Fix handling of stop strings and stop token ids (#3672) · e46a60aa
  Nick Hill authored Apr 11, 2024
  
  e46a60aa
- Add extra punica sizes to support bigger vocabs (#4015) · 1e96c334
  Antoni Baum authored Apr 11, 2024
  
  1e96c334
- Fix echo/logprob OpenAI completion bug (#3441) · 95e7d4a9
  Dylan Hawk authored Apr 11, 2024
```
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
```
  95e7d4a9