Commits · 62b5166bd4c458c3a8f6eda89d3ef9db14a4c2c8 · OpenDAS / vllm_cscc

23 Apr, 2024 8 commits
- [Core][Logging] Add last frame information for better debugging (#4278) · d86285a4
  youkaichao authored Apr 23, 2024
  
  d86285a4
- [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286) · d87f39e9
  DefTruth authored Apr 24, 2024
  
  d87f39e9
- [Bugfix] Fixing max token error message for openai compatible server (#4016) · d3c8180a
  Jack Gordley authored Apr 23, 2024
  
  d3c8180a
- [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) · 62b8aebc
  Cade Daniel authored Apr 23, 2024
  
  62b8aebc
- [Core] Scheduling optimization 2 (#4280) · 050f285f
  SangBin Cho authored Apr 23, 2024
  
  050f285f
- [Core] Some simplification of WorkerWrapper changes (#4183) · 8f2ea22b
  Nick Hill authored Apr 23, 2024
  
  8f2ea22b
- [Mypy] Part 3 fix typing for nested directories for most of directory (#4161) · 0ae11f78
  SangBin Cho authored Apr 23, 2024
  
  0ae11f78
- [Core][Distributed] use absolute path for library file (#4271) · c1b4e415
  youkaichao authored Apr 22, 2024
  
  c1b4e415
22 Apr, 2024 7 commits
- [Core] Scheduler perf fix (#4270) · ad8d696a
  SangBin Cho authored Apr 23, 2024
  
  ad8d696a
- [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217) · 15436806
  alexm-nm authored Apr 22, 2024
  
  15436806
- [Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993) · 077f0a2e
  Tao He authored Apr 22, 2024
```
Signed-off-by: Tao He <sighingnow@gmail.com>
```
  077f0a2e
- [Bugfix] Fix type annotations in CPU model runner (#4256) · e73ed0f1
  Woosuk Kwon authored Apr 22, 2024
  
  e73ed0f1
- [Misc] Add vision language model support to CPU backend (#3968) · 296cdf8a
  Isotr0py authored Apr 22, 2024
  
  296cdf8a
- [Core][Distributed] fix _is_full_nvlink detection (#4233) · 747b1a71
  youkaichao authored Apr 21, 2024
  
  747b1a71
- [AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI... · 95e5b087
  Hongxia Yang authored Apr 22, 2024
```
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129)
```
  95e5b087
21 Apr, 2024 1 commit
- Make initialization of tokenizer and detokenizer optional (#3748) · a37d815b
  GeauxEric authored Apr 21, 2024
```
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
```
  a37d815b
20 Apr, 2024 6 commits

Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) · cc74b2b2
Noam Gat authored Apr 20, 2024

cc74b2b2
[Frontend] multiple sampling params support (#3570) · 91528575
nunjunj authored Apr 20, 2024

91528575

[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3

Cody Yu authored Apr 19, 2024

Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726

This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.

Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.

Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:

BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.

a22cdea3

Fix missing docs and out of sync `EngineArgs` (#4219) · 682789d4
Harry Mellor authored Apr 20, 2024
```
Co-authored-by: Harry Mellor <hmellor@oxts.com>
```
682789d4
[Bugfix] Add fix for JSON whitespace (#4189) · 138485a8
Ayush Rautwar authored Apr 19, 2024
```
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
```
138485a8
Pass `tokenizer_revision` when getting tokenizer in openai serving (#4214) · bc9df157
Chirag Jain authored Apr 20, 2024

bc9df157

19 Apr, 2024 5 commits
- [Bugfix][Core] Restore logging of stats in the async engine (#4150) · 7be4f562
  Ronen Schaffer authored Apr 19, 2024
  
  7be4f562
- [Misc] fix docstrings (#4191) · 8f20fc04
  Uranus authored Apr 19, 2024
```
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com>
```
  8f20fc04
- Bump version of 0.4.1 (#4177) · 221d93ec
  Simon Mo authored Apr 19, 2024
  
  221d93ec
- [Bugfix] Fix LoRA loading check (#4138) · d17c8477
  Jee Li authored Apr 19, 2024
```
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  d17c8477
- Support eos_token_id from generation_config.json (#4182) · a134ef6f
  Simon Mo authored Apr 18, 2024
  
  a134ef6f
18 Apr, 2024 9 commits
- [Core] add an option to log every function call to for debugging hang/crash in... · 8a7a3e44
  youkaichao authored Apr 18, 2024
```
[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079)
Co-authored-by: Simon Mo <simon.mo@hey.com>
```
  8a7a3e44
- [Bugfix] Fix CustomAllreduce nvlink topology detection (#3974) · 8f9c28fd
  Adam Tilghman authored Apr 18, 2024
```
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974) (#4159)
```
  8f9c28fd
- [CI/CD] add neuron docker and ci test scripts (#3571) · cd2f63fb
  Liangfu Chen authored Apr 18, 2024
  
  cd2f63fb
- [Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) · e1bb2fd5
  James Whedbee authored Apr 18, 2024
  
  e1bb2fd5
- [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) · e8cc7967
  Michał Moskal authored Apr 18, 2024
  
  e8cc7967
- [Bugfix] Get available quantization methods from quantization registry (#4098) · 53b018ed
  Michael Goin authored Apr 18, 2024
  
  53b018ed
- Allow model to be served under multiple names (#2894) · 66ded030
  Harry Mellor authored Apr 18, 2024
```
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai>
```
  66ded030
- [Core] nccl integrity check and test (#4155) · 6dc1fc9c
  youkaichao authored Apr 17, 2024
```
[Core] Add integrity check during initialization; add test for it (#4155)
```
  6dc1fc9c
- [Typing] Mypy typing part 2 (#4043) · 533d2a1f
  SangBin Cho authored Apr 18, 2024
```
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local>
```
  533d2a1f
17 Apr, 2024 2 commits
- [Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024) · 8438e056
  youkaichao authored Apr 17, 2024
```
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)
```
  8438e056
- [Misc] [CI] Fix CI failure caught after merge (#4126) · d150e4f8
  Cade Daniel authored Apr 16, 2024
  
  d150e4f8
16 Apr, 2024 2 commits
- [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) · e95cd879
  Cade Daniel authored Apr 16, 2024
  
  e95cd879
- [Core] Refactor model loading code (#4097) · 69e1d2fb
  Antoni Baum authored Apr 16, 2024
  
  69e1d2fb