Commits · 1591c68fdea97a213d5564f687009c4fd1b44608 · OpenDAS / vllm_cscc

25 May, 2024 1 commit
- merge v0.4.2 · 1591c68f
  zhuwenwen authored May 25, 2024
  
  1591c68f
22 May, 2024 2 commits
- fix make error · 09bcf00b
  zhuwenwen authored May 22, 2024
  
  09bcf00b
- skip fp8 · fd6bc480
  zhuwenwen authored May 22, 2024
  
  fd6bc480
21 May, 2024 1 commit
- merge v0.4.1 · 99b471c2
  zhuwenwen authored May 21, 2024
  
  99b471c2
18 May, 2024 1 commit
- Merge branch 'vllm-v0.3.3-dtk24.04' · 1925d2e9
  zhuwenwen authored May 18, 2024
  
  1925d2e9
16 May, 2024 5 commits
- modify version · 5cef4904
  zhuwenwen authored May 16, 2024
  
  5cef4904
- update whl · 71014793
  zhuwenwen authored May 16, 2024
  
  71014793
- update readme · cc4a8e6f
  zhuwenwen authored May 16, 2024
  
  cc4a8e6f
- Merge branch 'vllm-v0.3.3-dtk24.04' · e9c204db
  zhuwenwen authored May 16, 2024
  
  e9c204db
- update benchmark of tht · a6950aea
  zhuwenwen authored May 16, 2024
  
  a6950aea
15 May, 2024 2 commits
- Merge branch 'vllm-v0.3.3-dtk24.04' · fca7ef19
  zhuwenwen authored May 15, 2024
  
  fca7ef19
- ignore linear_method layout · 5c4471ef
  zhuwenwen authored May 15, 2024
  
  5c4471ef
12 May, 2024 5 commits
- Merge branch 'vllm-v0.3.3-dtk24.04' · be0a159d
  zhuwenwen authored May 12, 2024
  
  be0a159d
- fix qkv linear · 47c04371
  zhuwenwen authored May 12, 2024
  
  47c04371
- Merge branch 'vllm-v0.3.3-dtk24.04' · d329f8d6
  zhuwenwen authored May 12, 2024
  
  d329f8d6
- add linear bias · 0d27f0c7
  zhuwenwen authored May 12, 2024
  
  0d27f0c7
- update cmake args · dbb2e382
  zhuwenwen authored May 12, 2024
  
  dbb2e382
09 May, 2024 1 commit
- merge dtk24.04-v0.3.3 · 35393439
  zhuwenwen authored May 09, 2024
  
  35393439
07 May, 2024 1 commit
- add llama_nn support · f26ecef8
  zhuwenwen authored May 07, 2024
  
  f26ecef8
06 May, 2024 1 commit
- update ray==2.9.1 · 96012705
  zhuwenwen authored May 06, 2024
  
  96012705
05 May, 2024 2 commits
- [CI] Reduce wheel size by not shipping debug symbols (#4602) · c7f2cf2b
  Simon Mo authored May 04, 2024
  
  c7f2cf2b
- bump version to v0.4.2 (#4600) · 8d8357c8
  Simon Mo authored May 04, 2024
  
  8d8357c8
04 May, 2024 5 commits

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) · 43029870
DearPlanet authored May 05, 2024

43029870
[CI] check size of the wheels (#4319) · 021b1a2a
Simon Mo authored May 04, 2024

021b1a2a

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with... · 2a052011

Michael Goin authored May 04, 2024

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527)

Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436.

This PR enables the following checkpoint loading features for Mixtral:

Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:

The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

2a052011

[Doc] Chunked Prefill Documentation (#4580) · 36fb68f9
SangBin Cho authored May 04, 2024

36fb68f9
[Misc][Refactor] Introduce ExecuteModelData (#4540) · bc8ad684
Cody Yu authored May 03, 2024

bc8ad684

03 May, 2024 10 commits
- [Misc] add installation time env vars (#4574) · 344bf7cd
  youkaichao authored May 03, 2024
  
  344bf7cd
- [Speculative decoding] Support target-model logprobs (#4378) · ab502751
  Cade Daniel authored May 03, 2024
  
  ab502751
- [Kernel] Use flashinfer for decoding (#4353) · 43c413ec
  Lily Liu authored May 03, 2024
```
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
```
  43c413ec
- Fix/async chat serving (#2727) · f8e7adda
  Sebastian Schoennenbeck authored May 03, 2024
  
  f8e7adda
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) · 7e65477e
  Michael Goin authored May 03, 2024
  
  7e65477e
- [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) · 3521ba4f
  SangBin Cho authored May 04, 2024
  
  3521ba4f
- [Doc] add env vars to the doc (#4572) · 2d7bce9c
  youkaichao authored May 02, 2024
  
  2d7bce9c
- [Misc] remove chunk detected debug logs (#4571) · ce3f1eed
  DefTruth authored May 03, 2024
  
  ce3f1eed
- [BugFix] Prevent the task of `_force_log` from being garbage collected (#4567) · 808632d3
  Yang, Bo authored May 02, 2024
  
  808632d3
- [Core][Distributed] enable allreduce for multiple tp groups (#4566) · 344a5d0c
  youkaichao authored May 02, 2024
  
  344a5d0c
02 May, 2024 3 commits
- [Core] Ignore infeasible swap requests. (#4557) · 0f8a9140
  SangBin Cho authored May 03, 2024
  
  0f8a9140
- [CI/Build] AMD CI pipeline with extended set of tests. (#4267) · 9b5c9f94
  Alexei-V-Ivanov-AMD authored May 02, 2024
```
Co-authored-by: simon-mo <simon.mo@hey.com>
```
  9b5c9f94
- [kernel] fix sliding window in prefix prefill Triton kernel (#4405) · 32881f3f
  Michał Moskal authored May 02, 2024
```
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
```
  32881f3f