1. 23 May, 2024 1 commit
  2. 22 May, 2024 1 commit
  3. 19 May, 2024 1 commit
  4. 18 May, 2024 1 commit
  5. 17 May, 2024 1 commit
  6. 16 May, 2024 3 commits
  7. 15 May, 2024 1 commit
    • SangBin Cho's avatar
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode... · 65bf2ac1
      SangBin Cho authored
      [Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681)
      
      This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
      
      It also refactors subquery_start_loc which was not refactored in the previous PR
      65bf2ac1
  8. 13 May, 2024 1 commit
  9. 11 May, 2024 1 commit
  10. 09 May, 2024 2 commits
  11. 08 May, 2024 3 commits
  12. 03 May, 2024 2 commits
  13. 02 May, 2024 1 commit
  14. 01 May, 2024 3 commits
    • Woosuk Kwon's avatar
      [Misc] Fix expert_ids shape in MoE (#4517) · 826b82a2
      Woosuk Kwon authored
      826b82a2
    • Philipp Moritz's avatar
      [Kernel] Update fused_moe tuning script for FP8 (#4457) · 24bb4fe4
      Philipp Moritz authored
      This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.
      
      All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.
      
      Before this PR (with static activation scaling):
      
      qps = 1: 9.8 ms ITL, 0.49s e2e latency
      qps = 2: 9.7 ms ITL, 0.49s e2e latency 
      qps = 4: 10.1 ms ITL, 0.52s e2e latency
      qps = 6: 11.9 ms ITL, 0.59s e2e latency
      qps = 8: 14.0 ms ITL, 0.70s e2e latency
      qps = 10: 15.7 ms ITL, 0.79s e2e latency
      
      After this PR (with static activation scaling):
      
      qps = 1: 9.8 ms ITL, 0.49s e2e latency
      qps = 2: 9.7 ms ITL, 0.49s e2e latency
      qps = 4: 10.2 ms ITL, 0.53s e2e latency
      qps = 6: 11.9 ms ITL, 0.59s e2e latency
      qps = 8: 11.9 ms ITL, 0.59s e2e latency
      qps = 10: 12.1 ms ITL, 0.61s e2e latency
      24bb4fe4
    • Jee Li's avatar
      d6f4bd7c
  15. 30 Apr, 2024 3 commits
  16. 29 Apr, 2024 2 commits
  17. 27 Apr, 2024 2 commits
  18. 26 Apr, 2024 3 commits
  19. 25 Apr, 2024 2 commits
  20. 24 Apr, 2024 2 commits
    • Robert Shaw's avatar
      [BUG] fixed fp8 conflict with aqlm (#4307) · 79a268c4
      Robert Shaw authored
      Fixes fp8 iterface which broke in AQLM merge.
      79a268c4
    • Philipp Moritz's avatar
      [Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0
      Philipp Moritz authored
      This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208
      
      It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:
      
      ```python
      from vllm import LLM, SamplingParams
      
      prompts = [
          "Hello, my name is",
          "The president of the United States is",
          "The capital of France is",
          "The future of AI is",
      ]
      sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
      
      llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")
      
      outputs = llm.generate(prompts, sampling_params)
      
      # Print the outputs.
      for output in outputs:
          prompt = output.prompt
          generated_text = output.outputs[0].text
          print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
      ```
      
      **Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:
      
      <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">
      
      
      **Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:
      
      ```
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
      | - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
      | - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
      | - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
      | - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
      ```
      this compares favorably with the fp16 results which are
      ```
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
      | - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
      | - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
      | - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
      | - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
      ```
      
      Happy hacking!
      eace8bf0
  21. 23 Apr, 2024 3 commits
  22. 20 Apr, 2024 1 commit