1. 09 Jun, 2024 1 commit
  2. 07 Jun, 2024 1 commit
  3. 05 Jun, 2024 1 commit
  4. 03 Jun, 2024 1 commit
  5. 01 Jun, 2024 3 commits
  6. 31 May, 2024 2 commits
  7. 23 May, 2024 2 commits
  8. 22 May, 2024 2 commits
  9. 20 May, 2024 1 commit
  10. 16 May, 2024 3 commits
  11. 10 May, 2024 1 commit
  12. 09 May, 2024 1 commit
  13. 07 May, 2024 1 commit
    • Philipp Moritz's avatar
      [Kernel] Make static FP8 scaling more robust (#4570) · a98187cf
      Philipp Moritz authored
      Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint
      
      https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale
      
      (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:
      
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
      | - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
      | - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
      | - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
      | - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
      With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is
      
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
      | - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
      | - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
      | - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
      | - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
      This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
      a98187cf
  14. 02 May, 2024 1 commit
  15. 29 Apr, 2024 1 commit
  16. 27 Apr, 2024 1 commit
  17. 24 Apr, 2024 2 commits
    • alexm-nm's avatar
      [Bugfix] Fix marlin kernel crash on H100 (#4218) · aae08249
      alexm-nm authored
      This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
      The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
      aae08249
    • Philipp Moritz's avatar
      [Kernel] FP8 support for MoE kernel / Mixtral (#4244) · eace8bf0
      Philipp Moritz authored
      This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208
      
      It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:
      
      ```python
      from vllm import LLM, SamplingParams
      
      prompts = [
          "Hello, my name is",
          "The president of the United States is",
          "The capital of France is",
          "The future of AI is",
      ]
      sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
      
      llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")
      
      outputs = llm.generate(prompts, sampling_params)
      
      # Print the outputs.
      for output in outputs:
          prompt = output.prompt
          generated_text = output.outputs[0].text
          print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
      ```
      
      **Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows:
      
      <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03">
      
      
      **Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:
      
      ```
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
      | - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
      | - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
      | - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
      | - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|
      ```
      this compares favorably with the fp16 results which are
      ```
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
      | - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
      | - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
      | - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
      | - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|
      ```
      
      Happy hacking!
      eace8bf0
  18. 23 Apr, 2024 1 commit
  19. 11 Apr, 2024 1 commit
  20. 03 Apr, 2024 1 commit
  21. 23 Mar, 2024 1 commit
  22. 01 Mar, 2024 1 commit
  23. 29 Feb, 2024 1 commit
  24. 12 Feb, 2024 1 commit
  25. 01 Feb, 2024 2 commits
  26. 29 Jan, 2024 1 commit
  27. 27 Jan, 2024 1 commit
  28. 03 Jan, 2024 2 commits
  29. 18 Dec, 2023 1 commit
  30. 15 Dec, 2023 1 commit