1. 21 Jul, 2024 1 commit
  2. 20 Jul, 2024 1 commit
  3. 18 Jul, 2024 1 commit
  4. 14 Jul, 2024 1 commit
  5. 03 Jul, 2024 1 commit
  6. 28 Jun, 2024 1 commit
  7. 26 Jun, 2024 1 commit
  8. 23 Jun, 2024 1 commit
  9. 20 Jun, 2024 3 commits
  10. 18 Jun, 2024 1 commit
  11. 14 Jun, 2024 2 commits
  12. 13 Jun, 2024 1 commit
  13. 12 Jun, 2024 1 commit
    • Cody Yu's avatar
      [Kernel] Vectorized FP8 quantize kernel (#5396) · 5985e342
      Cody Yu authored
      Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
      
      In details, we applied 3 optimizations:
      
      - Use inverted scale so that most divisions are changed to multiplications.
      - Unroll the loop by 4 times to improve ILP.
      - Use vectorized 4 to transfer data between HBM and SRAM.
      5985e342
  14. 09 Jun, 2024 1 commit
  15. 07 Jun, 2024 1 commit
  16. 05 Jun, 2024 1 commit
  17. 03 Jun, 2024 1 commit
  18. 01 Jun, 2024 3 commits
  19. 31 May, 2024 2 commits
  20. 23 May, 2024 2 commits
  21. 22 May, 2024 2 commits
  22. 20 May, 2024 1 commit
  23. 16 May, 2024 3 commits
  24. 10 May, 2024 1 commit
  25. 09 May, 2024 1 commit
  26. 07 May, 2024 1 commit
    • Philipp Moritz's avatar
      [Kernel] Make static FP8 scaling more robust (#4570) · a98187cf
      Philipp Moritz authored
      Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint
      
      https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale
      
      (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:
      
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
      | - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
      | - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
      | - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
      | - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
      With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is
      
      |      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
      |------------------|-------|------|-----:|------|-----:|---|-----:|
      |mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
      | - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
      | - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
      | - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
      | - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
      This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
      a98187cf
  27. 02 May, 2024 1 commit
  28. 29 Apr, 2024 1 commit
  29. 27 Apr, 2024 1 commit
  30. 24 Apr, 2024 1 commit
    • alexm-nm's avatar
      [Bugfix] Fix marlin kernel crash on H100 (#4218) · aae08249
      alexm-nm authored
      This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
      The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
      aae08249