1. 01 Jul, 2024 1 commit
  2. 30 Jun, 2024 1 commit
  3. 28 Jun, 2024 1 commit
  4. 20 Jun, 2024 1 commit
  5. 14 Jun, 2024 1 commit
  6. 13 Jun, 2024 2 commits
  7. 08 Jun, 2024 2 commits
  8. 07 Jun, 2024 1 commit
  9. 05 Jun, 2024 1 commit
  10. 22 May, 2024 1 commit
  11. 09 May, 2024 1 commit
    • Philipp Moritz's avatar
      [Kernel] [FP8] Improve FP8 linear layer performance (#4691) · 379da6dc
      Philipp Moritz authored
      This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).
      
      We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.
      
      Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:
      
      qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
      qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) 
      qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
      qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
      qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
      379da6dc
  12. 30 Apr, 2024 1 commit
  13. 27 Apr, 2024 1 commit
  14. 26 Apr, 2024 1 commit
  15. 24 Apr, 2024 1 commit
  16. 20 Apr, 2024 2 commits
    • Noam Gat's avatar
    • Cody Yu's avatar
      [Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) · a22cdea3
      Cody Yu authored
      Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
      
      This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
      
      Algorithm:
      We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
      
      Initial Results:
      Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
      
      BF16: 1.47s
      FP8: 1.66s
      I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
      a22cdea3