• Lei Wang's avatar
    [Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f
    Lei Wang authored
    
    
    * [Enhancement] Introduce CUDA driver module and refactor CUDA device handling
    
    - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
    - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
    - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
    - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.
    
    * [Refactor] Clean up whitespace in CUDA-related files
    
    - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
    - This change enhances the overall organization of the codebase without altering functionality.
    
    * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script
    
    - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
    - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
    - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
    - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.
    
    * lint fix
    
    * Update submodule and enhance FP8 type handling in CUDA codegen
    
    - Updated the TVM submodule to the latest commit.
    - Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
    - Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
    - Added error handling for missing configuration keys in the AutoTuner class.
    
    * lint fix
    
    * Remove print statement from example script
    
    * lint fix
    
    * fix
    
    ---------
    Co-authored-by: default avatarLeiWang1999 <wyatuestc@gmail.com>
    89725f7f
codegen_cuda.cc 61.1 KB