• Lei Wang's avatar
    [Math] Dispatch `T.rsqrt(x)` into cuda intrin instead of `1 / T.sqrt(x)` (#781) · b66f9aae
    Lei Wang authored
    * Fix type hint for target_host parameter in compile function to allow None value
    
    * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency
    
    * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.
    
    * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.
    
    * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.
    
    * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.
    
    * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h
    
    * lint fix
    
    * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages
    
    * lint fix
    
    * Add cmake dependency to pyproject.toml and improve build logging in setup.py
    b66f9aae
CMakeLists.txt 7.38 KB