1. 27 Feb, 2025 1 commit
    • Lei Wang's avatar
      [JIT] Enhance cython/ctypes wrapper for tma descriptor (#126) · 7b74bb01
      Lei Wang authored
      
      
      * refactor code
      
      * enhance tutorial
      
      * Enhance error handling and code generation in CUDA and TileLang components
      
      This commit introduces several improvements across multiple files:
      - Added more informative error messages in GEMM layout checks
      - Updated CUDA codegen to support more flexible function signature generation
      - Improved TMA descriptor initialization and kernel dispatch logic
      - Refined library generation and source code parsing utilities
      - Enhanced error handling in various adapter and wrapper classes
      
      * Add thread tag validation for warp specialization
      
      Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization.
      
      * Update TileLang Profiling and Compilation in Flash Decoding Examples
      
      Refactor the profiling and compilation workflow in two flash decoding example scripts:
      - Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()`
      - Simplify profiler initialization using `get_profiler()`
      - Update method calls to use the new profiler and compiled kernel objects
      - Maintain existing performance benchmarking and validation logic
      
      * Refactor and clean up code formatting in TileLang testing and adapter modules
      
      This commit includes several code style and formatting improvements:
      - Adjust whitespace and line breaks in test files
      - Improve code formatting in CUDA source wrapper and adapter utilities
      - Enhance readability of function calls and argument handling
      - Remove unnecessary whitespace and standardize indentation
      - Simplify function signatures and argument parsing
      
      * Refactor CUDA codegen and improve code formatting
      
      This commit includes several improvements to CUDA code generation and formatting:
      - Enhance function signature generation in CodeGenTileLangCUDA
      - Improve code formatting and readability in CUDA-related files
      - Simplify parameter handling and type annotations
      - Clean up whitespace and line breaks in codegen and layout files
      
      ---------
      Co-authored-by: default avatarUbuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>
      7b74bb01
  2. 21 Feb, 2025 1 commit
    • Lei Wang's avatar
      [JIT] Support Cython jit and make cython a default execution backend (#102) · 3471904f
      Lei Wang authored
      * [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution
      
      - Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in CTypes backend
      - Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
      - Add symbolic shape utility function in tilelang.language
      - Update profiler to improve flexibility in benchmark selection
      
      * Remove redundant thread binding in GEMM kernel implementations
      
      - Remove unnecessary `thread_binding` line in GEMM kernel functions
      - Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
      - Enhance code readability by removing redundant thread binding annotation
      
      * Fix indentation in int4 GEMM kernel test file
      
      - Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
      - Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
      - Improve code formatting for better readability
      
      * [Feature] Add Cython JIT kernel support for dynamic shapes and multi-stream execution
      
      - Implement CythonKernelAdapter to handle dynamic symbolic shapes
      - Add support for multi-stream kernel execution in Cython backend
      - Create comprehensive test suite for Cython GEMM kernel in test_tilelang_jit_gemm_cython.py
      - Update JITKernel to include "cython" as a valid execution backend
      - Add Cython-specific wrapper and library generation modules
      - Update .gitignore to exclude Cython cache directory
      - Modify setup.py to include Cython source files in package data
      
      * lint fix
      
      * [Refactor] Replace JITKernel with compile() function for kernel compilation
      
      - Add new `compile()` function in tilelang/jit/__init__.py as a wrapper for JITKernel
      - Update multiple test files and examples to use `tilelang.compile()` instead of `tilelang.JITKernel()`
      - Modify kernel adapters to support optional kernel-only source retrieval
      - Update `__init__.py` to import the new `compile()` function
      - Improve kernel source retrieval for different execution backends
      
      * lint fix
      
      * remove debug print
      
      * Add C/C++ compiler utility module and update Cython JIT kernel support
      
      - Introduce new `tilelang/contrib/cc.py` module with cross-platform C/C++ compiler utilities
      - Add functions to detect and retrieve system C/C++ compilers
      - Implement cross-compilation and shared library creation support
      - Update Cython JIT kernel to validate C++ compiler availability
      - Modify Cython adapter to use detected C++ compiler for library generation
      
      * Refactor float8 dtype mapping in tensor utility module
      
      - Move float8_dtype_map inside adapt_torch2tvm function
      - Simplify global scope by localizing the dtype mapping
      - Maintain existing functionality for converting torch float8 tensors to TVM ndarray
      
      * Refactor float8 dtype mapping in tensor utility module
      
      - Move float8_dtype_map inside adapt_torch2tvm function
      - Simplify global scope by localizing the dtype mapping
      - Maintain existing functionality for converting torch float8 tensors to TVM ndarray
      
      * revert
      
      * Enhance Cython JIT adapter with Cython compiler detection
      
      - Add `get_cython_compiler()` function to dynamically locate Cython executable
      - Update Cython adapter to use detected Cython compiler instead of hardcoded command
      - Raise an exception if no Cython compiler is found
      - Update requirements.txt to specify minimum PyTorch version (>=2.2.0)
      
      * Fix Cython kernel wrapper stream handling and type annotations
      
      - Update stream parameter type to int64_t for better compatibility
      - Directly use torch.cuda.current_stream().cuda_stream instead of casting
      - Improve type safety and precision in Cython kernel wrapper
      3471904f
  3. 25 Jan, 2025 1 commit
  4. 20 Jan, 2025 2 commits