• Lei Wang's avatar
    [Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b
    Lei Wang authored
    * Optimize CMake build process with dynamic job count calculation
    
    - Modify build_csrc function to use 90% of available CPU cores
    - Ensure at least one job is used during compilation
    - Improve build performance by dynamically adjusting parallel job count
    
    * Optimize build_csrc function with multiprocessing module
    
    - Replace os.cpu_count() with multiprocessing.cpu_count()
    - Maintain existing 90% CPU utilization logic
    - Improve CPU core count calculation for build process
    
    * Add dynamic shape support with out_idx in Cython JIT kernel compilation
    
    - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
    - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
    - Add support for resolving dynamic shape dimensions using input tensor references
    - Enhance flexibility of JIT kernel compilation with symbolic shape handling
    
    * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
    
    - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
    - Improve debugging by providing context about missing symbolic dimensions
    - Maintain existing dynamic shape resolution logic
    
    * Fix Copy operation handling for scalar and multi-dimensional tensors
    
    - Add special handling for scalar tensor copy operations
    - Enhance error reporting in MakeIndices method with more detailed diagnostic information
    - Improve SIMT loop generation to support zero-dimensional tensors
    - Add explicit check and handling for scalar tensor scenarios
    
    * Refactor Copy operation code formatting and improve readability
    
    - Improve code formatting in MakeIndices and MakeSIMTLoop methods
    - Add line breaks to enhance readability of complex ICHECK statements
    - Simplify code structure in scalar tensor handling
    - Remove unnecessary whitespace and improve code alignment
    
    * Simplify GEMM example with direct kernel compilation
    
    - Update copyright header to Tile-AI Corporation
    - Remove Profiler import and usage
    - Replace tilelang.lower() with tilelang.compile()
    - Simplify kernel execution workflow
    - Update kernel source retrieval method
    
    * Enhance block sparse attention implementation
    
    - Update `blocksparse_flashattn` to use 2 stages for improved performance.
    - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
    - Modify condition checks in the kernel to utilize boolean values.
    - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
    - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
    
    * Refactor and clean up code formatting across multiple files
    
    - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
    - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
    - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
    - Ensured consistent formatting and style across the codebase.
    7ccec53b
example_mha_fwd_bhsd.py 9.65 KB