1. 12 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      7ccec53b