1. 09 Oct, 2025 8 commits
  2. 07 Oct, 2025 3 commits
  3. 06 Oct, 2025 3 commits
  4. 05 Oct, 2025 3 commits
  5. 04 Oct, 2025 3 commits
  6. 02 Oct, 2025 2 commits
    • Zhiwen Mo's avatar
      [Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa
      Zhiwen Mo authored
      * Implements tcgen05.ld instruction support for copying from shared.tmem
        to local.fragment on SM100/Blackwell architecture. Adds layout inference
        and lowering logic for tensor memory operations with proper physical
        coordinate range analysis and warpgroup alignment checks.
      
        Changes:
        - Add kTMemLoad and kTMemStore to CopyInst enumeration
        - Implement CheckTMemLoad() and CheckTMemStore() validation functions
        - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
        - Add tmem layout inference in InferLayout() using expandTcgen05Layout
        - Support multiple instruction variants (32dp32b/64b/128b/256b)
        - Add physical layout bounds analysis for tmem coordinates
        - Change clear_accum from bool to PrimExpr in GEMM operations
        - Fix std::optional access checks in layout_inference.cc
        - Add tmem_allocate/deallocate PTX intrinsic support
        - Fix cooperative_groups grid.sync() code generation
      
      * fix
      
      * pipeline fix
      
      * bug fix
      
      * bool fix
      5ccac4fa
    • Lei Wang's avatar
      [Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452
      Lei Wang authored
      * [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode
      
      - Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
      - Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
      - Updated error handling for non-zero index access in fragment buffers to improve robustness.
      
      * [Layout] Improve code formatting and readability in layout.cc and parallel.cc
      
      - Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
      - Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
      - Ensured consistent formatting across conditional statements and comments for improved maintainability.
      
      * updt
      
      * optimize const index related op
      
      * bug fix
      
      * reduce gdn test
      
      * test fix
      
      * lintfix
      
      * lint fix
      
      * test fix
      fc4bd452
  7. 01 Oct, 2025 5 commits
  8. 30 Sep, 2025 3 commits
  9. 29 Sep, 2025 8 commits
    • Lei Wang's avatar
      [Example] Add topk into sparse mla example and append some docs (#901) · 6021ef32
      Lei Wang authored
      * Remove unused `fp8_mqa_logits.py` file and update README.md to reflect new directory structure and file descriptions for deepseek_v32 example. Added sections for architecture overview, Lightning Indexer, Top-k Selector, and Sparse MLA Forward implementations.
      
      * Update linting configurations and improve code formatting in deepseek_v32 example scripts
      
      - Added per-file ignores for the inference directory in `pyproject.toml`.
      - Refactored code in `topk_selector.py`, `convert.py`, `generate.py`, `kernel.py`, and `model.py` to enhance readability by adjusting spacing and line breaks.
      - Ensured consistent formatting across function definitions and assertions for better clarity.
      
      * Refactor test functions in deepseek_v32 example scripts for improved clarity and consistency
      
      - Updated `fp8_lighting_indexer.py` to define a dedicated test function for the lighting indexer.
      - Refactored `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` to standardize test function parameters and improve readability.
      - Enhanced `topk_selector.py` by introducing a test function with parameters for batch size and sequence length.
      - Ensured all test functions are invoked correctly in the main execution block.
      
      * Enhance test functions in deepseek_v32 example scripts with CUDA requirements and parameterization
      
      - Added CUDA requirements decorators to `test_example_sparse_mla_fwd` and `test_example_sparse_mla_fwd_pipelined`.
      - Parameterized test functions to use specific small shapes for testing, improving test coverage and clarity.
      
      * lint fix
      
      * Update README.md to correct image path for DeepSeek V3.2 architecture diagram
      6021ef32
    • Wenxuan Tan's avatar
      [Bugfix] Fix flops comp and softmax scale in mla (#900) · 16561159
      Wenxuan Tan authored
      * fix flops comp and softmax scale
      
      * format
      16561159
    • Lei Wang's avatar
      [CI] Legalize math related test (#899) · 54fc6ba0
      Lei Wang authored
      54fc6ba0
    • Wenhao Xie's avatar
      [Typo] Fix backend name for Huawei Ascend (#898) · d19fe1ae
      Wenhao Xie authored
      * [Typo] Fix backend name for Huawei Ascend chips
      
      * update
      d19fe1ae
    • Lei Wang's avatar
      [Example] Add sparse mla examples (#896) · 65ac7454
      Lei Wang authored
      * Update README.md to include directory structure and file descriptions for deepseek_v32 example
      
      * Refactor and clean up deepseek_v32 example scripts
      
      - Removed unused imports and functions from `fp8_mqa_logits.py` to streamline the code.
      - Improved formatting and readability in `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` by adjusting function signatures and indentation.
      - Added `# ruff: noqa` comments to suppress linting warnings in multiple files.
      - Enhanced the `generate_random_cu_seqlens` function in `utils.py` for better clarity and organization.
      - Updated print statements for consistency in output formatting.
      65ac7454
    • Wenhao Xie's avatar
    • Lei Wang's avatar
      [Example] Add example (#894) · 4424fa9a
      Lei Wang authored
      * [Refactor] Enhance CopyNode Lower method to support disable_tma flag and improve flash attention implementation
      
      * Updated the CopyNode Lower method to correctly include the disable_tma flag in the GetCopyInst call.
      * Refactored the flash attention implementation to selectively disable TMA for specific copy operations while allowing it for others.
      * Addressed linting issues for improved code quality
      
      * sparse mla kernels
      
      * Remove deprecated sparse MLA and utility files to streamline the codebase.
      4424fa9a
    • Jiaxing Ding's avatar
      [Layout] fix plot layout (#890) · 6c67a77f
      Jiaxing Ding authored
      6c67a77f
  10. 28 Sep, 2025 2 commits
    • Tong WU's avatar
      [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst (#888) · 599264ca
      Tong WU authored
      * Fix CopyNode Lower method to include disable_tma flag in GetCopyInst call
      
      * Refactor flash attention implementation to disable TMA for specific copy and allow TMA for other operations
      
      * attempt to fix lint
      599264ca
    • Zhiwen Mo's avatar
      [SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43
      Zhiwen Mo authored
      * update sm100 related utcmma, tmem, ld/st256 in src
      * update sm100 related utcmma, tmem, ld/st256 in tilelang
      * Remove deprecated GEMM examples and related README documentation for SM100 architecture support
      * Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
      * Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
      * Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
      * Update README and source files to reflect TCGEN5.MMA terminology changes
      * Refactor CUDA GEMM header for improved readability
      f58bcd43