1. 12 Oct, 2025 1 commit
    • Degeneracy-Evil's avatar
      [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) (#976) · b0b5347a
      Degeneracy-Evil authored
      
      
      * [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974)
      
      Enhanced CUDA detection to recognize NVIDIA HPC SDK installations:
      - Added path check for nvhpc in nvcc binary path
      - Added fallback scan for default nvhpc paths:
        /opt/nvidia/hpc_sdk/Linux_x86_64
      - Maintained backward compatibility with standard CUDA installations
      
      Verification:
      - Tested on Ubuntu 24.04 with NVIDIA HPC SDK 25.7
      - Confirmed detection works without manual CUDA_HOME or CUDA_PATH setting
      
      Fixes #974
      
      * [Bugfix] Fix CUDA home detection logic
      
      * [Bugfix] Safely handle None cuda_home during CUDA detection
      
      Adds a check for None before validating the CUDA home path to prevent errors when the path is not set.
      
      * [Bugfix] Fix CUDA detection edge cases in nvhpc support (#974)
      
      - Improved nvhpc path detection logic
      - Added None check for cuda_home to avoid crashes
      - Maintained existing CUDA installation compatibility
      
      Fixes #974
      
      * chore: rerun CI
      
      ---------
      Co-authored-by: default avatarNaNExist <138002947+NaNExist@users.noreply.github.com>
      b0b5347a
  2. 11 Oct, 2025 7 commits
  3. 10 Oct, 2025 6 commits
    • Chaofan Lin's avatar
      [Bugfix] Fix dummy kernel compliation (#962) · 7913fb1d
      Chaofan Lin authored
      
      
      * [Bugfix] Fix visit EvaluateNode in BufferGemmCollector
      
      * address comment
      
      * lint
      
      * fix
      
      * Add TileLang SplitHostDevice pass and tighten issue 830 test names
      
      * lint fix
      
      * enhance for kernel value unpacking.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      7913fb1d
    • Xiaoyu Zhang's avatar
      6031416f
    • Xuehai Pan's avatar
      [CI] add `pre-commit` integration (#955) · 8fe35402
      Xuehai Pan authored
      
      
      * chore: misc cleanup
      
      * feat: add pre-commit config
      
      * chore: update lint dependencies
      
      * style: fix lint issues
      
      * feat: add pre-commit hooks
      
      * fix: fix typos
      
      * chore: update .gitattributes
      
      * [Lint]: [pre-commit.ci] auto fixes [...]
      
      * docs: update CONTRIBUTING.md
      
      * chore: update default venv name
      
      * chore: revert and exclude CUDA files
      
      ---------
      Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
      8fe35402
    • Lei Wang's avatar
      [Bugfix] Do not force inline let stmt (#947) · f8ae600c
      Lei Wang authored
      * remove debug print
      
      * Remove inline let expressions from the LowerAndLegalize function in phase.py
      
      * add test
      
      * Update sparse MLA examples to support SKV adjustment and correctness checks
      
      - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
      - Added check_correctness parameter to test functions for validation of outputs.
      - Updated test cases to reflect new SKV values and correctness checks.
      
      * reduce test shape
      
      * Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py
      
      - Added a new section for compiler internals in the documentation.
      - Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.
      
      * Update buffer access checks in merge_shared_memory_allocations.cc
      
      - Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
      - Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.
      
      * lint fix
      
      * Support pipeline with LetStmt
      
      * lint fix
      
      * • Fix LowerTileOp let handling to avoid LetInline dependency
      
        - inline let-bound BufferLoad nodes via resolver helpers and structured return
        - remap layouts/buffers using original data vars and only rewrite when needed
        - update pipeline planner to understand let-bound address_of buffers
        - document the new inline behaviour in docs/let_inline_fix.md
      
      * fix for wgmma pipeline with let binding
      
      * lint fix
      
      * test fix
      
      * reduce smem usage.
      
      * let binding enhancement
      
      * fix for dpgm
      
      * fix simplify
      
      * lint fix
      
      * use tilelang.Simplify instead of tir.Simplify
      
      * • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage
      
        - register the new config in builtin headers/registration
        - add helper to pipeline enabling LetInline based on pass context
        - document LetStmt inlining controls and usage
      f8ae600c
    • Tong WU's avatar
      [Example] Add support for `bfloat16` and user-defined `sm_scale` in attention sink examples (#924) · 7cd0da99
      Tong WU authored
      
      
      * revert split+sum template for MHA backward
      
      * lint
      
      * Update example_mha_bwd.py
      
      * Update example_mha_bwd_wgmma_pipelined.py
      
      * Refactor attention sink examples to support bf16 and user-defined softmax scale
      
      * fix typos
      
      * Adding compile flags for fast math optimizations and enabling BF16 support in both GQA and MHA backward implementations.
      
      * Update backward configuration for GQA and MHA examples to align with flash attention
      
      * Refactor GQA backward implementation to improve atomic add performance
      
      * Allow for slightly larger numerical error for bf16
      
      * upd readme to show bf16 benchmark results
      
      * lint
      
      * fix ci and lint
      
      * fix comments and lint
      
      * refactor atomic add
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      7cd0da99
    • Xuehai Pan's avatar
      [Docs] add CODE_OF_CONDUCT.md (#965) · 8f07b9b0
      Xuehai Pan authored
      
      
      * [Docs] add CODE_OF_CONDUCT.md
      
      * Update CODE_OF_CONDUCT.md
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      8f07b9b0
  4. 09 Oct, 2025 10 commits
  5. 07 Oct, 2025 3 commits
  6. 06 Oct, 2025 3 commits
  7. 05 Oct, 2025 3 commits
  8. 04 Oct, 2025 3 commits
  9. 02 Oct, 2025 2 commits
    • Zhiwen Mo's avatar
      [Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa
      Zhiwen Mo authored
      * Implements tcgen05.ld instruction support for copying from shared.tmem
        to local.fragment on SM100/Blackwell architecture. Adds layout inference
        and lowering logic for tensor memory operations with proper physical
        coordinate range analysis and warpgroup alignment checks.
      
        Changes:
        - Add kTMemLoad and kTMemStore to CopyInst enumeration
        - Implement CheckTMemLoad() and CheckTMemStore() validation functions
        - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
        - Add tmem layout inference in InferLayout() using expandTcgen05Layout
        - Support multiple instruction variants (32dp32b/64b/128b/256b)
        - Add physical layout bounds analysis for tmem coordinates
        - Change clear_accum from bool to PrimExpr in GEMM operations
        - Fix std::optional access checks in layout_inference.cc
        - Add tmem_allocate/deallocate PTX intrinsic support
        - Fix cooperative_groups grid.sync() code generation
      
      * fix
      
      * pipeline fix
      
      * bug fix
      
      * bool fix
      5ccac4fa
    • Lei Wang's avatar
      [Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452
      Lei Wang authored
      * [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode
      
      - Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
      - Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
      - Updated error handling for non-zero index access in fragment buffers to improve robustness.
      
      * [Layout] Improve code formatting and readability in layout.cc and parallel.cc
      
      - Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
      - Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
      - Ensured consistent formatting across conditional statements and comments for improved maintainability.
      
      * updt
      
      * optimize const index related op
      
      * bug fix
      
      * reduce gdn test
      
      * test fix
      
      * lintfix
      
      * lint fix
      
      * test fix
      fc4bd452
  10. 01 Oct, 2025 2 commits