- 11 Dec, 2025 1 commit
-
-
Lei Wang authored
-
- 01 Dec, 2025 1 commit
-
-
botbw authored
* [misc] add a cpp side wrapper for gemm_sp_py * [misc] typing * [IR] bind GemmSPWarpPolicy * [chore] add wrapper code * [IR] fix GemmSPWarpPolicy * [codegen] apply ptxas instructions * [intrinsic] add typical (unused) mma layout * [template] add uint16 debug func * [intrinsic] add b matrix layout * [gemm_sp] enable fp16/bf16 on sm8x * [layout] refactor fp16/bf16 layout * [gemm_sp] enable int8 * [chore] update test case dtype * [gemm_sp] enable fp32 * [layout] refactor layouts * [intrinsic] enable ldmatrix for mat A * [layout] enable ldsm for matrix b * [layout] add ldmatrix for fp32 and fp8 * [chore] refine * [chore] refactor * [chore] add fp8 efactor * [chore] refactor * [chore] add remove negative zero util * [example] add a custom compress kernel * [chore] minor update * [test] refactor gemm_sp test * [refactor] make metadata layout func * [example] add option for using cutlass layout * [doc] add a gemm_sp doc * [doc] minor polish * [chore] remove unused * [bugfix] fix non replicate b case * [test] refactor * [chore] add a check * [bugfix] fix util bug * [wip] init a new test case for v2 * [chore] minor refactor * [chore] minor update * [bugfix] enable 16bit rs * [language] enable rs * [language] enable gemm_sp_sr * [language] enable gemm_sp_rr * [test] enable more tests * [tvm] update ffi binding * [chore] remove print * [chore] fix benchmark script * [lint] precommit lint * [chore] apply feedback * [test] use arch 8.0 * [chore] rollback ::ordered_metadata for backward compatibility * [bugfix] fix captialized * [example] keep gemm_sp on hopper * [test] fix no fp8 normal kernel * [test] reduce matmul size to satisfy accum error * [test] use cal_diff for assertion * [bugfix] expand float8 type * [lib] add make_int4 for short type * [language] add transpose E * [bugfix] fix wrong var * [format] format * [chore] refactor binding * [chore] fix wrong passing var
-
- 12 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism. - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use. - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization. - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming. * doc update * [Add] Enhance AMD FlashMLA implementation and documentation - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation. - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation. - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks. * lint fix * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators. - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling. - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks. - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
-
- 02 Feb, 2025 1 commit
-
-
Lei Wang authored
* implement jit test case * [Dev] implement auto tune test case for matrix multiplication * Implement test for legalize memory access and vectorized loop * lint fix * introduce run_once * Refactor callback function names for consistency and improve code readability * enhance documentations * lint fix * lint fix * lint fix * lint fix * fix formatting issues in rt_mod_hip.cc * add random seed initialization for deterministic testing * Add documentation images and comprehensive GEMM tutorial for TileLang * Update MATMUL documentation title to highlight Tile Library
-
- 26 Jan, 2025 1 commit
-
-
Lei Wang authored
* implement jit test case * [Dev] implement auto tune test case for matrix multiplication * Implement test for legalize memory access and vectorized loop * lint fix * introduce run_once * Refactor callback function names for consistency and improve code readability * enhance documentations * lint fix * lint fix * lint fix * lint fix * fix formatting issues in rt_mod_hip.cc * add random seed initialization for deterministic testing
-
- 24 Jan, 2025 1 commit
-
-
Lei Wang authored
[Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo (#39)
-
- 21 Jan, 2025 1 commit
-
-
Wenhao Xie authored
* [Doc] Use sphinx to generate docs. * [Doc] Fix a bug on tlcpack_sphinx_addon. * [Doc] Fix linting issues.
-