[Doc] Phaseout Legacy documentations (#610)

- Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core. - Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.

[Doc] Phaseout Legacy documentations (#610)
- Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core. - Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.
d9ae74c6 · Lei Wang · LeiWang1999 · be44758c · d9ae74c6 · be44758c
Commit d9ae74c6 authored Jul 04, 2025 by Lei Wang Committed by LeiWang1999 Jul 04, 2025
15 changed files
--- a/README.md
+++ b/README.md
@@ -13,6 +13,7 @@ Tile Language (**tile-lang**) is a concise domain-specific language designed to
 <img src=./images/MatmulExample.png />

 ## Latest News
+- 07/04/2025 🚀: Introduced `T.gemm_sp` for 2:4 sparse tensor core support, check out [Pull Request #526](https://github.com/tile-ai/tilelang/pull/526) for details.
 - 06/05/2025 ✨: Added [NVRTC Backend](https://github.com/tile-ai/tilelang/pull/461) to significantly reduce compilation time for cute templates!
 - 04/14/2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See [example_mla_amd](./examples/deepseek_mla/amd/README.md) for details.
 - 03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see [example_mla_decode.py](./examples/deepseek_mla/example_mla_decode.py))! We also provide [documentation](./examples/deepseek_mla/README.md) explaining how TileLang achieves this.

--- a/docs/deeplearning_operators/convolution.md
+++ b/docs/deeplearning_operators/convolution.md
-Convolution
-===========
--- a/docs/deeplearning_operators/flash_attention.md
+++ b/docs/deeplearning_operators/flash_attention.md
-Flash Attention
-==================
--- a/docs/deeplearning_operators/flash_linear_attention.md
+++ b/docs/deeplearning_operators/flash_linear_attention.md
-Flash Linear Attention
-======================
--- a/docs/deeplearning_operators/matmul_dequant.md
+++ b/docs/deeplearning_operators/matmul_dequant.md
-General Matrix-Matrix Multiplication with Dequantization
-=========================================================
--- a/docs/deeplearning_operators/tmac_gpu.md
+++ b/docs/deeplearning_operators/tmac_gpu.md
-TMAC: Look Up Table Based Mixed Precision Computing
-====================================================
--- a/docs/index.md
+++ b/docs/index.md
@@ -21,13 +21,8 @@ get_started/overview
 :maxdepth: 1
 :caption: TUTORIALS

-tutorials/writing_kernels_with_tilelibrary
-tutorials/writing_kernels_with_thread_primitives
-tutorials/annotate_memory_layout
 tutorials/debug_tools_for_tilelang
 tutorials/auto_tuning
-tutorials/jit_compilation
-tutorials/pipelining_computations_and_data_movements
 :::

 :::{toctree}
@@ -37,23 +32,9 @@ tutorials/pipelining_computations_and_data_movements
 deeplearning_operators/elementwise
 deeplearning_operators/gemv
 deeplearning_operators/matmul
-deeplearning_operators/matmul_dequant
-deeplearning_operators/flash_attention
-deeplearning_operators/flash_linear_attention
 deeplearning_operators/deepseek_mla
-deeplearning_operators/convolution
-deeplearning_operators/tmac_gpu
 :::

-:::{toctree}
-:maxdepth: 2
-:caption: LANGUAGE REFERENCE
-
-language_ref/ast
-language_ref/primitives
-language_ref/tilelibrary
-::: 
-
 :::{toctree}
 :maxdepth: 1
 :caption: API Reference

--- a/docs/language_ref/ast.md
+++ b/docs/language_ref/ast.md
-Tile Language AST
-==================
--- a/docs/language_ref/primitives.md
+++ b/docs/language_ref/primitives.md
-Tile Language: Primitives
-=========================
--- a/docs/language_ref/tilelibrary.md
+++ b/docs/language_ref/tilelibrary.md
-Tile Language: TileLibrary
-=========================
-
-T.Kernel
--------
-
-args: the grid size (0-3 dimension) and the num_threads.
-
-returns: the blockIdx variables
-
-launch a kernel, it must be used in a with statement. There can be
-multiple kernels launched sequentially inside a prim function.
-
-T.alloc_shared
--------------
-
-args: shape, dtype
-
-returns: Buffer
-
-Allocate buffer on shared memory, It must be used within T.Kernel scope
-and should be allocated at the top of the scope.
-
-Dynamic shared memory is used.
-
-T.alloc_fragment
----------------
-
-args: shape, dtype
-
-returns: Buffer
-
-Allocate buffer on register memory, It must be used within T.Kernel
-scope and should be allocated at the top of the scope.
-
-The shape represents the whole shape of the buffer. Each element in the
-buffer is distributed stored on each threads, this storage partition
-will be inferred by the compiler.
-
-T.copy
------
-
-args: src, dst
-
-Copies data from src to dst, src and dst can be one of (Buffer,
-BufferLoad, BufferRegion). If you use BufferLoad that represents a
-single starting point, the other params should not be BufferLoad, since
-we need to know the copy region.
-
-Zero will be padded if we detect the load is out of boundary.
-
-T.gemm
------
-
-args: A, B, C, transpose_A, transpose_B, policy
-
-Performs gemm operation on A, B and C. C must be a fragment, B must be
-on shared memory, A can be either a fragment or shared.
-
-Note that the current implementation has some shape and dtype
-constraints, for example, the length of reduction axis must be a
-multiple of 32 for fp16 multiplicand case, we will update this later.
-
-T.reduce_max T.reduce_sum
-------------------------
-
-args: src, dst, dim
-
-Performs a reduce operation from src to dst on dimension dim. Currently
-we only support src and dst to be a fragment.
-
-T.Parallel
----------
-
-You can use T.Parallel to write a loop. The loop will be partitioned to
-all the threads by the compiler (The compiler will consider vectorize
-size, the fragment’s thread mapping … ). Note that this is the only way
-you can perform arbitrary operation on fragments.
-
-T.Pipelined
-----------
-
-args: start, stop, num_stages
-
-Pipeline the loop, copy from the global memory will be converted to
-async operations and reordered to the point after it is consumed.
-num_stages is the number of buffer between producer-consumer.
-(e.g.&nbsp;Double buffer when num_stages=2)
-
-T.clear T.fill
--------------
-
-nothing special, they will be converted to T.Parallel
-
-T.use_swizzle
-------------
-
-Optimization for L2 cache. The launch of blockIdx.x and blockIdx.y will
-be serpentined.
-
-You need to add it in a kernel after buffer is all allocated.
--- a/docs/tutorials/annotate_memory_layout.md
+++ b/docs/tutorials/annotate_memory_layout.md
-Annotate Memory Layout
-=======================
--- a/docs/tutorials/jit_compilation.md
+++ b/docs/tutorials/jit_compilation.md
-Just In Time Compilation
-=========================
--- a/docs/tutorials/pipelining_computations_and_data_movements.md
+++ b/docs/tutorials/pipelining_computations_and_data_movements.md
-Pipelining Computation and Data Movement
-========================================
--- a/docs/tutorials/writing_kernels_with_thread_primitives.md
+++ b/docs/tutorials/writing_kernels_with_thread_primitives.md
-Writing High-Performance Kernels with Thread Primitives
-=======================================================
--- a/docs/tutorials/writing_kernels_with_tilelibrary.md
+++ b/docs/tutorials/writing_kernels_with_tilelibrary.md
-Writing High-Performance Kernels with the Tile Library
-======================================================