Commit d9ae74c6 authored by Lei Wang's avatar Lei Wang Committed by LeiWang1999
Browse files

[Doc] Phaseout Legacy documentations (#610)

- Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core.
- Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.
parent be44758c
......@@ -13,6 +13,7 @@ Tile Language (**tile-lang**) is a concise domain-specific language designed to
<img src=./images/MatmulExample.png />
## Latest News
- 07/04/2025 🚀: Introduced `T.gemm_sp` for 2:4 sparse tensor core support, check out [Pull Request #526](https://github.com/tile-ai/tilelang/pull/526) for details.
- 06/05/2025 ✨: Added [NVRTC Backend](https://github.com/tile-ai/tilelang/pull/461) to significantly reduce compilation time for cute templates!
- 04/14/2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See [example_mla_amd](./examples/deepseek_mla/amd/README.md) for details.
- 03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see [example_mla_decode.py](./examples/deepseek_mla/example_mla_decode.py))! We also provide [documentation](./examples/deepseek_mla/README.md) explaining how TileLang achieves this.
......
Flash Linear Attention
======================
General Matrix-Matrix Multiplication with Dequantization
=========================================================
TMAC: Look Up Table Based Mixed Precision Computing
====================================================
......@@ -21,13 +21,8 @@ get_started/overview
:maxdepth: 1
:caption: TUTORIALS
tutorials/writing_kernels_with_tilelibrary
tutorials/writing_kernels_with_thread_primitives
tutorials/annotate_memory_layout
tutorials/debug_tools_for_tilelang
tutorials/auto_tuning
tutorials/jit_compilation
tutorials/pipelining_computations_and_data_movements
:::
:::{toctree}
......@@ -37,23 +32,9 @@ tutorials/pipelining_computations_and_data_movements
deeplearning_operators/elementwise
deeplearning_operators/gemv
deeplearning_operators/matmul
deeplearning_operators/matmul_dequant
deeplearning_operators/flash_attention
deeplearning_operators/flash_linear_attention
deeplearning_operators/deepseek_mla
deeplearning_operators/convolution
deeplearning_operators/tmac_gpu
:::
:::{toctree}
:maxdepth: 2
:caption: LANGUAGE REFERENCE
language_ref/ast
language_ref/primitives
language_ref/tilelibrary
:::
:::{toctree}
:maxdepth: 1
:caption: API Reference
......
Tile Language AST
==================
Tile Language: Primitives
=========================
Tile Language: TileLibrary
=========================
T.Kernel
--------
args: the grid size (0-3 dimension) and the num_threads.
returns: the blockIdx variables
launch a kernel, it must be used in a with statement. There can be
multiple kernels launched sequentially inside a prim function.
T.alloc_shared
--------------
args: shape, dtype
returns: Buffer
Allocate buffer on shared memory, It must be used within T.Kernel scope
and should be allocated at the top of the scope.
Dynamic shared memory is used.
T.alloc_fragment
----------------
args: shape, dtype
returns: Buffer
Allocate buffer on register memory, It must be used within T.Kernel
scope and should be allocated at the top of the scope.
The shape represents the whole shape of the buffer. Each element in the
buffer is distributed stored on each threads, this storage partition
will be inferred by the compiler.
T.copy
------
args: src, dst
Copies data from src to dst, src and dst can be one of (Buffer,
BufferLoad, BufferRegion). If you use BufferLoad that represents a
single starting point, the other params should not be BufferLoad, since
we need to know the copy region.
Zero will be padded if we detect the load is out of boundary.
T.gemm
------
args: A, B, C, transpose_A, transpose_B, policy
Performs gemm operation on A, B and C. C must be a fragment, B must be
on shared memory, A can be either a fragment or shared.
Note that the current implementation has some shape and dtype
constraints, for example, the length of reduction axis must be a
multiple of 32 for fp16 multiplicand case, we will update this later.
T.reduce_max T.reduce_sum
-------------------------
args: src, dst, dim
Performs a reduce operation from src to dst on dimension dim. Currently
we only support src and dst to be a fragment.
T.Parallel
----------
You can use T.Parallel to write a loop. The loop will be partitioned to
all the threads by the compiler (The compiler will consider vectorize
size, the fragment’s thread mapping … ). Note that this is the only way
you can perform arbitrary operation on fragments.
T.Pipelined
-----------
args: start, stop, num_stages
Pipeline the loop, copy from the global memory will be converted to
async operations and reordered to the point after it is consumed.
num_stages is the number of buffer between producer-consumer.
(e.g.&nbsp;Double buffer when num_stages=2)
T.clear T.fill
--------------
nothing special, they will be converted to T.Parallel
T.use_swizzle
-------------
Optimization for L2 cache. The launch of blockIdx.x and blockIdx.y will
be serpentined.
You need to add it in a kernel after buffer is all allocated.
Annotate Memory Layout
=======================
Just In Time Compilation
=========================
Pipelining Computation and Data Movement
========================================
Writing High-Performance Kernels with Thread Primitives
=======================================================
Writing High-Performance Kernels with the Tile Library
======================================================
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment