triton-distributed.md 2.1 KB
Newer Older
whlwhlwhl's avatar
whlwhlwhl committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
id: ref-triton-distributed
repo: ByteDance-Seed/Triton-distributed
title: Triton-distributed
url: https://github.com/ByteDance-Seed/Triton-distributed
source_type: source-reference
source_category: open-triton-kernel-library
architectures:
- amd
- nvidia
- rocm
- dcu
tags:
- triton
- distributed
- communication-overlap
- allreduce
- allgather
- reduce-scatter
- gemm
- moe
- flash-decode
- amd
- nvidia
techniques:
- compute-communication-overlap
- gemm-allreduce
- allgather-gemm
- reduce-scatter-overlap
- distributed-kernel
hardware_features:
- wavefront
- lds
- mfma
- interconnect
kernel_types:
- gemm
- attention
- moe
- communication
languages:
- python
- triton
- cpp
captured_at: '2026-05-26'
license: not-captured
source_paths:
- python
- lib
- include
- csrc
- docs
- tests
- README.md
---
# Triton-distributed

- Repository: `ByteDance-Seed/Triton-distributed`
- Source: [ByteDance-Seed/Triton-distributed](https://github.com/ByteDance-Seed/Triton-distributed)
- Docs: [Triton-distributed kernels](https://triton-distributed.readthedocs.io/en/latest/kernels/index.html)

## Route Fit

Use Triton-distributed when the optimization touches tensor parallelism, expert
parallelism, MoE communication, GEMM + all-reduce, all-gather + GEMM,
reduce-scatter overlap, or distributed flash decode. It is not the first source
for single-kernel tuning, but it is a strong reference for overlap-aware Triton
design.

## What To Inspect

- Distributed kernel examples and docs for communication overlap patterns.
- Tests for shape and process-group assumptions.
- Backend support notes; separate AMD-compatible ideas from NVIDIA-only paths.

## DCU Use Notes

For DCU, prove the communication backend, process topology, and profiler kernel
presence before reusing overlap patterns. Treat NVIDIA-specific launch or
interconnect assumptions as cross-platform inspiration only.

## Query Hooks

```bash
python3 scripts/query.py "triton distributed allreduce gemm" --type source-reference --compact
python3 scripts/query.py "triton distributed moe reduce scatter" --type source-reference --compact
python3 scripts/get_page.py ref-triton-distributed
```