@@ -6,7 +6,7 @@ Machete is a spiritual successor to the Marlin kernel but optimized for Hopper a
Machete effectively performs
```
```python
scale_type=w_s.dtype
compute_type=a.dtype
out=(w_q.to(scale_type)*w_s-w_z.to(scale_type))@a
...
...
@@ -24,7 +24,7 @@ applied.
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like: