| `cudaErrorInvalidValue` on launch | Pointer counts mismatch (`nb`, `nl`, `no`) or non-contiguous input |
| Wrong values when using HND layout | Inner tensors not permuted to `[nh, nt, hd]` before passing in |
| Python bindings complain about dtype | Mixed precision in a batch; convert tensors to a common dtype |
| Kernels take unexpected time | Verify that `CUDA_ARCHS` matches your GPU to avoid JIT at runtime |
-`backend="auto"` defaults to the fused kernel, then `cudaMemcpyBatchAsync`, then `cudaMemcpyAsync`. Override if you want to benchmark a specific path.