[Bugfix] Fix layout conflict issue for gqa decoding examples (#314)
* Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output. * Refactor flashattn example to improve CUDA configuration handling - Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures. - Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output. - Adjusted the `do_bench` method call to streamline performance profiling. * lint fix
Showing
Please register or sign in to comment