-
Lei Wang authored
* Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output. * Refactor flashattn example to improve CUDA configuration handling - Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures. - Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output. - Adjusted the `do_bench` method call to streamline performance profiling. * lint fix
0fd82ed5