* fix bug for nvidia v100 * hard code the supported dict for different arch.
* add cuda flops performance benchmark.