Support flash attention 2 with causal masking when KV's seq length is longer than Q's seq length. (#436)
Attach a file by drag & drop or click to upload