- 05 Mar, 2026 1 commit
-
-
wooway777 authored
-
- 27 Jan, 2026 1 commit
-
-
gongchensu authored
- Ensure embedding tensors are on the same device. Change format. - Optimize embedding kernel with vectorized memory access and __ldg - Add vectorized memory access using float4/float2, half2, and bfloat162 - Use __ldg instruction for read-only weight and indices access - Add memory alignment checks to enable vectorized paths - Add __restrict__ keywords for better compiler optimization - Implement dynamic block size selection based on embedding_dim
-