• gongchensu's avatar
    issue/846 - Refactor embedding to support device-side input and CUDA graph recording · cc2cc3a1
    gongchensu authored
    - Ensure embedding tensors are on the same device. Change format.
    - Optimize embedding kernel with vectorized memory access and __ldg
    - Add vectorized memory access using float4/float2, half2, and bfloat162
    - Use __ldg instruction for read-only weight and indices access
    - Add memory alignment checks to enable vectorized paths
    - Add __restrict__ keywords for better compiler optimization
    - Implement dynamic block size selection based on embedding_dim
    cc2cc3a1
embedding.h 2.77 KB