1. 05 Mar, 2026 1 commit
  2. 27 Jan, 2026 1 commit
    • gongchensu's avatar
      issue/846 - Refactor embedding to support device-side input and CUDA graph recording · cc2cc3a1
      gongchensu authored
      - Ensure embedding tensors are on the same device. Change format.
      - Optimize embedding kernel with vectorized memory access and __ldg
      - Add vectorized memory access using float4/float2, half2, and bfloat162
      - Use __ldg instruction for read-only weight and indices access
      - Add memory alignment checks to enable vectorized paths
      - Add __restrict__ keywords for better compiler optimization
      - Implement dynamic block size selection based on embedding_dim
      cc2cc3a1