1. 27 Jan, 2026 1 commit
    • gongchensu's avatar
      issue/846 - Refactor embedding to support device-side input and CUDA graph recording · cc2cc3a1
      gongchensu authored
      - Ensure embedding tensors are on the same device. Change format.
      - Optimize embedding kernel with vectorized memory access and __ldg
      - Add vectorized memory access using float4/float2, half2, and bfloat162
      - Use __ldg instruction for read-only weight and indices access
      - Add memory alignment checks to enable vectorized paths
      - Add __restrict__ keywords for better compiler optimization
      - Implement dynamic block size selection based on embedding_dim
      cc2cc3a1
  2. 30 Dec, 2025 1 commit
  3. 29 Dec, 2025 1 commit
  4. 24 Dec, 2025 1 commit
  5. 21 Nov, 2025 1 commit
  6. 28 Oct, 2025 1 commit
  7. 23 Oct, 2025 1 commit
  8. 16 Oct, 2025 1 commit
  9. 29 Sep, 2025 1 commit
  10. 23 Sep, 2025 1 commit
  11. 16 Sep, 2025 1 commit
  12. 10 Sep, 2025 1 commit
  13. 02 Sep, 2025 1 commit
  14. 07 Jul, 2025 1 commit
  15. 27 Jun, 2025 1 commit
    • Pepe's avatar
      issue/205 - 添加Sub算子 · 2ccf1d9d
      Pepe authored
      issue/205 - 添加Sub算子的头文件、CPU实现、cuda实现、及Python测试
      2ccf1d9d
  16. 06 May, 2025 1 commit
  17. 28 Apr, 2025 1 commit
  18. 25 Apr, 2025 2 commits
  19. 08 Apr, 2025 1 commit
  20. 21 Mar, 2025 1 commit
  21. 05 Mar, 2025 2 commits
  22. 21 Feb, 2025 1 commit
  23. 11 Feb, 2025 1 commit