- 27 Jan, 2026 2 commits
-
-
gongchensu authored
- Ensure embedding tensors are on the same device. Change format. - Optimize embedding kernel with vectorized memory access and __ldg - Add vectorized memory access using float4/float2, half2, and bfloat162 - Use __ldg instruction for read-only weight and indices access - Add memory alignment checks to enable vectorized paths - Add __restrict__ keywords for better compiler optimization - Implement dynamic block size selection based on embedding_dim
-
wooway777 authored
-
- 21 Jan, 2026 1 commit
-
-
PanZezhong authored
-
- 12 Jan, 2026 2 commits
-
-
PanZezhong authored
-
PanZezhong authored
-
- 09 Jan, 2026 1 commit
-
-
PanZezhong authored
-
- 08 Jan, 2026 1 commit
-
-
zhushuang authored
-
- 30 Dec, 2025 2 commits
-
-
PanZezhong authored
-
zhushuang authored
-
- 29 Dec, 2025 1 commit
-
-
zhushuang authored
-
- 26 Dec, 2025 2 commits
-
-
qinyiqun authored
* can commit * can exec sm_90a * can exec < sm_90 * fix format * fix format * 增加测试,测试对标sglang test * fix format 1 * fix format 2 * add compile option to disable cutlass
-
PanZezhong1725 authored
This reverts commit 25258029.
-
- 25 Dec, 2025 2 commits
- 24 Dec, 2025 2 commits
- 19 Dec, 2025 1 commit
-
-
pengcheng888 authored
-
- 11 Dec, 2025 2 commits
- 10 Dec, 2025 2 commits
-
-
Ceng23333 authored
Signed-off-by:Ceng23333 <441651826@qq.com>
-
thatPepe authored
* issue/739 - support batched RoPE on Nvidia and CPU * issue/739 - metax, moore batched rope * issue/739 - adjust metax flags * issue/739 - added a rope module interface to forward inplace in output tensor
-
- 08 Dec, 2025 1 commit
-
-
wooway777 authored
-
- 04 Dec, 2025 2 commits
- 29 Nov, 2025 1 commit
-
-
Zhao Shijie authored
-
- 28 Nov, 2025 4 commits
- 26 Nov, 2025 1 commit
-
-
zhuyue authored
-
- 22 Nov, 2025 1 commit
-
-
zhuyue authored
- Implement Moore backend for add, mul, and silu elementwise operations - Filter unsupported dtypes (BF16, F64) for Moore platform in tests
-
- 21 Nov, 2025 5 commits
-
-
zhuyue authored
-
zhuyue authored
-
zhuyue authored
-
zhuyue authored
-
qinyiqun authored
* ISSUE/628 适配QY C610 GPU,增加编译选项,适配已有算子。添加bge类模型所需的算子,包括gelu,layer_norm,lp_norm(支持l1,l2 norm),relu,softmax,tanh。 --------- Co-authored-by:
xgqdut2016 <kenan_gewei@163.com> Co-authored-by:
xgqdut2016 <140036308+xgqdut2016@users.noreply.github.com>
-
- 20 Nov, 2025 2 commits
-
-
crapromer authored
* issue/636 - add support for fp8 with maca sdk * issue/636 - add functional header to support Fn * issue/636 - format code with clang
-
crapromer authored
* initial add mc support for meta * add command description for maca compilation * rebase metax maca support to main * issue/445 - clang format code on ubuntu * issue//445 - change config from use_mc to use-mc and format code
-
- 19 Nov, 2025 1 commit
-
-
zhuyue authored
-
- 07 Nov, 2025 1 commit
-
-
crapromer authored
* fix compile bug on cuda 13.0 * issue/367 - clang format code on ubuntu --------- Co-authored-by:root <root@Crapromer>
-