sync from public

2298a1a4 · illsilin · 965b7ba4 · 2f088b87 · 2298a1a4 · 2298a1a4
Commit 2298a1a4 authored Dec 09, 2024 by illsilin
20 changed files
--- a/example/ck_tile/15_fused_moe/README.md
+++ b/example/ck_tile/15_fused_moe/README.md
+# fused-moe
+Implementing the fused-moe block operator using ck-tile. This is a scatter/gather-group-gemm based solution, similiar to that of [vllm moe](https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py), but we introduce more kernel fusion to boost performance
+![](misc/moe-0.png)
+The benifit of this fused-moe:
+* 1.5~2x perf boost compared with current vllm solution
+* zero workspace to reduce memory footprint
+* much less kernel instance, easy to maintain
+# Implementation and feature support
+## moe-sorting
+this is a common pre-process step before the actual moe-gemm. The purpose is to transform the moe loop over from token-by-token to expert-by-expert, make sure very workgroup is working for a single expert (B matrix). Besides, we extend this op to do the zeroing of the output buffer(to be used for reduce buffer with atomic)
+## moe-gemm
+`moe-gemm` is a group-gemm based back-to-back gemm, where the row-id of input token comes from another buffer. Naive understanding of fused-moe is from token-by-token view as below picture:
+![](misc/moe-1.png)
+After `moe-sorting`, we can view this algorithm as expert-by-expert, as below:
+![](misc/moe-2.png)
+## optimization
+summary of the key design of this fused-moe operator:
+* fuse 2 group-gemm + activation + `topk-weight` multiply into single kernel, using atomic for 2nd gemm accumualation
+* fuse buffer-zeroing in `moe-sorgin`, user no longer need call extra torch.zero() for the out buffer
+* fused scatter-gather for row index(same as vllm)
+* pre-shuffle B matric(weight) to maximize memory throughput. input(activation) keep original layout `[batch, hidden]`.
+* extrem optimized pipeline using block-inline-asm(we call it `micro-kernel` or `uk`), while not breaking the *composable* design of ck
+## 
+```
+// [indexing implementation-1]
+// using M_a as constexpr block_size to partition all tokens into different slices
+// each slice map to one expert, and one expert can have multiple slices
+// e.g. num_experts = 6, topk=3, M_a = 4, input_tokens = 5
+// before sort, topk_ids is : [[0, 3, 5], [2, 3, 5], [1, 3, 5], [1, 2, 3], [1, 3, 5]]
+//                            tok-0      tok-1      tok-2      tok-3      tok-4
+//           topk_weight is : [[a, b, c], [d, e, f], [g, h, i], [j, k, l], [m, n, o]] (some float number)
+//
+// token_id_per_expert is : [[0], [2, 3, 4], [1, 3], [0, 1, 2, 3, 4], [], [0, 1, 2, 5]]
+//  (only for reference)    exp-0  exp-1     exp-2   exp-3          exp-4  exp-5
+// weight_id_per_expert is: [[a], [g, j, m], [d, k], [b, e, h, l, n], [], [c, f, i, o]]
+//
+// max_num_tokens_padded : topk * input_tokens + num_experts * (M_a - 1)
+// * this could be larger than actual, since actual tokens are on GPU
+//
+// sorted_token_ids_ptr   : [0, 6, 6, 6, 2, 3, 4, 6, 1, 3, 6, 6, 0, 1, 2, 3, 4, 6, 6, 6, 6, 6, 6, 6, 0, 1, 2, 5]
+//                          |-  exp-0  -|-  exp-1  -|-  exp-2  -|-      exp-3          -|-  exp-4 -|-  exp-5  -|
+// sorted_weight_ptr      : [a, *, *, *, g, j, m, *, d, k, *, *, b, e, h, l, n, *, *, *, *, *, *, *, c, f, i, o]
+//
+// * length is max_num_tokens_padded, actual size is num_tokens_post_padded_ptr
+//
+// sorted_expert_ids_ptr  : [0, 1, 2, 3, 3, 4, 5]
+// * length is (max_num_tokens_padded + block_size - 1) / block_size
+//
+// num_tokens_post_padded_ptr : [28]
+// num_sorted_tiles_ptr : [7]
+//
+// * different from vLLM
+//   1) token_id stored in sorted_token_ids_ptr is actual token_id, not token_id*top_K expanded id
+//   2）need sorted_weight_ptr
+//   3) use num_sorted_tiles_ptr, already divided by M_a
+//
+// * below used for indexing
+//  1) sorted_token_ids_ptr [max_num_tokens_padded]
+//  2) sorted_weight_ptr
+//  3) sorted_expert_ids_ptr
+//  4）num_tokens_post_padded_ptr/num_sorted_tiles_ptr (select one)
+//
+//   max_num_tokens_padded: opk_ids.numel() + num_experts * (block_size - 1)
+```
\ No newline at end of file
--- a/example/ck_tile/15_fused_moe/fused_moe.hpp
+++ b/example/ck_tile/15_fused_moe/fused_moe.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include "fused_moesorting.hpp"
+#include "fused_moegemm.hpp"
+struct fused_moe_args
+{
+    const void* a_ptr;              // [m, k], input token
+    const void* a_scale_ptr;        // [m, 1], token scale
+    const void* g_ptr;              // [e, n, k]/[e, 2*n, k], pre-shuffle([e, nr, kr, w])
+    const void* d_ptr;              // [e, n, k], pre-shuffle([e, nr, kr, w])
+    const void* g_scale_ptr;        // [e, 1, n], gate(up) scale
+    const void* d_scale_ptr;        // [e, 1, k], down scale
+    const void* y_smooth_scale_ptr; // [e, 1, n], smooth-quant-scale for 2nd gemm input
+    void* o_ptr;                    // [m, k], output token (no need to do zeroing)
+    const void* topk_ids_ptr;    // [tokens, topk]
+    const void* topk_weight_ptr; // [tokens, topk]
+    void* sorted_token_ids_ptr;  // [max_num_tokens_padded]
+    void* sorted_weight_ptr;     // [max_num_tokens_padded]
+    void* sorted_expert_ids_ptr; // [(max_num_tokens_padded + block_size - 1) / block_size]
+    void* num_sorted_tiles_ptr;  // [1]
+    ck_tile::index_t block_m;           // block_m, used to devide the input
+    ck_tile::index_t hidden_size;       // k
+    ck_tile::index_t intermediate_size; // n / TP, for Gate. if Gate+Up, Down need divide by 2
+    ck_tile::index_t num_tokens;        // input number of tokens for current iteration
+    ck_tile::index_t num_experts;       // number of groups
+    ck_tile::index_t topk;              // need this?
+    ck_tile::index_t stride_token; // for input/output, stride for each row, should >= hidden_size
+};
+// This is the public API, will be generated by script
+struct fused_moe_traits
+{
+    std::string prec_i;  // input precision
+    std::string prec_w;  // weight precision
+    std::string prec_o;  // output precision
+    std::string prec_st; // token scale data type
+    std::string prec_sw; // weight scale data type
+    std::string prec_sq; // smooth quant scale
+    std::string prec_kw; // topk-weight data type
+    int block_m;
+    int gate_only;
+    int fused_quant; // 0:no-sweep, 1:smooth-dynamic-quant, 2:dynamic-quant
+};
+float fused_moe(fused_moe_traits, fused_moe_args, const ck_tile::stream_config&);
--- a/example/ck_tile/15_fused_moe/fused_moegemm.hpp
+++ b/example/ck_tile/15_fused_moe/fused_moegemm.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include "ck_tile/core.hpp"
+#include "ck_tile/host/kernel_launch.hpp"
+#include "ck_tile/ops/fused_moe.hpp"
+#include <string>
+// this is only a convenient structure for creating an example
+// this is not part of the host API
+template <typename I, typename W, typename O, typename ST, typename SW, typename SQ, typename KW>
+struct FusedMoeGemmTypeConfig;
+template <typename ST, typename SW, typename SQ, typename KW>
+struct FusedMoeGemmTypeConfig<ck_tile::bf16_t, ck_tile::bf16_t, ck_tile::bf16_t, ST, SW, SQ, KW>
+{
+    using ADataType            = ck_tile::bf16_t;
+    using GDataType            = ck_tile::bf16_t;
+    using DDataType            = ck_tile::bf16_t;
+    using AccDataType          = float;
+    using ODataType            = ck_tile::bf16_t;
+    using AScaleDataType       = ck_tile::remove_cvref_t<ST>;
+    using GScaleDataType       = ck_tile::remove_cvref_t<SW>;
+    using DScaleDataType       = ck_tile::remove_cvref_t<SW>;
+    using YSmoothScaleDataType = ck_tile::remove_cvref_t<SQ>;
+    using TopkWeightDataType   = ck_tile::remove_cvref_t<KW>;
+    using IndexDataType        = ck_tile::index_t;
+};
+template <typename ST, typename SW, typename SQ, typename KW>
+struct FusedMoeGemmTypeConfig<ck_tile::fp16_t, ck_tile::fp16_t, ck_tile::fp16_t, ST, SW, SQ, KW>
+{
+    using ADataType            = ck_tile::fp16_t;
+    using GDataType            = ck_tile::fp16_t;
+    using DDataType            = ck_tile::fp16_t;
+    using AccDataType          = float;
+    using ODataType            = ck_tile::fp16_t;
+    using AScaleDataType       = ck_tile::remove_cvref_t<ST>;
+    using GScaleDataType       = ck_tile::remove_cvref_t<SW>;
+    using DScaleDataType       = ck_tile::remove_cvref_t<SW>;
+    using YSmoothScaleDataType = ck_tile::remove_cvref_t<SQ>;
+    using TopkWeightDataType   = ck_tile::remove_cvref_t<KW>;
+    using IndexDataType        = ck_tile::index_t;
+};
+template <typename ST, typename SW, typename SQ, typename KW>
+struct FusedMoeGemmTypeConfig<ck_tile::int8_t, ck_tile::int8_t, ck_tile::bf16_t, ST, SW, SQ, KW>
+{
+    using ADataType            = ck_tile::int8_t;
+    using GDataType            = ck_tile::int8_t;
+    using DDataType            = ck_tile::int8_t;
+    using AccDataType          = int32_t;
+    using ODataType            = ck_tile::bf16_t;
+    using AScaleDataType       = ck_tile::remove_cvref_t<ST>;
+    using GScaleDataType       = ck_tile::remove_cvref_t<SW>;
+    using DScaleDataType       = ck_tile::remove_cvref_t<SW>;
+    using YSmoothScaleDataType = ck_tile::remove_cvref_t<SQ>;
+    using TopkWeightDataType   = ck_tile::remove_cvref_t<KW>;
+    using IndexDataType        = ck_tile::index_t;
+};
+// runtime args
+struct fused_moegemm_args : public ck_tile::FusedMoeGemmHostArgs
+{
+};
+// This is the public API, will be generated by script
+struct fused_moegemm_traits
+{
+    std::string prec_i;  // input precision
+    std::string prec_w;  // weight precision
+    std::string prec_o;  // output precision
+    std::string prec_st; // token scale data type
+    std::string prec_sw; // weight scale data type
+    std::string prec_sq; // smooth quant scale
+    std::string prec_kw; // topk-weight data type
+    int block_m;
+    int gate_only;
+    int fused_quant; // 0:no-sweep, 1:smooth-dynamic-quant, 2:dynamic-quant
+};
+float fused_moegemm(fused_moegemm_traits, fused_moegemm_args, const ck_tile::stream_config&);
--- a/example/ck_tile/15_fused_moe/fused_moesorting.hpp
+++ b/example/ck_tile/15_fused_moe/fused_moesorting.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include <string>
+#include "ck_tile/core.hpp"
+#include "ck_tile/host.hpp"
+#include "ck_tile/ops/fused_moe.hpp"
+struct fused_moesorting_trait
+{
+    std::string index_type;
+    std::string weight_type; // currently always float
+};
+struct fused_moesorting_args : public ck_tile::MoeSortingHostArgs
+{
+};
+float fused_moesorting(fused_moesorting_trait t, fused_moesorting_args a, ck_tile::stream_config s);
--- a/example/ck_tile/15_fused_moe/instances/fused_moe_api.cpp
+++ b/example/ck_tile/15_fused_moe/instances/fused_moe_api.cpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#include "fused_moe.hpp"
+float fused_moe(fused_moe_traits t, fused_moe_args a, const ck_tile::stream_config& s)
+{
+    auto s_sub = ck_tile::stream_config{s.stream_id_, false, s.log_level_, 0, 1};
+    auto o_data_bytes = [&]() {
+        if(t.prec_o == "fp32")
+            return 4;
+        else if(t.prec_o == "fp16" || t.prec_o == "bf16")
+            return 2;
+        else if(t.prec_o == "int8" || t.prec_o == "fp8")
+            return 1;
+        return 1;
+    }();
+    auto t0 = fused_moesorting_trait{"int32", "fp32"};
+    auto a0 = fused_moesorting_args{
+        a.topk_ids_ptr,                              // const void* p_topk_ids;
+        a.topk_weight_ptr,                           // const void* p_weights;
+        a.sorted_token_ids_ptr,                      // void* p_sorted_token_ids;
+        a.sorted_weight_ptr,                         // void* p_sorted_weights;
+        a.sorted_expert_ids_ptr,                     // void* p_sorted_expert_ids;
+        a.num_sorted_tiles_ptr,                      // void* p_total_tokens_post_pad;
+        a.o_ptr,                                     // void* p_moe_buf;
+        a.num_tokens,                                // index_t tokens;
+        a.block_m,                                   // index_t unit_size;
+        a.num_experts,                               // index_t num_experts;
+        a.topk,                                      // index_t topk;
+        a.num_tokens * a.stride_token * o_data_bytes // index_t moe_buf_bytes;
+    };
+    auto t1 = fused_moegemm_traits{t.prec_i,
+                                   t.prec_w,
+                                   t.prec_o,
+                                   t.prec_st,
+                                   t.prec_sw,
+                                   t.prec_sq,
+                                   t.prec_kw,
+                                   t.block_m,
+                                   t.gate_only,
+                                   t.fused_quant};
+    auto a1 = fused_moegemm_args{
+        a.a_ptr,                 // const void* a_ptr;
+        a.a_scale_ptr,           // const void* a_scale_ptr;
+        a.g_ptr,                 // const void* g_ptr;
+        a.d_ptr,                 // const void* d_ptr;
+        a.g_scale_ptr,           // const void* g_scale_ptr;
+        a.d_scale_ptr,           // const void* d_scale_ptr;
+        a.y_smooth_scale_ptr,    // const void* y_smooth_scale_ptr;
+        a.o_ptr,                 // void* o_ptr;
+        a.sorted_token_ids_ptr,  // const void* sorted_token_ids_ptr;
+        a.sorted_weight_ptr,     // const void* sorted_weight_ptr;
+        a.sorted_expert_ids_ptr, // const void* sorted_expert_ids_ptr;
+        a.num_sorted_tiles_ptr,  // const void* num_sorted_tiles_ptr;
+        a.hidden_size,           // index_t hidden_size;
+        a.intermediate_size,     // index_t intermediate_size;
+        a.num_tokens,            // index_t num_tokens;
+        a.num_experts,           // index_t num_experts;
+        a.topk,                  // index_t topk;
+        a.stride_token           // index_t stride_token;
+    };
+    float r0 = -1;
+    float r1 = -1;
+    float r = ck_tile::launch_kernel(
+        s,
+        [=, &r0](const ck_tile::stream_config&) { r0 = fused_moesorting(t0, a0, s_sub); },
+        [=, &r1](const ck_tile::stream_config&) { r1 = fused_moegemm(t1, a1, s_sub); });
+    // keep unsupported case return negative
+    if(r0 < 0 || r1 < 0)
+        return -1;
+    return r;
+}
--- a/example/ck_tile/15_fused_moe/instances/fused_moegemm_api.cpp
+++ b/example/ck_tile/15_fused_moe/instances/fused_moegemm_api.cpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#include <ck_tile/core.hpp>
+#include "fused_moegemm.hpp"
+#include "fused_moegemm_api_traits.hpp"
+// Note: this internal API only declare, not define here, otherwise will block `make -j`
+template <typename Traits_>
+float fused_moegemm_(const ck_tile::stream_config& s, fused_moegemm_args a);
+template <ck_tile::index_t... Is>
+using S = ck_tile::sequence<Is...>;
+float fused_moegemm(fused_moegemm_traits t, fused_moegemm_args a, const ck_tile::stream_config& s)
+{
+    // clang-format off
+    float r = -1;
+    if(t.prec_i == "bf16" && t.prec_w == "bf16" && t.prec_o == "bf16" && t.prec_st == "fp32" &&
+       t.prec_sw == "fp32" && t.prec_sq == "fp32" && t.prec_kw == "fp32" && t.block_m == 32 && t.gate_only == 1)
+    {
+        using t_ = fmoe_<ck_tile::bf16_t, ck_tile::bf16_t, ck_tile::bf16_t, float, float, float, float, S<32, 512, 128, 128>, S<1, 4, 1>, S<16, 16, 32>, 1, 0>;
+        r = fused_moegemm_<t_>(s, a);
+    }
+    else if(t.prec_i == "fp16" && t.prec_w == "fp16" && t.prec_o == "fp16" && t.prec_st == "fp32" &&
+       t.prec_sw == "fp32" && t.prec_sq == "fp32" && t.prec_kw == "fp32" && t.block_m == 32 && t.gate_only == 1)
+    {
+        using t_ = fmoe_<ck_tile::fp16_t, ck_tile::fp16_t, ck_tile::fp16_t, float, float, float, float, S<32, 512, 128, 128>, S<1, 4, 1>, S<16, 16, 32>, 1, 0>;
+        r = fused_moegemm_<t_>(s, a);
+    }
+    // clang-format on
+    return r;
+}
--- a/example/ck_tile/15_fused_moe/instances/fused_moegemm_api_internal.hpp
+++ b/example/ck_tile/15_fused_moe/instances/fused_moegemm_api_internal.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include "fused_moegemm_api_traits.hpp"
+#include "ck_tile/ops/fused_moe.hpp"
+#include <iostream>
+template <ck_tile::index_t... Is>
+using S = ck_tile::sequence<Is...>;
+// do not the define of this tepmlate function inside the _api.cpp, otherwise will block make -j
+template <typename Ts_>
+float fused_moegemm_(const ck_tile::stream_config& s, fused_moegemm_args a)
+{
+    using f_traits = ck_tile::FusedMoeGemmTraits<Ts_::GateOnly, Ts_::FusedQuant == 1, 1 /*atomic*/>;
+    using f_shape  = ck_tile::FusedMoeGemmShape<typename Ts_::BlockTile_0,
+                                               typename Ts_::WarpPerBlock_0,
+                                               typename Ts_::WarpTile_0,
+                                               typename Ts_::BlockTile_1,
+                                               typename Ts_::WarpPerBlock_0,
+                                               typename Ts_::WarpTile_0>;
+    using f_problem =
+        ck_tile::FusedMoeGemmPipelineProblem<typename Ts_::ADataType,
+                                             typename Ts_::GDataType,
+                                             typename Ts_::DDataType,
+                                             typename Ts_::AccDataType,
+                                             typename Ts_::ODataType,
+                                             typename Ts_::AScaleDataType,
+                                             typename Ts_::GScaleDataType,
+                                             typename Ts_::DScaleDataType,
+                                             typename Ts_::YSmoothScaleDataType,
+                                             typename Ts_::TopkWeightDataType,
+                                             typename Ts_::IndexDataType,
+                                             ck_tile::element_wise::FastGeluAsm, // TODO: hardcoded
+                                             f_shape,
+                                             f_traits>;
+    // using f_pipeline    = ck_tile::FusedMoeGemmPipeline_FlatmmEx<f_problem>;
+    using f_pipeline    = ck_tile::FusedMoeGemmPipeline_FlatmmUk<f_problem>;
+    using f_partitioner = ck_tile::FusedMoeGemmTilePartitioner_Linear<f_shape>;
+    using f_kernel      = ck_tile::FusedMoeGemmKernel<f_partitioner, f_pipeline, void>;
+    const dim3 grids                       = f_kernel::GridSize(a);
+    constexpr dim3 blocks                  = f_kernel::BlockSize();
+    constexpr ck_tile::index_t kBlockPerCu = 1;
+    static int printed = 0;
+    auto kargs = f_kernel::MakeKargs(a);
+    if(s.log_level_ > 0 && printed == 0)
+    {
+        std::cout << ", " << f_kernel::GetName() << std::flush;
+        printed = 1;
+    }
+    return ck_tile::launch_kernel(
+        s, ck_tile::make_kernel<blocks.x, kBlockPerCu>(f_kernel{}, grids, blocks, 0, kargs));
+}
--- a/example/ck_tile/15_fused_moe/instances/fused_moegemm_api_traits.hpp
+++ b/example/ck_tile/15_fused_moe/instances/fused_moegemm_api_traits.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include <ck_tile/core.hpp>
+// this is used to pattern-match internl kernel implementation, not to instantiate kernel
+template <typename I,
+          typename W,
+          typename O,
+          typename ST,
+          typename SW,
+          typename SQ,
+          typename KW,
+          typename BlockTIle_, // seq<b_token, b_interm, b_hidden, b_down>
+          typename WarpPerBlock_,
+          typename WarpTile_, // seq<*,*,*>, used to select mfma
+          ck_tile::index_t GateOnly_   = 0,
+          ck_tile::index_t FusedQuant_ = 0>
+struct fmoe_ // traits, ugly name, only used for internal
+{
+    using TypeConfig = FusedMoeGemmTypeConfig<I, W, O, ST, SW, SQ, KW>;
+    using ADataType            = ck_tile::remove_cvref_t<typename TypeConfig::ADataType>;
+    using GDataType            = ck_tile::remove_cvref_t<typename TypeConfig::GDataType>;
+    using DDataType            = ck_tile::remove_cvref_t<typename TypeConfig::DDataType>;
+    using AccDataType          = ck_tile::remove_cvref_t<typename TypeConfig::AccDataType>;
+    using ODataType            = ck_tile::remove_cvref_t<typename TypeConfig::ODataType>;
+    using AScaleDataType       = ck_tile::remove_cvref_t<typename TypeConfig::AScaleDataType>;
+    using GScaleDataType       = ck_tile::remove_cvref_t<typename TypeConfig::GScaleDataType>;
+    using DScaleDataType       = ck_tile::remove_cvref_t<typename TypeConfig::DScaleDataType>;
+    using YSmoothScaleDataType = ck_tile::remove_cvref_t<typename TypeConfig::YSmoothScaleDataType>;
+    using TopkWeightDataType   = ck_tile::remove_cvref_t<typename TypeConfig::TopkWeightDataType>;
+    using IndexDataType        = ck_tile::remove_cvref_t<typename TypeConfig::IndexDataType>;
+    static constexpr ck_tile::index_t BT_ = BlockTIle_::at(ck_tile::number<0>{}); // block token
+    static constexpr ck_tile::index_t BI_ =
+        BlockTIle_::at(ck_tile::number<1>{}); // block intermediate
+    static constexpr ck_tile::index_t BH_ = BlockTIle_::at(ck_tile::number<2>{}); // block hidden
+    static constexpr ck_tile::index_t BD_ = BlockTIle_::at(ck_tile::number<3>{}); // block down
+    using BlockTile_0    = ck_tile::sequence<BT_, BI_, BH_>;
+    using WarpPerBlock_0 = ck_tile::remove_cvref_t<WarpPerBlock_>;
+    using WarpTile_0     = ck_tile::remove_cvref_t<WarpTile_>;
+    using BlockTile_1    = ck_tile::sequence<BT_, BD_, BI_ / (GateOnly_ ? 1 : 2)>;
+    using WarpPerBlock_1 = ck_tile::remove_cvref_t<WarpPerBlock_>;
+    using WarpTile_1     = ck_tile::remove_cvref_t<WarpTile_>;
+    static constexpr ck_tile::index_t GateOnly   = GateOnly_;
+    static constexpr ck_tile::index_t FusedQuant = FusedQuant_;
+};
--- a/example/ck_tile/15_fused_moe/instances/fused_moegemm_bf16_m32.cpp
+++ b/example/ck_tile/15_fused_moe/instances/fused_moegemm_bf16_m32.cpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#include <ck_tile/core.hpp>
+#include "fused_moegemm.hpp"
+#include "fused_moegemm_api_traits.hpp"
+#include "fused_moegemm_api_internal.hpp"
+// clang-format off
+template float fused_moegemm_<
+    fmoe_<ck_tile::bf16_t, ck_tile::bf16_t, ck_tile::bf16_t, float, float, float, float, S<32, 512, 128, 128>, S<1, 4, 1>, S<16, 16, 32>, 1, 0>
+>(const ck_tile::stream_config& s, fused_moegemm_args a);
+// clang-format on
--- a/example/ck_tile/15_fused_moe/instances/fused_moegemm_fp16_m32.cpp
+++ b/example/ck_tile/15_fused_moe/instances/fused_moegemm_fp16_m32.cpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#include <ck_tile/core.hpp>
+#include "fused_moegemm.hpp"
+#include "fused_moegemm_api_traits.hpp"
+#include "fused_moegemm_api_internal.hpp"
+// clang-format off
+template float fused_moegemm_<
+    fmoe_<ck_tile::fp16_t, ck_tile::fp16_t, ck_tile::fp16_t, float, float, float, float, S<32, 512, 128, 128>, S<1, 4, 1>, S<16, 16, 32>, 1, 0>
+>(const ck_tile::stream_config& s, fused_moegemm_args a);
+// clang-format on
--- a/example/ck_tile/15_fused_moe/instances/fused_moesorting_api.cpp
+++ b/example/ck_tile/15_fused_moe/instances/fused_moesorting_api.cpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#include "fused_moesorting.hpp"
+#define MOE_SORTING_DISPATCH(unroll_num_)                                                   \
+    constexpr ck_tile::index_t unroll_num = unroll_num_;                                    \
+    using ms_problem     = ck_tile::MoeSortingProblem<index_t, ms_weight_type, unroll_num>; \
+    using kernel         = ck_tile::MoeSortingKernel<ms_problem>;                           \
+    auto kargs           = kernel::MakeKargs(a);                                            \
+    const dim3 grids     = kernel::GridSize(a);                                             \
+    const dim3 blocks    = kernel::BlockSize(a);                                            \
+    const auto lds_bytes = kernel::GetSmemSize(a);                                          \
+    float ave_time       = ck_tile::launch_kernel(                                          \
+        s, ck_tile::make_kernel(kernel{}, grids, blocks, lds_bytes, kargs));          \
+    return ave_time;
+float fused_moesorting(fused_moesorting_trait t, fused_moesorting_args a, ck_tile::stream_config s)
+{
+    if(t.weight_type == "fp32" && t.index_type == "int32")
+    {
+        if(a.num_experts > 127)
+        {
+            printf("lds size exceed, only support experts <127 \n");
+            return -1;
+        }
+        if(a.moe_buf_bytes % 16)
+        {
+            printf("buf set size %d unaligned, must be multiple of 16\n", a.moe_buf_bytes);
+            return -1;
+        }
+        using index_t              = ck_tile::index_t;
+        using ms_weight_type       = float;
+        index_t smem_io_unroll_num = ck_tile::integer_divide_ceil(a.tokens * a.topk, 64);
+        switch(smem_io_unroll_num)
+        {
+        case(1): {
+            MOE_SORTING_DISPATCH(1);
+        }
+        case(2): {
+            MOE_SORTING_DISPATCH(2);
+        }
+        case(3): {
+            MOE_SORTING_DISPATCH(3);
+        }
+        case(5): {
+            MOE_SORTING_DISPATCH(5);
+        }
+        case(6): {
+            MOE_SORTING_DISPATCH(6);
+        }
+        case(7): {
+            MOE_SORTING_DISPATCH(7);
+        }
+        case(8): {
+            MOE_SORTING_DISPATCH(8);
+        }
+        case(9): {
+            MOE_SORTING_DISPATCH(9);
+        }
+        case(10): {
+            MOE_SORTING_DISPATCH(10);
+        }
+        case(11): {
+            MOE_SORTING_DISPATCH(11);
+        }
+        default: {
+            MOE_SORTING_DISPATCH(4);
+        }
+        }
+    }
+    return -1;
+}
--- a/example/ck_tile/15_fused_moe/main.cpp
+++ b/example/ck_tile/15_fused_moe/main.cpp
--- a/example/ck_tile/15_fused_moe/misc/moe-0.png
+++ b/example/ck_tile/15_fused_moe/misc/moe-0.png
--- a/example/ck_tile/15_fused_moe/misc/moe-1.png
+++ b/example/ck_tile/15_fused_moe/misc/moe-1.png
--- a/example/ck_tile/15_fused_moe/misc/moe-2.png
+++ b/example/ck_tile/15_fused_moe/misc/moe-2.png
--- a/example/ck_tile/15_fused_moe/misc/moe-3.png
+++ b/example/ck_tile/15_fused_moe/misc/moe-3.png
--- a/example/ck_tile/16_batched_gemm/CMakeLists.txt
+++ b/example/ck_tile/16_batched_gemm/CMakeLists.txt
+add_executable(tile_example_batched_gemm EXCLUDE_FROM_ALL batched_gemm.cpp)
--- a/example/ck_tile/16_batched_gemm/README.md
+++ b/example/ck_tile/16_batched_gemm/README.md
+# Batched GEMM
+This folder contains example for batched GEMM using ck_tile tile-programming implementation.
+## build
+```
+# in the root of ck_tile
+mkdir build && cd build
+# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
+sh ../script/cmake-ck-dev.sh  ../ <arch>
+make tile_example_batched_gemm -j
+```
+This will result in an executable `build/bin/tile_example_batched_gemm`
+## example
+```
+args:
+              -m     m dimension (default:256)
+              -n     n dimension (default:128)
+              -k     k dimension (default:128)
+       -a_layout     A tensor data layout (default:R) (R for Row, C for Col)
+       -b_layout     B tensor data layout (default:R) (R for Row, C for Col)
+       -c_layout     C tensor data layout (default:R) (R for Row, C for Col)
+       -stride_a     Tensor A stride (default:128)
+       -stride_b     Tensor B stride (default:128)
+       -stride_c     Tensor C stride (default:128)
+ -batch_stride_a     Batch A stride (default:32768)
+ -batch_stride_b     Batch B stride (default:16384)
+ -batch_stride_c     Batch C stride (default:32768)
+    -batch_count     Batch count (default:16)
+              -v     0. No validation, 1. Validation on CPU, 2. Validation on GPU (default:2)
+              -e     Absolute error tolerance (default:1e-5)
+           -prec     data type. fp16/bf16/fp8/bf8 (default:fp16)
+         -warmup     number of iterations before benchmark the kernel (default:10)
+         -repeat     number of iterations to benchmark the kernel (default:100)
+          -timer     gpu:gpu timer, cpu:cpu timer (default:gpu)
+```
\ No newline at end of file
--- a/example/ck_tile/16_batched_gemm/batched_gemm.cpp
+++ b/example/ck_tile/16_batched_gemm/batched_gemm.cpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2024, Advanced Micro Devices, Inc. All rights reserved.
+#include <hip/hip_runtime.h>
+#include <cstring>
+#include <iostream>
+#include <ostream>
+#include <string>
+#include <tuple>
+#include "ck_tile/core.hpp"
+#include "ck_tile/ops/epilogue.hpp"
+#include "ck_tile/ops/gemm.hpp"
+#include "ck_tile/host.hpp"
+#include "batched_gemm.hpp"
+template <typename ALayout, typename BLayout, typename CLayout>
+float batched_gemm(const batched_gemm_kargs& args, const ck_tile::stream_config& s)
+{
+    // The kPadM, kPadN, kPadK & kBlockPerCu should also come from the Codegen part.
+    constexpr bool kPadM        = false;
+    constexpr bool kPadN        = false;
+    constexpr bool kPadK        = false;
+    constexpr bool kTilePermute = false;
+    // The rank and permutation will also be generate out by the CodeGen part.
+    constexpr ck_tile::index_t kOutputRank = 2;
+    constexpr int kBlockPerCu = 1;
+    // This part comes from the Codegen
+    constexpr ck_tile::index_t M_Tile = 128;
+    constexpr ck_tile::index_t N_Tile = 128;
+    constexpr ck_tile::index_t K_Tile = 32;
+    constexpr ck_tile::index_t M_Warp = 2;
+    constexpr ck_tile::index_t N_Warp = 2;
+    constexpr ck_tile::index_t K_Warp = 1;
+    constexpr ck_tile::index_t M_Warp_Tile = 32;
+    constexpr ck_tile::index_t N_Warp_Tile = 32;
+    constexpr ck_tile::index_t K_Warp_Tile = 8;
+    // Whether doing the CShuffle (transpose before the global memory), depending on the output
+    // layout.
+    constexpr bool CShuffleEpilogue =
+        std::is_same_v<CLayout, ck_tile::tensor_layout::gemm::ColumnMajor>;
+    using CodegenGemmShape =
+        ck_tile::TileGemmShape<ck_tile::sequence<M_Tile, N_Tile, K_Tile>,
+                               ck_tile::sequence<M_Warp, N_Warp, K_Warp>,
+                               ck_tile::sequence<M_Warp_Tile, N_Warp_Tile, K_Warp_Tile>>;
+    using TilePartitioner = ck_tile::GemmTilePartitioner<CodegenGemmShape>;
+    using GemmEpilogue = std::conditional_t<
+        CShuffleEpilogue,
+        ck_tile::CShuffleEpilogue<ck_tile::CShuffleEpilogueProblem<AccDataType,
+                                                                   CDataType,
+                                                                   kPadM,
+                                                                   kPadN,
+                                                                   kTilePermute,
+                                                                   kOutputRank,
+                                                                   1,
+                                                                   0,
+                                                                   TilePartitioner::kM,
+                                                                   TilePartitioner::kN>>,
+        ck_tile::Default2DEpilogue<
+            ck_tile::Default2DEpilogueProblem<AccDataType, CDataType, kPadM, kPadN>>>;
+    using CodegenGemmTraits =
+        ck_tile::TileGemmTraits<kPadM, kPadN, kPadK, ALayout, BLayout, CLayout>;
+    using CodegenPipelineProblem = ck_tile::
+        GemmPipelineProblem<ADataType, BDataType, AccDataType, CodegenGemmShape, CodegenGemmTraits>;
+    using CodegenGemmPipeline = ck_tile::GemmPipelineAGmemBGmemCRegV1<CodegenPipelineProblem>;
+    // ToDo: Will add the codegen part to test different pipeline policies in GEMM.
+    // Now we only use the BlockGemmASmemBSmemCRegV1DefaultPolicy.
+    using Kernel = ck_tile::BatchedGemmKernel<TilePartitioner, CodegenGemmPipeline, GemmEpilogue>;
+    auto kargs = Kernel::MakeKargs(args);
+    const dim3 grids      = Kernel::GridSize(args);
+    constexpr dim3 blocks = Kernel::BlockSize();
+    if(s.log_level_ > 0)
+    {
+        std::cout << "Launching kernel with args:"
+                  << " grid: {" << grids.x << ", " << grids.y << ", " << grids.z << "}"
+                  << ", blocks: {" << blocks.x << ", " << blocks.y << ", " << blocks.z << "}"
+                  << std::endl;
+    }
+    float ave_time = ck_tile::launch_kernel(
+        s, ck_tile::make_kernel<blocks.x, kBlockPerCu>(Kernel{}, grids, blocks, 0, kargs));
+    return ave_time;
+}
+#include "run_batched_gemm_example.inc"
+int main(int argc, char* argv[]) { return !run_batched_gemm_example(argc, argv); }
--- a/example/ck_tile/16_batched_gemm/batched_gemm.hpp
+++ b/example/ck_tile/16_batched_gemm/batched_gemm.hpp