introducing ck_tile! (#1216)

* enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * initial enablement of gfx950 * fix clang format * disable examples 31 and 41 int8 on gfx950 * add code * fix build wip * fix xx * now can build * naming * minor fix * wip fix * fix macro for exp2; fix warpgemm a/b in transposedC * unify as tuple_array * Update the required Python version to 3.9 * Update executable name in test scripts * re-structure tuple/array to avoid spill * Merge function templates * Fix format * Add constraint to array<> ctor * Re-use function * Some minor changes * remove wrong code in store_raw() * fix compile issue in transpose * Rename enum Rename 'cood_transform_enum' to 'coord_transform_enum' * let more integral_constant->constant, and formating * make sure thread_buffer can be tuple/array * temp fix buffer_store spill * not using custom data type by default, now we can have ISA-level same code as opt_padding * fix compile error, fp8 not ready now * fix fp8 duplicated move/shift/and/or problem * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode * fix scratch in fp8 kernel * update some readme * fix merge from upstream * sync with upstream * sync upstream again * sync 22 * remove unused * fix clang-format * update README of ck_tile example * fix several issue * let python version to be 3.8 as minimal * remove ck_tile example from default cmake target like all/install/check * remove mistake * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg * fix some bug in group-mode masking and codegen. update README * F8 quantization for FMHA forward (#1224) * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline * Add element function to fmha api * Adjust P elementwise function * Fix bug of elementwise op, our elementwise op is not inout * Add some elementwise op, prepare to quantization * Let generate.py can generate different elementwise function * To prevent compiler issue, remove the elementwise function we have not used. * Remove f8 pipeline, we should share the same pipeline even in f8 * Remove remove_cvref_t * Avoid warning * Fix wrong fp8 QK/KV block gemm setting * Check fp8 rounding error in check_err() * Set fp8 rounding error for check_err() * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode * 1. codgen the f8 api and kernel 2. f8 host code * prevent warning in filter mode * Remove not-in-use elementwise function kargs * Remove more not-in-use elementwise function kargs * Small refinements in C++ source files * Use conditional_t<> to simplify code * Support heterogeneous argument for binary function types * Re-use already-existing scales<> functor template * Fix wrong value produced by saturating * Generalize the composes<> template * Unify saturates<> implementation * Fix type errors in composes<> * Extend less_equal<> * Reuse the existing template less_equal<> in check_err() * Add equal<float> & equal<double> * Rename check_err() parameter * Rename check_err() parameter * Add FIXME comment for adding new macro in future * Remove unnecessary cast to void * Eliminate duplicated code * Avoid dividing api pool into more than 2 groups * Use more clear variable names * Use affirmative condition in if stmt * Remove blank lines * Donot perfect forwarding in composes<> * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8 * Fix bug of p element function * Add compute element op to host softmax * Remove element function in api interface * Extract user parameter * Rename pscale and oscale variable * rename f8 to fp8 * rename more f8 to fp8 * Add pipeline::operator() without element_functor * 1. Remove deprecated pipeline enum 2. Refine host code parameter * Use quantization range as input * 1. Rename max_dtype to dtype_max. 2. Rename scale to scale_s 3.Add init description * Refine description * prevent early return * unify _squant kernel name in cpp, update README * Adjust the default range. * Refine error message and bias range * Add fp8 benchmark and smoke test * fix fp8 swizzle_factor=4 case --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com>

introducing ck_tile! (#1216)
* enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo * Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0 ) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * initial enablement of gfx950 * fix clang format * disable examples 31 and 41 int8 on gfx950 * add code * fix build wip * fix xx * now can build * naming * minor fix * wip fix * fix macro for exp2; fix warpgemm a/b in transposedC * unify as tuple_array * Update the required Python version to 3.9 * Update executable name in test scripts * re-structure tuple/array to avoid spill * Merge function templates * Fix format * Add constraint to array<> ctor * Re-use function * Some minor changes * remove wrong code in store_raw() * fix compile issue in transpose * Rename enum Rename 'cood_transform_enum' to 'coord_transform_enum' * let more integral_constant->constant, and formating * make sure thread_buffer can be tuple/array * temp fix buffer_store spill * not using custom data type by default, now we can have ISA-level same code as opt_padding * fix compile error, fp8 not ready now * fix fp8 duplicated move/shift/and/or problem * Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode * fix scratch in fp8 kernel * update some readme * fix merge from upstream * sync with upstream * sync upstream again * sync 22 * remove unused * fix clang-format * update README of ck_tile example * fix several issue * let python version to be 3.8 as minimal * remove ck_tile example from default cmake target like all/install/check * remove mistake * 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg * fix some bug in group-mode masking and codegen. update README * F8 quantization for FMHA forward (#1224) * Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline * Add element function to fmha api * Adjust P elementwise function * Fix bug of elementwise op, our elementwise op is not inout * Add some elementwise op, prepare to quantization * Let generate.py can generate different elementwise function * To prevent compiler issue, remove the elementwise function we have not used. * Remove f8 pipeline, we should share the same pipeline even in f8 * Remove remove_cvref_t * Avoid warning * Fix wrong fp8 QK/KV block gemm setting * Check fp8 rounding error in check_err() * Set fp8 rounding error for check_err() * Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode * 1. codgen the f8 api and kernel 2. f8 host code * prevent warning in filter mode * Remove not-in-use elementwise function kargs * Remove more not-in-use elementwise function kargs * Small refinements in C++ source files * Use conditional_t<> to simplify code * Support heterogeneous argument for binary function types * Re-use already-existing scales<> functor template * Fix wrong value produced by saturating * Generalize the composes<> template * Unify saturates<> implementation * Fix type errors in composes<> * Extend less_equal<> * Reuse the existing template less_equal<> in check_err() * Add equal<float> & equal<double> * Rename check_err() parameter * Rename check_err() parameter * Add FIXME comment for adding new macro in future * Remove unnecessary cast to void * Eliminate duplicated code * Avoid dividing api pool into more than 2 groups * Use more clear variable names * Use affirmative condition in if stmt * Remove blank lines * Donot perfect forwarding in composes<> * To fix compile error, revert generate.py back to 4439cc107dd90302d68a6494bdd33113318709f8 * Fix bug of p element function * Add compute element op to host softmax * Remove element function in api interface * Extract user parameter * Rename pscale and oscale variable * rename f8 to fp8 * rename more f8 to fp8 * Add pipeline::operator() without element_functor * 1. Remove deprecated pipeline enum 2. Refine host code parameter * Use quantization range as input * 1. Rename max_dtype to dtype_max. 2. Rename scale to scale_s 3.Add init description * Refine description * prevent early return * unify _squant kernel name in cpp, update README * Adjust the default range. * Refine error message and bias range * Add fp8 benchmark and smoke test * fix fp8 swizzle_factor=4 case --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> Co-authored-by: carlushuang <carlus.huang@amd.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com> Co-authored-by: rocking <ChunYu.Lai@amd.com>
db376dd8 · carlushuang · GitHub · dd34ab6e · db376dd8 · db376dd8
Unverified Commit db376dd8 authored Apr 16, 2024 by carlushuang Committed by GitHub Apr 15, 2024
20 changed files
--- a/include/ck_tile/core/algorithm/space_filling_curve.hpp
+++ b/include/ck_tile/core/algorithm/space_filling_curve.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include "ck_tile/core/config.hpp"
+#include "ck_tile/core/container/multi_index.hpp"
+#include "ck_tile/core/container/container_helper.hpp"
+#include "ck_tile/core/container/statically_indexed_array.hpp"
+#include "ck_tile/core/utility/functional.hpp"
+#include "ck_tile/core/utility/type_traits.hpp"
+namespace ck_tile {
+template <typename TensorLengths,
+          typename DimAccessOrder,
+          typename ScalarsPerAccess,
+          bool SnakeCurved = true> // # of scalars per access in each dimension
+struct space_filling_curve
+{
+    static constexpr index_t TensorSize =
+        reduce_on_sequence(TensorLengths{}, multiplies{}, number<1>{});
+    static_assert(0 < TensorSize,
+                  "space_filling_curve should be used to access a non-empty tensor");
+    static constexpr index_t nDim = TensorLengths::size();
+    using Index = multi_index<nDim>;
+    static constexpr index_t ScalarPerVector =
+        reduce_on_sequence(ScalarsPerAccess{}, multiplies{}, number<1>{});
+    static constexpr auto access_lengths   = TensorLengths{} / ScalarsPerAccess{};
+    static constexpr auto dim_access_order = DimAccessOrder{};
+    static constexpr auto ordered_access_lengths =
+        container_reorder_given_new2old(access_lengths, dim_access_order);
+    static constexpr auto to_index_adaptor = make_single_stage_tensor_adaptor(
+        make_tuple(make_merge_transform(ordered_access_lengths)),
+        make_tuple(typename arithmetic_sequence_gen<0, nDim, 1>::type{}),
+        make_tuple(sequence<0>{}));
+    static constexpr auto I0 = number<0>{};
+    static constexpr auto I1 = number<1>{};
+    CK_TILE_HOST_DEVICE static constexpr index_t get_num_of_access()
+    {
+        static_assert(TensorLengths::size() == ScalarsPerAccess::size());
+        static_assert(TensorLengths{} % ScalarsPerAccess{} ==
+                      typename uniform_sequence_gen<TensorLengths::size(), 0>::type{});
+        return reduce_on_sequence(TensorLengths{}, multiplies{}, number<1>{}) / ScalarPerVector;
+    }
+    template <index_t AccessIdx1dHead, index_t AccessIdx1dTail>
+    static CK_TILE_HOST_DEVICE constexpr auto get_step_between(number<AccessIdx1dHead>,
+                                                               number<AccessIdx1dTail>)
+    {
+        static_assert(AccessIdx1dHead >= 0 && AccessIdx1dHead < get_num_of_access(),
+                      "1D index out of range");
+        static_assert(AccessIdx1dTail >= 0 && AccessIdx1dTail < get_num_of_access(),
+                      "1D index out of range");
+        constexpr auto idx_head = get_index(number<AccessIdx1dHead>{});
+        constexpr auto idx_tail = get_index(number<AccessIdx1dTail>{});
+        return idx_tail - idx_head;
+    }
+    template <index_t AccessIdx1d>
+    static CK_TILE_HOST_DEVICE constexpr auto get_forward_step(number<AccessIdx1d>)
+    {
+        static_assert(AccessIdx1d < get_num_of_access(), "1D index should be larger than 0");
+        return get_step_between(number<AccessIdx1d>{}, number<AccessIdx1d + 1>{});
+    }
+    template <index_t AccessIdx1d>
+    static CK_TILE_HOST_DEVICE constexpr auto get_backward_step(number<AccessIdx1d>)
+    {
+        static_assert(AccessIdx1d > 0, "1D index should be larger than 0");
+        return get_step_between(number<AccessIdx1d>{}, number<AccessIdx1d - 1>{});
+    }
+    template <index_t AccessIdx1d>
+    static CK_TILE_HOST_DEVICE constexpr Index get_index(number<AccessIdx1d>)
+    {
+#if 0
+        /*
+         * \todo: tensor_adaptor::calculate_bottom_index does NOT return constexpr as expected.
+         */
+        constexpr auto ordered_access_idx = to_index_adaptor.calculate_bottom_index(make_multi_index(number<AccessIdx1d>{}));
+#else
+        constexpr auto access_strides =
+            container_reverse_exclusive_scan(ordered_access_lengths, multiplies{}, number<1>{});
+        constexpr auto idx_1d = number<AccessIdx1d>{};
+        // Given tensor strides \p access_lengths, and 1D index of space-filling-curve, compute the
+        // idim-th element of multidimensional index.
+        // All constexpr variables have to be captured by VALUE.
+        constexpr auto compute_index = [ idx_1d, access_strides ](auto idim) constexpr
+        {
+            constexpr auto compute_index_impl = [ idx_1d, access_strides ](auto jdim) constexpr
+            {
+                auto res = idx_1d.value;
+                auto id  = 0;
+                static_for<0, jdim.value + 1, 1>{}([&](auto kdim) {
+                    id = res / access_strides[kdim].value;
+                    res -= id * access_strides[kdim].value;
+                });
+                return id;
+            };
+            constexpr auto id = compute_index_impl(idim);
+            return number<id>{};
+        };
+        constexpr auto ordered_access_idx = generate_tuple(compute_index, number<nDim>{});
+#endif
+        constexpr auto forward_sweep = [&]() {
+            statically_indexed_array<bool, nDim> forward_sweep_;
+            forward_sweep_(I0) = true;
+            static_for<1, nDim, 1>{}([&](auto idim) {
+                index_t tmp = ordered_access_idx[I0];
+                static_for<1, idim, 1>{}(
+                    [&](auto j) { tmp = tmp * ordered_access_lengths[j] + ordered_access_idx[j]; });
+                forward_sweep_(idim) = tmp % 2 == 0;
+            });
+            return forward_sweep_;
+        }();
+        // calculate multi-dim tensor index
+        auto idx_md = [&]() {
+            Index ordered_idx;
+            static_for<0, nDim, 1>{}([&](auto idim) {
+                ordered_idx(idim) =
+                    !SnakeCurved || forward_sweep[idim]
+                        ? ordered_access_idx[idim]
+                        : ordered_access_lengths[idim] - 1 - ordered_access_idx[idim];
+            });
+            return container_reorder_given_old2new(ordered_idx, dim_access_order) *
+                   ScalarsPerAccess{};
+        }();
+        return idx_md;
+    }
+    // FIXME: rename this function
+    template <index_t AccessIdx1d>
+    static CK_TILE_HOST_DEVICE constexpr auto get_index_tuple_of_number(number<AccessIdx1d>)
+    {
+        constexpr auto idx = get_index(number<AccessIdx1d>{});
+        return generate_tuple([&](auto i) { return number<idx[i]>{}; }, number<nDim>{});
+    }
+};
+} // namespace ck_tile
--- a/include/ck_tile/core/arch/amd_buffer_addressing.hpp
+++ b/include/ck_tile/core/arch/amd_buffer_addressing.hpp
--- a/include/ck_tile/core/arch/arch.hpp
+++ b/include/ck_tile/core/arch/arch.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+// Address Space for AMDGCN
+// https://llvm.org/docs/AMDGPUUsage.html#address-space
+#include "ck_tile/core/config.hpp"
+#include "ck_tile/core/numeric/integer.hpp"
+#include "ck_tile/core/numeric/integral_constant.hpp"
+namespace ck_tile {
+enum struct address_space_enum
+{
+    generic,
+    global,
+    lds,
+    sgpr,
+    vgpr,
+};
+enum struct memory_operation_enum
+{
+    set,
+    atomic_add,
+    atomic_max,
+    add
+};
+CK_TILE_HOST_DEVICE constexpr index_t get_warp_size()
+{
+    // warpSize is defined by HIP
+    return warpSize;
+}
+CK_TILE_DEVICE index_t get_grid_size() { return gridDim.x; }
+CK_TILE_DEVICE index_t get_block_size() { return blockDim.x; }
+// TODO: deprecate these
+CK_TILE_DEVICE index_t get_thread_local_1d_id() { return threadIdx.x; }
+CK_TILE_DEVICE index_t get_thread_global_1d_id() { return blockIdx.x * blockDim.x + threadIdx.x; }
+CK_TILE_DEVICE index_t get_block_1d_id() { return blockIdx.x; }
+// Use these instead
+CK_TILE_DEVICE index_t get_lane_id() { return __lane_id(); }
+CK_TILE_DEVICE index_t get_warp_id()
+{
+    return __builtin_amdgcn_readfirstlane(threadIdx.x / get_warp_size());
+}
+CK_TILE_DEVICE index_t get_thread_id() { return threadIdx.x; }
+CK_TILE_DEVICE index_t get_block_id() { return blockIdx.x; }
+CK_TILE_DEVICE void block_sync_lds()
+{
+#if CK_TILE_EXPERIMENTAL_BLOCK_SYNC_LDS_WITHOUT_SYNC_VMEM
+    asm volatile("\
+    s_waitcnt lgkmcnt(0) \n \
+    s_barrier \
+    " ::);
+#else
+    __syncthreads();
+#endif
+}
+CK_TILE_DEVICE void block_sync_lds_direct_load()
+{
+    asm volatile("\
+    s_waitcnt vmcnt(0) \n \
+    s_waitcnt lgkmcnt(0) \n \
+    s_barrier \
+    " ::);
+}
+CK_TILE_DEVICE void s_nop()
+{
+#if 1
+    asm volatile("\
+    s_nop 0 \n \
+    " ::);
+#else
+    __builtin_amdgcn_sched_barrier(0);
+#endif
+}
+} // namespace ck_tile
--- a/include/ck_tile/core/arch/utility.hpp
+++ b/include/ck_tile/core/arch/utility.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+// Address Space for AMDGCN
+// https://llvm.org/docs/AMDGPUUsage.html#address-space
+#include "ck_tile/core/config.hpp"
+#include "ck_tile/core/numeric/integer.hpp"
+#include "ck_tile/core/numeric/integral_constant.hpp"
+#include "ck_tile/core/utility/bit_cast.hpp"
+#include <stdint.h>
+namespace ck_tile {
+// TODO: we have "memory" clobber here because this inline asm is used for async copy
+CK_TILE_DEVICE void m0_set_with_memory(index_t v)
+{
+    asm volatile("s_mov_b32 m0, %0" : : "s"(v) : "memory");
+}
+// NOTE: this is an immediate value
+CK_TILE_DEVICE void m0_inc_with_memory(index_t v)
+{
+    asm volatile("s_add_u32 m0, %0, m0" : : "n"(v) : "memory");
+}
+template <typename T>
+CK_TILE_DEVICE T warp_shuffle_up(const T& v_local, uint32_t lane_delta)
+{
+#if 0
+    return  __shfl_up(v_local, lane_delta);
+#elif 1
+    static_assert(sizeof(T) == sizeof(int32_t), "wrong!");
+    const uint32_t wrap_around_lane_delta = warpSize - lane_delta;
+    const int32_t v_remote_tmp = __builtin_amdgcn_ds_bpermute(
+        (__lane_id() << 2) + (wrap_around_lane_delta << 2), bit_cast<int32_t>(v_local));
+    return bit_cast<T>(v_remote_tmp);
+#endif
+}
+template <typename T>
+CK_TILE_DEVICE T warp_shuffle_down(const T& v_local, uint32_t lane_delta)
+{
+#if 0
+    return  __shfl_down(v_local, lane_delta);
+#elif 1
+    static_assert(sizeof(T) == sizeof(int32_t), "wrong!");
+    const int32_t v_remote_tmp = __builtin_amdgcn_ds_bpermute(
+        (__lane_id() << 2) + (lane_delta << 2), bit_cast<int32_t>(v_local));
+    return bit_cast<T>(v_remote_tmp);
+#endif
+}
+} // namespace ck_tile
--- a/include/ck_tile/core/config.hpp
+++ b/include/ck_tile/core/config.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#ifndef CK_TILE_DONT_USE_HIP_RUNTIME_HEADERS
+#include "hip/hip_runtime.h"
+#include "hip/hip_fp16.h"
+#endif
+#ifdef __HIPCC__
+#define CK_TILE_HOST inline __host__
+#define CK_TILE_DEVICE inline __device__
+#define CK_TILE_HOST_DEVICE inline __host__ __device__
+#define CK_TILE_DEVICE_EXTERN __device__
+#else
+#define CK_TILE_HOST inline
+#define CK_TILE_DEVICE inline
+#define CK_TILE_HOST_DEVICE inline
+#define CK_TILE_DEVICE_EXTERN
+#endif
+#ifndef CK_TILE_USE_CUSTOM_DATA_TYPE
+#define CK_TILE_USE_CUSTOM_DATA_TYPE 0 // custom data type will generate extra move/bfi code
+#endif
+#define CK_TILE_FLOAT_TO_BFLOAT16_STANDARD 0
+#define CK_TILE_FLOAT_TO_BFLOAT16_TRUNCATE_WITH_NAN 1
+#define CK_TILE_FLOAT_TO_BFLOAT16_TRUNCATE 2
+#ifndef CK_TILE_FLOAT_TO_BFLOAT16_DEFAULT
+#define CK_TILE_FLOAT_TO_BFLOAT16_DEFAULT CK_TILE_FLOAT_TO_BFLOAT16_TRUNCATE
+#endif
+#define CK_TILE_FLOAT_TO_FP8_STANDARD 0
+#define CK_TILE_FLOAT_TO_FP8_STOCHASTIC 1
+#ifndef CK_TILE_FLOAT_TO_FP8_DEFAULT
+#define CK_TILE_FLOAT_TO_FP8_DEFAULT CK_TILE_FLOAT_TO_FP8_STANDARD
+#endif
+// in the old rocm period, we have to use tuple array implementation to implement this
+// so turn on the _USE_TUPLE if meet compiler error, otherwise _USE_ARRAY by default.
+#define CK_TILE_STATICALLY_INDEXED_ARRAY_USE_ARRAY 0
+#define CK_TILE_STATICALLY_INDEXED_ARRAY_USE_TUPLE 1
+#ifndef CK_TILE_STATICALLY_INDEXED_ARRAY_DEFAULT
+#define CK_TILE_STATICALLY_INDEXED_ARRAY_DEFAULT CK_TILE_STATICALLY_INDEXED_ARRAY_USE_TUPLE
+#endif
+#define CK_TILE_THREAD_BUFFER_USE_ARRAY 0
+#define CK_TILE_THREAD_BUFFER_USE_TUPLE 1
+#ifndef CK_TILE_THREAD_BUFFER_DEFAULT
+#define CK_TILE_THREAD_BUFFER_DEFAULT CK_TILE_THREAD_BUFFER_USE_ARRAY
+#endif
+#ifndef CK_TILE_TUPLE_CTOR_WITH_INITIALIZER_LIST
+#if CK_TILE_THREAD_BUFFER_DEFAULT == CK_TILE_THREAD_BUFFER_USE_TUPLE
+// if using tuple-array as thread_buffer implementation, need to support {} brace init
+// ... with similiar behavior as array
+#define CK_TILE_TUPLE_CTOR_WITH_INITIALIZER_LIST 1
+#else
+#define CK_TILE_TUPLE_CTOR_WITH_INITIALIZER_LIST 0
+#endif
+#endif
+#ifndef CK_TILE_USE_LAUNCH_BOUNDS
+#define CK_TILE_USE_LAUNCH_BOUNDS 1
+#endif
+#ifndef CK_TILE_TIME_KERNEL
+#define CK_TILE_TIME_KERNEL 1
+#endif
+#define CK_TILE_MAX_THREAD_PER_BLOCK 256
+#define CK_TILE_MIN_BLOCK_PER_CU 2
+#ifndef CK_TILE_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK
+#define CK_TILE_EXPERIMENTAL_USE_BUFFER_LOAD_OOB_CHECK_OFFSET_TRICK 0
+#endif
+#ifndef CK_TILE_EXPERIMENTAL_USE_BUFFER_STORE_OOB_CHECK_OFFSET_TRICK
+#define CK_TILE_EXPERIMENTAL_USE_BUFFER_STORE_OOB_CHECK_OFFSET_TRICK 1
+#endif
+#ifndef CK_TILE_EXPERIMENTAL_USE_BUFFER_ATOMIC_ADD_OOB_CHECK_OFFSET_TRICK
+#define CK_TILE_EXPERIMENTAL_USE_BUFFER_ATOMIC_ADD_OOB_CHECK_OFFSET_TRICK 1
+#endif
+#ifndef CK_TILE_EXPERIMENTAL_USE_BUFFER_ATOMIC_MAX_OOB_CHECK_OFFSET_TRICK
+#define CK_TILE_EXPERIMENTAL_USE_BUFFER_ATOMIC_MAX_OOB_CHECK_OFFSET_TRICK 1
+#endif
+#ifndef CK_TILE_USE_AMD_LDS_DIRECT_LOAD_INLINE_ASM
+#define CK_TILE_USE_AMD_LDS_DIRECT_LOAD_INLINE_ASM 1
+#endif
+#ifndef CK_TILE_USE_AMD_BUFFER_LOAD
+#define CK_TILE_USE_AMD_BUFFER_LOAD 1
+#endif
+#ifndef CK_TILE_USE_AMD_BUFFER_STORE
+#define CK_TILE_USE_AMD_BUFFER_STORE 1
+#endif
+#ifndef CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_INTEGER
+#define CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_INTEGER 1
+#endif
+// buffer atomic add: floating point
+#ifndef __HIP_DEVICE_COMPILE__ // for host code
+#define CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT 1
+#elif defined(__gfx908__) || defined(__gfx90a__) || defined(__gfx940__) || defined(__gfx941__) || \
+    defined(__gfx942__) // for GPU code
+#define CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT 1
+#else // for GPU code
+#define CK_TILE_USE_AMD_BUFFER_ATOMIC_ADD_FLOAT 0
+#endif
+#if(defined(__gfx90a__) || defined(__gfx940__) || defined(__gfx941__) || \
+    defined(__gfx942__)) // for GPU code
+#define CK_TILE_USE_AMD_BUFFER_ATOMIC_MAX_FLOAT64 1
+#else
+#define CK_TILE_USE_AMD_BUFFER_ATOMIC_MAX_FLOAT64 0
+#endif
+#ifndef CK_TILE_EXPERIMENTAL_USE_MEMCPY_FOR_VECTOR_ACCESS
+#define CK_TILE_EXPERIMENTAL_USE_MEMCPY_FOR_VECTOR_ACCESS 0
+#endif
+#ifndef CK_TILE_WORKAROUND_SWDEV_XXXXXX_INT8_DS_WRITE_ISSUE
+#define CK_TILE_WORKAROUND_SWDEV_XXXXXX_INT8_DS_WRITE_ISSUE 1
+#endif
+#ifndef CK_TILE_DEBUG_LOG
+#define CK_TILE_DEBUG_LOG 0
+#endif
+#ifndef __HIP_DEVICE_COMPILE__ // for host code
+#define CK_TILE_BUFFER_RESOURCE_3RD_DWORD 0xffffffff
+#elif defined(__gfx803__) || defined(__gfx900__) || defined(__gfx906__) || defined(__gfx908__) || \
+    defined(__gfx90a__) || defined(__gfx940__) || defined(__gfx941__) ||                          \
+    defined(__gfx942__) // for GPU code
+#define CK_TILE_BUFFER_RESOURCE_3RD_DWORD 0x00020000
+#elif defined(__gfx1030__) // for GPU code
+#define CK_TILE_BUFFER_RESOURCE_3RD_DWORD 0x31014000
+#elif defined(__gfx1100__) || defined(__gfx1101__) || defined(__gfx1102__) // for GPU code
+#define CK_TILE_BUFFER_RESOURCE_3RD_DWORD 0x31004000
+#endif
+#ifndef CK_TILE_EXPERIMENTAL_BLOCK_SYNC_LDS_WITHOUT_SYNC_VMEM
+#define CK_TILE_EXPERIMENTAL_BLOCK_SYNC_LDS_WITHOUT_SYNC_VMEM 1
+#endif
+#ifndef CK_TILE_USE_SUBDWORD_TILE_CAST
+#define CK_TILE_USE_SUBDWORD_TILE_CAST 0
+#endif
--- a/include/ck_tile/core/container/array.hpp
+++ b/include/ck_tile/core/container/array.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include <initializer_list>
+#include "ck_tile/core/config.hpp"
+#include "ck_tile/core/numeric/integer.hpp"
+#include "ck_tile/core/numeric/integral_constant.hpp"
+#include "ck_tile/core/utility/type_traits.hpp"
+#include "ck_tile/core/utility/functional.hpp"
+namespace ck_tile {
+// use aggregate initialization for this type
+// e.g. array<index_t, 4> buf {0};  => {0, 0, 0, 0}, clean
+//      array<index_t, 4> buf {3, 2}; => {3, 2, 2, 2} (not {3,2,0,0})
+// use make_array_with({...}) to construct an array with compatible behavior as old ck
+// TODO: manually added constructor same as old ck
+template <typename T_, index_t N_>
+struct array
+{
+    using value_type           = T_;
+    static constexpr index_t N = N_;
+    // TODO: do we need this?
+    // using bulk_type = uint8_t __attribute__((ext_vector_type(N * sizeof(value_type))));
+    // union {
+    value_type data[N];
+    //     bulk_type __content;
+    //};
+    CK_TILE_HOST_DEVICE constexpr array() : data{} {}
+    // TODO: will initialize the data[] with the last value repeatedly
+    //       behavior different from std
+    CK_TILE_HOST_DEVICE constexpr array(std::initializer_list<value_type> ilist)
+    {
+        constexpr index_t list_size = std::initializer_list<value_type>{}.size();
+        static_assert(list_size <= N, "out of bound");
+        index_t i        = 0;
+        value_type vlast = value_type{};
+        for(const value_type& val : ilist)
+        {
+            data[i] = val;
+            vlast   = val;
+            ++i;
+        }
+        for(; i < N; ++i)
+        {
+            data[i] = vlast;
+        }
+    }
+    template <typename Y,
+              typename = std::enable_if_t<std::is_convertible_v<Y, value_type> ||
+                                          std::is_constructible_v<Y, value_type>>>
+    CK_TILE_HOST_DEVICE explicit constexpr array(Y c)
+    {
+        for(auto i = 0; i < size(); i++)
+            data[i] = static_cast<value_type>(c);
+    }
+    // template <typename Y>
+    // CK_TILE_HOST_DEVICE constexpr array(const array& o)
+    // {
+    //     // static_assert(ArrayType::size() == size(), "wrong! size not the same");
+    //     __content = o.__content;
+    // }
+    // CK_TILE_HOST_DEVICE constexpr array& operator=(const array& o)
+    // {
+    //     // static_assert(ArrayType::size() == size(), "wrong! size not the same");
+    //     __content = o.__content;
+    //     return *this;
+    // }
+    CK_TILE_HOST_DEVICE static constexpr auto size() { return N; }
+    CK_TILE_HOST_DEVICE static constexpr bool is_static() { return is_static_v<value_type>; }
+    // clang-format off
+    CK_TILE_HOST_DEVICE constexpr auto& get()                                           { return data; }
+    CK_TILE_HOST_DEVICE constexpr const auto& get() const                               { return data; }
+    CK_TILE_HOST_DEVICE constexpr auto& get(index_t i)                                  { return data[i]; }
+    CK_TILE_HOST_DEVICE constexpr const auto& get(index_t i) const                      { return data[i]; }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr auto& get()                      { return data[I]; }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr const auto& get() const          { return data[I]; }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr auto& get(number<I>)             { return data[I]; }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr const auto& get(number<I>) const { return data[I]; }
+    CK_TILE_HOST_DEVICE constexpr auto& at(index_t i)                                   { return get(i); }
+    CK_TILE_HOST_DEVICE constexpr const auto& at(index_t i) const                       { return get(i); }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr auto& at()                       { return get(I); }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr const auto& at() const           { return get(I); }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr auto& at(number<I>)              { return get(I); }
+    template <index_t I> CK_TILE_HOST_DEVICE constexpr const auto& at(number<I>) const  { return get(I); }
+    CK_TILE_HOST_DEVICE constexpr const value_type& operator[](index_t i) const { return get(i); }
+    CK_TILE_HOST_DEVICE constexpr value_type& operator[](index_t i)             { return get(i); }
+    CK_TILE_HOST_DEVICE constexpr value_type& operator()(index_t i)             { return get(i); }     // TODO: compatible
+#if 0
+    template <typename ArrayLike>
+    CK_TILE_HOST_DEVICE constexpr auto operator=(const ArrayLike& arr)
+    {
+        static_assert(ArrayLike::size() == size(), "wrong! size not the same");
+        for(index_t i = 0; i < size(); ++i)
+        {
+            data[i] = arr[i];
+        }
+        return *this;
+    }
+#endif
+    // type punning (strict aliasing) member functions for read/write
+    // aliasing this array of type "T", "N" elements
+    // as array of type "Tx", sizeof(T)*N/sizeof(Tx) elements
+#define AR_AS_COM_()                                         \
+    static_assert(sizeof(value_type) * N % sizeof(Tx) == 0); \
+    constexpr int vx = sizeof(value_type) * N / sizeof(Tx)
+    template <typename Tx> CK_TILE_HOST_DEVICE constexpr auto& get_as()
+            { AR_AS_COM_();    return reinterpret_cast<array<Tx, vx>&>(data); }
+    template <typename Tx> CK_TILE_HOST_DEVICE constexpr const auto& get_as() const
+            { AR_AS_COM_();    return reinterpret_cast<const array<Tx, vx>&>(data); }
+    // below index is for index *AFTER* type convert, not before
+    template <typename Tx> CK_TILE_HOST_DEVICE constexpr auto& get_as(index_t i)
+            { AR_AS_COM_();    return reinterpret_cast<array<Tx, vx>&>(data).at(i); }
+    template <typename Tx> CK_TILE_HOST_DEVICE constexpr const auto& get_as(index_t i) const
+            { AR_AS_COM_();    return reinterpret_cast<const array<Tx, vx>&>(data).at(i); }
+    template <typename Tx, index_t I> CK_TILE_HOST_DEVICE constexpr auto& get_as(number<I>)
+            { AR_AS_COM_();    return reinterpret_cast<array<Tx, vx>&>(data).at(number<I>{}); }
+    template <typename Tx, index_t I> CK_TILE_HOST_DEVICE constexpr const auto& get_as(number<I>) const
+            { AR_AS_COM_();    return reinterpret_cast<const array<Tx, vx>&>(data).at(number<I>{}); }
+    template <typename Tx> CK_TILE_HOST_DEVICE constexpr void set_as(index_t i, const Tx & x)
+            { AR_AS_COM_();    reinterpret_cast<array<Tx, vx>&>(data).at(i) = x; }
+    template <typename Tx, index_t I> CK_TILE_HOST_DEVICE constexpr void set_as(number<I>, const Tx & x)
+            { AR_AS_COM_();    reinterpret_cast<array<Tx, vx>&>(data).at(number<I>{}) = x; }
+#undef AR_AS_COM_
+    // clang-format on
+};
+// empty Array
+template <typename T>
+struct array<T, 0>
+{
+    using value_type = T;
+    CK_TILE_HOST_DEVICE constexpr array() {}
+    CK_TILE_HOST_DEVICE static constexpr index_t size() { return 0; }
+    CK_TILE_HOST_DEVICE static constexpr bool is_static() { return is_static_v<T>; };
+    CK_TILE_HOST_DEVICE void print() const { printf("array{size: 0, data: []}"); }
+};
+template <typename>
+struct vector_traits;
+// specialization for array
+template <typename T, index_t N>
+struct vector_traits<array<T, N>>
+{
+    using scalar_type                    = T;
+    static constexpr index_t vector_size = N;
+};
+namespace details {
+template <class>
+struct is_ref_wrapper : std::false_type
+{
+};
+template <class T>
+struct is_ref_wrapper<std::reference_wrapper<T>> : std::true_type
+{
+};
+template <class T>
+using not_ref_wrapper = std::negation<is_ref_wrapper<std::decay_t<T>>>;
+template <class D, class...>
+struct return_type_helper
+{
+    using type = D;
+};
+template <class... Ts>
+struct return_type_helper<void, Ts...> : std::common_type<Ts...>
+{
+    static_assert(std::conjunction_v<not_ref_wrapper<Ts>...>,
+                  "Ts cannot contain reference_wrappers when D is void");
+};
+template <class D, class... Ts>
+using return_type = array<typename return_type_helper<D, Ts...>::type, sizeof...(Ts)>;
+} // namespace details
+template <typename D = void, typename... Ts>
+CK_TILE_HOST_DEVICE constexpr details::return_type<D, Ts...> make_array(Ts&&... ts)
+{
+    return {std::forward<Ts>(ts)...};
+}
+// // make empty array
+// template <typename T>
+// CK_TILE_HOST_DEVICE constexpr auto make_array()
+// {
+//     return array<T, 0>{};
+// }
+// compatible with old ck's initializer, make an array and fill it withe the last element from
+// initializer_list
+template <typename T, index_t Size>
+CK_TILE_HOST_DEVICE constexpr auto make_array_with(std::initializer_list<T> ilist)
+{
+    return array<T, Size>(ilist);
+}
+template <typename T, index_t Size>
+CK_TILE_HOST_DEVICE constexpr bool operator==(const array<T, Size>& a, const array<T, Size>& b)
+{
+    bool same = true;
+    for(index_t i = 0; i < Size; ++i)
+    {
+        if(a[i] != b[i])
+        {
+            same = false;
+            break;
+        }
+    }
+    return same;
+}
+template <typename T, index_t Size>
+CK_TILE_HOST_DEVICE constexpr bool operator!=(const array<T, Size>& a, const array<T, Size>& b)
+{
+    return !(a == b);
+}
+template <typename T, index_t N, typename X>
+CK_TILE_HOST_DEVICE constexpr auto to_array(const X& x)
+{
+    static_assert(N <= X::size(), "");
+    array<T, N> arr;
+    static_for<0, N, 1>{}([&x, &arr](auto i) { arr(i) = x[i]; });
+    return arr;
+}
+} // namespace ck_tile
--- a/include/ck_tile/core/container/container_helper.hpp
+++ b/include/ck_tile/core/container/container_helper.hpp
--- a/include/ck_tile/core/container/map.hpp
+++ b/include/ck_tile/core/container/map.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include "ck_tile/core/config.hpp"
+#include "ck_tile/core/container/array.hpp"
+#include "ck_tile/core/container/sequence.hpp"
+#include "ck_tile/core/container/tuple.hpp"
+namespace ck_tile {
+// naive map
+template <typename key, typename data, index_t max_size = 128>
+struct map
+{
+    using pair_type = tuple<key, data>;
+    using impl_type = array<pair_type, max_size>;
+    impl_type impl_;
+    index_t size_;
+    struct iterator
+    {
+        impl_type& impl_;
+        index_t pos_;
+        CK_TILE_HOST_DEVICE constexpr iterator(impl_type& impl, index_t pos)
+            : impl_{impl}, pos_{pos}
+        {
+        }
+        CK_TILE_HOST_DEVICE constexpr iterator& operator++()
+        {
+            pos_++;
+            return *this;
+        }
+        CK_TILE_HOST_DEVICE constexpr bool operator!=(const iterator& other) const
+        {
+            return other.pos_ != pos_;
+        }
+        CK_TILE_HOST_DEVICE constexpr pair_type& operator*() { return impl_.at(pos_); }
+    };
+    struct const_iterator
+    {
+        const impl_type& impl_;
+        index_t pos_;
+        CK_TILE_HOST_DEVICE constexpr const_iterator(const impl_type& impl, index_t pos)
+            : impl_{impl}, pos_{pos}
+        {
+        }
+        CK_TILE_HOST_DEVICE constexpr const_iterator& operator++()
+        {
+            pos_++;
+            return *this;
+        }
+        CK_TILE_HOST_DEVICE constexpr bool operator!=(const const_iterator& other) const
+        {
+            return other.pos_ != pos_;
+        }
+        CK_TILE_HOST_DEVICE constexpr const pair_type& operator*() const { return impl_.at(pos_); }
+    };
+    CK_TILE_HOST_DEVICE constexpr map() : impl_{}, size_{0} {}
+    CK_TILE_HOST_DEVICE constexpr index_t size() const { return size_; }
+    CK_TILE_HOST_DEVICE void clear() { size_ = 0; }
+    CK_TILE_HOST_DEVICE constexpr index_t find_position(const key& k) const
+    {
+        for(index_t i = 0; i < size(); i++)
+        {
+            if(impl_[i].template at<0>() == k)
+            {
+                return i;
+            }
+        }
+        return size_;
+    }
+    CK_TILE_HOST_DEVICE constexpr const_iterator find(const key& k) const
+    {
+        return const_iterator{impl_, find_position(k)};
+    }
+    CK_TILE_HOST_DEVICE constexpr iterator find(const key& k)
+    {
+        return iterator{impl_, find_position(k)};
+    }
+    CK_TILE_HOST_DEVICE constexpr const data& operator[](const key& k) const
+    {
+        const auto it = find(k);
+        // FIXME
+        // assert(it.pos_ < size());
+        return impl_[it.pos_].template at<1>();
+    }
+    CK_TILE_HOST_DEVICE constexpr data& operator()(const key& k)
+    {
+        auto it = find(k);
+        // if entry not found
+        if(it.pos_ == size())
+        {
+            impl_(it.pos_).template at<0>() = k;
+            size_++;
+        }
+        // FIXME
+        // assert(size_ <= max_size);
+        return impl_(it.pos_).template at<1>();
+    }
+    // WARNING: needed by compiler for C++ range-based for loop only, don't use this function!
+    CK_TILE_HOST_DEVICE constexpr const_iterator begin() const { return const_iterator{impl_, 0}; }
+    // WARNING: needed by compiler for C++ range-based for loop only, don't use this function!
+    CK_TILE_HOST_DEVICE constexpr const_iterator end() const
+    {
+        return const_iterator{impl_, size_};
+    }
+    // WARNING: needed by compiler for C++ range-based for loop only, don't use this function!
+    CK_TILE_HOST_DEVICE constexpr iterator begin() { return iterator{impl_, 0}; }
+    // WARNING: needed by compiler for C++ range-based for loop only, don't use this function!
+    CK_TILE_HOST_DEVICE constexpr iterator end() { return iterator{impl_, size_}; }
+    CK_TILE_HOST_DEVICE void print() const
+    {
+        printf("map{size_: %d, ", size_);
+        //
+        printf("impl_: [");
+        //
+        for(const auto& [k, d] : *this)
+        {
+            printf("{key: ");
+            print(k);
+            printf(", data: ");
+            print(d);
+            printf("}, ");
+        }
+        //
+        printf("]");
+        //
+        printf("}");
+    }
+};
+} // namespace ck_tile
--- a/include/ck_tile/core/container/meta_data_buffer.hpp
+++ b/include/ck_tile/core/container/meta_data_buffer.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include "ck_tile/core/config.hpp"
+#include "ck_tile/core/container/array.hpp"
+#include "ck_tile/core/utility/bit_cast.hpp"
+#include <cstddef>
+namespace ck_tile {
+// TODO: this structure is not intented to be used by user
+template <index_t MaxSize>
+struct meta_data_buffer
+{
+    CK_TILE_HOST_DEVICE constexpr meta_data_buffer() : buffer_{}, size_{0} {}
+    template <typename X, typename... Xs>
+    CK_TILE_HOST_DEVICE constexpr meta_data_buffer(const X& x, const Xs&... xs)
+        : buffer_{}, size_{0}
+    {
+        push(x, xs...);
+    }
+    template <typename T>
+    CK_TILE_HOST_DEVICE constexpr void push(const T& data)
+    {
+        if constexpr(!std::is_empty_v<T>)
+        {
+            constexpr index_t size = sizeof(T);
+            auto tmp = bit_cast<array<std::byte, size>>(data);
+            for(int i = 0; i < size; i++)
+            {
+                buffer_(size_) = tmp[i];
+                size_++;
+            }
+        }
+    }
+    template <typename X, typename... Xs>
+    CK_TILE_HOST_DEVICE constexpr void push(const X& x, const Xs&... xs)
+    {
+        push(x);
+        push(xs...);
+    }
+    template <typename T>
+    CK_TILE_HOST_DEVICE constexpr T pop(index_t& pos) const
+    {
+        T data;
+        if constexpr(!std::is_empty_v<T>)
+        {
+            constexpr index_t size = sizeof(T);
+            array<std::byte, size> tmp;
+            for(int i = 0; i < size; i++)
+            {
+                tmp(i) = buffer_[pos];
+                pos++;
+            }
+            data = bit_cast<T>(tmp);
+        }
+        return data;
+    }
+    template <typename T>
+    CK_TILE_HOST_DEVICE constexpr T get(index_t pos) const
+    {
+        constexpr index_t size = sizeof(T);
+        array<std::byte, size> tmp;
+        for(int i = 0; i < size; i++)
+        {
+            tmp(i) = buffer_[pos];
+            pos++;
+        }
+        auto data = bit_cast<T>(tmp);
+        return data;
+    }
+    //
+    array<std::byte, MaxSize> buffer_;
+    index_t size_ = 0;
+};
+} // namespace ck_tile
--- a/include/ck_tile/core/container/multi_index.hpp
+++ b/include/ck_tile/core/container/multi_index.hpp
--- a/include/ck_tile/core/container/sequence.hpp
+++ b/include/ck_tile/core/container/sequence.hpp
--- a/include/ck_tile/core/container/span.hpp
+++ b/include/ck_tile/core/container/span.hpp
--- a/include/ck_tile/core/container/statically_indexed_array.hpp
+++ b/include/ck_tile/core/container/statically_indexed_array.hpp
+// SPDX-License-Identifier: MIT
+// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+#pragma once
+#include "ck_tile/core/config.hpp"
+#include "ck_tile/core/container/array.hpp"
+#include "ck_tile/core/container/tuple.hpp"
+#include "ck_tile/core/numeric/integer.hpp"
+namespace ck_tile {
+#if CK_TILE_STATICALLY_INDEXED_ARRAY_DEFAULT == CK_TILE_STATICALLY_INDEXED_ARRAY_USE_TUPLE
+template <typename T, index_t N>
+using statically_indexed_array = tuple_array<T, N>;
+#else
+// consider mark this struct as deprecated
+template <typename T, index_t N>
+using statically_indexed_array = array<T, N>;
+#endif
+// consider always use ck_tile::array for this purpose
+#if 0
+template <typename X, typename... Xs>
+CK_TILE_HOST_DEVICE constexpr auto make_statically_indexed_array(const X& x, const Xs&... xs)
+{
+    return statically_indexed_array<X, sizeof...(Xs) + 1>(x, static_cast<X>(xs)...);
+}
+// make empty statically_indexed_array
+template <typename X>
+CK_TILE_HOST_DEVICE constexpr auto make_statically_indexed_array()
+{
+    return statically_indexed_array<X, 0>();
+}
+#endif
+} // namespace ck_tile
--- a/include/ck_tile/core/container/thread_buffer.hpp
+++ b/include/ck_tile/core/container/thread_buffer.hpp
--- a/include/ck_tile/core/container/tuple.hpp
+++ b/include/ck_tile/core/container/tuple.hpp
--- a/include/ck_tile/core/numeric/bfloat16.hpp
+++ b/include/ck_tile/core/numeric/bfloat16.hpp
--- a/include/ck_tile/core/numeric/float8.hpp
+++ b/include/ck_tile/core/numeric/float8.hpp
--- a/include/ck_tile/core/numeric/half.hpp
+++ b/include/ck_tile/core/numeric/half.hpp
--- a/include/ck_tile/core/numeric/integer.hpp
+++ b/include/ck_tile/core/numeric/integer.hpp
--- a/include/ck_tile/core/numeric/integral_constant.hpp
+++ b/include/ck_tile/core/numeric/integral_constant.hpp