Reduction for int8 and bfloat16 (#125)

* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter * Rename the folder name for the pool2d and reduce examples * Update to reduction test scripts * Add Readme for pool2d_fwd and reduce_blockwise examples * Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX) * Tiny fix in reduce profiler and tiny update in reduce testing scripts * Tiny fix in testing script profile_reduce_no_index.sh * Tiny fix in testing script profile_reduce_no_index.sh * Add support for bfp16 reduction (using bhalf_t = ushort) * Tiny fix in amd_buffer_addressing.hpp * Tiny change in script/profile_reduce_with_index.sh * Use AccDataType for Beta value and use element_wise::PassThrough * Use type_convert for type converting in host layer reduction * Renaming and refining in Reduction profiler/device layer/examples * Renaming and refining in Reduction profiler/device layer/examples * Renaming all NumReduceDims to NumReduceDim * Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2 * Update to testing scripts to add bf16 support * added more static_assert * Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp * Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations * minor change * Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass * Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp * Tiny fix in script/profile_reduce_no_index.sh * Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims * Generic renaming in host reduction and DeviceReduce layer * Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances * Use multi-thread and simplification for host Reduction implementation * Add ctest for reduction * Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/ * Update to the reduce CTest executables to enable default testing behavior when no command argument * Renaming Co-authored-by: Jianfeng yan <jfyan008@gmail.com>

Reduction for int8 and bfloat16 (#125)
* Use thread cluster descriptor and explicit M_K 2d descriptor to simply Blockwise Reduction * Change by replacing ReduceDims by NumReduceDims as Device Reduce interface template parameter * Rename the folder name for the pool2d and reduce examples * Update to reduction test scripts * Add Readme for pool2d_fwd and reduce_blockwise examples * Add support for int8_t reduction (ADD/AVG, MIN/MAX/AMAX) * Tiny fix in reduce profiler and tiny update in reduce testing scripts * Tiny fix in testing script profile_reduce_no_index.sh * Tiny fix in testing script profile_reduce_no_index.sh * Add support for bfp16 reduction (using bhalf_t = ushort) * Tiny fix in amd_buffer_addressing.hpp * Tiny change in script/profile_reduce_with_index.sh * Use AccDataType for Beta value and use element_wise::PassThrough * Use type_convert for type converting in host layer reduction * Renaming and refining in Reduction profiler/device layer/examples * Renaming and refining in Reduction profiler/device layer/examples * Renaming all NumReduceDims to NumReduceDim * Fix the leaked type_convert in ThreadwiseTensorSliceTransfer_v2 * Update to testing scripts to add bf16 support * added more static_assert * Remove buggy tunable configurations defined in device_reduce_instance_xxx.hpp * Add static_assert to give compile-time warning for incorrect thread slice-size/vector-size configurations * minor change * Refine and fix (in GetWorkspaceSizeInBytes of MultiBlockPartialReduce) to make int8 completely pass * Tiny renaming in gridwise_2d_reduction_multiblock_partial_reduce.hpp * Tiny fix in script/profile_reduce_no_index.sh * Refine in DeviceReduce layer with regard to using NumInvariantDim/NumReduceDim or InvariantDims/ReduceDims * Generic renaming in host reduction and DeviceReduce layer * Add support for 4-d all dimension reduction in the profiler and add_device_reduce_xxx instances * Use multi-thread and simplification for host Reduction implementation * Add ctest for reduction * Update to clarify the using of data init method in produce_reduce/example_reduce/test_reduce/ * Update to the reduce CTest executables to enable default testing behavior when no command argument * Renaming Co-authored-by: Jianfeng yan <jfyan008@gmail.com>
9a8ee8a3 · Qianfeng · GitHub · cb87b049 · cb87b049 · 9a8ee8a3
Unverified Commit 9a8ee8a3 authored Mar 23, 2022 by Qianfeng Committed by GitHub Mar 22, 2022
20 changed files
--- a/library/include/ck/library/host_tensor/host_generic_reduction.hpp
+++ b/library/include/ck/library/host_tensor/host_generic_reduction.hpp
-/*******************************************************************************
- *
- * MIT License
- *
- * Copyright (c) 2020 Advanced Micro Devices, Inc.
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in all
- * copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- * SOFTWARE.
- *
- *******************************************************************************/
-#ifndef HOST_GENERIC_REDUCTION_HPP_
-#define HOST_GENERIC_REDUCTION_HPP_
-#include <vector>
-#include <functional>
-#include <limits>
-#include <type_traits>
-#include <cassert>
-#include <cmath>
-#include "reduction_enums.hpp"
-#include "host_reduce_util.hpp"
-using float16 = half_float::half;
-namespace ck {
-namespace host_reduce {
-template <typename T>
-static void
-get_all_indexes(const std::vector<T>& dimLengths, int dim, std::vector<std::vector<T>>& indexes)
-{
-    if(dim < dimLengths.size())
-    {
-        std::vector<std::vector<T>> updated_indexes;
-        if(dim == 0)
-        {
-            assert(indexes.size() == 0);
-            assert(dimLengths[dim] > 0);
-            for(T i = 0; i < dimLengths[dim]; i++)
-            {
-                std::vector<T> index = {i};
-                updated_indexes.push_back(index);
-            };
-        }
-        else
-        {
-            // go through all the current indexes
-            for(const auto& index : indexes)
-                for(T i = 0; i < dimLengths[dim]; i++)
-                {
-                    auto index_new = index;
-                    index_new.push_back(i);
-                    updated_indexes.push_back(index_new);
-                };
-        };
-        // update to the indexes (output)
-        indexes = updated_indexes;
-        // further to construct the indexes from the updated status
-        get_all_indexes(dimLengths, dim + 1, indexes);
-    };
-};
-template <typename T>
-static T get_offset_from_index(const std::vector<T>& strides, const std::vector<T>& index)
-{
-    T offset = 0;
-    assert(strides.size() == index.size());
-    for(int i = 0; i < index.size(); i++)
-        offset += strides[i] * static_cast<T>(index[i]);
-    return (offset);
-};
-template <typename T>
-static inline T get_flatten_offset(const std::vector<T>& lengths, const std::vector<T>& index)
-{
-    T offset = 0;
-    assert(lengths.size() == index.size() && lengths.size() > 0);
-    int len  = lengths.size();
-    T stride = 1;
-    // for len==1, the loop is not executed
-    for(int i = len - 1; i > 0; i--)
-    {
-        offset += stride * static_cast<T>(index[i]);
-        stride *= lengths[i];
-    };
-    offset += stride * static_cast<T>(index[0]);
-    return (offset);
-};
-template <typename InDataType,
-          typename AccDataType,
-          typename OutDataType,
-          ck::ReduceTensorOp_t ReduceOpId,
-          bool PropagateNan,
-          bool NeedIndices>
-class ReductionHost
-{
-    public:
-    ReductionHost() = default;
-    ReductionHost(HostTensorDescriptor& inDesc,
-                  HostTensorDescriptor& outDesc,
-                  const std::vector<int>& invariantDims_,
-                  const std::vector<int>& toReduceDims_)
-    {
-        this->inLengths  = to_int_vector(inDesc.GetLengths());
-        this->outLengths = to_int_vector(outDesc.GetLengths());
-        this->inStrides  = to_int_vector(inDesc.GetStrides());
-        this->outStrides = to_int_vector(outDesc.GetStrides());
-        this->invariantDims = invariantDims_;
-        this->toReduceDims  = toReduceDims_;
-        assert(this->inLengths.size() == this->outLengths.size());
-        assert(!this->toReduceDims.empty());
-        for(const auto dim : this->invariantDims)
-            this->invariantLengths.push_back(this->inLengths[dim]);
-        for(const auto dim : this->toReduceDims)
-            toReduceLengths.push_back(this->inLengths[dim]);
-        this->reduceAllDims = this->invariantDims.empty();
-    };
-    ~ReductionHost(){};
-    void
-    Run(float alpha, const InDataType* in_data, float beta, OutDataType* out_data, int* indices)
-    {
-        if constexpr(NeedIndices)
-            RunImpl_with_indices(alpha, in_data, beta, out_data, indices);
-        else
-            RunImpl_no_indices(alpha, in_data, beta, out_data);
-    };
-    private:
-    std::vector<int> inLengths;
-    std::vector<int> outLengths;
-    std::vector<int> inStrides;
-    std::vector<int> outStrides;
-    std::vector<int> invariantLengths;
-    std::vector<int> toReduceLengths;
-    std::vector<int> invariantDims;
-    std::vector<int> toReduceDims;
-    bool reduceAllDims;
-    void RunImpl_with_indices(
-        float alpha, const InDataType* in_data, float beta, OutDataType* out_data, int* indices)
-    {
-        using ck::host_reduce::binop_with_nan_check;
-        using ck::host_reduce::binop_with_nan_check2;
-        using ck::host_reduce::float_equal_one;
-        using ck::host_reduce::float_equal_zero;
-        using ck::host_reduce::PosUnaryOpFn;
-        using ck::host_reduce::PreUnaryOpFn;
-        using ck::host_reduce::ReduceOpFn2;
-        using ck::host_reduce::ReduceOpZeroVal;
-        auto opReduce = ReduceOpFn2<AccDataType, ReduceOpId>();
-        int divider = 1;
-        for(int i = 0; i < toReduceLengths.size(); i++)
-            divider *= toReduceLengths[i];
-        auto PreUnaryOp = PreUnaryOpFn<AccDataType, ReduceOpId>(divider);
-        auto PosUnaryOp = PosUnaryOpFn<AccDataType, ReduceOpId>(divider);
-        if(reduceAllDims)
-        {
-            std::vector<std::vector<int>> indexes_1;
-            get_all_indexes(inLengths, 0, indexes_1); // generate the input indexes space
-            auto accuVal  = ReduceOpZeroVal<AccDataType, ReduceOpId>();
-            int accuIndex = 0;
-            // go through indexes of the invariant dimensions
-            for(const auto& src_index : indexes_1)
-            {
-                auto src_offset = get_offset_from_index(this->inStrides, src_index);
-                auto currVal = static_cast<AccDataType>(in_data[src_offset]);
-                // unary operation before reducing, needed by AMAX. For MIN/MAX, nothing is actually
-                // done
-                PreUnaryOp(currVal);
-                auto currIndex = get_flatten_offset(inLengths, src_index);
-                binop_with_nan_check2<AccDataType, PropagateNan>(
-                    opReduce, accuVal, currVal, accuIndex, currIndex);
-            };
-            // scale the accumulated value
-            if(!float_equal_one(alpha))
-                accuVal *= static_cast<AccDataType>(alpha);
-            // scale the prior dst value and add it to the accumulated value
-            if(!float_equal_zero(beta))
-                accuVal += static_cast<AccDataType>(out_data[0]) * static_cast<AccDataType>(beta);
-            // store the reduced value to dst location
-            out_data[0] = static_cast<OutDataType>(accuVal);
-            indices[0]  = accuIndex;
-        }
-        else
-        {
-            std::vector<std::vector<int>> indexes_1, indexes_2;
-            get_all_indexes(
-                this->invariantLengths, 0, indexes_1); // generate the invariant indexes space
-            get_all_indexes(
-                this->toReduceLengths, 0, indexes_2); // generate the toReduce indexes space
-            // go through indexes of the invariant dimensions
-            for(const auto& index_1 : indexes_1)
-            {
-                std::vector<int> src_index;
-                std::vector<int> dst_index;
-                src_index.resize(this->inLengths.size());
-                // generate the part of src index belonging to invariant dims
-                for(int k = 0; k < invariantDims.size(); k++)
-                    src_index[invariantDims[k]] = index_1[k];
-                for(int k = 0; k < invariantDims.size(); k++)
-                    dst_index.push_back(index_1[k]);
-                int dst_offset = get_offset_from_index(this->outStrides, dst_index);
-                AccDataType accuVal = ReduceOpZeroVal<AccDataType, ReduceOpId>();
-                int accuIndex       = 0;
-                // go through indexes of the toReduce dimensions
-                for(const auto& index_2 : indexes_2)
-                {
-                    // generate the part of src index belonging to toReduce dims
-                    for(int k = 0; k < toReduceDims.size(); k++)
-                        src_index[toReduceDims[k]] = index_2[k];
-                    auto src_offset = get_offset_from_index(this->inStrides, src_index);
-                    auto currVal = static_cast<AccDataType>(in_data[src_offset]);
-                    // unary operation before reducing, needed by AMAX. For MIN/MAX, nothing is
-                    // actually done
-                    PreUnaryOp(currVal);
-                    auto currIndex = get_flatten_offset(toReduceLengths, index_2);
-                    binop_with_nan_check2<AccDataType, PropagateNan>(
-                        opReduce, accuVal, currVal, accuIndex, currIndex);
-                };
-                // scale the accumulated value
-                if(!float_equal_one(alpha))
-                    accuVal *= static_cast<AccDataType>(alpha);
-                // scale the prior dst value and add it to the accumulated value
-                if(!float_equal_zero(beta))
-                    accuVal += static_cast<AccDataType>(out_data[dst_offset]) *
-                               static_cast<AccDataType>(beta);
-                // store the reduced value to dst location
-                out_data[dst_offset] = static_cast<OutDataType>(accuVal);
-                indices[dst_offset]  = accuIndex;
-            };
-        };
-    }; // end of RunImpl_with_indices()
-    void
-    RunImpl_no_indices(float alpha, const InDataType* in_data, float beta, OutDataType* out_data)
-    {
-        using ck::host_reduce::binop_with_nan_check;
-        using ck::host_reduce::binop_with_nan_check2;
-        using ck::host_reduce::float_equal_one;
-        using ck::host_reduce::float_equal_zero;
-        using ck::host_reduce::PosUnaryOpFn;
-        using ck::host_reduce::PreUnaryOpFn;
-        using ck::host_reduce::ReduceOpFn;
-        using ck::host_reduce::ReduceOpZeroVal;
-        auto opReduce = ReduceOpFn<AccDataType, ReduceOpId>();
-        int divider = 1;
-        for(int i = 0; i < toReduceLengths.size(); i++)
-            divider *= toReduceLengths[i];
-        auto PreUnaryOp = PreUnaryOpFn<AccDataType, ReduceOpId>(divider);
-        auto PosUnaryOp = PosUnaryOpFn<AccDataType, ReduceOpId>(divider);
-        if(reduceAllDims)
-        {
-            std::vector<std::vector<int>> indexes_1;
-            get_all_indexes(inLengths, 0, indexes_1); // generate the input indexes space
-            auto accuVal = ReduceOpZeroVal<AccDataType, ReduceOpId>();
-            // go through indexes of the invariant dimensions
-            for(const auto& src_index : indexes_1)
-            {
-                auto src_offset = get_offset_from_index(this->inStrides, src_index);
-                auto currVal = static_cast<AccDataType>(in_data[src_offset]);
-                PreUnaryOp(currVal);
-                binop_with_nan_check<AccDataType, PropagateNan>(opReduce, accuVal, currVal);
-            };
-            PosUnaryOp(accuVal);
-            // scale the accumulated value
-            if(!float_equal_one(alpha))
-                accuVal *= static_cast<AccDataType>(alpha);
-            // scale the prior dst value and add it to the accumulated value
-            if(!float_equal_zero(beta))
-                accuVal += static_cast<AccDataType>(out_data[0]) * static_cast<AccDataType>(beta);
-            // store the reduced value to dst location
-            out_data[0] = static_cast<OutDataType>(accuVal);
-        }
-        else
-        {
-            std::vector<std::vector<int>> indexes_1, indexes_2;
-            get_all_indexes(
-                this->invariantLengths, 0, indexes_1); // generate the invariant indexes space
-            get_all_indexes(
-                this->toReduceLengths, 0, indexes_2); // generate the toReduce indexes space
-            // go through indexes of the invariant dimensions
-            for(const auto& index_1 : indexes_1)
-            {
-                std::vector<int> src_index;
-                std::vector<int> dst_index;
-                src_index.resize(this->inLengths.size());
-                for(int k = 0; k < invariantDims.size(); k++)
-                    dst_index.push_back(index_1[k]);
-                int dst_offset = get_offset_from_index(this->outStrides, dst_index);
-                // generate the part of src index belonging to invariant dims
-                for(int k = 0; k < invariantDims.size(); k++)
-                    src_index[invariantDims[k]] = index_1[k];
-                AccDataType accuVal = ReduceOpZeroVal<AccDataType, ReduceOpId>();
-                // go through indexes of the toReduce dimensions
-                for(const auto& index_2 : indexes_2)
-                {
-                    // generate the part of src index belonging to toReduce dims
-                    for(int k = 0; k < toReduceDims.size(); k++)
-                        src_index[toReduceDims[k]] = index_2[k];
-                    auto src_offset = get_offset_from_index(this->inStrides, src_index);
-                    auto currVal = static_cast<AccDataType>(in_data[src_offset]);
-                    PreUnaryOp(currVal);
-                    binop_with_nan_check<AccDataType, PropagateNan>(opReduce, accuVal, currVal);
-                };
-                PosUnaryOp(accuVal);
-                // scale the accumulated value
-                if(!float_equal_one(alpha))
-                    accuVal *= static_cast<AccDataType>(alpha);
-                // scale the prior dst value and add it to the accumulated value
-                if(!float_equal_zero(beta))
-                    accuVal += static_cast<AccDataType>(out_data[dst_offset]) *
-                               static_cast<AccDataType>(beta);
-                // store the reduced value to dst location
-                out_data[dst_offset] = static_cast<OutDataType>(accuVal);
-            };
-        };
-    }; // end of RunImpl_no_indices()
-};
-}; // end of namespace host_reduce
-}; // end of namespace ck
-#endif
--- a/library/include/ck/library/host_tensor/host_reduce_util.hpp
+++ b/library/include/ck/library/host_tensor/host_reduce_util.hpp
@@ -66,22 +66,22 @@ static inline bool float_equal_zero(half_float::half x)
    return x == static_cast<half_float::half>(0.0f);
 };
-template <typename compType, ReduceTensorOp_t ReduceOpId>
+template <typename AccDataType, ReduceTensorOp_t ReduceOpId>
-__host__ static inline std::function<void(compType&)> PreUnaryOpFn(int)
+__host__ static inline std::function<void(AccDataType&)> PreUnaryOpFn(int)
 {
    using std::abs;
    if constexpr(ReduceOpId == ReduceTensorOp_t::NORM1)
    {
-        return ([&](compType& a_) { a_ = abs(a_); });
+        return ([&](AccDataType& a_) { a_ = abs(a_); });
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::NORM2)
    {
-        return ([&](compType& a_) { a_ = a_ * a_; });
+        return ([&](AccDataType& a_) { a_ = a_ * a_; });
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::AMAX)
    {
-        return ([&](compType& a_) { a_ = abs(a_); });
+        return ([&](AccDataType& a_) { a_ = abs(a_); });
    }
    else
    {
@@ -90,23 +90,23 @@ __host__ static inline std::function<void(compType&)> PreUnaryOpFn(int)
        // ReduceTensorOp_t::MUL:
        // ReduceTensorOp_t::MIN:
        // ReduceTensorOp_t::MAX:
-        return ([&](compType&) {});
+        return ([&](AccDataType&) {});
    };
 };
-template <typename compType, ReduceTensorOp_t ReduceOpId>
+template <typename AccDataType, ReduceTensorOp_t ReduceOpId>
-__host__ static inline std::function<void(compType&)> PosUnaryOpFn(int divider)
+__host__ static inline std::function<void(AccDataType&)> PosUnaryOpFn(int32_t divider)
 {
    using std::sqrt;
    if constexpr(ReduceOpId == ReduceTensorOp_t::NORM2)
    {
-        return ([&](compType& a_) { a_ = sqrt(a_); });
+        return ([&](AccDataType& a_) { a_ = sqrt(a_); });
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::AVG)
    {
-        return ([&, divider](compType& a_) {
+        return ([&, divider](AccDataType& a_) {
-            a_ = a_ / static_cast<compType>(static_cast<float>(divider));
+            a_ = a_ / static_cast<AccDataType>(static_cast<float>(divider));
        });
    }
    else
@@ -117,44 +117,44 @@ __host__ static inline std::function<void(compType&)> PosUnaryOpFn(int divider)
        // ReduceTensorOp_t::MIN:
        // ReduceTensorOp_t::MAX:
        // ReduceTensorOp_t::AMAX:
-        return ([&](compType&) {});
+        return ([&](AccDataType&) {});
    }
 };
-template <typename compType, ReduceTensorOp_t ReduceOpId>
+template <typename AccDataType, ReduceTensorOp_t ReduceOpId>
-__host__ static inline std::function<void(compType&, compType)> ReduceOpFn()
+__host__ static inline std::function<void(AccDataType&, AccDataType)> ReduceOpFn()
 {
    if constexpr(ReduceOpId == ReduceTensorOp_t::ADD || ReduceOpId == ReduceTensorOp_t::AVG ||
                 ReduceOpId == ReduceTensorOp_t::NORM1 || ReduceOpId == ReduceTensorOp_t::NORM2)
    {
-        return ([&](compType& a_, compType b_) { a_ = a_ + b_; });
+        return ([&](AccDataType& a_, AccDataType b_) { a_ = a_ + b_; });
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::MUL)
    {
-        return ([&](compType& a_, compType b_) { a_ = a_ * b_; });
+        return ([&](AccDataType& a_, AccDataType b_) { a_ = a_ * b_; });
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::MIN)
    {
-        return ([&](compType& a_, compType b_) {
+        return ([&](AccDataType& a_, AccDataType b_) {
            if(a_ > b_)
                a_ = b_;
        });
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::MAX || ReduceOpId == ReduceTensorOp_t::AMAX)
    {
-        return ([&](compType& a_, compType b_) {
+        return ([&](AccDataType& a_, AccDataType b_) {
            if(a_ < b_)
                a_ = b_;
        });
    }
 };
-template <typename compType, ReduceTensorOp_t ReduceOpId>
+template <typename AccDataType, ReduceTensorOp_t ReduceOpId>
-__host__ static inline std::function<void(compType&, compType, bool& changed)> ReduceOpFn2()
+__host__ static inline std::function<void(AccDataType&, AccDataType, bool& changed)> ReduceOpFn2()
 {
    if constexpr(ReduceOpId == ReduceTensorOp_t::MIN)
    {
-        return ([&](compType& a_, compType b_, bool& changed) {
+        return ([&](AccDataType& a_, AccDataType b_, bool& changed) {
            if(a_ > b_)
            {
                a_      = b_;
@@ -166,7 +166,7 @@ __host__ static inline std::function<void(compType&, compType, bool& changed)> R
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::MAX || ReduceOpId == ReduceTensorOp_t::AMAX)
    {
-        return ([&](compType& a_, compType b_, bool& changed) {
+        return ([&](AccDataType& a_, AccDataType b_, bool& changed) {
            if(a_ < b_)
            {
                a_      = b_;
@@ -183,28 +183,28 @@ __host__ static inline std::function<void(compType&, compType, bool& changed)> R
        // ReduceTensorOp_t::AVG:
        // ReduceTensorOp_t::NORM1:
        // ReduceTensorOp_t::NORM2:
-        return (std::function<void(compType&, compType, bool&)>{});
+        return (std::function<void(AccDataType&, AccDataType, bool&)>{});
    };
 };
-template <typename compType, ReduceTensorOp_t ReduceOpId>
+template <typename AccDataType, ReduceTensorOp_t ReduceOpId>
-__host__ static inline compType ReduceOpZeroVal()
+__host__ static inline AccDataType ReduceOpZeroVal()
 {
    if constexpr(ReduceOpId == ReduceTensorOp_t::MUL)
    {
-        return (static_cast<compType>(1.0f));
+        return (static_cast<AccDataType>(1.0f));
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::MIN)
    {
-        return (std::numeric_limits<compType>::max());
+        return (std::numeric_limits<AccDataType>::max());
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::MAX)
    {
-        return (std::numeric_limits<compType>::lowest());
+        return (std::numeric_limits<AccDataType>::lowest());
    }
    else if constexpr(ReduceOpId == ReduceTensorOp_t::AMAX)
    {
-        return (static_cast<compType>(0.0f));
+        return (static_cast<AccDataType>(0.0f));
    }
    else
    {
@@ -212,14 +212,15 @@ __host__ static inline compType ReduceOpZeroVal()
        // ReduceTensorOp_t::AVG
        // ReduceTensorOp_t::NORM1
        // ReduceTensorOp_t::NORM2
-        return (static_cast<compType>(0.0f));
+        return (static_cast<AccDataType>(0.0f));
    };
 };
-template <typename compType, bool PropagateNan>
+template <typename AccDataType, bool PropagateNan>
-__host__ static inline void binop_with_nan_check(std::function<void(compType&, compType)> opReduce,
+__host__ static inline void
-                                                 compType& accuVal,
+binop_with_nan_check(std::function<void(AccDataType&, AccDataType)> opReduce,
-                                                 compType currVal)
+                     AccDataType& accuVal,
+                     AccDataType currVal)
 {
    using std::isnan;
@@ -236,11 +237,11 @@ __host__ static inline void binop_with_nan_check(std::function<void(compType&, c
    };
 };
-template <typename compType, bool PropagateNan>
+template <typename AccDataType, bool PropagateNan>
 __host__ static inline void
-binop_with_nan_check2(std::function<void(compType&, compType, bool&)> opReduce,
+binop_with_nan_check2(std::function<void(AccDataType&, AccDataType, bool&)> opReduce,
-                      compType& accuVal,
+                      AccDataType& accuVal,
-                      compType currVal,
+                      AccDataType currVal,
                      int& accuIndex,
                      int currIndex)
 {

--- a/library/include/ck/library/host_tensor/host_reduction.hpp
+++ b/library/include/ck/library/host_tensor/host_reduction.hpp
+/*******************************************************************************
+ *
+ * MIT License
+ *
+ * Copyright (c) 2020 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in all
+ * copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ *******************************************************************************/
+#ifndef HOST_REDUCTION_HPP_
+#define HOST_REDUCTION_HPP_
+#include <vector>
+#include <array>
+#include <functional>
+#include "reduction_enums.hpp"
+#include "host_reduce_util.hpp"
+#include "host_tensor.hpp"
+#include "data_type.hpp"
+template <int NDim>
+static void get_all_indexes(const std::array<size_t, NDim>& dimLengths,
+                            std::vector<std::array<size_t, NDim>>& indexes)
+{
+    static_assert(NDim >= 1, "NDim >= 1 is required to use this function!");
+    if constexpr(NDim == 1)
+    {
+        for(size_t i = 0; i < dimLengths[0]; i++)
+        {
+            std::array<size_t, 1> index{i};
+            indexes.push_back(index);
+        };
+    }
+    else
+    {
+        std::array<size_t, NDim - 1> partial_dim_lengths;
+        for(int i = 0; i < NDim - 1; i++)
+            partial_dim_lengths[i] = dimLengths[i + 1];
+        std::vector<std::array<size_t, NDim - 1>> partial_indexes;
+        get_all_indexes<NDim - 1>(partial_dim_lengths, partial_indexes);
+        for(size_t i = 0; i < dimLengths[0]; i++)
+            for(const auto& index : partial_indexes)
+            {
+                std::array<size_t, NDim> extIndex;
+                extIndex[0] = i;
+                for(int k = 0; k < NDim - 1; k++)
+                    extIndex[k + 1] = index[k];
+                indexes.push_back(extIndex);
+            };
+    };
+};
+template <int NDim>
+static size_t get_offset_from_index(const std::array<size_t, NDim>& strides,
+                                    const std::array<size_t, NDim>& index)
+{
+    size_t offset = 0;
+    for(int i = 0; i < NDim; i++)
+        offset += strides[i] * index[i];
+    return (offset);
+};
+template <int NDim>
+static size_t get_offset_from_index(const std::vector<size_t>& strides,
+                                    const std::array<size_t, NDim>& index)
+{
+    size_t offset = 0;
+    for(int i = 0; i < NDim; i++)
+        offset += strides[i] * index[i];
+    return (offset);
+};
+template <typename InDataType,
+          typename AccDataType,
+          typename OutDataType,
+          ck::ReduceTensorOp_t ReduceOpId,
+          int Rank,
+          int NumReduceDim,
+          bool PropagateNan,
+          bool NeedIndices>
+struct ReductionHost
+{
+    using IndexDataType = int32_t;
+    static constexpr int NumInvariantDim = Rank - NumReduceDim;
+    std::vector<size_t> outStrides;
+    std::vector<int> invariantDims;
+    std::vector<int> reduceDims;
+    IndexDataType divider;
+    std::function<void(AccDataType&)> preUnaryOp;
+    std::function<void(AccDataType&)> posUnaryOp;
+    std::array<size_t, NumReduceDim> reduceLengths;
+    std::array<size_t, NumReduceDim> reduceStrides;
+    std::array<size_t, NumInvariantDim> invariantLengths;
+    std::array<size_t, NumInvariantDim> invariantStrides;
+    std::vector<std::array<size_t, NumReduceDim>> reduce_dim_indexes;
+    std::vector<std::array<size_t, NumInvariantDim>> invariant_dim_indexes;
+    ReductionHost(HostTensorDescriptor& inDesc,
+                  HostTensorDescriptor& outDesc,
+                  const std::vector<int>& invariantDims_,
+                  const std::vector<int>& reduceDims_)
+    {
+        using ck::host_reduce::PosUnaryOpFn;
+        using ck::host_reduce::PreUnaryOpFn;
+        // this->outLengths = to_int_vector(outDesc.GetLengths());
+        this->outStrides = outDesc.GetStrides();
+        this->invariantDims = invariantDims_;
+        this->reduceDims    = reduceDims_;
+        int product = 1;
+        for(int i = 0; i < NumReduceDim; i++)
+        {
+            reduceLengths[i] = inDesc.GetLengths()[reduceDims[i]];
+            reduceStrides[i] = inDesc.GetStrides()[reduceDims[i]];
+            product *= inDesc.GetLengths()[reduceDims[i]];
+        };
+        divider = product;
+        for(int i = 0; i < NumInvariantDim; i++)
+        {
+            invariantLengths[i] = inDesc.GetLengths()[invariantDims[i]];
+            invariantStrides[i] = inDesc.GetStrides()[invariantDims[i]];
+        };
+        reduce_dim_indexes.clear();
+        get_all_indexes<NumReduceDim>(reduceLengths, reduce_dim_indexes);
+        if constexpr(NumInvariantDim > 0)
+        {
+            invariant_dim_indexes.clear();
+            get_all_indexes<NumInvariantDim>(invariantLengths, invariant_dim_indexes);
+        };
+        preUnaryOp = PreUnaryOpFn<AccDataType, ReduceOpId>(divider);
+        posUnaryOp = PosUnaryOpFn<AccDataType, ReduceOpId>(divider);
+    };
+    void Run(float alpha,
+             const InDataType* in_data,
+             float beta,
+             OutDataType* out_data,
+             IndexDataType* out_indices)
+    {
+        if constexpr(NeedIndices)
+        {
+            RunImpl_with_index(alpha, in_data, beta, out_data, out_indices);
+        }
+        else
+        {
+            RunImpl_no_index(alpha, in_data, beta, out_data);
+        };
+    };
+    void RunImpl_with_index(float alpha,
+                            const InDataType* in_data,
+                            float beta,
+                            OutDataType* out_data,
+                            IndexDataType* out_indices)
+    {
+        using ck::type_convert;
+        using ck::host_reduce::binop_with_nan_check2;
+        using ck::host_reduce::float_equal_one;
+        using ck::host_reduce::float_equal_zero;
+        using ck::host_reduce::ReduceOpFn2;
+        using ck::host_reduce::ReduceOpZeroVal;
+        auto opReduce2 = ReduceOpFn2<AccDataType, ReduceOpId>();
+        if constexpr(NumInvariantDim == 0)
+        {
+            AccDataType accuVal     = ReduceOpZeroVal<AccDataType, ReduceOpId>();
+            IndexDataType accuIndex = 0;
+            for(IndexDataType i = 0; i < reduce_dim_indexes.size(); i++)
+            {
+                auto offset_reduce =
+                    get_offset_from_index<NumReduceDim>(reduceStrides, reduce_dim_indexes[i]);
+                auto currVal = type_convert<AccDataType>(in_data[offset_reduce]);
+                preUnaryOp(currVal);
+                auto currIndex = i;
+                binop_with_nan_check2<AccDataType, PropagateNan>(
+                    opReduce2, accuVal, currVal, accuIndex, currIndex);
+            };
+            posUnaryOp(accuVal);
+            if(!float_equal_one(alpha))
+                accuVal *= type_convert<AccDataType>(alpha);
+            if(!float_equal_zero(beta))
+                accuVal += type_convert<AccDataType>(out_data[0]) * type_convert<AccDataType>(beta);
+            out_data[0]    = type_convert<OutDataType>(accuVal);
+            out_indices[0] = accuIndex;
+        }
+        else
+        {
+            auto thread_reduce_func = [&](auto invariant_index) {
+                AccDataType accuVal     = ReduceOpZeroVal<AccDataType, ReduceOpId>();
+                IndexDataType accuIndex = 0;
+                auto offset_invariant =
+                    get_offset_from_index<NumInvariantDim>(invariantStrides, invariant_index);
+                for(IndexDataType i = 0; i < reduce_dim_indexes.size(); i++)
+                {
+                    auto offset_reduce =
+                        get_offset_from_index<NumReduceDim>(reduceStrides, reduce_dim_indexes[i]);
+                    auto currVal =
+                        type_convert<AccDataType>(in_data[offset_invariant + offset_reduce]);
+                    preUnaryOp(currVal);
+                    auto currIndex = i;
+                    binop_with_nan_check2<AccDataType, PropagateNan>(
+                        opReduce2, accuVal, currVal, accuIndex, currIndex);
+                };
+                posUnaryOp(accuVal);
+                if(!float_equal_one(alpha))
+                    accuVal *= type_convert<AccDataType>(alpha);
+                auto dst_offset =
+                    get_offset_from_index<NumInvariantDim>(outStrides, invariant_index);
+                if(!float_equal_zero(beta))
+                    accuVal += type_convert<AccDataType>(out_data[dst_offset]) *
+                               type_convert<AccDataType>(beta);
+                out_data[dst_offset]    = type_convert<OutDataType>(accuVal);
+                out_indices[dst_offset] = accuIndex;
+            };
+            std::size_t num_thread = std::thread::hardware_concurrency();
+            std::size_t work_per_thread =
+                (invariant_dim_indexes.size() + num_thread - 1) / num_thread;
+            std::vector<joinable_thread> threads(num_thread);
+            for(std::size_t it = 0; it < num_thread; ++it)
+            {
+                std::size_t iw_begin = it * work_per_thread;
+                std::size_t iw_end =
+                    std::min((it + 1) * work_per_thread, invariant_dim_indexes.size());
+                auto f = [=] {
+                    for(std::size_t iw = iw_begin; iw < iw_end; ++iw)
+                    {
+                        thread_reduce_func(invariant_dim_indexes[iw]);
+                    }
+                };
+                threads[it] = joinable_thread(f);
+            }
+        };
+    };
+    void RunImpl_no_index(float alpha, const InDataType* in_data, float beta, OutDataType* out_data)
+    {
+        using ck::type_convert;
+        using ck::host_reduce::binop_with_nan_check;
+        using ck::host_reduce::float_equal_one;
+        using ck::host_reduce::float_equal_zero;
+        using ck::host_reduce::ReduceOpFn;
+        using ck::host_reduce::ReduceOpZeroVal;
+        auto opReduce = ReduceOpFn<AccDataType, ReduceOpId>();
+        if constexpr(NumInvariantDim == 0)
+        {
+            AccDataType accuVal = ReduceOpZeroVal<AccDataType, ReduceOpId>();
+            for(const auto& reduce_index : reduce_dim_indexes)
+            {
+                auto offset_reduce =
+                    get_offset_from_index<NumReduceDim>(reduceStrides, reduce_index);
+                auto currVal = type_convert<AccDataType>(in_data[offset_reduce]);
+                preUnaryOp(currVal);
+                binop_with_nan_check<AccDataType, PropagateNan>(opReduce, accuVal, currVal);
+            };
+            posUnaryOp(accuVal);
+            if(!float_equal_one(alpha))
+                accuVal *= type_convert<AccDataType>(alpha);
+            if(!float_equal_zero(beta))
+                accuVal += type_convert<AccDataType>(out_data[0]) * type_convert<AccDataType>(beta);
+            out_data[0] = type_convert<OutDataType>(accuVal);
+        }
+        else
+        {
+            auto thread_reduce_func = [&](auto invariant_index) {
+                AccDataType accuVal = ReduceOpZeroVal<AccDataType, ReduceOpId>();
+                auto offset_invariant =
+                    get_offset_from_index<NumInvariantDim>(invariantStrides, invariant_index);
+                for(const auto& reduce_index : reduce_dim_indexes)
+                {
+                    auto offset_reduce =
+                        get_offset_from_index<NumReduceDim>(reduceStrides, reduce_index);
+                    auto currVal =
+                        type_convert<AccDataType>(in_data[offset_invariant + offset_reduce]);
+                    preUnaryOp(currVal);
+                    binop_with_nan_check<AccDataType, PropagateNan>(opReduce, accuVal, currVal);
+                };
+                posUnaryOp(accuVal);
+                if(!float_equal_one(alpha))
+                    accuVal *= type_convert<AccDataType>(alpha);
+                auto dst_offset =
+                    get_offset_from_index<NumInvariantDim>(outStrides, invariant_index);
+                if(!float_equal_zero(beta))
+                    accuVal += type_convert<AccDataType>(out_data[dst_offset]) *
+                               type_convert<AccDataType>(beta);
+                out_data[dst_offset] = type_convert<OutDataType>(accuVal);
+            };
+            std::size_t num_thread = std::thread::hardware_concurrency();
+            std::size_t work_per_thread =
+                (invariant_dim_indexes.size() + num_thread - 1) / num_thread;
+            std::vector<joinable_thread> threads(num_thread);
+            for(std::size_t it = 0; it < num_thread; ++it)
+            {
+                std::size_t iw_begin = it * work_per_thread;
+                std::size_t iw_end =
+                    std::min((it + 1) * work_per_thread, invariant_dim_indexes.size());
+                auto f = [=] {
+                    for(std::size_t iw = iw_begin; iw < iw_end; ++iw)
+                    {
+                        thread_reduce_func(invariant_dim_indexes[iw]);
+                    }
+                };
+                threads[it] = joinable_thread(f);
+            }
+        };
+    };
+};
+#endif
--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance.hpp
@@ -6,23 +6,36 @@
 #include "device_reduce_instance_blockwise_f32_f32_f32.hpp"
 #include "device_reduce_instance_blockwise_f32_f64_f32.hpp"
 #include "device_reduce_instance_blockwise_f64_f64_f64.hpp"
+#include "device_reduce_instance_blockwise_i8_i8_i8.hpp"
+#include "device_reduce_instance_blockwise_i8_i32_i8.hpp"
+#include "device_reduce_instance_blockwise_b16_f32_b16.hpp"
 #include "device_reduce_instance_blockwise_second_call_f16_f16_f16.hpp"
 #include "device_reduce_instance_blockwise_second_call_f32_f32_f16.hpp"
 #include "device_reduce_instance_blockwise_second_call_f32_f32_f32.hpp"
 #include "device_reduce_instance_blockwise_second_call_f64_f64_f32.hpp"
 #include "device_reduce_instance_blockwise_second_call_f64_f64_f64.hpp"
+#include "device_reduce_instance_blockwise_second_call_i8_i8_i8.hpp"
+#include "device_reduce_instance_blockwise_second_call_i32_i32_i8.hpp"
+#include "device_reduce_instance_blockwise_second_call_f32_f32_b16.hpp"
 #include "device_reduce_instance_multiblock_atomic_add_f16_f32_f32.hpp"
 #include "device_reduce_instance_multiblock_atomic_add_f32_f32_f32.hpp"
 #include "device_reduce_instance_multiblock_atomic_add_f32_f64_f32.hpp"
+#include "device_reduce_instance_multiblock_atomic_add_b16_f32_f32.hpp"
 #include "device_reduce_instance_multiblock_partial_reduce_f16_f16_f16.hpp"
 #include "device_reduce_instance_multiblock_partial_reduce_f16_f32_f16.hpp"
 #include "device_reduce_instance_multiblock_partial_reduce_f32_f32_f32.hpp"
 #include "device_reduce_instance_multiblock_partial_reduce_f32_f64_f32.hpp"
 #include "device_reduce_instance_multiblock_partial_reduce_f64_f64_f64.hpp"
+#include "device_reduce_instance_multiblock_partial_reduce_i8_i8_i8.hpp"
+#include "device_reduce_instance_multiblock_partial_reduce_i8_i32_i8.hpp"
+#include "device_reduce_instance_multiblock_partial_reduce_b16_f32_b16.hpp"
 #include "device_reduce_instance_threadwise_f16_f16_f16.hpp"
 #include "device_reduce_instance_threadwise_f16_f32_f16.hpp"
 #include "device_reduce_instance_threadwise_f32_f32_f32.hpp"
 #include "device_reduce_instance_threadwise_f32_f64_f32.hpp"
 #include "device_reduce_instance_threadwise_f64_f64_f64.hpp"
+#include "device_reduce_instance_threadwise_i8_i8_i8.hpp"
+#include "device_reduce_instance_threadwise_i8_i32_i8.hpp"
+#include "device_reduce_instance_threadwise_b16_f32_b16.hpp"
 #endif
--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise.hpp
@@ -17,7 +17,6 @@ using reduce_configuration_2_instances_blockwise = std::tuple<
    ReductionConfiguration_2<0, 2, 2, 2, 1>,
    ReductionConfiguration_2<0, 1, 1, 2, 1>,
    ReductionConfiguration_2<1, 2, 1, 1, 2>,
-    ReductionConfiguration_2<1, 2, 2, 1, 2>,
    ReductionConfiguration_2<0, 1, 1, 3, 1>,
    ReductionConfiguration_2<1, 1, 1, 1, 3>
    // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_b16_f32_b16.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_b16_f32_b16.hpp
+#ifndef DEVICE_REDUCE_INSTANCE_BLOCKWISE_B16_F32_B16_HPP
+#define DEVICE_REDUCE_INSTANCE_BLOCKWISE_B16_F32_B16_HPP
+#include "reduction_enums.hpp"
+#include "reduction_operator_mapping.hpp"
+#include "device_reduce_instance_blockwise.hpp"
+namespace ck {
+namespace tensor_operation {
+namespace device {
+namespace device_reduce_instance {
+// clang-format off
+// InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 0, 0, 0, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 0, 0, 0, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 0, 0, 0, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 5, 0, 0, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 5, 0, 0, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 5, 0, 0, 2, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 7, 0, 0, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 7, 0, 0, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 7, 0, 0, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 0, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 0, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 0, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 0, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 0, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 0, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 0, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 0, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 0, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 1, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 1, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 2, 0, 1, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 1, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 1, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 3, 0, 1, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 1, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 1, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(bhalf_t, float, bhalf_t, 4, 0, 1, 2, 1);
+// clang-format on
+} // namespace device_reduce_instance
+} // namespace device
+} // namespace tensor_operation
+} // namespace ck
+#endif
--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f16_f16_f16.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f16_f16_f16.hpp
@@ -13,21 +13,27 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f16_f32_f16.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f16_f32_f16.hpp
@@ -13,12 +13,15 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(half_t, float, half_t, 7, 0, 0, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f32_f32_f32.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f32_f32_f32.hpp
@@ -13,30 +13,39 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 7, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 2, 0, 1, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 1, 4, 4);      
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 3, 0, 1, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, float, float, 4, 0, 1, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f32_f64_f32.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f32_f64_f32.hpp
@@ -13,12 +13,15 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(float, double, float, 7, 0, 0, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f64_f64_f64.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_f64_f64_f64.hpp
@@ -13,30 +13,39 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 7, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 0, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 0, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 0, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 2, 0, 1, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 3, 0, 1, 2, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 1, 4, 4);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 1, 4, 1);       
 ADD_BLOCKWISE_INST_REF_BY_ID(double, double, double, 4, 0, 1, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_i8_i32_i8.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_i8_i32_i8.hpp
+#ifndef DEVICE_REDUCE_INSTANCE_BLOCKWISE_I8_I32_I8_HPP
+#define DEVICE_REDUCE_INSTANCE_BLOCKWISE_I8_I32_I8_HPP
+#include "reduction_enums.hpp"
+#include "reduction_operator_mapping.hpp"
+#include "device_reduce_instance_blockwise.hpp"
+namespace ck {
+namespace tensor_operation {
+namespace device {
+namespace device_reduce_instance {
+// clang-format off
+// InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 0, 0, 0, 4, 4);
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 0, 0, 0, 4, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 0, 0, 0, 2, 1);
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 5, 0, 0, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 5, 0, 0, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int32_t, int8_t, 5, 0, 0, 2, 1);       
+// clang-format on
+} // namespace device_reduce_instance
+} // namespace device
+} // namespace tensor_operation
+} // namespace ck
+#endif
--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_i8_i8_i8.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_i8_i8_i8.hpp
+#ifndef DEVICE_REDUCE_INSTANCE_BLOCKWISE_I8_I8_I8_HPP
+#define DEVICE_REDUCE_INSTANCE_BLOCKWISE_I8_I8_I8_HPP
+#include "reduction_enums.hpp"
+#include "reduction_operator_mapping.hpp"
+#include "device_reduce_instance_blockwise.hpp"
+namespace ck {
+namespace tensor_operation {
+namespace device {
+namespace device_reduce_instance {
+// clang-format off
+// InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 0, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 0, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 0, 2, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 0, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 0, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 0, 2, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 0, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 0, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 0, 2, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 1, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 1, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 2, 0, 1, 2, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 1, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 1, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 3, 0, 1, 2, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 1, 4, 4);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 1, 4, 1);       
+ADD_BLOCKWISE_INST_REF_BY_ID(int8_t, int8_t, int8_t, 4, 0, 1, 2, 1);
+// clang-format on
+} // namespace device_reduce_instance
+} // namespace device
+} // namespace tensor_operation
+} // namespace ck
+#endif
--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call.hpp
@@ -15,9 +15,7 @@ using reduce_configuration_2_instances_blockwise_second_call = std::tuple<
    // clang-format off
    // InSrcVectorDim | InSrcVectorSize | OutDstVectorSize | MThreadSliceSize | KThreadSliceSize
    ReductionConfiguration_2<1, 2, 1, 1, 2>,
-    ReductionConfiguration_2<1, 2, 2, 1, 2>,
+    ReductionConfiguration_2<1, 1, 1, 1, 3>
-    ReductionConfiguration_2<1, 1, 1, 1, 3>,
-    ReductionConfiguration_2<1, 1, 2, 1, 3>
    // clang-format on
    >;
 #else

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f16_f16_f16.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f16_f16_f16.hpp
@@ -13,21 +13,27 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 2, 0, 1, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 3, 0, 1, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(half_t, half_t, half_t, 4, 0, 1, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f32_f32_b16.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f32_f32_b16.hpp
+#ifndef DEVICE_REDUCE_INSTANCE_BLOCKWISE_SECOND_CALL_F32_F32_B16_HPP
+#define DEVICE_REDUCE_INSTANCE_BLOCKWISE_SECOND_CALL_F32_F32_B16_HPP
+#include "reduction_enums.hpp"
+#include "reduction_operator_mapping.hpp"
+#include "device_reduce_instance_blockwise_second_call.hpp"
+namespace ck {
+namespace tensor_operation {
+namespace device {
+namespace device_reduce_instance {
+// clang-format off
+// InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 0, 0, 0, 4, 4);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 0, 0, 0, 4, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 0, 0, 0, 2, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 5, 0, 0, 4, 4);       
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 5, 0, 0, 4, 1);       
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 5, 0, 0, 2, 1);       
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 7, 0, 0, 4, 4);       
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 7, 0, 0, 4, 1);       
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 7, 0, 0, 2, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 0, 4, 4);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 0, 4, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 0, 2, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 0, 4, 4);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 0, 4, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 0, 2, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 0, 4, 4);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 0, 4, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 0, 2, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 1, 4, 4);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 1, 4, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 2, 0, 1, 2, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 1, 4, 4);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 1, 4, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 3, 0, 1, 2, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 1, 4, 4);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 1, 4, 1);
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, bhalf_t, 4, 0, 1, 2, 1);
+// clang-format on
+} // namespace device_reduce_instance
+} // namespace device
+} // namespace tensor_operation
+} // namespace ck
+#endif
--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f32_f32_f16.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f32_f32_f16.hpp
@@ -13,12 +13,15 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, half_t, 7, 0, 0, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f32_f32_f32.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f32_f32_f32.hpp
@@ -13,30 +13,39 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 7, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 2, 0, 1, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 3, 0, 1, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(float, float, float, 4, 0, 1, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f64_f64_f32.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f64_f64_f32.hpp
@@ -13,12 +13,15 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, float, 7, 0, 0, 2, 1);
 // clang-format on

--- a/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f64_f64_f64.hpp
+++ b/library/include/ck/library/tensor_operation_instance/gpu/reduce/device_reduce_instance_blockwise_second_call_f64_f64_f64.hpp
@@ -13,30 +13,39 @@ namespace device_reduce_instance {
 // clang-format off
 // InDataType | AccDataType | OutDataType | ReduceOpId | NanPropaOpt | IndicesOpt | Rank | NumReduceDim 
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 0, 0, 0, 4, 3); // for ADD
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 0, 0, 0, 4, 4);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 0, 0, 0, 4, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 0, 0, 0, 2, 1);
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 5, 0, 0, 4, 3); // for AVG
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 5, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 5, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 5, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 7, 0, 0, 4, 3); // for NORM2
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 7, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 7, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 7, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 0, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 0, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 0, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 0, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 0, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 0, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 1, 4, 3); // for MIN
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 2, 0, 1, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 1, 4, 3); // for MAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 3, 0, 1, 2, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 1, 4, 3); // for AMAX
+ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 1, 4, 4);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 1, 4, 1);       
 ADD_BLOCKWISE_SECOND_CALL_INST_REF_BY_ID(double, double, double, 4, 0, 1, 2, 1);
 // clang-format on