Merge branch 'develop' of github.com:ROCmSoftwarePlatform/AMDMIGraphX into...

Merge branch 'develop' of github.com:ROCmSoftwarePlatform/AMDMIGraphX into simplify_var_input_slices

Merge branch 'develop' of github.com:ROCmSoftwarePlatform/AMDMIGraphX into...
Merge branch 'develop' of github.com:ROCmSoftwarePlatform/AMDMIGraphX into simplify_var_input_slices
7d458df4 · charlie · 86e5d33c · 0039b11a · 7d458df4 · 7d458df4
Commit 7d458df4 authored Nov 15, 2023 by charlie
20 changed files
--- a/docs/contributor_guide.rst
+++ b/docs/contributor_guide.rst
@@ -14,3 +14,4 @@ Contributor Guide
   dev/pass
   dev/matchers
   dev/tools
+   dev/env_vars
--- a/docs/dev/env_vars.rst
+++ b/docs/dev/env_vars.rst
+Environment Variables
+=====================
+
+For parsing
+---------------
+
+**MIGRAPHX_TRACE_ONNX_PARSER**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debugging traces for the onnx parser.
+Prints: initializers (if used), ONNX node operators, added MIGraphX instructions
+
+**MIGRAPHX_DISABLE_FP16_INSTANCENORM_CONVERT**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disables the conversion from fp16 to fp32 for the InstanceNormalization ONNX operator that MIGX does as a workaround for accuracy issues with reduce_mean/variance.
+See ``parse_instancenorm.cpp`` for more details.
+
+
+Matchers
+------------
+
+**MIGRAPHX_TRACE_MATCHES**
+
+Set to "1" to print the matcher that matches an instruction and the matched instruction.
+Set to "2" and use the ``MIGRAPHX_TRACE_MATHCES_FOR`` flag to filter out results.
+
+**MIGRAPHX_TRACE_MATCHES_FOR**
+
+Set to the name of any matcher and only traces for that matcher will be printed out.
+
+**MIGRAPHX_VALIDATE_MATCHES**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Validate the module after finding the matches (runs ``module.validate()``).
+
+Program Execution 
+---------------------
+
+**MIGRAPHX_TRACE_EVAL**
+
+Set to "1", "2", or "3" to use.
+"1" prints the instruction run and the time taken.
+"2" prints everything in "1" and a snippet of the output argument and some statistics (ex. min, max, mean) of the output.
+"3" prints everything in "1" and the full output buffers.
+
+
+Program Verification
+------------------------
+
+**MIGRAPHX_VERIFY_ENABLE_ALLCLOSE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Uses ``allclose`` with the given ``atol`` and ``rtol`` for verifying ranges with ``driver verify`` or the tests that use ``migraphx/verify.hpp``.
+
+
+Pass debugging or Pass controls
+-----------------------------------
+
+**MIGRAPHX_TRACE_ELIMINATE_CONTIGUOUS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Debug print the instructions that have input ``contiguous`` instructions removed.
+
+**MIGRAPHX_DISABLE_POINTWISE_FUSION**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disables the ``fuse_pointwise`` compile pass.
+
+**MIGRAPHX_DEBUG_MEMORY_COLORING**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debug statements for the ``memory_coloring`` pass.
+
+**MIGRAPHX_TRACE_SCHEDULE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debug statements for the ``schedule`` pass.
+
+**MIGRAPHX_TRACE_PROPAGATE_CONSTANT**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Traces instructions replaced with a constant.
+
+**MIGRAPHX_INT8_QUANTIZATION_PARAMS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the quantization parameters in only the main module.
+
+**MIGRAPHX_DISABLE_DNNL_POST_OPS_WORKAROUND**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the DNNL post ops workaround.
+
+**MIGRAPHX_DISABLE_MIOPEN_FUSION**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable MIOpen fusions.
+
+**MIGRAPHX_DISABLE_SCHEDULE_PASS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the ``schedule`` pass.
+
+**MIGRAPHX_DISABLE_REDUCE_FUSION**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the ``fuse_reduce`` pass.
+
+**MIGRAPHX_ENABLE_NHWC**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable the ``layout_nhwc`` pass.
+
+**MIGRAPHX_ENABLE_CK**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable using the Composable Kernels library.
+Should be used in conjunction with ``MIGRAPHX_DISABLE_MLIR=1``.
+
+**MIGRAPHX_DISABLE_MLIR** 
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable using the rocMLIR library.
+
+**MIGRAPHX_ENABLE_EXTRA_MLIR**
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enables additional opportunities to use MLIR that may improve performance.
+
+**MIGRAPHX_COPY_LITERALS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use ``hip_copy_to_gpu`` with a new ``literal`` instruction rather than use ``hip_copy_literal{}``.
+
+Compilation traces
+----------------------
+
+**MIGRAPHX_TRACE_FINALIZE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Debug print instructions during the ``module.finalize()`` step.
+
+**MIGRAPHX_TRACE_COMPILE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print trace information for the graph compilation process.
+
+**MIGRAPHX_TRACE_PASSES**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the compile pass and the program after the pass.
+
+**MIGRAPHX_TIME_PASSES**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Time the compile passes.
+
+
+GPU Kernels JIT compilation debugging (applicable for both hiprtc and hipclang)
+-----------------------------------------
+
+**MIGRAPHX_TRACE_CMD_EXECUTE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print commands executed by the MIGraphX ``process``.
+
+**MIGRAPHX_TRACE_HIPRTC**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print HIPRTC options and C++ file executed.
+
+**MIGRAPHX_DEBUG_SAVE_TEMP_DIR**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Make it so the created temporary directories are not deleted.
+
+**MIGRAPHX_GPU_DEBUG**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Internally, this adds the option ``-DMIGRAPHX_DEBUG`` when compiling GPU kernels. It enables assertions and capture of source locations for the errors. 
+
+**MIGRAPHX_GPU_DEBUG_SYM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Adds the option ``-g`` when compiling HIPRTC.
+
+**MIGRAPHX_GPU_DUMP_SRC**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dump the HIPRTC source files compiled.
+
+**MIGRAPHX_GPU_DUMP_ASM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dump the hip-clang assembly.
+
+**MIGRAPHX_GPU_OPTIMIZE**
+
+Set the optimization mode for GPU compile (``-O`` option).
+Defaults to ``-O3``.
+
+**MIGRAPHX_GPU_COMPILE_PARALLEL**
+
+Set to the number of threads to use.
+Compile GPU code in parallel with the given number of threads.
+
+**MIGRAPHX_TRACE_NARY**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the ``nary`` device functions used.
+
+**MIGRAPHX_ENABLE_HIPRTC_WORKAROUNDS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable HIPRTC workarounds for bugs in HIPRTC.
+
+**MIGRAPHX_USE_FAST_SOFTMAX**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use the fast softmax optimization.
+
+**MIGRAPHX_ENABLE_NULL_STREAM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Allow using null stream for miopen and hipStream.
+
+**MIGRAPHX_NSTREAMS**
+
+Set to the number of streams to use.
+Defaults to 1.
+
+**MIGRAPHX_TRACE_BENCHMARKING**
+
+Set to "1" to print benchmarching trace.
+Set to "2" to print benchmarching trace with more detail.
+
+MLIR vars
+-------------
+
+**MIGRAPHX_TRACE_MLIR**
+
+Set to "1" to trace MLIR and print any failures.
+Set to "2" to additionally print all MLIR operations.
+
+**MIGRAPHX_MLIR_USE_SPECIFIC_OPS**
+
+Set to the name of the operations you want to always use MLIR regardless of GPU architecture.
+Accepts a list of operators separated by commas (ex: "fused", "convolution", "dot").
+
+**MIGRAPHX_MLIR_TUNING_DB**
+
+Set to the path of the MLIR tuning database to load.
+
+**MIGRAPHX_MLIR_TUNING_CFG**
+
+Set to the path of the tuning configuration.
+Appends to tuning cfg file that could be used with rocMLIR tuning scripts.
+
+**MIGRAPHX_MLIR_TUNE_EXHAUSTIVE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Do exhaustive tuning for MLIR.
+
+
+CK vars
+-----------
+
+**MIGRAPHX_LOG_CK_GEMM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print Composable Kernels GEMM traces.
+
+**MIGRAPHX_CK_DEBUG**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Always add the ``-DMIGRAPHX_CK_CHECK=1`` for compiling Composable Kernels operators.
+
+**MIGRAPHX_TUNE_CK**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use tuning for Composable Kernels.
+
+Testing 
+------------
+
+**MIGRAPHX_TRACE_TEST_COMPILE**
+
+Set to the target that you want to trace the compilation of (ex. "gpu", "cpu").
+Prints the compile trace for the given target for the verify tests.
+This flag shouldn't be used in conjunction with ``MIGRAPHX_TRACE_COMPILE``.
+For the verify tests only use ``MIGRAPHX_TRACE_TEST_COMPILE``.
+
+**MIGRAPHX_TRACE_TEST**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Prints the reference and target programs even if the verify passed successfully.
+
+**MIGRAPHX_DUMP_TEST**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dumps verify tests to ``.mxr`` files.
--- a/src/driver/verify.cpp
+++ b/src/driver/verify.cpp
@@ -119,6 +119,7 @@ void verify_program(const std::string& name,
    auto target_outs = run_target(p, t, options, quantize, inputs);

    std::size_t output_num = ref_outs.size();
+    bool passed            = true;
    for(std::size_t i = 0; i < output_num; ++i)
    {
        if(ref_outs[i].get_shape().type() != target_outs[i].get_shape().type() or
@@ -130,9 +131,11 @@ void verify_program(const std::string& name,
        }
        else
        {
-            verify_args(name, target_outs[i], verify::expected{ref_outs[i]}, tols);
+            passed &= verify_args(name, target_outs[i], verify::expected{ref_outs[i]}, tols);
        }
    }
+    if(passed)
+        std::cout << "MIGraphX verification passed successfully." << std::endl;
 }

 void verify_instructions(const program& prog,

--- a/src/fuse_pointwise.cpp
+++ b/src/fuse_pointwise.cpp
@@ -219,9 +219,8 @@ struct find_pointwise_reshape_pointwise

        auto reshape_input = [&](const auto& ins_to_insert) {
            return [&](auto input) {
-                auto c = m.insert_instruction(ins_to_insert, make_op("contiguous"), input);
                return m.insert_instruction(
-                    ins_to_insert, make_op("reshape", {{"dims", cd.dims}}), c);
+                    ins_to_insert, make_op("reshape", {{"dims", cd.dims}}), input);
            };
        };
        auto x_inputs = x_ins->inputs();

--- a/src/include/migraphx/matcher.hpp
+++ b/src/include/migraphx/matcher.hpp
@@ -591,6 +591,19 @@ MIGRAPHX_PRED_MATCHER(same_input_shapes, instruction_ref ins)
        ins->inputs().begin(), ins->inputs().end(), [&](auto x) { return x->get_shape() == s; });
 }

+MIGRAPHX_PRED_MATCHER(has_same_value, instruction_ref ins)
+{
+    if(ins->name() != "@literal")
+        return false;
+    bool all_same = false;
+    ins->get_literal().visit([&](auto s) {
+        all_same = std::all_of(s.begin() + 1, s.end(), [&](const auto& scale) {
+            return float_equal(scale, s.front());
+        });
+    });
+    return all_same;
+}
+
 MIGRAPHX_BASIC_MATCHER(output, const matcher_context&, instruction_ref ins)
 {
    if(ins->outputs().size() == 1)
@@ -844,6 +857,12 @@ auto skip_broadcasts_converts(Ms... ms)
    return skip(name("broadcast", "multibroadcast", "contiguous", "convert"))(ms...);
 }

+template <class... Ms>
+auto skip_broadcasts_transposes_contiguous(Ms... ms)
+{
+    return skip(name("broadcast", "multibroadcast", "contiguous", "transpose"))(ms...);
+}
+
 template <class T>
 inline auto has_value(T x, float tolerance = 1e-6)
 {

--- a/src/include/migraphx/module.hpp
+++ b/src/include/migraphx/module.hpp
@@ -42,6 +42,7 @@
 namespace migraphx {
 inline namespace MIGRAPHX_INLINE_NS {

+MIGRAPHX_EXPORT
 const operation& get_operation(instruction_ref ins);

 struct module_impl;

--- a/src/include/migraphx/op/isinf.hpp
+++ b/src/include/migraphx/op/isinf.hpp
@@ -35,7 +35,7 @@ struct isinf : unary<isinf>
 {
    auto apply() const
    {
-        return [&](auto x) { return std::isinf(x); };
+        return [&](auto x) { return std::isinf(static_cast<double>(x)); };
    }

    std::string name() const { return "isinf"; }

--- a/src/include/migraphx/op/slice.hpp
+++ b/src/include/migraphx/op/slice.hpp
@@ -31,6 +31,7 @@
 #include <migraphx/dyn_output.hpp>
 #include <migraphx/op/normalize_attribute.hpp>
 #include <migraphx/normalize_attributes.hpp>
+#include <array>

 namespace migraphx {
 inline namespace MIGRAPHX_INLINE_NS {

--- a/src/onnx/parse_lstm.cpp
+++ b/src/onnx/parse_lstm.cpp
@@ -116,6 +116,37 @@ void lstm_actv_functions(op::rnn_direction dirct, std::vector<std::string>& actv
    }
 }

+void lstm_transpose_inputs(onnx_parser::node_info& info, std::vector<instruction_ref>& args)
+{
+    std::vector<int64_t> perm{1, 0, 2};
+    args[0] = info.add_instruction(make_op("transpose", {{"permutation", perm}}), args[0]);
+
+    if(args.size() >= 6 and not args[5]->is_undefined())
+    {
+        args[5] = info.add_instruction(make_op("transpose", {{"permutation", perm}}), args[5]);
+    }
+
+    if(args.size() >= 7 and not args[6]->is_undefined())
+    {
+        args[6] = info.add_instruction(make_op("transpose", {{"permutation", perm}}), args[6]);
+    }
+}
+
+void lstm_transpose_outputs(onnx_parser::node_info& info,
+                            instruction_ref& hidden_states,
+                            instruction_ref& last_output,
+                            instruction_ref& last_cell_output)
+{
+    std::vector<int64_t> perm_hs{2, 0, 1, 3};
+    hidden_states =
+        info.add_instruction(make_op("transpose", {{"permutation", perm_hs}}), hidden_states);
+    std::vector<int64_t> perm_last{1, 0, 2};
+    last_output =
+        info.add_instruction(make_op("transpose", {{"permutation", perm_last}}), last_output);
+    last_cell_output =
+        info.add_instruction(make_op("transpose", {{"permutation", perm_last}}), last_cell_output);
+}
+
 struct parse_lstm : op_parser<parse_lstm>
 {
    std::vector<op_desc> operators() const { return {{"LSTM"}}; }
@@ -202,6 +233,12 @@ struct parse_lstm : op_parser<parse_lstm>
            input_forget = parser.parse_value(info.attributes.at("input_forget")).at<int>();
        }

+        int layout = 0;
+        if(contains(info.attributes, "layout"))
+        {
+            layout = parser.parse_value(info.attributes.at("layout")).at<int>();
+        }
+
        // append undefined opeator to make 6 arguments
        if(args.size() < 8)
        {
@@ -209,6 +246,11 @@ struct parse_lstm : op_parser<parse_lstm>
            args.insert(args.end(), 8 - args.size(), ins);
        }

+        if(layout != 0)
+        {
+            lstm_transpose_inputs(info, args);
+        }
+
        // first output for concatenation of hidden states
        auto hidden_states = info.add_instruction(make_op("lstm",
                                                          {{"hidden_size", hidden_size},
@@ -224,6 +266,11 @@ struct parse_lstm : op_parser<parse_lstm>
        auto last_cell_output =
            info.add_instruction(make_op("rnn_last_cell_output"), hidden_states);

+        if(layout != 0)
+        {
+            lstm_transpose_outputs(info, hidden_states, last_output, last_cell_output);
+        }
+
        return {hidden_states, last_output, last_cell_output};
    }
 };

--- a/src/simplify_algebra.cpp
+++ b/src/simplify_algebra.cpp
@@ -941,15 +941,6 @@ struct find_splits
                {
                    auto split = i->inputs()[split_idx];
                    assert(split->name() == "slice");
-                    // Insert contiguous for reshapes
-                    auto outputs = i->outputs();
-                    for(auto output : outputs)
-                    {
-                        if(output->name() != "reshape")
-                            continue;
-                        auto x = m.insert_instruction(output, make_op("contiguous"), i);
-                        m.replace_instruction(output, output->get_operator(), x);
-                    }

                    m.replace_instruction(i, split->get_operator(), c);
                }
@@ -1181,13 +1172,6 @@ struct find_conv_dot_horiz_fusion
            for(auto arg : range(start, last))
            {
                auto outputs = arg->outputs();
-                for(auto output : outputs)
-                {
-                    if(output->name() != "reshape")
-                        continue;
-                    auto x = m.insert_instruction(output, make_op("contiguous"), arg);
-                    m.replace_instruction(output, output->get_operator(), x);
-                }

                int64_t len = arg->get_shape().lens()[axis];
                m.replace_instruction(
@@ -1487,11 +1471,6 @@ struct find_split_reshape
                   slc_axis_len;
        });

-        // insert the reshape instruction and add contiguous if needed
-        if(not input->get_shape().standard())
-        {
-            input = m.insert_instruction(std::next(input), make_op("contiguous"), input);
-        }
        auto rsp_ins = m.insert_instruction(
            std::next(input), make_op("reshape", {{"dims", rsp_out_lens}}), input);


--- a/src/simplify_qdq.cpp
+++ b/src/simplify_qdq.cpp
@@ -45,77 +45,145 @@ std::unordered_set<std::string> get_quantizable_op_names()
    return s;
 }

-MIGRAPHX_PRED_MATCHER(has_same_value, instruction_ref ins)
+struct match_find_quantizable_ops
 {
-    if(ins->name() != "@literal")
-        return false;
-    bool all_same = false;
-    ins->get_literal().visit([&](auto s) {
-        all_same = std::all_of(s.begin() + 1, s.end(), [&](const auto& scale) {
-            return float_equal(scale, s.front());
+    static bool
+    is_valid_scale(instruction_ref scale, std::vector<std::size_t> lens, std::size_t axis)
+    {
+        return scale->get_shape().scalar() or scale->get_shape().elements() == lens.at(axis);
+    }
+
+    static bool is_valid_zero_point(instruction_ref zp)
+    {
+        if(not zp->can_eval())
+            return false;
+
+        bool all_zeros = false;
+        zp->eval().visit([&](auto z) {
+            all_zeros =
+                std::all_of(z.begin(), z.end(), [&](auto val) { return float_equal(val, 0); });
        });
-    });
-    return all_same;
-}
+        return all_zeros;
+    }

-struct match_find_quantizable_ops
-{
+    static auto
+    scale_broadcast_op(instruction_ref scale, std::vector<std::size_t> lens, std::size_t axis)
+    {
+        if(scale->get_shape().scalar())
+        {
+            return migraphx::make_op("multibroadcast", {{"out_lens", lens}});
+        }
+        else
+        {
+            return migraphx::make_op("broadcast", {{"out_lens", lens}, {"axis", axis}});
+        }
+    }

-    static auto dequantizelinear_op(const std::string& name, const std::string& scale)
+    // Helper function to insert quantized versions of any broadcasts and transpose ops that
+    // occur between dequantizelinear and the quantized op
+    static auto
+    propagate_quantized_ins(module& m, const instruction_ref dqins, const instruction_ref qop)
+    {
+        auto qinp     = dqins->inputs().front();
+        auto next_ins = dqins;
+
+        while(next_ins != qop)
+        {
+            if(next_ins->name() != "dequantizelinear")
+            {
+                qinp = m.insert_instruction(qop, next_ins->get_operator(), qinp);
+            }
+            next_ins = next_ins->outputs().front();
+        }
+        return qinp;
+    }
+
+    static auto dequantizelinear_op(const std::string& scale, const std::string& zp)
    {
        return match::name("dequantizelinear")(
-            match::arg(0)(match::skip(match::name("quantizelinear"))(match::any().bind(name))),
-            match::arg(1)(match::skip_broadcasts(has_same_value().bind(scale))),
-            match::arg(2)(match::skip_broadcasts(match::all_of(match::has_value(0)))));
+            match::arg(0)(match::skip(match::name("quantizelinear"))(match::any())),
+            match::arg(1)(match::skip_broadcasts(match::is_constant().bind(scale))),
+            match::arg(2)(match::skip_broadcasts(match::is_constant().bind(zp))));
    }

    auto matcher() const
    {
        return match::name(get_quantizable_op_names())(
-            match::arg(0)(dequantizelinear_op("x1", "scale1")),
-            match::arg(1)(dequantizelinear_op("x2", "scale2")));
+            match::arg(0)(match::skip_broadcasts_transposes_contiguous(
+                dequantizelinear_op("scale1", "zp1").bind("dq1"))),
+            match::arg(1)(match::skip_broadcasts_transposes_contiguous(
+                dequantizelinear_op("scale2", "zp2").bind("dq2"))));
    }

    void apply(module& m, const match::matcher_result& r) const
    {
        auto qop    = r.result;
-        auto q1     = r.instructions["x1"];
-        auto q2     = r.instructions["x2"];
+        auto dq1    = r.instructions["dq1"];
+        auto dq2    = r.instructions["dq2"];
        auto scale1 = r.instructions["scale1"];
        auto scale2 = r.instructions["scale2"];
+        auto zp1    = r.instructions["zp1"];
+        auto zp2    = r.instructions["zp2"];

        // Only INT8 type currently supported
-        if(q1->get_shape().type() != migraphx::shape::int8_type or
-           q2->get_shape().type() != migraphx::shape::int8_type)
+        if(dq1->inputs().front()->get_shape().type() != migraphx::shape::int8_type or
+           dq2->inputs().front()->get_shape().type() != migraphx::shape::int8_type)
            return;

-        double scale;
-        visit_all(scale1->get_literal(), scale2->get_literal())(
-            [&](const auto s1, const auto s2) { scale = s1.front() * s2.front(); });
+        // Only symmetric quantization supported (ie. non-zero zero_points not allowed)
+        if(not(is_valid_zero_point(zp1) and is_valid_zero_point(zp2)))
+            return;

+        // Only support scalar and 1D scales
+        if(scale1->get_shape().lens().size() != 1 or scale2->get_shape().lens().size() != 1)
+            return;
+
+        // Propagate q1 and q2 through any broadcasts and transposes before qop
        auto qop_args  = qop->inputs();
-        qop_args.at(0) = q1;
-        qop_args.at(1) = q2;
+        qop_args.at(0) = propagate_quantized_ins(m, dq1, qop);
+        qop_args.at(1) = propagate_quantized_ins(m, dq2, qop);
        instruction_ref dq;
-        instruction_ref dq_scale;
+        instruction_ref out_scale;
        instruction_ref zero_point;
        if(qop->name() == "convolution")
        {
            auto conv_val = qop->get_operator().to_value();
            dq            = m.insert_instruction(
                qop, migraphx::make_op("quant_convolution", conv_val), qop_args);
+            auto out_lens = dq->get_shape().lens();
+
+            // Input scale should always be scalar and weight scale can be scalar or 1D of the
+            // same lens as the output channel dim (dim 1 in the output)
+            if(not(is_valid_scale(scale1, out_lens, 1) and is_valid_scale(scale2, out_lens, 1)))
+                return;
+
+            auto s1_bcast =
+                m.insert_instruction(qop, scale_broadcast_op(scale1, out_lens, 1), scale1);
+            auto s2_bcast =
+                m.insert_instruction(qop, scale_broadcast_op(scale2, out_lens, 1), scale2);
+
+            out_scale = m.insert_instruction(qop, migraphx::make_op("mul"), s1_bcast, s2_bcast);
        }
        else if(qop->name() == "dot")
        {
-            dq = m.insert_instruction(qop, migraphx::make_op("quant_dot"), qop_args);
+            dq            = m.insert_instruction(qop, migraphx::make_op("quant_dot"), qop_args);
+            auto out_lens = dq->get_shape().lens();
+
+            // For (..., M, N) x (..., N, K) dot, only support cases where quantization axis is M
+            // for input1 and K for input 2
+            if(not(is_valid_scale(scale1, out_lens, out_lens.size() - 2) and
+                   is_valid_scale(scale2, out_lens, out_lens.size() - 1)))
+                return;
+
+            auto s1_bcast = m.insert_instruction(
+                qop, scale_broadcast_op(scale1, out_lens, out_lens.size() - 2), scale1);
+            auto s2_bcast = m.insert_instruction(
+                qop, scale_broadcast_op(scale2, out_lens, out_lens.size() - 1), scale2);
+
+            out_scale = m.insert_instruction(qop, migraphx::make_op("mul"), s1_bcast, s2_bcast);
        }
-        auto ins_type = qop->get_shape().type();
-        dq_scale      = m.add_literal(literal({ins_type}, {scale}));

-        auto lens = dq->get_shape().lens();
-        auto scale_mb =
-            m.insert_instruction(qop, make_op("multibroadcast", {{"out_lens", lens}}), dq_scale);
-        dq = m.insert_instruction(qop, make_op("dequantizelinear"), dq, scale_mb);
+        dq = m.insert_instruction(qop, make_op("dequantizelinear"), dq, out_scale);
        m.replace_instruction(qop, dq);
    }
 };

--- a/src/simplify_reshapes.cpp
+++ b/src/simplify_reshapes.cpp
@@ -103,8 +103,6 @@ struct find_reshaper
        auto input = mr.instructions["x"];
        auto dims  = ins->get_shape().lens();

-        if(not input->get_shape().standard())
-            input = m.insert_instruction(ins, make_op("contiguous"), input);
        m.replace_instruction(ins, make_op("reshape", {{"dims", dims}}), input);
    }
 };
@@ -475,9 +473,8 @@ struct find_resize
            ins_rsp, migraphx::make_op("reshape", {{"dims", in_dims}}), in_rsp);
        auto mb_rsp = m.insert_instruction(
            ins_rsp, migraphx::make_op("multibroadcast", {{"out_lens", out_dims}}), rsp_data);
-        auto std_mb = m.insert_instruction(ins, migraphx::make_op("contiguous"), mb_rsp);
        std::vector<int64_t> rsp_dims(out_lens.begin(), out_lens.end());
-        m.replace_instruction(ins, migraphx::make_op("reshape", {{"dims", rsp_dims}}), std_mb);
+        m.replace_instruction(ins, migraphx::make_op("reshape", {{"dims", rsp_dims}}), mb_rsp);
    }
 };

@@ -626,9 +623,8 @@ struct find_transpose_contiguous_reshaper_unary
        auto cont_ins      = r.instructions["cont_ins"];
        auto unary_op_name = ins->get_operator().name();
        auto unary_ins     = m.insert_instruction(cont_ins, make_op(unary_op_name), trans_ins);
-        auto new_cont_ins  = m.insert_instruction(cont_ins, make_op("contiguous"), unary_ins);
        // older cont and reshape are removed by deadcode elimination
-        m.replace_instruction(ins, reshaper_ins->get_operator(), new_cont_ins);
+        m.replace_instruction(ins, reshaper_ins->get_operator(), unary_ins);
    }
 };


--- a/src/verify_args.cpp
+++ b/src/verify_args.cpp
@@ -88,7 +88,6 @@ bool verify_args(const std::string& name,
            if(target_nan_idx >= 0)
                std::cout << "Non finite number found in target at " << target_nan_idx << ": "
                          << target[target_nan_idx] << std::endl;
-            std::cout << "MIGraphX verification passed successfully." << std::endl;
        }
    });
    return passed;

--- a/test/fuse_pointwise.cpp
+++ b/test/fuse_pointwise.cpp
@@ -414,8 +414,8 @@ TEST_CASE(add_reshape_add_nonstandard)
        auto y       = mm->add_parameter("y", s1);
        auto z       = mm->add_parameter("z", s2);
        auto add1    = mm->add_instruction(migraphx::make_op("add"), x, y);
-        auto c       = mm->add_instruction(migraphx::make_op("contiguous"), add1);
-        auto reshape = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s2.lens()}}), c);
+        auto reshape =
+            mm->add_instruction(migraphx::make_op("reshape", {{"dims", s2.lens()}}), add1);
        auto add2    = mm->add_instruction(migraphx::make_op("add"), reshape, z);
        mm->add_return({add2});
    }
@@ -426,10 +426,8 @@ TEST_CASE(add_reshape_add_nonstandard)
        auto x   = mm->add_parameter("x", s1);
        auto y   = mm->add_parameter("y", s1);
        auto z   = mm->add_parameter("z", s2);
-        auto cx  = mm->add_instruction(migraphx::make_op("contiguous"), x);
-        auto cy  = mm->add_instruction(migraphx::make_op("contiguous"), y);
-        auto x2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s3.lens()}}), cx);
-        auto y2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s3.lens()}}), cy);
+        auto x2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s3.lens()}}), x);
+        auto y2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s3.lens()}}), y);
        auto z2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s3.lens()}}), z);
        auto fadd =
            add_pointwise(p2, "main:pointwise0", {x2, y2, z2}, [=](auto* pm, const auto& inputs) {
@@ -466,10 +464,8 @@ TEST_CASE(add_unsqueeze_add_nonstandard)
        auto x   = mm->add_parameter("x", s1);
        auto y   = mm->add_parameter("y", s1);
        auto z   = mm->add_parameter("z", s2);
-        auto cx  = mm->add_instruction(migraphx::make_op("contiguous"), x);
-        auto cy  = mm->add_instruction(migraphx::make_op("contiguous"), y);
-        auto x2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s2.lens()}}), cx);
-        auto y2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s2.lens()}}), cy);
+        auto x2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s2.lens()}}), x);
+        auto y2  = mm->add_instruction(migraphx::make_op("reshape", {{"dims", s2.lens()}}), y);
        auto fadd =
            add_pointwise(p2, "main:pointwise0", {x2, y2, z}, [=](auto* pm, const auto& inputs) {
                auto add1 = pm->add_instruction(migraphx::make_op("add"), inputs[0], inputs[1]);

--- a/test/onnx/.onnxrt-commit
+++ b/test/onnx/.onnxrt-commit
-b7b8b5b2ce80edb33990c7ae0fedac6ae3c623f4
+4a8203033930da506b356cdaf88b1531d8d8fca3
--- a/test/onnx/gen_onnx.py
+++ b/test/onnx/gen_onnx.py
@@ -4484,6 +4484,177 @@ def lrn_test():
    return ([node], [x], [y])


+@onnx_test()
+def lstm_bi_layout_cell_test():
+    seq = helper.make_tensor_value_info('seq', TensorProto.FLOAT, [3, 5, 10])
+    w = helper.make_tensor_value_info('w', TensorProto.FLOAT, [2, 80, 10])
+    r = helper.make_tensor_value_info('r', TensorProto.FLOAT, [2, 80, 20])
+    bias = helper.make_tensor_value_info('bias', TensorProto.FLOAT, [2, 160])
+    seq_len = helper.make_tensor_value_info('seq_len', TensorProto.INT32, [3])
+    h0 = helper.make_tensor_value_info('h0', TensorProto.FLOAT, [3, 2, 20])
+    c0 = helper.make_tensor_value_info('c0', TensorProto.FLOAT, [3, 2, 20])
+    pph = helper.make_tensor_value_info('pph', TensorProto.FLOAT, [2, 60])
+
+    cellout = helper.make_tensor_value_info('cellout', TensorProto.FLOAT,
+                                            [3, 2, 20])
+
+    node = onnx.helper.make_node(
+        'LSTM',
+        inputs=['seq', 'w', 'r', 'bias', 'seq_len', 'h0', 'c0', 'pph'],
+        outputs=['', '', 'cellout'],
+        activations=['sigmoid', 'tanh', 'tanh'],
+        clip=0,
+        direction='bidirectional',
+        hidden_size=20,
+        input_forget=1,
+        layout=1)
+
+    return ([node], [seq, w, r, bias, seq_len, h0, c0, pph], [cellout])
+
+
+@onnx_test()
+def lstm_bi_layout_last_test():
+    seq = helper.make_tensor_value_info('seq', TensorProto.FLOAT, [3, 5, 10])
+    w = helper.make_tensor_value_info('w', TensorProto.FLOAT, [2, 80, 10])
+    r = helper.make_tensor_value_info('r', TensorProto.FLOAT, [2, 80, 20])
+    bias = helper.make_tensor_value_info('bias', TensorProto.FLOAT, [2, 160])
+    seq_len = helper.make_tensor_value_info('seq_len', TensorProto.INT32, [3])
+    h0 = helper.make_tensor_value_info('h0', TensorProto.FLOAT, [3, 2, 20])
+    c0 = helper.make_tensor_value_info('c0', TensorProto.FLOAT, [3, 2, 20])
+    pph = helper.make_tensor_value_info('pph', TensorProto.FLOAT, [2, 60])
+
+    hs = helper.make_tensor_value_info('hs', TensorProto.FLOAT, [3, 5, 2, 20])
+    output = helper.make_tensor_value_info('output', TensorProto.FLOAT,
+                                           [3, 2, 20])
+
+    node = onnx.helper.make_node(
+        'LSTM',
+        inputs=['seq', 'w', 'r', 'bias', 'seq_len', 'h0', 'c0', 'pph'],
+        outputs=['hs', 'output'],
+        activations=['sigmoid', 'tanh', 'tanh'],
+        clip=0,
+        direction='bidirectional',
+        hidden_size=20,
+        input_forget=1,
+        layout=1)
+
+    return ([node], [seq, w, r, bias, seq_len, h0, c0, pph], [hs, output])
+
+
+@onnx_test()
+def lstm_f_layout_hs_test():
+    seq = helper.make_tensor_value_info('seq', TensorProto.FLOAT, [3, 5, 10])
+    w = helper.make_tensor_value_info('w', TensorProto.FLOAT, [1, 80, 10])
+    r = helper.make_tensor_value_info('r', TensorProto.FLOAT, [1, 80, 20])
+    bias = helper.make_tensor_value_info('bias', TensorProto.FLOAT, [1, 160])
+    seq_len = helper.make_tensor_value_info('seq_len', TensorProto.INT32, [3])
+    h0 = helper.make_tensor_value_info('h0', TensorProto.FLOAT, [3, 1, 20])
+    c0 = helper.make_tensor_value_info('c0', TensorProto.FLOAT, [3, 1, 20])
+    pph = helper.make_tensor_value_info('pph', TensorProto.FLOAT, [1, 60])
+
+    hs = helper.make_tensor_value_info('hs', TensorProto.FLOAT, [3, 5, 1, 20])
+    output = helper.make_tensor_value_info('output', TensorProto.FLOAT,
+                                           [3, 1, 20])
+
+    node = onnx.helper.make_node(
+        'LSTM',
+        inputs=['seq', 'w', 'r', 'bias', 'seq_len', 'h0', 'c0', 'pph'],
+        outputs=['hs', 'output'],
+        activations=['sigmoid', 'tanh', 'tanh'],
+        clip=0,
+        direction='forward',
+        hidden_size=20,
+        input_forget=1,
+        layout=1)
+
+    return ([node], [seq, w, r, bias, seq_len, h0, c0, pph], [hs, output])
+
+
+@onnx_test()
+def lstm_f_layout_cell_test():
+    seq = helper.make_tensor_value_info('seq', TensorProto.FLOAT, [3, 5, 10])
+    w = helper.make_tensor_value_info('w', TensorProto.FLOAT, [1, 80, 10])
+    r = helper.make_tensor_value_info('r', TensorProto.FLOAT, [1, 80, 20])
+    bias = helper.make_tensor_value_info('bias', TensorProto.FLOAT, [1, 160])
+    seq_len = helper.make_tensor_value_info('seq_len', TensorProto.INT32, [3])
+    h0 = helper.make_tensor_value_info('h0', TensorProto.FLOAT, [3, 1, 20])
+    c0 = helper.make_tensor_value_info('c0', TensorProto.FLOAT, [3, 1, 20])
+    pph = helper.make_tensor_value_info('pph', TensorProto.FLOAT, [1, 60])
+
+    cellout = helper.make_tensor_value_info('cellout', TensorProto.FLOAT,
+                                            [3, 1, 20])
+
+    node = onnx.helper.make_node(
+        'LSTM',
+        inputs=['seq', 'w', 'r', 'bias', 'seq_len', 'h0', 'c0', 'pph'],
+        outputs=['', '', 'cellout'],
+        activations=['sigmoid', 'tanh', 'tanh'],
+        clip=0,
+        direction='forward',
+        hidden_size=20,
+        input_forget=1,
+        layout=1)
+
+    return ([node], [seq, w, r, bias, seq_len, h0, c0, pph], [cellout])
+
+
+@onnx_test()
+def lstm_r_layout_test():
+    seq = helper.make_tensor_value_info('seq', TensorProto.FLOAT, [3, 5, 10])
+    w = helper.make_tensor_value_info('w', TensorProto.FLOAT, [1, 80, 10])
+    r = helper.make_tensor_value_info('r', TensorProto.FLOAT, [1, 80, 20])
+    bias = helper.make_tensor_value_info('bias', TensorProto.FLOAT, [1, 160])
+    seq_len = helper.make_tensor_value_info('seq_len', TensorProto.INT32, [3])
+    h0 = helper.make_tensor_value_info('h0', TensorProto.FLOAT, [3, 1, 20])
+    c0 = helper.make_tensor_value_info('c0', TensorProto.FLOAT, [3, 1, 20])
+    pph = helper.make_tensor_value_info('pph', TensorProto.FLOAT, [1, 60])
+
+    hs = helper.make_tensor_value_info('hs', TensorProto.FLOAT, [3, 5, 1, 20])
+
+    node = onnx.helper.make_node(
+        'LSTM',
+        inputs=['seq', 'w', 'r', 'bias', 'seq_len', 'h0', 'c0', 'pph'],
+        outputs=['hs'],
+        activations=['sigmoid', 'tanh', 'tanh'],
+        clip=0,
+        direction='reverse',
+        hidden_size=20,
+        input_forget=1,
+        layout=1)
+
+    return ([node], [seq, w, r, bias, seq_len, h0, c0, pph], [hs])
+
+
+@onnx_test()
+def lstm_r_layout_hs_cell_test():
+    seq = helper.make_tensor_value_info('seq', TensorProto.FLOAT, [3, 5, 10])
+    w = helper.make_tensor_value_info('w', TensorProto.FLOAT, [1, 80, 10])
+    r = helper.make_tensor_value_info('r', TensorProto.FLOAT, [1, 80, 20])
+    bias = helper.make_tensor_value_info('bias', TensorProto.FLOAT, [1, 160])
+    seq_len = helper.make_tensor_value_info('seq_len', TensorProto.INT32, [3])
+    h0 = helper.make_tensor_value_info('h0', TensorProto.FLOAT, [3, 1, 20])
+    c0 = helper.make_tensor_value_info('c0', TensorProto.FLOAT, [3, 1, 20])
+    pph = helper.make_tensor_value_info('pph', TensorProto.FLOAT, [1, 60])
+
+    output = helper.make_tensor_value_info('output', TensorProto.FLOAT,
+                                           [3, 1, 20])
+    cellout = helper.make_tensor_value_info('cellout', TensorProto.FLOAT,
+                                            [3, 1, 20])
+
+    node = onnx.helper.make_node(
+        'LSTM',
+        inputs=['seq', 'w', 'r', 'bias', 'seq_len', 'h0', 'c0', 'pph'],
+        outputs=['', 'output', 'cellout'],
+        activations=['sigmoid', 'tanh', 'tanh'],
+        clip=0,
+        direction='reverse',
+        hidden_size=20,
+        input_forget=1,
+        layout=1)
+
+    return ([node], [seq, w, r, bias, seq_len, h0, c0, pph], [output, cellout])
+
+
 @onnx_test()
 def matmul_bmbm_test():
    m1 = helper.make_tensor_value_info('1', TensorProto.FLOAT, [3, 6, 7])

--- a/test/onnx/lstm_bi_layout_cell_test.onnx
+++ b/test/onnx/lstm_bi_layout_cell_test.onnx
--- a/test/onnx/lstm_bi_layout_last_test.onnx
+++ b/test/onnx/lstm_bi_layout_last_test.onnx
--- a/test/onnx/lstm_f_layout_cell_test.onnx
+++ b/test/onnx/lstm_f_layout_cell_test.onnx
--- a/test/onnx/lstm_f_layout_hs_test.onnx
+++ b/test/onnx/lstm_f_layout_hs_test.onnx