update code

fccfdfa5 · dlyrm · dcc7bf4f · fccfdfa5 · fccfdfa5 · fccfdfa5
Commit fccfdfa5 authored Dec 25, 2023 by dlyrm
20 changed files
--- a/deploy/third_engine/demo_avh/README.md.bak
+++ b/deploy/third_engine/demo_avh/README.md.bak
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+Running PP-PicoDet via TVM on bare metal Arm(R) Cortex(R)-M55 CPU and CMSIS-NN
+===============================================================
+
+This folder contains an example of how to use TVM to run a PP-PicoDet model
+on bare metal Cortex(R)-M55 CPU and CMSIS-NN.
+
+Prerequisites
+-------------
+If the demo is run in the ci_cpu Docker container provided with TVM, then the following
+software will already be installed.
+
+If the demo is not run in the ci_cpu Docker container, then you will need the following:
+- Software required to build and run the demo (These can all be installed by running
+  tvm/docker/install/ubuntu_install_ethosu_driver_stack.sh.)
+  - [Fixed Virtual Platform (FVP) based on Arm(R) Corstone(TM)-300 software](https://developer.arm.com/tools-and-software/open-source-software/arm-platforms-software/arm-ecosystem-fvps)
+  - [cmake 3.19.5](https://github.com/Kitware/CMake/releases/)
+  - [GCC toolchain from Arm(R)](https://developer.arm.com/-/media/Files/downloads/gnu-rm/10-2020q4/gcc-arm-none-eabi-10-2020-q4-major-x86_64-linux.tar.bz2)
+  - [Arm(R) Ethos(TM)-U NPU driver stack](https://review.mlplatform.org)
+  - [CMSIS](https://github.com/ARM-software/CMSIS_5)
+- The python libraries listed in the requirements.txt of this directory
+  - These can be installed by running the following from the current directory:
+    ```bash
+    pip install -r ./requirements.txt
+    ```
+
+You will also need TVM which can either be:
+  - Built from source (see [Install from Source](https://tvm.apache.org/docs/install/from_source.html))
+    - When building from source, the following need to be set in config.cmake:
+      - set(USE_CMSISNN ON)
+      - set(USE_MICRO ON)
+      - set(USE_LLVM ON)
+  - Installed from TLCPack(see [TLCPack](https://tlcpack.ai/))
+
+You will need to update your PATH environment variable to include the path to cmake 3.19.5 and the FVP.
+For example if you've installed these in ```/opt/arm``` , then you would do the following:
+```bash
+export PATH=/opt/arm/FVP_Corstone_SSE-300/models/Linux64_GCC-6.4:/opt/arm/cmake/bin:$PATH
+```
+
+Running the demo application
+----------------------------
+Type the following command to run the bare metal text recognition application ([src/demo_bare_metal.c](./src/demo_bare_metal.c)):
+```bash
+./run_demo.sh
+```
+If the Ethos(TM)-U platform and/or CMSIS have not been installed in /opt/arm/ethosu then
+the locations for these can be specified as arguments to run_demo.sh, for example:
+
+```bash
+./run_demo.sh --cmsis_path /home/tvm-user/cmsis \
+--ethosu_platform_path /home/tvm-user/ethosu/core_platform
+```
+
+This will:
+- Download a PP-PicoDet text recognition model
+- Use tvmc to compile the text recognition model for Cortex(R)-M55 CPU and CMSIS-NN
+- Create a C header file inputs.c containing the image data as a C array
+- Create a C header file outputs.c containing a C array where the output of inference will be stored
+- Build the demo application
+- Run the demo application on a Fixed Virtual Platform (FVP) based on Arm(R) Corstone(TM)-300 software
+- The application will report the text on the image and the corresponding score.
+
+Using your own image
+--------------------
+The create_image.py script takes a single argument on the command line which is the path of the
+image to be converted into an array of bytes for consumption by the model.
+
+The demo can be modified to use an image of your choice by changing the following line in run_demo.sh
+
+```bash
+python3 ./convert_image.py ../../demo/000000014439_640x640.jpg
+```
+
+Model description
+-----------------
--- a/deploy/third_engine/demo_avh/arm-none-eabi-gcc.cmake
+++ b/deploy/third_engine/demo_avh/arm-none-eabi-gcc.cmake
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+if (__TOOLCHAIN_LOADED)
+    return()
+endif()
+set(__TOOLCHAIN_LOADED TRUE)
+
+set(CMAKE_SYSTEM_NAME Generic)
+set(CMAKE_C_COMPILER "arm-none-eabi-gcc")
+set(CMAKE_CXX_COMPILER "arm-none-eabi-g++")
+set(CMAKE_SYSTEM_PROCESSOR "cortex-m55" CACHE STRING "Select Arm(R) Cortex(R)-M architecture. (cortex-m0, cortex-m3, cortex-m33, cortex-m4, cortex-m55, cortex-m7, etc)")
+
+set(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY)
+
+SET(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
+SET(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
+SET(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
+
+set(CMAKE_C_STANDARD 99)
+set(CMAKE_CXX_STANDARD 14)
+
+# The system processor could for example be set to cortex-m33+nodsp+nofp.
+set(__CPU_COMPILE_TARGET ${CMAKE_SYSTEM_PROCESSOR})
+string(REPLACE "+" ";" __CPU_FEATURES ${__CPU_COMPILE_TARGET})
+list(POP_FRONT __CPU_FEATURES CMAKE_SYSTEM_PROCESSOR)
+
+string(FIND ${__CPU_COMPILE_TARGET} "+" __OFFSET)
+if(__OFFSET GREATER_EQUAL 0)
+    string(SUBSTRING ${__CPU_COMPILE_TARGET} ${__OFFSET} -1 CPU_FEATURES)
+endif()
+
+# Add -mcpu to the compile options to override the -mcpu the CMake toolchain adds
+add_compile_options(-mcpu=${__CPU_COMPILE_TARGET})
+
+# Set floating point unit
+if("${__CPU_COMPILE_TARGET}" MATCHES "\\+fp")
+    set(FLOAT hard)
+elseif("${__CPU_COMPILE_TARGET}" MATCHES "\\+nofp")
+    set(FLOAT soft)
+elseif("${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "cortex-m33" OR
+       "${CMAKE_SYSTEM_PROCESSOR}" STREQUAL "cortex-m55")
+    set(FLOAT hard)
+else()
+    set(FLOAT soft)
+endif()
+
+add_compile_options(-mfloat-abi=${FLOAT})
+add_link_options(-mfloat-abi=${FLOAT})
+
+# Link target
+add_link_options(-mcpu=${__CPU_COMPILE_TARGET})
+add_link_options(-Xlinker -Map=output.map)
+
+#
+# Compile options
+#
+set(cxx_flags "-fno-unwind-tables;-fno-rtti;-fno-exceptions")
+
+add_compile_options("-Wall;-Wextra;-Wsign-compare;-Wunused;-Wswitch-default;\
+-Wdouble-promotion;-Wredundant-decls;-Wshadow;-Wnull-dereference;\
+-Wno-format-extra-args;-Wno-unused-function;-Wno-unused-label;\
+-Wno-missing-field-initializers;-Wno-return-type;-Wno-format;-Wno-int-conversion"
+    "$<$<COMPILE_LANGUAGE:CXX>:${cxx_flags}>"
+)
--- a/deploy/third_engine/demo_avh/configure_avh.sh
+++ b/deploy/third_engine/demo_avh/configure_avh.sh
+#!/bin/bash
+# Copyright (c) 2022 Arm Limited and Contributors. All rights reserved.
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+set -e
+set -u
+set -o pipefail
+
+# Show usage
+function show_usage() {
+    cat <<EOF
+Usage: Set up running environment by installing the required prerequisites.
+-h, --help
+    Display this help message.
+EOF
+}
+
+if [ "$#" -eq 1 ] && [ "$1" == "--help" -o "$1" == "-h" ]; then
+    show_usage
+    exit 0
+elif [ "$#" -ge 1 ]; then
+        show_usage
+        exit 1
+fi
+
+echo -e "\e[36mStart setting up running environment\e[0m"
+
+# Install CMSIS
+echo -e "\e[36mStart installing CMSIS\e[0m"
+CMSIS_PATH="/opt/arm/ethosu/cmsis" 
+mkdir -p "${CMSIS_PATH}"
+
+CMSIS_SHA="977abe9849781a2e788b02282986480ff4e25ea6"
+CMSIS_SHASUM="86c88d9341439fbb78664f11f3f25bc9fda3cd7de89359324019a4d87d169939eea85b7fdbfa6ad03aa428c6b515ef2f8cd52299ce1959a5444d4ac305f934cc"
+CMSIS_URL="http://github.com/ARM-software/CMSIS_5/archive/${CMSIS_SHA}.tar.gz"
+DOWNLOAD_PATH="/tmp/${CMSIS_SHA}.tar.gz"
+
+wget ${CMSIS_URL} -O "${DOWNLOAD_PATH}"
+echo "$CMSIS_SHASUM" ${DOWNLOAD_PATH} | sha512sum -c
+tar -xf "${DOWNLOAD_PATH}" -C "${CMSIS_PATH}" --strip-components=1
+touch "${CMSIS_PATH}"/"${CMSIS_SHA}".sha
+echo -e "\e[36mCMSIS Installation SUCCESS\e[0m"
+
+# Install Arm(R) Ethos(TM)-U NPU driver stack
+echo -e "\e[36mStart installing Arm(R) Ethos(TM)-U NPU driver stack\e[0m"
+git clone "https://review.mlplatform.org/ml/ethos-u/ethos-u-core-platform" /opt/arm/ethosu/core_platform 
+cd /opt/arm/ethosu/core_platform 
+git checkout tags/"21.11"
+echo -e "\e[36mArm(R) Ethos(TM)-U Core Platform Installation SUCCESS\e[0m"
+ 
+# Install Arm(R) GNU Toolchain
+echo -e "\e[36mStart installing Arm(R) GNU Toolchain\e[0m"
+mkdir -p /opt/arm/gcc-arm-none-eabi
+export gcc_arm_url='https://developer.arm.com/-/media/Files/downloads/gnu-rm/10-2020q4/gcc-arm-none-eabi-10-2020-q4-major-x86_64-linux.tar.bz2?revision=ca0cbf9c-9de2-491c-ac48-898b5bbc0443&la=en&hash=68760A8AE66026BCF99F05AC017A6A50C6FD832A'
+curl --retry 64 -sSL ${gcc_arm_url} | tar -C /opt/arm/gcc-arm-none-eabi --strip-components=1 -jx 
+export PATH=/opt/arm/gcc-arm-none-eabi/bin:$PATH 
+arm-none-eabi-gcc --version
+arm-none-eabi-g++ --version
+echo -e "\e[36mArm(R) Arm(R) GNU Toolchain Installation SUCCESS\e[0m"
+
+# Install TVM from TLCPack
+echo -e "\e[36mStart installing TVM\e[0m"
+pip install tlcpack-nightly -f https://tlcpack.ai/wheels
+echo -e "\e[36mTVM Installation SUCCESS\e[0m"
\ No newline at end of file
--- a/deploy/third_engine/demo_avh/convert_image.py
+++ b/deploy/third_engine/demo_avh/convert_image.py
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import os
+import pathlib
+import re
+import sys
+import cv2
+import math
+from PIL import Image
+import numpy as np
+
+
+def resize_norm_img(img, image_shape, padding=True):
+    imgC, imgH, imgW = image_shape
+    img = cv2.resize(img, (imgW, imgH), interpolation=cv2.INTER_LINEAR)
+    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+    img = np.transpose(img, [2, 0, 1]) / 255
+    img = np.expand_dims(img, 0)
+    img_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
+    img_std = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
+    img -= img_mean
+    img /= img_std
+    return img.astype(np.float32)
+
+
+def create_header_file(name, tensor_name, tensor_data, output_path):
+    """
+    This function generates a header file containing the data from the numpy array provided.
+    """
+    file_path = pathlib.Path(f"{output_path}/" + name).resolve()
+    # Create header file with npy_data as a C array
+    raw_path = file_path.with_suffix(".h").resolve()
+    with open(raw_path, "a") as header_file:
+        header_file.write(
+            "\n" + f"const size_t {tensor_name}_len = {tensor_data.size};\n" +
+            f'__attribute__((section(".data.tvm"), aligned(16))) float {tensor_name}[] = '
+        )
+
+        header_file.write("{")
+        for i in np.ndindex(tensor_data.shape):
+            header_file.write(f"{tensor_data[i]}, ")
+        header_file.write("};\n\n")
+
+
+def create_headers(image_name):
+    """
+    This function generates C header files for the input and output arrays required to run inferences
+    """
+    img_path = os.path.join("./", f"{image_name}")
+
+    # Resize image to 32x320
+    img = cv2.imread(img_path)
+    img = resize_norm_img(img, [3, 32, 320])
+    img_data = img.astype("float32")
+
+    # # Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
+    img_data = np.expand_dims(img_data, axis=0)
+
+    if os.path.exists("./include/inputs.h"):
+        os.remove("./include/inputs.h")
+    if os.path.exists("./include/outputs.h"):
+        os.remove("./include/outputs.h")
+    # Create input header file
+    create_header_file("inputs", "input", img_data, "./include")
+    # Create output header file
+    output_data = np.zeros([8500], np.float32)
+    create_header_file(
+        "outputs",
+        "output0",
+        output_data,
+        "./include", )
+    output_data = np.zeros([170000], np.float32)
+    create_header_file(
+        "outputs",
+        "output1",
+        output_data,
+        "./include", )
+
+
+if __name__ == "__main__":
+    create_headers(sys.argv[1])
--- a/deploy/third_engine/demo_avh/corstone300.ld
+++ b/deploy/third_engine/demo_avh/corstone300.ld
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+/*------------------ Reference System Memories -------------
+  +===================+============+=======+============+============+
+  | Memory            | Address    | Size  | CPU Access | NPU Access |
+  +===================+============+=======+============+============+
+  | ITCM              | 0x00000000 | 512KB | Yes (RO)   | No         |
+  +-------------------+------------+-------+------------+------------+
+  | DTCM              | 0x20000000 | 512KB | Yes (R/W)  | No         |
+  +-------------------+------------+-------+------------+------------+
+  | SSE-300 SRAM      | 0x21000000 |   2MB | Yes (R/W)  | Yes (R/W)  |
+  +-------------------+------------+-------+------------+------------+
+  | Data SRAM         | 0x01000000 |   2MB | Yes (R/W)  | Yes (R/W)  |
+  +-------------------+------------+-------+------------+------------+
+  | DDR               | 0x60000000 |  32MB | Yes (R/W)  | Yes (R/W)  |
+  +-------------------+------------+-------+------------+------------+ */
+
+/*---------------------- ITCM Configuration ----------------------------------
+  <h> Flash Configuration
+    <o0> Flash Base Address <0x0-0xFFFFFFFF:8>
+    <o1> Flash Size (in Bytes) <0x0-0xFFFFFFFF:8>
+  </h>
+  -----------------------------------------------------------------------------*/
+__ROM_BASE = 0x00000000;
+__ROM_SIZE = 0x00080000;
+
+/*--------------------- DTCM RAM Configuration ----------------------------
+  <h> RAM Configuration
+    <o0> RAM Base Address    <0x0-0xFFFFFFFF:8>
+    <o1> RAM Size (in Bytes) <0x0-0xFFFFFFFF:8>
+  </h>
+ -----------------------------------------------------------------------------*/
+__RAM_BASE = 0x20000000;
+__RAM_SIZE = 0x00080000;
+
+/*----------------------- Data SRAM Configuration ------------------------------
+  <h> Data SRAM Configuration
+    <o0> DATA_SRAM Base Address    <0x0-0xFFFFFFFF:8>
+    <o1> DATA_SRAM Size (in Bytes) <0x0-0xFFFFFFFF:8>
+  </h>
+ -----------------------------------------------------------------------------*/
+__DATA_SRAM_BASE = 0x01000000;
+__DATA_SRAM_SIZE = 0x00200000;
+
+/*--------------------- Embedded SRAM Configuration ----------------------------
+  <h> SRAM Configuration
+    <o0> SRAM Base Address    <0x0-0xFFFFFFFF:8>
+    <o1> SRAM Size (in Bytes) <0x0-0xFFFFFFFF:8>
+  </h>
+ -----------------------------------------------------------------------------*/
+__SRAM_BASE = 0x21000000;
+__SRAM_SIZE = 0x00200000;
+
+/*--------------------- Stack / Heap Configuration ----------------------------
+  <h> Stack / Heap Configuration
+    <o0> Stack Size (in Bytes) <0x0-0xFFFFFFFF:8>
+    <o1> Heap Size (in Bytes) <0x0-0xFFFFFFFF:8>
+  </h>
+  -----------------------------------------------------------------------------*/
+__STACK_SIZE = 0x00008000;
+__HEAP_SIZE  = 0x00008000;
+
+/*--------------------- Embedded RAM Configuration ----------------------------
+  <h> DDR Configuration
+    <o0> DDR Base Address    <0x0-0xFFFFFFFF:8>
+    <o1> DDR Size (in Bytes) <0x0-0xFFFFFFFF:8>
+  </h>
+ -----------------------------------------------------------------------------*/
+__DDR_BASE = 0x60000000;
+__DDR_SIZE = 0x02000000;
+
+/*
+ *-------------------- <<< end of configuration section >>> -------------------
+ */
+
+MEMORY
+{
+  ITCM       (rx)  : ORIGIN = __ROM_BASE, LENGTH = __ROM_SIZE
+  DTCM       (rwx) : ORIGIN = __RAM_BASE, LENGTH = __RAM_SIZE
+  DATA_SRAM  (rwx) : ORIGIN = __DATA_SRAM_BASE, LENGTH = __DATA_SRAM_SIZE
+  SRAM       (rwx) : ORIGIN = __SRAM_BASE, LENGTH = __SRAM_SIZE
+  DDR        (rwx) : ORIGIN = __DDR_BASE, LENGTH = __DDR_SIZE
+}
+
+/* Linker script to place sections and symbol values. Should be used together
+ * with other linker script that defines memory regions ITCM and RAM.
+ * It references following symbols, which must be defined in code:
+ *   Reset_Handler : Entry of reset handler
+ *
+ * It defines following symbols, which code can use without definition:
+ *   __exidx_start
+ *   __exidx_end
+ *   __copy_table_start__
+ *   __copy_table_end__
+ *   __zero_table_start__
+ *   __zero_table_end__
+ *   __etext
+ *   __data_start__
+ *   __preinit_array_start
+ *   __preinit_array_end
+ *   __init_array_start
+ *   __init_array_end
+ *   __fini_array_start
+ *   __fini_array_end
+ *   __data_end__
+ *   __bss_start__
+ *   __bss_end__
+ *   __end__
+ *   end
+ *   __HeapLimit
+ *   __StackLimit
+ *   __StackTop
+ *   __stack
+ */
+ENTRY(Reset_Handler)
+
+SECTIONS
+{
+  /* .ddr is placed before .text so that .rodata.tvm is encountered before .rodata* */
+  .ddr :
+  {
+    . = ALIGN (16);
+    *(.rodata.tvm)
+    . = ALIGN (16);
+    *(.data.tvm);
+    . = ALIGN(16);    
+  } > DDR
+
+  .text :
+  {
+    KEEP(*(.vectors))
+    *(.text*)
+
+    KEEP(*(.init))
+    KEEP(*(.fini))
+
+    /* .ctors */
+    *crtbegin.o(.ctors)
+    *crtbegin?.o(.ctors)
+    *(EXCLUDE_FILE(*crtend?.o *crtend.o) .ctors)
+    *(SORT(.ctors.*))
+    *(.ctors)
+
+    /* .dtors */
+    *crtbegin.o(.dtors)
+    *crtbegin?.o(.dtors)
+    *(EXCLUDE_FILE(*crtend?.o *crtend.o) .dtors)
+    *(SORT(.dtors.*))
+    *(.dtors)
+
+    *(.rodata*)
+
+    KEEP(*(.eh_frame*))
+  } > ITCM
+
+  .ARM.extab :
+  {
+    *(.ARM.extab* .gnu.linkonce.armextab.*)
+  } > ITCM
+
+  __exidx_start = .;
+  .ARM.exidx :
+  {
+    *(.ARM.exidx* .gnu.linkonce.armexidx.*)
+  } > ITCM
+  __exidx_end = .;
+
+  .copy.table :
+  {
+    . = ALIGN(4);
+    __copy_table_start__ = .;
+    LONG (__etext)
+    LONG (__data_start__)
+    LONG (__data_end__ - __data_start__)
+    /* Add each additional data section here */
+    __copy_table_end__ = .;
+  } > ITCM
+
+  .zero.table :
+  {
+    . = ALIGN(4);
+    __zero_table_start__ = .;
+    __zero_table_end__ = .;
+  } > ITCM
+
+  /**
+   * Location counter can end up 2byte aligned with narrow Thumb code but
+   * __etext is assumed by startup code to be the LMA of a section in DTCM
+   * which must be 4byte aligned
+   */
+  __etext = ALIGN (4);
+
+  .sram :
+  {
+    . = ALIGN(16);
+  } > SRAM AT > SRAM
+
+  .data : AT (__etext)
+  {
+    __data_start__ = .;
+    *(vtable)
+    *(.data)
+    *(.data.*)
+
+    . = ALIGN(4);
+    /* preinit data */
+    PROVIDE_HIDDEN (__preinit_array_start = .);
+    KEEP(*(.preinit_array))
+    PROVIDE_HIDDEN (__preinit_array_end = .);
+
+    . = ALIGN(4);
+    /* init data */
+    PROVIDE_HIDDEN (__init_array_start = .);
+    KEEP(*(SORT(.init_array.*)))
+    KEEP(*(.init_array))
+    PROVIDE_HIDDEN (__init_array_end = .);
+
+
+    . = ALIGN(4);
+    /* finit data */
+    PROVIDE_HIDDEN (__fini_array_start = .);
+    KEEP(*(SORT(.fini_array.*)))
+    KEEP(*(.fini_array))
+    PROVIDE_HIDDEN (__fini_array_end = .);
+
+    KEEP(*(.jcr*))
+    . = ALIGN(4);
+    /* All data end */
+    __data_end__ = .;
+
+  } > DTCM
+
+  .bss.noinit (NOLOAD):
+  {
+    . = ALIGN(16);
+    *(.bss.noinit.*)
+    . = ALIGN(16);
+  } > DDR AT > DDR
+
+  .bss :
+  {
+    . = ALIGN(4);
+    __bss_start__ = .;
+    *(.bss)
+    *(.bss.*)
+    *(COMMON)
+    . = ALIGN(4);
+    __bss_end__ = .;
+  } > DTCM AT > DTCM
+
+  .data_sram :
+  {
+    . = ALIGN(16);
+  } > DATA_SRAM
+
+  .heap (COPY) :
+  {
+    . = ALIGN(8);
+    __end__ = .;
+    PROVIDE(end = .);
+    . = . + __HEAP_SIZE;
+    . = ALIGN(8);
+    __HeapLimit = .;
+  } > DTCM
+
+  .stack (ORIGIN(DTCM) + LENGTH(DTCM) - __STACK_SIZE) (COPY) :
+  {
+    . = ALIGN(8);
+    __StackLimit = .;
+    . = . + __STACK_SIZE;
+    . = ALIGN(8);
+    __StackTop = .;
+  } > DTCM
+  PROVIDE(__stack = __StackTop);
+
+  /* Check if data + stack exceeds DTCM limit */
+  ASSERT(__StackLimit >= __bss_end__, "region DTCM overflowed with stack")
+}
--- a/deploy/third_engine/demo_avh/image/000000014439_640x640.jpg
+++ b/deploy/third_engine/demo_avh/image/000000014439_640x640.jpg
--- a/deploy/third_engine/demo_avh/include/crt_config.h
+++ b/deploy/third_engine/demo_avh/include/crt_config.h
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+#ifndef TVM_RUNTIME_CRT_CONFIG_H_
+#define TVM_RUNTIME_CRT_CONFIG_H_
+
+/*! Log level of the CRT runtime */
+#define TVM_CRT_LOG_LEVEL TVM_CRT_LOG_LEVEL_DEBUG
+
+#endif // TVM_RUNTIME_CRT_CONFIG_H_
--- a/deploy/third_engine/demo_avh/include/tvm_runtime.h
+++ b/deploy/third_engine/demo_avh/include/tvm_runtime.h
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <tvm/runtime/c_runtime_api.h>
+#include <tvm/runtime/crt/stack_allocator.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+void __attribute__((noreturn)) TVMPlatformAbort(tvm_crt_error_t error_code) {
+  printf("TVMPlatformAbort: %d\n", error_code);
+  printf("EXITTHESIM\n");
+  exit(-1);
+}
+
+tvm_crt_error_t TVMPlatformMemoryAllocate(size_t num_bytes, DLDevice dev,
+                                          void **out_ptr) {
+  return kTvmErrorFunctionCallNotImplemented;
+}
+
+tvm_crt_error_t TVMPlatformMemoryFree(void *ptr, DLDevice dev) {
+  return kTvmErrorFunctionCallNotImplemented;
+}
+
+void TVMLogf(const char *msg, ...) {
+  va_list args;
+  va_start(args, msg);
+  vfprintf(stdout, msg, args);
+  va_end(args);
+}
+
+TVM_DLL int TVMFuncRegisterGlobal(const char *name, TVMFunctionHandle f,
+                                  int override) {
+  return 0;
+}
+
+#ifdef __cplusplus
+}
+#endif
--- a/deploy/third_engine/demo_avh/requirements.txt
+++ b/deploy/third_engine/demo_avh/requirements.txt
+paddlepaddle
+numpy
+opencv-python
--- a/deploy/third_engine/demo_avh/run_demo.sh
+++ b/deploy/third_engine/demo_avh/run_demo.sh
+#!/bin/bash
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+set -e
+set -u
+set -o pipefail
+
+# Show usage
+function show_usage() {
+    cat <<EOF
+Usage: run_demo.sh
+-h, --help
+	Display this help message.
+--cmsis_path CMSIS_PATH
+	Set path to CMSIS.
+--ethosu_platform_path ETHOSU_PLATFORM_PATH
+	Set path to Arm(R) Ethos(TM)-U core platform.
+--fvp_path FVP_PATH
+	Set path to FVP.
+--cmake_path
+	Set path to cmake.
+--enable_FVP
+	Set 1 to run application on local Fixed Virtual Platforms (FVPs) executables.
+EOF
+}
+
+# Configure environment variables
+FVP_enable=0
+export PATH=/opt/arm/gcc-arm-none-eabi/bin:$PATH
+
+# Install python libraries
+echo -e "\e[36mInstall python libraries\e[0m"
+sudo pip install -r ./requirements.txt
+
+# Parse arguments
+while (( $# )); do
+    case "$1" in
+        -h|--help)
+            show_usage
+            exit 0
+            ;;
+
+        --cmsis_path)
+            if [ $# -gt 1 ]
+            then
+                export CMSIS_PATH="$2"
+                shift 2
+            else
+                echo 'ERROR: --cmsis_path requires a non-empty argument' >&2
+                show_usage >&2
+                exit 1
+            fi
+            ;;
+
+        --ethosu_platform_path)
+            if [ $# -gt 1 ]
+            then
+                export ETHOSU_PLATFORM_PATH="$2"
+                shift 2
+            else
+                echo 'ERROR: --ethosu_platform_path requires a non-empty argument' >&2
+                show_usage >&2
+                exit 1
+            fi
+            ;;
+
+        --fvp_path)
+            if [ $# -gt 1 ]
+            then
+                export PATH="$2/models/Linux64_GCC-6.4:$PATH"
+                shift 2
+            else
+                echo 'ERROR: --fvp_path requires a non-empty argument' >&2
+                show_usage >&2
+                exit 1
+            fi
+            ;;
+
+        --cmake_path)
+            if [ $# -gt 1 ]
+            then
+                export CMAKE="$2"
+                shift 2
+            else
+                echo 'ERROR: --cmake_path requires a non-empty argument' >&2
+                show_usage >&2
+                exit 1
+            fi
+            ;;
+            
+        --enable_FVP)
+            if [ $# -gt 1 ] && [ "$2" == "1" -o "$2" == "0" ];
+            then
+                FVP_enable="$2"
+                shift 2
+            else
+                echo 'ERROR: --enable_FVP requires a right argument 1 or 0' >&2
+                show_usage >&2
+                exit 1
+            fi
+            ;;
+
+        -*|--*)
+            echo "Error: Unknown flag: $1" >&2
+            show_usage >&2
+            exit 1
+            ;;
+    esac
+done
+
+# Choose running environment: cloud(default) or local environment
+Platform="VHT_Corstone_SSE-300_Ethos-U55"
+if [ $FVP_enable == "1" ]; then
+	Platform="FVP_Corstone_SSE-300_Ethos-U55"
+	echo -e "\e[36mRun application on local Fixed Virtual Platforms (FVPs)\e[0m"
+else
+	if [ ! -d "/opt/arm/" ]; then
+		sudo ./configure_avh.sh
+	fi
+fi
+
+# Directories
+script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
+
+# Make build directory
+make cleanall
+mkdir -p build
+cd build
+
+# Get PaddlePaddle inference model
+echo -e "\e[36mDownload PaddlePaddle inference model\e[0m"
+wget https://bj.bcebos.com/v1/paddledet/deploy/Inference/picodet_s_320_coco_lcnet_no_nms.tar
+tar -xf picodet_s_320_coco_lcnet_no_nms.tar
+
+# Compile model for Arm(R) Cortex(R)-M55 CPU and CMSIS-NN
+# An alternative to using "python3 -m tvm.driver.tvmc" is to call
+# "tvmc" directly once TVM has been pip installed.
+python3 -m tvm.driver.tvmc compile --target=cmsis-nn,c \
+    --target-cmsis-nn-mcpu=cortex-m55 \
+    --target-c-mcpu=cortex-m55 \
+    --runtime=crt \
+    --executor=aot \
+    --executor-aot-interface-api=c \
+    --executor-aot-unpacked-api=1 \
+    --pass-config tir.usmp.enable=1 \
+    --pass-config tir.usmp.algorithm=hill_climb \
+    --pass-config tir.disable_storage_rewrite=1 \
+    --pass-config tir.disable_vectorize=1 picodet_s_320_coco_lcnet_no_nms/model.pdmodel \
+    --output-format=mlf \
+    --model-format=paddle \
+    --module-name=picodet \
+    --input-shapes image:[1,3,320,320] \
+    --output=picodet.tar
+tar -xf picodet.tar
+
+# Create C header files
+cd ..
+python3 ./convert_image.py ./image/000000014439_640x640.jpg
+
+# Build demo executable
+cd ${script_dir}
+echo ${script_dir}
+make
+
+# Run demo executable on the AVH
+$Platform -C cpu0.CFGDTCMSZ=15 \
+-C cpu0.CFGITCMSZ=15 -C mps3_board.uart0.out_file=\"-\" -C mps3_board.uart0.shutdown_tag=\"EXITTHESIM\" \
+-C mps3_board.visualisation.disable-visualisation=1 -C mps3_board.telnetterminal0.start_telnet=0 \
+-C mps3_board.telnetterminal1.start_telnet=0 -C mps3_board.telnetterminal2.start_telnet=0 -C mps3_board.telnetterminal5.start_telnet=0 \
+./build/demo --stat
--- a/deploy/third_engine/demo_avh/src/demo_bare_metal.c
+++ b/deploy/third_engine/demo_avh/src/demo_bare_metal.c
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+#include <stdio.h>
+#include <tvm_runtime.h>
+#include <tvmgen_picodet.h>
+
+#include "uart.h"
+
+// Header files generated by convert_image.py
+#include "inputs.h"
+#include "outputs.h"
+
+int main(int argc, char **argv) {
+  uart_init();
+  printf("Starting PicoDet inference:\n");
+  struct tvmgen_picodet_outputs rec_outputs = {
+      .output0 = output0, .output1 = output1,
+  };
+  struct tvmgen_picodet_inputs rec_inputs = {
+      .image = input,
+  };
+
+  tvmgen_picodet_run(&rec_inputs, &rec_outputs);
+
+  // post process
+  for (int i = 0; i < output0_len / 4; i++) {
+    float score = 0;
+    int32_t class = 0;
+    for (int j = 0; j < 80; j++) {
+      if (output1[i + j * 2125] > score) {
+        score = output1[i + j * 2125];
+        class = j;
+      }
+    }
+    if (score > 0.1 && output0[i * 4] > 0 && output0[i * 4 + 1] > 0) {
+      printf("box: %f, %f, %f, %f, class: %d, score: %f\n", output0[i * 4] * 2,
+             output0[i * 4 + 1] * 2, output0[i * 4 + 2] * 2,
+             output0[i * 4 + 3] * 2, class, score);
+    }
+  }
+  return 0;
+}
--- a/deploy/third_engine/demo_mnn/CMakeLists.txt
+++ b/deploy/third_engine/demo_mnn/CMakeLists.txt
+cmake_minimum_required(VERSION 3.9)
+project(picodet-mnn)
+
+set(CMAKE_CXX_STANDARD 17)
+set(MNN_DIR PATHS "./mnn")
+
+# find_package(OpenCV REQUIRED PATHS "/work/dependence/opencv/opencv-3.4.3/build")
+find_package(OpenCV REQUIRED)
+include_directories(
+        ${MNN_DIR}/include
+        ${MNN_DIR}/include/MNN
+        ${CMAKE_SOURCE_DIR}
+)
+link_directories(mnn/lib)
+
+add_library(libMNN SHARED IMPORTED)
+set_target_properties(
+        libMNN
+        PROPERTIES IMPORTED_LOCATION
+        ${CMAKE_SOURCE_DIR}/mnn/lib/libMNN.so
+)
+add_executable(picodet-mnn main.cpp picodet_mnn.cpp)
+target_link_libraries(picodet-mnn MNN ${OpenCV_LIBS} libMNN.so)
--- a/deploy/third_engine/demo_mnn/README.md
+++ b/deploy/third_engine/demo_mnn/README.md
+# PicoDet MNN Demo
+
+本Demo提供的预测代码是根据[Alibaba's MNN framework](https://github.com/alibaba/MNN) 推理库预测的。
+
+## C++ Demo
+
+- 第一步：根据[MNN官方编译文档](https://www.yuque.com/mnn/en/build_linux) 编译生成预测库.
+- 第二步：编译或下载得到OpenCV库，可参考OpenCV官网，为了方便如果环境是gcc8.2 x86环境，可直接下载以下库：
+```shell
+wget https://paddledet.bj.bcebos.com/data/opencv-3.4.16_gcc8.2_ffmpeg.tar.gz
+tar -xf opencv-3.4.16_gcc8.2_ffmpeg.tar.gz
+```
+
+- 第三步：准备模型
+    ```shell
+    modelName=picodet_s_320_coco_lcnet
+    # 导出Inference model
+    python tools/export_model.py \
+            -c configs/picodet/${modelName}.yml \
+            -o weights=${modelName}.pdparams \
+            --output_dir=inference_model
+    # 转换到ONNX
+    paddle2onnx --model_dir inference_model/${modelName} \
+            --model_filename model.pdmodel  \
+            --params_filename model.pdiparams \
+            --opset_version 11 \
+            --save_file ${modelName}.onnx
+    # 简化模型
+    python -m onnxsim ${modelName}.onnx ${modelName}_processed.onnx
+    # 将模型转换至MNN格式
+    python -m MNN.tools.mnnconvert -f ONNX --modelFile picodet_s_320_lcnet_processed.onnx --MNNModel picodet_s_320_lcnet.mnn
+    ```
+为了快速测试，可直接下载：[picodet_s_320_lcnet.mnn](https://paddledet.bj.bcebos.com/deploy/third_engine/picodet_s_320_lcnet.mnn)（不带后处理）。
+
+**注意：**由于MNN里，Matmul算子的输入shape如果不一致计算有问题，带后处理的Demo正在升级中，很快发布。
+
+## 编译可执行程序
+
+- 第一步：导入lib包
+```
+mkdir mnn && cd mnn && mkdir lib
+cp /path/to/MNN/build/libMNN.so .
+cd ..
+cp -r /path/to/MNN/include .
+```
+- 第二步：修改CMakeLists.txt中OpenCV和MNN的路径
+- 第三步：开始编译
+``` shell
+mkdir build && cd build
+cmake ..
+make
+```
+如果在build目录下生成`picodet-mnn`可执行文件，就证明成功了。
+
+## 开始运行
+
+首先新建预测结果存放目录：
+```shell
+cp -r ../demo_onnxruntime/imgs .
+cd build
+mkdir ../results
+```
+
+- 预测一张图片
+``` shell
+./picodet-mnn 0 ../picodet_s_320_lcnet_3.mnn 320 320 ../imgs/dog.jpg
+```
+
+-测试速度Benchmark
+
+``` shell
+./picodet-mnn 1 ../picodet_s_320_lcnet.mnn 320 320
+```
+
+## FAQ
+
+- 预测结果精度不对：
+请先确认模型输入shape是否对齐，并且模型输出name是否对齐，不带后处理的PicoDet增强版模型输出name如下：
+```shell
+# 分类分支  |  检测分支
+{"transpose_0.tmp_0", "transpose_1.tmp_0"},
+{"transpose_2.tmp_0", "transpose_3.tmp_0"},
+{"transpose_4.tmp_0", "transpose_5.tmp_0"},
+{"transpose_6.tmp_0", "transpose_7.tmp_0"},
+```
+可使用[netron](https://netron.app)查看具体name，并修改`picodet_mnn.hpp`中相应`non_postprocess_heads_info`数组。
+
+## Reference
+[MNN](https://github.com/alibaba/MNN)
--- a/deploy/third_engine/demo_mnn/main.cpp
+++ b/deploy/third_engine/demo_mnn/main.cpp
+// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#include "picodet_mnn.hpp"
+#include <iostream>
+#include <opencv2/core/core.hpp>
+#include <opencv2/highgui/highgui.hpp>
+#include <opencv2/imgproc/imgproc.hpp>
+
+#define __SAVE_RESULT__ // if defined save drawed results to ../results, else
+                        // show it in windows
+
+struct object_rect {
+  int x;
+  int y;
+  int width;
+  int height;
+};
+
+std::vector<int> GenerateColorMap(int num_class) {
+  auto colormap = std::vector<int>(3 * num_class, 0);
+  for (int i = 0; i < num_class; ++i) {
+    int j = 0;
+    int lab = i;
+    while (lab) {
+      colormap[i * 3] |= (((lab >> 0) & 1) << (7 - j));
+      colormap[i * 3 + 1] |= (((lab >> 1) & 1) << (7 - j));
+      colormap[i * 3 + 2] |= (((lab >> 2) & 1) << (7 - j));
+      ++j;
+      lab >>= 3;
+    }
+  }
+  return colormap;
+}
+
+void draw_bboxes(const cv::Mat &im, const std::vector<BoxInfo> &bboxes,
+                 std::string save_path = "None") {
+  static const char *class_names[] = {
+      "person",        "bicycle",      "car",
+      "motorcycle",    "airplane",     "bus",
+      "train",         "truck",        "boat",
+      "traffic light", "fire hydrant", "stop sign",
+      "parking meter", "bench",        "bird",
+      "cat",           "dog",          "horse",
+      "sheep",         "cow",          "elephant",
+      "bear",          "zebra",        "giraffe",
+      "backpack",      "umbrella",     "handbag",
+      "tie",           "suitcase",     "frisbee",
+      "skis",          "snowboard",    "sports ball",
+      "kite",          "baseball bat", "baseball glove",
+      "skateboard",    "surfboard",    "tennis racket",
+      "bottle",        "wine glass",   "cup",
+      "fork",          "knife",        "spoon",
+      "bowl",          "banana",       "apple",
+      "sandwich",      "orange",       "broccoli",
+      "carrot",        "hot dog",      "pizza",
+      "donut",         "cake",         "chair",
+      "couch",         "potted plant", "bed",
+      "dining table",  "toilet",       "tv",
+      "laptop",        "mouse",        "remote",
+      "keyboard",      "cell phone",   "microwave",
+      "oven",          "toaster",      "sink",
+      "refrigerator",  "book",         "clock",
+      "vase",          "scissors",     "teddy bear",
+      "hair drier",    "toothbrush"};
+
+  cv::Mat image = im.clone();
+  int src_w = image.cols;
+  int src_h = image.rows;
+  int thickness = 2;
+  auto colormap = GenerateColorMap(sizeof(class_names));
+
+  for (size_t i = 0; i < bboxes.size(); i++) {
+    const BoxInfo &bbox = bboxes[i];
+    std::cout << bbox.x1 << ". " << bbox.y1 << ". " << bbox.x2 << ". "
+              << bbox.y2 << ". " << std::endl;
+    int c1 = colormap[3 * bbox.label + 0];
+    int c2 = colormap[3 * bbox.label + 1];
+    int c3 = colormap[3 * bbox.label + 2];
+    cv::Scalar color = cv::Scalar(c1, c2, c3);
+    // cv::Scalar color = cv::Scalar(0, 0, 255);
+    cv::rectangle(image, cv::Rect(cv::Point(bbox.x1, bbox.y1),
+                                  cv::Point(bbox.x2, bbox.y2)),
+                  color, 1, cv::LINE_AA);
+
+    char text[256];
+    sprintf(text, "%s %.1f%%", class_names[bbox.label], bbox.score * 100);
+
+    int baseLine = 0;
+    cv::Size label_size =
+        cv::getTextSize(text, cv::FONT_HERSHEY_SIMPLEX, 0.4, 1, &baseLine);
+
+    int x = bbox.x1;
+    int y = bbox.y1 - label_size.height - baseLine;
+    if (y < 0)
+      y = 0;
+    if (x + label_size.width > image.cols)
+      x = image.cols - label_size.width;
+
+    cv::rectangle(image, cv::Rect(cv::Point(x, y),
+                                  cv::Size(label_size.width,
+                                           label_size.height + baseLine)),
+                  color, -1);
+
+    cv::putText(image, text, cv::Point(x, y + label_size.height),
+                cv::FONT_HERSHEY_SIMPLEX, 0.4, cv::Scalar(255, 255, 255), 1,
+                cv::LINE_AA);
+  }
+
+  if (save_path == "None") {
+    cv::imshow("image", image);
+  } else {
+    cv::imwrite(save_path, image);
+    std::cout << save_path << std::endl;
+  }
+}
+
+int image_demo(PicoDet &detector, const char *imagepath) {
+  std::vector<cv::String> filenames;
+  cv::glob(imagepath, filenames, false);
+
+  for (auto img_name : filenames) {
+    cv::Mat image = cv::imread(img_name, cv::IMREAD_COLOR);
+    if (image.empty()) {
+      fprintf(stderr, "cv::imread %s failed\n", img_name.c_str());
+      return -1;
+    }
+    std::vector<BoxInfo> results;
+    detector.detect(image, results, false);
+    std::cout << "detect done." << std::endl;
+
+#ifdef __SAVE_RESULT__
+    std::string save_path = img_name;
+    draw_bboxes(image, results, save_path.replace(3, 4, "results"));
+#else
+    draw_bboxes(image, results);
+    cv::waitKey(0);
+#endif
+  }
+  return 0;
+}
+
+int benchmark(PicoDet &detector, int width, int height) {
+  int loop_num = 100;
+  int warm_up = 8;
+
+  double time_min = DBL_MAX;
+  double time_max = -DBL_MAX;
+  double time_avg = 0;
+  cv::Mat image(width, height, CV_8UC3, cv::Scalar(1, 1, 1));
+  for (int i = 0; i < warm_up + loop_num; i++) {
+    auto start = std::chrono::steady_clock::now();
+    std::vector<BoxInfo> results;
+    detector.detect(image, results, false);
+    auto end = std::chrono::steady_clock::now();
+
+    std::chrono::duration<double> elapsed = end - start;
+    double time = elapsed.count();
+    if (i >= warm_up) {
+      time_min = (std::min)(time_min, time);
+      time_max = (std::max)(time_max, time);
+      time_avg += time;
+    }
+  }
+  time_avg /= loop_num;
+  fprintf(stderr, "%20s  min = %7.2f  max = %7.2f  avg = %7.2f\n", "picodet",
+          time_min, time_max, time_avg);
+  return 0;
+}
+
+int main(int argc, char **argv) {
+  int mode = atoi(argv[1]);
+  std::string model_path = argv[2];
+  int height = 320;
+  int width = 320;
+  if (argc == 4) {
+    height = atoi(argv[3]);
+    width = atoi(argv[4]);
+  }
+  PicoDet detector = PicoDet(model_path, width, height, 4, 0.45, 0.3);
+  if (mode == 1) {
+    benchmark(detector, width, height);
+  } else {
+    if (argc != 5) {
+      std::cout << "Must set image file, such as ./picodet-mnn 0 "
+                   "../picodet_s_320_lcnet.mnn 320 320 img.jpg"
+                << std::endl;
+    }
+    const char *images = argv[5];
+    image_demo(detector, images);
+  }
+}
--- a/deploy/third_engine/demo_mnn/picodet_mnn.cpp
+++ b/deploy/third_engine/demo_mnn/picodet_mnn.cpp
+// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+// reference from https://github.com/RangiLyu/nanodet/tree/main/demo_mnn
+
+#include "picodet_mnn.hpp"
+
+using namespace std;
+
+PicoDet::PicoDet(const std::string &mnn_path, int input_width, int input_length,
+                 int num_thread_, float score_threshold_,
+                 float nms_threshold_) {
+  num_thread = num_thread_;
+  in_w = input_width;
+  in_h = input_length;
+  score_threshold = score_threshold_;
+  nms_threshold = nms_threshold_;
+
+  PicoDet_interpreter = std::shared_ptr<MNN::Interpreter>(
+      MNN::Interpreter::createFromFile(mnn_path.c_str()));
+  MNN::ScheduleConfig config;
+  config.numThread = num_thread;
+  MNN::BackendConfig backendConfig;
+  backendConfig.precision = (MNN::BackendConfig::PrecisionMode)2;
+  config.backendConfig = &backendConfig;
+
+  PicoDet_session = PicoDet_interpreter->createSession(config);
+
+  input_tensor = PicoDet_interpreter->getSessionInput(PicoDet_session, nullptr);
+}
+
+PicoDet::~PicoDet() {
+  PicoDet_interpreter->releaseModel();
+  PicoDet_interpreter->releaseSession(PicoDet_session);
+}
+
+int PicoDet::detect(cv::Mat &raw_image, std::vector<BoxInfo> &result_list,
+                    bool has_postprocess) {
+  if (raw_image.empty()) {
+    std::cout << "image is empty ,please check!" << std::endl;
+    return -1;
+  }
+
+  image_h = raw_image.rows;
+  image_w = raw_image.cols;
+  cv::Mat image;
+  cv::resize(raw_image, image, cv::Size(in_w, in_h));
+
+  PicoDet_interpreter->resizeTensor(input_tensor, {1, 3, in_h, in_w});
+  PicoDet_interpreter->resizeSession(PicoDet_session);
+  std::shared_ptr<MNN::CV::ImageProcess> pretreat(MNN::CV::ImageProcess::create(
+      MNN::CV::BGR, MNN::CV::BGR, mean_vals, 3, norm_vals, 3));
+  pretreat->convert(image.data, in_w, in_h, image.step[0], input_tensor);
+
+  auto start = chrono::steady_clock::now();
+
+  // run network
+  PicoDet_interpreter->runSession(PicoDet_session);
+
+  // get output data
+  std::vector<std::vector<BoxInfo>> results;
+  results.resize(num_class);
+
+  if (has_postprocess) {
+    auto bbox_out_tensor = PicoDet_interpreter->getSessionOutput(
+        PicoDet_session, nms_heads_info[0].c_str());
+    auto class_out_tensor = PicoDet_interpreter->getSessionOutput(
+        PicoDet_session, nms_heads_info[1].c_str());
+    // bbox branch
+    auto tensor_bbox_host =
+        new MNN::Tensor(bbox_out_tensor, MNN::Tensor::CAFFE);
+    bbox_out_tensor->copyToHostTensor(tensor_bbox_host);
+    auto bbox_output_shape = tensor_bbox_host->shape();
+    int output_size = 1;
+    for (int j = 0; j < bbox_output_shape.size(); ++j) {
+      output_size *= bbox_output_shape[j];
+    }
+    std::cout << "output_size:" << output_size << std::endl;
+    bbox_output_data_.resize(output_size);
+    std::copy_n(tensor_bbox_host->host<float>(), output_size,
+                bbox_output_data_.data());
+    delete tensor_bbox_host;
+    // class branch
+    auto tensor_class_host =
+        new MNN::Tensor(class_out_tensor, MNN::Tensor::CAFFE);
+    class_out_tensor->copyToHostTensor(tensor_class_host);
+    auto class_output_shape = tensor_class_host->shape();
+    output_size = 1;
+    for (int j = 0; j < class_output_shape.size(); ++j) {
+      output_size *= class_output_shape[j];
+    }
+    std::cout << "output_size:" << output_size << std::endl;
+    class_output_data_.resize(output_size);
+    std::copy_n(tensor_class_host->host<float>(), output_size,
+                class_output_data_.data());
+    delete tensor_class_host;
+  } else {
+    for (const auto &head_info : non_postprocess_heads_info) {
+      MNN::Tensor *tensor_scores = PicoDet_interpreter->getSessionOutput(
+          PicoDet_session, head_info.cls_layer.c_str());
+      MNN::Tensor *tensor_boxes = PicoDet_interpreter->getSessionOutput(
+          PicoDet_session, head_info.dis_layer.c_str());
+
+      MNN::Tensor tensor_scores_host(tensor_scores,
+                                     tensor_scores->getDimensionType());
+      tensor_scores->copyToHostTensor(&tensor_scores_host);
+
+      MNN::Tensor tensor_boxes_host(tensor_boxes,
+                                    tensor_boxes->getDimensionType());
+      tensor_boxes->copyToHostTensor(&tensor_boxes_host);
+
+      decode_infer(&tensor_scores_host, &tensor_boxes_host, head_info.stride,
+                   score_threshold, results);
+    }
+  }
+
+  auto end = chrono::steady_clock::now();
+  chrono::duration<double> elapsed = end - start;
+  cout << "inference time:" << elapsed.count() << " s, ";
+
+  for (int i = 0; i < (int)results.size(); i++) {
+    nms(results[i], nms_threshold);
+
+    for (auto box : results[i]) {
+      box.x1 = box.x1 / in_w * image_w;
+      box.x2 = box.x2 / in_w * image_w;
+      box.y1 = box.y1 / in_h * image_h;
+      box.y2 = box.y2 / in_h * image_h;
+      result_list.push_back(box);
+    }
+  }
+  cout << "detect " << result_list.size() << " objects" << endl;
+
+  return 0;
+}
+
+void PicoDet::decode_infer(MNN::Tensor *cls_pred, MNN::Tensor *dis_pred,
+                           int stride, float threshold,
+                           std::vector<std::vector<BoxInfo>> &results) {
+  int feature_h = ceil((float)in_h / stride);
+  int feature_w = ceil((float)in_w / stride);
+
+  for (int idx = 0; idx < feature_h * feature_w; idx++) {
+    const float *scores = cls_pred->host<float>() + (idx * num_class);
+    int row = idx / feature_w;
+    int col = idx % feature_w;
+    float score = 0;
+    int cur_label = 0;
+    for (int label = 0; label < num_class; label++) {
+      if (scores[label] > score) {
+        score = scores[label];
+        cur_label = label;
+      }
+    }
+    if (score > threshold) {
+      const float *bbox_pred =
+          dis_pred->host<float>() + (idx * 4 * (reg_max + 1));
+      results[cur_label].push_back(
+          disPred2Bbox(bbox_pred, cur_label, score, col, row, stride));
+    }
+  }
+}
+
+BoxInfo PicoDet::disPred2Bbox(const float *&dfl_det, int label, float score,
+                              int x, int y, int stride) {
+  float ct_x = (x + 0.5) * stride;
+  float ct_y = (y + 0.5) * stride;
+  std::vector<float> dis_pred;
+  dis_pred.resize(4);
+  for (int i = 0; i < 4; i++) {
+    float dis = 0;
+    float *dis_after_sm = new float[reg_max + 1];
+    activation_function_softmax(dfl_det + i * (reg_max + 1), dis_after_sm,
+                                reg_max + 1);
+    for (int j = 0; j < reg_max + 1; j++) {
+      dis += j * dis_after_sm[j];
+    }
+    dis *= stride;
+    dis_pred[i] = dis;
+    delete[] dis_after_sm;
+  }
+  float xmin = (std::max)(ct_x - dis_pred[0], .0f);
+  float ymin = (std::max)(ct_y - dis_pred[1], .0f);
+  float xmax = (std::min)(ct_x + dis_pred[2], (float)in_w);
+  float ymax = (std::min)(ct_y + dis_pred[3], (float)in_h);
+  return BoxInfo{xmin, ymin, xmax, ymax, score, label};
+}
+
+void PicoDet::nms(std::vector<BoxInfo> &input_boxes, float NMS_THRESH) {
+  std::sort(input_boxes.begin(), input_boxes.end(),
+            [](BoxInfo a, BoxInfo b) { return a.score > b.score; });
+  std::vector<float> vArea(input_boxes.size());
+  for (int i = 0; i < int(input_boxes.size()); ++i) {
+    vArea[i] = (input_boxes.at(i).x2 - input_boxes.at(i).x1 + 1) *
+               (input_boxes.at(i).y2 - input_boxes.at(i).y1 + 1);
+  }
+  for (int i = 0; i < int(input_boxes.size()); ++i) {
+    for (int j = i + 1; j < int(input_boxes.size());) {
+      float xx1 = (std::max)(input_boxes[i].x1, input_boxes[j].x1);
+      float yy1 = (std::max)(input_boxes[i].y1, input_boxes[j].y1);
+      float xx2 = (std::min)(input_boxes[i].x2, input_boxes[j].x2);
+      float yy2 = (std::min)(input_boxes[i].y2, input_boxes[j].y2);
+      float w = (std::max)(float(0), xx2 - xx1 + 1);
+      float h = (std::max)(float(0), yy2 - yy1 + 1);
+      float inter = w * h;
+      float ovr = inter / (vArea[i] + vArea[j] - inter);
+      if (ovr >= NMS_THRESH) {
+        input_boxes.erase(input_boxes.begin() + j);
+        vArea.erase(vArea.begin() + j);
+      } else {
+        j++;
+      }
+    }
+  }
+}
+
+inline float fast_exp(float x) {
+  union {
+    uint32_t i;
+    float f;
+  } v{};
+  v.i = (1 << 23) * (1.4426950409 * x + 126.93490512f);
+  return v.f;
+}
+
+inline float sigmoid(float x) { return 1.0f / (1.0f + fast_exp(-x)); }
+
+template <typename _Tp>
+int activation_function_softmax(const _Tp *src, _Tp *dst, int length) {
+  const _Tp alpha = *std::max_element(src, src + length);
+  _Tp denominator{0};
+
+  for (int i = 0; i < length; ++i) {
+    dst[i] = fast_exp(src[i] - alpha);
+    denominator += dst[i];
+  }
+
+  for (int i = 0; i < length; ++i) {
+    dst[i] /= denominator;
+  }
+
+  return 0;
+}
--- a/deploy/third_engine/demo_mnn/picodet_mnn.hpp
+++ b/deploy/third_engine/demo_mnn/picodet_mnn.hpp
+// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#ifndef __PicoDet_H__
+#define __PicoDet_H__
+
+#pragma once
+
+#include "Interpreter.hpp"
+
+#include "ImageProcess.hpp"
+#include "MNNDefine.h"
+#include "Tensor.hpp"
+#include <algorithm>
+#include <chrono>
+#include <iostream>
+#include <memory>
+#include <opencv2/opencv.hpp>
+#include <string>
+#include <vector>
+
+typedef struct NonPostProcessHeadInfo_ {
+  std::string cls_layer;
+  std::string dis_layer;
+  int stride;
+} NonPostProcessHeadInfo;
+
+typedef struct BoxInfo_ {
+  float x1;
+  float y1;
+  float x2;
+  float y2;
+  float score;
+  int label;
+} BoxInfo;
+
+class PicoDet {
+public:
+  PicoDet(const std::string &mnn_path, int input_width, int input_length,
+          int num_thread_ = 4, float score_threshold_ = 0.5,
+          float nms_threshold_ = 0.3);
+
+  ~PicoDet();
+
+  int detect(cv::Mat &img, std::vector<BoxInfo> &result_list,
+             bool has_postprocess);
+
+private:
+  void decode_infer(MNN::Tensor *cls_pred, MNN::Tensor *dis_pred, int stride,
+                    float threshold,
+                    std::vector<std::vector<BoxInfo>> &results);
+  BoxInfo disPred2Bbox(const float *&dfl_det, int label, float score, int x,
+                       int y, int stride);
+  void nms(std::vector<BoxInfo> &input_boxes, float NMS_THRESH);
+
+private:
+  std::shared_ptr<MNN::Interpreter> PicoDet_interpreter;
+  MNN::Session *PicoDet_session = nullptr;
+  MNN::Tensor *input_tensor = nullptr;
+
+  int num_thread;
+  int image_w;
+  int image_h;
+
+  int in_w = 320;
+  int in_h = 320;
+
+  float score_threshold;
+  float nms_threshold;
+
+  const float mean_vals[3] = {103.53f, 116.28f, 123.675f};
+  const float norm_vals[3] = {0.017429f, 0.017507f, 0.017125f};
+
+  const int num_class = 80;
+  const int reg_max = 7;
+
+  std::vector<float> bbox_output_data_;
+  std::vector<float> class_output_data_;
+
+  std::vector<std::string> nms_heads_info{"tmp_16", "concat_4.tmp_0"};
+  // If not export post-process, will use non_postprocess_heads_info
+  std::vector<NonPostProcessHeadInfo> non_postprocess_heads_info{
+      // cls_pred|dis_pred|stride
+      {"transpose_0.tmp_0", "transpose_1.tmp_0", 8},
+      {"transpose_2.tmp_0", "transpose_3.tmp_0", 16},
+      {"transpose_4.tmp_0", "transpose_5.tmp_0", 32},
+      {"transpose_6.tmp_0", "transpose_7.tmp_0", 64},
+  };
+};
+
+template <typename _Tp>
+int activation_function_softmax(const _Tp *src, _Tp *dst, int length);
+
+inline float fast_exp(float x);
+inline float sigmoid(float x);
+
+#endif
--- a/deploy/third_engine/demo_mnn_kpts/CMakeLists.txt
+++ b/deploy/third_engine/demo_mnn_kpts/CMakeLists.txt
+cmake_minimum_required(VERSION 3.9)
+
+project(tinypose-mnn)
+
+set(CMAKE_CXX_STANDARD 17)
+set(MNN_DIR {YOUR_MNN_DIR})
+
+find_package(OpenCV REQUIRED)
+
+include_directories(
+        ${MNN_DIR}/include
+        ${MNN_DIR}/include/MNN
+        ${CMAKE_SOURCE_DIR}
+)
+link_directories(mnn/lib)
+
+add_library(libMNN SHARED IMPORTED)
+set_target_properties(
+        libMNN
+        PROPERTIES IMPORTED_LOCATION
+        ${CMAKE_SOURCE_DIR}/mnn/lib/libMNN.so
+)
+add_executable(tinypose-mnn main.cpp picodet_mnn.cpp keypoint_detector.cpp keypoint_postprocess.cpp)
+
+target_link_libraries(tinypose-mnn MNN ${OpenCV_LIBS} libMNN.so)
+
--- a/deploy/third_engine/demo_mnn_kpts/CMakeLists_armv8.txt
+++ b/deploy/third_engine/demo_mnn_kpts/CMakeLists_armv8.txt
+cmake_minimum_required(VERSION 3.9)
+
+project(tinypose-mnn)
+
+set(CMAKE_CXX_STANDARD 17)
+set(MNN_DIR {YOUR_MNN_DIR})
+set(NDK_ROOT {YOUR_ANDROID_NDK_PATH})
+set(LDFLAGS -latomic -pthread -ldl -llog -lz -static-libstdc++)
+
+set(OpenCV_DIR ${CMAKE_SOURCE_DIR}/third/opencv4.1.0/arm64-v8a)
+
+set(OpenCV_DEPS ${OpenCV_DIR}/libs/libopencv_imgcodecs.a 
+              ${OpenCV_DIR}/libs/libopencv_imgproc.a 
+              ${OpenCV_DIR}/libs/libopencv_core.a 
+              ${OpenCV_DIR}/3rdparty/libs/libtegra_hal.a 
+              ${OpenCV_DIR}/3rdparty/libs/liblibjpeg-turbo.a 
+              ${OpenCV_DIR}/3rdparty/libs/liblibwebp.a 
+              ${OpenCV_DIR}/3rdparty/libs/liblibpng.a 
+              ${OpenCV_DIR}/3rdparty/libs/liblibjasper.a 
+              ${OpenCV_DIR}/3rdparty/libs/liblibtiff.a 
+              ${OpenCV_DIR}/3rdparty/libs/libIlmImf.a 
+              ${OpenCV_DIR}/3rdparty/libs/libtbb.a 
+              ${OpenCV_DIR}/3rdparty/libs/libcpufeatures.a)
+
+set(FLAGS "-pie -Wl,--gc-sections -funwind-tables -no-canonical-prefixes -D__ANDROID_API__=21 -fexceptions -frtti -std=c++11 -O3 -DNDEBUG -fPIE -fopenmp")
+set(CMAKE_CXX_FLAGS "--sysroot=${NDK_ROOT}/sysroot ${FLAGS}")
+
+set(STDCXX ${NDK_ROOT}/sources/cxx-stl/llvm-libc++/libs/arm64-v8a/libc++_static.a 
+              ${NDK_ROOT}/sources/cxx-stl/llvm-libc++/libs/arm64-v8a/libc++abi.a 
+              ${NDK_ROOT}/platforms/android-21/arch-arm64/usr/lib/libstdc++.a)
+set(SYS_INCS ${NDK_ROOT}/sysroot/usr/include/aarch64-linux-android/ ${NDK_ROOT}/sources/cxx-stl/llvm-libc++/include/  ${NDK_ROOT}/sources/cxx-stl/llvm-libc++abi/include/ ${NDK_ROOT}/sources/android/support/include/ ${NDK_ROOT}/sysroot/usr/include/)
+
+include_directories(
+        ${SYS_INCS} 
+        ${OpenCV_DIR}/include
+        ${MNN_DIR}/include
+        ${MNN_DIR}/include/MNN
+        ${CMAKE_SOURCE_DIR}
+)
+
+link_directories(${NDK_ROOT}/platforms/android-21/arch-arm64)
+link_directories(${MNN_DIR}/project/android/build_64)
+
+add_executable(tinypose-mnn picodet_mnn.cpp keypoint_postprocess.cpp keypoint_detector.cpp main.cpp)
+
+target_link_libraries(tinypose-mnn -lMNN ${OpenCV_DEPS} ${STDCXX} ${LDFLAGS})
+
--- a/deploy/third_engine/demo_mnn_kpts/README.md
+++ b/deploy/third_engine/demo_mnn_kpts/README.md
+# TinyPose MNN Demo
+
+This fold provides PicoDet+TinyPose inference code using
+[Alibaba's MNN framework](https://github.com/alibaba/MNN). Most of the implements in
+this fold are same as *demo_ncnn*.
+
+## Install MNN
+
+### Python library
+
+Just run:
+
+``` shell
+pip install MNN
+```
+
+### C++ library
+
+Please follow the [official document](https://www.yuque.com/mnn/en/build_linux) to build MNN engine.
+
+- Create picodet_m_416_coco.onnx and tinypose256.onnx
+    example:
+    ```shell
+    modelName=picodet_m_416_coco
+    # export model
+    python tools/export_model.py \
+            -c configs/picodet/${modelName}.yml \
+            -o weights=${modelName}.pdparams \
+            --output_dir=inference_model
+    # convert to onnx
+    paddle2onnx --model_dir inference_model/${modelName} \
+            --model_filename model.pdmodel  \
+            --params_filename model.pdiparams \
+            --opset_version 11 \
+            --save_file ${modelName}.onnx
+    # onnxsim
+    python -m onnxsim ${modelName}.onnx ${modelName}_processed.onnx
+    ```
+
+- Convert model
+    example:
+    ``` shell
+    python -m MNN.tools.mnnconvert -f ONNX --modelFile picodet-416.onnx --MNNModel picodet-416.mnn
+    ```
+Here are converted model
+[picodet_m_416](https://paddledet.bj.bcebos.com/deploy/third_engine/picodet_m_416.mnn).
+[tinypose256](https://paddledet.bj.bcebos.com/deploy/third_engine/tinypose256.mnn)
+
+## Build
+
+For C++ code, replace `libMNN.so` under *./mnn/lib* with the one you just compiled, modify OpenCV path and MNN path at CMake file,
+and run
+
+``` shell
+mkdir build && cd build
+cmake ..
+make
+```
+
+Note that a flag at `main.cpp` is used to control whether to show the detection result or save it into a fold.
+
+``` c++
+#define __SAVE_RESULT__ // if defined save drawed results to ../results, else show it in windows
+```
+
+#### ARM Build
+
+Prepare OpenCV library [OpenCV_4_1](https://paddle-inference-dist.bj.bcebos.com/opencv4.1.0.tar.gz).
+
+``` shell
+mkdir third && cd third
+wget https://paddle-inference-dist.bj.bcebos.com/opencv4.1.0.tar.gz
+tar -zxvf opencv4.1.0.tar.gz
+cd ..
+
+mkdir build && cd build
+cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI="arm64-v8a" -DANDROID_PLATFORM=android-21 -DANDROID_TOOLCHAIN=gcc ..
+make
+```
+
+## Run
+
+To detect images in a fold, run:
+``` shell
+./tinypose-mnn [mode] [image_file]
+```
+|  param   | detail  |
+|  ----  | ----  |
+| --mode  | input mode，0:camera；1:image；2:video；3:benchmark |
+| --image_file  | input image path |
+
+for example:
+
+``` shell
+./tinypose-mnn "1" "../imgs/test.jpg"
+```
+
+For speed benchmark:
+
+``` shell
+./tinypose-mnn "3" "0"
+```
+
+## Benchmark
+Plateform: Kirin980
+Model: [tinypose256](https://paddledet.bj.bcebos.com/deploy/third_engine/tinypose256.mnn)
+
+| param    | Min(s) | Max(s) | Avg(s) |
+| -------- | ------ | ------ | ------ |
+| Thread=4 | 0.018  | 0.021  | 0.019  |
+| Thread=1 | 0.031  | 0.041  | 0.032  |
+
+
+
+## Reference
+[MNN](https://github.com/alibaba/MNN)
--- a/deploy/third_engine/demo_mnn_kpts/keypoint_detector.cpp
+++ b/deploy/third_engine/demo_mnn_kpts/keypoint_detector.cpp
+//   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+#include <sstream>
+// for setprecision
+#include <chrono>
+#include <iomanip>
+#include "keypoint_detector.h"
+
+namespace PaddleDetection {
+
+// Visualiztion MaskDetector results
+cv::Mat VisualizeKptsResult(const cv::Mat& img,
+                            const std::vector<KeyPointResult>& results,
+                            const std::vector<int>& colormap,
+                            float threshold) {
+  const int edge[][2] = {{0, 1},
+                         {0, 2},
+                         {1, 3},
+                         {2, 4},
+                         {3, 5},
+                         {4, 6},
+                         {5, 7},
+                         {6, 8},
+                         {7, 9},
+                         {8, 10},
+                         {5, 11},
+                         {6, 12},
+                         {11, 13},
+                         {12, 14},
+                         {13, 15},
+                         {14, 16},
+                         {11, 12}};
+  cv::Mat vis_img = img.clone();
+  for (int batchid = 0; batchid < results.size(); batchid++) {
+    for (int i = 0; i < results[batchid].num_joints; i++) {
+      if (results[batchid].keypoints[i * 3] > threshold) {
+        int x_coord = int(results[batchid].keypoints[i * 3 + 1]);
+        int y_coord = int(results[batchid].keypoints[i * 3 + 2]);
+        cv::circle(vis_img,
+                   cv::Point2d(x_coord, y_coord),
+                   1,
+                   cv::Scalar(0, 0, 255),
+                   2);
+      }
+    }
+    for (int i = 0; i < results[batchid].num_joints; i++) {
+      if (results[batchid].keypoints[edge[i][0] * 3] > threshold &&
+          results[batchid].keypoints[edge[i][1] * 3] > threshold) {
+        int x_start = int(results[batchid].keypoints[edge[i][0] * 3 + 1]);
+        int y_start = int(results[batchid].keypoints[edge[i][0] * 3 + 2]);
+        int x_end = int(results[batchid].keypoints[edge[i][1] * 3 + 1]);
+        int y_end = int(results[batchid].keypoints[edge[i][1] * 3 + 2]);
+        cv::line(vis_img,
+                 cv::Point2d(x_start, y_start),
+                 cv::Point2d(x_end, y_end),
+                 colormap[i],
+                 1);
+      }
+    }
+  }
+  return vis_img;
+}
+
+void KeyPointDetector::Postprocess(std::vector<float>& output,
+                                   std::vector<int>& output_shape,
+                                   std::vector<int>& idxout,
+                                   std::vector<int>& idx_shape,
+                                   std::vector<KeyPointResult>* result,
+                                   std::vector<std::vector<float>>& center_bs,
+                                   std::vector<std::vector<float>>& scale_bs) {
+  std::vector<float> preds(output_shape[1] * 3, 0);
+  for (int batchid = 0; batchid < output_shape[0]; batchid++) {
+    get_final_preds(output,
+                    output_shape,
+                    idxout,
+                    idx_shape,
+                    center_bs[batchid],
+                    scale_bs[batchid],
+                    preds,
+                    batchid,
+                    this->use_dark());
+    KeyPointResult result_item;
+    result_item.num_joints = output_shape[1];
+    result_item.keypoints.clear();
+    for (int i = 0; i < output_shape[1]; i++) {
+      result_item.keypoints.emplace_back(preds[i * 3]);
+      result_item.keypoints.emplace_back(preds[i * 3 + 1]);
+      result_item.keypoints.emplace_back(preds[i * 3 + 2]);
+    }
+    result->push_back(result_item);
+  }
+}
+
+void KeyPointDetector::Predict(const std::vector<cv::Mat> imgs,
+                               std::vector<std::vector<float>>& center_bs,
+                               std::vector<std::vector<float>>& scale_bs,
+                               std::vector<KeyPointResult>* result) {
+  int batch_size = imgs.size();
+  KeyPointDet_interpreter->resizeTensor(input_tensor,
+                                        {batch_size, 3, in_h, in_w});
+  KeyPointDet_interpreter->resizeSession(KeyPointDet_session);
+  auto insize = 3 * in_h * in_w;
+
+  // Preprocess image
+  cv::Mat resized_im;
+  for (int bs_idx = 0; bs_idx < batch_size; bs_idx++) {
+    cv::Mat im = imgs.at(bs_idx);
+
+    cv::resize(im, resized_im, cv::Size(in_w, in_h));
+    std::shared_ptr<MNN::CV::ImageProcess> pretreat(
+        MNN::CV::ImageProcess::create(
+            MNN::CV::BGR, MNN::CV::RGB, mean_vals, 3, norm_vals, 3));
+    pretreat->convert(
+        resized_im.data, in_w, in_h, resized_im.step[0], input_tensor);
+  }
+
+  // Run predictor
+  auto inference_start = std::chrono::steady_clock::now();
+
+  KeyPointDet_interpreter->runSession(KeyPointDet_session);
+  // Get output tensor
+  auto out_tensor = KeyPointDet_interpreter->getSessionOutput(
+      KeyPointDet_session, "conv2d_441.tmp_1");
+  auto nchwoutTensor = new Tensor(out_tensor, Tensor::CAFFE);
+  out_tensor->copyToHostTensor(nchwoutTensor);
+
+  auto output_shape = nchwoutTensor->shape();
+  // Calculate output length
+  int output_size = 1;
+  for (int j = 0; j < output_shape.size(); ++j) {
+    output_size *= output_shape[j];
+  }
+  output_data_.resize(output_size);
+  std::copy_n(nchwoutTensor->host<float>(), output_size, output_data_.data());
+  delete nchwoutTensor;
+
+  auto idx_tensor = KeyPointDet_interpreter->getSessionOutput(
+      KeyPointDet_session, "argmax_0.tmp_0");
+
+  auto idxhostTensor = new Tensor(idx_tensor, Tensor::CAFFE);
+  idx_tensor->copyToHostTensor(idxhostTensor);
+
+  auto idx_shape = idxhostTensor->shape();
+  // Calculate output length
+  output_size = 1;
+  for (int j = 0; j < idx_shape.size(); ++j) {
+    output_size *= idx_shape[j];
+  }
+
+  idx_data_.resize(output_size);
+  std::copy_n(idxhostTensor->host<int>(), output_size, idx_data_.data());
+  delete idxhostTensor;
+
+  auto inference_end = std::chrono::steady_clock::now();
+  std::chrono::duration<double> elapsed = inference_end - inference_start;
+  printf("keypoint inference time: %f s\n", elapsed.count());
+
+  // Postprocessing result
+  Postprocess(output_data_,
+              output_shape,
+              idx_data_,
+              idx_shape,
+              result,
+              center_bs,
+              scale_bs);
+}
+
+}  // namespace PaddleDetection