• Po Yen Chen's avatar
    Add 'Permute' device op & example (#408) · f584ab0c
    Po Yen Chen authored
    * Add example folder for 'DeviceElementwise'
    
    * Re-structure example files
    
    * Move common parts into common.hpp
    
    * Use more strict input
    
    * Add more helper methods in 'DeviceElementwise'
    
    * Use more specific method to write example
    
    * Allow specify problem through command line argument
    
    * Allow specify problem 'axes' through command line argument
    
    * Add check to template type argument
    
    * Add transpose_shape() to generalize shape permute
    
    * Generalize transpose utility functions
    
    * Use better name for tensor indices
    
    * Add checks in helper functions
    
    * Remove debug messages
    
    * Refine error message for check_err()
    
    * Generalize variable naming in example code
    
    * Add device op 'DevicePermute'
    
    This device op is clone of 'DeviceElementwise'
    
    * Use 'DevicePermute' device op in example
    
    * Remove 'elementwise' from identifiers
    
    * Remove 'elementwise' from file paths
    
    * Remove base class of 'DevicePermute'
    
    * Let 'DevicePermute' inherit from 'BaseOperator'
    
    * Add simple type traits to validate device op type
    
    * Add static_assert() to check type constraints
    
    * Create 'DevicePermuteBase' to generate methods
    
    * Use indirect base type to generate methods
    
    * Remove 'is_device_op<>' type traits
    
    * Only accept single-input-single-output for 'DervicePermute'
    
    * Simplify 'DevicePermute' interface
    
    * Re-format 'DeviceElementwise'
    
    * Use CRTP to generate overridden virtual method
    
    * Remove unnecessary include directives
    
    * Distinguish input & output shape in 'DevicePermute'
    
    * Passing 'axes' to 'DevicePermute'
    
    * Use more reasonable return value for Invoker::Run()
    
    * Add 'GridwisePermute' kernel
    
    This kernel is a clone of 'GridwiseElementwise_1D'
    
    * Remove no-longer used type argument
    
    * Check if input/output shape meet the requirement
    
    * Remove no-longer used method
    
    * Remove never-entered-if-clause
    
    * Change problem description for 'DevicePermute'
    
    * Transform descriptor into 3 dimensions
    
    * Add debug code the verify result
    
    * Add comment to indicate template argument location
    
    * Add N/H/WPerBlock template parameter to 'DevicePermute'
    
    * Rename 'GridwisePermute' to 'GridwiseCopy'
    
    * Check tensor descriptor dimensions in 'GridwiseElementwise_1D'
    
    * Add missing include directive
    
    * Add 'BlockSize' parameter to 'DevicePermute'
    
    * Remove no-longer used method
    
    * Add 'BlockToTileMap' for 'GridwiseCopy'
    
    * Use the normal Block2TileMap convention
    
    * Rename 'BlockToTileMap' as 'Block2TileMap'
    
    * Fix most of compilation errors
    
    * Let 'Block2TileMap' map block to 2d coordinate
    
    * Allow data transfer in 'GridwiseCopy'
    
    * Fix wrong output descriptor for 2nd blockwise copy
    
    * Rename 'GridwiseCopy' as 'GridwisePermute'
    
    * Remove '1d' in identifiers
    
    * Remove commented-out codes
    
    * Remove 'MPerThread' template parameter
    
    * Seperate template parameters
    
    * Unify variable namming convention
    
    * Use more verbose way to create expressions
    
    * Add template parameter 'InBlockLdsExtraW'
    
    * Release the constraint on In/OutGridDesc
    
    * Use date type directly as template argument
    
    * Re-arrange template arguments for blockwise copy
    
    * Remove no-longer used template parameters
    
    * Embed layout in the variable names
    
    * Add GridwisePermute::CheckValidity()
    
    * Extract local types as template parameters
    
    * Rename local type alias
    
    * Add more template parameters (vector width related)
    
    * Calculate new SrcVectorDim/DstVectorDim after merge descriptor dimensions
    
    * Fill tensor values start from 1
    
    * Re-formate example code
    
    * Avoid too-large block id
    
    * Add comment
    
    * Make sure 'SrcVectorDim' is not same as 'DstVectorDim'
    
    * Add check for the 'VectorDim' & 'ScalarPerVector' template params
    
    * Let 'DstVectorDim' equals 'SrcVectorDim' after transpose out grid desc
    
    * Remove no-longer used template parameter 'NPerBlock'
    
    * Fix wrong descriptor creation logics
    
    * Specify problem in each examples
    
    * Use better example name
    
    * Add new example 'example_permute_NxHxW_fp32'
    
    * Add example for demonstrating bundle multiple elems in tensor
    
    * Add support to permute multiple elements together
    
    * Change the default problem size
    
    * Add span<> class template
    
    * Use span<> to generalize check_err() interface
    
    * Fix ambiguous ctor call
    
    * Avoid create necessary objects
    
    * Use helper functions to simplify example code
    
    * Add example for 4xfp16 permute
    
    * Disable failed-to-compile example
    
    * Add check for the NUM_ELEMS_IN_BUNDLE
    
    * Remove redundant parameter in helper lambda function
    
    * Add check for the input tensor type's byte-size
    
    * Check scalar-per-vector with padded length
    
    * Use more verbose name to avoid name collision
    
    * Use fixed 'VectorDim' & 'ScalarPerVector' for LDS
    
    * Embed shape info in name of descriptor constructor
    
    * Rename example folder '36_permute' into '37_permute'
    
    * Avoid using too-large LDS in kernel code
    
    * Remove redundant example
    
    * Usw switch() to group similar codes
    
    * Add const to the span<> type arguement
    
    * Simply initialize tensor with floating point values
    
    * Use fp16 as data type in all examples
    
    * Enlarge tensor size in example
    
    * Enalrge N-dim in example
    
    * Add check for the bundled type in example
    
    * Use more stricter error threshold
    
    * Remove global load/store loop in kernel code
    
    * Measure execution time by default
    
    * Use faster device op config for example 'NxHxW_fp16'
    
    * Use faster device op config for example '1xHxW_fp16'
    
    * Use faster device op config for example 'HxWx4_fp16'
    
    * Remove cmd arg parsing logics
    
    * Rename functions
    
    * Extract bundle permutation logic out
    
    * Simplify permute bundle example
    
    * Add Tensor<>::GetElementSpaceSizeInBytes()
    
    * Add Tensor<>::data()
    
    * Use new methods to simplify code
    
    * Use type alias to replace duplicated code
    
    * Use existing method to shorten code
    
    * Allow FillUniformDistribution accept range arugment
    
    * Intialize random values in range
    
    * Add Tensor<>::size()
    
    * Use more meaningful names in permute bundle example
    
    * Use more meaningful names in permute element examples
    
    * Use rangified copy() to copy elements
    
    * Use function return value directly to eliminate variables
    
    * Add to_array() conversion tool to eliminate more variables
    
    * Add Tensor<>::AsSpan<>() to create view of tensor values
    
    * Use AsSpan() to shorten check_err() calls
    
    * Remove no-longer-used 'using' directives
    
    * Move 'using' directive to proper code position
    
    * Remove redudant variables
    
    * Remove useless static_assert()
    
    * Add check for range types
    
    * Declare variable right before first use
    
    * Move long return type as tailing return type
    
    * Add BaseInvokerCRTP<> class template to generate method
    
    * Create new base type for 'DervicePermute' implementations
    
    * Move 'NumDim' template param to the first
    
    * Rename 'DevicePermute' to 'DevicePermuteImpl'
    
    * Add 'noexcept' specifier to CRTP generated method
    
    * Move 'Block2TileMap' definition into 'GridwisePermute'
    
    * Use type alias to reduce code
    
    * Unify naming style in 'DevicePermute'
    
    * Add comments in 'GridwisePermute'
    
    * Rename permute example folder
    
    * Use std::cerr to report error
    
    * Use larger shape in examples
    
    * Rename '38_permute' to '39_permute'
    
    * Make sure we use unsigned type for shape & indices
    
    * Remove opt-ed out assertion
    
    * Remove template BaseInvokerCRTP<>
    f584ab0c
permute_1xHxW_fp16.cpp 1.4 KB