* Perform a sequential inclusive postfix reverse scan over the statically-sized \p input array, seeded with the specified \p postfix. The aggregate is returned.
* Perform a sequential exclusive postfix reverse scan over the statically-sized \p input array, seeded with the specified \p postfix. The aggregate is returned.
* \brief Computes both inclusive and exclusive reverse scans using the specified binary scan functor across the calling warp. Because no initial value is supplied, the \p exclusive_output computed for the last <em>warp-lane</em> is undefined.
* \brief BlockReverseScan provides variants of raking-based parallel postfix scan across a CUDA thread block.
*/
template<
typenameT,///< Data type being scanned
intBLOCK_DIM_X,///< The thread block length in threads along the X dimension
boolMEMOIZE=false///< Whether or not to buffer outer raking scan partials to incur fewer shared memory reads at the expense of higher register pressure
/// Computes an exclusive thread block-wide postfix scan using the specified binary \p scan_op functor. Each thread contributes one input element. the call-back functor \p block_postfix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically postfixes the thread block's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
T&exclusive_output,///< [out] Calling thread's output item (may be aliased to \p input)
ScanOpscan_op,///< [in] Binary scan operator
BlockPostfixCallbackOp&block_postfix_callback_op)///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a thread block-wide postfix to be applied to all inputs.
{
if(WARP_SYNCHRONOUS){
// Short-circuit directly to warp-synchronous scan
* \brief Computes an inclusive block-wide postfix scan using the specified binary \p scan_op functor. Each thread contributes an array of consecutive input elements. the call-back functor \p block_postfix_callback_op is invoked by the first warp in the block, and the value returned by <em>lane</em><sub>0</sub> in that warp is used as the "seed" value that logically postfixes the thread block's scan inputs. Also provides every thread with the block-wide \p block_aggregate of all inputs.
T(&output)[ITEMS_PER_THREAD],///< [out] Calling thread's output items (may be aliased to \p input)
ScanOpscan_op,///< [in] Binary scan functor
BlockPostfixCallbackOp&block_postfix_callback_op)///< [in-out] <b>[<em>warp</em><sub>0</sub> only]</b> Call-back functor for specifying a block-wide postfix to be applied to the logical input sequence.
std::cerr<<"Warning (selective_scan_bwd_kernel): attempting to set maxDynamicSharedMemorySize on an AMD GPU which is currently a non-op (in ROCm versions <= 6.1). This might lead to undefined behavior. \n"<<std::endl;
std::cerr<<"Warning (selective_scan_fwd_kernel): attempting to set maxDynamicSharedMemorySize on an AMD GPU which is currently a non-op (in ROCm versions <= 6.1). This might lead to undefined behavior. \n"<<std::endl;