[PyTorch][MoE] Reduce CPU Overhead By Fuse Torch Empty Calls (#1793)
* finish python ref impl for bulk alloc Signed-off-by:zhongboz <zhongboz@nvidia.com> * c++ bulk alloc worked, still draft version Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve rebase conflict Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add license Signed-off-by:
zhongboz <zhongboz@nvidia.com> * use shared_ptr to auto manage reference count Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * attempt to fix misc training error Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * attempt to handle case where experts get zero token Signed-off-by:
zhongboz <zhongboz@nvidia.com> * updated with fused C++ function calls Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * experiment with reducing py object construction time Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix seg fault bug in inference mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fuse torch split into bulk alloc Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * rebase to latest main Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix unit test failure Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint error Signed-off-by:
zhongboz <zhongboz@nvidia.com> * refactor create_tensor to use get_scale_shape Signed-off-by:
zhongboz <zhongboz@nvidia.com> * refactor quantize to call quantize_cpp Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Implement separate functions for multi-tensor quantize and split + multi-tensor quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update grouped linear module with fused split+quantize func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move multi-tensor quantize func to cast.cpp Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not expose quantizer helper function externally Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert cuDNN frontend commit Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix corner cases with zero tokens Signed-off-by:
zhongboz <zhongboz@nvidia.com> * add comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
Showing
This diff is collapsed.
Please register or sign in to comment