# 目录
<span style="font-size: 12px;">[安装手册](安装手册)<br></span>

# 安装手册
## 1.镜像拉取与启动
<span style="font-size: 12px;">
(1) docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu22.04-dtk23.10.1-py310 <br>
(2) docker run -itd --name deepspeed -p 40136:22 --privileged -v /soft:/soft -v /data:/test_data --device=/dev/kfd --device=/dev/dri --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu22.04-dtk23.10.1-py310 /bin/bash <br>
(3) docker exec -it deepspeed /bin/bash <br>
(4) centos镜像 image.sourcefind.cn:5000/dcu/admin/base/custom:vllm0.3.3-dtk23.10-py38<br>
</span>

## 2. 依赖
<span style="font-size: 12px;">
   pip install wheel <br>
   -------------------------------------------------------<br>
   ubuntu:<br>
   apt-get install libaio-dev<br>
   -------------------------------------------------------<br>
   centos:<br>
   git clone https://pagure.io/libaio.git<br>
   cd libaio<br>
   make prefix=/usr install<br>
   pip install py-cpuinfo<br>
   -------------------------------------------------------<br>
   git clone --recursive https://github.com/microsoft/DeepSpeed.git<br>
</span>

## 3.详解修改部分
<span style="font-size: 12px;">
命令是DS_BUILD_OPS=1 DS_BUILD_EVOFORMER_ATTN=0 DS_BUILD_CUTLASS_OPS=0 DS_BUILD_FP_QUANTIZER=0 python setup.py bdist_wheel<br>
下面是不用这三个库编译的解释
DS_BUILD_EVOFORMER_ATTN,DS_BUILD_CUTLASS_OPS 这俩对应的库是需要cutlass,而cutlass现在是需要编译ptx代码,至于 DS_BUILD_FP_QUANTIZER 需要Deepspeed-kernel 额外还需要一个__shl_xor支持__half的函数版本,并且deepspeed0.12.3 版本也避开了这三个库以下是我编译后的库与0.12.3的差距,不建议使用cuda进行编译,很多torch.cuda里面的头文件不全，改动也不比 下面的小,下面编译会删
</span>

![image](a.JPG)
<span style="font-size: 12px;">
除掉__HIP_PLATFORM_AMD__这个编译宏的原因是由于这个宏会调用amd rocmblas 改动比较大，还是选择我们的hipblas进行替换
hip方法使用torch默认的hipcc进行编译:<br>
<u>/usr/local/lib/python3.10/site-packages/torch/utils/hipify/cuda_to_hip_mappings.py</u><br>
添加hipblasGemmAlgo_t与cublasGemmAlgo_t的映射

```python
("cublasHandle_t", ("hipblasHandle_t", CONV_TYPE, API_BLAS)),
("cublasGemmAlgo_t", ("hipblasGemmAlgo_t", CONV_TYPE, API_BLAS)),    ###cvchange
("cublasOperation_t", ("hipblasOperation_t", CONV_TYPE, API_BLAS)),
```
添加CUBLAS_GEMM_ALGO15_TENSOR_OP与HIPBLAS_GEMM_DEFAULT的映射

```python
("CUBLAS_GEMM_DEFAULT_TENSOR_OP", ("HIPBLAS_GEMM_DEFAULT", CONV_NUMERIC_LITERAL, API_BLAS)),
("CUBLAS_GEMM_ALGO15_TENSOR_OP", ("HIPBLAS_GEMM_DEFAULT", CONV_NUMERIC_LITERAL, API_BLAS)),     ###cvchange
("cublasCreate", ("hipblasCreate", CONV_MATH_FUNC, API_BLAS))
```
<u>csrc/includes/cublas_wrappers.h</u><br>
没能找到mma.h后面也不需要mma.h的头文件

```c++
#ifndef __HIP_PLATFORM_AMD__
//#include <mma.h>  //###cvchange
#endif
```
<u>csrc/transformer/ds_transformer_cuda.cpp</u><br>
cublasSetMathMode 因为__HIP_PLATFORM_AMD__要删除掉

```c++
#ifndef __HIP_PLATFORM_AMD__
    //if (std::is_same<T, __half>::value) cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);   //###cvchange
#endif
```
<u>csrc/transformer/inference/includes/inference_context.h</u><br>
cublasSetMathMode 因为__HIP_PLATFORM_AMD__要删除掉

```c++
#ifndef __HIP_PLATFORM_AMD__
        //cublasSetMathMode(_cublasHandle, CUBLAS_TENSOR_OP_MATH);  //###cvchange
#endif
```
<u>csrc/transformer/inference/includes/inference_cublas_wrappers.h</u><br>
建议修改整个文件

```c++
// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0

// DeepSpeed Team

#pragma once

#include <assert.h>
#include <cublas_v2.h>
#include <cuda.h>
#ifdef BF16_AVAILABLE
#include <cuda_bf16.h>
#endif
#include <cuda_fp16.h>
#include <cuda_runtime.h>
#ifndef __HIP_PLATFORM_AMD__
//#include <mma.h>   //###cvchange
#endif
#include <stdio.h>

#ifdef __HIP_PLATFORM_AMD__
int cublas_gemm_ex(rocblas_handle handle,
                   rocblas_operation transa,
                   rocblas_operation transb,
                   int m,
                   int n,
                   int k,
                   const float* alpha,
                   const float* beta,
                   const float* A,
                   const float* B,
                   float* C,
                   rocblas_gemm_algo algo,
                   int b_stride = -1)
#else
//@func ###cvchange
int cublas_gemm_ex(cublasHandle_t handle,
                   cublasOperation_t transa,
                   cublasOperation_t transb,
                   int m,
                   int n,
                   int k,
                   const float* alpha,
                   const float* beta,
                   const float* A,
                   const float* B,
                   float* C,
                   hipblasGemmAlgo_t algo,
                   int b_stride = -1)
#endif
{
    const int ldb = (b_stride == -1) ? ((transb == CUBLAS_OP_N) ? k : n) : b_stride;
#ifdef __HIP_PLATFORM_AMD__
    rocblas_status status = rocblas_gemm_ex(handle,
                                            transa,
                                            transb,
                                            m,
                                            n,
                                            k,
                                            (const void*)alpha,
                                            (const void*)A,
                                            rocblas_datatype_f32_r,
                                            (transa == rocblas_operation_none) ? m : k,
                                            (const void*)B,
                                            rocblas_datatype_f32_r,
                                            ldb,
                                            (const void*)beta,
                                            C,
                                            rocblas_datatype_f32_r,
                                            m,
                                            C,
                                            rocblas_datatype_f32_r,
                                            m,
                                            rocblas_datatype_f32_r,
                                            algo,
                                            0,
                                            0);
#else
    hipblasStatus_t status = hipblasGemmEx(handle,
                                         transa,
                                         transb,
                                         m,
                                         n,
                                         k,
                                         (const void*)alpha,
                                         (const void*)A,
                                         HIPBLAS_R_32F,
                                         (transa == HIPBLAS_OP_N) ? m : k,
                                         (const void*)B,
                                         HIPBLAS_R_32F,
                                         ldb,
                                         (const void*)beta,
                                         C,
                                         HIPBLAS_R_32F,
                                         m,
                                         HIPBLAS_R_32F,
                                         algo);
#endif

#ifdef __HIP_PLATFORM_AMD__
    if (status != rocblas_status_success) {
#else
    if (status != CUBLAS_STATUS_SUCCESS) {
#endif
        fprintf(stderr,
                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
                m,
                n,
                k,
                (int)status);
        return EXIT_FAILURE;
    }
    return 0;
}

template <typename T>
#ifdef __HIP_PLATFORM_AMD__
int cublas_gemm_ex(rocblas_handle handle,
                   rocblas_operation transa,
                   rocblas_operation transb,
                   int m,
                   int n,
                   int k,
                   const float* alpha,
                   const float* beta,
                   const T* A,
                   const T* B,
                   T* C,
                   rocblas_gemm_algo algo,
                   int b_stride = -1)
#else
//@func ###cvchange
int cublas_gemm_ex(cublasHandle_t handle,
                   cublasOperation_t transa,
                   cublasOperation_t transb,
                   int m,
                   int n,
                   int k,
                   const float* alpha,
                   const float* beta,
                   const T* A,
                   const T* B,
                   T* C,
                   hipblasGemmAlgo_t algo,
                   int b_stride = -1)
#endif
{
    const int ldb = (b_stride == -1) ? ((transb == CUBLAS_OP_N) ? k : n) : b_stride;
#ifdef __HIP_PLATFORM_AMD__
    constexpr auto rocblas_dtype_16 = std::is_same<T, __half>::value ? rocblas_datatype_f16_r
                                                                     : rocblas_datatype_bf16_r;
    rocblas_status status = rocblas_gemm_ex(handle,
                                            transa,
                                            transb,
                                            m,
                                            n,
                                            k,
                                            (const void*)alpha,
                                            (const void*)A,
                                            rocblas_dtype_16,
                                            (transa == rocblas_operation_none) ? m : k,
                                            (const void*)B,
                                            rocblas_dtype_16,
                                            ldb,
                                            (const void*)beta,
                                            (void*)C,
                                            rocblas_dtype_16,
                                            m,
                                            (void*)C,
                                            rocblas_dtype_16,
                                            m,
                                            rocblas_datatype_f32_r,
                                            algo,
                                            0,
                                            0);
#else
    constexpr auto cublas_dtype_16 = HIPBLAS_R_16F;
    hipblasStatus_t status = hipblasGemmEx(handle,
                                         transa,
                                         transb,
                                         m,
                                         n,
                                         k,
                                         (const void*)alpha,
                                         (const void*)A,
                                         cublas_dtype_16,
                                         (transa == HIPBLAS_OP_N) ? m : k,
                                         (const void*)B,
                                         cublas_dtype_16,
                                         ldb,
                                         (const void*)beta,
                                         (void*)C,
                                         cublas_dtype_16,
                                         m,
                                         HIPBLAS_R_32F,
                                         algo);
#endif

#ifdef __HIP_PLATFORM_AMD__
    if (status != rocblas_status_success) {
#else
    if (status != CUBLAS_STATUS_SUCCESS) {
#endif
        fprintf(stderr,
                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
                m,
                n,
                k,
                (int)status);
        return EXIT_FAILURE;
    }
    return 0;
}

#ifdef __HIP_PLATFORM_AMD__
int cublas_strided_batched_gemm(rocblas_handle handle,
                                int m,
                                int n,
                                int k,
                                const float* alpha,
                                const float* beta,
                                const float* A,
                                const float* B,
                                float* C,
                                rocblas_operation op_A,
                                rocblas_operation op_B,
                                int stride_A,
                                int stride_B,
                                int stride_C,
                                int batch,
                                rocblas_gemm_algo algo)
#else
//@func ###cvchange
int cublas_strided_batched_gemm(cublasHandle_t handle,
                                int m,
                                int n,
                                int k,
                                const float* alpha,
                                const float* beta,
                                const float* A,
                                const float* B,
                                float* C,
                                cublasOperation_t op_A,
                                cublasOperation_t op_B,
                                int stride_A,
                                int stride_B,
                                int stride_C,
                                int batch,
                                hipblasGemmAlgo_t algo)
#endif
{
#ifdef __HIP_PLATFORM_AMD__
    rocblas_status status =
        rocblas_gemm_strided_batched_ex(handle,
                                        op_A,
                                        op_B,
                                        m,
                                        n,
                                        k,
                                        alpha,
                                        A,
                                        rocblas_datatype_f32_r,
                                        (op_A == rocblas_operation_none) ? m : k,
                                        stride_A,
                                        B,
                                        rocblas_datatype_f32_r,
                                        (op_B == rocblas_operation_none) ? k : n,
                                        stride_B,
                                        beta,
                                        C,
                                        rocblas_datatype_f32_r,
                                        m,
                                        stride_C,
                                        C,
                                        rocblas_datatype_f32_r,
                                        m,
                                        stride_C,
                                        batch,
                                        rocblas_datatype_f32_r,
                                        algo,
                                        0,
                                        0);
#else
    cublasStatus_t status = cublasGemmStridedBatchedEx(handle,
                                                       op_A,
                                                       op_B,
                                                       m,
                                                       n,
                                                       k,
                                                       alpha,
                                                       A,
                                                       HIPBLAS_R_32F,
                                                       (op_A == CUBLAS_OP_N) ? m : k,
                                                       stride_A,
                                                       B,
                                                       HIPBLAS_R_32F,
                                                       (op_B == CUBLAS_OP_N) ? k : n,
                                                       stride_B,
                                                       beta,
                                                       C,
                                                       HIPBLAS_R_32F,
                                                       m,
                                                       stride_C,
                                                       batch,
                                                       HIPBLAS_R_32F,
                                                       algo);
#endif

#ifdef __HIP_PLATFORM_AMD__
    if (status != rocblas_status_success) {
#else
    if (status != CUBLAS_STATUS_SUCCESS) {
#endif
        fprintf(stderr,
                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
                batch,
                m,
                n,
                k,
                (int)status);
        return EXIT_FAILURE;
    }
    return 0;
}

template <typename T>
#ifdef __HIP_PLATFORM_AMD__
int cublas_strided_batched_gemm(rocblas_handle handle,
                                int m,
                                int n,
                                int k,
                                const float* alpha,
                                const float* beta,
                                const T* A,
                                const T* B,
                                T* C,
                                rocblas_operation op_A,
                                rocblas_operation op_B,
                                int stride_A,
                                int stride_B,
                                int stride_C,
                                int batch,
                                rocblas_gemm_algo algo)
#else
//@func ###cvchange
int cublas_strided_batched_gemm(cublasHandle_t handle,
                                int m,
                                int n,
                                int k,
                                const float* alpha,
                                const float* beta,
                                const T* A,
                                const T* B,
                                T* C,
                                cublasOperation_t op_A,
                                cublasOperation_t op_B,
                                int stride_A,
                                int stride_B,
                                int stride_C,
                                int batch,
                                hipblasGemmAlgo_t algo)
#endif
{
#ifdef __HIP_PLATFORM_AMD__
    constexpr auto rocblas_dtype_16 = std::is_same<T, __half>::value ? rocblas_datatype_f16_r
                                                                     : rocblas_datatype_bf16_r;
    rocblas_status status =
        rocblas_gemm_strided_batched_ex(handle,
                                        op_A,
                                        op_B,
                                        m,
                                        n,
                                        k,
                                        alpha,
                                        A,
                                        rocblas_dtype_16,
                                        (op_A == rocblas_operation_none) ? m : k,
                                        stride_A,
                                        B,
                                        rocblas_dtype_16,
                                        (op_B == rocblas_operation_none) ? k : n,
                                        stride_B,
                                        beta,
                                        C,
                                        rocblas_dtype_16,
                                        m,
                                        stride_C,
                                        C,
                                        rocblas_dtype_16,
                                        m,
                                        stride_C,
                                        batch,
                                        rocblas_datatype_f32_r,
                                        algo,
                                        0,
                                        0);
#else
    constexpr auto cublas_dtype_16 = HIPBLAS_R_16F;
    hipblasStatus_t status = hipblasGemmStridedBatchedEx(handle,
                                                       op_A,
                                                       op_B,
                                                       m,
                                                       n,
                                                       k,
                                                       alpha,
                                                       A,
                                                       cublas_dtype_16,
                                                       (op_A == HIPBLAS_OP_N) ? m : k,
                                                       stride_A,
                                                       B,
                                                       cublas_dtype_16,
                                                       (op_B == HIPBLAS_OP_N) ? k : n,
                                                       stride_B,
                                                       beta,
                                                       C,
                                                       cublas_dtype_16,
                                                       m,
                                                       stride_C,
                                                       batch,
                                                       HIPBLAS_R_32F,
                                                       algo);
#endif

#ifdef __HIP_PLATFORM_AMD__
    if (status != rocblas_status_success) {
#else
    if (status != CUBLAS_STATUS_SUCCESS) {
#endif
        fprintf(stderr,
                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
                m,
                n,
                k,
                (int)status);
        return EXIT_FAILURE;
    }

    return 0;
}
```
<u>deepspeed/inference/v2/kernels/core_ops/blas_kernels/blas_utils.h</u><br>
建议修改整个文件

```c++
// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0

// DeepSpeed Team

#pragma once

#include <assert.h>
#include <cublas_v2.h>
#include <cuda.h>
#ifdef BF16_AVAILABLE
#include <cuda_bf16.h>
#endif
#include <cuda_fp16.h>
#include <cuda_runtime.h>
#ifndef __HIP_PLATFORM_AMD__
//#include <mma.h>   //###cvchange
#endif
#include <stdio.h>
#include <iostream>
#include <stdexcept>

class BlasContext {
    /*
    Slim wrapper for managing the lifetime of the platform's BLAS handle. This should
    be hipified for ROCm.
    */
public:
    BlasContext()
    {
        if (cublasCreate(&_handle) != CUBLAS_STATUS_SUCCESS) {
            auto message = std::string("Fail to create cublas handle.");
            std::cerr << message << std::endl;
            throw std::runtime_error(message);
        }
#ifndef __HIP_PLATFORM_AMD__
        //cublasSetMathMode(_handle, CUBLAS_TENSOR_OP_MATH); //###cvchange
#endif
    }

    virtual ~BlasContext() { cublasDestroy(_handle); }

    static BlasContext& getInstance()
    {
        // Should always access the singleton through this function.
        static BlasContext _instance;
        return _instance;
    }

    cublasHandle_t get_handle() const { return _handle; }

private:
    cublasHandle_t _handle;
};

enum class BlasType { FP32, FP16, BF16 };

#ifdef __HIP_PLATFORM_AMD__
rocblas_operation get_trans_op(bool do_trans)
{
    return (do_trans) ? rocblas_operation_transpose : rocblas_operation_none;
}

rocblas_datatype get_datatype(BlasType type)
{
    switch (type) {
        case BlasType::FP32: return rocblas_datatype_f32_r;
        case BlasType::FP16: return rocblas_datatype_f16_r;
        case BlasType::BF16: return rocblas_datatype_bf16_r;
        default: throw std::runtime_error("Unsupported BlasType");
    }
}
#else
cublasOperation_t get_trans_op(bool do_trans) { return (do_trans) ? CUBLAS_OP_T : CUBLAS_OP_N; }
//@func ###cvchange
cublasDataType_t get_datatype(BlasType type)
{
    switch (type) {
        case BlasType::FP32: return HIPBLAS_R_32F;
        case BlasType::FP16: return HIPBLAS_R_16F;
        //case BlasType::BF16: return CUDA_R_16BF;
        default: throw std::runtime_error("Unsupported BlasType");
    }
}
#endif
//@func ###cvchange
int blas_gemm_ex(void* C,
                 const void* A,
                 const void* B,
                 int m,
                 int n,
                 int k,
                 int lda,
                 int ldb,
                 int ldc,
                 bool transa,
                 bool transb,
                 const float* alpha,
                 const float* beta,
                 BlasType type)
{
#ifdef __HIP_PLATFORM_AMD__
    rocblas_operation_t transa_op = get_trans_op(transa);
    rocblas_operation_t transb_op = get_trans_op(transb);

    rocblas_datatype_t abc_type = get_datatype(type);

    rocblas_status status = rocblas_gemm_ex(BlasContext::getInstance().get_handle(),
                                            transa_op,
                                            transb_op,
                                            m,
                                            n,
                                            k,
                                            (const void*)alpha,
                                            A,
                                            abc_type,
                                            lda,
                                            B,
                                            abc_type,
                                            ldb,
                                            (const void*)beta,
                                            C,
                                            abc_type,
                                            ldc,
                                            C,
                                            abc_type,
                                            ldc,
                                            rocblas_datatype_f32_r,
                                            rocblas_gemm_algo_standard,
                                            0,
                                            0);
#else
    hipblasOperation_t transa_op = get_trans_op(transa);
    hipblasOperation_t transb_op = get_trans_op(transb);

    hipblasDatatype_t abc_type = get_datatype(type);
    hipblasStatus_t status = hipblasGemmEx(BlasContext::getInstance().get_handle(),
                                         transa_op,
                                         transb_op,
                                         m,
                                         n,
                                         k,
                                         (const void*)alpha,
                                         A,
                                         abc_type,
                                         lda,
                                         B,
                                         abc_type,
                                         ldb,
                                         (const void*)beta,
                                         C,
                                         abc_type,
                                         ldc,
                                         HIPBLAS_R_32F,
                                         HIPBLAS_GEMM_DEFAULT);
#endif

#ifdef __HIP_PLATFORM_AMD__
    if (status != rocblas_status_success) {
#else
    if (status != CUBLAS_STATUS_SUCCESS) {
#endif
        fprintf(stderr,
                "!!!! kernel execution error. (m: %d, n: %d, k: %d, error: %d) \n",
                m,
                n,
                k,
                (int)status);
        return EXIT_FAILURE;
    }
    return 0;
}
//@func ###cvchange
int blas_strided_batched_gemm(void* C,
                              const void* A,
                              const void* B,
                              int m,
                              int n,
                              int k,
                              int lda,
                              int ldb,
                              int ldc,
                              bool transa,
                              bool transb,
                              const float* alpha,
                              const float* beta,
                              int stride_A,
                              int stride_B,
                              int stride_C,
                              int batch,
                              BlasType type)
{
#ifdef __HIP_PLATFORM_AMD__
    rocblas_operation_t transa_op = get_trans_op(transa);
    rocblas_operation_t transb_op = get_trans_op(transb);

    rocblas_datatype_t abc_type = get_datatype(type);

    rocblas_status status =
        rocblas_gemm_strided_batched_ex(BlasContext::getInstance()::get_handle(),
                                        transa_op,
                                        transb_op,
                                        m,
                                        n,
                                        k,
                                        (const void*)alpha,
                                        A,
                                        abc_type,
                                        lda,
                                        stride_A,
                                        B,
                                        abc_type,
                                        ldb,
                                        stride_B,
                                        (const void*)beta,
                                        C,
                                        abc_type,
                                        ldc,
                                        stride_C,
                                        C,
                                        abc_type,
                                        ldc,
                                        stride_C,
                                        batch,
                                        rocblas_datatype_f32_r,
                                        rocblas_gemm_algo_standard,
                                        0,
                                        0);
#else
    hipblasOperation_t transa_op = get_trans_op(transa);
    hipblasOperation_t transb_op = get_trans_op(transb);

    hipblasDatatype_t abc_type = get_datatype(type);

    hipblasStatus_t status = hipblasGemmStridedBatchedEx(BlasContext::getInstance().get_handle(),
                                                       transa_op,
                                                       transb_op,
                                                       m,
                                                       n,
                                                       k,
                                                       (const void*)alpha,
                                                       A,
                                                       abc_type,
                                                       lda,
                                                       stride_A,
                                                       B,
                                                       abc_type,
                                                       ldb,
                                                       stride_B,
                                                       (const void*)beta,
                                                       C,
                                                       abc_type,
                                                       ldc,
                                                       stride_C,
                                                       batch,
                                                       HIPBLAS_R_32F,
                                                       HIPBLAS_GEMM_DEFAULT);
#endif

#ifdef __HIP_PLATFORM_AMD__
    if (status != rocblas_status_success) {
#else
    if (status != CUBLAS_STATUS_SUCCESS) {
#endif
        fprintf(stderr,
                "!!!! kernel execution error. (batch: %d, m: %d, n: %d, k: %d, error: %d) \n",
                batch,
                m,
                n,
                k,
                (int)status);
        return EXIT_FAILURE;
    }
    return 0;
}
```
<u>deepspeed/inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cuda.cu</u><br>
需要额外的DS_D_INLINE这个标识符否则会出现 error: no matching function for call to 'gated_act_fn'

```c++
template <ActivationType ActType>
DS_D_INLINE float gated_act_fn(float x, float y); //###cvchange

template <>
DS_D_INLINE float gated_act_fn<ActivationType::GEGLU>(float x, float y)
{
    constexpr float sqrt_param = 0.79788456080286535587989211986876f;
    constexpr float mul_param = 0.044715;
    return y * x * 0.5f * (1.0f + tanhf(sqrt_param * (x + mul_param * x * x * x)));
}
```
<u>deepspeed/ops/op_builder/builder.py</u><br>
不建议打开__HIP_PLATFORM_AMD__ ,需要rocm_blas这个库过时了

```python
#if self.is_rocm_pytorch():
#    cxx_args.append("-D__HIP_PLATFORM_AMD__=1")  ###cvchange

op_module = load(name=self.name,
                    sources=self.strip_empty_entries(sources),
```
```python
#if self.is_rocm_pytorch():    ###cvchange
#    compile_args['cxx'].append("-D__HIP_PLATFORM_AMD__=1")   ###cvchange

cuda_ext = ExtensionBuilder(name=self.absolute_name(),
                            sources=self.strip_empty_entries(self.sources()),
                            include_dirs=include_dirs,
```
<u>deepspeed/ops/op_builder/evoformer_attn.py</u><br>
将适用性打开方便让setup.py运行下去

```python
args.append(f"-DGPU_ARCH={major}{minor}")
    return args

def is_compatible(self, verbose=True):
    return False ###cvchange
    try:
        import torch
```
<u>deepspeed/ops/op_builder/fp_quantizer.py</u><br>
将适用性打开方便让setup.py运行下去

```python
args.append(f"-DGPU_ARCH={major}{minor}")
    return args

def is_compatible(self, verbose=True):
    return False ###cvchange
    try:
        import torch
```
<u>deepspeed/ops/op_builder/inference_core_ops.py</u><br>
将适用性打开方便让setup.py运行下去

```python
def absolute_name(self):
        return f'deepspeed.inference.v2.kernels{self.NAME}'

def is_compatible(self, verbose=True):
    return True ###cvchange
    try:
        import torch
    except ImportError:
```
linear_kernels_cuda.cu 这个文件会使用ptx汇编代码 ptx_cp.async.cuh,ptx_mma.cuh,这个文件在deepspeed0.12.3的kernelsinference_core_ops.so里也没找到对应函数

```python
    "inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cuda.cu",
    "inference/v2/kernels/core_ops/cuda_linear/linear_kernels.cpp",
    #"inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu",  ###cvchange
]

prefix = self.get_prefix()
sources = [os.path.join(prefix, src) for src in sources]
return sources
```
将下面对应的cpp文件名改成cpu版本，否则不会编译<br>
"inference/v2/kernels/core_ops/bias_activations/bias_activation.cpp"->"inference/v2/kernels/core_ops/bias_activations/bias_activation_cpu.cpp"<br>
"inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels.cpp"->
"inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cpu.cpp"<br>
然后修改sources的内容
```python
def sources(self):
    sources = [
        "inference/v2/kernels/core_ops/core_ops.cpp",
        "inference/v2/kernels/core_ops/bias_activations/bias_activation_cpu.cpp",     #file###cvchange
        "inference/v2/kernels/core_ops/bias_activations/bias_activation_cuda.cu",
        "inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm.cpp",
        "inference/v2/kernels/core_ops/cuda_layer_norm/layer_norm_cuda.cu",
        "inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm.cpp",
        "inference/v2/kernels/core_ops/cuda_rms_norm/rms_norm_cuda.cu",
        "inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cpu.cpp",  #file###cvchange
        "inference/v2/kernels/core_ops/gated_activations/gated_activation_kernels_cuda.cu",   
        "inference/v2/kernels/core_ops/cuda_linear/linear_kernels.cpp",
        #"inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu",
    ]
```
<u>deepspeed/ops/op_builder/inference_cutlass_builder.py</u><br>
将适用性打开方便让setup.py运行下去<br>

```python
def absolute_name(self):
    return f'deepspeed.inference.v2.kernels.cutlass_ops.{self.NAME}'

def is_compatible(self, verbose=True):
    return False ###cvchange
    try:
        import torch
```
这个特性需要打开dskernel是Deepspeed-kernel 这个需要cutlass不支持,需要关闭引用dskernel
```python
def extra_ldflags(self):
    return []        ###cvchange
    import dskernels
    lib_path = dskernels.library_path()
    prefix = self.get_prefix()
```
<u>deepspeed/ops/op_builder/ragged_ops.py</u><br>
将适用性打开方便让setup.py运行下去<br>

```python
def absolute_name(self):
    return f'deepspeed.inference.v2.kernels.ragged_ops.{self.NAME}'

def is_compatible(self, verbose=True):
    return True ###cvchange
    try:
        import torch
```
这个特性需要打开dskernel是Deepspeed-kernel 这个需要cutlass不支持,需要关闭引用dskernel
```python
prefix = self.get_prefix()
    sources = [os.path.join(prefix, src) for src in sources]
    return sources

def extra_ldflags(self):
    return []    ###cvchange
    import dskernels
    lib_path = dskernels.library_path()
```
将下面对应的cpp文件名改成cpu版本，否则不会编译<br>
"inference/v2/kernels/ragged_ops/embed/embed.cpp"->"inference/v2/kernels/ragged_ops/embed/embed_cpu.cpp"<br>
"inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary.cpp"->"inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary_cpu.cpp"<br>
"inference/v2/kernels/ragged_ops/logits_gather/logits_gather.cpp"->"inference/v2/kernels/ragged_ops/logits_gather/logits_gather_cpu.cpp"<br>
"inference/v2/kernels/ragged_ops/moe_scatter/moe_scatter.cpp"->"inference/v2/kernels/ragged_ops/moe_scatter/moe_scatter_cpu.cpp"<br>
"inference/v2/kernels/ragged_ops/moe_gather/moe_gather.cpp"->"inference/v2/kernels/ragged_ops/moe_gather/moe_gather_cpu.cpp"<br>
"inference/v2/kernels/ragged_ops/top_k_gating/top_k_gating.cpp"->"inference/v2/kernels/ragged_ops/top_k_gating/top_k_gating_cpu.cpp"<br>
将sources修改成下面
```python
def sources(self):
    sources = [
        "inference/v2/kernels/ragged_ops/ragged_ops.cpp",
        "inference/v2/kernels/ragged_ops/atom_builder/atom_builder.cpp",
        "inference/v2/kernels/ragged_ops/blocked_flash/blocked_flash.cpp",
        "inference/v2/kernels/ragged_ops/embed/embed_cpu.cpp",     #file###cvchange
        "inference/v2/kernels/ragged_ops/embed/embed_cuda.cu",
        "inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary_cpu.cpp",    #file###cvchange
        "inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary_cuda.cu",
        "inference/v2/kernels/ragged_ops/logits_gather/logits_gather_cpu.cpp",   #file###cvchange
        "inference/v2/kernels/ragged_ops/logits_gather/logits_gather_cuda.cu",
        "inference/v2/kernels/ragged_ops/moe_scatter/moe_scatter_cpu.cpp",     #file###cvchange
        "inference/v2/kernels/ragged_ops/moe_scatter/moe_scatter_cuda.cu",
        "inference/v2/kernels/ragged_ops/moe_gather/moe_gather_cpu.cpp",       #file###cvchange
        "inference/v2/kernels/ragged_ops/moe_gather/moe_gather_cuda.cu",
        "inference/v2/kernels/ragged_ops/ragged_helpers/ragged_kernel_helpers.cpp",
        "inference/v2/kernels/ragged_ops/top_k_gating/top_k_gating_cpu.cpp",     #file###cvchange
        "inference/v2/kernels/ragged_ops/top_k_gating/top_k_gating_cuda.cu",
    ]
```
<u>deepspeed/ops/op_builder/ragged_utils.py</u><br>
将适用性打开方便让setup.py运行下去<br>

```python
def absolute_name(self):
    return f'deepspeed.inference.v2.{self.NAME}'

def is_compatible(self, verbose=True):
    return True ###cvchange
    try:
        import torch
    except ImportError:
```
<u>deepspeed/ops/op_builder/sparse_attn.py</u><br>
将适用性打开方便让setup.py运行下去<br>

```python
def cxx_args(self):
    return ['-O2', '-fopenmp']

def is_compatible(self, verbose=True):
    return True ###cvchange
    # Check to see if llvm and cmake are installed since they are dependencies
    #required_commands = ['llvm-config|llvm-config-9', 'cmake']
    #command_status = list(map(self.command_exists, required_commands))
    #deps_compatible = all(command_status)

    if self.is_rocm_pytorch():
```
<u>deepspeed/ops/op_builder/spatial_inference.py</u><br>
将适用性打开方便让setup.py运行下去<br>

```python
def absolute_name(self):
    return f'deepspeed.ops.spatial.{self.NAME}_op'

def is_compatible(self, verbose=True):
    return True ###cvchange
    try:
        import torch
    except ImportError:
        self.warning("Please install torch if trying to pre-compile inference kernels")
        return False
```
<u>deepspeed/ops/op_builder/transformer_inference.py</u><br>
将适用性打开方便让setup.py运行下去<br>

```python
def absolute_name(self):
    return f'deepspeed.ops.transformer.inference.{self.NAME}_op'

def is_compatible(self, verbose=True):
    return True ###cvchange
    try:
        import torch
    except ImportError:
```
**vscode debug建议要把所有的def is_compatible(self, verbose=True): 这个适用性定义好如下所示**
```python
def is_compatible(self, verbose=True):
    return True
```
**debug模式是将文件夹中所有的-O3 替换成 -O0 并加入 -fno-limit-debug-info 即可**<br>
'-O3'->'-O0'<br>
deepspeed/ops/op_builder/fused_adam.py nvcc_args '-O0' 需要改成 '-O0','-fno-limit-debug-info'<br>
deepspeed/ops/op_builder/fused_lamb.py nvcc_args '-O0' 需要改成 '-O0','-fno-limit-debug-info'<br>
deepspeed/ops/op_builder/fused_lion.py nvcc_args '-O0' 需要改成 '-O0','-fno-limit-debug-info'<br>
/data/DeepSpeed/deepspeed/ops/op_builder/builder.py  nvcc那块的'-O0'需要改成 '-O0','-fno-limit-debug-info'<br>
cuda使用cuda,nvcc进行编译<br>
**不建议使用cuda,我没有成功,改动非常大**
</span>

## 4.安装包命令
<span style="font-size: 12px;">
DS_BUILD_OPS=1 DS_BUILD_EVOFORMER_ATTN=0 DS_BUILD_CUTLASS_OPS=0 DS_BUILD_FP_QUANTIZER=0 python setup.py bdist_wheel<br>
AssertionError: Unable to pre-compile async_io 搜索libaio.so 把文件夹目录放到LD_LIBRARY_PATH 即可
</span>
