auto upload cuda12.1 python pkg to release when create new tag (#784)

* add cuda12-whl-release ci * enable environment * test py310-311 windows wheel * fix py310, py311 setup.py error on windows * fix lint

auto upload cuda12.1 python pkg to release when create new tag (#784)
* add cuda12-whl-release ci * enable environment * test py310-311 windows wheel * fix py310, py311 setup.py error on windows * fix lint
079f29bc · Chen Xin · GitHub · 7990d252 · 079f29bc · 079f29bc
Unverified Commit 079f29bc authored Dec 05, 2023 by Chen Xin Committed by GitHub Dec 05, 2023
5 changed files
--- a/.github/workflows/cuda12.1-whl-release.yml
+++ b/.github/workflows/cuda12.1-whl-release.yml
+name: cuda12.1-whl-release
+
+on:
+  push:
+    tags:
+      - '*'
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  linux-build:
+    strategy:
+      matrix:
+        pyver: [py38, py39, py310, py311]
+    runs-on: ubuntu-latest
+    env:
+      PYTHON_VERSION: ${{ matrix.pyver }}
+      PLAT_NAME: manylinux2014_x86_64
+      DOCKER_TAG: cuda12.1
+      OUTPUT_FOLDER: cuda12.1_dist
+      CUDA_VER: 12.1
+    steps:
+      - name: Free disk space
+        uses: jlumbroso/free-disk-space@main
+        with:
+          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
+          tool-cache: false
+          docker-images: false
+          # All of these default to true, but feel free to set to "false" if necessary for your workflow
+          android: true
+          dotnet: true
+          haskell: true
+          large-packages: true
+          swap-storage: false
+      - name: Checkout repository
+        uses: actions/checkout@v3
+      - name: Build
+        run: |
+          echo ${PYTHON_VERSION}
+          echo ${PLAT_NAME}
+          echo ${DOCKER_TAG}
+          echo ${OUTPUT_FOLDER}
+          # remove -it
+          sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
+          bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
+      - name: Upload Artifacts
+        uses: actions/upload-artifact@v3
+        with:
+          if-no-files-found: error
+          path: builder/manywheel/${{ env.OUTPUT_FOLDER }}/*
+          retention-days: 1
+
+  windows-build:
+    strategy:
+      matrix:
+        pyver: ['3.8', '3.9', '3.10', '3.11']
+    runs-on: windows-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v3
+      - name: Set up python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.pyver }}
+      - name: Install python packages
+        run: |
+          pip install pybind11 wheel
+      - uses: Jimver/cuda-toolkit@v0.2.11
+        id: cuda-toolkit
+        with:
+          cuda: '12.1.0'
+          use-github-cache: false
+      - name: Build wheel
+        run: |
+          mkdir build
+          cd build
+          pip install -U setuptools
+          ..\builder\windows\generate.ps1
+          cmake --build . --config Release -- /m > build.log.txt
+          cmake --install . --config Release
+          cd ..
+          rm build -Force -Recurse
+          python setup.py bdist_wheel -d build/wheel
+      - name: Upload Artifacts
+        uses: actions/upload-artifact@v3
+        with:
+          if-no-files-found: error
+          path: build/wheel/*
+          retention-days: 1
+
+  publish:
+    runs-on: ubuntu-latest
+    environment: 'prod'
+    needs:
+      - linux-build
+      - windows-build
+    steps:
+      - name: Download artifacts
+        uses: actions/download-artifact@v3
+      - name: Display artifacts
+        run: ls artifact/ -lh
+      - name: Publish
+        uses: softprops/action-gh-release@v1
+        if: startsWith(github.ref, 'refs/tags/')
+        with:
+          files: artifact/*
--- a/.github/workflows/pypi.yml
+++ b/.github/workflows/pypi.yml
@@ -75,6 +75,8 @@ jobs:
        run: |
          mkdir build
          cd build
+          # https://github.com/pypa/setuptools/issues/1631
+          pip install -U setuptools
          ..\builder\windows\generate.ps1
          cmake --build . --config Release -- /m > build.log.txt
          cmake --install . --config Release

--- a/builder/manywheel/build_all_wheel.sh
+++ b/builder/manywheel/build_all_wheel.sh
@@ -4,8 +4,10 @@ set -eou pipefail

 TOPDIR=$(git rev-parse --show-toplevel)/builder

+CUDA_VER=${CUDA_VER:-11.8}
+
 PLAT_NAME=manylinux2014_x86_64
-for cuver in 11.8; do
+for cuver in ${CUDA_VER}; do
    DOCKER_TAG=cuda${cuver}
    OUTPUT_FOLDER=cuda${cuver}_dist
    for pyver in py38 py39 py310 py311; do

--- a/docs/en/turbomind.md
+++ b/docs/en/turbomind.md
@@ -35,7 +35,7 @@ You may recognize this feature as "continuous batching" in other repos. But duri

 ## KV Cache Manager

-The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/LlamaCacheManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way
+The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way

 - All device memory required for KV cache is allocated by the manager. A fixed number of slots is pre-configured to match the memory size of the system. Each slot corresponds to the memory required by the KV cache of a single sequence. Allocation chunk-size can be configure to implement pre-allocate/on-demand style allocation policy (or something in-between).
 - When space for the KV cache of a new sequence is requested but no free slots left in the pool, the least recently used sequence is evicted from the cache and its device memory is directly reused by the new sequence. However, this is not the end of the story.

--- a/docs/zh_cn/turbomind.md
+++ b/docs/zh_cn/turbomind.md
@@ -35,7 +35,7 @@ TurboMind 是一款关于 LLM 推理的高效推理引擎，基于英伟达的 [

 ## KV 缓存管理器

-TurboMind 的 [KV 缓存管理器](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/LlamaCacheManager.h) 是一个内存池类型的对象，并且在其中加入了 LRU 的实现，这样整个管理器可以被看作是一个 **KV 缓存的缓存**。大致工作方式如下：
+TurboMind 的 [KV 缓存管理器](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) 是一个内存池类型的对象，并且在其中加入了 LRU 的实现，这样整个管理器可以被看作是一个 **KV 缓存的缓存**。大致工作方式如下：

 - KV 缓存由管理器分配。管理器会根据预先配置好的 slot 数量开辟空间。每个 slot 对应于一个 sequence 所需的 KV 缓存。分配的内存块大小可通过配置来实现预分配或者按需分配（或介于两者之间）。
 - 当有新的请求，但是缓存池中没有空闲 slot时，根据 LRU 机制，管理器会踢除最近使用最少的 sequence，把它占据的 slot 分给新的请求。不仅仅如此，