Unverified Commit 079f29bc authored by Chen Xin's avatar Chen Xin Committed by GitHub
Browse files

auto upload cuda12.1 python pkg to release when create new tag (#784)

* add cuda12-whl-release ci

* enable environment

* test py310-311 windows wheel

* fix py310, py311 setup.py error on windows

* fix lint
parent 7990d252
name: cuda12.1-whl-release
on:
push:
tags:
- '*'
workflow_dispatch:
permissions:
contents: write
jobs:
linux-build:
strategy:
matrix:
pyver: [py38, py39, py310, py311]
runs-on: ubuntu-latest
env:
PYTHON_VERSION: ${{ matrix.pyver }}
PLAT_NAME: manylinux2014_x86_64
DOCKER_TAG: cuda12.1
OUTPUT_FOLDER: cuda12.1_dist
CUDA_VER: 12.1
steps:
- name: Free disk space
uses: jlumbroso/free-disk-space@main
with:
# This might remove tools that are actually needed, if set to "true" but frees about 6 GB
tool-cache: false
docker-images: false
# All of these default to true, but feel free to set to "false" if necessary for your workflow
android: true
dotnet: true
haskell: true
large-packages: true
swap-storage: false
- name: Checkout repository
uses: actions/checkout@v3
- name: Build
run: |
echo ${PYTHON_VERSION}
echo ${PLAT_NAME}
echo ${DOCKER_TAG}
echo ${OUTPUT_FOLDER}
# remove -it
sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
- name: Upload Artifacts
uses: actions/upload-artifact@v3
with:
if-no-files-found: error
path: builder/manywheel/${{ env.OUTPUT_FOLDER }}/*
retention-days: 1
windows-build:
strategy:
matrix:
pyver: ['3.8', '3.9', '3.10', '3.11']
runs-on: windows-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Set up python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.pyver }}
- name: Install python packages
run: |
pip install pybind11 wheel
- uses: Jimver/cuda-toolkit@v0.2.11
id: cuda-toolkit
with:
cuda: '12.1.0'
use-github-cache: false
- name: Build wheel
run: |
mkdir build
cd build
pip install -U setuptools
..\builder\windows\generate.ps1
cmake --build . --config Release -- /m > build.log.txt
cmake --install . --config Release
cd ..
rm build -Force -Recurse
python setup.py bdist_wheel -d build/wheel
- name: Upload Artifacts
uses: actions/upload-artifact@v3
with:
if-no-files-found: error
path: build/wheel/*
retention-days: 1
publish:
runs-on: ubuntu-latest
environment: 'prod'
needs:
- linux-build
- windows-build
steps:
- name: Download artifacts
uses: actions/download-artifact@v3
- name: Display artifacts
run: ls artifact/ -lh
- name: Publish
uses: softprops/action-gh-release@v1
if: startsWith(github.ref, 'refs/tags/')
with:
files: artifact/*
......@@ -75,6 +75,8 @@ jobs:
run: |
mkdir build
cd build
# https://github.com/pypa/setuptools/issues/1631
pip install -U setuptools
..\builder\windows\generate.ps1
cmake --build . --config Release -- /m > build.log.txt
cmake --install . --config Release
......
......@@ -4,8 +4,10 @@ set -eou pipefail
TOPDIR=$(git rev-parse --show-toplevel)/builder
CUDA_VER=${CUDA_VER:-11.8}
PLAT_NAME=manylinux2014_x86_64
for cuver in 11.8; do
for cuver in ${CUDA_VER}; do
DOCKER_TAG=cuda${cuver}
OUTPUT_FOLDER=cuda${cuver}_dist
for pyver in py38 py39 py310 py311; do
......
......@@ -35,7 +35,7 @@ You may recognize this feature as "continuous batching" in other repos. But duri
## KV Cache Manager
The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/LlamaCacheManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way
The [KV cache manager](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) of TurboMind is a memory-pool-liked object that also implements LRU policy so that it can be viewed as a form of __cache of KV caches__. It works in the following way
- All device memory required for KV cache is allocated by the manager. A fixed number of slots is pre-configured to match the memory size of the system. Each slot corresponds to the memory required by the KV cache of a single sequence. Allocation chunk-size can be configure to implement pre-allocate/on-demand style allocation policy (or something in-between).
- When space for the KV cache of a new sequence is requested but no free slots left in the pool, the least recently used sequence is evicted from the cache and its device memory is directly reused by the new sequence. However, this is not the end of the story.
......
......@@ -35,7 +35,7 @@ TurboMind 是一款关于 LLM 推理的高效推理引擎,基于英伟达的 [
## KV 缓存管理器
TurboMind 的 [KV 缓存管理器](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/LlamaCacheManager.h) 是一个内存池类型的对象,并且在其中加入了 LRU 的实现,这样整个管理器可以被看作是一个 **KV 缓存的缓存**。大致工作方式如下:
TurboMind 的 [KV 缓存管理器](https://github.com/InternLM/lmdeploy/blob/main/src/turbomind/models/llama/SequenceManager.h) 是一个内存池类型的对象,并且在其中加入了 LRU 的实现,这样整个管理器可以被看作是一个 **KV 缓存的缓存**。大致工作方式如下:
- KV 缓存由管理器分配。管理器会根据预先配置好的 slot 数量开辟空间。每个 slot 对应于一个 sequence 所需的 KV 缓存。分配的内存块大小可通过配置来实现预分配或者按需分配(或介于两者之间)。
- 当有新的请求,但是缓存池中没有空闲 slot时,根据 LRU 机制,管理器会踢除最近使用最少的 sequence,把它占据的 slot 分给新的请求。不仅仅如此,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment