- 11 Jan, 2024 1 commit
-
-
binmakeswell authored
* [doc] fix annotation display * [doc] fix llama2 doc
-
- 09 Jan, 2024 1 commit
-
-
Hongxin Liu authored
* update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by:
Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by:
zxl <43881818+oahzxl@users.noreply.github.com>
-
- 08 Jan, 2024 1 commit
-
-
Xuanlei Zhao authored
* update extension * update cpu adam * update is * add doc for cpu adam * update kernel * update commit * update flash * update memory efficient * update flash attn * update flash attention loader * update api * fix * update doc * update example time limit * reverse change * fix doc * remove useless kernel * fix * not use warning * update * update
-
- 04 Jan, 2024 1 commit
-
-
digger yu authored
-
- 03 Jan, 2024 1 commit
-
-
flybird11111 authored
-
- 29 Dec, 2023 1 commit
-
-
flybird11111 authored
support linear accumulation fusion support linear accumulation fusion fix
-
- 22 Dec, 2023 1 commit
-
-
Wenhao Chen authored
* test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin
-
- 12 Dec, 2023 1 commit
-
-
flybird11111 authored
* fix aaa fix fix fix * fix * fix * test ci * fix ci fix * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * [Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878) * Add finetuning Colossal-Llama-2 example * Add finetuning Colossal-Llama-2 example 2 * Add finetuning Colossal-Llama-2 example and support NEFTuning * Add inference example and refine neftune * Modify readme file * update the imports --------- Co-authored-by:
Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by:
Camille Zhong <44392324+Camille7777@users.noreply.github.com> * llama support dist-cross fix fix fix fix fix fix fix fix * fix * fix * fix fix * test ci * test ci * fix * fix ci * fix ci --------- Co-authored-by:
Yuanchen <70520919+chengeharrison@users.noreply.github.com> Co-authored-by:
Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by:
Camille Zhong <44392324+Camille7777@users.noreply.github.com>
-
- 30 Nov, 2023 1 commit
-
-
Xuanlei Zhao authored
* update fused attn * update spda * tri attn * update triangle * import * fix * fix
-
- 28 Nov, 2023 2 commits
-
-
Wenhao Chen authored
* [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by:
Pengtai Xu <henryxu880@gmail.com> Co-authored-by:
ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by:
flybird11111 <1829166702@qq.com> Co-authored-by:
eric8607242 <e0928021388@gmail.com>
-
アマデウス authored
-
- 23 Nov, 2023 1 commit
-
-
Xuanlei Zhao authored
-
- 22 Nov, 2023 3 commits
-
-
Xuanlei Zhao authored
* llama 3d * update * fix autocast
-
flybird11111 authored
* fix flash attn * fix fix
-
Zhongkai Zhao authored
* hotfix/Fix get model policy strategy in ShardFormer * fix bug in auto policy
-
- 20 Nov, 2023 2 commits
-
-
Bin Jia authored
* fix init model with random parameters * fix example
-
Hongxin Liu authored
* [npu] setup device utils (#5047) * [npu] add npu device support * [npu] support low level zero * [test] update npu zero plugin test * [hotfix] fix import * [test] recover tests * [npu] gemini support npu (#5052) * [npu] refactor device utils * [gemini] support npu * [example] llama2+gemini support npu * [kernel] add arm cpu adam kernel (#5065) * [kernel] add arm cpu adam * [optim] update adam optimizer * [kernel] arm cpu adam remove bf16 support
-
- 19 Nov, 2023 1 commit
-
-
Xu Kai authored
* [inference] support only TP (#4998) * support only tp * enable tp * add support for bloom (#5008) * [refactor] refactor gptq and smoothquant llama (#5012) * refactor gptq and smoothquant llama * fix import error * fix linear import torch-int * fix smoothquant llama import error * fix import accelerate error * fix bug * fix import smooth cuda * fix smoothcuda * [Inference Refactor] Merge chatglm2 with pp and tp (#5023) merge chatglm with pp and tp * [Refactor] remove useless inference code (#5022) * remove useless code * fix quant model * fix test import bug * mv original inference legacy * fix chatglm2 * [Refactor] refactor policy search and quant type controlling in inference (#5035) * [Refactor] refactor policy search and quant type controling in inference * [inference] update readme (#5051) * update readme * update readme * fix architecture * fix table * fix table * [inference] udpate example (#5053) * udpate example * fix run.sh * fix rebase bug * fix some errors * update readme * add some features * update interface * update readme * update benchmark * add requirements-infer --------- Co-authored-by:
Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by:
Zhongkai Zhao <kanezz620@gmail.com>
-
- 16 Nov, 2023 2 commits
-
-
flybird11111 authored
* fix-llama * Update llama.py
-
Elsa Granger authored
[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017) * Use p2p * Cannot bidirectonal send p2p * Refactor tensor creation and serialization in P2P communication * Fix llama forward args in flash attention * Add flop estimate from megatron * Support loading weight not in weight_map when strict=False in hybrid_parallel * Use send_forward_recv_backward, etc in 1f1b * Use dataclass for metdata Remove torch.cuda.synchronize() as suggested * Add comment about the torch.cuda.synchronize for potential error * Typo * Update hybrid_parallel_checkpoint_io.py * Update p2p.py * Update one_f_one_b.py * Update p2p.py --------- Co-authored-by:flybird11111 <1829166702@qq.com>
-
- 10 Nov, 2023 2 commits
-
-
Zhongkai Zhao authored
* [refactor]: replace inference args with extra_kwargs in ShardConfig * modify shardconfig * polish code * fix policy bug in llama * fix bug in auto policy * remove setattr in ShardConfig
-
flybird11111 authored
* [colossalai]fix typo * [inference] Add smmoothquant for llama (#4904) * [inference] add int8 rotary embedding kernel for smoothquant (#4843) * [inference] add smoothquant llama attention (#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (#4902) * rafactor code * add license * add torch-int and smoothquant license * Update flash_attention_patch.py To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer. https://github.com/huggingface/transformers/pull/25598 * [kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921) * [kernel] support pure fp16 for cpu adam (#4896) * [kernel] fix cpu adam kernel for pure fp16 and update tests (#4919) * [kernel] fix cpu adam * [test] update gemini optim test * [format] applied code formatting on changed files in pull request 4908 (#4918) Co-authored-by:
github-actions <github-actions@github.com> * [gemini] support gradient accumulation (#4869) * add test * fix no_sync bug in low level zero plugin * fix test * add argument for grad accum * add grad accum in backward hook for gemini * finish implementation, rewrite tests * fix test * skip stuck model in low level zero test * update doc * optimize communication & fix gradient checkpoint * modify doc * cleaning codes * update cpu adam fp16 case * [hotfix] fix torch 2.0 compatibility (#4936) * [hotfix] fix launch * [test] fix test gemini optim * [shardformer] fix vit * [test] add no master test for low level zero plugin (#4934) * [format] applied code formatting on changed files in pull request 4820 (#4886) Co-authored-by:
github-actions <github-actions@github.com> * [nfc] fix some typo with colossalai/ docs/ etc. (#4920) * [Refactor] Integrated some lightllm kernels into token-attention (#4946) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by:
cuiqing.li <lixx336@gmail.com> Co-authored-by:
CjhHa1 <cjh18671720497@outlook.com> * [test] merge old components to test to model zoo (#4945) * [test] add custom models in model zoo * [test] update legacy test * [test] update model zoo * [test] update gemini test * [test] remove components to test * [inference] add reference and fix some bugs (#4937) * add reference and fix some bugs * update gptq init --------- Co-authored-by:
Xu Kai <xukai16@foxamil.com> * [Inference]ADD Bench Chatglm2 script (#4963) * add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Pipeline inference] Combine kvcache with pipeline inference (#4938) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test * updated c++17 compiler flags (#4983) * [Inference] Dynamic Batching Inference, online and offline (#4953) * [inference] Dynamic Batching for Single and Multiple GPUs (#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by:
yuehuayingxueluo <867460659@qq.com> * Revert "[inference]Re push async dynamic batching (#4901)" (#4905) This reverts commit fbf3c09e673794ed18c91d4bab1a7dfea052e95a. * Revert "[inference] Async dynamic batching (#4894)" This reverts commit fced14025043e29ce816b315f440601188f7f79f. * Revert "[inference] Async dynamic batching (#4894)" (#4909) This reverts commit fced14025043e29ce816b315f440601188f7f79f. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (#4911) * Revert "[inference] Async dynamic batching (#4894)" This reverts commit fced14025043e29ce816b315f440601188f7f79f. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by:
yuehuayingxueluo <867460659@qq.com> * add assertion for config (#4947) * [Inference] Finish dynamic batching offline test (#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by:
yuehuayingxueluo <867460659@qq.com> Co-authored-by:
Cuiqing Li <lixx3527@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by:
cuiqing.li <lixx336@gmail.com> * fix ColossalEval (#4992) Co-authored-by:
Xu Yuanchen <yuanchen.xu00@gmail.com> * [doc]Update doc for colossal-inference (#4989) * update doc * Update README.md --------- Co-authored-by:
cuiqing.li <lixx336@gmail.com> * [hotfix] Fix the bug where process groups were not being properly released. (#4940) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit 479900c1398637310abf92eefa3cd168038ea02f. * [hotfix] fix the bug of repeatedly storing param group (#4951) * [doc] add supported feature diagram for hybrid parallel plugin (#4996) * [Pipeline Inference] Merge pp with tp (#4993) * refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo * [release] update version (#4995) * [release] update version * [hotfix] fix ci * [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp * fix fix fix * update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO * support fused layernorm support fused layernorm support fused layernorm * update fusedlayernorm update fusedlayernorm update fusedlayernorm * add sequence parallel to gemini add sequence parallel to gemini * fix * fix comments fix comments fix comments * fix * fix t5 * clear cache * fix * activate ci * activate ci * fix * fix * fix * fix * revert * modify tp gather method modify tp gather method modify tp gather method modify tp gather method * fix test --------- Co-authored-by:
Xu Kai <xukai16@foxmail.com> Co-authored-by:
Zian(Andy) Zheng <62330719+Orion-Zheng@users.noreply.github.com> Co-authored-by:
Hongxin Liu <lhx0217@gmail.com> Co-authored-by:
github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by:
github-actions <github-actions@github.com> Co-authored-by:
Baizhou Zhang <eddiezhang@pku.edu.cn> Co-authored-by:
Zhongkai Zhao <kanezz620@gmail.com> Co-authored-by:
digger yu <digger-yu@outlook.com> Co-authored-by:
Cuiqing Li <lixx3527@gmail.com> Co-authored-by:
cuiqing.li <lixx336@gmail.com> Co-authored-by:
CjhHa1 <cjh18671720497@outlook.com> Co-authored-by:
Xu Kai <xukai16@foxamil.com> Co-authored-by:
Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by:
Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by:
アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by:
yuehuayingxueluo <867460659@qq.com> Co-authored-by:
Yuanchen <70520919+chengeharrison@users.noreply.github.com> Co-authored-by:
Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by:
littsk <1214689160@qq.com> Co-authored-by:
ppt0011 <143150326+ppt0011@users.noreply.github.com>
-
- 07 Nov, 2023 1 commit
-
-
Jianghai authored
* fix bug * fix * fix multiquery * fix multiquery --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
-
- 03 Nov, 2023 1 commit
-
-
littsk authored
* [hotfix] Add layer norm gradients all-reduce for sequence parallel. (#4915) * Add layer norm gradients all-reduce for sequence parallel. * skip pipeline inference test * [hotfix] fixing polices of sequence parallel (#4922) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy --------- Co-authored-by:
littsk <1214689160@qq.com> * Hotfix/add grad all reduce for sequence parallel (#4927) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy * fix bug using wrong variables --------- Co-authored-by:
littsk <1214689160@qq.com> * fix policy initialization * fix bloom and chatglm policices * polish code of handling layernorm * fix moe module * polish code of class initializing --------- Co-authored-by:
Zhongkai Zhao <kanezz620@gmail.com>
-
- 27 Oct, 2023 1 commit
-
-
Bin Jia authored
* merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test
-
- 18 Oct, 2023 2 commits
-
-
digger yu authored
-
Hongxin Liu authored
* [hotfix] fix launch * [test] fix test gemini optim * [shardformer] fix vit
-
- 04 Oct, 2023 1 commit
-
-
Xu Kai authored
* fix test bug * delete useless code * fix typo
-
- 27 Sep, 2023 1 commit
-
-
littsk authored
-
- 22 Sep, 2023 2 commits
-
-
Jianghai authored
* add chatglm2 * add * gather needed kernels * fix some bugs * finish context forward * finish context stage * fix * add * pause * add * fix bugs * finish chatglm * fix bug * change some logic * fix bugs * change some logics * add * add * add * fix * fix tests * fix
-
Xu Kai authored
* [gptq] add gptq kernel (#4416) * add gptq * refactor code * fix tests * replace auto-gptq * rname inferance/quant * refactor test * add auto-gptq as an option * reset requirements * change assert and check auto-gptq * add import warnings * change test flash attn version * remove example * change requirements of flash_attn * modify tests * [skip ci] change requirements-test * [gptq] faster gptq cuda kernel (#4494) * [skip ci] add cuda kernels * add license * [skip ci] fix max_input_len * format files & change test size * [skip ci] * [gptq] add gptq tensor parallel (#4538) * add gptq tensor parallel * add gptq tp * delete print * add test gptq check * add test auto gptq check * [gptq] combine gptq and kv cache manager (#4706) * combine gptq and kv cache manager * add init bits * delete useless code * add model path * delete usless print and update test * delete usless import * move option gptq to shard config * change replace linear to shardformer * update bloom policy * delete useless code * fix import bug and delete uselss code * change colossalai/gptq to colossalai/quant/gptq * update import linear for tests * delete useless code and mv gptq_kernel to kernel directory * fix triton kernel * add triton import
-
- 19 Sep, 2023 1 commit
-
-
Hongxin Liu authored
* [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format
-
- 15 Sep, 2023 1 commit
-
-
Baizhou Zhang authored
* create shardformer doc files * add docstring for seq-parallel * update ShardConfig docstring * add links to llama example * add outdated massage * finish introduction & supporting information * finish 'how shardformer works' * finish shardformer.md English doc * fix doctest fail * add Chinese document
-
- 14 Sep, 2023 1 commit
-
-
flybird11111 authored
* [shardformer] fix whisper test failed * [shardformer] fix whisper test failed * [shardformer] fix whisper test failed * [shardformer] fix whisper test failed
-
- 13 Sep, 2023 1 commit
-
-
flybird11111 authored
-
- 12 Sep, 2023 1 commit
-
-
flybird11111 authored
* [shardformer] update shardformer readme * [shardformer] update shardformer readme * [shardformer] update shardformer readme * [shardformer] update shardformer readme * [shardformer] update shardformer readme
-
- 11 Sep, 2023 2 commits
-
-
Cuiqing Li authored
[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577) * [infer] Infer/llama demo (#4503) * add * add infer example * finish * finish * stash * fix * [Kernels] add inference token attention kernel (#4505) * add token forward * fix tests * fix comments * add try import triton * add adapted license * add tests check * [Kernels] add necessary kernels (llama & bloom) for attention forward and kv-cache manager (#4485) * added _vllm_rms_norm * change place * added tests * added tests * modify * adding kernels * added tests: * adding kernels * modify * added * updating kernels * adding tests * added tests * kernel change * submit * modify * added * edit comments * change name * change commnets and fix import * add * added * combine codes (#4509) * [feature] add KV cache manager for llama & bloom inference (#4495) * add kv cache memory manager * add stateinfo during inference * format * format * rename file * add kv cache test * revise on BatchInferState * file dir change * [Bug FIx] import llama context ops fix (#4524) * added _vllm_rms_norm * change place * added tests * added tests * modify * adding kernels * added tests: * adding kernels * modify * added * updating kernels * adding tests * added tests * kernel change * submit * modify * added * edit comments * change name * change commnets and fix import * add * added * fix * add ops into init.py * add * [Infer] Add TPInferEngine and fix file path (#4532) * add engine for TP inference * move file path * update path * fix TPInferEngine * remove unused file * add engine test demo * revise TPInferEngine * fix TPInferEngine, add test * fix * Add Inference test for llama (#4508) * add kv cache memory manager * add stateinfo during inference * add * add infer example * finish * finish * format * format * rename file * add kv cache test * revise on BatchInferState * add inference test for llama * fix conflict * feature: add some new features for llama engine * adapt colossalai triton interface * Change the parent class of llama policy * add nvtx * move llama inference code to tensor_parallel * fix __init__.py * rm tensor_parallel * fix: fix bugs in auto_policy.py * fix:rm some unused codes * mv colossalai/tpinference to colossalai/inference/tensor_parallel * change __init__.py * save change * fix engine * Bug fix: Fix hang * remove llama_infer_engine.py --------- Co-authored-by:
yuanheng-zhao <jonathan.zhaoyh@gmail.com> Co-authored-by:
CjhHa1 <cjh18671720497@outlook.com> * [infer] Add Bloom inference policy and replaced methods (#4512) * add bloom inference methods and policy * enable pass BatchInferState from model forward * revise bloom infer layers/policies * add engine for inference (draft) * add test for bloom infer * fix bloom infer policy and flow * revise bloom test * fix bloom file path * remove unused codes * fix bloom modeling * fix dir typo * fix trivial * fix policy * clean pr * trivial fix * Revert "[infer] Add Bloom inference policy and replaced methods (#4512)" (#4552) This reverts commit 17cfa5714083a81a505c097f1c411cd28162d922. * [Doc] Add colossal inference doc (#4549) * create readme * add readme.md * fix typos * [infer] Add Bloom inference policy and replaced methods (#4553) * add bloom inference methods and policy * enable pass BatchInferState from model forward * revise bloom infer layers/policies * add engine for inference (draft) * add test for bloom infer * fix bloom infer policy and flow * revise bloom test * fix bloom file path * remove unused codes * fix bloom modeling * fix dir typo * fix trivial * fix policy * clean pr * trivial fix * trivial * Fix Bugs In Llama Model Forward (#4550) * add kv cache memory manager * add stateinfo during inference * add * add infer example * finish * finish * format * format * rename file * add kv cache test * revise on BatchInferState * add inference test for llama * fix conflict * feature: add some new features for llama engine * adapt colossalai triton interface * Change the parent class of llama policy * add nvtx * move llama inference code to tensor_parallel * fix __init__.py * rm tensor_parallel * fix: fix bugs in auto_policy.py * fix:rm some unused codes * mv colossalai/tpinference to colossalai/inference/tensor_parallel * change __init__.py * save change * fix engine * Bug fix: Fix hang * remove llama_infer_engine.py * bug fix: fix bugs about infer_state.is_context_stage * remove pollcies * fix: delete unused code * fix: delete unused code * remove unused coda * fix conflict --------- Co-authored-by:
yuanheng-zhao <jonathan.zhaoyh@gmail.com> Co-authored-by:
CjhHa1 <cjh18671720497@outlook.com> * [doc] add colossal inference fig (#4554) * create readme * add readme.md * fix typos * upload fig * [NFC] fix docstring for colossal inference (#4555) Fix docstring and comments in kv cache manager and bloom modeling * fix docstring in llama modeling (#4557) * [Infer] check import vllm (#4559) * change import vllm * import apply_rotary_pos_emb * change import location * [DOC] add installation req (#4561) * add installation req * fix * slight change * remove empty * [Feature] rms-norm transfer into inference llama.py (#4563) * add installation req * fix * slight change * remove empty * add rmsnorm polciy * add * clean codes * [infer] Fix tp inference engine (#4564) * fix engine prepare data * add engine test * use bloom for testing * revise on test * revise on test * reset shardformer llama (#4569) * [infer] Fix engine - tensors on different devices (#4570) * fix diff device in engine * [codefactor] Feature/colossal inference (#4579) * code factors * remove * change coding (#4581) * [doc] complete README of colossal inference (#4585) * complete fig * Update README.md * [doc]update readme (#4586) * update readme * Update README.md * bug fix: fix bus in llama and bloom (#4588) * [BUG FIX]Fix test engine in CI and non-vllm kernels llama forward (#4592) * fix tests * clean * clean * fix bugs * add * fix llama non-vllm kernels bug * modify * clean codes * [Kernel]Rmsnorm fix (#4598) * fix tests * clean * clean * fix bugs * add * fix llama non-vllm kernels bug * modify * clean codes * add triton rmsnorm * delete vllm kernel flag * [Bug Fix]Fix bugs in llama (#4601) * fix tests * clean * clean * fix bugs * add * fix llama non-vllm kernels bug * modify * clean codes * bug fix: remove rotary_positions_ids --------- Co-authored-by:
cuiqing.li <lixx3527@gmail.com> * [kernel] Add triton layer norm & replace norm for bloom (#4609) * add layernorm for inference * add test for layernorm kernel * add bloom layernorm replacement policy * trivial: path * [Infer] Bug fix rotary embedding in llama (#4608) * fix rotary embedding * delete print * fix init seq len bug * rename pytest * add benchmark for llama * refactor codes * delete useless code * [bench] Add bloom inference benchmark (#4621) * add bloom benchmark * readme - update benchmark res * trivial - uncomment for testing (#4622) * [Infer] add check triton and cuda version for tests (#4627) * fix rotary embedding * delete print * fix init seq len bug * rename pytest * add benchmark for llama * refactor codes * delete useless code * add check triton and cuda * Update sharder.py (#4629) * [Inference] Hot fix some bugs and typos (#4632) * fix * fix test * fix conflicts * [typo]Comments fix (#4633) * fallback * fix commnets * bug fix: fix some bugs in test_llama and test_bloom (#4635) * [Infer] delete benchmark in tests and fix bug for llama and bloom (#4636) * fix rotary embedding * delete print * fix init seq len bug * rename pytest * add benchmark for llama * refactor codes * delete useless code * add check triton and cuda * delete benchmark and fix infer bugs * delete benchmark for tests * delete useless code * delete bechmark function in utils * [Fix] Revise TPInferEngine, inference tests and benchmarks (#4642) * [Fix] revise TPInferEngine methods and inference tests * fix llama/bloom infer benchmarks * fix infer tests * trivial fix: benchmakrs * trivial * trivial: rm print * modify utils filename for infer ops test (#4657) * [Infer] Fix TPInferEngine init & inference tests, benchmarks (#4670) * fix engine funcs * TPInferEngine: receive shard config in init * benchmarks: revise TPInferEngine init * benchmarks: remove pytest decorator * trivial fix * use small model for tests * [NFC] use args for infer benchmarks (#4674) * revise infer default (#4683) * [Fix] optimize/shard model in TPInferEngine init (#4684) * remove using orig model in engine * revise inference tests * trivial: rename --------- Co-authored-by:
Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by:
Xu Kai <xukai16@foxmail.com> Co-authored-by:
Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by:
yuehuayingxueluo <867460659@qq.com> Co-authored-by:
yuanheng-zhao <jonathan.zhaoyh@gmail.com> Co-authored-by:
CjhHa1 <cjh18671720497@outlook.com>
-
flybird11111 authored
* [shardformer]fix gpt2 test [shardformer]fix gpt2 test [shardformer]fix gpt2 test * fix * [shardformer] add todo * [shardformer] add todo
-
- 09 Sep, 2023 1 commit
-
-
flybird11111 authored
* [shardformer] update shardformer readme [shardformer] update shardformer readme [shardformer] update shardformer readme * [shardformer] update llama2/opt finetune example and shardformer update to llama2 * [shardformer] update llama2/opt finetune example and shardformer update to llama2 * [shardformer] update llama2/opt finetune example and shardformer update to llama2 * [shardformer] change dataset * [shardformer] change dataset * [shardformer] fix CI * [shardformer] fix * [shardformer] fix * [shardformer] fix * [shardformer] fix * [shardformer] fix [example] update opt example [example] resolve comments fix fix
-
- 07 Sep, 2023 1 commit
-
-
Baizhou Zhang authored
* update vit example for hybrid plugin * reset tp/pp size * fix dataloader iteration bug * update optimizer passing in evaluation/add grad_accum * change criterion * wrap tqdm * change grad_accum to grad_checkpoint * fix pbar
-