"git@developer.sourcefind.cn:orangecat/ollama.git" did not exist on "1358e27b7799c108f83bf75e4d1295851f9667ed"
- 25 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 23 Feb, 2022 1 commit
-
-
Masaki Kozuki authored
-
- 04 Feb, 2022 1 commit
-
-
eqy authored
* FusedRMSNorm based on FusedLayerNorm * refactor duplicated kernels * delete comments * delete comments * cleanup * cleanup * cleanup, fixed clobbering forward_affine_mixed_dtypes * fix pybind naming and add MixedFused test * undo skipping * check elementwise_affine * Update tests/L0/run_fused_layer_norm/test_fused_layer_norm.py Oof, nice catch, thanks Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 31 Jan, 2022 1 commit
-
-
Masaki Kozuki authored
* Free output tensor on each pipeline stage for smaller memory footprint see: https://github.com/NVIDIA/Megatron-LM/commit/057b086c689b164864455430c223ab52fd86bbcb * ref: https://github.com/NVIDIA/Megatron-LM/commit/945ece943149b63511e9d0ec3df8effe7f3c13ff * ref: https://github.com/NVIDIA/Megatron-LM/commit/9a8b89acd8f6ba096860170d0e30ddc0bc2bacd4 * remove position embedding group in destroy * pass deallocate_pipeline_outputs to backward_step * fix typo * missing deallocate_pipeline_outputs * fix typo: grad_ouptut -> grad_output * update tests * remove accessed todo * test with data parallel size of 2 if there's equal to or more than 8 gpus
-
- 28 Jan, 2022 2 commits
-
-
Masaki Kozuki authored
* cosmetic refactor in test * log with PID * log more info: rank, pid, filename, lineNo
-
Masaki Kozuki authored
* have get_kth_microbatch deal with None batch * broadcast based on tensor parallel rank * dtype * remove unnecessary .cuda() Processes of tensor parallel rank != 0 doesn't need to prepare one or more `torch.utils.data.DataLoader` instances, which means the argument of `batch` of `get_kth_microbatch` function can be `None` but the current function implementation doesn't allow for it.
-
- 21 Jan, 2022 1 commit
-
-
Masaki Kozuki authored
* add keyword argument of `grad_scaler` * update test * pass dtype to fwd_step_func * add log * calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting * add keyword argument of `grad_scaler` * update test * pass dtype to fwd_step_func * add log * calc loss in autocast as per https://pytorch.org/docs/stable/amp.html#autocasting * option to turn off autocast inside forward_step function As there's some users who activate `autocast` outside fwd/bwd functions. * add missing arg of disable_autocast * reorder args of no pipeline
-
- 17 Dec, 2021 1 commit
-
-
Masaki Kozuki authored
Add an argument of `dtype` to forward_backward functions to specify the dtype used in p2p comm (#1249) * let users sepcify dtype for p2p comm taking the possibility of O2 style AMP into account * add `dtype` argument to forward_backward functions * fix * better message * add docstring of dtype * add a link to dtype logic of p2p comm
-
- 16 Dec, 2021 1 commit
-
-
eqy authored
* reduce bert memory usage, placeholder data for gpt * update gpt test * fix * Update tests/L0/run_transformer/run_bert_minimal_test.py remove debugging indexing Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> * Update tests/L0/run_transformer/run_bert_minimal_test.py cleanup Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com> Co-authored-by:
Masaki Kozuki <masaki.kozuki.2014@gmail.com>
-
- 14 Dec, 2021 1 commit
-
-
eqy authored
-
- 10 Dec, 2021 2 commits
-
-
Masaki Kozuki authored
* update parallel_state * update pipeline common funcs - forward_step and backward_step * update pipelining w/o interleaving * type hint * merge utils into without_interleaving Motivation: functions in utils are only used by forward_backward_pipelining_without_interleaving * fix handling of `model_type` * fix import of DDP * update set_input_tensor method * fix * cosmetic * update model * refactor pipeline test scripts
-
Rishi Puri authored
Minimal gpt pipeline parallel (builds off of minimal_bert_pipeline_parallel) including cpu-offloading (#1222) * minimal bert pipeline parallel test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * first draft of gpt minimal test * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * framework to scale up the gpt2 test for variety of distributed setups * adding gpt_minimal_test to list of multigpu tests Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
riship <riship@nvidia.com>
-
- 09 Dec, 2021 1 commit
-
-
Kevin Stephano authored
* Add fused mixed precision lamb optimizer. * Fix device usage in constructor. * Fix sending param_group tensor state to device. * Remove unneeded device set.
-
- 19 Nov, 2021 2 commits
-
-
eqy authored
* minimal bert pipeline parallel test * fix global and cleanup * use get_forward_backward_func * cleanup and fix some tests
-
Masaki Kozuki authored
* init logging use * fix * clean up * fp32 p2p comm * init * Dynamic global batch size with `MegatronPretrainingSampler` I couldn't make this script work with `MegatronPretrainingRandomSampler` because the random sampler seems to have some requirement for global batch size, total number of samples, local minibatch size, etc. which I'm not familiar with for now * revive original pipeline parallel test * update MULTIGPU_TEST: add dynamic batchsize test * run MegatronPretrainingRandomSampler * fix comment * fix * update * cosmetic * add note * Apply 2 suggestion(s) to 2 file(s) * change following https://github.com/NVIDIA/apex/pull/1210 * fix
-
- 10 Nov, 2021 1 commit
-
-
eqy authored
-
- 27 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* Init apex.ppu (pipeline model parallel utility) Reference commit: ``` commit 5ab646376d67831601d5552c193241d017f1b35c (HEAD -> main, internal/main) Merge: 14f2c684 7b293d9b Author: Mohammad Shoeybi <mshoeybi@nvidia.com> Date: Wed Sep 22 22:57:54 2021 -0700 Merge branch 'add_BOS' into 'main' Add Beginning of Sentence token option and adding semaphore while multi-threading to prevent crashes and hangs due to connection keep-alives See merge request ADLR/megatron-lm!328 ``` * removing get_args and replace import - phase 1 * removing get_args and replace import - phase 2 * move ppu to apex.transformer.pipeline_parallel * update two __init__.py * update READMEs * mpu -> parallel_state & tensor_parallel * fix * remove not pipeline files * separate schedules.py - phase 1 * dissect schedules.py * data_iterators -> batch * remove optimizer from forward_backward_step funcs * init test * Apply 2 suggestion(s) to 2 file(s) * fix cyclic import * fix syntax of Callable * fix - 1 * move directory as testing used for pp test as well * add some functions for num microbatches calculator * model is a list in pipeline parallel * skip build num microbatch calculator * fix test * assert -> raise * skip args printing * specify tensor shape everywhere even if None - phase 1 * private timers * passing tensor shape & dtype around * update dtype handling by introducing helper func * write helper func to reduce cyclomatic complexity * remove duplicate * update * move split_tensor_into_1d_equal_chunks to avoid cyclic import * tmp * cosmetic * move gather_split_1d_tensor to avoid cyclic imports * remove debug print * add outer loop * early return if possible * cosmetic * passing around tensor shape * refactor test * add script to learn batch sampler behavior * update * minibatch splitter * add minibatch splitter * split minibatch into microbatches * minor changes * uncomment split batch for test sake * set as attribute * study the behavior of no pipelining * debug 1 * reflect test util namespace change * update readme * cosmetic in test * add model build helper func for interleaving shced * adding model builder from megatron * canbe cyclic import * fix * enable interleaving test, but failing even if forward only * fix batch preparation * add explanation * print data parallel size * fix typo * Add Megatron style GPT model by Rishi Co-authored-by:Rishi Puri <riship@nvidia.com> * update * type hint for jit * fix forward_backward_no_pipelining test * pipeline forward backward seem to hang if not forward only * fix typo * debug * add p2p test * simplify * fix * tentative * set both tmp and pmp to 1 * init * fix typo * fix * fix path of divide * set seed for tmp * update upon Eddie comment * fix typo * adding failing data loader test * fix * megatron still failing * check in * with the nested loop of new order, interleaving seems fine * cosmetic change * make `forward_backward_pipelining_with_interleaving private * warn users that interleaving sched is unstable * move noop handler to no pipelining * comment out rank_print * make `build_model` more flexible * skip megatron test tentatively * correctly comment out rank_print * correctly comment out rank_print * correctly comment out rank_print * skip appropriately * remove wip p2p comm test * update type hint of model_provider_func * disable tf32 in each test script * skip interleaving w/ backward * rename as mpu is the old name * remove broken case * expose build_model func * delete `dist.ring_exchange` func call and `use_ring_exchange` argument * nit fixes * check in * remove unused file * update the list * update tensor shape * remove mixed dtype case * use torch.distributed.run * 2020 -> 2021 * another 2020 -> 2021 * docstring & type hint * fix teardown * update * change to experimental * check if warned Co-authored-by:
Rishi Puri <riship@nvidia.com> Co-authored-by:
Eddie Yan <eddiey@nvidia.com>
-
- 23 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* switch from clone to out-of-place subtract * Update apex/mpu/cross_entropy.py * Apply 1 suggestion(s) to 1 file(s) Co-authored-by:Eddie Yan <eddiey@nvidia.com>
-
- 08 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* run backward * remove custom_fwd/custom_bwd
-
- 06 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
* [ColumnParallelLinear] Test behavior in autocast * fix test * casts manually to autocast dtype
-
- 02 Oct, 2021 1 commit
-
-
Masaki Kozuki authored
Co-authored-by:
Piotr Bialecki <pbialecki@nvidia.com> Co-authored-by:
Eddie Yan <eddiey@nvidia.com> Co-authored-by:
Rishi Puri <riship@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com>
-
- 15 Apr, 2021 1 commit
-
-
Sudhakar Singh authored
* Add unit tests for fused-novograd * Fix: tensors should reside on the same device * Fix: Cudastream should be called on the same device on which the tensors reside on. Found this during debugging fused novograd multi-device unit test * fixed issues mentioned in the comments
-
- 01 Dec, 2020 1 commit
-
-
Kexin Yu authored
DistributedFusedAdam Model Parallelism Support (Megatron) Co-authored-by:
Kexin Yu <kexiny@nvidia.com> Co-authored-by:
Kexin Yu <kexinznzn@gmail.com>
-
- 05 Aug, 2020 1 commit
-
-
ngimel authored
* add device guards to the optimizers * add untracked file * set deviceGuard in multi_tensor_apply * address review comments; fix lamb * indent * typo
-
- 06 Jul, 2020 1 commit
-
-
jjsjann123 authored
* [sync BN] support non-uniform batch size across process group. TODO: test should be added once cleaned up. * updating unit tests * new unit tests for different inputs * cleaning
-
- 23 Jun, 2020 3 commits
- 14 May, 2020 1 commit
-
-
Andrew Tulloch authored
-
- 30 Apr, 2020 1 commit
-
-
Deyu Fu authored
* update fused bias relu backward kernel * adding support for not require first layer dgrad * fix bug: wrong layer in requires grad * add infrastructure for optional bias and activation, currently only support no bias and no relu * make bias and relu optional separately * add sigmoid activation option
-
- 22 Apr, 2020 2 commits
-
-
Deyu Fu authored
-
Vinicius Reis authored
The LARC optimizer wraps an underlying optimizer and then needs to be passed to amp.initialize for mixed precision. There were 3 different crashes happening in this situation, fix all of them and add a unit test. I don't know if the 'LARC' in sys.modules check ever worked. In my setup, the entry in sys.modules is 'apex.parallel.LARC'. Checking if the variable is defined seems more reliable though.
-
- 31 Mar, 2020 1 commit
-
-
Jeff Bowles authored
-
- 27 Feb, 2020 1 commit
-
-
mcarilli authored
* NHWC support for multi tensor apply * compilation fix for version<=1.4
-
- 06 Nov, 2019 1 commit
-
-
jjsjann123 authored
-
- 03 Oct, 2019 1 commit
-
-
ptrblck authored
* increase atol for Half-Float comparison to 1.5e-4 * disable tests for different opt_levels * reset atol * add bitwise accurate comparison
-
- 06 Sep, 2019 1 commit
-
-
mcarilli authored
* Pushing for build tests * Contrib files * Removing deprecated checks
-
- 03 Sep, 2019 1 commit
-
-
Deyu Fu authored
* move import of amp_C to __init__() * make fp16/32 separate lists to support mixed param types, disable double test * make zero_grad consistent between adam/novograd/lamb
-
- 27 Aug, 2019 1 commit
-
-
ptrblck authored
* add state_dict, load_state_dict * add test_restoring, test_loss_scale_decrease * disable amp outputs for checkpoint tests * add test for amp.state_dict, cleanup * add state_dict patch, add test * fixed testing, cleanup * add readme for checkpointing * add docs to source/amp * add review changes to doc
-
- 17 Aug, 2019 1 commit
-
-
Deyu Fu authored
-