2. Install system-wide `cuda` if you don't have it already. [NVIDIA instructions](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html). Of course ideally use [the premade packages for your distro](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-installation).
Use the same major version as pytorch's cuda build. To check use:
If the pytorch and system-wide cuda minor versions mismatch, it's not a problem, you just need to hack `apex`'s build to bypass the check by applying this patch first and then build it.
You can, of course, run this as a slurm script, but here is [a full slurm script example](https://github.com/bigscience-workshop/bigscience/blob/d57b76bb592832bb4d2054cd5cbf132796be2d83/train/tr11-176B-ml/setup-test-n2.slurm), which has some tweaks to get `MASTER_ADDR` and a few other bits right under the SLURM environment on JeanZay, which may or may not be needed if you run it elsewhere.
Remember to wipe out `$CHECKPOINT_PATH`, if you change the model shape and there is a checkpoint with the old shapes saved already.
help='Should the sequence length be adapted to the batch during evaluation, if in fp16 the results will be slightly different due to numerical errors but greatly speed up evaluation.')
group.add_argument('--eval_fp32',default=False,action='store_true',help='Should the evaluation run in fp32')
group.add_argument('--intermed_results',default=False,action='store_true',help='Whether to print & write intermediate results for each task')
group.add_argument('--bootstrap_iters',type=int,default=100000,help='How many iterations to use for stderr estimation')
group.add_argument('--micro_bs_multiplier',type=int,default=1,help='Increase the global batch size to remove bubble when pipeline parallel')
returnparser
frommegatron.global_varsimport_parse_args
defmain():
# parse the megatron args. But wait with initalizing megatron.
# avoid printing the arguments, since they will later be overridden.
args=_parse_args(tasks_args)
load_path=args.load
model=load_ds_checkpoint_and_setup_megatron(args)
args=get_args()
ifargs.deepspeedandargs.adaptive_seq_len:
# adaptive_seq_len hack #1:
# CL automatically enables reset_activation_shape() which allows us to change input shapes
# and it also reshapes the attenion scores in attention_mask_func