## Pipeline parallelism We choose to mimic the "torch" eager semantics: - Declare module/blocks at init time - Declare edges during forward # Scheduling # All forward, all backward (easy, but memory expensive) # 1f1b (much nicer) We're going to assume that all Pipeline blocks are assigned to a rank in a contiguous manner. Warmup: ``` Rank 1: [forward(), forward(), forward(), forward(), backward()] Rank 2: [forward(), forward(), forward(), forward(), backward(), backward()] Rank 3: [forward(), forward(), forward(), forward(), backward(), backward(), backward()] Rank 4: [forward(), backward(), forward(), backward(), forward(), backward()] ``` // TODO @thomasw21: How do we extrapolate this notion to a tree. Not sure exactly, but topological ordering should be fine # TODOs: - [ ] passing activation that don't require backward screws me as 1f1b works because you have the same number of forward and the same number of backward (in the stage sense)