Docs - Add design document (#125)

**Description** Add Executor and Benchmarks design doc **Major Revision** - Add Executor design doc - Add Benchmarks design doc

Docs - Add design document (#125)
**Description** Add Executor and Benchmarks design doc **Major Revision** - Add Executor design doc - Add Benchmarks design doc
03d1fcac · TobeyQin · GitHub · 157b4e2d · 03d1fcac · 157b4e2d
Unverified Commit 03d1fcac authored Aug 02, 2021 by TobeyQin Committed by GitHub Aug 02, 2021
13 changed files
--- a/docs/assets/benchmark-structure.png
+++ b/docs/assets/benchmark-structure.png
--- a/docs/assets/benchmark_package.png
+++ b/docs/assets/benchmark_package.png
--- a/docs/assets/docker-benchmark-process.svg
+++ b/docs/assets/docker-benchmark-process.svg
--- a/docs/assets/executor-pipeline.png
+++ b/docs/assets/executor-pipeline.png
--- a/docs/assets/executor_workflow.png
+++ b/docs/assets/executor_workflow.png
--- a/docs/assets/micro-benchmark-process-native.svg
+++ b/docs/assets/micro-benchmark-process-native.svg
--- a/docs/assets/micro-benchmark-process-python.svg
+++ b/docs/assets/micro-benchmark-process-python.svg
--- a/docs/assets/model-inference-process.svg
+++ b/docs/assets/model-inference-process.svg
--- a/docs/assets/model-training-process.svg
+++ b/docs/assets/model-training-process.svg
--- a/docs/design-docs/benchmarks.md
+++ b/docs/design-docs/benchmarks.md
+---
+id: benchmarks
+---
+
+# Benchmarks Design
+
+## Goals
+
+The design of `benchmarks` has the following goals to achieve:
+
+**High Code Quality**
+* Extract the common code into the base class, and reduce the efforts to maintain different benchmarks.
+
+**Good Extensibility**
+* Avoid modifying existing code when adding new benchmarks by using registration mechanism.
+* Support pre-definition of benchmarks' settings, and benchmark registration with different settings.
+
+**Good Usability**
+* Provide a unified entrance to launch benchmarks.
+* Unify the output format for all the micro-benchmarks and E2E-benchmarks, contains return_code, metrics, raw-output, etc.
+
+## Overall System Design
+
+![Structure of `benchmarks` Package](../assets/benchmark-structure.png)
+
+The structure of `benchmarks` package can be divided into layers from the bottom up:
+1. Abstract base classes for all kind of benchmarks, including `Benchmark`, `ModelBenchmark`, `Microbenchmark` and `DockerBenchmark`.
+   1. `Benchmark` is the base class for all benchmarks. It defines common interfaces such as `run()`, `_preprocess()`, `_postprocess()`, `_benchmark()`, `add_parser_arguments()` and so on.
+   2. `ModelBenchmark` is the base class for all E2E models. It defines the abstract interfaces that need to be implemented by the subclasses using different frameworks, such as `PytorchBase`, `TFBase` and `ONNXBase`. Each subclass will realize part of the abstract interfaces that is common for models, such as `_init_distributed_setting()`, `_init_dataloader()`, `_create_optimizer()`.
+   3. `Microbenchmark` is the base class for all micro benchmarks. It defines the abstract interfaces that need to be implemented by the subclasses, such as `_process_raw_result()`, `_process_numeric_result()`.
+   4. `DockerBenchmark` is the base class for real workloads based on docker. It also defines the abstract interfaces that need to be implemented by the subclasses.
+2. Derived classes for all implemented benchmarks, which need to realize all the abstract interfaces. The benchmarks will be registered into `BenchmarkRegistry`.
+3. `BenchmarkRegistry` provides a way of benchmark registration, maintains all the registered benchmarks, and supports benchmark launching by `BenchmarkContext`.
+4. `BenchmarkContext` provides the context to launch one benchmark, including name, parameters, platform(CPU, GPU, etc.), and framework(Pytorch, TF, ONNX, etc.).
+5. `BenchmarkResult` defines the structured results for each benchmark in json format, including name, return_code, start_time, end_time, raw_data, summarized metrics, etc.
+
+The `Executor` on the uppermost layer is the entrance for all the benchmarks. It launches the benchmark by `BenchmarkRegistry` and fetch `BenchmarkResult`.
+
+## Detailed Component Design
+
+This chapter will describe the design details of all the components in `benchmarks` package.
+
+### E2E Model Benchmarks
+
+The E2E model benchmarks have 4-layer inheritance relationship.
+
+#### Training 
+
+The general process of model training is:
+
+> init_distributed_setting -> generate_dataset -> init_dataloader -> create_model -> create_optimizer -> train
+
+These functions will be executed according to the order in the following figure. The functions that exist in derived class but not in base class are abstract functions.
+
+![Function Execution Order in Training Process](../assets/model-training-process.svg)
+
+#### Inference
+
+The general process of the model inference is:
+
+> Init_distributed_setting -> generate_dataset -> init_dataloader -> create_model -> inference
+
+Compared with training, it just gets rid of create_optimizer operation.
+
+![Function Execution Order in Inference Process](../assets/model-inference-process.svg)
+
+### Micro Benchmarks
+
+The micro-benchmarks have 3-layer Inheritance Relationship. There are two base classes for micro-benchmark: 
+`MicroBenchmark` is pure-python benchmark.
+`MicroBenchmarkWithInvoke` is benchmark depending on third-party executable program.
+
+![Function Execution Order in `MicroBenchmark`](../assets/micro-benchmark-process-python.svg)
+
+![Function Execution Order in `MicroBenchmarkWithInvoke`](../assets/micro-benchmark-process-native.svg)
+
+### Docker Benchmarks
+
+The Docker benchmarks have 3-layer Inheritance Relationship. The Docker benchmarks need docker env ready.
+
+![Function Execution Order in DockerBenchmark Process](../assets/docker-benchmark-process.svg)
+
+### BenchmarkRegistry
+
+`BenchmarkRegistry` is designed to
+1.	Provide a way to register the benchmark.
+2.	Avoid modifying existing code when adding new benchmarks.
+3.	Reduce the redundant code for benchmarks with different configurations.
+4.	Support benchmark selection by platform and framework, which can be used to select desired benchmark automatically.
+
+#### Design
+
+Intefaces are designed as:
+
+```py
+class BenchmarkRegistry:
+    benchmarks = dict()
+
+    @classmethod
+    def register_benchmark(cls, name, class_def, parameters='', platform=None):
+        """Register new benchmark, key is the benchmark name.
+        Args:
+            name (str): internal name of benchmark.
+            class_def (Benchmark): class object of benchmark.
+            parameters (str): predefined parameters of benchmark.
+            platform (Platform): Platform types like CUDA, ROCM.
+        """
+        pass
+
+    @classmethod
+    def create_benchmark_context(cls, name, platform=Platform.CPU, parameters='', framework=Framework.NONE):
+        """Create the benchmark context.
+        Args:
+            name (str): name of benchmark in config file.
+            platform (Platform): Platform types like Platform.CPU, Platform.CUDA, Platform.ROCM.
+            parameters (str): predefined parameters of benchmark.
+            framework (Framework): Framework types like Framework.PYTORCH, Framework.ONNX.
+        Return:
+            benchmark_context (BenchmarkContext): the benchmark context.
+        """
+        pass
+
+    @classmethod
+    def get_all_benchmark_predefine_settings(cls):
+        """Get all registered benchmarks' predefine settings.
+        Return:
+            benchmark_params (dict[str, dict]): key is benchmark name,
+              value is the dict with structure: {'parameter': default_value}.
+        """
+        pass
+
+
+    @classmethod
+    def launch_benchmark(cls, benchmark_context):
+        """Select and Launch benchmark.
+        Args:
+            benchmark_context (BenchmarkContext): the benchmark context.
+        Return:
+            benchmark (Benchmark): the benchmark instance contains all results,
+              None means context is invalid or no benchmark is found.
+        """
+        pass
+```
+
+The structure of the BenchmarkRegistry.benchmarks is designed as:
+
+```py
+dictionary = {
+  'benhmark1': {
+    'tag1': (benchmark1_tag1_class, predefined_arguments),
+    'tag2': (benchmark1_tag2_class, predefined_arguments),
+  },
+  'benhmark2': {
+    'tag1': (benchmark2_tag1_class, predefined_arguments),
+    'tag2': (benchmark2_tag2_class, predefined_arguments),
+  },
+  ...
+}
+```
+
+#### Examples
+
+For E2E model benchmarks:
+
+```py
+BenchmarkRegistry.register_benchmark('bert-large', PytorchBERT, args='--hidden_size=1024 --num_hidden_layers=24 --num_attention_heads=16 --intermediate_size=4096')
+BenchmarkRegistry.register_benchmark('bert-base', PytorchBERT, args='--hidden_size=768 --num_hidden_layers=12 --num_attention_heads=12 --intermediate_size=3072')
+```
+
+For Microbenchmark:
+
+```py
+BenchmarkRegistry.register_benchmark('kernel-launch', KernelLaunch)
+```
+
+## Interfaces
+
+This chapter will describe the interfaces with the caller (Superbench executor), including the input/output format and the invoke method.
+
+### Inputs
+
+The inputs needed by the `benchmarks` package is simple, just the context object of the benchmark want to run:
+
+### Invoke
+
+```py
+    context = BenchmarkRegistry.create_benchmark_context(
+        benchmark_name, parameters=xxx, framework=xxx, platform=xxx
+    )
+
+    benchmark = BenchmarkRegistry.launch_benchmark(context)
+    if benchmark:
+        logger.info(
+            'benchmark: {}, return code: {}, result: {}'.format(
+                benchmark.name, benchmark.return_code, benchmark.result
+            )
+        )
+```
+
+### Outputs
+
+#### Design
+
+```py
+result = {
+    'name': 'benchmark_name',
+    'type: BenchmarkType,
+    'run_count': N,
+    'return_code': ReturnCode,
+    'start_time': date,
+    'end_time': date,
+    'raw_data': { # Key is metrics, Array for N runs.
+        'metrics1': List[List[Number]] or List[str],
+        ...
+        'metricsM' List[List[Number]] or List[str],
+    },
+    'result': { # Key is metrics, Array for N runs,
+        'metrics1': List[Number],
+             ...
+        'metricsM': List[Number],
+    },
+}
+```
+
+#### Example
+
+Model Benchmarks:
+
+```py
+result = {
+    'name': 'bert-large',
+    'type': 'model',
+    'run_count': N,
+    'return_code': 0,
+    'raw_data': {
+        'throughput-train-float32': [[step1_time, ..., stepK_time], ..., […]],
+        'throughput-train-float16': [[step1_time, ..., stepK_time], ..., […]],
+        'throughput-inference-float32': [[step1_time, ..., stepK_time], ..., […]],
+        'throughput-inference-float16': [[step1_time, ..., stepK_time], ..., […]],
+    },
+    'result': {
+            'throughput-train-float32': [avg_throughput1, ..., avg_throughputN],
+            'throughput-train-float16': [avg_throughput1, ..., avg_throughputN],
+            'throughput-inference-float32': [avg_throughput1, ..., avg_throughputN],
+            'throughput-inference-float16': [avg_throughput1, ..., avg_throughputN],
+    },
+}
+```
+
+Micro Benchmarks:
+
+```py
+result = {
+    'name': 'kernel_launch',
+    'type': 'micro',
+    'run_count': N,
+    'return_code': 0,
+    'raw_data': {
+        'raw_output': [raw_output1, ..., raw_outputN],
+    },
+    'result': { # Key is metrics
+        'overhead': [overhead1, ..., overheadN],
+    },
+}
+```
--- a/docs/design-docs/overview.md
+++ b/docs/design-docs/overview.md
+---
+id: overview
+---
+
+# Superbench Design
+
+## Goals
+
+SuperBench targets on providing a distribution test in cluster with 100 ~ 1000 nodes,
+and making it modulable, easy to use and easy to scale up.
+Therefore, SuperBench would like to provide a more simple and convenient way to:
+1. Install and deploy SuperBench in a raw cluster, including bare-metal, on-premises, and cloud environments.
+2. Configure SuperBench configurations, by specifying config file or using command line arguments.
+3. Execute the command locally to launch SuperBench on selected devices/nodes or on all nodes in the cluster. Use single device, selected devices, or all devices, including GPU and InfiniBand.
+4. Support distributed benchmarks in SuperBench, including NCCL tests using MPI, model performance tests using torch distributed, etc.
+5. Collect log during running, save results after running, and merge all nodes' results to one summary report.
+6. Provide a unified interface for all benchmarks, including how to run on different device vendor (NVIDIA/AMD), how to run on different mode (local/mpi), how to pass configurations and save results.
+
+## Architecture
+
+![SuperBench Workflow](../assets/executor_workflow.png)
+
+## Pipeline
+
+![Pipeline](../assets/executor-pipeline.png)
+
+1. User prepares config file and host file. Both files can be omitted by using the default config or command line arguments to specify.
+2. User runs SuperBench CLI on head node. Command line interface could accept a set of arguments and provide help information to user.
+3. SuperBench CLI parses the input config file, host file, and arguments, loads into one config object, and calls SuperBench Runner to start the test.
+4. SuperBench Runner parses the config and execute following steps on all nodes specified in config,
+   1. Check the connection.
+   2. Check docker environment and SuperBench Docker image.
+   3. Start a SuperBench Docker container, and mount necessary paths.
+   4. Prepare running context, including local package code, config file, SSH key pairs, SSH config for passwordless use, and distribute to all nodes.
+   5. Prepare output path.
+   6. Start SSH service and check inter-connections.
+5. SuperBench Runner  loops all benchmarks and all modes in each benchmark. For each mode in each benchmark, SuperBench Runner calls SuperBench Executor inside Docker container on corresponding nodes to start an execution.
+6. SuperBench Executor parses the config object, start benchmarks inside Docker container one by one.
+7. SuperBench Executor gets return code and results of each benchmark once the benchmark finished.
+8. SuperBench Executor sends the return code and results back to sb runner. SuperBench Runner Schecks return code and moves to the next execution.
+9. Once all benchmarks finished, SuperBench Runner reduces results on all compute nodes, save log and results files, then summarize running results to SuperBench CLI.
+10. SuperBench CLI returns the results to the user.
+
+## Components
+
+### SuperBench CLI
+
+SuperBench CLI provides the command-line interface for users to use SuperBench and run related benchmarks.
+The CLI provides a set of commands and corresponding help information to users.
+
+### SuperBench Runner
+
+SuperBench Runner is the component to configure environments, prepare context, run benchmarks, collect log from nodes, and summarize results.
+It controls the running logic, including when to start, which node or a set of nodes to run, whether a barrier is needed, etc.
+SuperBench Runner either communicates with host through SSH to configure running environments and prepare context,
+or talk with SuperBench Executor inside Docker container to execute benchmarks and collect log and results on each node.
+
+Here're the details about work directory structure for SuperBench Runner.
+
+```bash
+/path/to/working/directory
+└── outputs/                                  # output root directory
+    └── datetime                              # output directory name in %Y-%m-%d_%H-%M-%S format
+        ├── nodes                             # nodes directory
+        │   └── node-0                        # output collected from each node
+        │       ├── benchmarks                # benchmarks directory
+        │       │   └── benchmark-0           # output for each benchmark
+        │       │       └── rank-0            # output for each rank in each benchmark
+        │       │           └── results.json  # raw results
+        │       └── sb-exec.log               # collected SuperBench Executor log
+        ├── sb-run.log                        # SuperBench Runner log
+        ├── sb.config.yaml                    # SuperBench configuration snapshot
+        ├── ssh_config                        # generated SSH config file
+        ├── id_ed25519                        # generated SSH private key for each run
+        └── id_ed25519.pub                    # generated SSH public key for each run
+```
+
+### SuperBench Executor
+
+SuperBench Executor is the component to run benchmarks inside Docker container.
+It will execute each benchmark and handle all pre- and post-processing, including health check, result validation, result processing, etc.
+
+Here're the SuperBench Executor's work directory structure inside Docker container.
+The `/root` directory is mounted from `$HOME/sb-workspace` on the host path.
+
+```bash
+/root
+├── .ssh                              # SSH directory
+│   ├── config                        # generated SSH config file
+│   ├── id_ed25519                    # generated SSH private key from Runner
+│   └── id_ed25519.pub                # generated SSH public key from Runner
+└── outputs/                          # output root directory
+    └── datetime                      # output directory name in %Y-%m-%d_%H-%M-%S format
+        ├── benchmarks                # benchmarks directory
+        │   └── benchmark-0           # output for each benchmark
+        │       └── rank-0            # output for each rank in each benchmark
+        │           └── results.json  # raw results
+        ├── sb.config.yaml            # SuperBench configuration snapshot
+        └── sb.env                    # SuperBench runtime environment variables
+```
+
+### [Benchmarks](benchmarks.md)
+
+Benchmarks are a set of tests that actually run on node to measure the hardware performance.
+Here're the related concepts, nccl benchmark is used as an example.
+
+1. Module
+
+    One benchmark is one module, it has a set of abstract methods that should be implemented, e.g., pre check, measure, post check, save result, etc. The whole nccl benchmark is one module.
+
+2. Mode
+
+    One module may have several modes when running, each mode has corresponding method to run, e.g., local mode is running inside one node, mpi mode is running on all nodes but the command is only executed on one node, pair mode is running between two nodes for every combination, etc. The nccl module may have local mode and mpi mode.
+
+3. Task Group
+
+    One mode may have several task groups to run, each task group has a barrier when running. So only until the task group has finished on all nodes will the next task group start. The mpi mode of nccl module will run all reduce, all gather, etc., each should be treated as one task group.
+
+4. Task
+
+    One task group may contain several tasks to run, there's no barrier needed among these tasks inside one task group.
--- a/website/blog/2021-06-24-introduce-superbench.md
+++ b/website/blog/2021-06-24-introduce-superbench.md
@@ -63,7 +63,7 @@ For more information, please view [configuration](https://microsoft.github.io/su
  It also provides a unified interface and result format for all benchmarks.
  Developers can easily add new benchmarks.

-  ![SuperBench Benchmark Package](../../docs/assets/benchmark_package.png)
+  ![SuperBench Benchmark Package](../../docs/assets/benchmark-structure.png)

 ### Conprehensive and Strandardized Benchmarks


--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -31,6 +31,14 @@ module.exports = {
        'developer-guides/contributing',
      ],
    },
+    {
+      type: 'category',
+      label: 'Design Docs',
+      items: [
+        'design-docs/overview',
+        'design-docs/benchmarks',
+      ],
+    },
  ],
  api: [
    'cli',