configuration.md 3.91 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
id: configuration
---

# Configuration

## SuperBench config

SuperBench uses a [YAML](https://yaml.org/spec/1.2/spec.html) config file to configure the details of benchmarkings,
including which benchmark to run, which distributing mode to choose, which parameter to use, etc.

Here's what default config file looks like.

```yaml title="superbench/config/default.yaml"
# SuperBench Config
superbench:
  enable: null
  var:
    default_local_mode: &default_local_mode
      enable: true
      modes:
        - name: local
          proc_num: 8
          prefix: CUDA_VISIBLE_DEVICES={proc_rank}
          parallel: yes
    default_pytorch_mode: &default_pytorch_mode
      enable: true
      modes:
        - name: torch.distributed
          proc_num: 8
          node_num: 1
      frameworks:
        - pytorch
    common_model_config: &common_model_config
      duration: 0
      num_warmup: 16
      num_steps: 128
      precision:
        - float32
        - float16
      model_action:
        - train
  benchmarks:
    kernel-launch:
      <<: *default_local_mode
    gemm-flops:
      <<: *default_local_mode
    cudnn-function:
      <<: *default_local_mode
    cublas-function:
      <<: *default_local_mode
    matmul:
      <<: *default_local_mode
      frameworks:
        - pytorch
    sharding-matmul:
      <<: *default_pytorch_mode
    computation-communication-overlap:
      <<: *default_pytorch_mode
    gpt_models:
      <<: *default_pytorch_mode
      models:
        - gpt2-small
        - gpt2-large
      parameters:
        <<: *common_model_config
        batch_size: 4
    bert_models:
      <<: *default_pytorch_mode
      models:
        - bert-base
        - bert-large
      parameters:
        <<: *common_model_config
        batch_size: 8
    lstm_models:
      <<: *default_pytorch_mode
      models:
        - lstm
      parameters:
        <<: *common_model_config
        batch_size: 128
    cnn_models:
      <<: *default_pytorch_mode
      models:
        - resnet50
        - resnet101
        - resnet152
        - densenet169
        - densenet201
        - vgg11
        - vgg13
        - vgg16
        - vgg19
      parameters:
        <<: *common_model_config
        batch_size: 128
```

By default, all benchmarks in default configuration will be run if you don't specify customized configuration.

If you want to have a quick try, you can modify this config a little bit. For example, only run resnet models.
1. copy the default config to a file named `resnet.yaml` in current path.
  ```bash
  cp superbench/config/default.yaml resnet.yaml
  ```
2. enable only `cnn_models` in the config and remove other models except resnet under `benchmarks.cnn_models.models`.
  ```yaml {3,10-13} title="resnet.yaml"
  # SuperBench Config
  superbench:
    enable: ['cnn_models']
    var:
  # ...
  # omit the middle part
  # ...
      cnn_models:
        <<: *default_pytorch_mode
        models:
          - resnet50
          - resnet101
          - resnet152
        parameters:
          <<: *common_model_config
          batch_size: 128
  ```

## Ansible Inventory

SuperBench leverages [Ansible](https://docs.ansible.com/ansible/latest/) to run benchmarking workloads on managed nodes,
you need to provide an [inventory](https://docs.ansible.com/ansible/latest/user_guide/intro_inventory.html) file
to configure host list for managed nodes.

Here're some basic examples as your starting point.
* One managed node, same node as control node.
  ```ini title="local.ini"
  [all]
  localhost ansible_connection=local
  ```
* Two managed nodes, one is control node and the other can be remote accessed.
  ```ini title="mix.ini"
  [all]
  localhost ansible_connection=local
  10.0.0.100 ansible_user=username ansible_ssh_private_key_file=id_rsa
  ```
* Eight managed nodes, all can be accessed remotely.
  ```ini title="remote.ini"
  [all]
  10.0.0.[100:103]
  10.0.0.[200:203]

  [all:vars]
  ansible_user=username
  ansible_ssh_private_key_file=id_rsa
  ```