README.md 8.03 KB
Newer Older
unknown's avatar
unknown committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# Variational Autoencoder for Collaborative Filtering for TensorFlow



## Table Of Contents

  * [模型概览](#模型概览)
     * [模型架构](#模型结构)
  * [测试过程指导](#测试过程指导)
  * [参数说明](#参数说明)
     * [主要参数](#主要参数)
     * [其他参数](#其他参数)
   * [推理测试](#推理测试)
   * [测试结果参考](#测试结果参考)
     

## 模型概览

参考文档 “Variational Autoencoders for Collaborative Filtering”(https://arxiv.org/abs/1802.05814) 


### 模型结构

<p align="center">
   <img width="70%" src="images/autoencoder.png" />
   <br>
   Figure 1. The architecture of the VAE-CF model </p>


## 测试过程指导
1. 数据准备
* 数据:[MovieLens 20m dataset](https://grouplens.org/datasets/movielens/20m/). 

* 数据处理:解压到```/data/ml-20m/extracted/``` 文件夹
       
       ```
       cd /data
       mkdir -p ml-20m/extracted
       cd ml-20m/extracted
       wget http://files.grouplens.org/datasets/movielens/ml-20m.zip
       unzip ml-20m.zip
       ```
   

2. 数据预处理

   ```bash
   python prepare_dataset.py
   #或者添加参数 --data_dir=/data/ml-20m/extracted
   ```

3. 单卡训练测试.
   ```python
   export MIOPEN_USE_APPROXIMATE_PERFORMANCE=0
   export MIOPEN_FIND_MODE=1 
   python main.py --train --checkpoint_dir ./checkpoints

   ```
4. 多卡训练测试

   ```bash
   mpirun --bind-to numa --allow-run-as-root -np 8 -H localhost:8 python main.py --train --amp --checkpoint_dir ./checkpoints
   ```


## 参数说明

### 主要参数

常见的运行参数
* `--data_dir` :指定测试数据的路径,默认地址为 ```/data```
* `--checkpoint_dir`: 训练模型参数保存地址


### 其他参数

```bash
python main.py --help

usage: main.py [-h] [--train] [--test] [--inference_benchmark]
               [--amp] [--epochs EPOCHS]
               [--batch_size_train BATCH_SIZE_TRAIN]
               [--batch_size_validation BATCH_SIZE_VALIDATION]
               [--validation_step VALIDATION_STEP]
               [--warm_up_epochs WARM_UP_EPOCHS]
               [--total_anneal_steps TOTAL_ANNEAL_STEPS]
               [--anneal_cap ANNEAL_CAP] [--lam LAM] [--lr LR] [--beta1 BETA1]
               [--beta2 BETA2] [--top_results TOP_RESULTS] [--xla] [--trace]
               [--activation ACTIVATION] [--log_path LOG_PATH] [--seed SEED]
               [--data_dir DATA_DIR] [--checkpoint_dir CHECKPOINT_DIR]

Train a Variational Autoencoder for Collaborative Filtering in TensorFlow

optional arguments:
  -h, --help            show this help message and exit
  --train               Run training of VAE
  --test                Run validation of VAE
  --inference_benchmark
                        Benchmark the inference throughput and latency
  --amp          Enable Automatic Mixed Precision
  --epochs EPOCHS       Number of epochs to train
  --batch_size_train BATCH_SIZE_TRAIN
                        Global batch size for training
  --batch_size_validation BATCH_SIZE_VALIDATION
                        Used both for validation and testing
  --validation_step VALIDATION_STEP
                        Train epochs for one validation
  --warm_up_epochs WARM_UP_EPOCHS
                        Number of epochs to omit during benchmark
  --total_anneal_steps TOTAL_ANNEAL_STEPS
                        Number of annealing steps
  --anneal_cap ANNEAL_CAP
                        Annealing cap
  --lam LAM             Regularization parameter
  --lr LR               Learning rate
  --beta1 BETA1         Adam beta1
  --beta2 BETA2         Adam beta2
  --top_results TOP_RESULTS
                        Number of results to be recommended
  --xla                 Enable XLA
  --trace               Save profiling traces
  --activation ACTIVATION
                        Activation function
  --log_path LOG_PATH   Path to the detailed JSON log from to be created
  --seed SEED           Random seed for TensorFlow and numpy
  --data_dir DATA_DIR   Directory for storing the training data
  --checkpoint_dir CHECKPOINT_DIR
                        Path for saving a checkpoint after the training

```


## 推理测试

推理测试可以通过参数:`--inference_benchmark` 

```
python main.py --inference_benchmark --checkpoint_dir ./checkpoints
```

## 测试结果参考



#### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)

| GPUs    | Batch size / GPU    | Accuracy - TF32  | Accuracy - mixed precision  |   Time to train - TF32 [s] |  Time to train - mixed precision [s] | Time to train speedup (TF32 to mixed precision)
|-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:|
|      1 |       24,576 |         0.430298 |       0.430398 |     112.8  |    109.4 |           1.03 |
|      8 |        3,072 |         0.430897 |       0.430353 |      25.9 |     30.4 |           0.85 |

#### Training accuracy: NVIDIA DGX-1 (8x V100 32GB)

| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision  | Time to train - FP32 [s] |  Time to train - mixed precision [s] | Time to train speedup (FP32 to mixed precision) |
|-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:|
|      1 |       24,576 |         0.430592 |       0.430525 |     346.5 |   186.5  |           1.86 |
|      8 |        3,072 |         0.430753 |       0.431202 |      59.1 |    42.2 |           1.40  |


### Training performance results

Performance numbers below show throughput in users processed per second. They were averaged over an entire training run.

##### Training performance: NVIDIA DGX A100 (8x A100 40GB)

| GPUs   | Batch size / GPU | Throughput - TF32  | Throughput - mixed precision    | Throughput speedup (TF32 - mixed precision)   | Strong scaling - TF32    | Strong scaling - mixed precision
|-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:|
|      1 |       24,576 |    354,032   |         365,474   |                 1.03 | 1    | 1    |
|      8 |        3,072 |    1,660,700 |         1,409,770 |                 0.85 | 4.69 | 3.86 |

##### Training performance: NVIDIA DGX-1 (8x V100 32GB)

| GPUs   | Batch size / GPU   | Throughput - FP32    | Throughput - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Strong scaling - FP32    | Strong scaling - mixed precision |
|-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:|
|      1 |       24,576 |             114,125 | 213,283        |                 1.87 | 1 | 1 |
|      8 |        3,072 |             697,628 |      1,001,210 |                 1.44 | 6.11 | 4.69 |

#### Inference performance results

Our results were obtained by running:
```
python main.py  --inference_benchmark [--amp]
```
in the TensorFlow 20.06 NGC container.

We use users processed per second as a throughput metric for measuring inference performance.
All latency numbers are in seconds.

##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
TF32

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 1181 | 0.000847 | 0.000863 | 0.000871 | 0.000901 |

FP16

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 1215 | 0.000823 | 0.000858 | 0.000864 | 0.000877 |

##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)

FP32

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
|   1 | 718 |  0.001392 |   0.001443 | 0.001458 | 0.001499 |


FP16

|   Batch size |   Throughput Avg |   Latency Avg |   Latency 90% |   Latency 95% |   Latency 99%  |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 |   707 | 0.001413 | 0.001511 | 0.001543 | 0.001622 |