# Variational Autoencoder for Collaborative Filtering for TensorFlow
## Table Of Contents
* [模型概览](#模型概览)
* [模型架构](#模型结构)
* [测试过程指导](#测试过程指导)
* [参数说明](#参数说明)
* [主要参数](#主要参数)
* [其他参数](#其他参数)
* [推理测试](#推理测试)
* [测试结果参考](#测试结果参考)
## 模型概览
参考文档 “Variational Autoencoders for Collaborative Filtering”(https://arxiv.org/abs/1802.05814)
### 模型结构
Figure 1. The architecture of the VAE-CF model
## 测试过程指导
1. 数据准备
* 数据:[MovieLens 20m dataset](https://grouplens.org/datasets/movielens/20m/).
* 数据处理:解压到```/data/ml-20m/extracted/``` 文件夹
```
cd /data
mkdir -p ml-20m/extracted
cd ml-20m/extracted
wget http://files.grouplens.org/datasets/movielens/ml-20m.zip
unzip ml-20m.zip
```
2. 数据预处理
```bash
python prepare_dataset.py
#或者添加参数 --data_dir=/data/ml-20m/extracted
```
3. 单卡训练测试.
```python
export MIOPEN_USE_APPROXIMATE_PERFORMANCE=0
export MIOPEN_FIND_MODE=1
python main.py --train --checkpoint_dir ./checkpoints
```
4. 多卡训练测试
```bash
mpirun --bind-to numa --allow-run-as-root -np 8 -H localhost:8 python main.py --train --amp --checkpoint_dir ./checkpoints
```
## 参数说明
### 主要参数
常见的运行参数
* `--data_dir` :指定测试数据的路径,默认地址为 ```/data```
* `--checkpoint_dir`: 训练模型参数保存地址
### 其他参数
```bash
python main.py --help
usage: main.py [-h] [--train] [--test] [--inference_benchmark]
[--amp] [--epochs EPOCHS]
[--batch_size_train BATCH_SIZE_TRAIN]
[--batch_size_validation BATCH_SIZE_VALIDATION]
[--validation_step VALIDATION_STEP]
[--warm_up_epochs WARM_UP_EPOCHS]
[--total_anneal_steps TOTAL_ANNEAL_STEPS]
[--anneal_cap ANNEAL_CAP] [--lam LAM] [--lr LR] [--beta1 BETA1]
[--beta2 BETA2] [--top_results TOP_RESULTS] [--xla] [--trace]
[--activation ACTIVATION] [--log_path LOG_PATH] [--seed SEED]
[--data_dir DATA_DIR] [--checkpoint_dir CHECKPOINT_DIR]
Train a Variational Autoencoder for Collaborative Filtering in TensorFlow
optional arguments:
-h, --help show this help message and exit
--train Run training of VAE
--test Run validation of VAE
--inference_benchmark
Benchmark the inference throughput and latency
--amp Enable Automatic Mixed Precision
--epochs EPOCHS Number of epochs to train
--batch_size_train BATCH_SIZE_TRAIN
Global batch size for training
--batch_size_validation BATCH_SIZE_VALIDATION
Used both for validation and testing
--validation_step VALIDATION_STEP
Train epochs for one validation
--warm_up_epochs WARM_UP_EPOCHS
Number of epochs to omit during benchmark
--total_anneal_steps TOTAL_ANNEAL_STEPS
Number of annealing steps
--anneal_cap ANNEAL_CAP
Annealing cap
--lam LAM Regularization parameter
--lr LR Learning rate
--beta1 BETA1 Adam beta1
--beta2 BETA2 Adam beta2
--top_results TOP_RESULTS
Number of results to be recommended
--xla Enable XLA
--trace Save profiling traces
--activation ACTIVATION
Activation function
--log_path LOG_PATH Path to the detailed JSON log from to be created
--seed SEED Random seed for TensorFlow and numpy
--data_dir DATA_DIR Directory for storing the training data
--checkpoint_dir CHECKPOINT_DIR
Path for saving a checkpoint after the training
```
## 推理测试
推理测试可以通过参数:`--inference_benchmark`
```
python main.py --inference_benchmark --checkpoint_dir ./checkpoints
```
## 测试结果参考
#### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
| GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 [s] | Time to train - mixed precision [s] | Time to train speedup (TF32 to mixed precision)
|-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:|
| 1 | 24,576 | 0.430298 | 0.430398 | 112.8 | 109.4 | 1.03 |
| 8 | 3,072 | 0.430897 | 0.430353 | 25.9 | 30.4 | 0.85 |
#### Training accuracy: NVIDIA DGX-1 (8x V100 32GB)
| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [s] | Time to train - mixed precision [s] | Time to train speedup (FP32 to mixed precision) |
|-------:|-----------------:|-------------:|-----------:|----------------:|--------------:|---------------:|
| 1 | 24,576 | 0.430592 | 0.430525 | 346.5 | 186.5 | 1.86 |
| 8 | 3,072 | 0.430753 | 0.431202 | 59.1 | 42.2 | 1.40 |
### Training performance results
Performance numbers below show throughput in users processed per second. They were averaged over an entire training run.
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
| GPUs | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 - mixed precision) | Strong scaling - TF32 | Strong scaling - mixed precision
|-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:|
| 1 | 24,576 | 354,032 | 365,474 | 1.03 | 1 | 1 |
| 8 | 3,072 | 1,660,700 | 1,409,770 | 0.85 | 4.69 | 3.86 |
##### Training performance: NVIDIA DGX-1 (8x V100 32GB)
| GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
|-------:|------------:|-------------------:|-----------------:|---------------------:|---:|---:|
| 1 | 24,576 | 114,125 | 213,283 | 1.87 | 1 | 1 |
| 8 | 3,072 | 697,628 | 1,001,210 | 1.44 | 6.11 | 4.69 |
#### Inference performance results
Our results were obtained by running:
```
python main.py --inference_benchmark [--amp]
```
in the TensorFlow 20.06 NGC container.
We use users processed per second as a throughput metric for measuring inference performance.
All latency numbers are in seconds.
##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
TF32
| Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 1181 | 0.000847 | 0.000863 | 0.000871 | 0.000901 |
FP16
| Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 1215 | 0.000823 | 0.000858 | 0.000864 | 0.000877 |
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
FP32
| Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 718 | 0.001392 | 0.001443 | 0.001458 | 0.001499 |
FP16
| Batch size | Throughput Avg | Latency Avg | Latency 90% | Latency 95% | Latency 99% |
|-------------:|-----------------:|--------------:|--------------:|--------------:|---------------:|
| 1 | 707 | 0.001413 | 0.001511 | 0.001543 | 0.001622 |