---
id: model-benchmarks
---
# Model Benchmarks
## PyTorch Model Benchmarks
### `model-benchmarks`
#### Introduction
Run training or inference tasks with single or half precision for deep learning models,
including the following categories:
* GPT: gpt2-small, gpt2-medium, gpt2-large and gpt2-xl
* LLAMA: llama2-7b, llama2-13b, llama2-70b
* MoE: mixtral-8x7b, mixtral-8x22b
* BERT: bert-base and bert-large
* LSTM
* CNN, listed in [`torchvision.models`](https://pytorch.org/vision/0.8/models.html), including:
* resnet: resnet18, resnet34, resnet50, resnet101, resnet152
* resnext: resnext50_32x4d, resnext101_32x8d
* wide_resnet: wide_resnet50_2, wide_resnet101_2
* densenet: densenet121, densenet169, densenet201, densenet161
* vgg: vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
* mnasnet: mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3
* mobilenet: mobilenet_v2
* shufflenet: shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0
* squeezenet: squeezenet1_0, squeezenet1_1
* others: alexnet, googlenet, inception_v3
For inference, supported percentiles include
50th, 90th, 95th, 99th, and 99.9th.
**New: Support fp8_hybrid and fp8_e4m3 precision for BERT models.**
**New: Deterministic Training Support**
SuperBench now supports deterministic training to ensure reproducibility across runs. This includes fixed seeds and deterministic algorithms. To enable deterministic training, use the following flags:
- **Flags:**
- `--enable_determinism`: Enables deterministic computation for reproducible results.
- `--deterministic_seed `: Sets the seed for reproducibility (default: 42).
- `--check_frequency `: How often to record deterministic metrics (default: 100).
- **Environment Variables (set automatically by SuperBench when `--enable_determinism` is used):**
- `CUBLAS_WORKSPACE_CONFIG=:4096:8`: Ensures deterministic behavior in cuBLAS. This can be overridden by setting it manually before running SuperBench.
**Comparing Deterministic Results**
To compare deterministic results between runs, use the standard result analysis workflow:
1. Run benchmark with `--enable_determinism` flag
2. Generate baseline: `sb result generate-baseline --data-file results.jsonl --summary-rule-file rules.yaml`
3. Compare future runs: `sb result diagnosis --data-file new-results.jsonl --rule-file diagnosis-rule.yaml --baseline-file baseline.json`
This allows configurable tolerance for floating-point differences via YAML rules.
**Configuration Parameter Validation**
When determinism is enabled, benchmark configuration parameters (batch_size, num_steps, deterministic_seed, etc.) are automatically recorded in the results file as `deterministic_config_*` metrics. The diagnosis rules enforce exact matching of these parameters between runs to ensure valid comparisons:
If any configuration parameter differs between runs, the diagnosis will flag it as a failure, ensuring you only compare runs with identical configurations.
**Summary Rule Snippet for Determinism**
Include the following rule in your summary rule file (used with `sb result summary` or `sb result generate-baseline --summary-rule-file`) to surface deterministic metrics in the results summary:
```yaml
superbench:
rules:
model-benchmarks-deterministic:
statistics:
- mean
categories: Deterministic
metrics:
- model-benchmarks:.*/deterministic_loss.*
- model-benchmarks:.*/deterministic_act_mean.*
- model-benchmarks:.*/deterministic_check_count.*
- model-benchmarks:.*/deterministic_step.*
- model-benchmarks:.*/deterministic_config_.*
- model-benchmarks:.*/return_code.*
```
This groups all deterministic outputs — loss fingerprints, activation means, check counts, step numbers, configuration parameters, and return codes — under the **Deterministic** category.
**Diagnosis Rule Snippet for Determinism**
Include the following rules in your diagnosis rule file (used with `sb result diagnosis` or `sb result generate-baseline --diagnosis-rule-file`) to detect Silent Data Corruption (SDC) and validate configuration consistency:
```yaml
superbench:
rules:
deterministic_rule:
function: variance
criteria: "lambda x: x != 0"
categories: SDC-Fingerprint
metrics:
- model-benchmarks:.*/deterministic_loss.*
- model-benchmarks:.*/deterministic_act_mean.*
- model-benchmarks:.*/deterministic_check_count.*
deterministic_config_rule:
function: variance
criteria: "lambda x: x != 0"
categories: SDC-Config
metrics:
- model-benchmarks:.*/deterministic_config_batch_size.*
- model-benchmarks:.*/deterministic_config_num_steps.*
- model-benchmarks:.*/deterministic_config_num_warmup.*
- model-benchmarks:.*/deterministic_config_deterministic_seed.*
- model-benchmarks:.*/deterministic_config_check_frequency.*
- model-benchmarks:.*/deterministic_config_seq_len.*
- model-benchmarks:.*/deterministic_config_hidden_size.*
- model-benchmarks:.*/deterministic_config_num_classes.*
- model-benchmarks:.*/deterministic_config_input_size.*
- model-benchmarks:.*/deterministic_config_num_layers.*
- model-benchmarks:.*/deterministic_config_num_hidden_layers.*
- model-benchmarks:.*/deterministic_config_num_attention_heads.*
- model-benchmarks:.*/deterministic_config_intermediate_size.*
deterministic_failure_rule:
function: failure_check
criteria: "lambda x: x != 0"
categories: SDC-Failed
metrics:
- model-benchmarks:.*/return_code
```
- **SDC-Fingerprint** (`deterministic_rule`): Flags any node where loss, activation mean, or check count has *any* variance from baseline (`x != 0`), indicating a potential SDC issue.
- **SDC-Config** (`deterministic_config_rule`): Ensures all determinism configuration parameters (seed, batch size, sequence length, hidden size, etc.) are identical across nodes — any mismatch means the comparison is invalid.
- **SDC-Failed** (`deterministic_failure_rule`): Uses `failure_check` to catch nodes where the determinism benchmark failed to run or returned a non-zero exit code.
For complete rule files covering all benchmark categories (micro-benchmarks, NCCL, GPU copy bandwidth, NVBandwidth, etc.), refer to the rule file documentation in [Result Summary](../result-summary.md) and [Data Diagnosis](../data-diagnosis.md).
#### Metrics
| Name | Unit | Description |
|-----------------------------------------------------------------------------------------|------------------------|------------------------------------------------------------------------------|
| model-benchmarks/pytorch-${model_name}/${precision}_train_step_time | time (ms) | The average training step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_train_throughput | throughput (samples/s) | The average training throughput with fp32/fp16 precision per GPU. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time | time (ms) | The average inference step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput | throughput (samples/s) | The average inference throughput with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time\_${percentile} | time (ms) | The nth percentile inference step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput\_${percentile} | throughput (samples/s) | The nth percentile inference throughput with fp32/fp16 precision. |
## Megatron Model benchmarks
### `megatron-gpt`
#### Introduction
Run GPT pretrain tasks with float32, float16, bfloat16 precisions with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) or [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed).
`tips: batch_size in this benchmark represents global batch size, the batch size on each GPU instance is micro_batch_size.`
#### Metrics
| Name | Unit | Description |
|---------------------------------------------------|------------------------|---------------------------------------------------------|
| megatron-gpt/${precision}_train_step_time | time (ms) | The average training step time per iteration. |
| megatron-gpt/${precision}_train_throughput | throughput (samples/s) | The average training throughput per iteration. |
| megatron-gpt/${precision}_train_tflops | tflops/s | The average training tflops per second per iteration. |
| megatron-gpt/${precision}_train_mem_allocated | GB | The average GPU memory allocated per iteration. |
| megatron-gpt/${precision}_train_max_mem_allocated | GB | The average maximum GPU memory allocated per iteration. |