• one's avatar
    Benchmark: Model benchmark - deterministic training support (#731) (#2) · 47d4a79d
    one authored
    
    
    Adds opt-in deterministic training mode to SuperBench's PyTorch model
    benchmarks. When enabled --enable-determinism. PyTorch deterministic
    algorithms are enforced, and per-step numerical fingerprints (loss,
    activation means) are recorded as metrics. These can be compared across
    runs using the existing sb result diagnosis pipeline to verify bit-exact
    reproducibility — useful for hardware validation and platform
    comparison.
     
    Flags added - 
    
    --enable-determinism
    --check-frequency: Number of steps after which you want the metrics to
    be recorded
    --deterministic-seed
    
    Changes - 
    
    Updated pytorch_base.py to handle deterministic settings, logging.
    Added a new example script: pytorch_deterministic_example.py
    Added a test file: test_pytorch_determinism_all.py to verify everything
    works as expected.
    
    Usage - 
    
    Step 1: Run 1 - Run with --enable-determinism and the necessary metrics
    will be recorded in the results-summary.jsonl file
    Step 2: Generate the baseline file from the Run 1 results using - sb
    result generate-baseline
    Step 3: Run 2 - Run with --enable-determinism and the necessary metrics
    will be recorded in the results-summary.jsonl file on a different
    machine (or the same machine)
    Step 4: Run diagnosis on the results generated from the 2 runs using the
    - sb result diagnosis command
    
    Note - 
    1. Make sure all the parameters are constant between the 2 runs 
    2. Running the diagnosis command requires the rules.yaml file
    
    ---------
    Co-authored-by: default avatarAishwarya Tonpe <aishwarya.tonpe25@gmail.com>
    Co-authored-by: default avatarUbuntu <rdadmin@HPCPLTNODE0.n3kgq4m0lhoednrx3hxtad2nha.cdmx.internal.cloudapp.net>
    47d4a79d
model_log_utils.py 3.58 KB