# Mixtral reference standalone inference script The reference output and accuracy can be checked using the standalone hugginface inference script following the instructions below: ``` cd language/mixtral-8x7b docker build -t mlc-ngc . nvidia-docker run -it --rm --net=host --runtime=nvidia --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN --cap-add=DAC_READ_SEARCH --security-opt seccomp=unconfined -w $PWD -v $PWD:$PWD -t mlc-ngc pip install -r requirements.txt cd standalone_infer # Make sure the checkpoint and reference pickle file is already downloaded python3 hf_eval_all.py --input_pkl=09292024_mixtral_15k_mintoken2_v1.pkl --checkpoint_path=/raid/data/mlperf-llm/Mixtral-8x7B-Instruct-v0.1 --output_pkl=mixtral_8x7b_15000_greedy_reference_fp16_mintoken2.pkl --batch_size=64 # Exit the container and enter the evaluation container exit docker build . -f Dockerfile.eval -t evaluation docker run -it --rm --net=host --runtime=nvidia --ipc=host -v $PWD:$PWD -w $PWD evaluation cd standalone_infer python3 run_accuracy.py --results_path=mixtral_8x7b_15000_greedy_reference_fp16_mintoken2.pkl ``` Expected output: ``` EM: 0.7366, correct: 3683 / 5000, gen_token_per_sample: 129.9604 Evaluating OpenOrca score... OpenOrca score: {'rouge1': np.float64(45.5989), 'rouge2': np.float64(23.3526), 'rougeL': np.float64(30.4608), 'rougeLsum': np.float64(42.5396)}, gen_token_per_sample: 205.8656 Evaluating MBXP score... 100%|| 5000/5000 [02:33<00:00, 32.50it/s] Processed 5000 in 153.89411109898356s 60.16% pass@1 {'cpp': 381, 'typescript': 438, 'ruby': 419, 'python': 492, 'php': 809, 'javascript': 469} out of {'cpp': 743, 'typescript': 868, 'ruby': 846, 'python': 863, 'php': 846, 'javascript': 834} gen_tokens_per_sample: 98.7026 ```