[ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing...

[ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size (#31553) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing...
[ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size (#31553) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
5cc48766 · Andreas Karatzas · GitHub · 5fff4406 · 5cc48766
Unverified Commit 5cc48766 authored Jan 01, 2026 by Andreas Karatzas Committed by GitHub Jan 01, 2026
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 1 deletion

.buildkite/test-amd.yaml .buildkite/test-amd.yaml +2 -1

No files found.
--- a/.buildkite/test-amd.yaml
+++ b/.buildkite/test-amd.yaml
@@ -859,7 +859,7 @@ steps:
 - label: Language Models Tests (Extra Standard) %N
  timeout_in_minutes: 45
  mirror_hardwares: [amdexperimental]
-  agent_pool: mi325_8
+  agent_pool: mi325_2
  # grade: Blocking
  torch_nightly: true
  source_file_dependencies:
@@ -871,6 +871,7 @@ steps:
    # Shard slow subset of standard language models tests. Only run when model
    # source is modified, or when specified test files are modified
    - pip freeze | grep -E 'torch'
+    - export TORCH_NCCL_BLOCKING_WAIT=1
    - pytest -v -s models/language -m 'core_model and slow_test' \
             --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
             --shard-id=$$BUILDKITE_PARALLEL_JOB