Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
983aa496
Unverified
Commit
983aa496
authored
Aug 16, 2025
by
kk
Committed by
GitHub
Aug 15, 2025
Browse files
Fix nan value generated after custom all reduce (#8663)
Co-authored-by:
wunhuang
<
wunhuang@amd.com
>
parent
9c3e95d9
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
20 additions
and
1 deletion
+20
-1
.github/workflows/pr-test-amd.yml
.github/workflows/pr-test-amd.yml
+19
-0
python/sglang/srt/distributed/device_communicators/custom_all_reduce.py
...srt/distributed/device_communicators/custom_all_reduce.py
+1
-1
No files found.
.github/workflows/pr-test-amd.yml
View file @
983aa496
...
@@ -290,6 +290,25 @@ jobs:
...
@@ -290,6 +290,25 @@ jobs:
run
:
|
run
:
|
bash scripts/ci/amd_ci_exec.sh python3 run_suite.py --suite per-commit-8-gpu-amd --timeout-per-file 3600
bash scripts/ci/amd_ci_exec.sh python3 run_suite.py --suite per-commit-8-gpu-amd --timeout-per-file 3600
unit-test-backend-8-gpu-CAR-amd
:
if
:
(github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft ==
false
strategy
:
matrix
:
runner
:
[
linux-mi300-gpu-8
]
runs-on
:
${{matrix.runner}}
steps
:
-
name
:
Checkout code
uses
:
actions/checkout@v4
-
name
:
Start CI container
run
:
bash scripts/amd_ci_start_container.sh
env
:
GITHUB_WORKSPACE
:
${{ github.workspace }}
-
name
:
Install dependencies
run
:
bash scripts/amd_ci_install_dependency.sh
-
name
:
Run CustomAllReduce test
-
name
:
Run CustomAllReduce test
timeout-minutes
:
20
timeout-minutes
:
20
run
:
|
run
:
|
...
...
python/sglang/srt/distributed/device_communicators/custom_all_reduce.py
View file @
983aa496
...
@@ -398,7 +398,7 @@ class CustomAllreduce:
...
@@ -398,7 +398,7 @@ class CustomAllreduce:
else
:
else
:
# If warm up, mimic the allocation pattern since custom
# If warm up, mimic the allocation pattern since custom
# allreduce is out-of-place.
# allreduce is out-of-place.
return
torch
.
empty
_like
(
input
)
return
torch
.
zeros
_like
(
input
)
else
:
else
:
if
_is_hip
:
if
_is_hip
:
# note: outside of cuda graph context,
# note: outside of cuda graph context,
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment