Unverified Commit b0ceb4d3 authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

fix: use port 9401 inside and outside of container (#1838)


Co-authored-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 7dea77c3
......@@ -13,19 +13,17 @@ Default Service Relationship Diagram:
```mermaid
graph TD
BROWSER[Browser] -->|:3001| GRAFANA[Grafana :3001]
BROWSER[Browser] -->|:3001| DCGM_EXPORTER2["external dcgm_exporter 0.0.0.0:9400"]
subgraph DockerComposeNetwork [Network inside Docker Compose]
NATS_PROM_EXP[nats-prom-exp :7777 /metrics] -->|:8222/varz| NATS_SERVER[nats-server :4222, :6222, :8222]
PROMETHEUS[Prometheus server :9090] -->|:2379/metrics| ETCD_SERVER[etcd-server :2379, :2380]
PROMETHEUS -->|:9400/metrics| DCGM_EXPORTER[dcgm-exporter :9400]
PROMETHEUS -->|:9401/metrics| DCGM_EXPORTER[dcgm-exporter :9401]
PROMETHEUS -->|:7777/metrics| NATS_PROM_EXP
PROMETHEUS -->|:8000/metrics| DYNAMOFE[Dynamo HTTP FE :8000]
GRAFANA -->|:9090/query API| PROMETHEUS
end
BROWSER -->|:9401/metrics| DCGM_EXPORTER
```
The dcgm-exporter within the Docker Compose network is configured to bind to port 9400 internally, but it is exposed externally on port 9401. This setup helps prevent conflicts with other dcgm-exporters that might be running concurrently, such as in distributed environments like SLURM.
The dcgm-exporter service in the Docker Compose network is configured to use port 9401 instead of the default port 9400. This adjustment is made to avoid port conflicts with other dcgm-exporter instances that may be running simultaneously. Such a configuration is typical in distributed systems like SLURM.
As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build containers with `--framework VLLM_V1` or `--framework TENSORRTLLM`.
......
......@@ -63,11 +63,12 @@ services:
dcgm-exporter:
image: nvidia/dcgm-exporter:4.2.3-4.1.3-ubi9
ports:
# Remap from 9400 to 9401 (public port) to avoid conflict with an existing dcgm-exporter
# on dlcluster. To access dcgm:
# Outside the container: curl http://localhost:9401/metrics
# Inside the container (container-to-container): curl http://dcgm-exporter:9400/metrics
- 9401:9400
# Expose dcgm-exporter on port 9401 both inside and outside the container
# to avoid conflicts with other dcgm-exporter instances in distributed environments.
# To access DCGM metrics:
# Outside the container: curl http://localhost:9401/metrics (or the host IP)
# Inside the container (container-to-container): curl http://dcgm-exporter:9401/metrics
- 9401:9401
cap_add:
- SYS_ADMIN
deploy:
......@@ -80,6 +81,7 @@ services:
environment:
# dcgm uses NVIDIA_VISIBLE_DEVICES variable but normally it is CUDA_VISIBLE_DEVICES
- NVIDIA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-all}
- DCGM_EXPORTER_LISTEN=:9401
runtime: nvidia # Specify the NVIDIA runtime
networks:
- monitoring
......
......@@ -31,7 +31,7 @@ scrape_configs:
- job_name: 'dcgm-exporter'
scrape_interval: 5s
static_configs:
- targets: ['dcgm-exporter:9400'] # on the "monitoring" network
- targets: ['dcgm-exporter:9401'] # on the "monitoring" network
# This is a demo service that needs to be launched manually. See components/metrics/README.md
# Note that you may need to disable the firewall on your host. On Ubuntu: sudo ufw allow 8000/tcp
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment