@@ -161,9 +161,7 @@ The chart includes built-in validation to prevent all operator conflicts:
| dynamo-operator.webhook.certManager.certificate.rootCA.duration | string | `"87600h"` | Duration for the root CA certificate (e.g., "87600h" for 10 years). The root CA typically has a much longer lifetime than the leaf certificates it signs. |
| dynamo-operator.webhook.certManager.certificate.rootCA.renewBefore | string | `"720h"` | Time before root CA expiration to trigger renewal (e.g., "720h" for 30 days). Renewing a CA can be disruptive as all signed certificates must be reissued. |
| dynamo-operator.checkpoint.initContainerImage | string | `"busybox:latest"` | Image used for init containers in checkpoint jobs (e.g., signal file cleanup) |
| dynamo-operator.checkpoint.readyForCheckpointFilePath | string | `"/tmp/ready-for-checkpoint"` | Path written by worker when model is loaded and ready for checkpointing |
| dynamo-operator.checkpoint.restoreMarkerFilePath | string | `"/tmp/dynamo-restored"` | Path written by restore-entrypoint after successful CRIU restore |
print("Ready for checkpoint. Waiting for watcher signal...")
# Wait for whichever signal comes first
# Wait for whichever signal comes first (SIGKILL on failure kills us
# immediately, so only success/restore signals reach this point)
done,pending=awaitasyncio.wait(
[asyncio.create_task(checkpoint_done.wait()),
asyncio.create_task(restore_done.wait())],
...
...
@@ -390,11 +394,14 @@ async def main():
- Pod has `nvidia.com/chrek-is-checkpoint-source: "true"` label
- Pod status is `Ready` (readiness probe passes = ready file exists)
2.**Signal-based coordination**: The DaemonSet sends `SIGUSR1` after checkpoint completes and `SIGCONT` after restore completes. Your application must handle these signals (not poll for files).
2.**Signal handler ordering**: Install signal handlers **before** writing the ready file. Otherwise there is a race window where the DaemonSet sends a signal while the default disposition (terminate) is still in effect.
3.**Signal-based coordination**: The DaemonSet sends `SIGUSR1` after checkpoint completes, `SIGCONT` after restore completes, and `SIGKILL` if checkpoint fails. Your application must handle `SIGUSR1` and `SIGCONT` (not poll for files). `SIGKILL` cannot be caught — the kernel terminates the process immediately.
-**SIGCONT received**: Process was restored, wake model and continue
-**SIGKILL received**: Checkpoint failed, process terminated immediately (no handler needed)
---
...
...
@@ -490,7 +497,7 @@ The DaemonSet communicates checkpoint/restore completion via Unix signals, not f
|--------|-----------|---------|
| `SIGUSR1` | DaemonSet → checkpoint pod | Checkpoint completed, process should exit |
| `SIGCONT` | DaemonSet → restored pod | Restore completed, process should wake up |
| `SIGUSR2` | DaemonSet → checkpoint pod | Checkpoint failed (wake process to continue) |
| `SIGKILL` | DaemonSet → checkpoint pod | Checkpoint failed — process terminated immediately |
CRIU tuning options are configured via the ChReK Helm chart's `config.checkpoint.criu` values, not environment variables. See the [Helm Chart Values](https://github.com/ai-dynamo/dynamo/tree/main/deploy/helm/charts/chrek/values.yaml) for available options.
...
...
@@ -660,7 +667,7 @@ CRIU tuning options are configured via the ChReK Helm chart's `config.checkpoint