Unverified Commit d381e6ff authored by Schwinn Saereesitthipitak's avatar Schwinn Saereesitthipitak Committed by GitHub
Browse files

feat(chrek): config refactor, /dev/shm support, and mount-policy rewrite (#5946)

parent b6824ae0
This diff is collapsed.
...@@ -151,8 +151,9 @@ spec: ...@@ -151,8 +151,9 @@ spec:
- --checkpoint-enabled=true - --checkpoint-enabled=true
- --checkpoint-storage-type={{ .Values.checkpoint.storage.type }} - --checkpoint-storage-type={{ .Values.checkpoint.storage.type }}
- --checkpoint-signal-host-path={{ .Values.checkpoint.storage.signalHostPath }} - --checkpoint-signal-host-path={{ .Values.checkpoint.storage.signalHostPath }}
- --checkpoint-criu-timeout={{ .Values.checkpoint.criu.timeout }}
- --checkpoint-init-container-image={{ .Values.checkpoint.initContainerImage }} - --checkpoint-init-container-image={{ .Values.checkpoint.initContainerImage }}
- --checkpoint-ready-for-checkpoint-file-path={{ .Values.checkpoint.readyForCheckpointFilePath }}
- --checkpoint-restore-marker-file-path={{ .Values.checkpoint.restoreMarkerFilePath }}
{{- if eq .Values.checkpoint.storage.type "pvc" }} {{- if eq .Values.checkpoint.storage.type "pvc" }}
- --checkpoint-pvc-name={{ .Values.checkpoint.storage.pvc.pvcName }} - --checkpoint-pvc-name={{ .Values.checkpoint.storage.pvc.pvcName }}
- --checkpoint-pvc-base-path={{ .Values.checkpoint.storage.pvc.basePath }} - --checkpoint-pvc-base-path={{ .Values.checkpoint.storage.pvc.basePath }}
......
...@@ -157,6 +157,14 @@ checkpoint: ...@@ -157,6 +157,14 @@ checkpoint:
# Defaults to busybox:latest if not specified # Defaults to busybox:latest if not specified
initContainerImage: "busybox:latest" initContainerImage: "busybox:latest"
# Path written by worker when model is loaded and ready for checkpointing
# Must match the path expected by checkpoint-enabled runtime images
readyForCheckpointFilePath: "/tmp/ready-for-checkpoint"
# Path written by restore-entrypoint after successful CRIU restore
# Must match the path expected by checkpoint-enabled runtime images
restoreMarkerFilePath: "/tmp/dynamo-restored"
# Storage configuration # Storage configuration
# These settings tell the operator where to find checkpoint storage # These settings tell the operator where to find checkpoint storage
# Must match the configuration in the chrek chart # Must match the configuration in the chrek chart
...@@ -196,14 +204,6 @@ checkpoint: ...@@ -196,14 +204,6 @@ checkpoint:
# Reference to a docker config secret for registry authentication # Reference to a docker config secret for registry authentication
credentialsSecretRef: "" credentialsSecretRef: ""
# CRIU timeout configuration (shared across checkpoint and restore)
criu:
# CRIU operation timeout in seconds
# Maximum time to wait for checkpoint/restore to complete
# Increase for models with very large memory footprints
# 21600s (6 hours) is recommended for large LLMs (70B+)
timeout: "21600"
# Webhook configuration # Webhook configuration
webhook: webhook:
# Enable admission webhooks for validation # Enable admission webhooks for validation
...@@ -280,4 +280,3 @@ webhook: ...@@ -280,4 +280,3 @@ webhook:
# Time before expiration to renew root CA (e.g., "720h" for 30 days) # Time before expiration to renew root CA (e.g., "720h" for 30 days)
renewBefore: "720h" renewBefore: "720h"
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment