v1.5.0-rc.17
[Unreleased]
Added
-
Fleet-aggregate stats subsystem (commits
feature/v1.5-rc17). NewContainerStatsAggregatorpolls each locally-monitored container once per tick (default 10 s) and computes a fleet-wideContainerStatsSummary(total CPU%, total memory, top-N rows). Two new endpoints —GET /api/v1/stats/summaryandGET /api/v1/stats/summary/stream— expose the current snapshot and a live SSE feed; the dashboard Resource Usage widget now consumes the SSE stream directly, fixing the regression (introduced in rc.13 by the?touch=falseworkaround) where the widget showed zeros because the per-container cache was never warmed. The legacyGET /api/v1/containers/statsendpoint and the client-sidesummarizeContainerResourceUsagerollup have been removed. -
Per-container update locks (commit
761fb834). New keyedLockManagerprimitive inapp/updates/lock-primitives.tsreplaces the module-levelpLimit(1)that was serialising every container update across the entire process. Lock keys are derived per container (and per compose project forDockercompose), so two unrelated containers can now pull and recreate concurrently while two services in the same compose project still serialise correctly. The lock primitive is its own pure-logic file with full unit tests; the docker trigger and compose subclass derive the lock key set via a newgetUpdateLockKeys(container)method. -
Restart recovery for queued and pulling updates (commit
00788b13). Startup reconciliation inapp/store/update-operation.tsis now selective:status=queuedoperations stay queued for the recovery dispatcher to pick up, andphase=pullingrows are reset toqueued(pull is idempotent). All other in-progress phases —prepare,renamed,new-created,old-stopped,new-started,health-gate,rollback-*— remain marked failed because they leave inconsistent state that an operator should review. A newapp/updates/recovery.tsmodule runs once afterregistry.init(), re-resolves trigger and container for each queued operation, and dispatches them through the existing fire-and-forget pipeline. Operations whose container or trigger no longer exists are marked failed with an explanatorylastErrorso they don't sit in the queue forever. -
Notification outbox with retry and dead-letter queue (commits
a9561d93,7d2ef6eb,b215d295,ce26bece). NewnotificationOutboxLokiJS collection (app/store/notification-outbox.ts) and matchingapp/notifications/outbox-worker.tsbackground worker provide durable retry semantics for notification dispatch.Trigger.dispatchContainerForEventnow optimistically callsthis.trigger(container)directly; on failure, the delivery intent is persisted to the outbox and the worker retries on a periodic drain with exponential backoff + jitter. After a configurable number of failed attempts (default 5) entries transition to the dead-letter queue; delivered and dead-letter entries are auto-purged past TTL (default 30 days). New/api/notifications/outboxREST surface lets operators list entries (?status=filter), retry from the DLQ (POST /:id/retry), or discard (DELETE /:id). New base methodTrigger.dispatchOutboxEntry(entry)is the worker's delivery hook; subclasses can override. -
Notification outbox UI (commit
feature/v1.5-rc17). NewNotification outboxpage (route/notifications/outbox, nav under Settings) consumes the existing/api/notifications/outboxREST surface so operators can review the dead-letter queue, retry stuck deliveries, or discard dead entries from the UI. Status tabs (Dead-letter / Pending / Delivered) keep the same query-param convention (?status=) used by the rest of the list views; counts per bucket render as inline badges.Retryis shown only on dead-letter rows;Discardis available everywhere. Newui/src/services/notification-outbox.tsmirrors the API exactly. -
Cancel queued or in-flight updates (commits
4b79e3ac,79487115).POST /api/operations/:id/cancelnow accepts both queued and in-progress operations. Queued ops are marked failed immediately withlastError: 'Cancelled by operator'(200). In-progress ops are flagged via a newcancelRequestedfield on the operation row and the endpoint returns202 Accepted; the lifecycle observes the flag at three safe checkpoints — after pull and before rename (clean abort, no rollback needed), before creating the replacement container, and before stopping the old container — so cancellations either short-circuit cleanly or fall through the existing rollback path that renames the container back. The rollback path tags the rollback reason ascancelledso the audit trail distinguishes operator cancellations from real failures. Already-terminal ops still return409 Conflict. The container row's Cancel action is now visible for both queued and in-progress operations; the toast says "Cancelled" for the immediate path and "Cancellation requested" for the in-progress path. -
Global concurrent-update cap (
DD_UPDATE_MAX_CONCURRENT). New counting semaphore (Semaphoreclass inapp/updates/lock-primitives.ts) provides a configurable global gate on how many update lifecycles run simultaneously across the entire controller instance. Default0= unlimited — no behavior change on upgrade. Positive integerNmeans at most N updates run concurrently. Negative or non-integer values fail fast at startup with a descriptive error. The cap layers on top of the existing per-container and per-compose-project locks; it does not replace them. Operations waiting on the cap remain inqueuedstatus. Scope is per controller instance; distributed agent hosts have independent counters by design. Self-update operations bypass the global cap — they take per-container locks but never wait on the global semaphore, preventing a full update queue from starving an admin-triggered self-update. -
Health-gate SSE heartbeat (
DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS). While drydock waits for a new container to pass its health gate, the SSE pipeline was silent for the entire wait — the UI received no events betweenphase: 'health-gate'andphase: 'health-gate-passed'. For images with long healthcheck intervals (e.g. vaultwarden's 60 s check) this meant the UI relied on REST reconciliation poll if the SSE connection was interrupted during that window. A periodic heartbeat now re-emitsphase: 'health-gate'at a configurable interval (default 10 s).DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS=0disables heartbeats entirely; values below 1000 ms or non-integers fail fast at startup. The heartbeat cancels immediately when the wait resolves in any direction (success, timeout, or unhealthy), ensuring the terminal event is never preempted. No new phases are introduced; existing UI consumers accept the re-emitted event unchanged.
Changed
- Watcher dispatch is fully fire-and-forget (commit
5cfa2286).Trigger.runUpdateAvailableSimpleTriggerandrunAcceptedUpdateBatchpreviously awaitedrunAcceptedContainerUpdates, so a slow update lifecycle stalled the next watcher tick. The API path was already fire-and-forget; the watcher path now matches. NewdispatchAccepted(accepted)helper centralises thevoid runAcceptedContainerUpdates(...).catch(() => undefined)pattern across all four call sites. Per-operation failures are still terminalised inside the lifecycle handler, so swallowing the dispatch chain's rejection loses no observable information. - Security alert emit is non-blocking inside the update lifecycle (commit
6c5198dd).SecurityGate.maybeEmitHighSeverityAlertwas awaited insideevaluateScanOutcome, which itself runs inside the update lifecycle's critical path. With multiple notifiers registered for security alerts, the await chained sequential provider calls (SMTP, Slack, HTTP, MQTT, webhook) into the lifecycle, multiplying latency before pull/recreate could even start. The function now returns synchronously after firing the emit; notification dispatch semantics from the caller's perspective are unchanged (the same handlers run in the same order viaemitOrderedHandlers), the lifecycle just no longer waits. - "Update started" toasts renamed to "Update queued" (commit
79487115). Dispatch is fire-and-forget — by the time the toast renders, the lifecycle hasn't started, the operation is just queued. The text now matches what actually happened:"Update queued: {name}","Force update queued: {name}","Queued update(s) for N container(s)". Function names inui/src/utils/container-update.tsare unchanged so call-site churn is zero.
Fixed
- One slow notifier no longer stalls every container update (commit
761fb834). The module-levelpLimit(1)introduced in v1.5 to serialise concurrent updates was the root cause behind reports of stuck queues whenever a single notifier hung — every update on every container was waiting for the same single slot. Per-container locks remove the global bottleneck while still preventing a container from being updated twice in parallel. - Process restart no longer wipes the queued update list (commit
00788b13). Previously every active operation was force-failed on startup. Queued and pulling-phase operations now resume; only operations mid-destructive-step (renamed/new-created/old-stopped/etc.) are surfaced for operator review. See the matching addition above. - Transient notifier outages no longer drop alerts (commit
b215d295). Direct dispatch failures land in the outbox and are retried with exponential backoff + jitter; only persistently failing entries (default: 5 failed attempts) move to the dead-letter queue. Crash-during-dispatch is the only remaining loss window. dd.registry.lookup.imagelabel no longer corrupts deploy identity (commit594a07e8, fixes #336). The lookup label is intended to redirect tag/manifest queries to a different image (e.g. a private mirror runningmyreg/nextcloudlooking up tags from Docker Hub'slibrary/nextcloud), butnormalizeContainerwas assigning the substituted view back onto the container record so the deploy identity — image name and registry URL — was silently rewritten to the lookup target. Compose-file rewrites and container recreates then deployed the wrong image.normalizeContainerno longer overwritesimage.name/image.registry.url; a newgetImageForRegistryQueryhelper applies the substitution + provider URL normalisation only at each query boundary (getTags,getImageManifestDigest,getImagePublishedAt). Un-prefixed images (nginx:1.0) now default todocker.iofor the registry URL;Hub.getImageFullNamestrips the prefix for clean display.- Password-manager autofill restored on login form (commit
3abe2fa6, fixes #335). Username and password inputs lost theirnameandidattributes during the v1.5 plain-HTML rewrite. Browser-native autofill kept working viaautocomplete=, but credential managers that rely onname/idheuristics (Dashlane in Chrome, among others) could no longer identify the username field. Both attributes are restored. security-scan-skippedaudit row now fires when the gate is disabled globally (commitae24e0a9). PreviouslyrecordSecurityAudit('security-scan-skipped', …)only executed when the per-container labeldd.security.gate=offwas set. WithDD_SECURITY_GATE_MODE=offconfigured globally, scans were silently skipped with no audit trail — an operator reading the audit log had no indication that the gate was suppressed.getGateDisabledAuditDetailsnow selects the appropriate human-readable reason from whichever off-state is in effect and the audit call is unconditional.- Registry URL normalization restored on container record after regression in
594a07e8. Removing thenormalizeImagecall innormalizeContainerto fix deploy-identity corruption (issue #336) inadvertently leftimage.registry.urlin its raw user-config form (docker.io) instead of the API base URL form (https://registry-1.docker.io/v2). All registry HTTP callers,getImageFullName, the Prometheusimage_registry_urllabel, and the Docker trigger's self-update helper expect the normalized form. The URL rewrite is now restored for containers where the deploy image itself matches the provider; harbor-mirror containers (where a lookup label diverts to a different registry) correctly retain their deploy URL unchanged. image.namecanonicalization also restored after partial fix in4e06329b. The prior fix only restoredimage.registry.url;image.namewas still not rewritten through the provider'snormalizeImage, so Docker Hub containers with un-prefixed names (e.g.nginx) keptimage.name = "nginx"instead oflibrary/nginx. This caused the Prometheusimage_namelabel to emit the bare name, breaking e2e scenarios that assertimage_name="library/nginx". ThenormalizeImageresult now also assignsimage.namein the deploy-match branch; the cross-registry mirror branch (harbor → Hub lookup) is unaffected and still preserves the deploy name.- Stack/group view no longer collapses to ungrouped mid-update when containers are recreated. When a Docker action recreates a container it receives a new container ID; the group-membership map was keyed only by the original ID, so the post-recreate lookup missed and every container fell into
__ungrouped__. With a two-container stack the single-member-flatten rule then removed both group buckets entirely.loadGroups()now indexes the map under id, name, AND displayName, so the existingmap[container.name]fallback in the lookup actually resolves after a recreate.
Added
- Chinese (Simplified) UI (PR #331 by TianMiao, commits
8f3286b7,b97944dc). Chinese is the first non-English locale to ship in drydock. 14 namespace JSON files underui/src/locales/zh/cover the full UI surface — dashboard, containers, agents, config, list views, container components, app shell, auth, logs, and shared components (~1,100+ strings). A latent bootstrap bug (buildMessagesmap initialized only foren, causingObject.assign(undefined)crashes for any second locale) was fixed as part of this work, along with 112 translation gaps that arose because the locale files were authored before several new UI strings landed in rc.17. The i18n framework loaded on the existingimport.meta.globauto-discovery; no additional wiring was needed. The Crowdin integration planned for v1.8.0 will pick up thezhcatalog automatically. - Multi-select event-type filter in audit log (commit
5e2d0c70, discussion #332). The audit log's event-type filter was a single-value<select>, so operators wanting to view bothupdate-appliedandupdate-failedin the same session had to query them separately and mentally merge the results. The filter is now a checkbox dropdown supporting any combination of event categories simultaneously. The backend already accepted?actions=(plural, comma-separated) — this wires the missing UI half. Back-compat: existing?action=foobookmark URLs parse as a single-element selection without requiring a migration.
Changed
app/updates/locks.tsrenamed toapp/updates/lock-primitives.ts(commit4c506d21).locks.tswas a misleading filename for a module that contains general-purpose synchronisation primitives (Semaphore,LockManager) not tied to the updates subsystem. Existing CHANGELOG entries above and theapp/updates/update-locks.tsconsumer have been updated to the new path.HookExecutorandRollbackMonitornow delegate label-to-integer parsing to the project-wideparseEnvNonNegativeIntegerhelper instead of inline NaN/zero guards;getErrorMessageinline copies inpost-start-livenessandrequest-updateare consolidated to the sharedutil/errorimport.
Security
- Credential redaction expanded to
x-registry-auth,*-token, andapi-keyfields (commit4417ce25). The existingscrubAuthorizationHeaderValueshelper only redactedAuthorizationheader values. Structured error payloads inupdate-failedSSE events could still leak registry auth tokens, API keys, and OAuth bearer strings embedded under other field names. A second regex pass now redactsx-registry-auth, any field matching*-token, andapi-key/api_keyvalues before the payload leaves the server. Theupdate-failedSSE path was the primary exposure vector; operator-visible diagnostic strings no longer leak registry credentials in production environments.
Performance
- Binary indices and drain concurrency cap for notification outbox (commit
9393253e).findReadyForDelivery— the hot path that runs on every outbox drain cycle — querieddata.statusanddata.nextAttemptAtwith standard LokiJS indices, causing full-collection scans as the outbox grew. Switching those two fields to binary indices gives O(log n) B-tree lookups.OutboxWorkergains amaxDrainConcurrencyoption (default 10) backed by aDrainSemaphoreso a burst of ready entries cannot flood the trigger pipeline with unbounded parallel deliveries.store/utilnormalises abinaryIndicesoption oninitCollectionso collections receive correct field registration at creation time.
Tests / CI
- Reconciliation terminal-hold toast assertions use
maxIdBeforepattern. Two tests inContainersView.spec.tswere flaking on CI becausevi.advanceTimersByTime(1500)expired pre-existing toasts, loweringtoasts.value.lengtheven though no new toasts were added. ReplacedcountBefore = toasts.value.length/toBe(countBefore)with themaxIdBefore/filter(t.id > maxIdBefore)pattern already used elsewhere in the file.