v1.5.0-rc.19

[Unreleased]

Added

Fleet-aggregate stats subsystem (commits feature/v1.5-rc17). New ContainerStatsAggregator polls each locally-monitored container once per tick (default 10 s) and computes a fleet-wide ContainerStatsSummary (total CPU%, total memory, top-N rows). Two new endpoints — GET /api/v1/stats/summary and GET /api/v1/stats/summary/stream — expose the current snapshot and a live SSE feed; the dashboard Resource Usage widget now consumes the SSE stream directly, fixing the regression (introduced in rc.13 by the ?touch=false workaround) where the widget showed zeros because the per-container cache was never warmed. The legacy GET /api/v1/containers/stats endpoint and the client-side summarizeContainerResourceUsage rollup have been removed.
Per-container update locks (commit 761fb834). New keyed LockManager primitive in app/updates/lock-primitives.ts replaces the module-level pLimit(1) that was serialising every container update across the entire process. Lock keys are derived per container (and per compose project for Dockercompose), so two unrelated containers can now pull and recreate concurrently while two services in the same compose project still serialise correctly. The lock primitive is its own pure-logic file with full unit tests; the docker trigger and compose subclass derive the lock key set via a new getUpdateLockKeys(container) method.
Restart recovery for queued and pulling updates (commit 00788b13). Startup reconciliation in app/store/update-operation.ts is now selective: status=queued operations stay queued for the recovery dispatcher to pick up, and phase=pulling rows are reset to queued (pull is idempotent). All other in-progress phases — prepare, renamed, new-created, old-stopped, new-started, health-gate, rollback-* — remain marked failed because they leave inconsistent state that an operator should review. A new app/updates/recovery.ts module runs once after registry.init(), re-resolves trigger and container for each queued operation, and dispatches them through the existing fire-and-forget pipeline. Operations whose container or trigger no longer exists are marked failed with an explanatory lastError so they don't sit in the queue forever.
Notification outbox with retry and dead-letter queue (commits a9561d93, 7d2ef6eb, b215d295, ce26bece). New notificationOutbox LokiJS collection (app/store/notification-outbox.ts) and matching app/notifications/outbox-worker.ts background worker provide durable retry semantics for notification dispatch. Trigger.dispatchContainerForEvent now optimistically calls this.trigger(container) directly; on failure, the delivery intent is persisted to the outbox and the worker retries on a periodic drain with exponential backoff + jitter. After a configurable number of failed attempts (default 5) entries transition to the dead-letter queue; delivered and dead-letter entries are auto-purged past TTL (default 30 days). New /api/notifications/outbox REST surface lets operators list entries (?status= filter), retry from the DLQ (POST /:id/retry), or discard (DELETE /:id). New base method Trigger.dispatchOutboxEntry(entry) is the worker's delivery hook; subclasses can override.
Notification outbox UI (commit feature/v1.5-rc17). New Notification outbox page (route /notifications/outbox, nav under Settings) consumes the existing /api/notifications/outbox REST surface so operators can review the dead-letter queue, retry stuck deliveries, or discard dead entries from the UI. Status tabs (Dead-letter / Pending / Delivered) keep the same query-param convention (?status=) used by the rest of the list views; counts per bucket render as inline badges. Retry is shown only on dead-letter rows; Discard is available everywhere. New ui/src/services/notification-outbox.ts mirrors the API exactly.
Cancel queued or in-flight updates (commits 4b79e3ac, 79487115). POST /api/operations/:id/cancel now accepts both queued and in-progress operations. Queued ops are marked failed immediately with lastError: 'Cancelled by operator' (200). In-progress ops are flagged via a new cancelRequested field on the operation row and the endpoint returns 202 Accepted; the lifecycle observes the flag at three safe checkpoints — after pull and before rename (clean abort, no rollback needed), before creating the replacement container, and before stopping the old container — so cancellations either short-circuit cleanly or fall through the existing rollback path that renames the container back. The rollback path tags the rollback reason as cancelled so the audit trail distinguishes operator cancellations from real failures. Already-terminal ops still return 409 Conflict. The container row's Cancel action is now visible for both queued and in-progress operations; the toast says "Cancelled" for the immediate path and "Cancellation requested" for the in-progress path.
Global concurrent-update cap (DD_UPDATE_MAX_CONCURRENT). New counting semaphore (Semaphore class in app/updates/lock-primitives.ts) provides a configurable global gate on how many update lifecycles run simultaneously across the entire controller instance. Default 0 = unlimited — no behavior change on upgrade. Positive integer N means at most N updates run concurrently. Negative or non-integer values fail fast at startup with a descriptive error. The cap layers on top of the existing per-container and per-compose-project locks; it does not replace them. Operations waiting on the cap remain in queued status. Scope is per controller instance; distributed agent hosts have independent counters by design. Self-update operations bypass the global cap — they take per-container locks but never wait on the global semaphore, preventing a full update queue from starving an admin-triggered self-update.
Health-gate SSE heartbeat (DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS). While drydock waits for a new container to pass its health gate, the SSE pipeline was silent for the entire wait — the UI received no events between phase: 'health-gate' and phase: 'health-gate-passed'. For images with long healthcheck intervals (e.g. vaultwarden's 60 s check) this meant the UI relied on REST reconciliation poll if the SSE connection was interrupted during that window. A periodic heartbeat now re-emits phase: 'health-gate' at a configurable interval (default 10 s). DD_UPDATE_HEALTH_GATE_HEARTBEAT_MS=0 disables heartbeats entirely; values below 1000 ms or non-integers fail fast at startup. The heartbeat cancels immediately when the wait resolves in any direction (success, timeout, or unhealthy), ensuring the terminal event is never preempted. No new phases are introduced; existing UI consumers accept the re-emitted event unchanged.

Changed

Crowdin export configuration aligned with app locale folders. Crowdin now maps language codes such as es-ES into the locale folder IDs the UI actually loads (for example es) and only downloads languages exposed in the locale picker. A new config guard test prevents future sync PRs from adding ignored region-coded folders, and the new auto-hidden-columns tooltip source avoids English-only column(s) punctuation that triggered Crowdin QA warnings for translated strings.
Shared DataTable column sizing overhaul (commit 596adcd2). All first-party table surfaces now route through the shared DataTable component with numeric sizing metadata (size, minSize, maxSize, flex, priority, overflow, autoSize) instead of ad-hoc string widths. Tables render a stable <colgroup>, keep actions in an independent sticky/fixed managed column, support pointer and keyboard column resizing, double-click autosize visible content, and persist manual/autosized widths per table via browser preferences. Containers uses the sizing data for responsive auto-hide math so narrow widths hide lower-priority metadata instead of shrinking columns below readable minimums. The config webhook endpoint list was migrated too, and a new architecture test fails if raw <table> markup or string column widths reappear in ui/src.
Watcher dispatch is fully fire-and-forget (commit 5cfa2286). Trigger.runUpdateAvailableSimpleTrigger and runAcceptedUpdateBatch previously awaited runAcceptedContainerUpdates, so a slow update lifecycle stalled the next watcher tick. The API path was already fire-and-forget; the watcher path now matches. New dispatchAccepted(accepted) helper centralises the void runAcceptedContainerUpdates(...).catch(() => undefined) pattern across all four call sites. Per-operation failures are still terminalised inside the lifecycle handler, so swallowing the dispatch chain's rejection loses no observable information.
Security alert emit is non-blocking inside the update lifecycle (commit 6c5198dd). SecurityGate.maybeEmitHighSeverityAlert was awaited inside evaluateScanOutcome, which itself runs inside the update lifecycle's critical path. With multiple notifiers registered for security alerts, the await chained sequential provider calls (SMTP, Slack, HTTP, MQTT, webhook) into the lifecycle, multiplying latency before pull/recreate could even start. The function now returns synchronously after firing the emit; notification dispatch semantics from the caller's perspective are unchanged (the same handlers run in the same order via emitOrderedHandlers), the lifecycle just no longer waits.
"Update started" toasts renamed to "Update queued" (commit 79487115). Dispatch is fire-and-forget — by the time the toast renders, the lifecycle hasn't started, the operation is just queued. The text now matches what actually happened: "Update queued: {name}", "Force update queued: {name}", "Queued update(s) for N container(s)". Function names in ui/src/utils/container-update.ts are unchanged so call-site churn is zero.

Fixed

#356 — Containers list Version column no longer hides the human-readable tag for floating-tag + digest-watch images. The rc.18 ship of #342 (digest-pinned containers were rendering currentTag → newTag as two identical truncated sha256: strings because both fields came from the same pinned digest reference) replaced that with a real formatShortDigest(localValue) → formatShortDigest(remoteValue) pair whenever the update kind was 'digest'. That correctly addressed digest-pinned containers but cast too wide a net: containers that pull a floating tag (:latest, :v8.13.2, :compose-X-version-9.0.1) with image.digest.watch enabled also surface as kind === 'digest' whenever the registry rebuilds the image — and there the tag is meaningful, the user expects to see it, and replacing it with two sha256:… hashes obscured every linuxserver/* and similar GHCR-hosted row on the Containers table for users like the reporter. A new derived isDigestPinned: boolean (added to the UI Container type, mapped from image.tag.value.startsWith('sha256:') — same heuristic the watcher uses at app/watchers/providers/docker/image-comparison.ts:240) now gates the digest-pair render: digest-pinned containers continue to show the sha256:abc… → sha256:def… pair the #342 fix intended, while floating-tag + digest-watch containers render the tag once (no arrow, since currentTag === newTag for digest-only updates) with the digest pair surfaced on the cell tooltip. The two container-detail panels gain a small muted "Digest:" subline showing the actual digest transition so the underlying change is still visible without dominating the version row. Applies symmetrically to all five UI sites that switched in rc.18: the Containers table version cell, card body, and list-accordion image subtitle, plus the side and full-page detail panels.
#357 — Transient Trivy failures no longer wipe previously-stored scan history. The scheduler used to overwrite container.security.scan unconditionally; when Trivy hit a hiccup (daemon timeout, registry blip, missing socket) mapToErrorResult returned an empty status:'error' record and that result silently replaced every prior passed/blocked entry on the next cycle. The scheduler now keeps the existing record when the new result is an error and there is something to preserve, capped at a 7-day max-staleness window so a persistently broken pipeline eventually surfaces in stored state instead of locking in a stale passed indefinitely; the UI still sees the live error via SSE so operators are not left in the dark either way. Error results are also no longer indefinitely re-spawning fresh Trivy invocations — scanImageWithDedup now uses a 15-minute error retry floor so under aggressive cron and a registry outage, retries are bounded to once per 15 minutes per digest instead of once per scheduler cycle.
#355 — update-failed notifications no longer drop silently when the controller's container store races against post-failure prune. UpdateLifecycleExecutor now carries the failing container on the update-failed payload, and Trigger.handleContainerUpdateFailedEvent accepts payload.container as the primary source with the store lookup as fallback — mirroring the existing update-applied symmetry. Previously, when the store lookup missed (post-failure prune timing, agent push race, watcher/name re-key) the trigger silently debug-logged "No container found for update-failed event => ignore" and the user got no out-of-band signal that the update had failed. The event payload types are now strictly typed (container?: Container only — no Record<string, unknown> escape hatch), the three duck-typing payload-extraction blocks across the trigger handlers collapsed into a single direct payload.container || lookup(...) pattern, and the agent SSE relay strips container from dd:update-applied / dd:update-failed events before transmit (mirroring the controller-side sanitizer at app/api/sse.ts) so the full container blob — vulnerabilities, env entries, labels — no longer goes over the wire on every event.
#355 / #357 — Trivy scan and SBOM no longer require /var/run/docker.sock inside the drydock container. Regression introduced in rc.17 forced Trivy to use only the local Docker daemon as image source. Operators running the tecnativa/docker-socket-proxy topology (documented in README.md), rootless Docker, podman, or remote watchers saw every gated update fail post-pull with dial unix /var/run/docker.sock: connect: no such file or directory, and previously-stored scan results were also overwritten with empty error records when the scheduler fired. The forced --image-src docker flag is removed; Trivy now uses its default source order (docker, containerd, podman, remote) and falls back to a registry pull when the local daemon isn't reachable. Operators who know their topology is socket-less and want to skip the docker/containerd/podman probe attempts can set DD_SECURITY_TRIVY_IMAGE_SRC=remote (any value Trivy accepts works, including comma-separated lists like remote,docker); when unset Trivy auto-detects. Pre-rc.17 behaviour is fully restored.
#290 — "Updated Successfully" toast no longer drops intermittently after a container update. Terminal-update toasts previously fired from three independent handlers (ContainerUpdateDialog, useContainerSsePatchPipeline, ContainersGroupedViews), each gated on different state — any one of operationId missing on the wire, the view being unmounted, or the per-batch dependency on ContainersGroupedViews being mounted would silently swallow the toast. A new useGlobalUpdateToast composable mounted once at App.vue is the single source of truth: listens for dd:sse-update-applied / dd:sse-update-failed / dd:sse-batch-update-completed (via globalThis events), survives route navigation, dedupes by operationId over a 5-minute window matched to the SSE replay buffer, and waits for the matching dd:sse-container-added/updated/removed event before firing so the toast appears the moment the row's "Updating" badge clears (not on a hardcoded delay). A 5s safety fallback fires the toast for cases where no row event arrives (remote agents, deleted containers). Backend stops coercing missing operationId/containerId to '' so the wire format is honest about what's optional. Browser EventSource cannot set custom headers on reconnect, so Last-Event-ID is now also accepted via query param (?last-event-id=) and validated against the canonical <bootId>:<counter> shape at the request boundary. Defensive hardening: module-level singleton guard so a stray child-component install can't double-register listeners, FIFO-bounded dedup map (cap 500) defends against runaway operation throughput, and HTML angle brackets are stripped from raw error text before i18n interpolation.
#289 — Container row state regression after recreate. Same root cause as #290: per-view SSE handlers dropped events when the view was unmounted or the payload omitted operationId. The row-state pipeline (useContainerSsePatchPipeline) is now decoupled from toast emission so it can focus solely on patch application; toast firing lives exclusively in useGlobalUpdateToast at App.vue.
#291 — Dashboard fired "updated" toast while the "updating" toast was missed. The dashboard had its own duplicate SSE-terminal-toast handler that competed with (and sometimes pre-empted) the global one. The dashboard SSE handler now does row-state hold/ghost management only; toast emission is owned exclusively by the global handler at App.vue.
Release security gate restored before rc.18. Patched transitive npm dependencies flagged by OSV during the post-merge main CI run: fast-uri now resolves to 3.1.2 in app/UI lock domains, and fast-xml-builder now resolves to 1.2.0 through the app/e2e XML parser override path. This clears the Qlty security gate without changing runtime behavior.
#345 — Host names with numeric suffixes no longer lose the differentiating character in the Containers table. The rc.18 table pass already replaced the old host badge with plain text, and the host column now has a wider default/readable floor so names like servicevault and servicevault2 remain distinguishable at desktop widths. Narrow layouts still auto-hide the host column into secondary metadata instead of shrinking it below readability.
#340 - Self-update no longer preserves stale Drydock version metadata. The self-update clone path now drops image-inherited environment variables and labels from the old image when the target image changed them, so replacement containers inherit the new image's DD_VERSION and org.opencontainers.image.version instead of reporting the previous release after an automatic update. Operator-supplied environment variables and labels remain preserved.
One slow notifier no longer stalls every container update (commit 761fb834). The module-level pLimit(1) introduced in v1.5 to serialise concurrent updates was the root cause behind reports of stuck queues whenever a single notifier hung — every update on every container was waiting for the same single slot. Per-container locks remove the global bottleneck while still preventing a container from being updated twice in parallel.
Process restart no longer wipes the queued update list (commit 00788b13). Previously every active operation was force-failed on startup. Queued and pulling-phase operations now resume; only operations mid-destructive-step (renamed/new-created/old-stopped/etc.) are surfaced for operator review. See the matching addition above.
Transient notifier outages no longer drop alerts (commit b215d295). Direct dispatch failures land in the outbox and are retried with exponential backoff + jitter; only persistently failing entries (default: 5 failed attempts) move to the dead-letter queue. Crash-during-dispatch is the only remaining loss window.
dd.registry.lookup.image label no longer corrupts deploy identity (commit 594a07e8, fixes #336). The lookup label is intended to redirect tag/manifest queries to a different image (e.g. a private mirror running myreg/nextcloud looking up tags from Docker Hub's library/nextcloud), but normalizeContainer was assigning the substituted view back onto the container record so the deploy identity — image name and registry URL — was silently rewritten to the lookup target. Compose-file rewrites and container recreates then deployed the wrong image. normalizeContainer no longer overwrites image.name / image.registry.url; a new getImageForRegistryQuery helper applies the substitution + provider URL normalisation only at each query boundary (getTags, getImageManifestDigest, getImagePublishedAt). Un-prefixed images (nginx:1.0) now default to docker.io for the registry URL; Hub.getImageFullName strips the prefix for clean display.
Password-manager autofill restored on login form (commit 3abe2fa6, fixes #335). Username and password inputs lost their name and id attributes during the v1.5 plain-HTML rewrite. Browser-native autofill kept working via autocomplete=, but credential managers that rely on name/id heuristics (Dashlane in Chrome, among others) could no longer identify the username field. Both attributes are restored.
security-scan-skipped audit row now fires when the gate is disabled globally (commit ae24e0a9). Previously recordSecurityAudit('security-scan-skipped', …) only executed when the per-container label dd.security.gate=off was set. With DD_SECURITY_GATE_MODE=off configured globally, scans were silently skipped with no audit trail — an operator reading the audit log had no indication that the gate was suppressed. getGateDisabledAuditDetails now selects the appropriate human-readable reason from whichever off-state is in effect and the audit call is unconditional.
Registry URL normalization restored on container record after regression in 594a07e8. Removing the normalizeImage call in normalizeContainer to fix deploy-identity corruption (issue #336) inadvertently left image.registry.url in its raw user-config form (docker.io) instead of the API base URL form (https://registry-1.docker.io/v2). All registry HTTP callers, getImageFullName, the Prometheus image_registry_url label, and the Docker trigger's self-update helper expect the normalized form. The URL rewrite is now restored for containers where the deploy image itself matches the provider; harbor-mirror containers (where a lookup label diverts to a different registry) correctly retain their deploy URL unchanged.
image.name canonicalization also restored after partial fix in 4e06329b. The prior fix only restored image.registry.url; image.name was still not rewritten through the provider's normalizeImage, so Docker Hub containers with un-prefixed names (e.g. nginx) kept image.name = "nginx" instead of library/nginx. This caused the Prometheus image_name label to emit the bare name, breaking e2e scenarios that assert image_name="library/nginx". The normalizeImage result now also assigns image.name in the deploy-match branch; the cross-registry mirror branch (harbor → Hub lookup) is unaffected and still preserves the deploy name.
Stack/group view no longer collapses to ungrouped mid-update when containers are recreated. When a Docker action recreates a container it receives a new container ID; the group-membership map was keyed only by the original ID, so the post-recreate lookup missed and every container fell into __ungrouped__. With a two-container stack the single-member-flatten rule then removed both group buckets entirely. loadGroups() now indexes the map under id, name, AND displayName, so the existing map[container.name] fallback in the lookup actually resolves after a recreate.

Added

Chinese (Simplified) UI (PR #331 by TianMiao, commits 8f3286b7, b97944dc). Chinese is the first non-English locale to ship in drydock. 14 namespace JSON files under ui/src/locales/zh-CN/ cover the full UI surface — dashboard, containers, agents, config, list views, container components, app shell, auth, logs, and shared components (~1,100+ strings). A latent bootstrap bug (buildMessages map initialized only for en, causing Object.assign(undefined) crashes for any second locale) was fixed as part of this work, along with 112 translation gaps that arose because the locale files were authored before several new UI strings landed in rc.17. The i18n framework loaded on the existing import.meta.glob auto-discovery; no additional wiring was needed.
Chinese (Traditional) UI (PR #344 by TianMiao, commit 2e60f1e7). The Chinese catalog is now split into BCP-47 locale folders (zh-CN and zh-TW) so operators can choose Simplified or Traditional Chinese from Config > Appearance. The Traditional catalog ships with the same namespace coverage as Simplified Chinese, including the rc.18 appearance, outbox, table, and preference strings.
Multi-select event-type filter in audit log (commit 5e2d0c70, discussion #332). The audit log's event-type filter was a single-value <select>, so operators wanting to view both update-applied and update-failed in the same session had to query them separately and mentally merge the results. The filter is now a checkbox dropdown supporting any combination of event categories simultaneously. The backend already accepted ?actions= (plural, comma-separated) — this wires the missing UI half. Back-compat: existing ?action=foo bookmark URLs parse as a single-element selection without requiring a migration.

Changed

app/updates/locks.ts renamed to app/updates/lock-primitives.ts (commit 4c506d21). locks.ts was a misleading filename for a module that contains general-purpose synchronisation primitives (Semaphore, LockManager) not tied to the updates subsystem. Existing CHANGELOG entries above and the app/updates/update-locks.ts consumer have been updated to the new path. HookExecutor and RollbackMonitor now delegate label-to-integer parsing to the project-wide parseEnvNonNegativeInteger helper instead of inline NaN/zero guards; getErrorMessage inline copies in post-start-liveness and request-update are consolidated to the shared util/error import.

Security

Credential redaction expanded to x-registry-auth, *-token, and api-key fields (commit 4417ce25). The existing scrubAuthorizationHeaderValues helper only redacted Authorization header values. Structured error payloads in update-failed SSE events could still leak registry auth tokens, API keys, and OAuth bearer strings embedded under other field names. A second regex pass now redacts x-registry-auth, any field matching *-token, and api-key / api_key values before the payload leaves the server. The update-failed SSE path was the primary exposure vector; operator-visible diagnostic strings no longer leak registry credentials in production environments.

Performance

Binary indices and drain concurrency cap for notification outbox (commit 9393253e). findReadyForDelivery — the hot path that runs on every outbox drain cycle — queried data.status and data.nextAttemptAt with standard LokiJS indices, causing full-collection scans as the outbox grew. Switching those two fields to binary indices gives O(log n) B-tree lookups. OutboxWorker gains a maxDrainConcurrency option (default 10) backed by a DrainSemaphore so a burst of ready entries cannot flood the trigger pipeline with unbounded parallel deliveries. store/util normalises a binaryIndices option on initCollection so collections receive correct field registration at creation time.

Tests / CI

Reconciliation terminal-hold toast assertions use maxIdBefore pattern. Two tests in ContainersView.spec.ts were flaking on CI because vi.advanceTimersByTime(1500) expired pre-existing toasts, lowering toasts.value.length even though no new toasts were added. Replaced countBefore = toasts.value.length / toBe(countBefore) with the maxIdBefore / filter(t.id > maxIdBefore) pattern already used elsewhere in the file.

CodesWhat/drydock v1.5.0-rc.19 on GitHub

v1.5.0-rc.19

[Unreleased]

Added

Changed

Fixed

Added

Changed

Security

Performance

Tests / CI

CodesWhat/drydock v1.5.0-rc.19
on GitHub