github PegaProx/project-pegaprox v0.9.11
v0.9.11 — PVE 9.2 coexistence + dead-node-perf + capacity-outlook rebuild

3 hours ago

Highlights

PVE 9.2 GA coexistence (HA CRS, custom CPU models, SDN fabrics, removed scheduling= key), runtime hardening (per-node + auth circuit breakers), Capacity Outlook rebuilt off the WMA-current endpoint onto true linear-regression forecasting, 10 customer-reported fixes batched, Docker plugin loading restored. 41 commits since v0.9.10.3.

⚖️ PVE 9.2 coexistence

  • HA CRS auto-rebalance — two-layer skip logic so PegaProx LB doesn't ping-pong PVE-CRS-managed VMs. Cluster-wide gate on /cluster/options.crs.ha-auto-rebalance means pre-9.2 clusters behave exactly as before.
  • scheduling= payload key stripped on PVE 9.2+ (key was removed in 9.2; older PVE versions still accept it).
  • Surface coverage: HA resource auto-rebalance + disarmed flag, SDN fabrics with wireguard/bgp protocol + dryrun apply, SDN route-maps + prefix-lists CRUD passthrough, API-token regenerate endpoint, LXC mountpoint idmap + keepattrs fields, storage approximate-size badge, PBS /storage/{id}/identity passthrough, migration downtime-limit cap to 2000s, OIDC audiences field.
  • Dynamic per-arch CPU types pulled live from /nodes/{n}/capabilities/qemu/cpu (replaces the hardcoded list — picks up custom CPU models on PVE 9.2+).
  • New CRS settings UI panel (HA + percent thresholds + auto-rebalance tuning) — also mirrored into the Native HA panel so both views stay in sync.
  • VM CPU sub-options: reported-model + level inputs surfaced in the VM config dialog.
  • QEMU 11 (PVE 9.2) HMP rename: V2P now falls back from block_job_complete/block_job_cancel to job_* so cross-hypervisor copies stop wedging on QEMU 11 hosts.

⚡ Dead-node performance

  • Per-node circuit breaker stops the dead-node UI hang (~30s/poll → ~4s). Breaker opens after consecutive failures, half-open probe re-arms cleanly.
  • SSH connect timeout capped at 8s via socket pre-create (was effectively 30s through the paramiko command-timeout path).
  • connect_to_proxmox now keeps a host-skip cache so it doesn't re-iterate a host that just timed out within the same request.
  • Per-node fan-outs parallelised via gevent.spawn — 6-node clusters no longer serialise 4s timeouts.
  • Auto-recovery via PVE /nodes aggregate (Corosync's view) — when PVE reports the node back online, breaker resets without needing a manual nudge.
  • #444 — Auth circuit-breaker stops stale-cred storms from DoS'ing pveproxy after a root-password rotation on the PVE side.

📊 Capacity Outlook rebuild

The corporate-dashboard Capacity Outlook card was presenting a WMA-current smoothed value as a forecast — a steady 60% CPU cluster was constantly flagged as "rising" because the rising threshold (70% of critical) tripped on normal mid-load. Rewired to /api/clusters/<id>/insights/forecast which runs proper linear regression with R²-gating against 30 days of metric history. New glance-card UI: one number (closest-ETA days, or stable) and one subtitle (worst metric + target threshold). Per-metric drilldown moved to the Insights page where it belongs. Duplicate /predictive-analysis route in reports.py removed — clusters.py always won the URL match anyway, the reports.py copy was dead code.

🛠️ Customer-reported fixes

  • #448 @DarmokNoob — Cross-cluster LXC replication clone-call sent name= (LXC schema rejects this; QEMU accepts it). Three call sites (xcrepl, intra-cluster repl, manual clone_vm) now branch on vm_type and emit hostname= for LXC, name= for QEMU.
  • #446 @hugobugomugo — Encryption key rotation: column-name mismatches between the rotate transaction and the schema broke the whole pass. Aligned.
  • #438 @crcro — ESXi → PVE migration: drive-mirror task logging, dd progress PID detection, dd-progress regex hardened. Plus the same diagnostic pattern applied to _ssh_pipe_transfer + monitor.
  • #436 @aalandez — Snapshot-schedule race: the retention-delete pass kicked off before the create-task finished, hitting locked-snapshot errors on busy clusters. Schedule now waits for create-task completion before retention-delete.
  • #422xcrepl snapshot cleanup now uses force=1 and auto-unlocks the VM on ESTALE (target unreachable mid-cleanup).
  • #419 follow-ups — NodeCard netin/netout were treated as counter values (rendered absurd spikes); now interpreted as the rate the API returns. XCP-ng counterpart unified on the same rate contract.
  • #417 follow-up (from @tfoks) — Keystore resolver now propagates EACCES instead of swallowing it; bundle support-log diagnostics also detects permission-block on journalctl.
  • #412 follow-up — oidc_allow_private_ip flag now covers token + userinfo endpoints, not only discovery (private-IP IdPs failed mid-flow before).
  • #357 follow-up — PEGAPROX_LOG_LEVEL env var now also caps the per-cluster StreamHandler, not just the root logger.
  • Datacenter status / options endpoints now return 504 instead of 500 when a host times out, and immediately pivot to a warm host instead of looping on the cold one.

🤝 Contributor PRs

  • #445 @gyptazy / credativ — VM API route auto-resolves the node from /cluster/resources?type=vm, so callers no longer need to look up the node first. Backward-compatible with explicit-node paths.
  • #421 @gyptazy — ISO sync now refuses to run on shared cluster storage (would result in duplicate copies; PVE handles cluster-wide sync natively for shared storage).
  • #439 — Aikido container-autofix landed (Dockerfile hardening).

🐳 Docker / packaging

  • Dockerfile was missing COPY --chown=pegaprox:pegaprox plugins/ plugins/ — the plugin loader (pegaprox/api/plugins.py) scanned the plugins/ directory inside the container, found it empty, and quietly registered zero plugins. All five bundled plugins (client_portal, hello_world, notifications, proxmox-ha, status_page) were invisible to anyone running PegaProx on Docker. Single-line fix.
  • Sponsor asset updated: images/sponsors/sponsor3.png swapped from the wide banner to the square Banner Oranje logo and padded square so it renders at the same footprint as the Netwolk logo (200×200) in the README and the in-app sponsor card.

🌐 i18n

29 new translation keys across 7 languages (DE / EN / FR / ES / PT / KO / IT) covering the CRS settings panel, per-VM auto-rebalance, API-token rotate, OIDC audiences, plus a fresh Capacity-Outlook block (capacityCollecting, metricAtRisk, metricsAtRisk, allMetricsStable, overThreshold, etaIn, forecastEngine, stable, metric).


💎 Platinum Sponsors

Sponsor PegaProx → opencollective.com/pegaprox | pegaprox.com/#sponsor


Upgrade: in-app updater, bash update.sh, or docker compose pull && docker compose up -d.
Docker: ghcr.io/pegaprox/pegaprox:v0.9.11 (linux/amd64 + linux/arm64).

Don't miss a new project-pegaprox release

NewReleases is sending notifications on new releases.