Release Notes & Audit Summary
Twenty atomic commits over a single branch, the result of running every finding from a multi-source audit (security CRITICAL/HIGH/MEDIUM/LOW + a performance audit + a UI audit) through empirical verification before acting.
~62 audit findings were checked; only the ones that survived verification turned into commits.
Honest Summary:
- 8 real bugs fixed
- 10 partial / defense-in-depth landed
- 35 false positives documented
- The rest pre-existing-and-deferred
The release also nearly doubles unit test coverage on previously-uncovered critical-path modules: +160 new tests across 5 new test files (570 unit tests total, up from 410).
Real Bug Fixes
-
fix(certificates): race condition on metadata RMW
record_backend_deployment_statusandrecord_browser_deployment_statusdid load → mutate → save without holding the per-domain lock that already exists in the class. Two concurrent deployment-status updates lost one of the writes silently. -
fix(deployer): deploy-hook parameter-expansion bypass
The safe-vars regex\$\{?CERTMATE_[A-Z_]+\}?accepted partial brace forms, letting${CERTMATE_FOO:-/etc/passwd}and the other bash expansion operators smuggle arbitrary paths past the validator. Closing-brace is now required immediately after the var name. -
fix(file_operations): UnboundLocalError in safe_file_write
Ifmkstemp()raised beforetemp_filewas bound, the except handlers referenced an unbound local, masking the actualOSError. Operators saw"no local variable temp_file"instead of"No space left on device". -
fix(certificates): corrupt metadata.json silently clobbered
_load_metadataswallowedJSONDecodeErroralong with everything else and returned{}; the next save would overwrite the only copy. Now JSON corruption is quarantined tometadata.json.corrupt-<utc>and logged atERROR, separately from IO errors which still get the empty-dict fallback. -
fix(health): scheduler-setup failure now surfaces on /health
If APScheduler setup raised, the only signal was a singleERRORlog line./healthnow reportsscheduler: failedwith the exception message and timestamp; admins can detect a broken scheduler without grepping logs. -
fix(tests): stale UI test assertions rewritten against v2.5.x
Fourtests/test_ui.pyassertions had been failing on main since v2.5.0 rewrote the help page and the dashboard create-form toggle. Updated them to current selectors + handle the setup-wizard overlay during Playwright clicks. e2e suite: 112 passed, 0 failed.
Performance Fixes
-
perf(settings): request-scoped cache for load_settings
Typical/api/certificatesrequests calledsettings_manager.load_settings()~100 times when listing 50 certs. Now cached onflask.gfor the request's lifetime; first call hits disk, subsequent calls return a deepcopy. ~15-30ms saved per typical request; more at scale. -
perf(renewal): thread settings through check_renewals
The request-scoped cache doesn't fire in background threads. Pass the once-loaded settings down toget_certificate_infoso the hourly renewal job hits disk once, not N times. -
perf(probe): TLS probe timeout 5s → 3s + slow-probe warning
Unreachable hosts block a Flask worker for the full timeout. Tightened default + addedCERTMATE_TLS_PROBE_TIMEOUT_SECONDSenv var (clamped to[1, 30]) +WARNlog when a probe takes more than 1s. -
perf(rate-limit): bound login-attempt dicts
Botnet IP rotation could grow_login_attempts_by_ipunbounded. Sweep empty buckets when either dict crosses the 10K soft cap. -
perf(backup): single iterdir pass in create_unified_backup
cert_dir.iterdir()was called twice. Now once.
Hardening (Defense-in-Depth)
chore(hardening): SQLite WAL fallback detection
PRAGMA journal_mode=WALsilently falls back on filesystems that don't support WAL (NFS, network mounts). Now logs a warning at startup if the effective mode is anything but WAL.chore(hardening): deploy-hook timeout int coercion
_run_hookreadhook.get('timeout')withoutint(); a string timeout from a hand-editedsettings.jsoncrashed the renewal worker. Coerce defensively.chore(ux): SSE retry give-up after 10 failures
Logged-out tabs produced a 401-every-30s loop indefinitely. Now gives up after ~3 minutes of exponential retries.chore(ux): MutationObserver readyState guard
Modal focus-trap observer was attached inDOMContentLoadedonly; ifcertmate.jsloaded later the listener never ran. Mirror the readyState pattern used byCM.refreshRole.chore(ux): confirm dialog before clear-cache
settings.js'sclearDeploymentCachenow matches the dashboard'sinvalidateAllCachewith aCertMate.confirm()step.
Documentation
docs(installation): documentBEHIND_PROXY=true
Undocumented before; without it, per-client rate limiting collapses to per-proxy when CertMate sits behind Nginx / Traefik / Cloudflare.docs(installation): NFS guidance
Python blocking I/O semantics + recommendedsoft,timeo=30,retrans=3mount options.docs: neutralize DNS provider counts
README anddocs/cited 22/23/24 inconsistently. Switched prose to neutral wording; canonical number lives only in the table atdocs/dns-providers.md. Same change pushed to the GitHub Wiki.
Test Coverage Push
Five new test files, +160 unit tests on previously-uncovered modules:
| Module | Before | New Tests | Focus |
|---|---|---|---|
modules/core/private_ca.py
| 0% | 34 | CA shape (RSA-4096, BC=CA-true, KU.keyCertSign), CSR signing, signature verification, CRL generation |
modules/core/csr_handler.py
| 0% | 38 | Validator entry-point: empty/garbage/truncated PEM, no-CN, control-char CN attacks (NUL, newline, CR), SAN ceiling at 100 |
modules/core/ocsp_crl.py
| 0% | 20 | Status branches (good/revoked/unknown), CRL signature verification, manager-failure → 'unknown' not 'good' |
modules/core/storage_backends.py
| ~25% | 56 | _is_transient heuristic, _with_retry decorator, _validate_storage_domain, Azure secret-name collision avoidance, StorageManager dispatch + fallback
|
modules/core/certificates.py (gaps)
| ~40% | 12 | Concurrent-issuance non-blocking lock, DNS alias status surfacing (ok/missing/mismatch/error), trailing-dot normalisation |
Tests use real cryptography primitives (no mocked crypto operations); cloud-SDK request paths deliberately out of scope (they're covered by e2e). Total unit suite: 570 passed in ~12s.
Audit Precision Summary (Transparency)
Out of ~62 audit findings across 7 lists (CRITICAL/HIGH/MEDIUM/LOW for security + perf-CRITICAL/HIGH + perf-MEDIUM/LOW + UI CRITICAL/HIGH):
- 8 true positives → fixes shipped
- 10 partial / defense-in-depth → hardening shipped
- 35 false positives → documented in commit messages why they were skipped
- 2 already fixed incidentally during earlier waves
- 7 YAGNI / over-engineering → deferred
Each audit list was verified empirically (test scripts in Python where applicable) before deciding whether to commit. The audit author appears to have pattern-matched on code SHAPES (innerHTML, no .catch, no debounce, except Exception, mkstemp, etc.) without verifying the actual behaviour — the most clamorous claim ("validator allows backticks") was falsifiable in two lines of Python.
Backward Compatibility
- No API breakage. No data migration. No new required env vars.
- New optional env vars:
BEHIND_PROXY,CERTMATE_TLS_PROBE_TIMEOUT_SECONDS. /healthadds two new fields when the scheduler is in failure state (checks.scheduler == "failed"plusscheduler_error+scheduler_failed_at). Existing consumers that only readstatusandchecks.schedulersee no contract change for the success path.
Test Results
- 570 unit tests pass in ~12s
- 112 e2e tests pass (real Cloudflare DNS-01 issuance + Playwright UI), 0 failures