semihalev/sdns v1.6.6 on GitHub

Security release. Closes a cache-poisoning vulnerability in both forwarder and resolver paths (issue #469). Operators on 1.6.5 should upgrade.

CVE / advisory: the issue was reported and disclosed publicly via the issue tracker. A GHSA entry will follow.

What's Changed

Security

Drop upstream responses with mismatched question section (#470, #471). Both the forwarder (middleware/forwarder/forwarder.go) and the resolver wire layer (middleware/resolver/client.go:Conn.Exchange) used to accept an upstream reply as long as the DNS transaction ID matched. A malicious or misbehaving upstream could answer a query for attacker.example. with a message whose question section was victim.example. — and because the cache is keyed on the response's question, the unrelated answer was stored under victim.example. and served from cache to later clients.
Both paths now require the response to contain exactly one question whose Name (case-insensitively, per DNS wire rules), Qtype, and Qclass match the outstanding request. Mismatches drop the response and fall through to the next upstream, with the existing retry path covering transient cases. New regression tests pin the contract at both layers.
Closes #469.

Features

Per-client static-answer middleware ("views", #360). New [[views]] config block returns different DNS answers based on the originating client's source IP — split-horizon resolution where *.example.lan. can resolve to one address for LAN clients and a different one for VPN clients without disturbing recursion for everyone else. Each view declares a zone label, a list of networks (CIDR), and a list of answers (zone-file format, wildcards allowed).
Match precedence follows RFC 4592: exact owners override a covering wildcard (§3.2); among wildcards, the longest matching suffix (closest encloser, §2.2.1) wins. Views are evaluated in declaration order; the first whose networks contain the client IP wins. A matched-but-no-answer view falls through (CoreDNS-style "fallthrough" semantics). Internal sub-pipelines skip views entirely. Position in the chain: between hostsfile and blocklist, so a view-curated answer wins over a global blocklist rule for that name. See the example block in contrib/linux/sdns.conf and the README for usage.
Non-blocking blocklist persistence + bulk import API. Reported issue: blocklist mutations via the HTTP API caused DNS to temporarily stop responding while changes were applied. Root cause: Set / Remove held b.mu (mutually exclusive with the RLock that ServeDNS takes on every query) for the full duration of the synchronous disk write in save(). Large blocklists turned that into multi-millisecond stalls of every in-flight query.
Fixes:
- Mutate maps under b.mu, snapshot, release b.mu, then persist outside the lock. ServeDNS readers no longer wait on disk I/O.
- A new saveMu serializes concurrent persists; the os.Rename of a temp file (CreateTemp + Sync + Rename) is the linearisation point, so the on-disk file always matches some in-memory state and never a half-written intermediate.
- New SetBatch / RemoveBatch perform one map lock + one disk write for an entire batch instead of one disk write per entry.
Two new HTTP endpoints accept {"keys":[...]} JSON bodies (8 MiB cap, unknown fields rejected), returning {requested, added/removed, skipped/missing}:
- POST /api/v1/block/set/batch
- POST /api/v1/block/remove/batch
A new contract test (Test_BlockList_NoStallDuringSave) holds saveMu from a goroutine and asserts that a concurrent ServeDNS-style RLock returns within 2s, so a future regression that re-introduces disk I/O inside the map lock fails loudly.

Kubernetes Middleware Refactor

Collapses the dual-mode (killer/boring) implementation into one sharded registry with per-headless-service incremental state. Slice events go through ApplyEndpointSlice / RemoveEndpointSlice plus a worker-coalesced MaterialiseHeadless, so a one-pod change in a 1000-pod headless service costs O(slice size) for state work and O(delta) RR allocations.

Correctness fixes that came along with the refactor:

SERVFAIL for cluster-domain queries when not synced — forward queries no longer leak to public DNS during initial informer warmup; reverse queries still fall through.
UID guard rejects late EndpointSlice events from a deleted Service via tombstone tracking and ownerRef.UID matching, plus dirty-replay on AddService so the synthetic seed handover doesn't drop other slices.
onEndpointSliceUpdate retracts the slice from the old service on a service-name relabel.
cluster_domain is normalised (trailing dot, mixed case) at construction and at Registry.SetClusterDomain.
Anonymous headless endpoints get distinct dashed-IP SRV targets (10-0-0-1.svc...) instead of collapsing to one record.
buildConfig defers to clientcmd's default loading rules so multi-file KUBECONFIG entries merge correctly.
Skip-if-equal guard in applyEndpointSlice eliminates rebuilds for resourceVersion-only update events.
SRV port-number edits invalidate the cached *dns.SRV pointer; SRV glue refresh allocates a new answerSet rather than mutating the published one in place.
Run waits on per-handler HasSynced (not just informer.HasSynced) and flushes pending rebuilds before publishing synced=true.
DeleteService order is now tombstone → flush → DeleteService, preventing a worker rebuild from re-populating the registry after wipe.

config.KubernetesConfig.killer_mode is dropped from the live API; existing configs still parse (the field is retained but ignored), but new configs should omit it.

Resolver / DNSSEC Refactor

Pure DNSSEC verify functions (RRSIG, DS, NSEC, NSEC3 denial-of-existence proofs) and the EDE-coded sentinel errors that go with them moved into a new middleware/resolver/dnssec subpackage. The generic DNS RR helpers (ExtractRRSet, FilterRRsToZone, NameInZone, DnameTarget) and the EDEError type moved into util/, where both resolver and dnssec can share them without a circular import. Resolver-side network errors keep their identities but now use *util.EDEError instead of the resolver-local ValidationError type.

(*Resolver).lookup() was split in place: the per-server query goroutine moved to a queryServer method, the adaptive RTT-based timeout became adaptiveServerTimeout, and the trailing fallback-picker became pickFallbackResponse. Behaviour is unchanged; lookup() drops from ~250 lines to ~140 and the goroutine entry no longer captures state via closure. Net diff: −2016 / +265 in middleware/resolver/, ~1100 lines under middleware/resolver/dnssec/.

Config

Update B-root to current IANA addresses. ICANN/Verisign re-numbered B-root in late 2023 (IPv4 199.9.14.201 → 170.247.170.2; IPv6 2001:500:200::b → 2801:1b8:10::b). The old addresses still answer for transitional reasons and priming even discovers the new ones at runtime, but the embedded default config, the Linux packaging config, the benchmark fixtures, and the fuzz seed corpus now match the canonical named.root list.

Dependencies

github.com/semihalev/zlog/v2 → v2.0.8 (v2.0.7 broke the variadic-KV signature; v2.0.8 restores it, so this is a no-op upgrade).
github.com/fsnotify/fsnotify → v1.10.0.
goreleaser/goreleaser-action → v7.2.1.

Upgrade Notes

Recommended for everyone on 1.6.5. The cache-poisoning fix is the headline reason for this release.
Config compatibility: configver bumps to 1.6.6; existing configs continue to parse, you'll just see a one-line "Config file is out of version" log warning until you regenerate. The deprecated kubernetes.killer_mode key is now ignored.
No on-disk format changes to trust-anchor.db / trust-anchor-tombstones.db / blocklist persistence — the new blocklist save path is a strict superset of the old format.

Full Changelog: v1.6.5...v1.6.6