Summary
Three reproducibility / kwargs-forwarding bug fixes from the PyOD 3 paper (KDD 2027 ADS Cycle 1) §5 evidence work, plus partial progress on a long-standing open issue. Reviewed via four rounds of implement-review with Codex (rounds 1 through 3 each surfaced real findings that were addressed before merging; round 4 cleared with no new findings).
- Closes #685 —
ABOD/KNN/LUNAR/SODover-forwarded**kwargsto sklearn'sNearestNeighbors, crashing on any kwarg outsideNearestNeighbors's signature (the sklearn-conventionrandom_state, a typo liken_neighbours, etc.). - Closes #686 —
ADEngine.investigatewas non-deterministic on byte-identical input because no public API pinnedrandom_state. - Closes #469 —
LODAresults were not reproducible because the constructor did not acceptrandom_stateand the innernp.random.*calls fell back to numpy's module-level state. - Partial progress on #599 (sklearn-style
random_stateacross pyod):ADEngine,LUNAR,LODA, andEmbeddingODnow acceptrandom_state. Deep-learning detectors (DIF, AutoEncoder, DeepSVDD, ...) remain follow-up work tracked under #599.
What changed
#685 ABOD / KNN / LUNAR / SOD kwargs leak
Removed **kwargs from each __init__ and stopped forwarding **self.kwargs / **kwargs to NearestNeighbors. The six named forwarding parameters added in b8f6c81 (algorithm, leaf_size, metric, p, metric_params, n_jobs) still cover the use case #654 originally asked for. Unknown kwargs on ABOD / KNN / SOD now raise a clean TypeError at construction that names the detector class and does NOT leak NearestNeighbors.
LUNAR is the one #685 detector that is actually stochastic. Instead of rejecting random_state, LUNAR.__init__ declares an explicit random_state parameter (accepts int or numpy.random.RandomState) that threads through:
torch.manual_seed(andtorch.cuda.manual_seed_allwhen CUDA is available), beforeSCORE_MODEL/WEIGHT_MODELconstruction and again infit().- The numpy
RandomStatereturned bysklearn.utils.check_random_state. train_test_split(..., random_state=rng)for the validation split.generate_negative_samples(..., random_state=rng)for the synthetic anomaly generator (new signature).
#686 ADEngine non-determinism
Added random_state to ADEngine.__init__. Plumbed through ADEngine.build_detector -> build_detector_from_plan -> build_from_preset. The factory injects random_state into plan['params'] only for detector classes whose __init__ declares an explicit random_state parameter (verified via inspect.signature); detectors that do not declare it are instantiated unchanged. Plan-level random_state in params wins over the engine default. The factory copies plan['params'] before injecting so the caller's plan dict is not mutated.
EmbeddingOD preset coverage is end-to-end: EmbeddingOD.__init__ accepts random_state, EmbeddingOD.fit forwards into resolve_detector(detector, contamination, random_state=...) which injects the seed into the inner shortcut detector (LUNAR by default), and EmbeddingOD._preprocess_fit passes the seed to PCA(n_components=self.reduce_dim, random_state=...) so a preset plan with reduce_dim is fully deterministic. The external encoder's own inference (sentence-transformers, DINOv2) is documented as NOT seeded.
#469 LODA non-reproducible
Added random_state to LODA.__init__. Threaded through sklearn.utils.check_random_state and replaced the two np.random.* call sites (np.random.randn for the projection matrix and np.random.permutation for the per-cut feature subset) with rng.randn and rng.permutation. LODA(random_state=42) is now bit-stable across reruns, and ADEngine(random_state=42) propagates the seed through the existing factory path.
API compatibility
Soft API removal: the accidental arbitrary-**kwargs surface added to ABOD / KNN / LUNAR / SOD in commit b8f6c81 is gone. Code that relied on it (for example ABOD(some_unknown_kwarg=value)) now fails fast at the constructor instead of at the NearestNeighbors constructor inside fit. The six named forwarding parameters still work; this is the only meaningful behavior change.
ADEngine() without a seed retains v3.5.1 behavior (no determinism guarantee). Existing callers of LODA(), LUNAR(), EmbeddingOD() without random_state see no behavior change.
Tests
31 new regression tests across 6 test files. All pass locally. The 4 pre-existing TestFastABOD / TestKnnNearestNeighborsConfig / TestLUNARNearestNeighborsConfig / TestSODNearestNeighborsConfig failures on Windows are MKL DLL load errors that reproduce on a clean tree and are unrelated to this PR.
Install
pip install --upgrade pyodor, with conda-forge (auto-released within a few hours):
conda install -c conda-forge pyod=3.5.2