github yzhao062/pyod v3.5.2

5 hours ago

Summary

Three reproducibility / kwargs-forwarding bug fixes from the PyOD 3 paper (KDD 2027 ADS Cycle 1) §5 evidence work, plus partial progress on a long-standing open issue. Reviewed via four rounds of implement-review with Codex (rounds 1 through 3 each surfaced real findings that were addressed before merging; round 4 cleared with no new findings).

  • Closes #685ABOD / KNN / LUNAR / SOD over-forwarded **kwargs to sklearn's NearestNeighbors, crashing on any kwarg outside NearestNeighbors's signature (the sklearn-convention random_state, a typo like n_neighbours, etc.).
  • Closes #686ADEngine.investigate was non-deterministic on byte-identical input because no public API pinned random_state.
  • Closes #469LODA results were not reproducible because the constructor did not accept random_state and the inner np.random.* calls fell back to numpy's module-level state.
  • Partial progress on #599 (sklearn-style random_state across pyod): ADEngine, LUNAR, LODA, and EmbeddingOD now accept random_state. Deep-learning detectors (DIF, AutoEncoder, DeepSVDD, ...) remain follow-up work tracked under #599.

What changed

#685 ABOD / KNN / LUNAR / SOD kwargs leak

Removed **kwargs from each __init__ and stopped forwarding **self.kwargs / **kwargs to NearestNeighbors. The six named forwarding parameters added in b8f6c81 (algorithm, leaf_size, metric, p, metric_params, n_jobs) still cover the use case #654 originally asked for. Unknown kwargs on ABOD / KNN / SOD now raise a clean TypeError at construction that names the detector class and does NOT leak NearestNeighbors.

LUNAR is the one #685 detector that is actually stochastic. Instead of rejecting random_state, LUNAR.__init__ declares an explicit random_state parameter (accepts int or numpy.random.RandomState) that threads through:

  1. torch.manual_seed (and torch.cuda.manual_seed_all when CUDA is available), before SCORE_MODEL / WEIGHT_MODEL construction and again in fit().
  2. The numpy RandomState returned by sklearn.utils.check_random_state.
  3. train_test_split(..., random_state=rng) for the validation split.
  4. generate_negative_samples(..., random_state=rng) for the synthetic anomaly generator (new signature).

#686 ADEngine non-determinism

Added random_state to ADEngine.__init__. Plumbed through ADEngine.build_detector -> build_detector_from_plan -> build_from_preset. The factory injects random_state into plan['params'] only for detector classes whose __init__ declares an explicit random_state parameter (verified via inspect.signature); detectors that do not declare it are instantiated unchanged. Plan-level random_state in params wins over the engine default. The factory copies plan['params'] before injecting so the caller's plan dict is not mutated.

EmbeddingOD preset coverage is end-to-end: EmbeddingOD.__init__ accepts random_state, EmbeddingOD.fit forwards into resolve_detector(detector, contamination, random_state=...) which injects the seed into the inner shortcut detector (LUNAR by default), and EmbeddingOD._preprocess_fit passes the seed to PCA(n_components=self.reduce_dim, random_state=...) so a preset plan with reduce_dim is fully deterministic. The external encoder's own inference (sentence-transformers, DINOv2) is documented as NOT seeded.

#469 LODA non-reproducible

Added random_state to LODA.__init__. Threaded through sklearn.utils.check_random_state and replaced the two np.random.* call sites (np.random.randn for the projection matrix and np.random.permutation for the per-cut feature subset) with rng.randn and rng.permutation. LODA(random_state=42) is now bit-stable across reruns, and ADEngine(random_state=42) propagates the seed through the existing factory path.

API compatibility

Soft API removal: the accidental arbitrary-**kwargs surface added to ABOD / KNN / LUNAR / SOD in commit b8f6c81 is gone. Code that relied on it (for example ABOD(some_unknown_kwarg=value)) now fails fast at the constructor instead of at the NearestNeighbors constructor inside fit. The six named forwarding parameters still work; this is the only meaningful behavior change.

ADEngine() without a seed retains v3.5.1 behavior (no determinism guarantee). Existing callers of LODA(), LUNAR(), EmbeddingOD() without random_state see no behavior change.

Tests

31 new regression tests across 6 test files. All pass locally. The 4 pre-existing TestFastABOD / TestKnnNearestNeighborsConfig / TestLUNARNearestNeighborsConfig / TestSODNearestNeighborsConfig failures on Windows are MKL DLL load errors that reproduce on a clean tree and are unrelated to this PR.

Install

pip install --upgrade pyod

or, with conda-forge (auto-released within a few hours):

conda install -c conda-forge pyod=3.5.2

Don't miss a new pyod release

NewReleases is sending notifications on new releases.