⭐️ Highlights
The main highlight of this release is that phase 1 of the Pandera internals re-write is complete 🎉🚀! This is a backwards-compatible re-write (unit tests FTW 😅) that should just work with your existing pandera code. Please submit bug reports if you encounter any regressions that weren't covered by the current test suite.
These PRs #913 #1109, and #1110 address #381, and essentially decouples pandas-specific logic from the pandera schema specification. In summary:
- The pandera schema specifications are defined in
pandera.api
, containing:- schema base classes in
pandera.api.base
- pandera schema classes in
pandera.api.pandas
- the global check and hypothesis namespace in
pandera.api.checks.Check
andpandera.api.hypotheses.Hypothesis
- decorators are provided in
pandera.api.extensions
to be able to register builtin and custom checks/hypotheses
- schema base classes in
- The pandera backend validation logic is defined in
pandera.backends
, containing:- backend base classes in
pandera.backends.base
- pandas-specific backend validators in
pandera.backends.pandas
- backend base classes in
Now, all pandas-specific logic is isolated to specific modules, where support for additional non-pandas-compliant schema specifications and their associated backends can be implemented either as 1st-party-maintained libraries (see issues for supporting polars and ibis) or 3rd party libraries.
🛣 Rewrite Roadmap
The bulk of the re-write is complete, however there are still some outstanding items:
- Write validation backends for the existing pandas-like frameworks (dask, pyspark.pandas, modin). This may lead to refactoring some of the abstractions that came out of the rewrite.
- Write an alpha version of the
pandera-ibis
package, which will create a schema specification and validation backends for ibis data structures (see issue #1105) - Document the process of writing your own 3rd party libraries based on pandera for any arbitrary statistical data container.
What's Changed
- Bugfix/996: strict="filter" doesn't work on spark dataframes by @nwoodbury in #1001
- unpin pandas-stubs version by @williamjamir in #1000
- add PR messages, DCO to contributing guide by @cosmicBboy in #1006
- Turn failure-cases to string to avoid hashing unhashable objects by @a-recknagel in #1014
- not require
coerce==True
when for PydandticModels by @the-matt-morris in #1011 - Schema Model manipulation docs by @a-recknagel in #1012
- Fix handling of decimals with scale=0 by @a-recknagel in #1010
- Add Union support to
check_types
: Bugfix/977 by @kr-hansen in #995 - Bugfix/997 by @joepatol in #1017
- update mypy plugin and tests by @cosmicBboy in #1007
- fix issue where @check_types-decorated function is an iterable by @cosmicBboy in #1022
- fix mypy extra unit tests, pin pandas-stubs for dev env by @cosmicBboy in #1056
- Feature/511: Copy columns in DataFrameSchema.init() by @NickCrews in #1055
- Unpinning ray from requirements-dev.txt by @erichamers in #1052
- core and backend pandera API internals rewrite by @cosmicBboy in #913
- Small fix to example by @brl0 in #1083
- Coerce dt indexes and series by @cristianmatache in #1057
- correctly type-check strings by @cosmicBboy in #1106
- fix lazy validation issue with regex columns if no column found by @cosmicBboy in #1107
- fix(dtypes.py): correction at function is_numeric docstring by @HenriqueAJNB in #1100
- internals rewrite: clean up checks and hypothesis functionality by @cosmicBboy in #1109
- rename pandera.core to pandera.api by @cosmicBboy in #1110
New Contributors
- @nwoodbury made their first contribution in #1001
- @williamjamir made their first contribution in #1000
- @kr-hansen made their first contribution in #995
- @joepatol made their first contribution in #1017
- @erichamers made their first contribution in #1052
- @brl0 made their first contribution in #1083
- @HenriqueAJNB made their first contribution in #1100
Full Changelog: v0.13.4...v0.14.0