✨ Highlights ✨
📣 Pandera now supports validation of polars.DataFrame
and polars.LazyFrame
🐻❄️!
You can now do this:
import pandera.polars as pa
import polars as pl
class Schema(pa.DataFrameModel):
state: str
city: str
price: int = pa.Field(in_range={"min_value": 5, "max_value": 20})
lf = pl.LazyFrame(
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
Schema.validate(lf).collect()
And of course you can do functional validation with decorators like so:
from pandera.typing.polars import LazyFrame
@pa.check_types
def function(lf: LazyFrame[Schema]) -> LazyFrame[Schema]:
return lf.filter(pl.col("state").eq("CA"))
function(lf).collect()
You can read more about the integration here. Not all pandera features are supported at this point, but depending on community demand/contributions we'll slowly add them. To learn more about what's currently supported, check out this table.
Special shoutout to @AndriiG13 and @FilipAisot for their contributions on the built-in checks and polars datatypes, respectively, and to @evanrasmussen9, @baldwinj30, @obiii, @Filimoa, @philiporlando, @r-bar, @alkment, @jjfantini, and @robertdj for their early feedback and bug reports during the 0.19.0 beta.
What's Changed
- Support polars DataFrames, LazyFrames by @cosmicBboy, @AndriiG13, and @FilipAisot in #1373
- bugfix: optional columns in polars schema should no longer raise errors when not present by @cosmicBboy in #1532
check_nullable
does not uselessly computeisna()
anymore in pandas backend by @smarie in #1538- Polars LazyFrames are validated at the schema-level by default by @cosmicBboy in #1534
- Enable from_format_kwargs for dict format by @ektar in #1539
- Convert docs to myst by @cosmicBboy in #1542
- fix README(tab to space) by @np-yoe in #1544
- pandas DataFrameModel accepts python generic types by @cosmicBboy in #1547
- Backend registration happens at schema initialization by @cosmicBboy in #1548
- do not format if test is not necessary by @mattB1989 in #1530
- Register default backends when restoring state by @alkment in #1550
- Bump actions/setup-python from 4 to 5 by @dependabot in #1452
- fix: prevent environment pollution when importing pyspark by @sam-goodwin in #1552
- use rst to speed up api docs generation by @cosmicBboy in #1557
- Add _GenericAlias.call patch by @cosmicBboy in #1561
- support typeguard < 3 for better compatability by @cosmicBboy in #1563
- Add parse function to DataFrameModel in #1181
- localize GenericAlias patch to DataFrameBase subclasses by @cosmicBboy in #1571
- Bump idna from 3.4 to 3.7 by @dependabot in #1569
- docs: fix typo in env var name by @alekseik1 in #1562
- polars: fix element-wise checks, register backends by @cosmicBboy in #1572
- remove pytest ignore on modin, dask. pyspark tests with pandas >= 2 by @cosmicBboy in #1573
- make sure check name is propagated to error report by @cosmicBboy in #1574
- update ci to run pyspark, modin, dask with pandas >= v2 by @cosmicBboy in #1575
- use sphinx-design instead of sphinx-panels by @cosmicBboy in #1581
- Update bug_report.md by @philiporlando in #1585
- bugfix: polars column core checks now return check output by @cosmicBboy in #1586
- make pandera.typing.Series[TYPE] error in polars DataFrameModel more readable by @cosmicBboy in #1588
- implement timezone agnostic polars_engine.DateTime type by @cosmicBboy in #1589
- fix pyspark import error by @cosmicBboy in #1591
- fix pyspark tests when run on full test suite by @cosmicBboy in #1593
- Bugfix/1580 by @cosmicBboy in #1596
- Set pandas_io.from_frictionless_schema to use a raw string for docs by @mark-thm in #1597
- Add a generic Series type for polars by @baldwinj30 in #1595
- Add StructType and DDL extraction from Pandera schemas by @filipeo2-mck in #1570
- Clean up typing for pandas GenericDtype by @cosmicBboy in #1601
- Adding warning for unique in pyspark field and a test showing the issue as well as config when it works. by @zippeurfou in #1592
- bugfix/1607: coercion error should correctly report relevant failure cases by @cosmicBboy in #1608
- Create a common DataFrameSchema class, update mypy used in pre-commit by @cosmicBboy in #1609
- Dataframe column schema by @cosmicBboy in #1611
- bugfix: column-level coercion is properly implemented by @cosmicBboy in #1612
- update docs for polars by @cosmicBboy in #1613
- fix: properly coerce dtypes for columns with regex=True by @tesslinden in #1602
- rewrite Check class docstrings to remove pandas assumption by @cosmicBboy in #1614
- add tests for polars decorators by @cosmicBboy in #1615
New Contributors
- @smarie made their first contribution in #1538
- @ektar made their first contribution in #1539
- @np-yoe made their first contribution in #1544
- @alkment made their first contribution in #1550
- @sam-goodwin made their first contribution in #1552
- @alekseik1 made their first contribution in #1562
- @philiporlando made their first contribution in #1585
- @mark-thm made their first contribution in #1597
- @baldwinj30 made their first contribution in #1595
- @zippeurfou made their first contribution in #1592
- @tesslinden made their first contribution in #1602
Full Changelog: v0.18.3...v0.19.0