Highlights
pandera
now supports pyspark dataframe validation via pyspark.pandas
The pandera koalas integration has now been deprecated
You can now pip install pandera[pyspark]
and validate pyspark.pandas
dataframes:
import pyspark.pandas as ps
import pandas as pd
import pandera as pa
from pandera.typing.pyspark import DataFrame, Series
class Schema(pa.SchemaModel):
state: Series[str]
city: Series[str]
price: Series[int] = pa.Field(in_range={"min_value": 5, "max_value": 20})
# create a pyspark.pandas dataframe that's validated on object initialization
df = DataFrame[Schema](
{
'state': ['FL','FL','FL','CA','CA','CA'],
'city': [
'Orlando',
'Miami',
'Tampa',
'San Francisco',
'Los Angeles',
'San Diego',
],
'price': [8, 12, 10, 16, 20, 18],
}
)
print(df)
PydanticModel
DataType Enables Row-wise Validation with a pydantic
model
Pandera now supports row-wise validation by applying a pydantic model as a dataframe-level dtype:
from pydantic import BaseModel
import pandera as pa
class Record(BaseModel):
name: str
xcoord: str
ycoord: int
import pandas as pd
from pandera.engines.pandas_engine import PydanticModel
class PydanticSchema(pa.SchemaModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(Record)
coerce = True # this is required, otherwise a SchemaInitError is raised
⚠️ Warning: This may lead to performance issues for very large dataframes.
Improved conda installation experience
Before this release there were only two conda packages: one to install pandera-core
and another to install pandera
(which would install all extras functionality)
The conda packaging now supports finer-grained control:
conda install -c conda-forge pandera-hypotheses # hypothesis checks
conda install -c conda-forge pandera-io # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies # data synthesis strategies
conda install -c conda-forge pandera-mypy # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi # fastapi integration
conda install -c conda-forge pandera-dask # validate dask dataframes
conda install -c conda-forge pandera-pyspark # validate pyspark dataframes
conda install -c conda-forge pandera-modin # validate modin dataframes
conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask
Enhancements
- Add option to disallow duplicate column names #758
- Make SchemaModel use class name, define own config #761
- implement coercion-on-initialization for DataFrame[SchemaModel] types #772
- Update filtering columns for performance reasons. #777
- implement pydantic model data type #779
- make finding coerce failure cases faster #792
- add pyspark support, deprecate koalas #793
- Add overloads to schema.to_yaml #790
- Add overloads to infer_schema #789
Bugfixes
Deprecations
Docs Improvements
- add imports to fastapi docs
- add documentation for pandas_engine.DateTime #780
- update docs for 0.10.0 #795
- update docs with fastapi #804