pandera 0.9.0 on Python PyPI

Highlights

FastAPI Integration [Docs]

pandera now integrates with fastapi. You can decorate app endpoint arguments with DataFrame[Schema] types and the endpoint will validate incoming and outgoing data.

from typing import Optional

from pydantic import BaseModel, Field

import pandera as pa


# schema definitions
class Transactions(pa.SchemaModel):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float] = pa.Field(ge=0, le=1000)

    class Config:
        coerce = True

class TransactionsOut(Transactions):
    id: pa.typing.Series[int]
    cost: pa.typing.Series[float]
    name: pa.typing.Series[str]

class TransactionsDictOut(TransactionsOut):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

App endpoint example:

from fastapi import FastAPI, File

app = FastAPI()

@app.post("/transactions/", response_model=DataFrame[TransactionsDictOut])
def create_transactions(transactions: DataFrame[Transactions]):
    output = transactions.assign(name="foo")
    ...  # do other stuff, e.g. update backend database with transactions
    return output

Data Format Conversion [Docs]

The class-based API now supports automatically deserializing/serializing pandas dataframes in the context of @pa.check_types-decorated functions, @pydantic.validate_arguments-decorated functions, and fastapi endpoint functions.

import pandera as pa
from pandera.typing import DataFrame, Series

# base schema definitions
class InSchema(pa.SchemaModel):
    str_col: Series[str] = pa.Field(unique=True, isin=[*"abcd"])
    int_col: Series[int]

class OutSchema(InSchema):
    float_col: pa.typing.Series[float]

# read and validate data from a parquet file
class InSchemaParquet(InSchema):
    class Config:
        from_format = "parquet"

# output data as a list of dictionary records
class OutSchemaDict(OutSchema):
    class Config:
        to_format = "dict"
        to_format_kwargs = {"orient": "records"}

@pa.check_types
def transform(df: DataFrame[InSchemaParquet]) -> DataFrame[OutSchemaDict]:
    return df.assign(float_col=1.1)

The transform function can then take a filepath or buffer containing a parquet file that pandera automatically reads and validates:

import io
import json

buffer = io.BytesIO()
data = pd.DataFrame({"str_col": [*"abc"], "int_col": range(3)})
data.to_parquet(buffer)
buffer.seek(0)

dict_output = transform(buffer)
print(json.dumps(dict_output, indent=4))

Output:

[
    {
        "str_col": "a",
        "int_col": 0,
        "float_col": 1.1
    },
    {
        "str_col": "b",
        "int_col": 1,
        "float_col": 1.1
    },
    {
        "str_col": "c",
        "int_col": 2,
        "float_col": 1.1
    }
]

Data Validation with GeoPandas [Docs]

DataFrameSchemas can now validate geopandas.GeoDataFrame and GeoSeries objects:

import geopandas as gpd
import pandas as pd
import pandera as pa
from shapely.geometry import Polygon

geo_schema = pa.DataFrameSchema({
    "geometry": pa.Column("geometry"),
    "region": pa.Column(str),
})

geo_df = gpd.GeoDataFrame({
    "geometry": [
        Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
        Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
    ],
    "region": ["NA", "SA"]
})

geo_schema.validate(geo_df)

You can also define SchemaModel classes with a GeoSeries field type annotation to create validated GeoDataFrames, or use then in @pa.check_types-decorated functions for input/output validation:

from pandera.typing import Series
from pandera.typing.geopandas import GeoDataFrame, GeoSeries


class Schema(pa.SchemaModel):
    geometry: GeoSeries
    region: Series[str]


# create a geodataframe that's validated on object initialization
df = GeoDataFrame[Schema](
    {
        'geometry': [
            Polygon(((0, 0), (0, 1), (1, 1), (1, 0))),
            Polygon(((0, 0), (0, -1), (-1, -1), (-1, 0)))
        ],
        'region': ['NA','SA']
    }
)

Enhancements

Support GeoPandas data structures (#732)
Fastapi integration (#741)
add title/description fields (#754)
add nullable float dtypes (#721)

Bugfixes

typed descriptors and setup.py only includes pandera (#739)
@pa.dataframe_check works correctly on pandas==1.1.5 (#735)
fix set_index with MultiIndex (#751)
strategies: correctly handle StringArray null values (#748)

Docs Improvements

fastapi docs, add to ci (#753)

Testing Improvements

Add Python 3.10 to CI matrix (#724)

Contributors

Big shout out to the following folks for your contributions on this release 🎉🎉🎉

pandera 0.9.0 0.9.0: FastAPI Integration, Support GeoPandas DataFrames on Python PyPI