New Features
- Added EMD/CVMD/KS distribution diagnostics
BalanceDFnow exposes Earth Mover's Distance (EMD), Cramér-von Mises distance (CVMD), and Kolmogorov-Smirnov (KS) statistics for comparing adjusted samples to targets.- These diagnostics support weighted or unweighted comparisons, apply discrete/continuous formulations, and respect
aggregate_by_main_covarfor one-hot categorical aggregation.
- Exposed outcome columns selection in the CLI
- Added
--outcome_columnsto choose which columns are treated as outcomes
instead of defaulting to all non-id/weight/covariate columns. Remaining columns are moved toignored_columns.
- Added
- Improved missing data handling in
poststratify()poststratify()now acceptsna_actionto either drop rows with missing
values or treat missing values as their own category during weighting.- Breaking change: the default behavior now fills missing values in
poststratification variables with"__NaN__"and treats this as a distinct
category during weighting. Previously, missing values were not handled
explicitly, and their treatment depended on pandasgroupbyandmerge
defaults. To approximate the legacy behavior where missing values do not
form their own category, passna_action="drop"explicitly.
- Added formula support for
descriptive_statsmodel matricesdescriptive_stats()now accepts aformulaargument that is always
applied to the data (including numeric-only frames), letting callers
control which terms and dummy variables are included in summary statistics.
Documentation
- Documented the balance CLI
- Added full API docstrings for
balance.cliand a new CLI tutorial notebook.
- Added full API docstrings for
- Created Balance CLI tutorial
- Added CLI command echoing, a
load_data()example, and richer diagnostics exploration with metric/variable listings and a browsable diagnostics table. https://import-balance.org/docs/tutorials/balance_cli_tutorial/
- Added CLI command echoing, a
- Synchronized docstring examples with test cases
- Updated user-facing docstrings so the documented examples mirror tested inputs
and outputs.
- Updated user-facing docstrings so the documented examples mirror tested inputs
Code Quality & Refactoring
- Added warning when the sample size of 'target' is much larger than 'sample' sample size
Sample.adjust()now warns when the target exceeds 100k rows and is at
least 10x larger than the sample, highlighting that uncertainty is
dominated by the sample (akin to a one-sample comparison).
- Split util helpers into focused modules
- Broke
balance.utilintobalance.utilssubmodules for easier navigation.
- Broke
Bug Fixes
- Updated
Sample.__str__()to format weight diagnostics likeSample.summary()- Weight diagnostics (design effect, effective sample size proportion, effective sample size)
are now displayed on separate lines instead of comma-separated on one line. - Replaced "eff." abbreviations with full "effective" word for better readability.
- Improves consistency with
Sample.summary()output format.
- Weight diagnostics (design effect, effective sample size proportion, effective sample size)
- Numerically stable CBPS probabilities
- The CBPS helper now uses a stable logistic transform to avoid exponential
overflow warnings during probability computation in constraint checks.
- The CBPS helper now uses a stable logistic transform to avoid exponential
- Silenced pandas observed default warning
- Explicitly sets
observed=Falsein weighted categorical KLD calculations
to retain current behavior and avoid future pandas default changes.
- Explicitly sets
- Fixed
plot_qq_categoricalto respect theweightedparameter for target data- Previously, the target weights were always applied regardless of the
weighted=Falsesetting, causing inconsistent behavior between sample
and target proportions in categorical QQ plots.
- Previously, the target weights were always applied regardless of the
- Restored CBPS tutorial plots
- Re-enabled scatter plots in the CBPS comparison tutorial notebook while
avoiding GitHub Pages rendering errors and pandas colormap warnings. https://import-balance.org/docs/tutorials/comparing_cbps_in_r_vs_python_using_sim_data/
- Re-enabled scatter plots in the CBPS comparison tutorial notebook while
- Clearer validation errors in adjustment helpers
trim_weights()now accepts list/tuple inputs and reports invalid types explicitly.apply_transformations()raises clearer errors for invalid inputs and empty transformations.
- Fixed
model_matrixto drop NA rows when requestedmodel_matrix(add_na=False)now actually drops rows containing NA values while preserving categorical levels, matching the documented behavior.- Previously,
add_na=Falseonly logged a warning without dropping rows; code relying on the old behavior may now see fewer rows and should either handle missingness explicitly or useadd_na=True.
Tests
- Aligned formatting toolchain between Meta internal and GitHub CI
- Added
["fbcode/core_stats/balance"]override to Meta's internaltools/lint/pyfmt/config.tomlto useformatter = "black"andsorter = "usort". - This ensures both internal (
pyfmt/arc lint) and external (GitHub Actions) environments use the same Black 25.1.0 formatter, eliminating formatting drift. - Updated CI workflow, pre-commit config, and
requirements-fmt.txtto useblack==25.1.0.
- Added
- Added Pyre type checking to GitHub Actions via
.pyre_configuration.externaland a newpyrejob in the workflow. Tests are excluded due to external typeshed stub differences; library code is fully type-checked. - Added test coverage workflow and badge to README via
.github/workflows/coverage.yml. The workflow collects coverage using pytest-cov, generates HTML and XML reports, uploads them as artifacts, and displays coverage metrics. A coverage badge is now shown in README.md alongside other workflow badges. - Improved test coverage for edge cases and error handling paths
- Added targeted tests for previously uncovered code paths across the library, addressing edge cases including empty inputs, verbose logging, error handling for invalid parameters, and boundary conditions in weighting methods (IPW, CBPS, rake).
- Tests exercise defensive code paths that handle empty DataFrames, NaN convergence values, invalid model types, and non-convergence warnings.
- Split test_util.py into focused test modules
- Split the large
test_util.pyfile (2325 lines) into 5 modular test files that mirror thebalance/utils/structure:test_util_data_transformation.py- Tests for data transformation utilitiestest_util_input_validation.py- Tests for input validation utilitiestest_util_model_matrix.py- Tests for model matrix utilitiestest_util_pandas_utils.py- Tests for pandas utilities (including high cardinality warnings)test_util_logging_utils.py- Tests for logging utilities
- This improves test organization and makes it easier to locate tests for specific utilities.
- Split the large
Contributors
Full Changelog: 0.14.0...0.15.0