facebookresearch/balance 0.15.0 on GitHub

New Features

Added EMD/CVMD/KS distribution diagnostics
- BalanceDF now exposes Earth Mover's Distance (EMD), Cramér-von Mises distance (CVMD), and Kolmogorov-Smirnov (KS) statistics for comparing adjusted samples to targets.
- These diagnostics support weighted or unweighted comparisons, apply discrete/continuous formulations, and respect aggregate_by_main_covar for one-hot categorical aggregation.
Exposed outcome columns selection in the CLI
- Added --outcome_columns to choose which columns are treated as outcomes
  instead of defaulting to all non-id/weight/covariate columns. Remaining columns are moved to ignored_columns.
Improved missing data handling in poststratify()
- poststratify() now accepts na_action to either drop rows with missing
  values or treat missing values as their own category during weighting.
- Breaking change: the default behavior now fills missing values in
  poststratification variables with "__NaN__" and treats this as a distinct
  category during weighting. Previously, missing values were not handled
  explicitly, and their treatment depended on pandas groupby and merge
  defaults. To approximate the legacy behavior where missing values do not
  form their own category, pass na_action="drop" explicitly.
Added formula support for descriptive_stats model matrices
- descriptive_stats() now accepts a formula argument that is always
  applied to the data (including numeric-only frames), letting callers
  control which terms and dummy variables are included in summary statistics.

Documentation

Documented the balance CLI
- Added full API docstrings for balance.cli and a new CLI tutorial notebook.
Created Balance CLI tutorial
- Added CLI command echoing, a load_data() example, and richer diagnostics exploration with metric/variable listings and a browsable diagnostics table. https://import-balance.org/docs/tutorials/balance_cli_tutorial/
Synchronized docstring examples with test cases
- Updated user-facing docstrings so the documented examples mirror tested inputs
  and outputs.

Code Quality & Refactoring

Added warning when the sample size of 'target' is much larger than 'sample' sample size
- Sample.adjust() now warns when the target exceeds 100k rows and is at
  least 10x larger than the sample, highlighting that uncertainty is
  dominated by the sample (akin to a one-sample comparison).
Split util helpers into focused modules
- Broke balance.util into balance.utils submodules for easier navigation.

Bug Fixes

Updated Sample.__str__() to format weight diagnostics like Sample.summary()
- Weight diagnostics (design effect, effective sample size proportion, effective sample size)
  are now displayed on separate lines instead of comma-separated on one line.
- Replaced "eff." abbreviations with full "effective" word for better readability.
- Improves consistency with Sample.summary() output format.
Numerically stable CBPS probabilities
- The CBPS helper now uses a stable logistic transform to avoid exponential
  overflow warnings during probability computation in constraint checks.
Silenced pandas observed default warning
- Explicitly sets observed=False in weighted categorical KLD calculations
  to retain current behavior and avoid future pandas default changes.
Fixed plot_qq_categorical to respect the weighted parameter for target data
- Previously, the target weights were always applied regardless of the
  weighted=False setting, causing inconsistent behavior between sample
  and target proportions in categorical QQ plots.
Restored CBPS tutorial plots
- Re-enabled scatter plots in the CBPS comparison tutorial notebook while
  avoiding GitHub Pages rendering errors and pandas colormap warnings. https://import-balance.org/docs/tutorials/comparing_cbps_in_r_vs_python_using_sim_data/
Clearer validation errors in adjustment helpers
- trim_weights() now accepts list/tuple inputs and reports invalid types explicitly.
- apply_transformations() raises clearer errors for invalid inputs and empty transformations.
Fixed model_matrix to drop NA rows when requested
- model_matrix(add_na=False) now actually drops rows containing NA values while preserving categorical levels, matching the documented behavior.
- Previously, add_na=False only logged a warning without dropping rows; code relying on the old behavior may now see fewer rows and should either handle missingness explicitly or use add_na=True.

Tests

Aligned formatting toolchain between Meta internal and GitHub CI
- Added ["fbcode/core_stats/balance"] override to Meta's internal tools/lint/pyfmt/config.toml to use formatter = "black" and sorter = "usort".
- This ensures both internal (pyfmt/arc lint) and external (GitHub Actions) environments use the same Black 25.1.0 formatter, eliminating formatting drift.
- Updated CI workflow, pre-commit config, and requirements-fmt.txt to use black==25.1.0.
Added Pyre type checking to GitHub Actions via .pyre_configuration.external and a new pyre job in the workflow. Tests are excluded due to external typeshed stub differences; library code is fully type-checked.
Added test coverage workflow and badge to README via .github/workflows/coverage.yml. The workflow collects coverage using pytest-cov, generates HTML and XML reports, uploads them as artifacts, and displays coverage metrics. A coverage badge is now shown in README.md alongside other workflow badges.
Improved test coverage for edge cases and error handling paths
- Added targeted tests for previously uncovered code paths across the library, addressing edge cases including empty inputs, verbose logging, error handling for invalid parameters, and boundary conditions in weighting methods (IPW, CBPS, rake).
- Tests exercise defensive code paths that handle empty DataFrames, NaN convergence values, invalid model types, and non-convergence warnings.
Split test_util.py into focused test modules
- Split the large test_util.py file (2325 lines) into 5 modular test files that mirror the balance/utils/ structure:
  - test_util_data_transformation.py - Tests for data transformation utilities
  - test_util_input_validation.py - Tests for input validation utilities
  - test_util_model_matrix.py - Tests for model matrix utilities
  - test_util_pandas_utils.py - Tests for pandas utilities (including high cardinality warnings)
  - test_util_logging_utils.py - Tests for logging utilities
- This improves test organization and makes it easier to locate tests for specific utilities.

Contributors

@neuralsorcerer, @talgalili

Full Changelog: 0.14.0...0.15.0

facebookresearch/balance 0.15.0 0.15.0 (2026-01-20) on GitHub

New Features

Documentation

Code Quality & Refactoring

Bug Fixes

Tests

Contributors

facebookresearch/balance 0.15.0
0.15.0 (2026-01-20)

on GitHub