github facebookresearch/balance 0.15.0
0.15.0 (2026-01-20)

15 hours ago

New Features

  • Added EMD/CVMD/KS distribution diagnostics
    • BalanceDF now exposes Earth Mover's Distance (EMD), Cramér-von Mises distance (CVMD), and Kolmogorov-Smirnov (KS) statistics for comparing adjusted samples to targets.
    • These diagnostics support weighted or unweighted comparisons, apply discrete/continuous formulations, and respect aggregate_by_main_covar for one-hot categorical aggregation.
  • Exposed outcome columns selection in the CLI
    • Added --outcome_columns to choose which columns are treated as outcomes
      instead of defaulting to all non-id/weight/covariate columns. Remaining columns are moved to ignored_columns.
  • Improved missing data handling in poststratify()
    • poststratify() now accepts na_action to either drop rows with missing
      values or treat missing values as their own category during weighting.
    • Breaking change: the default behavior now fills missing values in
      poststratification variables with "__NaN__" and treats this as a distinct
      category during weighting. Previously, missing values were not handled
      explicitly, and their treatment depended on pandas groupby and merge
      defaults. To approximate the legacy behavior where missing values do not
      form their own category, pass na_action="drop" explicitly.
  • Added formula support for descriptive_stats model matrices
    • descriptive_stats() now accepts a formula argument that is always
      applied to the data (including numeric-only frames), letting callers
      control which terms and dummy variables are included in summary statistics.

Documentation

  • Documented the balance CLI
    • Added full API docstrings for balance.cli and a new CLI tutorial notebook.
  • Created Balance CLI tutorial
  • Synchronized docstring examples with test cases
    • Updated user-facing docstrings so the documented examples mirror tested inputs
      and outputs.

Code Quality & Refactoring

  • Added warning when the sample size of 'target' is much larger than 'sample' sample size
    • Sample.adjust() now warns when the target exceeds 100k rows and is at
      least 10x larger than the sample, highlighting that uncertainty is
      dominated by the sample (akin to a one-sample comparison).
  • Split util helpers into focused modules
    • Broke balance.util into balance.utils submodules for easier navigation.

Bug Fixes

  • Updated Sample.__str__() to format weight diagnostics like Sample.summary()
    • Weight diagnostics (design effect, effective sample size proportion, effective sample size)
      are now displayed on separate lines instead of comma-separated on one line.
    • Replaced "eff." abbreviations with full "effective" word for better readability.
    • Improves consistency with Sample.summary() output format.
  • Numerically stable CBPS probabilities
    • The CBPS helper now uses a stable logistic transform to avoid exponential
      overflow warnings during probability computation in constraint checks.
  • Silenced pandas observed default warning
    • Explicitly sets observed=False in weighted categorical KLD calculations
      to retain current behavior and avoid future pandas default changes.
  • Fixed plot_qq_categorical to respect the weighted parameter for target data
    • Previously, the target weights were always applied regardless of the
      weighted=False setting, causing inconsistent behavior between sample
      and target proportions in categorical QQ plots.
  • Restored CBPS tutorial plots
  • Clearer validation errors in adjustment helpers
    • trim_weights() now accepts list/tuple inputs and reports invalid types explicitly.
    • apply_transformations() raises clearer errors for invalid inputs and empty transformations.
  • Fixed model_matrix to drop NA rows when requested
    • model_matrix(add_na=False) now actually drops rows containing NA values while preserving categorical levels, matching the documented behavior.
    • Previously, add_na=False only logged a warning without dropping rows; code relying on the old behavior may now see fewer rows and should either handle missingness explicitly or use add_na=True.

Tests

  • Aligned formatting toolchain between Meta internal and GitHub CI
    • Added ["fbcode/core_stats/balance"] override to Meta's internal tools/lint/pyfmt/config.toml to use formatter = "black" and sorter = "usort".
    • This ensures both internal (pyfmt/arc lint) and external (GitHub Actions) environments use the same Black 25.1.0 formatter, eliminating formatting drift.
    • Updated CI workflow, pre-commit config, and requirements-fmt.txt to use black==25.1.0.
  • Added Pyre type checking to GitHub Actions via .pyre_configuration.external and a new pyre job in the workflow. Tests are excluded due to external typeshed stub differences; library code is fully type-checked.
  • Added test coverage workflow and badge to README via .github/workflows/coverage.yml. The workflow collects coverage using pytest-cov, generates HTML and XML reports, uploads them as artifacts, and displays coverage metrics. A coverage badge is now shown in README.md alongside other workflow badges.
  • Improved test coverage for edge cases and error handling paths
    • Added targeted tests for previously uncovered code paths across the library, addressing edge cases including empty inputs, verbose logging, error handling for invalid parameters, and boundary conditions in weighting methods (IPW, CBPS, rake).
    • Tests exercise defensive code paths that handle empty DataFrames, NaN convergence values, invalid model types, and non-convergence warnings.
  • Split test_util.py into focused test modules
    • Split the large test_util.py file (2325 lines) into 5 modular test files that mirror the balance/utils/ structure:
      • test_util_data_transformation.py - Tests for data transformation utilities
      • test_util_input_validation.py - Tests for input validation utilities
      • test_util_model_matrix.py - Tests for model matrix utilities
      • test_util_pandas_utils.py - Tests for pandas utilities (including high cardinality warnings)
      • test_util_logging_utils.py - Tests for logging utilities
    • This improves test organization and makes it easier to locate tests for specific utilities.

Contributors

@neuralsorcerer, @talgalili

Full Changelog: 0.14.0...0.15.0

Don't miss a new balance release

NewReleases is sending notifications on new releases.