pypi bm25s 0.3.0

4 hours ago

Breaking Changes

  • Scipy is no longer a required dependency. scipy has been removed from install_requires in setup.py. The library now uses a pure NumPy-based CSC matrix builder by default. If you need scipy's CSC builder, install it separately and pass csc_backend="scipy" to BM25(), or install via pip install bm25s[indexing].
  • Import change: from bm25s import selection is now from bm25s import selection as selection_np internally. If you were importing selection directly from bm25s, update your imports.

New Features

High-Level API (bm25s.high_level)

A new simplified 1-line indexing and 1-line search API:

import bm25s.high_level as bm25

corpus = bm25.load("documents.csv", document_column="text")
retriever = bm25.index(corpus)
results = retriever.search(["your query"], k=5)
  • bm25.load() supports CSV, JSON, JSONL, and TXT files with automatic format detection.
  • bm25.index() handles tokenization (with stemming + stopword removal) and indexing in one call.
  • BM25Search.search() returns ranked results with document text, scores, and IDs.
  • Handles empty queries gracefully by filtering them before retrieval.

Command-Line Interface (bm25 CLI)

A new terminal CLI via the bm25 console script entry point:

  • bm25 index <file> — Index documents from CSV, TXT, JSON, or JSONL files.
    • -o to specify output directory, -c to specify text column, -u to save to user directory (~/.bm25s/indices/).
  • bm25 search -i <index> "query" — Search an existing index.
    • -k for top-k, -s to save results as JSON, -u for user directory with interactive index picker.
  • Interactive index picker (requires pip install bm25s[cli] for Rich-based UI, falls back to plain text).

MCP Server (bm25s.mcp)

A built-in Model Context Protocol server to expose BM25 indices as tools for LLMs:

  • bm25 mcp launch --index-dir <path> — Launch an MCP server with retrieve and get_info tools.
  • Compatible with Claude Desktop and other MCP clients.
  • Install with pip install bm25s[mcp].

Numba Compilation & Auto-Compile

  • New compile() method on BM25 for explicit JIT compilation of both the scorer and CSC builder.
  • New auto_compile=True parameter on BM25.__init__() — automatically compiles Numba JIT functions on initialization.
  • New warmup_numba_scorer() and warmup_numba_csc() methods to pre-trigger JIT compilation with dummy data.
  • New activate_numba_csc() method — applies Numba JIT to the CSC matrix builder for faster indexing.

Pure NumPy CSC Matrix Construction

  • New csc_backend parameter on BM25(): choose "numpy" (default), "scipy", or "auto".
  • _np_csc_python() — Pure NumPy implementation using packed-index argsort.
  • _np_csc_jit_ready() — Numba-compilable implementation using counting sort (linear time).
  • Eliminates the scipy dependency for index construction.

Parameter Overrides on Load

  • BM25.load() now accepts override_params={} and **kwargs to override saved parameters at load time (e.g., change auto_compile, backend, etc.).

Improvements

  • Reduced disk footprint: Base install is now ~51MB (down from ~479MB) since scipy is no longer required. Documented in updated disk usage table in README.
  • _faketqdm fix: The fallback tqdm replacement now properly handles being called with no positional arguments (returns None instead of raising).
  • Consistent tqdm disabling: All modules (__init__, scoring, tokenization, hf, beir, corpus) now respect the DISABLE_TQDM environment variable uniformly.
  • dtype handling: _compute_relevance_from_scores now wraps dtype with np.dtype() for compatibility with Numba JIT.
  • Numba availability checks: Replaced selection_jit is None checks with a proper NUMBA_AVAILABLE boolean flag.
  • Numba JIT disable guard: activate_numba_scorer() now respects the NUMBA_DISABLE_JIT environment variable.

New Install Extras

Extra Packages Purpose
mcp mcp MCP server support
cli rich Rich terminal UI for interactive index picker
indexing scipy scipy-based CSC matrix construction

CI/CD

  • New test jobs: Separate test-numba and test-high-level CI jobs with proper thread-safety env vars (OMP_NUM_THREADS=1, etc.).
  • Coverage reporting: Core tests now run with coverage and report percentage.
  • Branch triggers: CI now runs on dev* branches in addition to main.
  • New workflows: Added claude.yml (Claude Code GitHub Action for issue/PR interaction) and claude-code-review.yml (automated PR code review).

New Test Coverage

  • tests/core/test_core_coverage.py — 447 lines of comprehensive core module tests.
  • tests/core/test_corpus.py, test_hf_utils.py, test_init_utils.py, test_json_functions.py, test_scoring.py, test_selection.py, test_tokenization_extended.py — Extended unit tests for core modules.
  • tests/high_level/test_high_level.py — 121 lines testing the high-level API.
  • tests/high_level/test_terminal.py — 647 lines testing the CLI terminal commands.
  • Test data files: tests/data/dummy.csv, dummy.jsonl, dummy.txt.

New Examples

  • examples/mcp/create_index.py — Create a test index for the MCP server.
  • examples/mcp/verify_server.py — Verify MCP server functionality.
  • examples/simple_load.py — Demonstrate the high-level load/index/search workflow.

Documentation

  • README significantly expanded with sections on High-Level API, CLI usage, MCP server setup (including Claude Desktop integration), and updated disk usage benchmarks.
  • Updated project tagline from "powered by Scipy sparse matrices" to "powered by Numpy".

Stats

  • +3,722 lines added, -69 lines removed across 35 files
  • 15 commits since v0.2.14

Don't miss a new bm25s release

NewReleases is sending notifications on new releases.