bm25s 0.3.0 on Python PyPI

Breaking Changes

Scipy is no longer a required dependency. scipy has been removed from install_requires in setup.py. The library now uses a pure NumPy-based CSC matrix builder by default. If you need scipy's CSC builder, install it separately and pass csc_backend="scipy" to BM25(), or install via pip install bm25s[indexing].
Import change: from bm25s import selection is now from bm25s import selection as selection_np internally. If you were importing selection directly from bm25s, update your imports.

New Features

High-Level API (`bm25s.high_level`)

A new simplified 1-line indexing and 1-line search API:

import bm25s.high_level as bm25

corpus = bm25.load("documents.csv", document_column="text")
retriever = bm25.index(corpus)
results = retriever.search(["your query"], k=5)

bm25.load() supports CSV, JSON, JSONL, and TXT files with automatic format detection.
bm25.index() handles tokenization (with stemming + stopword removal) and indexing in one call.
BM25Search.search() returns ranked results with document text, scores, and IDs.
Handles empty queries gracefully by filtering them before retrieval.

Command-Line Interface (`bm25` CLI)

A new terminal CLI via the bm25 console script entry point:

bm25 index <file> — Index documents from CSV, TXT, JSON, or JSONL files.
- -o to specify output directory, -c to specify text column, -u to save to user directory (~/.bm25s/indices/).
bm25 search -i <index> "query" — Search an existing index.
- -k for top-k, -s to save results as JSON, -u for user directory with interactive index picker.
Interactive index picker (requires pip install bm25s[cli] for Rich-based UI, falls back to plain text).

MCP Server (`bm25s.mcp`)

A built-in Model Context Protocol server to expose BM25 indices as tools for LLMs:

bm25 mcp launch --index-dir <path> — Launch an MCP server with retrieve and get_info tools.
Compatible with Claude Desktop and other MCP clients.
Install with pip install bm25s[mcp].

Numba Compilation & Auto-Compile

New compile() method on BM25 for explicit JIT compilation of both the scorer and CSC builder.
New auto_compile=True parameter on BM25.__init__() — automatically compiles Numba JIT functions on initialization.
New warmup_numba_scorer() and warmup_numba_csc() methods to pre-trigger JIT compilation with dummy data.
New activate_numba_csc() method — applies Numba JIT to the CSC matrix builder for faster indexing.

Pure NumPy CSC Matrix Construction

New csc_backend parameter on BM25(): choose "numpy" (default), "scipy", or "auto".
_np_csc_python() — Pure NumPy implementation using packed-index argsort.
_np_csc_jit_ready() — Numba-compilable implementation using counting sort (linear time).
Eliminates the scipy dependency for index construction.

Parameter Overrides on Load

BM25.load() now accepts override_params={} and **kwargs to override saved parameters at load time (e.g., change auto_compile, backend, etc.).

Improvements

Reduced disk footprint: Base install is now ~51MB (down from ~479MB) since scipy is no longer required. Documented in updated disk usage table in README.
_faketqdm fix: The fallback tqdm replacement now properly handles being called with no positional arguments (returns None instead of raising).
Consistent tqdm disabling: All modules (__init__, scoring, tokenization, hf, beir, corpus) now respect the DISABLE_TQDM environment variable uniformly.
dtype handling: _compute_relevance_from_scores now wraps dtype with np.dtype() for compatibility with Numba JIT.
Numba availability checks: Replaced selection_jit is None checks with a proper NUMBA_AVAILABLE boolean flag.
Numba JIT disable guard: activate_numba_scorer() now respects the NUMBA_DISABLE_JIT environment variable.

New Install Extras

Extra	Packages	Purpose
`mcp`	`mcp`	MCP server support
`cli`	`rich`	Rich terminal UI for interactive index picker
`indexing`	`scipy`	scipy-based CSC matrix construction

CI/CD

New test jobs: Separate test-numba and test-high-level CI jobs with proper thread-safety env vars (OMP_NUM_THREADS=1, etc.).
Coverage reporting: Core tests now run with coverage and report percentage.
Branch triggers: CI now runs on dev* branches in addition to main.
New workflows: Added claude.yml (Claude Code GitHub Action for issue/PR interaction) and claude-code-review.yml (automated PR code review).

New Test Coverage

tests/core/test_core_coverage.py — 447 lines of comprehensive core module tests.
tests/core/test_corpus.py, test_hf_utils.py, test_init_utils.py, test_json_functions.py, test_scoring.py, test_selection.py, test_tokenization_extended.py — Extended unit tests for core modules.
tests/high_level/test_high_level.py — 121 lines testing the high-level API.
tests/high_level/test_terminal.py — 647 lines testing the CLI terminal commands.
Test data files: tests/data/dummy.csv, dummy.jsonl, dummy.txt.

New Examples

examples/mcp/create_index.py — Create a test index for the MCP server.
examples/mcp/verify_server.py — Verify MCP server functionality.
examples/simple_load.py — Demonstrate the high-level load/index/search workflow.

Documentation

README significantly expanded with sections on High-Level API, CLI usage, MCP server setup (including Claude Desktop integration), and updated disk usage benchmarks.
Updated project tagline from "powered by Scipy sparse matrices" to "powered by Numpy".

Stats

+3,722 lines added, -69 lines removed across 35 files
15 commits since v0.2.14

bm25s 0.3.0 on Python PyPI

Breaking Changes

New Features

High-Level API (bm25s.high_level)

Command-Line Interface (bm25 CLI)

MCP Server (bm25s.mcp)

Numba Compilation & Auto-Compile

Pure NumPy CSC Matrix Construction

Parameter Overrides on Load

Improvements

New Install Extras

CI/CD

New Test Coverage

New Examples

Documentation

Stats

bm25s 0.3.0
on Python PyPI

High-Level API (`bm25s.high_level`)

Command-Line Interface (`bm25` CLI)

MCP Server (`bm25s.mcp`)