Breaking Changes
- Scipy is no longer a required dependency.
scipyhas been removed frominstall_requiresinsetup.py. The library now uses a pure NumPy-based CSC matrix builder by default. If you need scipy's CSC builder, install it separately and passcsc_backend="scipy"toBM25(), or install viapip install bm25s[indexing]. - Import change:
from bm25s import selectionis nowfrom bm25s import selection as selection_npinternally. If you were importingselectiondirectly frombm25s, update your imports.
New Features
High-Level API (bm25s.high_level)
A new simplified 1-line indexing and 1-line search API:
import bm25s.high_level as bm25
corpus = bm25.load("documents.csv", document_column="text")
retriever = bm25.index(corpus)
results = retriever.search(["your query"], k=5)bm25.load()supports CSV, JSON, JSONL, and TXT files with automatic format detection.bm25.index()handles tokenization (with stemming + stopword removal) and indexing in one call.BM25Search.search()returns ranked results with document text, scores, and IDs.- Handles empty queries gracefully by filtering them before retrieval.
Command-Line Interface (bm25 CLI)
A new terminal CLI via the bm25 console script entry point:
bm25 index <file>— Index documents from CSV, TXT, JSON, or JSONL files.-oto specify output directory,-cto specify text column,-uto save to user directory (~/.bm25s/indices/).
bm25 search -i <index> "query"— Search an existing index.-kfor top-k,-sto save results as JSON,-ufor user directory with interactive index picker.
- Interactive index picker (requires
pip install bm25s[cli]for Rich-based UI, falls back to plain text).
MCP Server (bm25s.mcp)
A built-in Model Context Protocol server to expose BM25 indices as tools for LLMs:
bm25 mcp launch --index-dir <path>— Launch an MCP server withretrieveandget_infotools.- Compatible with Claude Desktop and other MCP clients.
- Install with
pip install bm25s[mcp].
Numba Compilation & Auto-Compile
- New
compile()method onBM25for explicit JIT compilation of both the scorer and CSC builder. - New
auto_compile=Trueparameter onBM25.__init__()— automatically compiles Numba JIT functions on initialization. - New
warmup_numba_scorer()andwarmup_numba_csc()methods to pre-trigger JIT compilation with dummy data. - New
activate_numba_csc()method — applies Numba JIT to the CSC matrix builder for faster indexing.
Pure NumPy CSC Matrix Construction
- New
csc_backendparameter onBM25(): choose"numpy"(default),"scipy", or"auto". _np_csc_python()— Pure NumPy implementation using packed-index argsort._np_csc_jit_ready()— Numba-compilable implementation using counting sort (linear time).- Eliminates the scipy dependency for index construction.
Parameter Overrides on Load
BM25.load()now acceptsoverride_params={}and**kwargsto override saved parameters at load time (e.g., changeauto_compile,backend, etc.).
Improvements
- Reduced disk footprint: Base install is now ~51MB (down from ~479MB) since scipy is no longer required. Documented in updated disk usage table in README.
_faketqdmfix: The fallback tqdm replacement now properly handles being called with no positional arguments (returnsNoneinstead of raising).- Consistent tqdm disabling: All modules (
__init__,scoring,tokenization,hf,beir,corpus) now respect theDISABLE_TQDMenvironment variable uniformly. - dtype handling:
_compute_relevance_from_scoresnow wrapsdtypewithnp.dtype()for compatibility with Numba JIT. - Numba availability checks: Replaced
selection_jit is Nonechecks with a properNUMBA_AVAILABLEboolean flag. - Numba JIT disable guard:
activate_numba_scorer()now respects theNUMBA_DISABLE_JITenvironment variable.
New Install Extras
| Extra | Packages | Purpose |
|---|---|---|
mcp
| mcp
| MCP server support |
cli
| rich
| Rich terminal UI for interactive index picker |
indexing
| scipy
| scipy-based CSC matrix construction |
CI/CD
- New test jobs: Separate
test-numbaandtest-high-levelCI jobs with proper thread-safety env vars (OMP_NUM_THREADS=1, etc.). - Coverage reporting: Core tests now run with
coverageand report percentage. - Branch triggers: CI now runs on
dev*branches in addition tomain. - New workflows: Added
claude.yml(Claude Code GitHub Action for issue/PR interaction) andclaude-code-review.yml(automated PR code review).
New Test Coverage
tests/core/test_core_coverage.py— 447 lines of comprehensive core module tests.tests/core/test_corpus.py,test_hf_utils.py,test_init_utils.py,test_json_functions.py,test_scoring.py,test_selection.py,test_tokenization_extended.py— Extended unit tests for core modules.tests/high_level/test_high_level.py— 121 lines testing the high-level API.tests/high_level/test_terminal.py— 647 lines testing the CLI terminal commands.- Test data files:
tests/data/dummy.csv,dummy.jsonl,dummy.txt.
New Examples
examples/mcp/create_index.py— Create a test index for the MCP server.examples/mcp/verify_server.py— Verify MCP server functionality.examples/simple_load.py— Demonstrate the high-level load/index/search workflow.
Documentation
- README significantly expanded with sections on High-Level API, CLI usage, MCP server setup (including Claude Desktop integration), and updated disk usage benchmarks.
- Updated project tagline from "powered by Scipy sparse matrices" to "powered by Numpy".
Stats
- +3,722 lines added, -69 lines removed across 35 files
- 15 commits since v0.2.14