What's Changed
Fix crash tokenizing with empty word_to_id by @mgraczyk in #72
Create nltk_stemmer.py by @aflip in #77
aa31a23: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.
bm25s/init.py:
Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.bm25s/tokenization.py:
Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.
New Contributors
Full Changelog: 0.2.3...0.2.4