New submodule for text processing!
It contains two classes to help you make text features ready for training:
- Tokenizer -- use this class to split text into tokens (automatic lowercase and punctuation removal)
- Dictionary -- with this class you create a dictionary which maps tokens to numeric identifiers. You then use these identifiers as new features.
pandas.DataFramewith discontinuous columns, #1079
standalone_evaluator, PR #1083
- Huge speedup of preprocessing in python-package for datasets with many samples (>10 mln)
We also release precompiled packages for Python 3.8