New submodule for text processing!
It contains two classes to help you make text features ready for training:
- Tokenizer -- use this class to split text into tokens (automatic lowercase and punctuation removal)
- Dictionary -- with this class you create a dictionary which maps tokens to numeric identifiers. You then use these identifiers as new features.
New features:
- Enabled
boost_from_average
forMAPE
loss function
Bug fixes:
- Fixed
Pool
creation frompandas.DataFrame
with discontinuous columns, #1079 - Fixed
standalone_evaluator
, PR #1083
Speedups:
- Huge speedup of preprocessing in python-package for datasets with many samples (>10 mln)
We also release precompiled packages for Python 3.8