Features
-
#6150: Add
not_supported_language_list
to component to be able to define languages that a component can NOT handle.WhitespaceTokenizer
is not able to process languages which are not separated by whitespace.WhitespaceTokenizer
will throw an error if it is used with Chinese, Japanese, and Thai.
Bugfixes
- #6150:
WhitespaceTokenizer
only removes emoji if complete token matches emoji regex.