- Add additional arguments to the function that downloads and loads the
krebsregister data. The argumentmissing_values
is used to fill missing
values. Default: nothing is done. The argumentshuffle
is used to
shuffle the records. Default is True. - Remove the lastest traces of the old package name. The new package name is
'Python Record Linkage Toolkit' - Better error messages when there are only matches or non-matches are passed
to train the classifier. - Add AirSpeedVelocity tests to test the performance.
- Compare for deduplication fixed. It was broken.
- Parameterized tests for the
Compare
class and its algorithms. Making use
ofnose-parameterized
module. - Update documentation about contributing.
- Bugfix/improvement when blocking on multiple columns with missing values.
- Fix bug #29. Package
not working with pandas 0.18 and 0.17. Dropped support pandas 0.17 and fixed
support for 0.18. Also added multi-dendency tests for TravisCI. - Support for dedicated deduplication algorithms
- Special algorithm for full index in case of finding duplicates. Performce is
100x better. - Function
max_number_of_pairs
to get the maximum number of pairs. low_memory
for compare class.- Improved performance in case of comparing a large number of record pairs.
- New documentation about custom algorithms
- New documentation about the use of classifiers.
- Possible to compare arrays and series directly without using labels.
- Make a dataframe with random comparison vectors with the
binary_comparisons
in therecordlinkage.datasets.random
module. - Set KMeans cluster centers by hand.
- Various documentation updates and improvements.
- Jellyfish is now a required dependency. Fixes bug #30.
- Added
tox.ini
to test packaging and installation of package. - Drop requirements.txt file.
- Many small fixes and changes. Most of the changes cover the
Compare
module. Especially label handling is improved.