github daac-tools/vibrato v0.5.0

latest release: v0.5.1
21 months ago

Main changes

  • Add a Wasm demo #115
  • Handle locale on the Wasm demo #119
  • Add bi-gram feature info generator for MeCab models #121
  • Embed a magic number into a model #129

Precompiled model files

We provide precompiled models for Vibrato, allowing you to get started with tokenization easily. You can download them from Assets in this release. The licenses are contained in each file.

All models were compiled and modified in the manners described in compile.md and map.md. We trained the mappings of connection ids using CORE data in BCCWJ v1.1 (except the PN category).

Note that all the models are compressed in zstd format. You can directly input them to Vibrato CLIs, but if using vibrato APIs, you need to extract them outside the APIs (see README).

Models trained using Vibrato

The three variants are trained using BCCWJ v1.1 (except the PN category) and UniDic v3.1.1.

  • bccwj-suw+unidic-cwj-3_1_1: A standard version.
  • bccwj-suw+unidic-cwj-3_1_1+compact: A smaller version that compresses the connection matrix in the manner described in small-dic.md.
  • bccwj-suw+unidic-cwj-3_1_1-extracted+compact: A further smaller version that contains only POS and pronunciation features.

These models were trained with L1-regularization.

Models converted from publicly-available resources

Statistics for compressed UniDic models

The following table shows UniDic model sizes in the two versions: without and with +compact (not in zstd format).

Models Standard Compact
bccwj-suw+unidic-cwj-3_1_1 618 MB 248 MB
unidic-cwj-3_1_1 717 MB 252 MB

Don't miss a new vibrato release

NewReleases is sending notifications on new releases.