What's Changed
Major New Features
- Significantly smaller file sizes
- 54% smaller file sizes for English, 73% smaller for Chinese (see #806 for details)
- This results in a ~50% decrease in runtime for first-time users (who do not yet have the data downloaded/cached)
- Significantly lower memory usage
- Worker memory utilization in the web benchmark is reduced from 311 MB to 164 MB (47% reduction)
- The lower memory footprint makes it feasible to use more workers, significantly improving performance for projects that utilize schedulers for parallel processing
- Compatible with iOS 17 (using default settings)
- iOS 17 broke compatibility with Tesseract.js v4--upgrading to v5 should resolve
- See discussion section below for details
- iOS 17 broke compatibility with Tesseract.js v4--upgrading to v5 should resolve
Breaking Changes Impacting Many Users
createWorker
arguments changed- Setting non-default language and OEM now happens in
createWorker
- E.g.
createWorker("chi_sim", 1)
- E.g.
- Setting non-default language and OEM now happens in
worker.initialize
andworker.loadLanguage
functions now do nothing and can be deleted from code- Loading the language and initialization now occurs in
createWorker
- Workers can be re-initialized with different settings using
worker.reinitialize
- Loading the language and initialization now occurs in
In other words, code should be modified from this:
const worker = await Tesseract.createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const ret = await worker.recognize(file);
To this:
const worker = await Tesseract.createWorker("eng");
const ret = await worker.recognize(file);
Breaking Changes Impacting Fewer Users
- Users who manually set
corePath
will need to update the contents of theircorePath
directorycorePath
should point to a directory that contains all 4 of the files below from Tesseract.js-core v5:tesseract-core.wasm.js
tesseract-core-simd.wasm.js
tesseract-core-lstm.wasm.js
tesseract-core-simd-lstm.wasm.js
- Tesseract.js will automatically select the correct version to use
worker.detect
function disabled by default- Orientation + script detection is a function of the Legacy model only, which is no longer included by default
- To enable, set arguments
legacyCore: true
andlegacyLang: true
increateWorker
options- E.g.
Tesseract.createWorker("eng", 1, {legacyCore: true, legacyLang: true});
- E.g.
- Language of progress logs standardized
- This should only impact users who parse status logs (e.g. to update a loading bar)
Non-Breaking Changes
- Language data loaded from
jsdelivr
by default (rather than GitHub pages)- This should result in improved performance and uptime
- Separate "development" build (that produced
tesseract.dev.js
andworker.dev.js
removed - Documentation and examples were modified to prevent new users from using
Tesseract.recognize
andTesseract.detect
- Users who already use these functions are encouraged to modify their code to use
worker.recognize
andworker.detect
instead
- Users who already use these functions are encouraged to modify their code to use
Considering upgrading from v2 to v5? See #771 for a full guide for updating.
Full Changelog: v4.1.3...v5.0.0