--output-type auto(the default) again produces PDF/A whenever it can,
matching OCRmyPDF 16's "PDF/A by default" behavior. It first tries the fast
Ghostscript-free conversion (validated by veraPDF when available) and now
falls back to Ghostscript when that cannot produce PDF/A, only emitting a
regular PDF when even Ghostscript cannot safely convert (for example, an
input with non-embedded CID/CJK fonts, per {issue}1561). A consequence is
that the default path may once again invoke Ghostscript, which is slower and
may transcode images; use--output-type pdfto skip PDF/A conversion
entirely.- Fixed detection of veraPDF 1.30.0 and newer: recent builds print JVM
warnings before their version string, which caused OCRmyPDF to report
veraPDF as unavailable and skip the fast PDF/A path. - OCRmyPDF no longer silently corrupts a non-embedded CID (CJK) text layer when
producing PDF/A ({issue}1561). PDF/A requires all fonts to be embedded, so
Ghostscript substitutes and re-embeds non-embedded CID fonts — such as the OCR
text layer Adobe Acrobat adds to scanned CJK documents — which mangles the
text and destroys searchability. OCRmyPDF now detects non-embedded CID fonts
before conversion: with--output-type auto(the default) it produces a
regular PDF and preserves the existing text layer, and with an explicit
--output-type pdfa*it stops with an error rather than emit corrupted
output. Use--output-type pdfto keep the text layer, or--force-ocrto
rebuild it with embedded fonts. - Writing the output PDF to standard output (
ocrmypdf input.pdf -) is now
protected against corruption at the operating system level. Previously
OCRmyPDF relied on no in-process code — third-party libraries, plugins, or
strayprint()calls — ever writing to stdout; a single accidental write
would silently corrupt the PDF. The command line program now saves the real
stdout at startup, before plugins are loaded or any worker process/thread is
started, and redirects file descriptor 1 to stderr, so that only OCRmyPDF's
final PDF output can reach stdout. A consequence is that a plugin which
intentionally prints to stdout will have that output redirected to stderr. - Added the public API function {func}
ocrmypdf.configure_stdout_protection,
which installs this same protection. Like {func}ocrmypdf.configure_logging,
it is optional and intended for callers that want command-line-like behavior;
applications that manage their own standard output should not call it. - Fixed an uncaught
UnicodeDecodeErrorwhen processing a PDF whose
/DocumentInfodictionary contains a/Namekey encoded in Latin-1 (or
another non-UTF-8 encoding), such as/Saks#e5r.repair_docinfo_nulsnow
treats such a block as malformed, logs a message, and continues instead of
crashing the pipeline ({issue}1540). Current pikepdf releases tolerate these
keys by surrogate-escaping them, but older versions raised while iterating the
dictionary.