ocrmypdf/OCRmyPDF v17.8.0 on GitHub

--output-type auto (the default) again produces PDF/A whenever it can,
matching OCRmyPDF 16's "PDF/A by default" behavior. It first tries the fast
Ghostscript-free conversion (validated by veraPDF when available) and now
falls back to Ghostscript when that cannot produce PDF/A, only emitting a
regular PDF when even Ghostscript cannot safely convert (for example, an
input with non-embedded CID/CJK fonts, per {issue}1561). A consequence is
that the default path may once again invoke Ghostscript, which is slower and
may transcode images; use --output-type pdf to skip PDF/A conversion
entirely.
Fixed detection of veraPDF 1.30.0 and newer: recent builds print JVM
warnings before their version string, which caused OCRmyPDF to report
veraPDF as unavailable and skip the fast PDF/A path.
OCRmyPDF no longer silently corrupts a non-embedded CID (CJK) text layer when
producing PDF/A ({issue}1561). PDF/A requires all fonts to be embedded, so
Ghostscript substitutes and re-embeds non-embedded CID fonts — such as the OCR
text layer Adobe Acrobat adds to scanned CJK documents — which mangles the
text and destroys searchability. OCRmyPDF now detects non-embedded CID fonts
before conversion: with --output-type auto (the default) it produces a
regular PDF and preserves the existing text layer, and with an explicit
--output-type pdfa* it stops with an error rather than emit corrupted
output. Use --output-type pdf to keep the text layer, or --force-ocr to
rebuild it with embedded fonts.
Writing the output PDF to standard output (ocrmypdf input.pdf -) is now
protected against corruption at the operating system level. Previously
OCRmyPDF relied on no in-process code — third-party libraries, plugins, or
stray print() calls — ever writing to stdout; a single accidental write
would silently corrupt the PDF. The command line program now saves the real
stdout at startup, before plugins are loaded or any worker process/thread is
started, and redirects file descriptor 1 to stderr, so that only OCRmyPDF's
final PDF output can reach stdout. A consequence is that a plugin which
intentionally prints to stdout will have that output redirected to stderr.
Added the public API function {func}ocrmypdf.configure_stdout_protection,
which installs this same protection. Like {func}ocrmypdf.configure_logging,
it is optional and intended for callers that want command-line-like behavior;
applications that manage their own standard output should not call it.
Fixed an uncaught UnicodeDecodeError when processing a PDF whose
/DocumentInfo dictionary contains a /Name key encoded in Latin-1 (or
another non-UTF-8 encoding), such as /Saks#e5r. repair_docinfo_nuls now
treats such a block as malformed, logs a message, and continues instead of
crashing the pipeline ({issue}1540). Current pikepdf releases tolerate these
keys by surrogate-escaping them, but older versions raised while iterating the
dictionary.