github ocrmypdf/OCRmyPDF v17.8.0

4 hours ago
  • --output-type auto (the default) again produces PDF/A whenever it can,
    matching OCRmyPDF 16's "PDF/A by default" behavior. It first tries the fast
    Ghostscript-free conversion (validated by veraPDF when available) and now
    falls back to Ghostscript when that cannot produce PDF/A, only emitting a
    regular PDF when even Ghostscript cannot safely convert (for example, an
    input with non-embedded CID/CJK fonts, per {issue}1561). A consequence is
    that the default path may once again invoke Ghostscript, which is slower and
    may transcode images; use --output-type pdf to skip PDF/A conversion
    entirely.
  • Fixed detection of veraPDF 1.30.0 and newer: recent builds print JVM
    warnings before their version string, which caused OCRmyPDF to report
    veraPDF as unavailable and skip the fast PDF/A path.
  • OCRmyPDF no longer silently corrupts a non-embedded CID (CJK) text layer when
    producing PDF/A ({issue}1561). PDF/A requires all fonts to be embedded, so
    Ghostscript substitutes and re-embeds non-embedded CID fonts — such as the OCR
    text layer Adobe Acrobat adds to scanned CJK documents — which mangles the
    text and destroys searchability. OCRmyPDF now detects non-embedded CID fonts
    before conversion: with --output-type auto (the default) it produces a
    regular PDF and preserves the existing text layer, and with an explicit
    --output-type pdfa* it stops with an error rather than emit corrupted
    output. Use --output-type pdf to keep the text layer, or --force-ocr to
    rebuild it with embedded fonts.
  • Writing the output PDF to standard output (ocrmypdf input.pdf -) is now
    protected against corruption at the operating system level. Previously
    OCRmyPDF relied on no in-process code — third-party libraries, plugins, or
    stray print() calls — ever writing to stdout; a single accidental write
    would silently corrupt the PDF. The command line program now saves the real
    stdout at startup, before plugins are loaded or any worker process/thread is
    started, and redirects file descriptor 1 to stderr, so that only OCRmyPDF's
    final PDF output can reach stdout. A consequence is that a plugin which
    intentionally prints to stdout will have that output redirected to stderr.
  • Added the public API function {func}ocrmypdf.configure_stdout_protection,
    which installs this same protection. Like {func}ocrmypdf.configure_logging,
    it is optional and intended for callers that want command-line-like behavior;
    applications that manage their own standard output should not call it.
  • Fixed an uncaught UnicodeDecodeError when processing a PDF whose
    /DocumentInfo dictionary contains a /Name key encoded in Latin-1 (or
    another non-UTF-8 encoding), such as /Saks#e5r. repair_docinfo_nuls now
    treats such a block as malformed, logs a message, and continues instead of
    crashing the pipeline ({issue}1540). Current pikepdf releases tolerate these
    keys by surrogate-escaping them, but older versions raised while iterating the
    dictionary.

Don't miss a new OCRmyPDF release

NewReleases is sending notifications on new releases.