ocrmypdf/OCRmyPDF v17.6.0 on GitHub

When the optimizer encounters an image it cannot process (for example, an
exotic colorspace that cannot be transcoded), it now logs a concise warning
that the image was left unchanged rather than printing an alarming
traceback. The output file was already valid in these cases; only the
reporting was misleading. The full traceback is still available at debug
verbosity (-v 1) ({issue}846).
--pdfa-image-compression=auto (the default) now selects lossless image
compression at -O0 so Ghostscript no longer transcodes lossless images to
JPEG during PDF/A generation. At -O1 and above, auto continues to defer
to Ghostscript's heuristic, which may recompress images lossily. -O1 (the
default level) is kept as a historical exception because coercing it to
lossless can substantially bloat output; users who want guaranteed lossless
image handling should pass --pdfa-image-compression=lossless or use -O0
({issue}1124).
--pdfa-image-compression=lossless now passes existing JPEG images through
unchanged rather than re-encoding them with a lossless codec. Re-encoding an
already-lossy JPEG losslessly cannot recover quality and only inflates the
file, so JPEGs are preserved while non-JPEG images are encoded losslessly.
OCRmyPDF now validates and repairs malformed page-boundary boxes
(/MediaBox, /CropBox, /TrimBox, /ArtBox, /BleedBox) in its
input, following the PDF 2.0 specification. Coordinates written in invalid
exponential notation are reinterpreted ({issue}1398); rectangles whose
corners are given in reversed order are normalized, which previously crashed
with NegativeDimensionError ({issue}1526); and a crop/trim/art/bleed box
that falls outside the MediaBox is clamped to their intersection, or discarded
when that intersection is empty, which previously produced an output with a
zero-height effective page that some viewers refused to open ({issue}1400).
When a box is discarded, clamped, or reinterpreted, OCRmyPDF logs a warning
recommending visual inspection of the output. Thanks @ajdlinux for the initial
fix in PR #1691.
OCRmyPDF now discards an embedded Adobe full-text search index
(/Root/PieceInfo/SearchIndex) from its output. This proprietary index,
produced by Acrobat's "Embed Index" feature, is read only by Adobe Acrobat;
other viewers ignore it and search the text on the fly. Because any change to
a PDF invalidates the index, retaining it after OCRmyPDF rewrites the document
would leave a stale index that returns incorrect search results in Acrobat.
Modern viewers rebuild a search index on demand, so there is no loss of
search capability.
OCRmyPDF now discards embedded per-page thumbnail images (the optional
/Thumb image XObject on a page) from its output. OCRmyPDF alters page
appearance (deskew, clean, rasterize, re-render) and plugins may edit pages
arbitrarily, so a retained thumbnail would be stale and no longer match its
page. Embedded thumbnails are a navigation aid that modern viewers generate
on demand, so there is no loss of functionality.
Fixed a regression in OCR quality for PDFs that paint a 1-bit image mask
(stencil) with a gray or colored fill color. Previously such pages were
rasterized as 1-bit black-and-white before OCR, so Ghostscript dithered
mid-tone text into an unreadable stipple and Tesseract failed to recognize
it. The rasterizer now inspects the fill color used to paint a mask and
promotes the page to grayscale or full color as needed, so the distinction
is preserved for the OCR engine. This applies to both the Ghostscript and
pypdfium rasterizers. {issue}1688
The default 1-bit raster device for Ghostscript is now pngmonod
(error-diffusion) instead of pngmono (ordered dithering). It produces
better input for OCR on faint or anti-aliased scans at negligible cost and
no change to output file size, since the rasterized image is an
intermediate that is discarded after OCR.
When rasterizing pages with Ghostscript, OCRmyPDF now enables text and
graphics anti-aliasing (-dTextAlphaBits=4 -dGraphicsAlphaBits=4) for the
grayscale and color raster devices. Ghostscript 10.x renders aliased glyphs
that OCR frequently misreads as extra word breaks or substituted characters;
anti-aliasing materially improves OCR accuracy on the Ghostscript
rasterization path, especially for small fonts at moderate resolution. The
1-bit monochrome devices are unaffected, since they perform their own
anti-aliased downscaling and older Ghostscript versions reject alpha-bit
options on them. Note that the default rasterizer (--rasterizer auto)
prefers pypdfium2, which already anti-aliases; this change benefits users who
select --rasterizer ghostscript or do not have pypdfium2 installed.
OCRmyPDF now also logs which rasterizer rendered each page at debug verbosity
(-v 1), and the --rasterizer help text explains the OCR-quality
trade-off, to make such reports easier to diagnose. {issue}1439
When Tesseract reports a page with many diacritics, OCRmyPDF still logs its
interpreted "lots of diacritics - possibly poor OCR" hint, but now also emits
Tesseract's raw message at debug verbosity (-v 1) so the original wording
is available for diagnosis. {issue}1566
Added --mode strip, which removes the invisible OCR text layer from a PDF
in place. Unlike --ocr-engine none --force-ocr, it does not rasterize the
page, so images and visible content are preserved unchanged and the output is
smaller rather than larger. Only text drawn as invisible (PDF text render mode
3) is removed; some OCR engines -- and OCRmyPDF v2.2 and earlier -- express
text as visible glyphs covered by an opaque image, and that text cannot be
removed this way. {issue}1435