github ocrmypdf/OCRmyPDF v17.6.0

7 hours ago
  • When the optimizer encounters an image it cannot process (for example, an
    exotic colorspace that cannot be transcoded), it now logs a concise warning
    that the image was left unchanged rather than printing an alarming
    traceback. The output file was already valid in these cases; only the
    reporting was misleading. The full traceback is still available at debug
    verbosity (-v 1) ({issue}846).
  • --pdfa-image-compression=auto (the default) now selects lossless image
    compression at -O0 so Ghostscript no longer transcodes lossless images to
    JPEG during PDF/A generation. At -O1 and above, auto continues to defer
    to Ghostscript's heuristic, which may recompress images lossily. -O1 (the
    default level) is kept as a historical exception because coercing it to
    lossless can substantially bloat output; users who want guaranteed lossless
    image handling should pass --pdfa-image-compression=lossless or use -O0
    ({issue}1124).
  • --pdfa-image-compression=lossless now passes existing JPEG images through
    unchanged rather than re-encoding them with a lossless codec. Re-encoding an
    already-lossy JPEG losslessly cannot recover quality and only inflates the
    file, so JPEGs are preserved while non-JPEG images are encoded losslessly.
  • OCRmyPDF now validates and repairs malformed page-boundary boxes
    (/MediaBox, /CropBox, /TrimBox, /ArtBox, /BleedBox) in its
    input, following the PDF 2.0 specification. Coordinates written in invalid
    exponential notation are reinterpreted ({issue}1398); rectangles whose
    corners are given in reversed order are normalized, which previously crashed
    with NegativeDimensionError ({issue}1526); and a crop/trim/art/bleed box
    that falls outside the MediaBox is clamped to their intersection, or discarded
    when that intersection is empty, which previously produced an output with a
    zero-height effective page that some viewers refused to open ({issue}1400).
    When a box is discarded, clamped, or reinterpreted, OCRmyPDF logs a warning
    recommending visual inspection of the output. Thanks @ajdlinux for the initial
    fix in PR #1691.
  • OCRmyPDF now discards an embedded Adobe full-text search index
    (/Root/PieceInfo/SearchIndex) from its output. This proprietary index,
    produced by Acrobat's "Embed Index" feature, is read only by Adobe Acrobat;
    other viewers ignore it and search the text on the fly. Because any change to
    a PDF invalidates the index, retaining it after OCRmyPDF rewrites the document
    would leave a stale index that returns incorrect search results in Acrobat.
    Modern viewers rebuild a search index on demand, so there is no loss of
    search capability.
  • OCRmyPDF now discards embedded per-page thumbnail images (the optional
    /Thumb image XObject on a page) from its output. OCRmyPDF alters page
    appearance (deskew, clean, rasterize, re-render) and plugins may edit pages
    arbitrarily, so a retained thumbnail would be stale and no longer match its
    page. Embedded thumbnails are a navigation aid that modern viewers generate
    on demand, so there is no loss of functionality.
  • Fixed a regression in OCR quality for PDFs that paint a 1-bit image mask
    (stencil) with a gray or colored fill color. Previously such pages were
    rasterized as 1-bit black-and-white before OCR, so Ghostscript dithered
    mid-tone text into an unreadable stipple and Tesseract failed to recognize
    it. The rasterizer now inspects the fill color used to paint a mask and
    promotes the page to grayscale or full color as needed, so the distinction
    is preserved for the OCR engine. This applies to both the Ghostscript and
    pypdfium rasterizers. {issue}1688
  • The default 1-bit raster device for Ghostscript is now pngmonod
    (error-diffusion) instead of pngmono (ordered dithering). It produces
    better input for OCR on faint or anti-aliased scans at negligible cost and
    no change to output file size, since the rasterized image is an
    intermediate that is discarded after OCR.
  • When rasterizing pages with Ghostscript, OCRmyPDF now enables text and
    graphics anti-aliasing (-dTextAlphaBits=4 -dGraphicsAlphaBits=4) for the
    grayscale and color raster devices. Ghostscript 10.x renders aliased glyphs
    that OCR frequently misreads as extra word breaks or substituted characters;
    anti-aliasing materially improves OCR accuracy on the Ghostscript
    rasterization path, especially for small fonts at moderate resolution. The
    1-bit monochrome devices are unaffected, since they perform their own
    anti-aliased downscaling and older Ghostscript versions reject alpha-bit
    options on them. Note that the default rasterizer (--rasterizer auto)
    prefers pypdfium2, which already anti-aliases; this change benefits users who
    select --rasterizer ghostscript or do not have pypdfium2 installed.
    OCRmyPDF now also logs which rasterizer rendered each page at debug verbosity
    (-v 1), and the --rasterizer help text explains the OCR-quality
    trade-off, to make such reports easier to diagnose. {issue}1439
  • When Tesseract reports a page with many diacritics, OCRmyPDF still logs its
    interpreted "lots of diacritics - possibly poor OCR" hint, but now also emits
    Tesseract's raw message at debug verbosity (-v 1) so the original wording
    is available for diagnosis. {issue}1566
  • Added --mode strip, which removes the invisible OCR text layer from a PDF
    in place. Unlike --ocr-engine none --force-ocr, it does not rasterize the
    page, so images and visible content are preserved unchanged and the output is
    smaller rather than larger. Only text drawn as invisible (PDF text render mode
    3) is removed; some OCR engines -- and OCRmyPDF v2.2 and earlier -- express
    text as visible glyphs covered by an opaque image, and that text cannot be
    removed this way. {issue}1435

Don't miss a new OCRmyPDF release

NewReleases is sending notifications on new releases.