Added
- OCR table inlining into markdown content (#421): When
output_format = Markdownand OCR detects tables, the markdown pipe tables are now inlined intoresult.contentat their correct vertical positions instead of only appearing inresult.tables. AddsOcrTableBoundingBoxtoOcrTablefor spatial positioning. Setsmetadata.output_format = "markdown"to signal pre-formatted content and skip re-conversion. - OCR table bounding boxes: OCR-detected tables now include bounding box coordinates (pixel-level) computed from TSV word positions, propagated through all bindings as
Table.bounding_box. - OCR table test images: Added balance sheet and financial table test images from issue #421 for integration testing.
Fixed
- OCR test_tsv_row_to_element used wrong Tesseract level: Test specified
level: 4(Line) but assertedWord. Fixed tolevel: 5(correct Tesseract word level). - MSG recipients missing email addresses: The MSG extractor read
PR_DISPLAY_TOwhich contains only display names (e.g. "John Jennings"), losing email addresses entirely. Now reads recipient substorages (__recip_version1.0_#XXXXXXXX) withPR_EMAIL_ADDRESSandPR_RECIPIENT_TYPEto produce full"Name" <email>output with correct To/CC/BCC separation. - MSG date missing or incorrect: Date was parsed from
PR_TRANSPORT_MESSAGE_HEADERSwhich is absent in many MSG files. Now readsPR_CLIENT_SUBMIT_TIMEFILETIME directly from the MAPI properties stream, with fallback to transport headers. - EML date mangled for non-standard formats:
mail_parserparsed ISO 8601 dates (e.g.2025-07-29T12:42:06.000Z) into garbled output (2000-00-20T00:00:00Z) and replaced invalid dates with2000-00-00T00:00:00Z. Now extracts the rawDate:header text from the email bytes, preserving the original value. - EML/MSG attachments line pollutes text output:
build_email_text_output()appended anAttachments: ...line that doesn't represent message content. Removed from text output; attachment names remain in metadata. - HTML script/style tags leak in email fallback: The regex-based HTML cleaner for email bodies used
.*?which doesn't match across newlines, allowing multiline<script>/<style>content to leak into extracted text. Added(?s)flag for dotall matching. - SVG CData content leaks JavaScript/CSS:
Event::CDatahandler in the XML extractor didn't check SVG mode, causing<script>and<style>CDATA blocks to appear in SVG text output. - RTF parser leaks metadata noise into text: The RTF extractor did not skip known destination groups (
fonttbl,stylesheet,colortbl,info,themedata, etc.) or ignorable destinations ({\*\...}), causing ~17KB of font tables, color definitions, and internal metadata to appear in extracted text. - RTF
\ucontrol word mishandled: Control words like\ul(underline) and\uc1were incorrectly interpreted as Unicode escapes (\u+ numeric param), producing garbage characters instead of being treated as formatting commands. - RTF paragraph breaks collapsed to spaces:
\parcontrol words emitted a single space instead of newlines, causing all paragraphs to merge into a single line. Now correctly emits double newlines for paragraph separation. - RTF whitespace normalization destroys paragraph structure:
normalize_whitespace()treated newlines as whitespace and collapsed them to spaces. Rewritten to preserve newlines while collapsing runs of spaces within lines.