github krrome/docling-hierarchical-pdf v0.1.1
Fix: Source DocumentStreams + Error handling

latest releases: v0.1.6, v0.1.5, v0.1.3...
5 months ago

Makes it possible to pass the file path of the source file or a stream to ResultPostprocessor in order to be read by pymupdf for metadata extraction:

If you run into the PDFFileNotFoundException then your source attribute to DocumentConverter().convert(source=source) has either been of type str or of type DocumentStream so there is the Docling conversion result unfortunately does not hold a valid reference to the source file anymore. Hence the Postprocessor needs your help - if source was a string then you can add the source=source when instantiating ResultPostprocessor - full example:

from docling.document_converter import DocumentConverter
from hierarchical.postprocessor import ResultPostprocessor

source = "my_file.pdf"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
# the postprocessor modifies the result.document in place.
ResultPostprocessor(result, source=source).process()
# ...

If you have used a DocumentStream object as source you are unfortunately in the situation that you will have to pass a valid Path to the PDF as a source argument to ResultPostprocessor or a new, open BytesIO stream or DocumentStream object as a source argument to ResultPostprocessor. The reason is that docling closes the source stream when it is finished - so no more reading from that stream is possible.

Don't miss a new docling-hierarchical-pdf release

NewReleases is sending notifications on new releases.