github icereed/paperless-gpt v0.16.0
New Providers + Enhanced Processing

latest releases: v0.22.0, v0.21.0, v0.20.0...
4 months ago

🌟 New Features

Enhanced PDF Processing (PR #353)

  • Smart Text Layer Integration

    • Creates searchable PDFs by embedding OCR text layers
    • Maintains original document appearance while adding invisible text layer
    • Preserves accurate text positioning using hOCR data
    • Works with Google Document AI's hOCR output
  • Paperless-ngx Integration

    • Automatic metadata preservation (tags, correspondent, created date)
    • Smart document replacement workflow
    • Processing status tracking via document tagging
    • Skip mechanism for already processed documents
  • Safety Features

    • Page count validation to prevent incomplete processing
    • Optional local file backup for verification
    • Configurable document replacement
    • Comprehensive error handling and logging

Cloud Provider Integrations

Azure OpenAI Support

  • Added native support for Azure OpenAI Service
  • New configuration options: OPENAI_API_TYPE and OPENAI_BASE_URL
  • Improved validation for Azure-specific environment variables

Docling Server Integration

  • Added Docling Server as a new OCR provider
  • Self-hosted OCR capabilities for enhanced privacy
  • Support for multiple OCR engines

Container Registry Support

  • Added GitHub Container Registry (GHCR) as an alternative image source
  • Multi-architecture support for both Docker Hub and GHCR

🔧 Configuration

New PDF Processing Variables:

CREATE_LOCAL_HOCR: Save hOCR files locally
LOCAL_HOCR_PATH: Directory for hOCR files
CREATE_LOCAL_PDF: Save PDFs locally
LOCAL_PDF_PATH: Directory for PDFs
PDF_UPLOAD: Enable paperless-ngx uploads
PDF_REPLACE: Control document replacement
PDF_COPY_METADATA: Enable metadata copying
PDF_OCR_TAGGING: Enable process tracking
PDF_OCR_COMPLETE_TAG: Tag for processed documents

Azure OpenAI Variables:

OPENAI_API_TYPE=azure
OPENAI_BASE_URL=https://<your-azure-openai-endpoint>.openai.azure.com/

For detailed setup instructions, please refer to the updated documentation.

Contributors

A big shoutout to @signorecello for the docling integration and @gardar for the advanced hOCR support.

Don't miss a new paperless-gpt release

NewReleases is sending notifications on new releases.