🌟 New Features
Enhanced PDF Processing (PR #353)
-
Smart Text Layer Integration
- Creates searchable PDFs by embedding OCR text layers
- Maintains original document appearance while adding invisible text layer
- Preserves accurate text positioning using hOCR data
- Works with Google Document AI's hOCR output
-
Paperless-ngx Integration
- Automatic metadata preservation (tags, correspondent, created date)
- Smart document replacement workflow
- Processing status tracking via document tagging
- Skip mechanism for already processed documents
-
Safety Features
- Page count validation to prevent incomplete processing
- Optional local file backup for verification
- Configurable document replacement
- Comprehensive error handling and logging
Cloud Provider Integrations
Azure OpenAI Support
- Added native support for Azure OpenAI Service
- New configuration options:
OPENAI_API_TYPE
andOPENAI_BASE_URL
- Improved validation for Azure-specific environment variables
Docling Server Integration
- Added Docling Server as a new OCR provider
- Self-hosted OCR capabilities for enhanced privacy
- Support for multiple OCR engines
Container Registry Support
- Added GitHub Container Registry (GHCR) as an alternative image source
- Multi-architecture support for both Docker Hub and GHCR
🔧 Configuration
New PDF Processing Variables:
CREATE_LOCAL_HOCR: Save hOCR files locally
LOCAL_HOCR_PATH: Directory for hOCR files
CREATE_LOCAL_PDF: Save PDFs locally
LOCAL_PDF_PATH: Directory for PDFs
PDF_UPLOAD: Enable paperless-ngx uploads
PDF_REPLACE: Control document replacement
PDF_COPY_METADATA: Enable metadata copying
PDF_OCR_TAGGING: Enable process tracking
PDF_OCR_COMPLETE_TAG: Tag for processed documents
Azure OpenAI Variables:
OPENAI_API_TYPE=azure
OPENAI_BASE_URL=https://<your-azure-openai-endpoint>.openai.azure.com/
For detailed setup instructions, please refer to the updated documentation.
Contributors
A big shoutout to @signorecello for the docling integration and @gardar for the advanced hOCR support.