v1.2.0 - PDF Advanced Features Release
Date: October 23, 2025
Major enhancement to PDF extraction capabilities with advanced features for handling any type of PDF documentation.
๐ What's New
Enhanced PDF Support
-
OCR for Scanned PDFs - Automatically extract text from scanned documents using Tesseract OCR
- Intelligent fallback when text content is low (< 50 characters)
- Works with pytesseract and Pillow
- Command:
--ocr
-
Password-Protected PDFs - Handle encrypted PDF files securely
- Clear error messages for authentication issues
- Command:
--password PASSWORD
-
Table Extraction - Extract complex tables from PDF documents
- Captures table data as structured 2D arrays
- Includes metadata (bounding box, row/column counts)
- Integrates seamlessly with skill references
- Command:
--extract-tables
Performance Improvements
-
3x Faster Processing - Parallel page processing using multi-threading
- Auto-detects CPU count or accepts custom worker specification
- Activates automatically for PDFs with 5+ pages
- Benchmark: 500-page PDF reduced from 4m 10s to 1m 15s
- Commands:
--paralleland--workers N
-
Intelligent Caching - 50% faster on subsequent runs
- In-memory cache for expensive operations (text extraction, code detection, quality scoring)
- Enabled by default, disable with
--no-cache
๐ Usage Examples
Basic PDF Extraction
python3 cli/pdf_scraper.py --pdf docs/manual.pdf --name myskillMaximum Performance
python3 cli/pdf_scraper.py --pdf docs/manual.pdf --name myskill \
--extract-tables \
--parallel \
--workers 8Scanned PDFs
pip3 install pytesseract Pillow
python3 cli/pdf_scraper.py --pdf docs/scanned.pdf --name myskill --ocrPassword-Protected PDFs
python3 cli/pdf_scraper.py --pdf docs/encrypted.pdf --name myskill --password mypasswordAll Features Combined
python3 cli/pdf_scraper.py --pdf docs/manual.pdf --name myskill \
--ocr \
--extract-tables \
--parallel \
--workers 8 \
--verbose๐ Performance Benchmarks
| Pages | Sequential | Parallel (4 workers) | Parallel (8 workers) |
|---|---|---|---|
| 50 | 25s | 10s (2.5x) | 8s (3.1x) |
| 100 | 50s | 18s (2.8x) | 15s (3.3x) |
| 500 | 4m 10s | 1m 30s (2.8x) | 1m 15s (3.3x) |
| 1000 | 8m 20s | 3m 00s (2.8x) | 2m 30s (3.3x) |
๐งช Testing
-
New Test Suite: test_pdf_advanced_features.py (26 comprehensive tests)
- OCR Support (5 tests)
- Password Protection (4 tests)
- Table Extraction (5 tests)
- Parallel Processing (4 tests)
- Intelligent Caching (5 tests)
- Integration (3 tests)
-
Updated Tests: test_pdf_extractor.py (23 tests, all passing)
-
Total PDF Tests: 49/49 passing (100%)
-
Overall Project: 142/142 tests passing (100%)
๐ Documentation
- New Guide: docs/PDF_ADVANCED_FEATURES.md (580 lines)
- Complete usage guide
- Installation instructions
- Performance benchmarks
- Best practices
- Troubleshooting
- API reference
๐ฆ Dependencies
New Required Dependencies
pip3 install Pillow==11.0.0 pytesseract==0.3.13Optional System Dependency
- Tesseract OCR engine (for scanned PDF support)
- Ubuntu/Debian: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
๐ง What's Changed
- Enhanced cli/pdf_extractor_poc.py with all advanced features
- Added cli/pdf_scraper.py for full workflow support
- Updated requirements.txt with new dependencies
- Updated README.md with advanced features showcase
- Updated docs/TESTING.md with comprehensive test documentation
- Added extensive PDF documentation (7 new guides)
๐ Bug Fixes
- Fixed function signature mismatches in tests
- Updated language detection confidence thresholds
- Corrected chapter detection patterns
- Fixed code block merging with proper metadata
๐ Full Changelog
See CHANGELOG.md for complete version history.
Full Diff: v1.1.0...v1.2.0
This release represents a major step forward in PDF documentation processing capabilities. Now you can extract comprehensive skills from virtually any PDF, whether it's a modern digital document, a scanned paper book, or an encrypted technical manual! ๐