Changed:
- Merge v2 master into new-procesor-api
- PAGE API: Update to latest generateDS 2.44.1, bertsky#21
- 🔥 logging: increase default root (not
ocrd) level fromINFOtoWARNING - 🔥
initLogging: do not remove any previous handlers/levels, unlessforce_reinit - 🔥
disableLogging: remove all handlers, reset all levels - instead of being selective - 🔥 Processor: replace
weakrefwith__del__to triggershutdown - 🔥
OCRD_MAX_PARALLEL_PAGES>1: log viaQueueHandlerin subprocess,QueueListenerin main - 🔥
ocrd_utils.initLogging: also add handler to root logger (as in file config),
but disable message propagation to avoid duplication - only import
ocrd_networkinsrc/ocrd/decorators/__init__.pyonce needed Processor.process_page_file: skip computingprocess_page_pcgtsif output already exists,
butOCRD_EXISTING_OUTPUT!=OVERWRITE- 🔥
OCRD_MAX_PARALLEL_PAGES>1: switch from multithreading to multiprocessing, depend on
lokyinstead of stdlibconcurrent.futures OCRD_PROCESSING_PAGE_TIMEOUT>0: actually enforce timeout within workerOCRD_MAX_MISSING_OUTPUTS>0: abort early if too many failures already, prospectivelyProcessor.process_workspace: split up into overridable sub-methods:process_workspace_submit_tasks(iterate input file group and schedule page tasks)process_workspace_submit_page_task(download input files and submit single page task)process_workspace_handle_tasks(monitor page tasks and aggregate results)process_workspace_handle_page_task(await single page task and handle errors)
- 🔥
Processor/Workspace.add_file: alwaysforceifOCRD_EXISTING_OUTPUT==OVERWRITE - 🔥
Processor.verify: revert 3.0.0b1 enforcing cardinality checks (stay backwards compatible) - 🔥
Processor.verify: check output fileGrps, too
(must not exist unlessOCRD_EXISTING_OUTPUT=OVERWRITE|SKIPor disjoint--page-idrange) - lib.bash
input-files: do not try to validate tasks here (now covered byProcessor.verify()) run_processor: be robust ifocrd_toolis missingstepsPcGtsType.PageType.idviamake_xml_id: replace/with_- 🔥
ocrd_utils,ocrd_models,ocrd_modelfactory,ocrd_validatorsandocrd_networkare not published
as separate packages anymore, everything is contained inocrd- you should adapt yourrequirements.txtaccordingly - 🔥
Processor.parameternow a property (attribute always exists, butNonefor non-processing contexts) - 🔥
Processor.parameteris now afrozendict(contents immutable) - 🔥
Processor.parametervalidate when(ever) set instead of (just) the constructor - setting
Processor.parameterwill also trigger (Processor.shutdown() and) Processor.setup() get_processor(... instance_caching=True): usemin(max_instances, OCRD_MAX_PROCESSOR_CACHE)- 🔥
Processor.verifyalways validates fileGrp cardinalities (because we haveocrd-tool.jsondefaults now) - 🔥
OcrdMets.add_agentwithout positional arguments ocrd bashlib input-filesnow uses normal Processor decorator, and gets passed actualocrd-tool.jsonand tool name
from bashlib'socrd__wrap- 🔥
OcrdPageas proxy ofPcGtsTypeinstead of alias; also containsetreeandmappingnow - 🔥
page_from_file: removed kwargwith_tree- useOcrdPage.etreeandOcrdPage.mappinginstead - 🔥
Processor.zip_input_filesnow can throwocrd.NonUniqueInputFileandocrd.MissingInputFile
(the latter only ifOCRD_MISSING_INPUT=ABORT) - 🔥
Processor.zip_input_filesdoes not by default userequire_firstanymore
(so the first file in any input file tuple per page can beNoneas well) - 🔥 no more
Workspace.overwrite_mode, merely delegate toOCRD_EXISTING_OUTPUT=OVERWRITE - 🎨 improve on docs result for
ocrd_utils.config - 🔥 Deprecate
Processor.process - update spec to v3.25.0, which requires annotating fileGrp cardinality in
ocrd-tool.json - 🔥 Remove passing non-processing kwargs to
Processorconstructor, add as members
(i.e.show_help,dump_json,dump_module_dir,list_resources,show_resource,resolve_resource) - 🔥 Deprecate passing processing arg / kwargs to
Processorconstructor
(i.e.workspace,page_id,input_file_grp,output_file_grp; now all set byrun_processor) - 🔥 Deprecate passing
ocrd-tool.jsonmetadata toProcessorconstructor ocrd.processor: Handle loading of bundledocrd-tool.jsongenerically
Fixed:
ocrd --helpoutput was broken for multiline config options, bertsky#25- Call
initLoggingbefore instantiating processors inocrd_cli_wrap_processor, bertsky#24, #1296 - PAGE API: Fully reversable mapping from/to XML element/generateDS instances, bertsky#21
initLogging: only add root handler instead of multiple redundant handlers withpropagate=falsesetOverrideLogLevel: override all currently active loggers' levelOcrdMets.get_physical_pages: coverreturn_divsw/ofor_fileIdsandfor_pageIds- tests: ensure
ocrd_utils.configgets reset whenever changing it globally OcrdMetsServer.add_file: pass onforcekwargocrd.cli.workspace: consistently pass on--mets-server-urland--backupocrd.cli.validate "tasks": pass on--mets-server-urlocrd.cli.bashlib "input-files": pass on--mets-server-urllib.bash input-files: pass on--mets-server-url,--overwrite, and parameterslib.bash: fixerrexithandlingocrd.cli.ocrd-tool "resolve-resource": forgot to actually print resultProcessor.metadata_location:srcworkaround respects namespace packages, qurator-spk/eynollah#134Workspace.reload_mets: handle ClientSideOcrdMets as welldisableLogging: also re-instate root logger to Python defaults- actually apply CLI
--log-filename, and show in--help - adapt to Pillow changes
ocrd workspace clone: do pass on--file-grp(for download filtering)
Added:
ocrd-filterprocessor to remove segments based on XPath expressions, bertsky#21- XPath function
pc:pixelareafor the number of pixels of the bounding box (or sum area on node sets), bertsky#21 - XPath function
pc:textequivfor the first TextEquiv unicode string (or concatenated string on node sets), bertsky#21 OcrdPage: newPageType.get_ReadingOrderGroups()to retrieve recursive RO as dict- ocrd.cli.workspace
server: add subcommandsreloadandsave - METS Server: export and delegate
physical_pages - processor CLI: delegate
--resolve-resource, too Processor.process_page_file/OcrdPageResultImage: allowNonebesidesAlternativeImageTypeOcrdConfig.reset_defaultsto reset config variables to their defaultsProcessor.max_workers: class attribute to control per-page parallelism of this implementationProcessor.max_page_seconds: class attribute to control per-page timeout of this implementationOCRD_MAX_PARALLEL_PAGESfor whether and how many workers should process pages in parallelOCRD_PROCESSING_PAGE_TIMEOUTfor whether and how long processors should wait for single pagesOCRD_MAX_MISSING_OUTPUTSfor maximum rate (fraction) of pages before makingOCRD_MISSING_OUTPUT=ABORTProcessor.metadata_filename: expose to make local path ofocrd-tool.jsonin Python distribution reusable+overridableProcessor.metadata_location: expose to make absolute path ofocrd-tool.jsonreusable+overridableProcessor.metadata_rawdict: expose to make in-memory contents ofocrd-tool.jsonreusable+overridableProcessor.metadata: expose to make validated and default-expanded contents ofocrd-tool.jsonreusable+overridableProcessor.shutdown: to shut down processor after processing, optionalProcessor.max_instances: class attribute to control instance caching of this implementation- 👉
OCRD_DOWNLOAD_INPUTfor whether input files should be downloaded before processing - 👉
OCRD_MISSING_INPUTfor how to handle missing input files (SKIPorABORT) - 👉
OCRD_MISSING_OUTPUTfor how to handle processing failures (SKIPorABORTorCOPY)
the latter behaves like ocrd-dummy for the failed page(s) - 👉
OCRD_EXISTING_OUTPUTfor how to handle existing output files (SKIPorABORTorOVERWRITE) - new CLI option
--debugas short-hand forABORTchoices above Processor.loggerset up by constructor already (for re-use by processor implementors)default-expand and validateocrd_tool.jsoninProcessorconstructor, log invalidities- handle JSON
deprecationinocrd_tool.jsonby reporting warnings Processor.process_workspace: process a complete workspace, with default implementationProcessor.process_page_file: process an OcrdFile, with default implementationProcessor.process_page_pcgts: process a single OcrdPage, produce a single OcrdPage, required to implementProcessor.verify: handle fileGrp cardinality verification, with default implementationProcessor.setup: to set up processor before processing, optional