datalab-to/marker v1.7.0 on GitHub

Surya OCR 3 (inline math, better accuracy/speed)

New OCR model that is more accurate, supports inline math, and is faster on GPU. Use the --format_lines option to OCR inline math properly.

More benchmarks coming soon on this. Math recognition appears to be the best available, but not fully validated yet.

Structured extraction (beta)

We now have an early version of structured extraction. You pass in a file and a pydantic schema to extract data. You can use it like this:

from marker.converters.extraction import ExtractionConverter
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
from pydantic import BaseModel

class Links(BaseModel):
    links: list[str]
    
schema = Links.model_json_schema()
config_parser = ConfigParser({
    "page_schema": schema
})

converter = ExtractionConverter(
    artifact_dict=create_model_dict(),
    config=config_parser.generate_config_dict(),
    llm_service=config_parser.get_llm_service(),
)
rendered = converter("FILEPATH")

This requires you to configure an LLM service - see the docs here.

There is a structured extraction gui app, which you can run with:

pip install streamlit streamlit-ace
marker_extract

OCR converter

You can now run OCR and keep characters from marker. This will allow for block equations to be handled properly. You can use it like this:

from marker.converters.ocr import OCRConverter
from marker.models import create_model_dict

converter = OCRConverter(
    artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")

Misc improvements

The PDFconverter can now take an io.BytesIO object instead of a filepath.
Fixed some rare bugs with merging blocks together.

What's Changed

Keep chars by @VikParuchuri in #662
Keep chars by @VikParuchuri in #665
Structured extraction by @VikParuchuri in #687
WIP: Foundation Model Integration by @tarun-menta in #616
New OCR model, structured extraction beta by @VikParuchuri in #693

Full Changelog: v1.6.2...v1.7.0

datalab-to/marker v1.7.0 New OCR model; inline math; beta structured extraction on GitHub

Surya OCR 3 (inline math, better accuracy/speed)

Structured extraction (beta)

OCR converter

Misc improvements

What's Changed

datalab-to/marker v1.7.0
New OCR model; inline math; beta structured extraction

on GitHub