github datalab-to/marker v1.7.0
New OCR model; inline math; beta structured extraction

latest releases: v1.10.2, v1.10.1, v1.10.0...
13 months ago

Surya OCR 3 (inline math, better accuracy/speed)

New OCR model that is more accurate, supports inline math, and is faster on GPU. Use the --format_lines option to OCR inline math properly.

image

More benchmarks coming soon on this. Math recognition appears to be the best available, but not fully validated yet.

Structured extraction (beta)

We now have an early version of structured extraction. You pass in a file and a pydantic schema to extract data. You can use it like this:

from marker.converters.extraction import ExtractionConverter
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
from pydantic import BaseModel

class Links(BaseModel):
    links: list[str]
    
schema = Links.model_json_schema()
config_parser = ConfigParser({
    "page_schema": schema
})

converter = ExtractionConverter(
    artifact_dict=create_model_dict(),
    config=config_parser.generate_config_dict(),
    llm_service=config_parser.get_llm_service(),
)
rendered = converter("FILEPATH")

This requires you to configure an LLM service - see the docs here.

There is a structured extraction gui app, which you can run with:

pip install streamlit streamlit-ace
marker_extract

OCR converter

You can now run OCR and keep characters from marker. This will allow for block equations to be handled properly. You can use it like this:

from marker.converters.ocr import OCRConverter
from marker.models import create_model_dict

converter = OCRConverter(
    artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")

Misc improvements

  • The PDFconverter can now take an io.BytesIO object instead of a filepath.
  • Fixed some rare bugs with merging blocks together.

What's Changed

Full Changelog: v1.6.2...v1.7.0

Don't miss a new marker release

NewReleases is sending notifications on new releases.