Overview
Version 0.1.0 (previously 0.1.0a6) is a large release, bringing many improvements over the previous 0.0.2 version.
High-level changes include:
- Organized dependencies into feature groups — install only the converters you need, or get everything with
pip install markitdown[all]
- A new plugin-based architecture, allowing 3rd-party developers to add functionality to MarkItDown (see the sample plugin)
- All conversions are performed in-memory — no more temporary files
- Support for new formats including EPUB
- Option to keep data URIs in converted Markdown
- Option to override MIME type, extension, and charset in the command-line interface (useful when reading input from a pipe or stdin)
Breaking changes
- As noted above, dependencies are now organized into optional feature groups.
Use pip install markitdown[all]
for backward-compatible behavior. convert_stream()
now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, which also accepted text file-like objects, like io.StringIO.- The
DocumentConverter
class interface has changed to read from file-like streams rather than file paths. No temporary files are created anymore. If you are the maintainer of a plugin or custom DocumentConverter, you likely need to update your code. Otherwise, if you're only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
Detailed list of contributions
- Cleanup and refactor, in preparation for plugin support. by @afourney in #318
- Skip generating md links in 'pre' blocks by @t-kalinowski in #322
- Fix a typo in sample RTF plugin by @rickygao in #320
- Added priority argument to all converter constructors. by @afourney in #324
- Doc Intelligence fixes for refactored code by @KennyZhang1 in #325
- Added CLI tests. by @afourney in #327
- Fix UnboundLocalError in MarkItDown._convert by @menezesandre in #1038
- add necessary imports by @tanreinama in #861
- fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue by @iw4p in #1035
- Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) by @C0dingMast3r in #331
- Make sure extensions are unique in MarkItDown's convert methods. by @afourney in #1076
- Don't have ZipConverter accept OOXML files. by @afourney in #1078
- Print and log better exceptions when file conversions fail. by @afourney in #1080
- Exceptions should subclass Exception not BaseException. by @afourney in #1082
- [Draft] Exploring ways to allow Optional dependencies by @afourney in #1079
- Fixed property name by @afourney in #1085
- Update converter API, user streams rather than filepaths by @afourney in #1088
- Bump version. by @afourney in #1094
- Fixed loading of plugins. by @afourney in #1096
- Fixed version. by @afourney in #1097
- fix(README): correct pip install command formatting by @Piero24 in #1090
- Fixed deepcopy failure when passing llm_client by @scalabreseGD in #1089
- Fixed formatting. by @afourney in #1098
- feat: sort pptx shapes to be parsed in top-to-bottom, left-to-right order by @richardye101 in #1104
- feat(docker): improve dockerfile build by @syaghoubi00 in #220
- Fix exiftool in well-known paths. by @afourney in #1106
- fix typo in well-known path list by @0xmohit in #1109
- Switch from puremagic to magika. by @afourney in #1108
- Minimize guesses when guesses are compatible. by @afourney in #1114
- Added CLI options for extension, mime-types, and charset. by @afourney in #1115
- Fix string formatting in FileConversionException error message by @yushihang in #1121
- Handle not supported plot type in pptx by @EmanueleMeazzo in #1122
- Small fixes for autogen integration. by @afourney in #1124
- Added epub test file. by @afourney in #1130
- Fix remaining mypy errors. by @afourney in #1132
- Have magika read from the stream. by @afourney in #1136
- EPub Support. Adapted #123 to not use epublib. by @afourney in #1131
- Consider anything with a charset as plain text-convertible. by @afourney in #1142
- Adjust warning filters and update dependencies by @afourney in #1143
- Add support for preserving base64 encoded images by @BetterAndBetterII in #1140
- Resolve a console encoding error. by @afourney in #1149
- Bump version to 0.1.0 by @afourney in #1150
New Contributors
- @t-kalinowski made their first contribution in #322
- @rickygao made their first contribution in #320
- @menezesandre made their first contribution in #1038
- @tanreinama made their first contribution in #861
- @iw4p made their first contribution in #1035
- @C0dingMast3r made their first contribution in #331
- @Piero24 made their first contribution in #1090
- @scalabreseGD made their first contribution in #1089
- @richardye101 made their first contribution in #1104
- @syaghoubi00 made their first contribution in #220
- @0xmohit made their first contribution in #1109
- @yushihang made their first contribution in #1121
- @EmanueleMeazzo made their first contribution in #1122
- @BetterAndBetterII made their first contribution in #1140
Full Changelog: v0.0.2...v0.1.0