jgm/pandoc 1.13 on GitHub

New features

Added docx as an input format (Jesse Rosenthal). The docx reader includes conversion of native Word equations to pandoc LaTeX Math elements. Metadata is taken from paragraphs at the beginning of the document with styles Author, Title, Subtitle, Date, and Abstract.
Added epub as an input format (Matthew Pickering). The epub reader includes conversion of MathML to pandoc LaTeX Math elements.
Added t2t (Txt2Tags) as an input format (Matthew Pickering). Txt2tags is a lightweight markup format described at http://txt2tags.org/.
Added dokuwiki as an output format (Clare Macrae).
Added haddock as an output format.
Added --extract-media option to extract media contained in a zip container (docx or epub) while adjusting image paths to point to the extracted images.
Added a new markdown extension, compact_definition_lists, that restores the syntax for definition lists of pandoc 1.12.x, allowing tight definition lists with no blank space between items, and disallowing lazy wrapping. (See below under behavior changes.)
Added an extension epub_html_exts for parsing HTML in EPUBs.
Added extensions native_spans and native_divs to activate parsing of material in HTML span or div tags as Pandoc Span inlines or Div blocks.
--trace now works with the Markdown, HTML, Haddock, EPUB, Textile, and MediaWiki readers. This is an option intended for debugging parsing problems; ordinary users should not need to use it.

Behavior changes

Changed behavior of the markdown_attribute extension, to bring it in line with PHP markdown extra and multimarkdown. Setting markdown="1" on an outer tag affects all contained tags, recursively, until it is reversed with markdown="0" (#1378).
Revised markdown definition list syntax (#1429). Both the reader and writer are affected. This change brings pandoc's definition list syntax into alignment with that used in PHP markdown extra and multimarkdown (with the exception that pandoc is more flexible about the definition markers, allowing tildes as well as colons). Lazily wrapped definitions are now allowed. Blank space is required between list items. The space before a definition is used to determine whether it is a paragraph or a "plain" element. WARNING: This change may break existing documents! Either check your documents for definition lists without blank space between items, or use markdown+compact_definition_lists for the old behavior.
.numberLines now works in fenced code blocks even if no language is given (#1287, jgm/highlighting-kate#40).
Improvements to --filter:
Don't search PATH for a filter with an explicit path. This fixed a bug wherein --filter ./caps.py would run caps.py from the system path, even if there was a caps.py in the working directory.
Respect shebang if filter is executable (#1389).
Don't print misleading error message. Previously pandoc would say that a filter was not found, even in a case where the filter had a syntax error.
HTML reader:
Parse div and span elements even without --parse-raw, provided native_divs and native_spans extensions are set. Motivation: these now generate native pandoc Div and Span elements, not raw HTML.
Parse EPUB-specific elements if the epub_html_exts extension is enabled. These include switch, footnote, rearnote, noteref.
Org reader:
Support for inline LaTeX. Inline LaTeX is now accepted and parsed by the org-mode reader. Both math symbols (like \tau) and LaTeX commands (like \cite{Coffee}), can be used without any further escaping (Albert Krewinkel).
Textile reader and writer:
The raw_tex extension is no longer set by default. You can enable it with textile+raw_tex.
DocBook reader:
Support equation, informalequation, inlineequation elements with mml:math content. This is converted into LaTeX and put into a Pandoc Math inline.
Revised plain output, largely following the style of Project Gutenberg:
Emphasis is rendered with _underscores_, strong emphasis with ALL CAPS.
Headings are rendered differently, with space to set them off, not with setext style underlines. Level 1 headers are ALL CAPS.
Math is rendered using unicode when possible, but without the distracting emphasis markers around variables.
Footnotes use a regular [n] style.
Markdown writer:
Horizontal rules are now a line across the whole page.
Prettier pipe tables. Columns are now aligned (#1323).
Respect the raw_html extension. pandoc -t markdown-raw_html no longer emits any raw HTML, including span and div tags generated by Span and Div elements.
Use span with style for SmallCaps (#1360).
HTML writer:
Autolinks now have class uri, and email autolinks have class email, so they can be styled.
Docx writer:
Document formatting is carried over from reference.docx. This includes margins, page size, page orientation, header, and footer, including images in headers and footers.
Include abstract (if present) with Abstract style (#1451).
Include subtitle (if present) with Subtitle style, rather than tacking it on to the title (#1451).
Org writer:
Write empty span elements with an id attribute as org anchors. For example Span ("uid",[],[]) [] becomes <<uid>>.
LaTeX writer:
Put table captions above tables, to match the conventional standard. (Previously they appeared below tables.)
Use $..$ instead of $..$ for inline math (#1464).
Use \nolinkurl in email autolinks. This allows them to be styled using \urlstyle{tt}. Thanks to Ulrike Fischer for the solution.
Use \textquotesingle for ' in inline code. Otherwise we get curly quotes in the PDF output (#1364).
Use \footnote<.>{..} for notes in beamer, so that footnotes do not appear before the overlays in which their markers appear (#1525).
Don't produce a \label{..} for a Div or Span element. Do produce a \hyperdef{..} (#1519).
EPUB writer:
If the metadata includes page-progression-direction (which can be ltr or rtl, the page-progression-direction attribute will be set in the EPUB spine (#1455).
Custom lua writers:
Custom writers now work with --template.
Removed HTML header scaffolding from sample.lua.
Made citation information available in lua writers.
--normalize and Text.Pandoc.Shared.normalize now consolidate adjacent RawBlocks when possible.

API changes

Added Text.Pandoc.Readers.Docx, exporting readDocx (Jesse Rosenthal).
Added Text.Pandoc.Readers.EPUB, exporting readEPUB (Matthew Pickering).
Added Text.Pandoc.Readers.Txt2Tags, exporting readTxt2Tags (Matthew Pickering).
Added Text.Pandoc.Writers.DokuWiki, exporting writeDokuWiki (Clare Macrae).
Added Text.Pandoc.Writers.Haddock, exporting writeHaddock.
Added Text.Pandoc.MediaBag, exporting MediaBag, lookupMedia, insertMedia, mediaDirectory, extractMediaBag. The docx and epub readers return a pair of a Pandoc document and a MediaBag with the media resources they contain. This can be extracted using --extract-media. Writers that incorporate media (PDF, Docx, ODT, EPUB, RTF, or HTML formats with --self-contained) will look for resources in the MediaBag generated by the reader, in addition to the file system or web.
Text.Pandoc.Readers.TexMath: Removed deprecated readTeXMath. Renamed readTeXMath' to texMathToInlines.
Text.Pandoc: Added Reader data type (Matthew Pickering). readers now associates names of readers with Reader structures. This allows inclusion of readers, like the docx reader, that take binary rather than textual input.
Text.Pandoc.Shared:
Added capitalize (Artyom Kazak), and replaced uses of map toUpper (which give bad results for many languages).
Added collapseFilePath, which removes intermediate . and .. from a path (Matthew Pickering).
Added fetchItem', which works like fetchItem but searches a MediaBag before looking on the net or file system.
Added withTempDir.
Added removeFormatting.
Added extractSpaces (from HTML reader) and generalized its type so that it can be used by the docx reader (Matthew Pickering).
Added ordNub.
Added normalizeInlines, normalizeBlocks.
normalize is now Pandoc -> Pandoc instead of Data a :: a -> a. Some users may need to change their uses of normalize to the newly exported normalizeInlines or normalizeBlocks.
Text.Pandoc.Options:
Added writerMediaBag to WriterOptions.
Removed deprecated and no longer used readerStrict in ReaderOptions. This is handled by readerExtensions now.
Added Ext_compact_definition_lists.
Added Ext_epub_html_exts.
Added Ext_native_divs and Ext_native_spans. This allows users to turn off the default pandoc behavior of parsing contents of div and span tags in markdown and HTML as native pandoc Div blocks and Span inlines.
Text.Pandoc.Parsing:
Generalized readWith to readWithM (Matthew Pickering).
Export runParserT and Stream (Matthew Pickering).
Added HasQuoteContext type class (Matthew Pickering).
Generalized types of mathInline, smartPunctuation, quoted, singleQuoted, doubleQuoted, failIfInQuoteContext, applyMacros (Matthew Pickering).
Added custom token (Matthew Pickering).
Added stateInHtmlBlock to ParserState. This is used to keep track of the ending tag we're waiting for when we're parsing inside HTML block tags.
Added stateMarkdownAttribute to ParserState. This is used to keep track of whether the markdown attribute has been set in an enclosing tag.
Generalized type of registerHeader, using new type classes HasReaderOptions, HasIdentifierList, HasHeaderMap (Matthew Pickering). These allow certain common functions to be reused even in parsers that use custom state (instead of ParserState), such as the MediaWiki reader.
Moved inlineMath, displayMath from Markdown reader to Parsing, and generalized their types (Matthew Pickering).
Text.Pandoc.Pretty:
Added nestle.
Added blanklines, which guarantees a certain number of blank lines (and no more).

Bug fixes

Markdown reader:
Fixed parsing of indented code in list items. Indented code at the beginning of a list item must be indented eight spaces from the margin (or edge of the container), or four spaces from the list marker, whichever is greater.
Fixed small bug in HTML parsing with markdown_attribute, which caused incorrect tag nesting for input like <aside markdown="1">*hi*</aside>.
Fixed regression with intraword underscores (#1121).
Improved parsing of inline links containing quote characters (#1534).
Slight rewrite of enclosure/emphOrStrong code.
Revamped raw HTML block parsing in markdown (#1330). We no longer include trailing spaces and newlines in the raw blocks. We look for closing tags for elements (but without backtracking). Each block-level tag is its own RawBlock; we no longer try to consolidate them (though --normalize will do so).
Combine consecutive latex environments. This helps when you have two minipages which can't have blank lines between them (#690, #1196).
Support smallcaps through span. <span style="font-variant:small-caps;">foo</span> will be parsed as a SmallCaps inline, and will work in all output formats that support small caps (#1360).
Prevent spurious line breaks after list items (#1137). When the hard_line_breaks option was specified, pandoc would formerly produce a spurious line break after a tight list item.
Fixed table parsing bug (#1333).
Handle c++ and objective-c as language identifiers in github-style fenced blocks (#1318).
Inline math must have nonspace before final $ (#1313).
LaTeX reader:
Handle comments at the end of tables. This resolves the issue illustrated in http://stackoverflow.com/questions/24009489.
Correctly handle table rows with too few cells. LaTeX seems to treat them as if they have empty cells at the end (#241).
Handle leading/trailing spaces in \emph better. \emph{ hi } gets parsed as [Space, Emph [Str "hi"], Space] so that we don't get things like * hi * in markdown output. Also applies to \textbf and some other constructions (#1146).
Don't assume preamble doesn't contain environments (#1338).
Allow (and discard) optional argument for \caption (James Aspnes).
HTML reader:
Fixed major parsing problem with HTML tables. Table cells were being combined into one cell (#1341).
Fixed performance issue with malformed HTML tables. We let a </table> tag close an open <tr> or <td> (#1167).
Allow space between <col> and </col>.
Added audio and source in eitherBlockOrInline.
Moved video, svg, progress, script, noscript, svg from blockTags to eitherBlockOrInline.
map and object were mistakenly in both lists; they have been removed from blockTags.
Ignore DOCTYPE and xml declarations.
MediaWiki reader:
Don't parse backslash escapes inside <source> (#1445).
Tightened up template parsing. The opening {{ must be followed by an alphanumeric or :. This prevents the exponential slowdown in #1033.
Support "Bild" for images.
DocBook reader:
Better handle elements inside code environments. Pandoc's document model does not allow structure inside code blocks, but at least this way we preserve the text (#1449).
Support <?asciidoc-br?> (#1236).
Textile reader:
Fixed list parsing. Lists can now start without an intervening blank line (#1513).
HTML block-level tags that do not start a line are parsed as inline HTML and do not interrupt paragraphs (as in RedCloth).
Org reader:
Make tildes create inline code (#1345). Also relabeled code and verbatim parsers to accord with the org-mode manual.
Respect :exports header argument in code blocks (Craig Bosma).
Fixed tight lists with sublists (#1437).
EPUB writer:
Avoid excess whitespace in nav.xhtml. This should improve TOC view in iBooks (#1392).
Fixed regression on cover image. In 1.12.4 and 1.12.4.2, the cover image would not appear properly, because the metadata id was not correct. Now we derive the id from the actual cover image filename, which we preserve rather than using "cover-image."
Keep newlines between block elements. This allows easier diff-ability (#1424).
Use stringify instead of custom plainify.
Use renderTags' for all tag rendering. This properly handles tags that should be self-closing. Previously <hr/> would appear in EPUB output as <hr></hr> (#1420).
Better handle HTML media tags.
Handle multiple dates with OPF event attributes. Note: in EPUB3 we can have only one dc:date, so only the first one is used.
LaTeX writer:
Correctly handle figures in notes. Notes can't contain figures in LaTeX, so we fake it to avoid an error (#1053).
Fixed strikeout + highlighted code (#1294). Previously strikeout highlighted code caused an error.
ConTeXt writer:
Improved detection of autolinks with URLs containing escapes.
RTF writer:
Improved image embedding: fetchItem' is now used to get the images, and calculated image sizes are indicated in the RTF.
Avoid extra paragraph tags in metadata (#1421).
HTML writer:
Deactivate "incremental" inside slide speaker notes (#1394).
Don't include empty items in the table of contents for slide shows. (These would result from creating a slide using a horizontal rule.)
MediaWiki writer:
Minor renaming of st prefixed names.
AsciiDoc writer:
Double up emphasis and strong emphasis markers in intraword contexts, as required by asciidoc (#1441).
Markdown writer:
Avoid wrapping that might start a list, blockquote, or header (#1013).
Use Span instead of (hackish) SmallCaps in plainify.
Don't use braced attributes for fenced code (#1416). If Ext_fenced_code_attributes is not set, the first class attribute will be printed after the opening fence as a bare word.
Separate adjacent lists of the same kind with an HTML comment (#1458).
PDF writer:
Fixed treatment of data uris for images (#1062).
Docx writer:
Use Compact style for empty table cells (#1353). Otherwise we get overly tall lines when there are empty table cells and the other cells are compact.
Create overrides per-image for media/ in reference docx. This should be somewhat more robust and cover more types of images.
Improved entryFromArchive to avoid an unneeded parse.
Section numbering carries over from reference.docx (#1305).
Simplified abstractNumId numbering. Instead of sequential numbering, we assign numbers based on the list marker styles.
Text.Pandoc.Options:
Removed Ext_fenced_code_attributes from markdown_github extensions.
Text.Pandoc.ImageSize:
Use default instead of failing if image size not found in exif header (#1358).
ignore unknown exif header tag rather than crashing. Some images seem to have tag type of 256, which was causing a runtime error.
Text.Pandoc.Shared:
fetchItem: unescape URI encoding before reading local file (#1427).
fetchItem: strip a fragment like ?#iefix from the extension before doing mime lookup, to improve mime type guessing.
Improved logic of fetchItem: absolute URIs are fetched from the net; other things are treated as relative URIs if sourceURL is Just _, otherwise as file paths on the local file system.
fetchItem now properly handles links without a protocol (#1477).
fetchItem now escapes characters not allowed in URIs before trying to parse the URIs.
Fixed runtime error with compactify'DL on certain lists (#1452).
pandoc.hs: Don't strip path off of writerSourceURL: the path is needed to resolve relative URLs when we fetch resources (#750).
Text.Pandoc.Parsing
Simplified dash and ellipsis (#1419).
Removed (>>~) in favor of the equivalent (<*) (Matthew Pickering).
Generalized functions to use ParsecT (Matthew Pickering).
Added isbn and pmid to list of recognized schemes (Matthew Pickering).

Template changes

Added haddock template.
EPUB3: Added type attribute to link tags. They are supposed to be "advisory" in HTML5, but kindlegen seems to require them.
EPUB3: Put title page in section with epub:type="titlepage".
LaTeX: Made \subtitle work properly (#1327).
LaTeX/Beamer: remove conditional around date (#1321).
LaTeX: Added lot and lof variables, which can be set to get \listoftables and \listoffigures (#1407). Note that these variables can be set at the command line with -Vlot -Vlof or in YAML metadata.

Under the hood improvements

Rewrote normalize for efficiency (#1385).
Rewrote Haddock reader to use haddock-library (#1346).
This brings pandoc's rendering of haddock markup in line with the new haddock.
Fixed line breaks in @ code blocks.
alex and happy are no longer build-depends.
Added Text.Pandoc.Compat.Directory to allow building against different versions of the directory library.
Added Text.Pandoc.Compat.Except to allow building against different verions of mtl.
Code cleanup in some writers, using Reader monad to avoid passing options parameter around (Matej Kollar).
Improved readability in pandoc.hs.
Miscellaneous code cleanups (Artyom Kazak).
Avoid import Prelude hiding (catch) (#1309, thanks to Michael Thompson).
Changed http-conduit flag to https. Depend on http-client and http-client-tls instead of http-conduit. (Note: pandoc still depends on conduit via yaml.)
Require highlighting-kate >= 0.5.8.5 (#1271, #1317, Debian #753299). This change to highlighting-kate means that PHP fragments no longer need to start with <?php. It also fixes a serious bug causing failures with ocaml and fsharp.
Require latest texmath. This fixes \tilde{E} and allows \left to be used with ], ) etc. (#1319), among many other improvements.
Require latest zip-archive. This has fixes for unicode path names.
Added tests for plain writer.
Text.Pandoc.Templates:
Fail informatively on template syntax errors. With the move from parsec to attoparsec, we lost good error reporting. In fact, since we weren't testing for end of input, malformed templates would fail silently. Here we revert back to Parsec for better error messages.
Use ordNub (#1022).
Benchmarks:
Made benchmarks compile again (Artyom Kazak).
Fixed so that the failure of one benchmark does not prevent others from running (Artyom Kazak).
Use nfIO instead of the getLength trick to force full evaluation.
Changed benchmark to use only the test suite, so that benchmarks run more quickly.
Windows build script:
Add -windows to file name.
Use one install command for pandoc, pandoc-citeproc.
Force install of pandoc-citeproc.
make_osx_package: Call zip file pandoc-VERSION-osx.zip. The zip should not be named SOMETHING.pkg.zip, or OSX finder will extract it into a folder named SOMETHING.pkg, which it will interpret as a defective package (#1308).
README:
Made headers for all extensions so they have IDs and can be linked to (Beni Cherniavsky-Paskin).
Fixed typos (Phillip Alday).
Fixed documentation of attributes (#1315).
Clarified documentation on small caps (#1360).
Better documentation for fenced_code_attributes extension (Caleb McDaniel).
Documented fact that you can put YAML metadata in a separate file (#1412).

jgm/pandoc 1.13 pandoc 1.13 on GitHub

New features

Behavior changes

API changes

Bug fixes

Template changes

Under the hood improvements

jgm/pandoc 1.13
pandoc 1.13

on GitHub