Click to expand changelog
-
--resource-path
now accumulates if specified multiple times (#6152). Resource paths specified later on the command line are prepended to those specified earlier. Thus,--resource-path foo --resource-path bar:baz
is equivalent to--resource-path bar:bas:foo
. (The previous behavior was for the last--resource-path
to replace all the rest.)resource-path
in defaults files behaves the same way: it will be prepended to the resource path set by earlier command line options or defaults files. This change facilitates the use of multiple defaults files: each can specify a directory containing resources it refers to without clobbering the resource paths set by the others. -
Allow defaults files to refer to the home directory, the user data directory, and the directory containing the defaults file itself (#5871, #5982, #5977). In fields that expect file paths (and only in these fields),
${VARIABLE}
will expand to the value of the environment variableVARIABLE
(and in particular${HOME}
will expand to the path of the home directory). A warning will be raised for undefined variables.${USERDATA}
will expand to the path of the user data directory in force when the defaults file is being processed.${.}
will expand to the directory containing the defaults file. (This allows default files to be placed in a directory containing resources they make use of.)
-
When downloading content from URL arguments, be sensitive to the character encoding (#5600). We can properly handle UTF-8 and latin1 (ISO-8859-1); for others we raise an error. Fall back to latin1 if no charset is given in the mime type and UTF-8 decoding fails.
-
Allow abbreviations that don’t end in a period to be specified using
--abbreviations
(#7124). -
Add new unexported module Text.Pandoc.XML.Light, as well as Text.Pandoc.XML.Light.Types, Text.Pantoc.XML.Light.Proc, Text.Pandoc.XML.Light.Output. (Closes #6001, #6565, #7091).
This module exports definitions of
Element
andContent
that are isomorphic to xml-light’s, but with Text instead of String. This allows us to keep most of the code in existing readers that use xml-light, but avoid lots of unnecessary allocation.We also add versions of the functions from xml-light’s Text.XML.Light.Output and Text.XML.Light.Proc that operate on our modified XML types, and functions that convert xml-light types to our types (since some of our dependencies, like texmath, use xml-light).
We export functions that use xml-conduit’s parser to produce an
Element
or[Content]
. This allows existing pandoc code to use a better parser without much modification.The new parser is used in all places where xml-light’s parser was previously used. Benchmarks show a significant performance improvement in parsing XML-based formats (with docbook, opml, jats, and docx almost twice as fast, odt and fb2 more than twice as fast).
In addition, the new parser gives us better error reporting than xml-light. We report XML errors, when possible, using the new
PandocXMLError
constructor inPandocError
.These changes revealed the need for some changes in the tests. The docbook-reader.docbook test lacked definitions for the entities it used; these have been added. And the docx golden tests have been updated, because the new parser does not preserve the order of attributes.
-
DocBook reader:
- Avoid expensive tree normalization step, as it is not necessary with the new XML parser.
- Support
informalfigure
(#7079) (Nils Carlson).
-
Docx reader:
- Use Map instead of list for Namespaces. This gives a speedup of about 5-10%. With this and the XML parsing changes, the docx reader is now about twice as fast as in the previous release.
-
HTML reader:
- Small performance tweaks.
- Also, remove exported class
NamedTag(..)
[API change]. This was just intended to smooth over the transition from String to Text and is no longer needed. - As a result, the functions
isInlineTag
andisBlockTag
are no longer polymorphic; they apply to aTag Text
[API change]. - Do a lookahead to find the right parser to use. This takes benchmarks from 34ms to 23ms, with less allocation.
- Fix bad handling of empty
src
attribute iniframe
(#7099). Ifsrc
is empty, we simply skip theiframe
. Ifsrc
is invalid or cannot be fetched, we issue a warning nd skip instead of failing with an error.
-
JATS reader:
- Avoid tree normalization, which is no longer necessary given the new XML parser.
-
LaTeX reader:
- Don’t export
tokenize
,untokenize
[API change]. These are internal implementation details, which were only exported for testing. They don’t belong in the public API. - Improved efficiency of the parser. With these changes the reader is almost twice as fast as in the last release in our benchmarks.
- Code cleanup, removing some unnecessary things.
- Rewrite
withRaw
so it doesn’t rely on fragile assumptions about token positions (which break when macros are expanded) (#7092). This requires the addition ofsEnableWithRaw
andsRawTokens
inLaTeXState
, and a new combinatordisablingWithRaw
to disable collecting of raw tokens in certain contexts. AddparseFromToks
to Text.Pandoc.Readers.LaTeX.Parsing. Fix parsing of single character tokens so it doesn’t mess up the new raw token collecting. These changes slightly increase allocations and have a small performance impact. - Handle some bibtex/biblatex-specific commands that used to be dealt with in pandoc-citeproc (#7049).
- Optimize
satisfyTok
, avoiding unnecessary macro expansion steps. Benchmarks after this change show 2/3 of the run time and 2/3 of the allocation of the Feb. 10 benchmarks. - Removed
sExpanded
in state. This isn’t actually needed and checking it doesn’t change anything. - Improve
braced'
. Remove the parameter, have it parse the opening brace, and make it more efficient. - Factor out pieces of the LaTeX reader to make the module smaller. This reduces memory demands when compiling. Created Text.Pandoc.Readers.{LaTeX,Math,Citation,Table,Macro,Inline}. Changed Text.Pandoc.Readers.LaTeX.SIunitx to export a command map instead of individual commands.
- Handle table cells containing
&
in\verb
(#7129).
- Don’t export
-
Make Text.Pandoc.Readers.LaTeX.Types an unexported module [API change].
-
Markdown reader:
- Improved handling of mmd link attributes in references (#7080). Previously they only worked for links that had titles.
- Improved efficiency of the parser (benchmarks show a 15% speedup).
-
OPML reader:
- Avoid tree normalization, which is no longer necessary with the new XML parser.
-
ODT reader:
- Finer-grained errors on parse failure (#7091).
- Give more information if the zip container can’t be unpacked.
-
Org reader:
- Support
task_lists
extension (Albert Krewinkel, #6336). - Fix bug in org-ref citation parsing (Albert Krewinkel, #7101). The org-ref syntax allows to list multiple citations separated by comma. Previously commas were accepted as part of the citation id, so all citation lists were parsed as one single citation.
- Support
-
RST reader:
- Use
getTimestamp
instead ofgetCurrentTime
to fetch timestamp. SettingSOURCE_DATE_EPOCH
will allow reproducible builds. - RST reader: fix handling of header in CSV tables (#7064). The interpretation of this line is not affected by the delim option.
- Use
-
Jira reader:
-
Text.Pandoc.Shared
- Remove formerly exported functions that are no longer used in the code base:
splitByIndices
,splitStringByIndicies
,substitute
, andunderlineSpan
(which had been deprecated in April 2020) [API change]. - Export
handleTaskListItem
(Albert Krewinkel) [API change]. - Change
defaultUserDataDirs
todefaultUserDataDir
[API change]. We determine what is the default user data directory by seeing whether the XDG directory and/or legacy directory exist.
- Remove formerly exported functions that are no longer used in the code base:
-
BibTeX writer:
- BibTeX writer: use doclayout and doctemplate. This change allows bibtex/biblatex output to wrap as other formats do, depending on the settings of
--wrap
and--columns
(#7068).
- BibTeX writer: use doclayout and doctemplate. This change allows bibtex/biblatex output to wrap as other formats do, depending on the settings of
-
CSL JSON writer:
- Output
[]
if no references in input, instead of raising a PandocAppError as before.
- Output
-
Docx writer:
- Use
getTimestamp
instead ofgetCurrentTime
for timestamp. SettingSOURCE_DATE_EPOCH
will allow reproducible builds.
- Use
-
EPUB writer:
- Use
getTimestamp
instead ofgetCurrentTime
for timestamp. SettingSOURCE_DATE_EPOCH
will allow reproducible builds (#7093). This does not suffice to fully enable reproducible in EPUB, since a unique id is still being generated for each build. - Support
belongs-to-collection
metadata (#7063) (Nick Berendsen).
- Use
-
JATS writer:
- Escape special chars in reference elements (Albert Krewinkel). Prevents the generation of invalid markup if a citation element contains an ampersand or another character with a special meaning in XML.
-
Jira writer:
- Use Span identifiers as anchors (Albert Krewinkel).
- Use
{noformat}
instead of{code}
for unknown languages (Albert Krewinkel). Code blocks which are not marked as a language supported by Jira are rendered as preformatted text via{noformat}
blocks.
-
LaTeX writer:
- Adjust hypertargets to beginnings of paragraphs (#7078). Use
\vadjust pre
so that the hypertarget takes you to the beginning of the paragraph rather than one line down. This makes a particular difference for links to citations using--citeproc
andlink-citations: true
. - Change BCP47 lang tag from
jp
toja
(Mauro Bieg, #7047). - Use function instead of map for accent lookup (should be more efficient).
- Split the module to make it easier to compile on low-memory systems: added Text.Pandoc.Writers.LaTeX.{Util,Citation,Lang}.
- Adjust hypertargets to beginnings of paragraphs (#7078). Use
-
Markdown writer:
- Handle math right before digit. We insert an HTML comment to avoid a
$
right before a digit, which pandoc will not recognize as a math delimiter. - Split the module to make it easier to compile on low-memory systems: added Text.Pandoc.Writers.Markdown.{Types,Inline}.
- Handle math right before digit. We insert an HTML comment to avoid a
-
ODT writer:
- Use
getTimestamp
instead ofgetCurrentTime
for timestamp. SettingSOURCE_DATE_EPOCH
will allow reproducible builds. - Update default ODT style (Lorenzo). Previously, the “First paragraph” style inherited from “Standard” but not from “Text body.” Now it is adjusted to inherit from “Text body”, to avoid some ugly spacing issues. It may be necessary to update a custom
reference.odt
in light of this change.
- Use
-
Org writer:
- Support
task_lists
extension (Albert Krewinkel, #6336).
- Support
-
Pptx writer:
- Use
getTimestamp
instead ofgetCurrentTime
for timestamp. SettingSOURCE_DATE_EPOCH
will allow reproducible builds.
- Use
-
JATS templates: tag
author.name
asstring-name
(Albert Krewinkel). The partitioning the components of a name into surname, given names, etc. is not always possible or not available. Usingauthor.name
allows to give the full name as a fallback to be used whenauthor.surname
is not available. -
Add default templates for bibtex and biblatex, so that the variables
header-include
,include-before
,include-after
(or alternatively the command line options--include-in-header
,--include-before-body
,--include-after-body
) may be used. -
LaTeX template:
-
revealjs template: Add ‘center’ option for vertical slide centering. (maurerle, #7104).
-
Text.Pandoc.XML: Improve efficiency of
fromEntities
. -
Text.Pandoc.MIME
- Add exported function
getCharset
[API change].
- Add exported function
-
Text.Pandoc.UTF8: change IO functions to return Text, not String [API change]. This affects
readFile
,getContents
,writeFileWith
,writeFile
,putStrWith
,putStr
,putStrLnWith
,putStrLn
.hPutStrWith
,hPutStr
,hPutStrLnWith
,hPutStrLn
,hGetContents
. This avoids the need to uselessly create a linked list of characters when emiting output. -
Text.Pandoc.App
- Add
parseOptionsFromArgs
[API change, new exported function]. - Add fields for CSL options to
Opt
[API change]:optCSL
,optbibliography
,optCitationAbbreviations
.
- Add
-
Text.Pandoc.Citeproc.BibTeX
Text.Pandoc.Citeproc.writeBibTeXString
now returnsDoc Text
instead ofText
(#7068).- Correctly handle
pages
(=page
in CSL) (#7067). - Correctly handle BibLaTeX
langid
(=language
in CSL, #7067). - In BibTeX output, protect foreign titles since there’s no language field (#7067).
- Clean up BibTeX parsing (#7049). Previously there was a messy code path that gave strange results in some cases, not passing through raw tex but trying to extract a string content. This was an artefact of trying to handle some special bibtex-specific commands in the BibTeX reader. Now we just handle these in the LaTeX reader and simplify parsing in the BibTeX reader. This does mean that more raw tex will be passed through (and currently this is not sensitive to the
raw_tex
extension; this should be fixed).
-
Text.Pandoc.Citeproc.MetaValue
- Correctly parse “raw” date value in markdown references metadata. (See jgm/citeproc#53.)
-
Text.Pandoc.Citeproc
- Use https URLs for links (Salim B, #7122).
-
Text.Pandoc.Class
- Add
getTimestamp
[API change]. This attempts to read theSOURCE_DATE_EPOCH
environment variable and parse a UTC time from it (treating it as a unix date stamp, see https://reproducible-builds.org/specs/source-date-epoch/). If the variable is not set or can’t be parsed as a unix date stamp, then the function returns the current date.
- Add
-
Text.Pandoc.Error
- Add
PandocUnsupportedCharsetError
constructor forPandocError
[API change]. - Export
renderError
[API change]. - Refactor
handleError
to userenderError
. This allows us render error messages without exiting.
- Add
-
Text.Pandoc.Extensions
-
Lua subsystem:
- Always load built-in Lua scripts from default data-dir (Albert Krewinkel). The Lua modules
pandoc
andpandoc.List
are now always loaded from the system’s default data directory. Loading from a different directory by overriding the default path, e.g. via--data-dir
, is no longer supported to avoid unexpected behavior and to address security concerns. - Add module “pandoc.path” (Albert Krewinkel, #6001, #6565). The module allows to work with file paths in a convenient and platform-independent manner.
- Use strict evaluation when retrieving AST value from the stack (Albert Krewinkel, #6674).
- Always load built-in Lua scripts from default data-dir (Albert Krewinkel). The Lua modules
-
Text.Pandoc.PDF
- Disable
smart
extension when building PDF via LaTeX. This is to prevent accidental creation of ligatures like?`
and!`
(especially in languages with quotations like German), and similar ligature issues. (See jgm/citeproc#54.)
- Disable
-
Text.Pandoc.CSV:
- Fix parsing of unquoted values (#7112). Previously we didn’t allow unescaped quotes in unquoted values, but they are allowed in CSV.
-
Test suite:
- Use a more robust method for testing the executable. Many of our tests require running the pandoc executable. This is problematic for a few different reasons. First, cabal-install will sometimes run the test suite after building the library but before building the executable, which means the executable isn’t in place for the tests. One can work around that by first building, then building and running the tests, but that’s fragile. Second, we have to find the executable. So far, we’ve done that using a function
findPandoc
that attempts to locate it relative to the test executable (which can be located using findExecutablePath). But the logic here is delicate and work with every combination of options. To solve both problems, we add an--emulate
option to thetest-pandoc
executable. When--emulate
occurs as the first argument passed totest-pandoc
, the program simply emulates the regular pandoc executable, using the rest of the arguments (after--emulate
). Thus,test-pandoc --emulate -f markdown -t latex
is just likepandoc -f markdown -t latex
. Since all the work is done by library functions, implementing this emulation just takes a couple lines of code and should be entirely reliable. With this change, we can test the pandoc executable by running the test program itself (locatable usingfindExecutablePath
) with the--emulate
option. This removes the need for the fragilefindPandoc
step, and it means we can run our integration tests even when we’re just building the library, not the executable. [Note: part of this change involved simplifying some complex handling to set environment variables for dynamic library paths. I have tested a build with--enable-dynamic-executable
, and it works, but further testing may be needed.] - Print accurate location if a test fails (Albert Krewinkel). Ensures that tasty-hunit reports the location of the failing test instead of the location of the helper
test
function.
- Use a more robust method for testing the executable. Many of our tests require running the pandoc executable. This is problematic for a few different reasons. First, cabal-install will sometimes run the test suite after building the library but before building the executable, which means the executable isn’t in place for the tests. One can work around that by first building, then building and running the tests, but that’s fragile. Second, we have to find the executable. So far, we’ve done that using a function
-
Documentation: Update URLs and use
https
where possible (#7122, Salim B). -
Add
doc/libraries.md
, a description of libraries that support pandoc. -
MANUAL.txt
- MANUAL: block-level formatting is not allowed in line blocks (#7107).
- Clarify
tex_math_dollars
extension. Note that no blank lines are allowed between the delimiters in display math. - Add MANUAL section on reproducible builds.
- Document no template fallback for absolute path (#7077, Nixon Enraght-Moony.)
- Improve docs for cite-method.
- Update README and man page.
-
Makefile: in
make bench
, create CSV files for comparison and compare against previous benchmark run. Add timestamp to CSV filenames. -
cabal.project: don’t explicitly set -trypandoc. If we do, this can’t be overridden on the cabal command line.
-
doc/lua-filters.md: improve documentation for
pandoc.mediabag.insert
,pandoc.mediabag.fetch
,directory
,normalize
(Albert Krewinkel). -
Allow base64-bytestring-1.2.* (Dmitrii Kovanikov)
-
Require jira-wiki-markup 1.3.3 (Albert Krewinkel)
-
Require citeproc 0.3.0.8, which correctly titlecases when titles contain non-ASCII characters.
-
Use skylighting 0.10.4. This version of skylighting uses xml-conduit rather than hxt. This speeds up parsing of XML syntax definitions fourfold, and removes four packages from pandoc’s dependency graph: hxt-charproperties, hxt-unicode, hxt-regex-xmlschema, hxt.
-
Add script
tools/parseTimings.pl
to help pin down which modules take the most time and memory to compile. -
Avoid unnecessary use of NoImplicitPrelude pragma (#7089) (Albert Krewinkel)
-
Benchmarks
- Use the lighter-weight tasty-bench instead of criterion.
- Run writer benchmarks for binary formats too.
- Alphabetize benchmarks.
- Don’t run benchmarks for bibliography formats (yet; we need a special input for them).
- Show allocation data
- Clean up benchmark code.
- Allow specifying patterns using `-p blah’.
-
trypandoc: add 2 second timeout.
-
Use
-split-sections
in creating linux release binary. This reduces executable size significantly (by about 30%). -
Remove
weigh-pandoc
. It’s not really useful any more, now that our regular benchmarks include data on allocation. -
Improve linux package build process and add script to automate building an arm64 binary package.