pulldown_cmark, a fast pull parser for the CommonMark markdown standard written in Rust, has just seen its 0.3 release. It marks a milestone for renewed CommonMark test compliance and performance. We are now inviting existing and new users of the crate to try the new release.
new_algo branch was started way back in January of 2017 by crate maintainer Raph Levien. It has since gone through alternating periods of development and inactivity, receiving contributions from people such as Will Turner, Michael Howell and Scott Abbey.
In this blog post, we'll have a quick look at the goals of the rewrite, how they were achieved and what's next for the crate.
goals of the new branch
The goals of the rewrite were the following, ordered roughly by priority:
- restore full compliance with the Commonmark specification, now at version 0.28
- maintain backward compatibility with previous releases
- remove quadratic time behaviours and other poorly scaling code
- uphold or improve overall performance
One of the key design decisions for the initial releases of pulldown_cmark was to have a pull based parser. This meant that it emitted events, roughly corresponding to the opening and closing of elements in the traversal of the document tree, on request, each time advancing the parser through the document just far enough to produce another event. Thus the parser only kept track of relevant local information and did not need to construct a full document tree before it could start emitting events. An added benefit was that when a consumer was only interested in part of the document, they could simply stop pulling events at some point and no unnecessary work would have been done.
This approach worked fairly well and made pulldown_cmark one of the faster CommonMark parsers around. However, to parse certain markdown structures correctly, some context is required which may lay ahead of it in the document. For example, it is possible to refer to a link definition further down the page like this:
The pulldown_cmark crate is a library written in the [Rust] programming language. [rust]: https://rust-lang.org
In order to emit a link with the proper URL, we need to scan to the end of the snippet. Previously, some trickery had to be done to properly handle these cases.
Scanning only once also created performance pitfalls. Consider markdown's codeblocks for example, which are delimited by sequences of
`. With a one-scan design it's necessary to scan ahead to see whether there is a closing sequence for an opening sequence of backticks. But since we know nothing about the text that is to come, we may need to look until the end of the document to find the closing sequence. In the worst case, we may be scanning very far very frequently which is detrimental to performance: the time to parse a document may scale quadratically with the length of the document, which means that parsing time may grow out of control rapidly in some cases. This usually only happens in specially crafted examples, but this opens up applications to
denial-of-service attacks when they're parsing documents in an untrusted environment like the web. Preventing that simply wasn't feasible with the previous architecture.
An important goal of
new_algo branch was to provide a better foundation to tackle these problems. This branch introduced a two-phase design, with the first phase constructing an abstract syntax tree (AST) of the document, which an in-memory datastructure representing the blocks (paragraphs, codeblocks, lists, etc.) that make up the document and their nesting.
The second phase then traverses this tree, parses inline elements in details and generates the right events along the way. The idea is that by doing a quick first pass, we have a context which we can use in the second pass, eliminating the need to look ahead since we already know what's there! Some of these denial-of-service vulnerabilities still exist in 0.3, but they should be much
easier to iron out.
a need for speed
To make the construction and traversal of the AST fast, the crate uses an arena-backed tree implementation similar to the indextree crate. That means that all the nodes of the tree are laid out contiguously in memory and all references to nodes are just indices into the backing vector. This is super good for cache locality and insertion speed — we can just append the vector. Tree indices are then essentially just values of
0 represent the Nil pointer. To prevent using nil pointers by accident, pointers are encoded in the type system as
Option<NonZeroUsize>, which gets optimized to a single word!
Another substantial way to improve performance is to never copy or allocate memory when it's not strictly necessary. For example,
text nodes will usually be just be references into the source text. Copying those references around is much cheaper than
allocating memory and copying the text itself. The same is done for most other text-like data, like link URLs or their titles.
Generally, only when text needs to be (un)escaped, memory is allocated to store these strings. The new release introduces
a new copy-on-write string type that can store small strings inline, further reducing the need for allocations. Since this
type is exposed in the public API, this does constitute a compatibility break. However, it behaves very similarly to the
Cow<str> that was
previously used and we expect that most users will experience no breakage.
Finally, we noted that a large proportion of time is spent scanning text for special characters that could indicate the start of a new elements like links, lists or tables. This is done by looking at all bytes subsequently and case splitting on their value. Especially when documents contain large passages of text without any such special structures, this isn't optimal. We've started experimenting with SIMD instructions to quickly skip such passages. Instead of looking at a single byte at a time, we can use these special instructions to inspect up to 32 bytes simultaneously. The first results are promising, but since these improvements can be introduced without backward compatibility breaks, 0.3 has been released before these SIMD changes landed.
On typical workloads, the 0.3 release achieves around 70% higher throughput than the previous release, and it should only get faster!
The crate is now at a point where it once again passes all 633 CommonMark tests, the GFM's table extension tests, as well as the footnotes test set. Next, the plan is to find any issues that the tests haven't caught and fix them. To do that, we could
really use your help! Are you already using pulldown_cmark? Then upgrading should be a breeze. Just change the relevant line in your project's
pulldown_cmark = "0.3" and you should be all set.
Please let us know when you find any issues related to compatibility breaks, correctness, performance or otherwise in the project's issue tracker.