Eric Rahm

A Rust-based XML parser for Firefox

Goal: Replace Gecko’s XML parser, libexpat, with a Rust-based XML parser

Firefox currently uses an old, trimmed down, and slightly modified version of libexpat, a library written in C, to support parsing of XML documents. These files include plain old XML on the web, XSLT documents, SVG images, XHTML documents, RDF, and our own XUL UI format. While it’s served it’s job well it has long been unmaintained and has been a source of many security vulnerabilities, a few of which I’ve had the pleasure of looking into. It’s 13,000 lines of rather hard to understand code and tracing through everything when looking into security vulnerabilities can take days at a time.

It’s time for a change. I’d like us to switch over to a Rust-based XML parser to help improve our memory safety. We’ve done this already with at least two other projects: an mp4 parser, and a url parser. This seems to fit well into that mold: a standalone component with past security issues that can be easily swapped out.

There have been suggestions adding full XML 1.0 v5 support, there’s a 6-year old proposal to rewrite our XML stack which doesn’t include replacing expat, there’s talk of the latest and greatest, but not quite fully speced, XML5. These are all interesting projects, but they’re large efforts. I’d like to see us make a reasonable change now.

What do we want?

In order to avoid scope creep and actually implement something in the short term I just want a library we can drop in that has parity with the features of libexpat that we currently use. That means:

Why do we need UTF-16?

Short answer: That’s how our current XML parser stack works.

Slightly longer answer: In Firefox libexpat is wrapped by nsExpatDriver which implements nsITokenizer. nsITokenizer uses nsScanner which exposes the data it wraps as UTF-16 and takes in nsAString, which as you may have guessed is a wide string. It can also read in c-strings, but internally it performs a character conversion to UTF-16. On the other side all tokenized data is emitted as UTF-16 so all consumers would need to be updated as well. This extends further out, but hopefully that’s enough to explain that for a drop-in replacement it should support UTF-16.

What don’t we need?

We can drop the complexity of our parser by excluding parts of expat or more modern parsers that we don’t need. In particular:

What are our options?

There are three Rust-based parsers that I know of, none of which quite fit our needs:

Where do we go from here?

My recommendation is to implement our own parser that fits the needs and use cases of Firefox specifically. I’m not saying we’d necessarily start from scratch, it’s possible we could fork one of the existing libraries or just take inspiration from a little bit of all of them, but we have rather specific requirements that need to be met.