Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking

Bryan Ford
Master's Thesis
Massachusetts Institute of Technology

Abstract

Packrat parsing is a novel and practical method for implementing linear-time parsers for grammars defined in Top-Down Parsing Language (TDPL). While TDPL was originally created as a formal model for top-down parsers with backtracking capability, this thesis extends TDPL into a powerful general-purpose notation for describing language syntax, providing a compelling alternative to traditional context-free grammars (CFGs). Common syntactic idioms that cannot be represented concisely in a CFG are easily expressed in TDPL, such as longest-match disambiguation and "syntactic predicates," making it possible to describe the complete lexical and grammatical syntax of a practical programming language in a single TDPL grammar.

Packrat parsing is an adaptation of a 30-year-old tabular parsing algorithm that was never put into practice until now. A packrat parser can recognize any string defined by a TDPL grammar in linear time, providing the power and flexibility of a backtracking recursive descent parser without the attendant risk of exponential parse time. A packrat parser can recognize any LL(k) or LR(k) language, as well as many languages requiring unlimited lookahead that cannot be parsed by shift/reduce parsers. Packrat parsing also provides better composition properties than LL/LR parsing, making it more suitable for dynamic or extensible languages. The primary disadvantage of packrat parsing is its storage cost, which is a constant multiple of the total input size rather than being proportional to the nesting depth of the syntactic constructs appearing in the input.

Monadic combinators and lazy evaluation enable elegant and direct implementations of packrat parsers in recent functional programming languages such as Haskell. Three different packrat parsers for the Java language are presented here, demonstrating the construction of packrat parsers in Haskell using primitive pattern matching, using monadic combinators, and by automatic generation from a declarative parser specification. The prototype packrat parser generator developed for the third case itself uses a packrat parser to read its parser specifications, and supports full TDPL notation extended with "semantic predicates," allowing parsing decisions to depend on the semantic values of other syntactic entities. Experimental results show that all of these packrat parsers run reliably in linear time, efficiently support "scannerless" parsing with integrated lexical analysis, and provide the user-friendly error-handling facilities necessary in practical applications.

Full Thesis

In PDF or PostScript

Pappy: a Parser Generator for Haskell

The full source code for Pappy, the prototype packrat parser generator described in the thesis, is available for browsing in this directory, or for downloading as a gzipped tar file. A brief breakdown of the source files follows:

Example Arithmetic Expression Parsers

Following are complete versions of the example parsers for the trivial arithmetic expression language used in the thesis:

Example Java Language Parsers

The three complete and working parsers for the Java language, which are described in the paper and used for analysis and comparison purposes, are available here:

The test suite of Java source files used to obtain the experimental results in the thesis are available in this gzipped tar file. All of these Java source files were taken from Cryptix version 3.2.0.

Enjoy!


Bryan Ford's Home Page