The "TypesTest" project

Multiple grammar support

It is possible to load multiple grammars simultaneously and to embed rules from one grammar inside another (aka "composable" grammars). When first parsed, all grammars must begin with a "%lang name" directive specifying the language name (later extensions to the grammar do not need this directive). Rules and templates from other grammars can be accessed by the "." operator. Example of using rules from language "B" within language "A":

%lang A
a: "b" ":" B.b

At the time rule A.a is parsed, language B must exist (although the rule "B.b" does not have to exist). The parsing of embedded rules is implemented by dynamically discovering which dfa the lexer should run. Pseudocode for the algorithm:

parse() {
    reduce()         // <-+ Swapped from the order in src/lib
    tokens = lex()   // <-+ Is this a problem? (I don't see one)
lex() {
  foreach stack top z:
    foreach possible shift sym s in z.shift:
      if s.grammar hasn't been checked yet:
        toks = rundfa(s.grammar.dfa)
        if toks != nil:
          return toks
  return nil

This algorithm assumes that the input can only be tokenized properly by one grammar. Ambiguity in the token stream is not supported. Example:

%lang S
s: "v" "=" M.e ";"
s: "v" "=" P.e ";"

%lang M
e: "0"
e: e "*" e

%lang P
e: "0"
e: e "+" e

# Text to parse (starting at S.s)
v = 0 + 0 ;

# Tree should be
(S.s "v" "=" (P.e (P.e "0") "+" (P.e "0")) ";")

# Tree when M's dfa is tried first
(S.s "v" "=" (M.e (M.e "0") @error!

After reaching the "=", the parser detects a dfa ambiguity. If it tries M's dfa first, it will tokenize 0 and then expect a * token. Howver a + token follows so the parse fails.

The major problem with supporting token ambiguity is that the file pointer can differ in each stack top. This is too hard to solve as of now, but would be interesting to look at again some day.

Templated Grammar Rules

Grammar rules can be declared as templates. Their main usage is to define the pattern matching syntax, so that is the running example:

# extensions to C
expr: expr "~" astmatch
expr: expr "~" astname
expr: "`" astmatch

astmatch(N,R): astname(N) "{" R "}"

%lang softid
id(N): N

The expr extensions show how templated rules are used within non-templated rules. The ast extensions show how to declare templates. All instances of a template must have the same number of parameters, so declaring astname as astname(N,R): would be a syntax error (since it is first seen in astmatch with one parameter).

To be useful, templates must be instantiated. This is done by binding literal, regexp, or rule symbols to each parameter. Duplicate (same parameter) instantiations of a template are ignored. When a template is instantiated, it recursively instantiates all templates on its right-hand side. For example, the instantiation astmatch("expr",c.expr) produces:

astmatch: astmatch('expr,c.expr)
astname:  astname('expr)

astmatch('expr,c.expr): astname('expr) "{" c.expr "}"
astname('expr): "expr"'expr): "expr"

Note the recursive astname instantiation within astmatch is specific. Instantiation is not done in the following way (which would have a different effect):

# Not what happens
astmatch("expr",c.expr): astname "{" c.expr "}"

Pattern Matching Implementation

The pattern matching syntax for C is defined above. The advantages are three-fold:

Other details of the pattern-matching implemetation

Possible Optimizations


Maybe TODO / Misc