It is possible to load multiple grammars simultaneously and to embed rules from one grammar inside another (aka "composable" grammars). When first parsed, all grammars must begin with a "%lang name" directive specifying the language name (later extensions to the grammar do not need this directive). Rules and templates from other grammars can be accessed by the "." operator. Example of using rules from language "B" within language "A":
%lang A a: "b" ":" B.b
At the time rule A.a is parsed, language B must exist (although the rule "B.b" does not have to exist). The parsing of embedded rules is implemented by dynamically discovering which dfa the lexer should run. Pseudocode for the algorithm:
parse() { while(!eof){ reduce() // <-+ Swapped from the order in src/lib tokens = lex() // <-+ Is this a problem? (I don't see one) shift(tokens) } } lex() { foreach stack top z: foreach possible shift sym s in z.shift: if s.grammar hasn't been checked yet: toks = rundfa(s.grammar.dfa) if toks != nil: return toks return nil }
This algorithm assumes that the input can only be tokenized properly by one grammar. Ambiguity in the token stream is not supported. Example:
%lang S s: "v" "=" M.e ";" s: "v" "=" P.e ";" %lang M e: "0" e: e "*" e %lang P e: "0" e: e "+" e # Text to parse (starting at S.s) v = 0 + 0 ; # Tree should be (S.s "v" "=" (P.e (P.e "0") "+" (P.e "0")) ";") # Tree when M's dfa is tried first (S.s "v" "=" (M.e (M.e "0") @error!
After reaching the "=", the parser detects a dfa ambiguity. If it tries M's dfa first, it will tokenize 0 and then expect a * token. Howver a + token follows so the parse fails.
The major problem with supporting token ambiguity is that the file pointer can differ in each stack top. This is too hard to solve as of now, but would be interesting to look at again some day.
Grammar rules can be declared as templates. Their main usage is to define the pattern matching syntax, so that is the running example:
# extensions to C expr: expr "~" astmatch expr: expr "~" astname expr: "`" astmatch astmatch(N,R): astname(N) "{" R "}" astname(N): softid.id(N) %lang softid id(N): N
The expr extensions show how templated rules are used within non-templated rules. The ast extensions show how to declare templates. All instances of a template must have the same number of parameters, so declaring astname as astname(N,R): softid.id(N) would be a syntax error (since it is first seen in astmatch with one parameter).
To be useful, templates must be instantiated. This is done by binding literal, regexp, or rule symbols to each parameter. Duplicate (same parameter) instantiations of a template are ignored. When a template is instantiated, it recursively instantiates all templates on its right-hand side. For example, the instantiation astmatch("expr",c.expr) produces:
astmatch: astmatch('expr,c.expr) astname: astname('expr) softid: softid.id('expr) astmatch('expr,c.expr): astname('expr) "{" c.expr "}" astname('expr): "expr" softid.id('expr): "expr"
Note the recursive astname instantiation within astmatch is specific. Instantiation is not done in the following way (which would have a different effect):
# Not what happens astmatch("expr",c.expr): astname "{" c.expr "}"
The pattern matching syntax for C is defined above. The advantages are three-fold:
The syntax is concise
It "just works" with multiple languages (except for the "ambiguous token" problem described above). All that is needed is one template instantiation (on astmatch). For example, it can parse the following perl regular expression pattern (which contains an inner "}"):
# The instantiation astmatch("perl.re", perl.re) # Supported syntax ast ~ perl.re{s/^}/)/}
This works because the {/^}/} pattern is parsed directly with the perl.re rule.
Pattern names are "soft keywords". A special language ("softid") is used to hide all keywords so that the keyword namespace is not polluted.
Other details of the pattern-matching implemetation
Operator !~
Syntactic sugar for grammars:
rule? # option type rule* # list types rule+ [rule sep] # separated list, eg. [R ","] ~ R , R , ... , R [rule sep]?
These expand to:
rule?: rule?: rule rule*: rule*: rule* rule rule+: rule rule+: rule+ rule [rule sep]: rule [rule sep]: [rule sep] sep rule [rule sep]?: [rule sep]?: [rule sep]
Nesting and grouping would be cool, but are harder to use. For example, one could define ifs using grouping as below (note the braces). The problem: individual elifs cannot be pattern-matched because an anonymous rule is created for each group (the pattern-matching syntax requires a rule name so it knows what to parse).
if: "if" "(" expr ")" stmt { "elif" "(" expr ")" stmt }* { "else" stmt }?
Question: How are the lists represented? How can an extension writer iterate, splice, concat, etc. on lists declared using the *,+,[] syntaxes?
AST shape-checking (see notes elsewhere)
It bugs me that expr{ (0} } !~ expr{ 0 }. Maybe use a quick de-sugaring pass, or integrate de-sugaring with the parser?
Improve the lexer's power. I'd like to support two things:
A flag to disable slots, possibly %allotslots(rule). Slots only make sense within a pattern, so this will tell the parser to reject accidental uses of slots outside of patterns, such as:
int \foo() { return bar(); }
This would require adding state flags to Ystates, I think.
Though it is insanely unlikely, hash code collisions are possible. One solution:
Let Ysym->code = nameof(Ysym->grammar->name + "." + Ysym->name) Let Ysym->rule = nameof(print("%R",rule))
These codes are guaranteed unique, but they are not persistant. They must be bound when an extension is loaded. For example:
const char* sym_namestable[] = { "'int", "'bool", "/[0-9]+", ... }; const char* rule_namestable[] = { "expr: expr '+ expr", ... }; Name sym_codestable[S] = {0}; // S = nelem(sym_namestable) Name rule_codestable[R] = {0}; // R = nelem(rule_namestable) // All calls to destruct & restruct lookup a code by index in the codes tables. // Special compiler-generated function, called when the extension loads void xinit() { for(i = 0..S-1): sym_codestable[i] = nameof(sym_namestable[i]); for(i = 0..R-1): rule_codestable[i] = nameof(rule_codestable[i]); }