Jim Roskind on C ambiguity

# From:         Jim Roskind x5570 - view profile
# Date:         Tues, Jan 14 1992 4:23 pm
# Email:                j...@florida.HQ.Ileaf.COM (Jim Roskind x5570)
# Groups:               comp.compilers
#
# show options
#
# Reply to Author | Forward | Print | View Thread | Show original | Report Abuse | Find messages by this author
#
# >   From: hje...@cs.cmu.edu (Mark Hjelm)
# >   Organization: School of Computer Science, Carnegie Mellon
#
# >   I have a parser, written using Yacc and Lex, for ANSI C.  The grammar is
# >   taken pretty much verbatim from the standard.  The scanner uses the symbol
# >   table to decide whether to return "identifier" or "typedef name" as the
# >   token type for an identifier.  How do I KNOW that there are no situations
# >   which, due to parser lookahead, would cause the scanner to return an
# >   incorrect token type for an identifier (i.e.  return "identifier", even
# >   though the identifier was just/will be made into a "typedef name")?  Is
# >   there a general answer to this question for other parsing strategies
# >   (possibly with other amounts of lookahead) and other grammars (languages)?
#
# This is an excellent and interesting question, and pulled out a very
# nice line of responses.  My answers are:
#
#         1) I have no "general method" of proving correctness.
#         2) In this case, correctness can be proved.
#         3) If your grammar came from the standard, the proof will
# identify an error in your grammar.
#         4) A correct grammar exists, and I distribute it for free.
#
# For folks that can't figure out why there is so much discussion about this
# problem, consider the code fragment:
#
#         T(*b)[4];
#
# Notice how the meaning is very different depending on whether T is a
# typedef-name, or an identifier (and hopefully a function in this case).
# For this reason most C lexers distinguish between typedef names and other
# identifiers.  Alas, as the poster points out, there is concern that the
# feedback loop (parser annotates symbol table, which lexer interrogates,
# and then feeds tokens to the parser) will fail in some complex scenario.
#
# Other threads of response included the use of a GLR parser, which tries
# all alternatives.  Alas, as will be shown, the ambiguity in C is too
# great, and such an approach will not provide a unique syntactically
# correct parse when it does exist.
#
# Other respondees mentioned that almost no yacc based parsers get this area
# of the standard right.  Since the grammar included in the standard is
# misleading, the cause for this problem will become abundantly clear. I
# will start with a description of how a proof of correctness could proceed,
# take a slight aside to show why the traditional grammar tends to cause
# this proof to fail, and explain how the grammar can be written (provably)
# correctly.  Since I am typing this on the fly, I will surely make some
# errors, but I hope that most of the readers will find the content
# interesting.  Note that my grammar distribution includes a very large
# paper that deals with many C/C++ ambiguities in much greater detail, and
# with many more examples.
#
# The fact that yacc uses LR(1) scanning is pivotal to the argument in this
# special case.  In this special case you can note that the typedef-ness of
# an identifier can *only* change after a declarator is complete.
# Specifically, the standard states that an identifier enters scope only
# after its declarator is complete.  The completion of a declarator is
# always marked by one of three tokens:
#
#         ";" end of declaration statement
#         "," end of individual declarator (another is comming)
#         "=" end of declarator, initializer is comming.
#
# To prove correctness, you must show that reductions involved with
# annotation of your symbol table take place while one of the three tokens I
# just mentioned is waiting in the lookahead buffer.  The argument notes
# that since none of the three tokens I identified relies on the
# typedef-ness of an identifier, everything will work out (i.e., the
# identifier vs typedef-name separation by the lexer can always be done in a
# correct and timely manner).  Basically, all you have to do is look at the
# position of the action code that is placed in the grammar to annotate the
# symbol table, and be sure that this happens before one of the above three
# terminating tokens (context counts) has being processed.
#
# As an example, suppose you had a rule that looked like:
#
# declaration: storage-class INT identifier ';' { /* annotate sym table */ }
#
# Please don't bother to argue that you would never have such a specific
# rule in the grammar, rather note that the action is being done *after*
# reading the ';'.  This is "bad."  If you had such a rule, then there is a
# chance that yacc would have read the next token (it is allowed to if it
# needs to), and there is a chance that the next token *might* be an
# identifier.  Hence, with the above rule, there is a chance that the lexer
# would process an upcoming identifier *prior* to the annotation of the
# symbol table, and hence incorrect results could arise.  For example:
#
#         typedef int T ;
#         T a;
#
# might be incorrectly parsed (the second "T" might be read *and* tokenized
# by the lexer before the symbol table is annotated).  A guaranteed correct
# rule could be written:
#
# declaration: storage-class INT identifier { /* annotate sym table*/ } ';'
#
# Note that here the action takes place with the parser at *most* looking
# ahead at the ';', and hence the lexer would not have had to process future
# identifier/typedef-name distinctions.
#
# Getting back to your example, since you copied the grammar from the ANSI
# Standard, your grammar is probably wrong.  To expose the problem, try
# parsing:
#
#         typedef int A, B(A); /* my bet is you get a syntax error*/
#
# When correctly parsed, the above is equivalent to:
#
#         typedef int A;
#         typedef int B(A);
#
# or simply:
#
#         typedef int A;
#         typedef int B(int); /* B's type is function taking an int
#                 returning and int */
#
# Again, to verify my interpretation, look at the verbose discussion of
# declarations in the ANSI standard and note that an identifier enters scope
# when its declarator is complete.  Despite this assertion in prose, most
# grammars are based on the K&R or ANSI grammars, which gathers an
# "init-declarator-list" before combining the declarators with the
# "declaration-specifiers." In the above example, gathering the entire list
# "A, B(A)" before annotating the symbol table for "A" would allow the lexer
# to incorrectly tokenize the second "A".
#
# You might wonder how GNU GCC (or others related compilers) manage to get
# the above parse right, when their grammar looks a lot like this standard
# (and I assert, faulty) grammar.  In the case of GCC, this area of the
# action code includes a hack wherein the "decl-specifiers" are fetched from
# lower points in the stack using $-1 (using negative numbers allows action
# code to reach down into questionable areas of the yacc stack, whereas
# positive numbers correspond to the tokens in the productions with the
# action).  Basically, by reaching backwards when they see a ',' (or '='),
# they are able to annotate the symbol table even though they use the
# standard grammar. Once again, it is provable that they get the correct
# parse (if they can prove that $-1 always contains the desired
# decl-specifier value.  This latter assertion is far from trivial to prove,
# as they must correctly process function declarations, which occasionally
# have no decl-specifiers!  They get it right, but a second hack is required
# atop the first hack, and these two hack unite to make C++ parsing a horror
# for them).
#
# For those compiler hackers out there, it is worth noting that the problems
# are greater than what was just given in the above example.  Not only can
# "typdef-ness" be established by a declarator, it can be taken away.  For
# example, consider the following:
#
# #include <assert.h>
# typedef char F;
# main() {
#         long a=sizeof(F), F, b=sizeof(F);
#         assert(1 == a);
#         assert(sizeof(long) == b);
#         }
#
# With proper rewriting, a grammar can be made to correctly process such
# declaration without "cheating" and looking at $-1 (or where ever else this
# info might be socked away).  The trick is to combine the declarator with
# the declaration specifier (and annotate the symbol table), then combine
# with the optional initializer (rather than forming an init-declarator),
# and then if there is a ',', form a token that once again acts just like
# the decl-specs to process the next declarator.  If you want to see the
# correct grammar, take a look at my distribution (which includes a C and a
# C++ grammar, along with a cheap FLEX input file).
#
# Since one of the respondees included suggesting that a GLR parser could
# just try all interpretations, and that only one would be valid, the
# following is a rather interesting example:
#
# typedef int T;
# main() {
#         int T=100, a=(T)+1;
#         assert(101 == T);
#         }
#
# Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were
# still a typedef-name, as a cast to type T of the expression "+1".  Note
# that the GLR approach would yield two distinct and valid parses, and hence
# there would be no single winner (without additional heuristics).
#
# Another person focused on the issues involving redeclaration of
# typedefnames at inner scopes.  Indeed, this was something of a bear to
# tackle (without kludging).  Examples of the code that usually causes
# problems include:
#
# typedef int T1 ;
# typedef int T2 ;
# typedef int T3;
# main() {
#         const T1 T1; /* redeclares T1 to be an int */
#         const T2 (T2); /* redeclares T2 to be an int */
#         const T3; /* syntax error : missing declarator */
#         }
#
# Bruce Blodgett (formerly of Honeywell Corporation) and I independently
# implemented clean resolutions of this within a YACC grammar, but the
# results can be seen within my distribution.  Interestingly, this careful
# analysis of C identified some unaddressed scenarios in the ANSI standard.
# Since these results go beyond the scope of this posting, interested
# readers are encouraged to look at my paper which is included with my
# distribution.
#
# Doug Lea and Doug Schmidt have graciously offered to provide anonymous ftp
# sites for the 8 files, as well as the Berkeley YACC source (if you need
# it).
#
# ics.uci.edu (128.195.1.1) in the ftp/gnu directory (even though neither of
# the archives are GNU related) as:
#
#         c++grammar2.0.tar.Z
#         byacc1.8.tar.Z
#
# mach1.npac.syr.edu (128.230.7.14) in the ftp/pub/C++ directory as:
#
#         c++grammar2.0.tar.Z
#         byacc1.8.tar.z
#
# Jim Roskind
# Independent Consultant
# (407)729-4348 or (617)290-4990 x5570
# j...@hq.ileaf.com or ...!uunet!leafusa!jar
#
# p.s.: I'm always looking for interesting compiler and/or grammar work.
# --
# Send compilers articles to compil...@iecc.cambridge.ma.us or
# {ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.