[amitp@cs.stanford.edu: Re: PEGs]
Kragen Javier Sitaker
kragen at canonical.org
Mon Dec 1 21:39:52 EST 2008
Forwarded, slightly edited, with permission.
----- Forwarded message from Amit Patel <amitp at cs.stanford.edu> -----
From: Amit Patel <amitp at cs.stanford.edu>
To: kragen at pobox.com
Subject: Re: PEGs
Hi Kragen,
Neat article!
[referring to
http://github.com/~kragen/peg-bootstrap/tree/master/peg.md or
http://considerate.murch-sitaker.org/~kragen/peg.md.html -kjs]
On http://lambda-the-ultimate.org/node/3039, Josh Haberman thinks PEGs
aren't as cool as I and everyone else thinks they are. I don't have enough
experience with them to know.
For Yapps, I wrote parts of the parser by hand, and it was enough to parse
the grammar for Yapps, so after a few iterations I had Yapps producing its
own parser. That was fun :)
For parsers in general, I'm so happy that ANTLR and packrat got people
interested again. It seemed like the field was dead, even though there is so
much more to do. Three areas I'm interested in, but don't plan to work on
anytime soon:
- Abstraction. When I write a rule and code to handle a comma-separated
list of numbers, and then I write it all again for a semicolon-separated
list of CSS rules, and then I write it all again for a period-separated list
of names, I want some abstraction rule for List(separator, element) that not
only can parse those lists, but produce the output value.
- Inheritance. (A different form of abstraction) It might be useful to
take a grammar and add to it and override existing rules by a mechanism
similar to inheritance in classes. For example, Vax C let you use $ in
identifiers. It'd be neat to take an existing grammar for C, inherit from
it, and redefine the Token rule to also allow $.
- Libraries. (This goes along with abstraction) There are lots of
patterns that are pretty common and could be put into a library. The List()
rule is one of course but nesting, whether () or [] or {}, or strings
(single, double, triple quotes, escaping, etc.) would be nice to put into a
library.
- Better theory. In its purest form, parser theory tells you whether a
string matched. In practice, we want to extract parts of it and compute
functions of those parts. This means the grammar aa* and a*a, although
equivalent in theory (because they match the same set of strings), are
actually different once you take into account extracting data. So I think
the theorists started with the wrong problem, and I'd like to see theory
that addresses the actual problem that parsers are used to solve. Maybe it
won't make any difference; I don't know. But I have this bad feeling about
the foundations.
- Error handling. (This goes along with better theory) Parsers tend to
be fine when things go right, but I haven't seen any good theory around
handling errors. In practice it seems that people use heuristics and hacks.
- Diff parsing. (This is related to errors) In theory, a parser is run
on a string, and then you stop. But in practice, you're running a parser on
a string and then run the parser on a modified string. For example, if my
input is "(if (foo) bar)" and I change it to "(if (foo bar)" then by
looking at the change I made in the editor, you can tell that the error is
likely to be that foo should have ')' after it. But the parser, having
forgotten the previous version of the string, will tell me the error is
"missing )" at the very end of the string. The history can tell you
something about where the error occurred. Looking at it another way,
suppose you have the last successful parse, and you know which parts of the
parse tree correspond to the parts of the input string. When the input
string changes, try to preserve the parse tree to the left of and to the
right of the input string changes. That will give you a subtree that you
expect errors to occur within, and the missing ')' likely should be there,
not at the very end (rightmost leaf of the tree).
- Amit
----- End forwarded message -----
More information about the Kragen-discuss
mailing list