PyGgy reads in a specification file for a parser and generates tables that can be used by a parsing engine to parse a stream of tokens. The first step in using PyGgy is to construct a specification file which specifies the grammar to be parsed. An example of a simple spec file is given in examples/test2.pyg (see examples/test2.pyl for the lexer that goes along with it):
# This grammar is ambiguous E -> E PLUS E | E TIMES E | ID ;
This specifies a grammar with one non-terminal (E
) with three
productions (E -> E PLUS E
, E -> E TIMES E
and E -> ID
)
and three terminals (tokens PLUS
, TIMES
and ID
).
Building a parser from a specification file is similar to building a lexer from a PyLly specification file. The example in example/test2.py illustrates this:
import pyggy [...] # instantiate the lexer and parser l,ltab = pyggy.getlexer("test2.pyl") p,ptab = pyggy.getparser("test2.pyg") l.setinput("-") p.setlexer(l) # parse the input tree = p.parse() if tree == None : print "error!" else : print "parse done: ", exprstr(tree) # if you have dot, try uncommenting the following #pyggy.glr.dottree(tree)
The getparser
function builds parser tables and a parser
in a similar manner as the getlexer
method previously
discussed. It returns both the parser and the generated module
containing the parser tables. Once the parser is specified, its
input source is specified with the setlexer
method. Finally
the parse
method is called to parse the token stream from
the lexer.
The parse
method parses the tokens from the lexer and returns
a parse tree. The tree has a slightly different shape than might be expected
because the parse engine can
parse ambiguous grammars. The root of the tree is a
pyggy.glr.symbolnode
instance. This instance refers to one of the
terminals or non-terminals in the grammar. It has a list
of the possible productions that are derived by that symbol in
the possibilities
member. If the parse in unambiguous, there
will be exactly one item in the possibilities
list.
Each possibility is a pyggy.glr.rulenode
instance. The rulenode
instance represents the left hand side of a production and has
members rule
specifying which rule was matched and elements
which is a list of all of the parsed items in the right hand side. These
elements are symbolnode
instances.
To clarify, consider the code from examples/test2.py:
def singleexprstr(kids) : if len(kids) == 1 : return kids[0].sym[1] else : return "(%s %s %s)" % (exprstr(kids[0]), kids[1].sym[1], exprstr(kids[2])) def exprstr(e) : res = [] for p in e.possibilities : res.append(singleexprstr(p.elements)) if len(res) == 1 : return res[0] else : return "[" + string.join(res, " or ") + "]"
The exprstr
function is called to convert a parsed expression tree
into a string. The exprstr
function is called with a symnode
that
always references the non-terminal "E"
.
Exprstr converts each of
the possible parses of "E"
into a string by calling singleexprstr
.
For each of the possible rulenodes
, it calls the singleexprstr
function with a list of the right hand side elements. The singleexprstr
function converts this right hand side list into a string. If there
is only one item in the right hand side, it must be an identifier, and
the identifiers value is retrieved from the symbol information. Otherwise
there are three children, two expressions and an operator. The expressions
are converted to strings and joined into a single string with the operator
between them.
Its informative to see what the output of this function looks like:
$ echo "a+b*c" | python run2.py parse done: [((a + b) * c) or (a + (b * c))]
Notice that there were two possible parses of this string. If you
have GraphViz installed, edit the examples/test2.py example and
uncomment
the line pyggy.glr.dottree(tree)
and rerun the previous test case.
You will be shown a graphical representation of the parse tree.
In the graph, the symnodes
show up in red and the rulenodes
show
up in black. You can also see the value of the symbol in the symnode
and the rule in the rulenode
. Notice that the symnode
for each
non-terminal has a sym
value of the non-terminal name and each terminal
has a sym
value that is a tuple of the token name and the token
value. Also note that the each rulenode
has a rule that is a tuple
of the name of the left hand side, the number of elements in the right
hand side (which is also the number of items in its elements
variable)
and the index of the production in the grammar.
The graphical view of the parse tree makes it obvious that its not a parse tree at all! The parsing engine makes use of shared nodes whenever possible to avoid an exponential blowup in the number of nodes during an ambiguous parse. The parse tree will truely be a tree if there is a unique parse. The parse tree may have cycles if there is a production which can derive itself without consuming any input. If there are no such productions, there will be no cycles.
Like the lexer specification file, the grammar specification file can be used to specify actions to be performed. These actions are not performed during parsing, as is traditionally done, but can be invoked after the parse is complete. The example/pyg_calc.py example illustrates this. This is a small calculator test case based on the example from the PLY web site (http://systems.cs.uchicago.edu/ply/example.html). The example/pyg_calc.pyg file specifies the grammar and actions:
%left TIMES DIVIDE; %left PLUS; %right UNARYMINUS; statement -> NAME EQUALS expression : names[kids[0]] = kids[2] | expression : print kids[0] ; expression -> expression PLUS expression : return kids[0] + kids[2] | %prec(PLUS) expression MINUS expression : return kids[0] - kids[2] | expression TIMES expression : return kids[0] * kids[2] | expression DIVIDE expression : return kids[0] / kids[2] | %prec(UNARYMINUS) MINUS expression : return -kids[1] | LPAREN expression RPAREN : return kids[1] | NUMBER : return kids[0] | NAME : if not kids[0] in names : print "Undefined name '%s'" % kids[0] return 0 return names[kids[0]] ;
Each production in the grammar specifies a block of code following
the final colon. After the input is parsed into a parse tree
the actions can be applied to the parse tree with the proctree
function as is done in examples/pyg_cal.py:
import sys import pyggy # build the lexer and parser l,ltab = pyggy.getlexer("pyg_calc.pyl") p,ptab = pyggy.getparser("pyg_calc.pyg") p.setlexer(l) while 1: sys.stdout.write("calc > ") line = sys.stdin.readline() if line == "" : break l.setinputstr(line) try : tree = p.parse() except pyggy.ParseError,e : print "parse error at '%s'" % e.str continue pyggy.proctree(tree, ptab)
proctree
walks the tree in a depth-first manner and
at each node representing a production in the grammar, runs
the action associated with that production. When running the
code the kids
argument contains a list of the right
hand side values of the production.
See the PyGgy Home Page.