2.3 Using PyGgy

PyGgy reads in a specification file for a parser and generates tables that can be used by a parsing engine to parse a stream of tokens. The first step in using PyGgy is to construct a specification file which specifies the grammar to be parsed. An example of a simple spec file is given in examples/test2.pyg (see examples/test2.pyl for the lexer that goes along with it):

# This grammar is ambiguous
E -> E PLUS E
    | E TIMES E
    | ID
    ;

This specifies a grammar with one non-terminal (E) with three productions (E -> E PLUS E, E -> E TIMES E and E -> ID) and three terminals (tokens PLUS, TIMES and ID).

Building a parser from a specification file is similar to building a lexer from a PyLly specification file. The example in example/test2.py illustrates this:

import pyggy

[...]

# instantiate the lexer and parser
l,ltab = pyggy.getlexer("test2.pyl")
p,ptab = pyggy.getparser("test2.pyg")
l.setinput("-")
p.setlexer(l)

# parse the input
tree = p.parse()
if tree == None :
    print "error!"
else :
    print "parse done: ", exprstr(tree)
    # if you have dot, try uncommenting the following
    #pyggy.glr.dottree(tree)

The getparser function builds parser tables and a parser in a similar manner as the getlexer method previously discussed. It returns both the parser and the generated module containing the parser tables. Once the parser is specified, its input source is specified with the setlexer method. Finally the parse method is called to parse the token stream from the lexer.

The parse method parses the tokens from the lexer and returns a parse tree. The tree has a slightly different shape than might be expected because the parse engine can parse ambiguous grammars. The root of the tree is a pyggy.glr.symbolnode instance. This instance refers to one of the terminals or non-terminals in the grammar. It has a list of the possible productions that are derived by that symbol in the possibilities member. If the parse in unambiguous, there will be exactly one item in the possibilities list. Each possibility is a pyggy.glr.rulenode instance. The rulenode instance represents the left hand side of a production and has members rule specifying which rule was matched and elements which is a list of all of the parsed items in the right hand side. These elements are symbolnode instances.

To clarify, consider the code from examples/test2.py:

def singleexprstr(kids) :
    if len(kids) == 1 :
        return kids[0].sym[1]
    else :
        return "(%s %s %s)" % (exprstr(kids[0]), kids[1].sym[1], exprstr(kids[2]))

def exprstr(e) :
    res = []
    for p in e.possibilities :
        res.append(singleexprstr(p.elements))
    if len(res) == 1 :
        return res[0]
    else :
        return "[" + string.join(res, " or ") + "]"

The exprstr function is called to convert a parsed expression tree into a string. The exprstr function is called with a symnode that always references the non-terminal "E". Exprstr converts each of the possible parses of "E" into a string by calling singleexprstr. For each of the possible rulenodes, it calls the singleexprstr function with a list of the right hand side elements. The singleexprstr function converts this right hand side list into a string. If there is only one item in the right hand side, it must be an identifier, and the identifiers value is retrieved from the symbol information. Otherwise there are three children, two expressions and an operator. The expressions are converted to strings and joined into a single string with the operator between them.

Its informative to see what the output of this function looks like:

$ echo "a+b*c" | python run2.py
parse done:  [((a + b) * c) or (a + (b * c))]

Notice that there were two possible parses of this string. If you have GraphViz installed, edit the examples/test2.py example and uncomment the line pyggy.glr.dottree(tree) and rerun the previous test case. You will be shown a graphical representation of the parse tree. In the graph, the symnodes show up in red and the rulenodes show up in black. You can also see the value of the symbol in the symnode and the rule in the rulenode. Notice that the symnode for each non-terminal has a sym value of the non-terminal name and each terminal has a sym value that is a tuple of the token name and the token value. Also note that the each rulenode has a rule that is a tuple of the name of the left hand side, the number of elements in the right hand side (which is also the number of items in its elements variable) and the index of the production in the grammar.

The graphical view of the parse tree makes it obvious that its not a parse tree at all! The parsing engine makes use of shared nodes whenever possible to avoid an exponential blowup in the number of nodes during an ambiguous parse. The parse tree will truely be a tree if there is a unique parse. The parse tree may have cycles if there is a production which can derive itself without consuming any input. If there are no such productions, there will be no cycles.

Like the lexer specification file, the grammar specification file can be used to specify actions to be performed. These actions are not performed during parsing, as is traditionally done, but can be invoked after the parse is complete. The example/pyg_calc.py example illustrates this. This is a small calculator test case based on the example from the PLY web site (http://systems.cs.uchicago.edu/ply/example.html). The example/pyg_calc.pyg file specifies the grammar and actions:

%left TIMES DIVIDE;
%left PLUS;
%right UNARYMINUS;

statement -> NAME EQUALS expression :
        names[kids[0]] = kids[2]
    | expression :
        print kids[0]
    ;

expression -> expression PLUS expression :
        return kids[0] + kids[2]
    | %prec(PLUS) expression MINUS expression :
        return kids[0] - kids[2]
    | expression TIMES expression :
        return kids[0] * kids[2]
    | expression DIVIDE expression :
        return kids[0] / kids[2]
    | %prec(UNARYMINUS) MINUS expression :
        return -kids[1]
    | LPAREN expression RPAREN :
        return kids[1]
    | NUMBER :
        return kids[0]
    | NAME :
        if not kids[0] in names :
                print "Undefined name '%s'" % kids[0]
            return 0
        return names[kids[0]]
    ;

Each production in the grammar specifies a block of code following the final colon. After the input is parsed into a parse tree the actions can be applied to the parse tree with the proctree function as is done in examples/pyg_cal.py:

import sys
import pyggy

# build the lexer and parser
l,ltab = pyggy.getlexer("pyg_calc.pyl")
p,ptab = pyggy.getparser("pyg_calc.pyg")
p.setlexer(l)

while 1:
    sys.stdout.write("calc > ")
    line = sys.stdin.readline()
    if line == "" :
        break

    l.setinputstr(line)
    try :
        tree = p.parse()
    except pyggy.ParseError,e :
        print "parse error at '%s'" % e.str
        continue
    pyggy.proctree(tree, ptab)

proctree walks the tree in a depth-first manner and at each node representing a production in the grammar, runs the action associated with that production. When running the code the kids argument contains a list of the right hand side values of the production.

See the PyGgy Home Page.