The organization of the Parser |
topic started 3/29/2005; 11:02:06 PM last post 3/29/2005; 11:02:06 PM |

Davide P. Cervone - The organization of the Parser 3/29/2005; 11:02:06 PM (reads: 816, responses: 0) |

In
a previous discussion, Bob Byerly asks what is the best way to learn
about the new parser that I wrote for WeBWorK. Rather than continue
that (already long) topic, I thought I'd start a new one that gives an
overview of the layout of the parser and its design. It will not be
comprehensive, as the parser is a pretty complicated piece of coding,
but it should give you some guidance for how to proceed if you want to
read throught he code to understand the parser better. The parser is broken into two main packages, the Parser and the Value packages, and each serves a distinct purpose. The Value package is the one handles the computational aspects of the various objects (like complex numbers, real, points, vectors, and so on). It defines the structures for storing the data for the objects, for printing that data in various forms, for promoting one type to another, and so on. It also is the one that overloads the operators to make them work for these data types, and extends the functions like sin() and sqrt() to operate on them. The Value package was designed to work separately from the parser, if you just want a means of computing with complex numbers, points, and so on, except for the Formula class, which relies on the Parser. The Parser package is the one that knows how to interpret a string that represents an expression (like the ones the students type) and turn them into the operations required to evaluate the expression or to compare one expression to another. In contrast to the Value package, which is centered around computation, the Parser package works on a symbolic level to build representations of the structure of the expressions, rather than just producing the results of an expression. The Parser uses the Value package to handle constants of the various types, and to perform the computations when an expression is evaluated. The way the Parser interprets an expression is controlled by a Context object that defines the various operators, functions, data types, and so on, that are available to the Parser. For most of the data types, there is an object class in both the Parser and the Value packages that deals with that type. The ones in the Value package provide for computation with the data type, and the one in the Parser provide symbolic manipulation. The answer checkers are part of the Value package. The various constructors like Point(), Vector(), Real(), and Formula() all produce Value objects, so these are the ones that you deal with most frequently in your perl code. The Formula() object, however, is a shell around a Parser object, and a Formula evaluates to a Value object, so they are pretty intimately connected. Because the Value objects are the ones that you use directly, I will start by discussing them. The pg/lib/Value.pm file defines the basic Value structure, and the other objects (Point, Vector, Real, etc.) are subclasses of this class. The Value class defines all the basic utilitiy routines; ones that let you test if a value is a number, or a formula, or is zero, etc., or that convert a value to an appropriate Value object, or that return the object's class or type, or that extract the data from the object, or that convert the object to its string or TeX or perl representation. Many of these will be overridden by the subclasses, but lots of the basic structure is produced here. The individual Value classes are defined each in its own file. For example, the Real object is defined in pg/lib/Value/Real.pm. Each such class has a constructor (the "sub new") that tells how to create an instance of the object from the user-supplied data, and it usually accepts data in a variety of forms (as a string, or a perl array, or an appropriate perl constant, or another Value object that should be converted to this type, etc.). The "new" method does the typechecking and data conversion to produce the desired Value object. There is also a "make" method that assumes the data are already checked and converted, and simply constructs the object from them (this is used to produce instances without the overhead of the typechecking when you know the data are good). The individual object classes also implement methods to tell how to promote other objects to this class, how to get data out of the class, how to convert an instance to a formula, and so on. Most importantly, the class overloads the various operators (like + and ==) and implement the methods used to perform those operations. Typically, this includes methods like add, mult, compare, and so on. It also may include functions like sin(), or sqrt(), or new ones like norm() or arg() that can be bound to functions in the Parser's Context so that students can access them. Finally, the string, TeX and perl methods can be supplied to produce appropriate strings from the object's data. The pg/lib/Value/Formula.pm file is particularly complicated, because the overloaded comparison operator is pretty sophisticated. This is the file that implements most of the details for function comparison (picking random points, testing the values of functions, finding adaptive parameters, etc.) Two additional files (pg/lib/Value/AnswerChecker.pm and pg/lib/Value/WeBWorK.pm) implement the WeBWork-specific portions of the parser. WeBWorK.pm implements the Parser::Formula and Parser::Evaluate methods that allow you to create and evaluate formulas from student data with error trapping so that you can properly handle error conditions. It also implements the copying of values from global.conf and course.conf into the appropriate locations in the Context objects. The AnswerChecker.pm file implements all the parser's answer checkers. There are basically only two checkers, one for single items and one for lists. The answer checker calls the overloaded == operator to do the actual check to see if the student and professor's answers agree. The rest of the answer checker is devoted to parsing the student answer and reporting error conditions in a reasonable and consistent way. The individual Value classes each provide special methods that customize the common answer checkers to their purposes (e.g., define and implement the special flags needed for that type, and report special error messages appropriate for those types). The Value package is the heart of the computational side of the new parser, both in terms of manipulating values, and in comparing student and professor answers. The Parser package, on the other hand, handles all the details of converting student answers into Value objects and parsed expressions. It is much more complicated than the Value package in many ways, as it has to keep track of the structure of the expressions, not just the final results. The way that the Parser interprets an expression is controlled by the Parser's Context object. This object contains tables that describe the operators and their precedences, the functions, the types of parentheses, the variables that are available and their expected types, and so on. You can change these data to control what functions and operations are available to the student, and what those functions and operations do. The Context object provides a global context for the equations and values used within a problem. (This deserves a lot more discussion, but that will have to be postponed for now. Note, however, that both the Value and the Parser packages contain Context objects. The Parser Context is a subclass of the Value Context, and includes considerably more data. The pg/lib/Context/Default.pm contains the definitions of all the main contexts available for the parser. Some special-purpose contexts are in pg/macros/context*.pl) The main purpose of the Parser class is to build a parse tree, which is a representation of the structure of the mathematical expression that is being parsed. Since a Value::Formula object is really a Parser object, you can call Parser methods or access the Parser data of any Formula object. The parsed expression tree is available in the {tree} field of the Formula. For example, if $f = Formula("x+1"), then $f->{tree} is the parsed formula. The tree is made of Parser::Item objects, or rather, subclasses of this. There are classes for things like binary operators, unary operators, function calls, various types of constants, variables, and so on. These are defined in the files and subdirectories of the pg/lib/Parser directory. For example, pg/lib/Parser/BOP.pm implements the basic binary operator. This in turn is subclassed by the specific binary operators, like plus, times, power, and so on. A Formula object has methods like eval(), substitute(), reduce(), string(), TeX(), and perl(), and these recursively call the corresponding functions on the items in the Formula's tree. So each Parser::Item subclass implements these methods for its specific type of object. For example, the eval() method of the BOP item evaluates each of its operands to compute the values of those subformulas, and then combines the operands to produce the final result (depending on the actual binary operator involved). The subclasses of BOP implement methods for checking the types of the arguments for compatibility, for doing the actual combining of the operands, for reducing the formula based on the formulas in the operands, and so on. Each subclass of the Parser::Item has its own scheme of storing its required data. For example, the BOP objects have an {op} field (for the actual operator involved) and {lop} and {rop} fields, which are the left- and right-hand operands (as trees of Parser::Items). On the other hand, a function call has a {name} field for the name of the function, and a {params} field, which is a reference to an array of arguments to the function (these are trees of Parser::Items). All the Parser::Items have fields that point back to the parent equation (which has a pointer to the context, the original string, and so on), and to data about the position of the item in the original string, etc. They have methods that return the type of value that the item will produce when it is evaluated (so that operand type-checking can be performed, for example), and many also include pointers to the Context data that defines the object. For instance, the BOP data type has a {def} field that points to the Context's operator definition for the particular operator in use (this saves looking it up each time). Currently, there is no easy way to "walk the tree" of a parsed expression. I'll think about how best to do that, but for now, the only real way is to start at the root of the tree and use the item's ->class method to get the object's type (BOP, Function, etc.), then use that to determine how to handle the item on a case-by-case basis. If you are interested in restricting what structures are allowed to be entered by a student, one way to do this is to modify the Context to change how the Parser treats the student's expression. You might consider looking at pg/macros/contextLimited*.pl, which implement contexts in which the students can only enter certain kinds of values. For example, contextLimitedVector.pl implements a context in which the operators like + and * can only be used within the coordinates of a vector, not between vectors themselves. This is done by subclassing the standard BOP items and overridding the type-checking routines that test if the operands have compatible types. The various operators are then tied to the subclasses rather than the original ones. The contextLimitedComplex.pl shows how to restrict the operators based on the values of the left- and right-hand operands to allow only answers of the form a+b*i or r*e^(b*i). Well, I hope that gets you started with understanding the details of the new parser. When you get stuck, you can ask questions here. |