Wild Stabs In The Dark: July 2008

Thursday, July 31, 2008

Post 3: Expression Evaluator - The Expression Class and Evaluation

The Expression Class is a n-ary tree representation of a given S-Expression.

In its simplest form, the class would look something like:

   public class Expression      
{       
    protected Expression m_Parent = null;       
    public virtual Expression Parent { get { return m_Parent; } } 
        protected List<Expression> m_Children = new List<Expression>();      
    public virtual List<Expression> Children { get { return m_Children; } } 
        protected string m_TokenValue = String.Empty;      
    public string TokenValue { get { return m_TokenValue; } } 
        ... other housekeeping properties ...

At this point in the evaluation cycle, we have simply created a different representation of the S-Expression.

We actually perform the evaluation of each node by recursively evaluating each of the children and merging their results. The obvious questions here are:

How exactly do we merge the results?
How do we evaluate the children?

It would also be important to discuss the issue of the type of the result.

Let's go back to a simple S-Expression

        (+ 1 2 3 4)

The Expression tree generated would be:

        Root (1 Child)                 >> (5 Children)                         "+" (0 Children) 
              "1" (0 Children) 
              "2" (0 Children) 
              "3" (0 Children) 
              "4" (0 Children)

In order to apply meaning to the expression, we define the convention that the first child represents some operation which has to be performed on all its siblings in order, which when applied, results in the value associated with the first child. Further, we define that the value of a node is the value of the first child.

In our example case, we define that the value of the Root node is the value of the anonymous ">>" node, which in turn is the value of the "+" node, which in turn is the value resulting from applying the "+" function on "1", "2", "3" and "4".

We notice that we have now answered the first question "How do we merge the results".

To answer the second question, let's take a closer look at the Evaluate function of the Expression class, shown here in its simplified form:

   public Expression Evaluate()      
{       
    switch (this.Children.Count)       
    {       
        case 0:       
            /* atomic value */       
            {       
                switch (this.Token)       
                {       
                    case E_TOKEN.SExprStart:       
                        break; 
                        case E_TOKEN.ValidName:      
                    case E_TOKEN.Text:       
                    case E_TOKEN.ValidNumber:       
                    case E_TOKEN.SingleQuotedText:       
                    case E_TOKEN.DoubleQuotedText:       
                        {       
                            this.SetValue(this.TokenValue);       
                        }       
                        break; 
                        default:      
                        {       
                            throw new EvaluatorException(this, "Evaluate Error: Unknown Token!");       
                        }       
                } 
                    return _this;      
            } 
            case 1:      
            /* s-expr or root node */       
            {       
                return this.SetValue(this.Children[0].Evaluate());       
            } 
            default:      
            /* function call */       
            {       
                // apply first child to the remaining children       
                return this.SetValue(Apply(this.Children[0], this.Children.Skip(1).ToList()));       
            }       
    }       
}

We can see the straightforward processing of the Root and Atom cases. The default case deals with the situation where the first child needs to be "applied" on the other children. Apply is the most elaborate function in this project, but it's quite simple really. If all we did was addition, this is what Apply() would look like:

        private static Expression Apply(Expression Functor, List<Expression> Arguments)        
{         
    string strFunc = Functor.TokenValue;         
          switch (strFunc.ToLower())        
    {         
        #region SUM/CONCAT         
        case "+":         
        case "add":         
        case "sum":         
            {         
                try         
                {         
                    return Functor.SetValue(Arguments.Aggregate(0.0d, (result, x) => result += x.Evaluate().ValueAsNumber);         
                }         
                catch         
                { } 
                      return Functor.SetNil();        
            }         
        #endregion 
          }        
}

The lambda expressions and List extension functions in C# make writing the processor sections a breeze. It's easy to see how the first child's value (the Functor argument to Apply) is set to the sum of the other children (the Arguments argument to Apply).

We are only left with one major issue to discuss - that of the value type.

Wednesday, July 30, 2008

Post 2: Expression Evaluator - Building an Expression Tree

The Parse function described in the previous posts allows us to specify a ParseEventHandler delegated function which will be called-back by the parser.

We use this function to monitor the parsing of an expression string, and build a corresponding expression tree, which can then be evaluated. This is one of several approaches we could take for the evaluation process - alternatively, we could proceed with the evaluation in-situ, with a makeshift stack. I took this approach because it allows us to re-use the expression strings without having the re-parse - which is something we'll need to support user-defined functions.

The building of the expression tree is straightforward. We require a simple class to keep track of the Root node of the expression, and the Current node - which represents the S-Expression being parsed currently.

We write a simple ParseEventHandler delegate, which hooks the SExprStart state to signify the creation of a sub-tree, and the Atom state to signify leaf nodes of the expression tree. We pass along a ParserContext object to keep a reference to the Root and the Current nodes of the expression tree.

   public class ParserContext      
{       
    public Expression Root { get; private set; }       
    public Expression Current { get; set; } 
        public ParserContext(Expression Root)      
    {       
        this.Root = Root;       
        this.Current = Root;       
    }       
}

The ParseEventHandler function is simple and elegant. All we're doing here is listening for three events:

SExprStart : Signifying a new S-Expression child. Our handler creates a child and sets it as the current parent, so future nodes will get bound to it.
SExprFinish: Signifying the completion of the current S-Expression. Our handler "pops the stack", as it were, and marks the parent node as the current parent.
Atom: Signifying a leaf-node child of the current parent. Simply creates and attaches a leaf-node, leaving the current parent intact.

   protected static void ParseEventHandler(ParseEventArgs e)      
{       
    ParserContext pctx = e.ParserContext as ParserContext;       
    if (pctx == null) return; 
        switch (e.Token)      
    {       
        case E_TOKEN.SExprStart:       
            {       
                pctx.Current = new Expression(pctx.Current, ">>", e.State, e.Token);       
            }       
            return; 
            case E_TOKEN.SExprFinish:      
            {       
                pctx.Current = pctx.Current.Parent;       
            }       
            return;       
    } 
        switch (e.State)      
    {       
        case E_PARSESTATE.Atom:       
            {       
                Expression exIgnore = pctx.Current.NewChild(e.TokenValue, e.State, e.Token);       
            }       
            return; 
            case E_PARSESTATE.Failure:      
            {       
                throw new ParseException(String.Format("*** FAILURE {0}] : [ {1} ], {2}, {3}", e.ErrorDescription, e.TokenValue, e.State, e.Token));       
            }       
    }       
}

To actually call the parser, all we do is pass in a reference to the handler and context instance:

Expression Root = new Expression();

Parser.Parse(pszExpression, new ParseEventHandler(ParseEventHandler), new ParserContext(Root));

// ... at this point, Root has been built up, or parsing has failed and thrown an exception

We'll discuss the Expression class in detail in the next post.

Eden is Born!

Eden Trinity Azariah was born today in Moncton, Canada to my brother and sister-in-law!

Mom and baby are doing well!

Big sister Tabitha couldn't be happier!

Welcome, baby Eden!

Friday, July 25, 2008

Post 1: Expression Evaluator - The Parser

We can take the classic state-machine approach to build the S-Expression parser, and use the C# event constructs to perform appropriate callbacks as we encounter specific states.

The Parser class contains a single static function which can parse a string and call the handler at appropriate points for further processing. A context object is passed through along to the handler for its use.

The Parse function consists of a single loop that walks the string from beginning to end, using the Tokenizer helper class to break up the string into lexically interesting chunks, and transitioning the state machine through the various states. We'll discuss Tokenizer::NextToken in detail next.

The state machine has the following states:

        Start 
      Finish 
      SExprStart 
      SExprFinish 
      AtomOrSExpr 
      Atom 
      Failure

Each of these states is of interest to the caller of the Parse function, as each state represents effectively a result of parsing the string.

The Parse results for a simple string are given - it's easy to walk through the body of the Parse function and see why the results came out the way they did:

   Input String: (test "(f1 a b)" (f2 c d)) 
         Start, 
      SExprStart, ( 
      Atom - ValidName, test 
      Atom - DoubleQuotedText, (f1 a b) 
      SExprStart, ( 
      Atom - ValidName, f2 
      Atom - ValidName, c 
      Atom - ValidName, d 
      SExprFinish, ) 
      SExprFinish, ) 
      Finish,

The grammar we described earlier define what state we want to be in after encountering a particular token, and this structure is reflected in the state machine. The Tokenizer also goes a little further and reports if the text it encounters would conform to a valid variable name, or a valid number. This helps us when we want to do some further processing of the parsed results.

   public static E_PARSERESULT Parse(this string psz, ParseEventHandler ParseEventHandler, object context)      
{       
    int sStart = 0, sFinish = 0; int expr_nest = 0; 
        E_PARSESTATE state = E_PARSESTATE.Start; 
        while (sFinish <= psz.Length)      
    {       
        switch (state)       
        {       
            case E_PARSESTATE.Start:       
                {       
                    expr_nest = 0; 
                        // dispense the notification.      
                    ParseEventHandler(new ParseEventArgs(psz, context, state, E_TOKEN.MAX, 0, 0)); 
                        state = E_PARSESTATE.AtomOrSExpr;      
                };       
                break;       
            case E_PARSESTATE.Finish:       
                {       
                    if (expr_nest != 0) { state = E_PARSESTATE.Failure; break; } 
                        // dispense the notification.      
                    ParseEventHandler(new ParseEventArgs(psz, context, state, E_TOKEN.MAX, 0, 0)); 
                        // we're done...      
                    return E_PARSERESULT.S_OK;       
                }       
        } 
            E_TOKEN tok = Tokenizer.NextToken(psz, sFinish, out sStart, out sFinish); 
            switch (state)      
        {  
            case E_PARSESTATE.SExprStart:       
                {       
                    // SExpr ::= SExprStart (Atom | SExpr)* SExprFinish       
                    switch (tok)       
                    {       
                        case E_TOKEN.SExprStart: // (       
                            {       
                                expr_nest++; 
                                    ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.SExprStart, tok, sStart, sFinish));      
                                state = E_PARSESTATE.AtomOrSExpr;       
                            }       
                            break; 
                            case E_TOKEN.EOF:      
                            {       
                                sFinish = sStart; state = E_PARSESTATE.Finish;       
                            }       
                            break; 
                            default:      
                            {       
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.SExprStart, tok, sStart, sFinish, "Unexpected token found instead of SExprStart", 0));       
                                state = E_PARSESTATE.Failure;       
                            };       
                            break;       
                    }       
                }       
                break; 
                case E_PARSESTATE.AtomOrSExpr:      
                {       
                    switch (tok)       
                    {       
                        case E_TOKEN.DoubleQuotedText:       
                        case E_TOKEN.SingleQuotedText:       
                            {       
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Atom, tok, sStart+1, sFinish-1));       
                                state = E_PARSESTATE.AtomOrSExpr;       
                            }       
                            break; 
                            case E_TOKEN.ValidName:      
                        case E_TOKEN.ValidNumber:       
                        case E_TOKEN.Text:       
                            {       
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Atom, tok, sStart, sFinish));       
                                state = E_PARSESTATE.AtomOrSExpr;       
                            }       
                            break; 
                            case E_TOKEN.SExprStart: // (      
                            {       
                                sFinish = sStart; //rewind token       
                                state = E_PARSESTATE.SExprStart;       
                            }       
                            break; 
                            case E_TOKEN.SExprFinish: // )      
                            {       
                                sFinish = sStart; //rewind token       
                                state = E_PARSESTATE.SExprFinish;       
                            }       
                            break; 
                            case E_TOKEN.EOF:      
                            {       
                                state = E_PARSESTATE.Finish;       
                            }       
                            break; 
                            default:      
                            {       
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, tok, sStart, sFinish, "Unexpected token found without matching SExprFinish", 0));       
                                state = E_PARSESTATE.Failure;       
                            };       
                            break;       
                    }       
                }       
                break; 
                case E_PARSESTATE.SExprFinish:      
                {       
                    switch (tok)       
                    {       
                        case E_TOKEN.SExprFinish: // )       
                            {       
                                expr_nest--; 
                                    ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.SExprFinish, tok, sStart, sFinish));      
                                state = (expr_nest == 0) ? E_PARSESTATE.Finish : E_PARSESTATE.AtomOrSExpr;       
                            }       
                            break; 
                            case E_TOKEN.EOF:      
                            {       
                                sFinish = sStart; state = E_PARSESTATE.Finish;       
                            }       
                            break; 
                            default:      
                            {       
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, tok, sStart, sFinish, "Unexpected token found instead of SExprFinish", 0));       
                                state = E_PARSESTATE.Failure;       
                            };       
                            break;       
                    }       
                }       
                break; 
                case E_PARSESTATE.Failure:      
            default:       
                {       
                    ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, E_TOKEN.MAX, sStart, sFinish, "Parser Failure", 0));       
                }       
                return E_PARSERESULT.E_FAILURE; 
            }      
    }; // while (curpos < psz.Length) 
        {      
        ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, E_TOKEN.MAX, sStart, sFinish));       
    } 
        return E_PARSERESULT.E_FAILURE;      
}

The Tokenizer helper class has a bunch of text processing functions which simply make life easier. The primary function, that of tokenizing - or chopping up the input text - is carried out by the NextToken function:

   public static E_TOKEN NextToken(string psz, int s, out int sTokBegin, out int sTokEnd)      
{       
    sTokBegin = -1; sTokEnd = psz.Length - 1;       
    if (s < 0) { return E_TOKEN.EOF; } 
        int sb = SnarfWhiteSpace(psz, s); 
        sTokBegin = sb;      
    if (sb == psz.Length) { return E_TOKEN.EOF; } 
        // (      
    if (IsOpenParen(psz[sb]))       
    {       
        sTokEnd = ++sb; return E_TOKEN.SExprStart;       
    } 
        // )      
    if (IsCloseParen(psz[sb]))       
    {       
        sTokEnd = ++sb; return E_TOKEN.SExprFinish;       
    } 
        // '...'      
    int se = sb;       
    if (psz[sb] == '\'')       
    {       
        return SnarfQuotedText(psz, sb, out sTokBegin, out sTokEnd, psz[sb]) ? E_TOKEN.SingleQuotedText : E_TOKEN.EOF;       
    }       
    sb = se; 
        // "..."      
    se = sb;       
    if (psz[sb] == '"')       
    {       
        return SnarfQuotedText(psz, sb, out sTokBegin, out sTokEnd, psz[sb]) ? E_TOKEN.DoubleQuotedText : E_TOKEN.EOF;       
    }       
    sb = se; 
        {      
        sb++; 
            bool fOK = true;      
        for (; (sb < psz.Length) && fOK; sb++)       
        {       
            char ch = psz[sb];       
            fOK = (!(IsWhiteSpace(ch) || IsOpenParen(ch) || IsCloseParen(ch)));       
        } 
            if (!fOK) { --sb; } 
            if (sb <= psz.Length)      
        {       
            sTokEnd = sb; sTokBegin = SnarfWhiteSpace(psz, sTokBegin); 
                string strToken = psz.Substring(sTokBegin, sTokEnd - sTokBegin); 
                if (IsValidName(strToken)) { return E_TOKEN.ValidName; }      
            if (IsValidNumber(strToken)) { return E_TOKEN.ValidNumber; } 
                return E_TOKEN.Text;      
        } 
            sTokEnd = sb; return E_TOKEN.EOF;      
    }       
}

The tokenizing approach is a throwback to my C/C++ days, when strings were null-delimited arrays of characters, and traipsing over them using a bunch of indices to keep track of sub-strings were the most efficient way of dealing with them. The NextToken function is passed in a tentative starting point (typically the point where it left off last) and it returns the next interesting token encountered after that point in the string.

A few points of note:

The Tokenizer class has been made internal to the Parser class and therefore has no interface footprint. Of course, with the C# partial class structure, we could put it in a separate file but keep it as a part of the Parser class.

We take care to ensure that the Parser and Tokenizer are both static classes (with all static functions, obviously). This is a great way to ensure that there is no state associated with the classes themselves, so the classes are multi-thread ready.

One sweet syntactic bonus is to write the Parse function as an extension class of the String type, so if you include the namespace containing the Parser, all strings automatically have the Parse function stuck on to them - very chic!

Monday, July 21, 2008

Part 0: Building a Basic Expression Evaluator

Quite often in our work, we come across situations where we would have to customize some business-logic quickly and on-the-fly. One such situation is when we have a workflow coded up, but the logic for transitioning between states is not fully known or is likely to change when the customers finally decide what they want.

In some of such situations, we would really love it if we could store the state transition-logic in a form that is easily editable at customization-time and dynamically evaluable to allow the workflow's behaviour to evolve and refine over time.

Being a classic computer scientist at heart, I decided to see if we could embed a little LISP evaluation engine and perhaps write and store the customization logic as a LISP expression, which would be evaluated at run-time and allow for the flexibility we needed. Then, being a classic masochist, I decided to spend a few days and see if I could build the evaluation engine itself. (Actually, I did google up for an embeddable LISP engine, but found that the gymnastics involved in integrating the engine into our environment wouldn't be worth the trouble!)

The evaluator we built takes a string s-expression value, and reduces it to a single scalar value. We built it in stages, and I'll describe the building process similarly. Unlike a full blown LISP implementation, all our functions return a single scalar - so there isn't support for the cons primitive.

The lexical structure for the little language we're building is:

        Program ::= Atom | SExpr 
      Atom ::= number | boolean | string  
      SExpr ::= "(" Atom | SExpr ")" 
      number is a pattern conforming to the following IEEE regular expression for a floating point: ^[-+]?[0-9]*\\.?[0-9]+([eE][-+]?[0-9]*\\.?[0-9]+)?$. 
      boolean is one of the following strings : t, true, false, nil and f. Each corresponding to its commonly known, self-explanatory value 
      string is a pattern conforming to an unquoted sequence of non-white-space characters, or a quoted sequence of characters including white-space

Semantically, we will use the general LISP convention where the car or first element of the SExpr is the function we want to apply to the cdr or the remainder of the SExpr. We will also pre-define some of the well-known functions we want to process.

So for example, we might want to evaluate the following strings and expect the associated results:

        (+ 1 2 3 4) => 10 
      (+ 1 2 (* 3 2)) => 8 
      (+ "hello" " " "world") => "hello world" 
      (and (> 4 2 1) (<= 3 4 4 5)) => true

The cool thing to notice is that we want our engine to deal with dynamically typing the values into either numbers, boolean values or strings. The well-known functions also have to be intelligent and decide, for example, when to sum numbers and when to concatenate strings.

Of course, as the engine evolves, we want to pay special attention to integrating it into an environment from which it could be called, and introduce support for custom functions to supplement the well-known ones. Further improvements would include support for recursive and mutually recursive functions.

The posts following will outline the approach to building the evaluation engine and providing the support we desire.

Wild Stabs In The Dark