Friday, July 25, 2008

Post 1: Expression Evaluator - The Parser

We can take the classic state-machine approach to build the S-Expression parser, and use the C# event constructs to perform appropriate callbacks as we encounter specific states.

The Parser class contains a single static function which can parse a string and call the handler at appropriate points for further processing. A context object is passed through along to the handler for its use.

The Parse function consists of a single loop that walks the string from beginning to end, using the Tokenizer helper class to break up the string into lexically interesting chunks, and transitioning the state machine through the various states. We'll discuss Tokenizer::NextToken in detail next.

The state machine has the following states:

  • Start
  • Finish
  • SExprStart
  • SExprFinish
  • AtomOrSExpr
  • Atom
  • Failure

Each of these states is of interest to the caller of the Parse function, as each state represents effectively a result of parsing the string.

The Parse results for a simple string are given - it's easy to walk through the body of the Parse function and see why the results came out the way they did:

Input String: (test "(f1 a b)" (f2 c d))

  • Start,
  • SExprStart, (
  • Atom - ValidName, test
  • Atom - DoubleQuotedText, (f1 a b)
  • SExprStart, (
  • Atom - ValidName, f2
  • Atom - ValidName, c
  • Atom - ValidName, d
  • SExprFinish, )
  • SExprFinish, )
  • Finish,

The grammar we described earlier define what state we want to be in after encountering a particular token, and this structure is reflected in the state machine. The Tokenizer also goes a little further and reports if the text it encounters would conform to a valid variable name, or a valid number. This helps us when we want to do some further processing of the parsed results.

public static E_PARSERESULT Parse(this string psz, ParseEventHandler ParseEventHandler, object context)
{
    int sStart = 0, sFinish = 0; int expr_nest = 0;

    E_PARSESTATE state = E_PARSESTATE.Start;

    while (sFinish <= psz.Length)
    {
        switch (state)
        {
            case E_PARSESTATE.Start:
                {
                    expr_nest = 0;

                    // dispense the notification.
                    ParseEventHandler(new ParseEventArgs(psz, context, state, E_TOKEN.MAX, 0, 0));

                    state = E_PARSESTATE.AtomOrSExpr;
                };
                break;
            case E_PARSESTATE.Finish:
                {
                    if (expr_nest != 0) { state = E_PARSESTATE.Failure; break; }

                    // dispense the notification.
                    ParseEventHandler(new ParseEventArgs(psz, context, state, E_TOKEN.MAX, 0, 0));

                    // we're done...
                    return E_PARSERESULT.S_OK;
                }
        }

        E_TOKEN tok = Tokenizer.NextToken(psz, sFinish, out sStart, out sFinish);

        switch (state)
        { 
            case E_PARSESTATE.SExprStart:
                {
                    // SExpr ::= SExprStart (Atom | SExpr)* SExprFinish
                    switch (tok)
                    {
                        case E_TOKEN.SExprStart: // (
                            {
                                expr_nest++;

                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.SExprStart, tok, sStart, sFinish));
                                state = E_PARSESTATE.AtomOrSExpr;
                            }
                            break;

                        case E_TOKEN.EOF:
                            {
                                sFinish = sStart; state = E_PARSESTATE.Finish;
                            }
                            break;

                        default:
                            {
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.SExprStart, tok, sStart, sFinish, "Unexpected token found instead of SExprStart", 0));
                                state = E_PARSESTATE.Failure;
                            };
                            break;
                    }
                }
                break;

            case E_PARSESTATE.AtomOrSExpr:
                {
                    switch (tok)
                    {
                        case E_TOKEN.DoubleQuotedText:
                        case E_TOKEN.SingleQuotedText:
                            {
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Atom, tok, sStart+1, sFinish-1));
                                state = E_PARSESTATE.AtomOrSExpr;
                            }
                            break;

                        case E_TOKEN.ValidName:
                        case E_TOKEN.ValidNumber:
                        case E_TOKEN.Text:
                            {
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Atom, tok, sStart, sFinish));
                                state = E_PARSESTATE.AtomOrSExpr;
                            }
                            break;

                        case E_TOKEN.SExprStart: // (
                            {
                                sFinish = sStart; //rewind token
                                state = E_PARSESTATE.SExprStart;
                            }
                            break;

                        case E_TOKEN.SExprFinish: // )
                            {
                                sFinish = sStart; //rewind token
                                state = E_PARSESTATE.SExprFinish;
                            }
                            break;

                        case E_TOKEN.EOF:
                            {
                                state = E_PARSESTATE.Finish;
                            }
                            break;

                        default:
                            {
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, tok, sStart, sFinish, "Unexpected token found without matching SExprFinish", 0));
                                state = E_PARSESTATE.Failure;
                            };
                            break;
                    }
                }
                break;

            case E_PARSESTATE.SExprFinish:
                {
                    switch (tok)
                    {
                        case E_TOKEN.SExprFinish: // )
                            {
                                expr_nest--;

                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.SExprFinish, tok, sStart, sFinish));
                                state = (expr_nest == 0) ? E_PARSESTATE.Finish : E_PARSESTATE.AtomOrSExpr;
                            }
                            break;

                        case E_TOKEN.EOF:
                            {
                                sFinish = sStart; state = E_PARSESTATE.Finish;
                            }
                            break;

                        default:
                            {
                                ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, tok, sStart, sFinish, "Unexpected token found instead of SExprFinish", 0));
                                state = E_PARSESTATE.Failure;
                            };
                            break;
                    }
                }
                break;

            case E_PARSESTATE.Failure:
            default:
                {
                    ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, E_TOKEN.MAX, sStart, sFinish, "Parser Failure", 0));
                }
                return E_PARSERESULT.E_FAILURE;

        }
    }; // while (curpos < psz.Length)

    {
        ParseEventHandler(new ParseEventArgs(psz, context, E_PARSESTATE.Failure, E_TOKEN.MAX, sStart, sFinish));
    }

    return E_PARSERESULT.E_FAILURE;
}

The Tokenizer helper class has a bunch of text processing functions which simply make life easier. The primary function, that of tokenizing - or chopping up the input text - is carried out by the NextToken function:

public static E_TOKEN NextToken(string psz, int s, out int sTokBegin, out int sTokEnd)
{
    sTokBegin = -1; sTokEnd = psz.Length - 1;
    if (s < 0) { return E_TOKEN.EOF; }

    int sb = SnarfWhiteSpace(psz, s);

    sTokBegin = sb;
    if (sb == psz.Length) { return E_TOKEN.EOF; }

    // (
    if (IsOpenParen(psz[sb]))
    {
        sTokEnd = ++sb; return E_TOKEN.SExprStart;
    }

    // )
    if (IsCloseParen(psz[sb]))
    {
        sTokEnd = ++sb; return E_TOKEN.SExprFinish;
    }

    // '...'
    int se = sb;
    if (psz[sb] == '\'')
    {
        return SnarfQuotedText(psz, sb, out sTokBegin, out sTokEnd, psz[sb]) ? E_TOKEN.SingleQuotedText : E_TOKEN.EOF;
    }
    sb = se;

    // "..."
    se = sb;
    if (psz[sb] == '"')
    {
        return SnarfQuotedText(psz, sb, out sTokBegin, out sTokEnd, psz[sb]) ? E_TOKEN.DoubleQuotedText : E_TOKEN.EOF;
    }
    sb = se;

    {
        sb++;

        bool fOK = true;
        for (; (sb < psz.Length) && fOK; sb++)
        {
            char ch = psz[sb];
            fOK = (!(IsWhiteSpace(ch) || IsOpenParen(ch) || IsCloseParen(ch)));
        }

        if (!fOK) { --sb; }

        if (sb <= psz.Length)
        {
            sTokEnd = sb; sTokBegin = SnarfWhiteSpace(psz, sTokBegin);

            string strToken = psz.Substring(sTokBegin, sTokEnd - sTokBegin);

            if (IsValidName(strToken)) { return E_TOKEN.ValidName; }
            if (IsValidNumber(strToken)) { return E_TOKEN.ValidNumber; }

            return E_TOKEN.Text;
        }

        sTokEnd = sb; return E_TOKEN.EOF;
    }
}

The tokenizing approach is a throwback to my C/C++ days, when strings were null-delimited arrays of characters, and traipsing over them using a bunch of indices to keep track of sub-strings were the most efficient way of dealing with them. The NextToken function is passed in a tentative starting point (typically the point where it left off last) and it returns the next interesting token encountered after that point in the string.

A few points of note:

The Tokenizer class has been made internal to the Parser class and therefore has no interface footprint. Of course, with the C# partial class structure, we could put it in a separate file but keep it as a part of the Parser class.

We take care to ensure that the Parser and Tokenizer are both static classes (with all static functions, obviously). This is a great way to ensure that there is no state associated with the classes themselves, so the classes are multi-thread ready.

One sweet syntactic bonus is to write the Parse function as an extension class of the String type, so if you include the namespace containing the Parser, all strings automatically have the Parse function stuck on to them - very chic!

No comments: