Once again about the analysis of expressions using the recursive descent method

I was still a schoolboy when I fell into the hands of the book "Beginner Course C and C ++" from the publishing house Dialog MEPhI. It was from this book that I learned about the basics of object-oriented programming, and it also posed a problem for me that I could not solve for quite some time. At the end of it was an application listing programs, one of which was called a recursive calculator. This was my first encounter with the analysis of arithmetic expressions using the recursive descent method. I still remember those long hours that were spent trying to at least understand something in this code.

The article is intended primarily for those who are faced with the task of parsing expressions for the first time.

And so, before writing the parser of arithmetic expressions, we need to understand the structure of these very expressions. For simplicity, we will not consider the unary minus operation and the possibility of using functions in the expression for now.

To describe the structure, we use the Beckus-Naur forms

E -> T + E | T - E | T
T -> F * T | F / t | F
F -> N | (E)

Here's the vertical bar "|" means a logical operation "or", and E, T, F, N - various types of subexpressions into which the original expression will be split when using these sentences. N in this case denotes a number.

The first sentence means that an expression of type E can consist of an expression of type T plus an expression of type E, or an expression of type T minus an expression of type E or just an expression of type T.

For the second sentence, everything is similar, but instead of adding and subtracting, there will be multiplication and division.

The third sentence means that an expression of type F is either a number or an expression of type E enclosed in brackets.

Let's try to apply these three sentences to an example of an arithmetic expression

2 + 3 * (4 - 5) + 6 - 7

We apply the first sentence several times

E (2 + 3 * (4-5) + 6-7) -> 
T (2) + E (3 * (4-5) + 6-7) -> 
T (2) + T (3 * (4-5)) + E (6-7) -> 
T (2) + T (3 * (4-5)) + T (6) - T (7)

Now our expression consists only of subexpressions of type T, so we apply the second sentence to each of them.

T (2) -> F (2)
T (3 * (4-5)) -> F (3) * T ((4-5)) -> F (3) * F ((4-5))
T (6) -> F (6)
T (7) -> F (7)

It's time for the third sentence

F (2) -> N (2) -> 2
F (3) -> N (3) -> 3
F ((4-5)) -> E (4-5)
F (6) -> N (6) -> 6
F (7) -> N (7) -> 7

For the expression in brackets (4-5), we recursively moved on to the first sentence and will have to repeat all the actions by analogy with what it was for the rest of the expression. If the expression contained nested brackets, then there would be more such recursive transitions. I would recommend that you independently try to apply all three sentences to an expression that would contain nested levels of brackets, for example 2 + 2 * (3 + 4 * (5-6)), to make sure that they work for it too.

Now we just have to implement the actions that are carried out by the proposals in the form of code. This will be the analysis of the expression using the recursive descent method, but before you start writing code, let's simplify the sentences a bit by getting rid of the recursion in the first two of them.

If we endlessly substitute the first sentence into itself, then we get

E -> T ± T ± T ± T ± ... ± T

The second sentence can be transformed in a similar way.

T -> F * / F * / F * / * / ... * / F

In fact, this means that instead of recursion, we can use a loop here. Similarly, recursion cannot be excluded from the third sentence.

We will write the code in Java, because This is the most familiar programming language for me at the moment. The last remark before we start writing the code - I leave out the task of tokenization, i.e. splitting the expression into indivisible significant elements (tokens). Such tokens are numbers, signs of arithmetic operations and brackets. Suppose we already have an token array as input from the expression.

String[] tokens = {"2", "+", "3", "*", "(", "4", "-", "5", ")", "+", "6", "-", "7"};

Create the next parser class that will contain the methods

  • parse () starts parsing an expression
  • expression () to handle expressions of type E
  • term () for processing expressions of type T
  • factor () for processing expressions of type F and N

public class RecursiveDescentParser {

    private final String[] tokens;
    private int pos = 0; //   

    public RecursiveDescentParser(String[] tokens) {
        this.tokens = tokens;
    }

    public Double parse() {
        Double result = expression();
        if (pos != tokens.length) {
            throw new IllegalStateException("Error in expression at " + tokens[pos]));
        }
        return result;
    }

    // E -> T±T±T±T± ... ±T
    private Double expression() {
        //   
        Double first = term();

        while (pos < tokens.length) {
            String operator = tokens[pos];
            if (!operator.equals("+") && !operator.equals("-")) {
                break;
            } else {
                pos++;
            }

            //    ()
            Double second = term();
            if (operator.equals("+")) {
                first += second;
            } else {
                first -= second;
            }
        }
        return first;
    }

    // T -> F*/F*/F*/*/ ... */F
    private Double term() {
        //   
        Double first = factor();

        while (pos < tokens.length) {
            String operator = tokens[pos];
            if (!operator.equals("*") && !operator.equals("/")) {
                break;
            } else {
                pos++;
            }

            //    ()
            Double second = factor();
            if (operator.equals("*")) {
                first *= second;
            } else {
                first /= second;
            }
        }
        return first;
    }
    
    // F -> N | (E)
    private Double factor() {
        String next = tokens[pos];
        Double result;
        if (next.equals("(")) {
            pos++;
            //    ,        
            result = expression();  
            String closingBracket;
            if (pos < tokens.length) {
                closingBracket = tokens[pos];
            } else {
                throw new IllegalStateException("Unexpected end of expression");
            }
            if (pos < tokens.length && closingBracket.equals(")")) {
                pos++;
                return result;
            }
            throw new IllegalStateException("')' expected but " + closingBracket);
        }
        pos++;
        //       
        return Double.parseDouble(next);
    }
}

That's all. If desired, this parser can be taught to work with a unary minus or functions in an expression. The first thing to do is to change the original sentences so that they take into account a new element of the structure of the expression, and there may be more proposals from this. Once they are properly modified, the implementation becomes almost formal.

All Articles