Converting Regular Expressions to Postfix Notation with the Shunting-Yard Algorithm

Introduction

My latest coding project involved converting a regular expression to a nondeterministic finite automata (NFA) diagram (you can learn more about NFAs by reading one of my latest articles on my blog). My program would take in a regular expression as an input, such as a(a+b)*b and output an NFA diagram with states and transitions that would represent the given regular expression.

Early on into the coding, I ran into an issue. I needed to figure out a way to properly read in the regular expression so that my program would be able to build the NFA piece by piece. I had to ask myself, what would my program do when it runs into parentheses? How would it know which NFAs to combine? What kind of data structure would I use to parse the input?

As I kept thinking through the logic, I realized that my issue was merely a variation on the simple PEMDAS rule. Instead of numbers, multiplication, and addition; I would be using letters, concatenation, and union. I began searching for algorithms that implement this type of logic, and eventually came across the Shunting-Yard Algorithm.

What is the Shunting-Yard Algorithm?

This algorithm is an operator-precedence parser that is specifically designed to parse mathematical expressions into postfix notation for computation. Postfix notation (Reverse Polish notation) is a mathematical notation in which the operators follow the numbers.

For example, 4 + 5 will turn into 4 5 +

Postfix notation removes the need for parentheses and allows computer programs to read in mathematical expressions one symbol after the other, instead of worrying about operator precedence and parentheses during computation.

I will now show you how to convert a regular expression into postfix notation.

What is our regular expression?

The regular expression we will be using will be: a(a+b)*b

This expression will accept all strings that begin with the letter ‘a’ and end in the letter ‘b.’

Establish operator precedence

Before we can begin converting our expression to postfix notation, we must establish the operator precedence. Knowing the precedence will be necessary during the conversion when we push operators onto the stack. We will need to know which operator has higher precedence in order to decide the correct action to take.

The precedence from highest to lowest is:

  1. Closure (Kleene star) a*
  2. Concatenation ab
  3. Union a+b

Explicitly inserting a concatenation symbol

The concatenation expression usually does not have a symbol between the two letters ab. This will make our computation more difficult than necessary when we do the conversion. So, in order to easily handle this I will be using an ? symbol between the two letters for every concatenation.

So, ab will now turn into a?b

If we apply this change to our regular expression, we will now get: a?(a+b)*?b

Basic Rules of the Algorithm

  • If the input symbol is a letter… append it directly to the output queue
  • If the input symbol is an operator… if there exists an operator already on the top of the operator stack with higher or equal precedence than our current input symbol, remove the operator from the top of the operator stack and append it to the output queue. Do this until the current input symbol has a higher precedence than the symbol on the top of the operator stack, or the operator stack is empty.
  • If the input symbol is an operator AND there is a left parenthesis on top of the stack… push the input symbol onto the stack on top of the left parenthesis.
  • If the input symbol is an ( … append it to the operator stack
  • If the input symbol is an ) … pop all operators from the operator stack and append them to the output queue until you find an ( . Then, you can remove both of those parentheses and continue with the algorithm.

Algorithm in pseudo code here

Shunting-Yard Algorithm Visualized

Here is a visual representation of how the Shunting-Yard Algorithm works. I will be using this diagram to provide a step-by-step approach to the algorithm with our regular expression.

The Input section will move from right to left, almost like a conveyor belt, as it reads each character.

The Operator stack will hold all of the operators that pass through and respond to new operators by following the rules we used in the previous section.

The Output queue will be the final postfix notation.

The first symbol we will read is [a]

[a] is a letter, so we can append it directly to the output queue.

Move to the next symbol.

[?] is an operator and there are no operators in the stack yet. So, append [?] to the stack.

Move to the next symbol.

We encountered a left parenthesis, push it on to the operator stack

Move to the next symbol.

The next symbol is an [a]. Append it directly to the output queue.

Move to the next symbol.

The next symbol is a [+]. Now we have to look at the operator on the top of the stack to decide what to do. The operator on top of the stack is [(]. A left parenthesis does not have higher precedence than a union symbol. So we can append [+] to the stack.

Move to the next symbol

The next symbol is a [b]. Append it directly to the output queue.

Move to the next symbol.

We encountered a right parenthesis. We must now look at the operator stack and pop every operator that is on top of a left parenthesis to the output queue.

We have encountered the left parenthesis we were looking for. Good, that means we have a matching pair of parentheses. If we reach the bottom of the stack without finding a left parenthesis, that means we have mismatching parentheses and will result in an error.

We can now discard both parentheses.

Move to the next symbol.

The next symbol is a [*]. Now let’s check the operator stack to see what to do next… A Kleene star has higher precedence than a concatenation. So, append the Kleene star to the operator stack.

Move to the next symbol.

The next symbol is a [?]. Now we have to check the operator stack for comparison.

A Kleene star [*] is on top of the stack, which has higher precedence than [?]. Now we have to move [*] to the output queue.

Check the next symbol in the operator stack to compare again.

A concat [?] is on top of the stack. This is also the symbol of our current character. Because [?] == [?], we also have to pop the [?] from the stack and append it to the output queue.

The stack is now empty, so our current [?] can be pushed onto the operator stack.

Move to the next symbol.

The next symbol is a [b]. Append it directly to the output queue.

Now we have no more input symbols. Our final step is to simply pop all operators left on the operator stack and append them to the output queue.

Our final postfix notation is aab+*?b?

There you have it. I hope this article provides a visual representation of the Shunting-Yard Algorithm in action when working with regular expressions. Stay tuned for another article, where I will explain how to use this postfix regular expression to create an NFA!

Software Engineer at IBM — more about me at https://cernera.me/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store