LL(1) Parsing

Here is an introduction to all you need to know for LL(1) parsing correct input for CS164 at UC Berkeley.

Note: The following discussion is informal and does not cover all of the cases found in left factorization and LL(1) parsing grammars. To be sure you've covered all your bases, reread the text (Dragon book) and the lecture notes.

The following grammar comes from Written Assignment 2.

E -> E + T | T
T -> T * F | F
F -> (E) | int

This is a grammar for arithmetic. There are several orders of precedence here: ()'s beat *, and * beats +. We'd like to parse sentences using this grammar using a top-down, recursive descent algorithm.

This algorithm traverses the grammar looking for matches between terminals (*, +, (, ), and int) and the input sentence. We do this search depth-first, which presents a problem. If we start at the starting production E, and derive the production E + T, we have E still on the left. In recursive-descent parsing, we can only expand the left-most non-terminal at each step! We're going to infinitely loop if we try to parse sentences using this grammar.

How do we fix it? We use a technique known as left-recursion elimination to get rid of the non-terminals on the left-hand side of each production that cause us to infinitely loop. (Note: not all grammars have left recursion. You can identify the ones that have immediate left recursion by looking at all of the productions -- if the non-terminal on the left side of the arrow is the same as the non-terminal in the left-most position of any phrase on the right side of the arrow, then this grammar is left-recursive. There are other forms of left recursion that can show up if you were to "recurse" down multiple rules in the grammar. If you eventually will cause any infinite loop, the grammar is left-recursive.)

Let's take a look at the first one:

E -> E + T | T

What we do is this. For each production where the non-terminal on the left (E) of the arrow is the same as the left-side of a production on the right-hand side of the arrow (E + T), we take the part of the production without the E (+T) and move it down into its own new production (we'll call it E').

E' -> + T

We're not done yet. Now, after each of the new productions, add E' to the end.

E' -> + T E'

Nope, still not done yet. Now add an extra production to epsilon.

E' -> + T E' | epsilon

Good. Now we must fix up the original E productions. Here, we take all of the right-hand sides that didn't start with E, and add E' to the end of them.

E  -> T E'

If we do this for the T productions as well, we get the following grammar:

E  -> T E'
E' -> + T E' | epsilon
T  -> F T'
T' -> * F T' | epsilon
F  -> (E) | int

Note, the F production didn't get changed at all. That's because F didn't appear on the leftmost position of any of the productions on the right-hand side of the arrow.

Once we have performed left-recursion elimination on our grammar, we need to construct our parser. We're going to use an algorithm called LL(1) parsing. This is an algorithm that utilizes a lookup table to parse an expression. On top, we list all of the terminals, and on the left, we list the non-terminals (we include $ to signify end-of-file).

+ * ( ) int $

E

E'

T

T'

F

Our LL(1) parser also contains a stack of non-terminals. Initially, we'll only put the starting state (E) on the stack. As we parse, we'll be putting non-terminals on and popping them off this stack. When they're all gone, and our stack is empty (and there is no more input), we're done.

At each step, there is a grammar symbol at the top of the stack. When we see an input token from the lexer, we look in the table to see what to do. Each cell in the table is going to tell our LL(1) parser what to do when it sees the terminal on the top when the non-terminal on the left is at the top of the stack.

Right now, our table is empty. Let's fill it up.

We do this by computing two functions called FIRST and FOLLOW.

FIRST is a function on each non-terminal (E, E', T, T', and F) that tells us which terminals (tokens from the lexer) can appear as the first part of one of these non-terminals. Epsilon (neither a terminal nor a non-terminal) may also be included in this FIRST function. (FIRST is also defined for terminals, but its value is just equal to the terminal itself, so we won't talk about it more.) What this means is that the parser is going to invoke some of these productions in the grammar. We need to know which one to pick when we see a particular token in the input stream.

So, let's start computing. What is the FIRST(E)? What are the terminals that can appear at the beginning of the stream when we're looking for an E? Well, E -> T E', so whatever occurs at the beginning of E will be the same as what happens at the beginning of T.

FIRST(E) => FIRST(T)

How about FIRST(E')? This one is easy, we have the terminal +, and epsilon.

FIRST(E') = { +, epsilon }

And we'll continue with the others:

FIRST(T) => FIRST(F)
FIRST(T') = { *, epsilon }
FIRST(F) = { (, int }

See? FIRST(F) is just the set of terminals that are at the beginnings of its productions.

So, to sum up:

FIRST(E) = { (, int }
FIRST(E') = { +, epsilon }
FIRST(T) = { (, int }
FIRST(T') = { *, epsilon }
FIRST(F) = { (, int }

Now, let's do FOLLOW. Just as FIRST shows us the terminals that can be at the beginning of a derived non-terminal, FOLLOW shows us the terminals that can come after a derived non-terminal. Note, this does not mean the last terminal derived from a non-terminal. It's the set of terminals that can come after it. We define FOLLOW for all the non-terminals in the grammar.

How do we figure out FOLLOW? Instead of looking at the first terminal for each phrase on the right side of the arrow, we find every place our non-terminal is located on the right side of any of the arrows. Then we look for some terminals. As we go through our example, you'll see almost all of the different ways we figure out the FOLLOW of a non-terminal.

First, however, let's pretend that our grammar starts with a unique starting production (it's not really part of the grammar):

S  -> E

We start our journey at S, but rewrite it a bit to reflect the EOF that can be at the end. In parser-land, EOF is represented by $. So our production is really:

S -> E $

What is FOLLOW(E)? (Note: We don't really care about FOLLOW(S) because it's just imaginary.) Look on all of the right-hand sides (after the arrow) of all of the productions in the grammar. What terminals appear on the right of the E? Well, I see a $ and a ).

FOLLOW(E) = { $, ) }

How about FOLLOW(E')? There's nothing after either of the E's in the grammar. What do we do now? Well, if we had derived E' from E in E -> TE', then whatever follows the E is the same as whatever follows the E'. For the other production, we get that whatever follows E' is the same as whatever follows E'. That's a tautology.

FOLLOW(E') => FOLLOW(E)

Now, let's do the rest:

FOLLOW(T) = ?

T is always followed by E'. So, whatever terminals begin E' must be the terminals that follow T. So this means that FOLLOW(T) => FIRST(E'). It seems a bit weird, but since when we're done, we'll have no more non-terminals in our stream, so only the terminals are important to look at. Since T will disappear into some terminals, those are the same that will tell us that we're starting to do E'.

FOLLOW(T) => FIRST(E')
FOLLOW(T') => FOLLOW(T)
FOLLOW(F) => FIRST(T')

If we solve this system of equations, we get:

FOLLOW(S) = { $ }
FOLLOW(E) = { $, ) }
FOLLOW(E') = { $, ) }
FOLLOW(T) = { +, epsilon }

Hold on! What's that epsilon doing there? We can't have that. Where did it come from? Ah, it came from including FIRST(E'). Well, if E' -> epsilon, then whatever terminals follow E' (FOLLOW(E')) will be the terminals we're looking for. (These are not the droids you're looking for.) These are $ and ). Let's add those to our list:

FOLLOW(T)  = { +, $, ) }
FOLLOW(T') = { +, $, ) }
FOLLOW(F)  = { *, epsilon }

There's that epsilon again. If T' -> epsilon, then we want to add FOLLOW(T') in there.

FOLLOW(F) = { *, +, $, ) }

Good, now we're done.

Now it's time to put these things back into our table. Each of the cells should be filled in with the production that the non-terminal on the left column takes when we see the terminal on the top in our input stream. Note: we don't have an S row in this table because it's not a very interesting transition. It just goes to E.


    +   *   (   )   int   $
 E                  
 E'                  
  T                  
 T'                  
 F

Our FIRST calculations have told us exactly what we need! They tell us what terminals we're allowed to see on the input stream when we're looking for one of those non-terminals.

We'll reprint the FIRST equations to remind ourselves what the values were:

FIRST(E)  = { (, int }
FIRST(E') = { +, epsilon }
FIRST(T)  = { (, int }
FIRST(T') = { *, epsilon }
FIRST(F)  = { (, int }

So, let's fill in the first row:


    +   *   (   )   int   $
 E   can't happen  can't happen   E -> T
      E'  can't happen  E ->
      T E'   can't happen
 E'                  
 T                  
  T'                  
 F

The production we put in is the only one available. Let's do another one.


    +   *   (   )   int   $
 E   can't happen  can't happen   E -> T
      E'  can't happen  E ->
      T E'   can't happen
 E'   +
      T E'               
 T                  
  T'                  
 F

Wait a minute. We've got that epsilon in there. There's no epsilon in the input stream! Remember what we did in the FOLLOW function when we found the epsilon? We took the FOLLOW of that production, FOLLOW(E'), which turned out to be ) and $. So, let's take that epsilon production whenever we see ) or $.


    +   *   (   )   int   $
 E   can't happen  can't happen   E -> T
      E'  can't happen  E ->
      T E'   can't happen
 E'   +
      T E'   can't happen   can't
      happen   E' -> epsilon   can't happen   E' ->
      epsilon
 T                   
 T'                  
 F

Let's fill in the next few lines:


    +   *   (   )   int   $
 E   can't happen  can't happen   E -> T
      E'  can't happen  E ->
      T E'   can't happen
 E'   +
      T E'   can't happen  can't
      happen E' -> epsilon  can't
      happen   E' -> epsilon 
 T   can't
      happen   can't happen   T
      -> F T'   can't happen  T -> F T'   can't happen 
 T'   T
      -> epsilon   T' -> * F T'   can't happen   T' ->
      epsilon   can't happen   T'
      -> epsilon
 F   can't happen   can't
      happen   F -> ( E )   can't
      happen   F -> int   can't
      happen

And, we're done.

Now what do we do with this table? This table forms one part in a three part data structure. The other two parts are a stack of grammar symbols (E, E', T, T', F, +, *, (, ), int, and $), and an input stream (the expression we want to parse, already tokenized into lexemes by the scanner).

We start our stack with the starting non-terminal: E. Our input will start with this example: 3 + 5 * 7.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

Now, look at the table. If we have an E on the stack, and our input is a 3 (an integer), we pick the transition E -> T E'. This means, we pop E off the stack and replace it with T E'.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

Note, we haven't done anything to the input yet. We're still only dealing with non-terminals. Now we have a T on the top of the stack, and an integer on the input. Yes, we go to T -> F T'.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

3 + 5 * 7 $ 3. F T' E'

Let's continue another step. If we have an F on the input stack and see an integer, we do F -> int.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

3 + 5 * 7 $ 3. F T' E'

3 + 5 * 7 $ 4 int T' E'

Now we have an int at the top of the stack, and an int beginning the input stream. We pop both of these off.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

3 + 5 * 7 $ 3. F T' E'

3 + 5 * 7 $ 4 int T' E'

+ 5 * 7 $ 5. T' E'

Now we have a T' on the top of the stack, and a + on the input. T' -> epsilon. This means we pop T' off our stack and do nothing to the input.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

3 + 5 * 7 $ 3. F T' E'

3 + 5 * 7 $ 4 int T' E'

+ 5 * 7 $ 5. T' E'

+ 5 * 7 $ 6. E'

We now have an E' on the top of the stack, and a + on the input. Now we do the production E' -> + T E'.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

3 + 5 * 7 $ 3. F T' E'

3 + 5 * 7 $ 4 int T' E'

+ 5 * 7 $ 5. T' E'

+ 5 * 7 $ 6. E'

+ 5 * 7 $ 7. + T E'

And again, we have a terminal at the top of the stack, and a matching terminal on the input stream. So we pop them both, and continue.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

3 + 5 * 7 $ 3. F T' E'

3 + 5 * 7 $ 4 int T' E'

+ 5 * 7 $ 5. T' E'

+ 5 * 7 $ 6. E'

+ 5 * 7 $ 7. + T E'

5 * 7 $ 8. T E'

At this point, we're in a pretty similar position to step 2. Let's do a few steps and see if you're still with us.

Input Step # Stack (top is on left)

3 + 5 * 7 $ 1. E

3 + 5 * 7 $ 2. T E'

3 + 5 * 7 $ 3. F T' E'

3 + 5 * 7 $ 4 int T' E'

+ 5 * 7 $ 5. T' E'

+ 5 * 7 $ 6. E'

+ 5 * 7 $ 7. + T E'

5 * 7 $ 8. T E'

5 * 7 $ 9. F T' E'

5 * 7 $ 10. int T' E'

* 7 $ 11. T' E'

* 7 $ 12. * F T' E'

Still with us? We parsed enough to pop the 5 off the input, and are now about to get rid of the * in another step.

Whew! Finished. We know we're done because the stack is now empty, and the input is at EOF ($). If there was more input and we ran out of stack symbols, we'd be in an error situation. And we'll leave that story for another time.

That's most of all you need to know for parsing correct input using LL(1). You should read the Dragon book for more detailed (and theoretical) information to understand the complete story.

Questions? Comments?

Andrew Begel abegel@cs.berkeley.edu

	+	*	(	)	int	$
E	can't happen	can't happen	E -> T E'	can't happen	E -> T E'	can't happen
E'
T
T'
F

Input	Step #	Stack (top is on left)
3 + 5 * 7 $	1.	E
3 + 5 * 7 $	2.	T E'