CS164 - Section 3 - February 10/11, 2004 Announcements: WA2 due Thursday at 12:30pm [Return WA1] WA3 out Thursday - Remember to put your ***userid*** PA3 due February 20 (start now!) and TA name/section on your written assignments! Outline: - Context-Free Grammars - LL Parsing ---------------------------------------------------------------------------- 0. Intro - Parsing Why aren't regular languages enough? After the lexing phase, we now have a stream of tokens. Now what? - analogy: We've identified the words in the sentence. Now we want to give structure to determine if they're valid sentences. Parser - What is the output of the parser? - What does the parser filter out? ---------------------------------------------------------------------------- 1. Context-Free Grammars Warm-up: S -> SS | (S) | \epsilon - A Context Free Grammar is N - a set of non-terminals T - a set of terminals S - a start symbol (a non-terminal) \delta - a set of productions - How do we tell if some sequence is in this language? - Let G be a CFG. Then the language of G is { a_1 ... a_n | S ->* a_1 ... a_n and every a_i is a terminal } Many CFGs describe the same language. The form is important. - What language does this grammar represent? - What aspect of this grammar is problematic for parsing? - Write a better grammar for the same language: Exercise: Consider the language with constants 't' and 'f', a prefix operator '!', and two infix operators '&&' and '||'. - Write a simple (ambiguous) grammar for this language. - What is a derivation? How does it relate to a parse tree? - Rewrite the grammar to give precendence to the operators in the following order: !, &&, || (That is, ! binds tightest and || is loosest.) Exercise: Every regular language is context-free. a. Write a CFG for 00*11* b. Let's prove it by constructing a CFG for an arbitrary RE. - Can you go the other direction? ---------------------------------------------------------------------------- 2. LL Parsing An LL(k) grammar is a CFG used by a parser that - Scans input left-to-right ("L"). - Leftmost derivation ("L"). - Uses k tokens of lookahead to predict the correct production How is a LL(0) parser different from a recursive-decent parser? We will concern ourselves with LL(1) grammars. We will generate a derivation from the start symbol using the input string as a guide. At each step, we expand the leftmost nonterminal. We use the next token (our one token of lookahead) to choose the appropriate production. Terminals: 'section' 'is' 'very' 'useful' 'boring' 'indeed' ',' Nonterminals: S M V A I (think "start", "modifier", "very", "adjective", and "indeed") S -> section is M A I M -> \epsilon | V V -> very , V | very A -> boring | useful I -> \epsilon | indeed Let's build an LL(1) parse table. Each row represents the leftmost nonterminal, and each column represents our one token of lookahead. section is very useful boring indeed , $ ------------------------------------------------------------------------ S ------------------------------------------------------------------------ M ------------------------------------------------------------------------ V ------------------------------------------------------------------------ A ------------------------------------------------------------------------ I ------------------------------------------------------------------------ What went wrong? Let's fix the grammar: (What's this process called?) Now let's build a better parse table: section is very useful boring indeed , $ ------------------------------------------------------------------------ S ------------------------------------------------------------------------ M ------------------------------------------------------------------------ V ------------------------------------------------------------------------ A ------------------------------------------------------------------------ I ------------------------------------------------------------------------ X ------------------------------------------------------------------------ Informally, describe the the process. It seems there are roughly two cases to think about. What are they? Now, how do we formalize this process? First(A): the set of all terminals that could occur first in an expansion of the terminal or nonterminal A (include \epsilon if A can expand to \epsilon) Follow(A): the set of all terminals that could follow an occurence of the terminal or nonterminal A in a (partial) derivation Let's compute these sets for the above grammar. Now we can use this information to construct a parse table. For each production "X -> A_1 ... A_n": - For each 1 <= i <= n, and for each b in First(A_i): Set T[X, b] = "X -> A_1 ... A_n". Stop when \epsilon is not in First(A_i). - If A_1 ... A_n ->* \epsilon, then for each b in Follow(X): Set T[X, b] = "\epsilon". Let's give it a try. section is very useful boring indeed , $ ------------------------------------------------------------------------ S ------------------------------------------------------------------------ M ------------------------------------------------------------------------ V ------------------------------------------------------------------------ A ------------------------------------------------------------------------ I ------------------------------------------------------------------------ X ------------------------------------------------------------------------ What's an example of a grammar that is not LL(k) for any k? Bonus Exercise: Now consider the following LL(1) grammar. S -> a T U T -> b S U -> \epsilon | a Construct Follow (and thus also First) sets. - How do we know we can write an algorithm that will terminate that computes the Follow sets?