Context-Free Grammars and Languages We have seen that many languages cannot be regular. Thus we need to consider larger classes of langs, called Context- Free Languages (CFL). These langs have a natural, recursive notation, called Context- Free Grammars (CFG). CFGs have played a central role in natural languages since the 1950's, and in compilers since the 1960's. Today CFL's are increasingly important for XML (extensible markup lang) and their DTD's (document type definition). We'll look at: CFG's, the languages they generate, parse trees, pushdown automata, and closure properties of CFL's. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 1
An Informal Example of CFG s Consider the language of palindromes (is a string that reads the same forward and backward. L pal is not a regular language. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 2
Palindrome example Let L pal = { 0 n 10 n : n>0 }. It is easy to show that it is not a regular lang. Apply pumping lemma. Let ω = xyz st y consists of 0 s from the first group. Then xy 0 z is not a palindrome because the number of 0 s are not equal. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 3
Inductive Definition of L pal Basis: ε, 0, and 1 palindromes. Induction: If ω is a palindrome, so are 0ω0 and 1ω1. No string is a palindrome, unless it follows this basis and induction rule. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 4
Formal Definition of CFGs There are four important components in a grammatical description of the language: 1. Finite set of symbols that form the strings of the lang. This set was {0,1} in palindrome example. This alphabet is called the terminals, or terminal symbols. 2. Finite set of variables, which are called nonterminals or syntactic categories. In our example here, it is P. 3. One of the variables represent the language being defined; it is called the start symbol. In our example it is P. 4. There is a finite set of productions or rules that represent the recursive definition of the language. Each production consists of a variable, the production symbol, and a string of zero or more terminals and variables, which is called the body of the production. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 5
Formal Definition of CFGs A context-free grammar is a quadruple where G = (V, T, P, S) V is a Finite set of variables. T is a finite set of terminals. P is a finite set of productions of the form A α, where A is a variable and α (V U T)* S is a designated variable called the start symbol. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 6
Example Regular expressions over {0,1} can be defined by the grammar G regex = ({E}, {0,1}, A, E) where A = {E ε, E 0, E 1, E E.E, E E+E, or E E*, E (E)} A = {E ε 0 1 E.E E+E E* (E)} Above representation of the production is called the compact notation. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 7
Notice that E and I are variables, elements of T are terminal symbols, P is the productions at right, and E is the start symbol. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 8
Derivations Using Grammar Recursive inference, using productions from body to head Derivations, using productions from head to body. Recursive inference example: Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 9
Derivations, Using Productions From Head To Body Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 10
We define to be the closure of, i.e., represent zero, one, many derivation steps. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 11
Example 5.5 The Inference that a (a+b00) is in the lang. of variable E can be reflected in a derivation of that string, starting with the string E. Here is one such derivation: E E E I E a E a (E) a (E+E) a (I+E) a (a+e ) a (a+i ) a (a+i0 ) a (a+i00 ) a (a+b00). We can conclude that E a (a+b00). The two viewpoints recursive inference and derivation are equivalent. A string of terminals ω is inferred to be in the language of some variable A if and only if A ω. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 12
Example 5.5 Cont. E E E I E a E a (E) a (E+E) a (I+E) a (a+e ) a (a+i ) a (a+i0 ) a (a+i00 ) a (a+b00) Note: At each step we might have several rules to choose from, e.g. I E a E a (E), versus I E I (E) a (E). Note2: Not all choices lead to successful derivations of a particular string, for instance E E + E won't lead to a derivation of a (a+b00). Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 13
Leftmost And Rightmost Derivations In order to restrict the number of choices we have in deriving a string, it is often useful to require that at each step we replace the leftmost (or rightmost) variable by one of its production bodies. Such a derivation is called leftmost derivation (or rightmost derivation). Leftmost derivation denoted by. Rightmost derivation denoted by. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 14
Leftmost: Example 5.5 lm vs rm comparison Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 15
The Language of a Grammar Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 16
We shall prove that a string ω in {0,1}* is in L(G pal ) if and only if it is a palindrome. Proof: (if direction) Suppose ω = ω R. We show by induction on ω that ω L(G pal ). Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 17
Induction Hypothesis Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 18
Proof: (only if direction) We assume that ω L(G pal ) and must show that ω = ω R,that is, ω is palindrome. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 19
Sentential Forms Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 20
Examples to Sentential Forms Example: E (I+E) is sentential form since E E E E (E) E (E+E) E (I+E) This derivation is neither leftmost nor rightmost. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 21
Parse Trees Let G = (V, T, P, S) be a CFG. The parse trees for G are trees with the following conditions: 1. Each interior node is labelled by a variable in V. 2. Each leaf is labelled by a symbol in V U T U {ε}. Any ε -labelled leaf is the only child of its parent. 3. If an interior node is lablelled A, and its children (from left to right) labelled X 1, X 2, X 3,. X K then A X 1, X 2, X 3,. X K P Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 22
Parse Tree Examples In the grammar Parse tree 1. E I 2. E E + E 3. E E * E 4. E Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 23
Parse Tree Examples In the grammar Parse tree Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 24
Yield of Parse Tree If we look at the leaves of any parse tree and concatenate them from the left, we get a string, called yield of the tree, which is always a string that s derived from the root variable. Of special importance are those parse threes such that: 1. The yield is a terminal string. That is, all leaves are labeled either with a terminal or with ε. 2. The root is labeled by the start symbol. We shall see that the set of yields of these important parse trees is the language of the underlying grammar. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 25
Yield of Parse Tree Example concatenate them from the left get a string, called yield of the tree, all leaves are labeled either with a terminal or with ε. The root is labeled by the start symbol. The yield is a (a+b00). Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 26
Equivalence of Inference, Derivations, and Parse Trees Let G = (V,T,P,S) be a CFG and A V. Then the followings are equivalent. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 27
Ambiguity in Grammars and Languages In the grammar below, sentential form E + E E has two derivations: This gives us two parse trees Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 28
The mere existence of several derivations is not dangerous, it is the existence of several parse trees that ruins a grammar. Example: In the same grammar the string a+b has several derivations, e.g., However, their parse trees are the same, and the structure of a+b is unambiguous. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 29
Let G = (V,T,P,S) be a CFG and A V. We say that G is ambiguous if there is a string in T* that has more than one parse tree. If every string in L(G) has at most one parse tree, G is said to be unambiguous. Example: The terminal string a+a a has two parse trees: Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 30
Removing Ambiguity From Grammars Good news: Sometimes we can remove ambiguity by hand. Bad news: There is no algorithm to do it. More bad news: Some CFL's have only ambiguous CFG's. We are studying the grammar E I E+E E E (E), I a b Ia Ib I0 Ib There are two problems: 1. There is no precedence between * and +. 2. A squence of identical operators can group either from the left or from the right. For example E+E+E. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 31
Solution: We introduce more variables, each representing expressions of same binding strength. 1. A factor is an expression that cannot be broken apart by an adjacent or +. Our factors are (a) Identifiers (b) A parenthesized expression. 2. A term is an expression that cannot be broken by +. For instance a b can be broken by a1 such as a1 a b, which is (a1 a) b breaks a b. It cannot be broken by +, since e.g. a1+a b is (by precedence rules) same as a1+(a b), and a b+a1 is same as (a b)+a1. 3. The rest are expressions, i.e. they can be broken apart with or +. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 32
Example 5.27 Let F stand for factors, T for terms, and E for expressions. From the previous form E I E+E E E (E), consider the following grammar: E T E+T T F T F F I (E) I a b Ia Ib I0 Ib Now the only parse tree for a + a a will be the following. I a b Ia Ib I0 Ib Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 33
Leftmost Derivations & Ambiguity While the derivations are not necessarily unique, even if the grammar is unambiguous, in an unambiguous grammar, leftmost and rightmost derivations will be unique. We shall consider leftmost derivations. Theorem 5.29: For any CFG G, a terminal string ω has two distinct parse trees if and only if ω has two distinct leftmost derivations from the start symbol. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 34
Example to non-unique derivation The parse trees and derivations for a + a a. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 35
Inherent Ambiguity A CFL L is inherently ambiguous if all grammars for L are ambiguous. Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 36
Nisan 2006 Ankara Üniversitesi Bilgisayar Mühendisliği - TY 37