rajcode: April 2011

UNIT – I

OVERVIEW OF COMPILATION: Phases of CompilationLexical Analysis, Regular Grammar and
regular expression for common programming language features, pass and Phases of translation,
interpretation, bootstrapping, data structures in compilation LEX lexical analyzer generator.

A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code). The most common reason for wanting to transform source code is to create an executable program
The phases of compiler can be broadly classified into
The Front End ( Language specific ), and the Back End ( Machine specific )parts of compilation
Also known as Analysis-Synthesis model of compilation
Compiler structure

- Front end phases are known as analysis phases
- Back end phases are known as synthesis phases
. Each phase has a well defined work
. Each phase handles a logical activity in the process of compilation
The Analysis-Synthesis model:
The front end phases are Lexical, Syntax and Semantic analyses. These form the "analysis phase" as you can well see these all do some kind of analysis. The Back End phases are called the "synthesis phase" as they synthesize the intermediate and the target language and hence the program from the representation created by the Front End phases. The advantages are that not only can lots of code be reused, but also since the compiler is well structured - it is easy to maintain & debug. Compiler is retarget able.
. Source and machine independent code optimization is possible.
. Optimization phase can be inserted after the front and back end phases have been developed and deployed. Also known as Analysis-Synthesis model of compilation

Final Compiler structure

Also since each phase handles a logically different phase of working of a compiler parts of the code can be reused to make new compilers. E.g., in a C compiler for Intel & Athlon the front ends will be similar. For a same language, lexical, syntax and semantic analyses are similar, code can be reused. Also in adding optimization, improving the performance of one phase should not affect the same of the other phase; this is possible to achieve in this model.
Bootstrapping .
. A compiler can be characterized by three languages: the source language (S), the target language (T), and the implementation language (I)
. The three language S, I, and T can be quite different. Such a compiler is called cross-compiler
Compilers are of two kinds: native and cross
Native compilers are written in the same language as the target language. For example, SMM is a compiler for the language S that is in a language that runs on machine M and generates output code that runs on machine M.
Cross compilers are written in different language as the target language. For example, SNM is a compiler for the language S that is in a language that runs on machine N and generates output code that runs on machine M.
Lexical Analysis
. Recognize tokens and ignore white spaces, comments
Generates token stream
. Error reporting . Model using regular expressions
. Recognize using Finite State Automata
The first phase of the compiler is lexical analysis. The lexical analyzer breaks a sentence into a sequence of words or tokens and ignores white spaces and comments. It generates a stream of tokens from the input. This is modeled through regular expressions and the structure is recognized through finite state automata. If the token is not valid i.e., does not fall into any of the identifiable groups, then the lexical analyzer reports an error. Lexical analysis thus involves recognizing the tokens in the source program and reporting errors, if any. We will study more about all these processes in the subsequent slides
Regular Expression Notation ....
. If r and s are regular expressions denoting the languages L(r) and L(s) then
. (r)(s) is a regular expression denoting L(r) U L(s)
. (r)(s) is a regular expression denoting L(r)L(s)
. (r)* is a regular expression denoting (L(r))*
. (r) is a regular expression denoting L(r )
How to specify tokens
Examples

My email address ska@iitk.ac.in
. S = letter U {@, . }
. Letter a b . z A B . Z
. Name letter +
. Address name '@' name '.' name '.' name
Finite Automata
. Regular expression are declarative specifications
. Finite automata is an implementation
. A finite automata consists of
- An input alphabet belonging to S
- A set of states S
- A set of transitions statei statej
- A set of final states F
- A start state n
. Transition s1 s2 is read:
in state s1 on input a go to state s2
. If end of input is reached in a final state then accept
Pictorial notation
. A state . A final state
. Transition
. Transition from state i to state j on an input a
How to recognize tokens
. Consider
relop < <= = <> >= >
id letter(letterdigit)*
num digit + ('.' digit + )? (E('+''-')? digit + )?
delim blank tab newline
ws delim +
. Construct an analyzer that will return pairs
We now consider the following grammar and try to construct an analyzer that will return pairs.
relop < = = <> = >
id letter (letter digit)*
num digit+ ('.' digit+)? (E ('+' '-')? digit+)?
delim blank tab newline
ws delim+

Essay type questions:
1. Explain the different phases of a compiler, showing the output of each phase,
using the example of the following statement:
position : = initial + rate * 60
2.Compare compiler and interpreter with suitable diagrams.
3. Explain the different phases of a compiler.
4. Write a short notes on token specification.

5. Explain how input buffering helps lexical analyzer in compilation process.
6. Construct a DFA for (0 * + 1 *) 011.
7. Write a lex program to identify comments in the program.
8.What is LEX? Explain, in detail, different sections of LEX program.
9. Write regular expressions for the following patterns
The set of words having a,e,i,o,u appearing in that order, although not
necessarily consecutively.
10. Explain the bootstrapping process with suitable diagrams.

Short answer questions:
1.Define compilers?
2.Define Interpreter?
3.Write about preprocessors?
4.Write the declaration of Macro function?
5.what is the pass and phase?
6.Write the role of Lexical analyzer?
7.What are the components are there for Lexical analyzer?
8.Write the structure of Lex?
9.What is the lex?
10. W h i ch o f th e f o l l ow i n g s t r i n g h as l e n g t h 0 ?
( a) s ( b ) a ( c ) a b c ( d ) ∈
11. If input alphabet is a then closure of a is
( a) { ∈ , a , a a } ( b ) { ∈ } ( c ) { a , a a - - - - - - } ( d ) { ∈ , a , a a , - - - - }
12. . A p p l i c a t i on o f F i n i te a u to m ata i s
( a) s c a n n e r ( b ) L e x i c a l an a l y z e r ( c ) s e m a nt i c a n al y z e r ( d ) p a rs e r

UNIT II
TOP DOWN PARSING: Context free grammars, Top down parsing , Backtracking,
LL (1), recursive descent parsing, predictive parsing, preprocessing steps required for predictive parsing.

Notes:Context free grammars
T- a set of tokens (terminal symbols)
V- a set of non terminal symbols
p- a set of productions of the form nonterminal String of terminals & non terminals
s- a start symbol
. A grammar derives strings by beginning with a start symbol and repeatedly replacing a non terminal by the right hand side of a production for that non terminal.
. The strings that can be derived from the start symbol of a grammar G form the language L(G) defined by the grammar.
we review the definition of a context free grammar and introduce terminology for talking about parsing. A context free grammar has four components:
A set of tokens , known as terminal symbols. Terminals are the basic symbols from which strings are formed.
A set of non-terminals . Non-terminals are syntactic variables that denote sets of strings. The non-terminals define sets of strings that help define the language generated by the grammar.
A set of productions . The productions of a grammar specify the manner in which the terminals and non-terminals can be combined to form strings. Each production consists of a non-terminal called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.
A designation of one of the non-terminals as the start symbol , and the set of strings it denotes is the language defined by the grammar.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start symbol) by the right hand side of a production for that non-terminal.
Example:
list list + digit
list - digit
digit digit 0 1 . 9
is the grammar for a string of digits separated by + or -.
Derivation
list list + digit
list - digit + digit
digit - digit + digit
9 - digit + digit
9 - 5 + digit
9 - 5 + 2
Therefore, the string 9-5+2 belongs to the language specified by the grammar.

Parse tree for 9-5+2
Ambiguity
. A Grammar can have more than one parse tree for a string
Consider grammar
string string + string
string - string
0 1 . 9
. String 9-5+2 has two parse trees
A grammar is said to be an ambiguous grammar if there is some string that it can generate in more than one way (i.e., the string has more than one parse tree or more than one leftmost derivation). A language is inherently ambiguous if it can only be generated by ambiguous grammars.
For example, consider the following grammar:
string string + string
string - string
0 1 . 9
In this grammar, the string 9-5+2 has two possible parse trees as shown in the below fig..
Ambiguity .
. Ambiguity is problematic because meaning of the programs can be incorrect
. Ambiguity can be handled in several ways
- Enforce associativity and precedence
- Rewrite the grammar (cleanest way)
. There are no general techniques for handling ambiguity
. It is impossible to convert automatically an ambiguous grammar to an unambiguous one
Ambiguity is harmful to the intent of the program. The input might be deciphered in a way which was not really the intention of the programmer, as shown above in the 9-5+2 example. Though there is no general technique to handle ambiguity i.e., it is not possible to develop some feature which automatically identifies and removes ambiguity from any grammar. However, it can be removed, broadly speaking, in the following possible ways:-
Parsing
. Process of determination whether a string can be generated by a grammar
. Parsing falls in two categories:
- Top-down parsing: Construction of the parse tree starts at the root (from the start symbol) and proceeds towards leaves (token or terminals)
- Bottom-up parsing: Construction of the parse tree starts from the leaf nodes (tokens or terminals of the grammar) and proceeds towards root (start symbol)
Parsing is the process of analyzing a continuous stream of input (read from a file or a keyboard, for example) in order to determine its grammatical structure with respect to a given formal grammar. The task of the parser is essentially to determine if and how the input can be derived from the start symbol within the rules of the formal grammar. This can be done in essentially two ways:
. Top-down parsing - A parser can start with the start symbol and try to transform it to the input. Intuitively, the parser starts from the largest elements and breaks them down into incrementally smaller parts. LL parsers are examples of top-down parsers. We will study about these in detail in the coming slides.
. Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. We will study about these in detail in the coming slides.
TOP DOWN PARSING :
construct a parse tree for a string, we initially create a tree consisting of a single node (root node) labeled by the start symbol. Thereafter, we repeat the following steps to construct the of parse tree by starting at the root labeled by start symbol:
. At node labeled with non terminal A select one of the production of A and construct the children nodes.. Find the next node at which subtree is constructed.
Parse array [ num dotdot num ] of integer
. Cannot proceed as non terminal "simple" never generates a string beginning with token "array". Therefore, requires back-tracking.
. Back-tracking is not desirable, therefore, take help of a "look-ahead" token. The current token is treated as look- ahead token. ( restricts the class of grammars )
To construct a parse tree corresponding to the string array [ num dotdot num ] of integer , we start with the start symbol type . Then, we use the production type à simple to expand the tree further and construct the first child node. Now, finally, the non-terminal simple should lead to the original string. But, as we can see from the grammar, the expansion of the non-terminal simple never generates a string beginning with the token "array". So, at this stage, we come to know that we had used the wrong production to expand the tree in the first step and we should have used some other production. So, we need to backtrack now. This backtracking tends to cause a lot of overhead during the parsing of a string and is therefore not desirable. To overcome this problem, a " look-ahead " token can be used. In this method, the current token is treated as look-ahead token and the parse tree is expanded by using the production which is determined with the help of the look-ahead token.
Left recursion
. A top down parser with production A A α may loop forever
. From the grammar A A α b left recursion may be eliminated by transforming the grammar to
A b R
R α R ε
Left recursion is an issue of concern in top down parsers. A grammar is left-recursive if we can find some non-terminal A which will eventually derive a sentential form with itself as the left-symbol. In other words, a grammar is left recursive if it has a non terminal A such that there is a derivation
A + A a for some string a . These derivations may lead to an infinite loop. Removal of left recursion:
For the grammar A A a ß , left recursion can be eliminated by transforming the original grammar as:
A ß R
R a R ε

This slide shows the parse trees corresponding to the string ßa * using the original grammar (with left factoring) and the modified grammar (without left factoring).
Example
. Consider grammar for arithmetic expressions
E E + T T
T T * F F
F ( E ) id
. After removal of left recursion the grammar becomes
E T E'
E' + T E' ε
T F T'
T' * F T' ε
F ( E ) id
As another example, a grammar having left recursion and its modified version with left recursion removed has been shown.
Removal of left recursion
The general algorithm to remove the left recursion follows. Several improvements to this method have been made. For each rule of the form
A A a1 A a2 ... A a m β 1 β 2 .. β n
Where:
. A is a left-recursive non-terminal.
. a is a sequence of non-terminals and terminals that is not null ( a≠ε ).
. ß is a sequence of non-terminals and terminals that does not start with A .
Replace the A-production by the production:
A β 1 A' β2 A' ... βn A'
And create a new non-terminal
A' a1 A' a2 A' ... am A' ε
This newly created symbol is often called the "tail", or the "rest".
Left recursion hidden due to many productions
.Left recursion may also be introduced by two or more grammar rules. For example
S Aa b
A Ac Sd ε
there is a left recursion because
S Aa Sda
. In such cases, left recursion is removed systematically
- Starting from the first rule and replacing all the occurrences of the first non terminal symbol
- Removing left recursion from the modified grammar
What we saw earlier was an example of immediate left recursion but there may be subtle cases where left recursion occurs involving two or more productions. For example in the grammar
S A a b
A A c S d e
there is a left recursion because
S A a S d a
More generally, for the non-terminals A 0, A 1,..., An , indirect left recursion can be defined as being of the form:
An A1 a 1 .
A1 A2 a2
. .
A n An a(n+1) .
Where a 1 , a2 ,..., a n are sequences of non-terminals and terminals.
Following algorithm may be used for removal of left recursion in general case:
Input : Grammar G with no cycles or e -productions.
Output: An equivalent grammar with no left recursion .
Algorithm:
Arrange the non-terminals in some order A1 , A2 , A3 ..... An .
for i := 1 to n do begin
replace each production of the form Ai Ajγ
by the productions A i d1 γ d2 γ ........... dkγ
where Aj d1 d2 ......... dk are all the current Aj -productions;
end
eliminate the immediate left recursion among the Ai-productions.
end.
Removal of left recursion due to many productions .
. After the first step (substitute S by its rhs in the rules) the grammar becomes
S Aa b
A Ac Aad bd ε
. After the second step (removal of left recursion) the grammar becomes
S Aa b
A bdA' A'
A' cA' adA' ε
After the first step (substitute S by its R.H.S. in the rules), the grammar becomes
S A a b
A A c A a d b d ε
After the second step (removal of left recursion from the modified grammar obtained after the first step), the grammar becomes
S A a b
A b d A' A'
A' c A' a d A' ε
Left factoring
. In top-down parsing when it is not clear which production to choose for expansion of a symbol
defer the decision till we have seen enough input.
In general if A αβ1 αβ2
defer decision by expanding A to a A'
we can then expand A' to β1 or β2
. Therefore A αβ1 αβ2
transforms to
A α A'
A' β1 β2
Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive parsing. The basic idea is that when it is not clear which of two or more alternative productions to use to expand a non-terminal A, we defer the decision till we have seen enough input to make the right choice.
In general if A α ß 1 α ß 2 , we defer decision by expanding A to a A'.
and then we can expand A ' to ß1 or ß1
Therefore A α ß1 α ß2 transforms to
A α A'
A ' ß1 ß2
Predictive parsers
. A non recursive top down parsing method
. Parser "predicts" which production to use
. It removes backtracking by fixing one production for every non-terminal and input token(s)
. Predictive parsers accept LL(k) languages
- First L stands for left to right scan of input
- Second L stands for leftmost derivation
- k stands for number of lookahead token
. In practice LL(1) is used
In general, the selection of a production for a non-terminal may involve trial-and-error; that is, we may have to try a production and backtrack to try another production if the first is found to be unsuitable. A production is unsuitable if, after using the production, we cannot complete the tree to match the input string. Predictive parsing is a special form of recursive-descent parsing, in which the current input token unambiguously determines the production to be applied at each step. After eliminating left recursion and left factoring, we can obtain a grammar that can be parsed by a recursive-descent parser that needs no backtracking . Basically, it removes the need of backtracking by fixing one production for every non-terminal and input tokens. Predictive parsers accept LL(k) languages where:
. First L : The input is scanned from left to right.
. Second L : Leftmost derivations are derived for the strings.
. k : The number of lookahead tokens is k.
However, in practice, LL(1) grammars are used i.e., one lookahead token is used.
Predictive parsing
. Predictive parser can be implemented by maintaining an external stack
Parse table is a two dimensional array M[X,a] where "X" is a non terminal and "a" is a terminal of the grammar
It is possible to build a non recursive predictive parser maintaining a stack explicitly, rather than implicitly via recursive calls. A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output stream. The input buffer contains the string to be parsed, followed by $, a symbol used as a right end marker to indicate the end of the input string. The stack contains a sequence of grammar symbols with a $ on the bottom, indicating the bottom of the stack. Initially the stack contains the start symbol of the grammar on top of $. The parsing table is a two-dimensional array M [X,a] , where X is a non-terminal, and a is a terminal or the symbol $ . The key problem during predictive parsing is that of determining the production to be applied for a non-terminal. The non-recursive parser looks up the production to be applied in the parsing table.
Algoritm:
Example
. Consider the grammar
E T E'
E' +T E' ε
T F T'
T' * F T' ε
F ( E ) id
As an example, we shall consider the grammar shown.
Parse table for the grammar
Blank entries are error states. For example E cannot derive a string starting with '+'
A predictive parsing table for the grammar in the previous slide is shown. In the table, the blank entries denote the error states; non-blanks indicate a production with which to expand the top nonterminal on the stack
. Constructing parse table
. Table can be constructed if for every non terminal, every lookahead symbol can be handled by at most one production
. First( a ) for a string of terminals and non terminals a is
- Set of symbols that might begin the fully expanded (made of only tokens) version of a
. Follow(X) for a non terminal X is
- set of symbols that might follow the derivation of X in the input stream
The construction of the parse table is aided by two functions associated with a grammar G. These functions, FIRST and FOLLOW , allow us to fill in the entries of a predictive parsing table for G, whenever possible. If α is any string of grammar symbols, FIRST ( α ) is the set of terminals that begin the strings derived from α . If α * ε , then ε is also in FIRST( α ).
If X is a non-terminal, FOLLOW ( X ) is the set of terminals a that can appear immediately to the right of X in some sentential form, that is, the set of terminals a such that there exists a derivation of the form S * α A a ß for some α and ß . Note that there may, at some time, during the derivation, have been symbols between A and a , but if so, they derived ε and disappeared. Also, if A can be the rightmost symbol in some sentential form, then $ is in FOLLOW(A).
Example
. For the expression grammar
E T E'
E' +T E' ε
T F T'
T' * F T' ε
F ( E ) id
First(E) = First(T) = First(F) = { (, id }
First(E') = {+, ε }
First(T') = { *, ε }
Consider the grammar shown above. For example, id and left parenthesis are added to FIRST(F) by rule (3) in the definition of FIRST with i = 1 in each case, since FIRST(id) = {id} and FIRST{'('} = { ( } by rule (1). Then by rule (3) with i = 1, the production T = FT' implies that id and left parenthesis are in FIRST(T) as well. As another example, e is in FIRST(E') by rule (2).
Compute follow sets
1. Place $ in follow(S)
2. If there is a production A a B ß then everything in first( ß ) (except ε ) is in follow(B)
3. If there is a production A a B then everything in follow(A) is in follow(B)
4. If there is a production A a B ß and First( ß ) contains e then everything in follow(A) is in follow(B)
Since follow sets are defined in terms of follow sets last two steps have to be repeated until follow sets converge
To compute FOLLOW ( A ) for all non-terminals A , apply the following rules until nothing can be added to any FOLLOW set:
1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right endmarker.
2. If there is a production A a Bß, then everything in FIRST(ß) except for e is placed in FOLLOW(B).
3. If there is a production A a ß, or a production A a Bß where FIRST(ß) contains e (i.e., ß * e ), then everything in FOLLOW(A) is in FOLLOW(B).
Error handling
. Stop at the first error and print a message
- Compiler writer friendly
- But not user friendly
. Every reasonable compiler must recover from errors and identify as many errors as possible
. However, multiple error messages due to a single fault must be avoided
. Error recovery methods
- Panic mode
- Phrase level recovery
- Error productions
- Global correction
Error handling and recovery is also one of the important tasks for a compiler. Errors can occur at any stage during the compilation. There are many ways in which errors can be handled. One way is to stop as soon as an error is detected and print an error message. This scheme is easy for the programmer to program but not user friendly. Specifically, a good parser should, on encountering a parsing error, issue an error message and resume parsing in some way, repairing the error if possible. However, multiple error messages due to a single fault are undesirable and tend to cause confusion if displayed. Error recovery is thus a non trivial task. The following error recovery methods are commonly used:
1. Panic Mode
2. Phrase level recovery
3. Error productions
4. Global correction
Panic mode
. Simplest and the most popular method
. Most tools provide for specifying panic mode recovery in the grammar
. When an error is detected
- Discard tokens one at a time until a set of tokens is found whose role is clear
- Skip to the next token that can be placed reliably in the parse tree
Let us discuss each of these methods one by one. Panic mode error recovery is the simplest and most commonly used method. On discovering the error, the parser discards input symbols one at a time until one of the designated set of synchronizing tokens is found. The synchronizing tokens are usually delimiters, such as semicolon or end , whose role in the source program is clear. The compiler designer must select the synchronizing tokens appropriate for the source language, of course.
Panic mode .
. Consider following code
begin
a = b + c;
x = p r ;
h = x < 0;
end;
. The second expression has syntax error
. Panic mode recovery for begin-end block skip ahead to next ';' and try to parse the next expression
. It discards one expression and tries to continue parsing
. May fail if no further ';' is found
Consider the code shown in the example above. As we can see, the second expression has a syntax error. Now panic mode recovery for begin-end block states that in this situation skip ahead until the next ' ; ' is seen and try to parse the following expressions thereafter i.e., simply skip the whole expression statement if there is an error detected. However, this recovery method might fail if no further ' ; ' is found. While panic-mode correction often skips a considerable amount of input without checking it for additional errors, it has the advantage of simplicity and, unlike some other methods to be considered later, it is guaranteed not to go into an infinite loop. In situations where multiple errors in the same statement are rare, this method may be quite adequate.
Phrase level recovery
. Make local correction to the input
. Works only in limited situations
- A common programming error which is easily detected
- For example insert a ";" after closing "}" of a class definition
. Does not work very well!
Phrase level recovery is implemented by filling in the blank entries in the predictive parsing table with pointers to error routines. These routines may change, insert, or delete symbols on the input and issue appropriate error messages. They may also pop from the stack. It basically makes local corrections to the input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. A typical local correction would be to replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semicolon. This type of replacement can correct any input string and has been used in several error-repairing compilers. However, its major drawback is the difficulty it has in coping with situations in which the actual error has occurred before the point of detection.
Error productions
. Add erroneous constructs as productions in the grammar
. Works only for most common mistakes which can be easily identified
. Essentially makes common errors as part of the grammar
. Complicates the grammar and does not work very well
If we have a good idea of the common errors that might be encountered, we can augment the grammar for the language at hand with productions that generate the erroneous constructs. We then use the grammar augmented by these error productions to construct a parser. If an error production is used by a parser, we can generate appropriate error diagnostics to indicate the error construct that has been recognized in the input. The main drawback of this approach is that it tends to complicate the grammar and thus does not work very well.
Global corrections
. Considering the program as a whole find a correct "nearby" program
. Nearness may be measured using certain metric
. PL/C compiler implemented this scheme: anything could be compiled!
. It is complicated and not a very good idea!
Ideally, we would like a compiler to make as few changes as possible in processing the incorrect input string. There are algorithms for choosing a minimal sequence of changes to obtain globally least-cost correction. Given an incorrect input string x and a grammar G, these algorithms will find a parse tree for a related string y such that the number of insertions, deletions, and changes of tokens required to transform x into y is as small as possible. Unfortunately, these methods are in general too costly to implement in terms of time and space, so these techniques are currently only of theoretical interest
Error Recovery in LL(1) parser
. Error occurs when a parse table entry M[A,a] is empty
. Skip symbols in the input until a token in a selected set (synch) appears
. Place symbols in follow(A) in synch set. Skip tokens until an element in follow(A) is seen. Pop(A) and continue parsing
. Add symbol in first(A) in synch set. Then it may be possible to resume parsing according to A if a symbol in first(A) appears in input.
Let us consider error recovery in an LL(1) parser by panic-mode recovery method. An error occurs when the terminal on top of the stack does not match the next input symbol or when non-terminal A is on top of the stack, a is the next input symbol, and the parsing table entry M[A, a ] is empty. Panic-mode error recovery is based on the idea of skipping symbols on the input until a token in a selected set of synchronizing tokens appears. Its effectiveness depends on the choice of synchronizing set. The sets should be chosen so that the parser recovers quickly from errors that are likely to occur in practice. Some heuristics are as follows:
1. As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing set for non-terminal A. If we skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is likely that parsing can continue.
2. If we add symbols in FIRST(A) to the synchronizing set for non-terminal A, then it may be possible to resume parsing according to A if a symbol in FIRST(A) appears in the input.
Error Recovery in LL(1) parser
. Error occurs when a parse table entry M[A,a] is empty
. Skip symbols in the input until a token in a selected set (synch) appears
. Place symbols in follow(A) in synch set. Skip tokens until an element in follow(A) is seen. Pop(A) and continue parsing
. Add symbol in first(A) in synch set. Then it may be possible to resume parsing according to A if a symbol in first(A) appears in input.
Let us consider error recovery in an LL(1) parser by panic-mode recovery method. An error occurs when the terminal on top of the stack does not match the next input symbol or when non-terminal A is on top of the stack, a is the next input symbol, and the parsing table entry M[A, a ] is empty. Panic-mode error recovery is based on the idea of skipping symbols on the input until a token in a selected set of synchronizing tokens appears. Its effectiveness depends on the choice of synchronizing set. The sets should be chosen so that the parser recovers quickly from errors that are likely to occur in practice. Some heuristics are as follows:
1. As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing set for non-terminal A. If we skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is likely that parsing can continue.
2. If we add symbols in FIRST(A) to the synchronizing set for non-terminal A, then it may be possible to resume parsing according to A if a symbol in FIRST(A) appears in the input.
Essay type questions:
1.What is recursive descent parser? Construct recursive descent parser for the
following grammar.
E à E + TT T à TFF F à F_ab
2. What is ambiguous grammar? Eliminate ambiguities for the grammar:
Eà E + EE_E(E)id.

3. Consider the following grammar.
Sà 0A1B0 1 Aà 0S1B 1 B à 0A1 S
Construct leftmost derivations and parse trees for the following sentences
i. 0101 ii. 1100101
4. Construct predictive parsing table for the following grammar.
Eà T E′ E′ à +T E′ε T à F T′ T′ à ∗FT′ε F à (E)id
5. Write an algorithm for construction of predictive parsing table.
6. Consider the following grammar
E à T + ET Tà V_TV Và id
Write down the procedures for the nonterminals of the grammar to make a
recursive descent parser
7.Remove tha left Recursion for the following grammar
E à T E′ E′ à +T E′ε T à F T′ T′ à ∗FT′ε F à (E)id
8. Write about top down parsing .and What are the limitations of recursive descent porser.
9. Consider the following grammar
Sà (L) a L à L, S S
Construct leftmost derivations and parse trees for the following sentences:
i. (a,(a,a)) ii. (a,((a,a),(a,a))).
10. What are the difficulties in top down parsing? Explain in detail.
Short answer questions:
1. Define Context free grammar?
2.What is Top down parsing?
3.What is Backtracking,?
4.Expand the LL (1)?
5.Define recursive descent parsing?
6. What is predictive parsing
7What is Left most derivation?
8.What is Right most derivation?
9.What is a Parser?
10.Define parse tree?
11. What Left Recursion?
12.What is a Left Factoring?
13.Draw the structure of Parse table?
14. A g r am m a r i s a d e v i c e
( a) c a n ’ t s ay ( b ) c o gn i t i ve ( c ) a c c e p t d)None
15. A ny s t r i n g of t e r m i n al s t h at c an b e g e n e r a te d by t h e f o l l ow i n g C FG i s S → X Y X → a x b X a Y → Ya Y b a
( a) h a s a t l e as t on e b ( b ) h a s a t l e as t two a ’ s ( c ) h a s n o c o n s e c ut i ve a ’ s or b ’ s
( d ) s h o u l d e n d i n a ’ a ’

UNIT III
BOTTOM UP PARSING: Shift Reduce parsing, LR and LALR parsing, Error recovery in parsing,
handling ambiguous grammar, YACCautomatic
parser generator

Bottom up parsing
. Construct a parse tree for an input string beginning at leaves and going towards root OR
. Reduce a string w of input to start symbol of grammar
Consider a grammar
S aABe
A Abc b
B d
And reduction of a string
a b b c d e
a A b c d e
a A d e
a A B e
S
Right most derivation
S a A B e
a A d e
a A b c d e
a b b c d e
We will now study a general style of bottom up parsing, also known as shift-reduce parsing. Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves (the bottom) and working up towards the root (the top). We can think of the process as one of "reducing" a string w to the start symbol of a grammar. At each reduction step a particular substring matching the right side of a production is replaced by the symbol on the left of that production, and if the substring is chosen correctly at each step, a rightmost derivation is traced out in reverse. For example, consider the grammar
S aABe
A Abc b
B d
The sentence a b b c d e can be reduced to S by the following steps:
a b b c d e
a A b c d e
a A d e
a A B e
S
These reductions trace out the following right-most derivation in reverse:
S a A B e
a A d e
a A b c d e
a b b c d e
Shift reduce parsing
. Split string being parsed into two parts
- Two parts are separated by a special character "."
- Left part is a string of terminals and non terminals
- Right part is a string of terminals
. Initially the input is .w
A convenient way to implement a shift reduce parser is by using an input buffer to hold the string w to be parsed. At any stage, the input string is split into two parts which are separated by the character ' . '. The left part consists of terminals and non-terminals while the right part consists of terminals only. Initially, the string " .w " is on the input.
Shift reduce parsing .
. Bottom up parsing has two actions
. Shift : move terminal symbol from right string to left string
if string before shift is a .pqr
then string after shift is a p.qr
.Reduce : immediately on the left of "." identify a string same as RHS of a production and replace it by LHS
if string before reduce action is aß .pqr
and A ß is a production
then string after reduction is a A.pqr
There are two primary operations of a shift-reduce parser namely (1) shift and (2) reduce. In a shift action, the next input symbol is shifted from right string to the left string. For example, if string before shift is " a .pqr " then string after shift would be " a p.qr " . In a reduce action, the parser identifies a string which is same as the RHS of a production and replace it by the non-terminal at LHS. For example, if string before reduce action is " aß .pqr " and A ß is a production, then string after reduction is " a A.pqr " .
Bottom up parsing .
. A more powerful parsing technique
. LR grammars - more expensive than LL
. Can handle left recursive grammars
. Can handle virtually all the programming languages
. Natural expression of programming language syntax
. Automatic generation of parsers (Yacc, Bison etc.)
. Detects errors as soon as possible
. Allows better error recovery
Bottom-up parsing is a more powerful parsing technique. Listed are some of the advantages of bottom-up parsing: . A more powerful parsing technique . It is capable of handling almost all the programming languages. . It can fastly handle left recursion in the grammar. . It allows better error recovery by detecting errors as soon as possible.

LR parsing
. Input contains the input string. . Stack contains a string of the form S 0 X 1 S1 X2 ..X n Sn where each Xi is a grammar symbol and each S i is a state. . Tables contain action and goto parts. . action table is indexed by state and terminal symbols. . goto table is indexed by state and non terminal symbols.
Actions in an LR (shift reduce) parser
. Assume Si is top of stack and ai is current input symbol
. Action [Si ,a i ] can have four values
1. shift ai to the stack and goto state Sj
2. reduce by a rule
3. Accept
4. error
Example

Consider the grammar And its parse table
E E + T T T T * F F F ( E ) id

Constructing parse table
Augment the grammar
. G is a grammar with start symbol S
. The augmented grammar G' for G has a new start symbol S' and an additional production S' S
. When the parser reduces by this rule it will stop with accept
The first step in the process of constructing the parse table, is to augment the grammar to include one more rule. The input is accepted by the parser, only when it reduces with this rule. We add a new start symbol in the augmented grammar (say S' ) and another rule ( S' S ), where S is the start symbol in the original grammar. The parser stops and accepts only when it reduces with this rule.
LR(0) items
. An LR(0) item of a grammar G is a production of G with a special symbol "." at some position of the right side
. Thus production A XYZ gives four LR(0) items
A .XYZ
A X.YZ
A XY.Z
A XYZ.
. An item indicates how much of a production has been seen at a point in the process of parsing
- Symbols on the left of "." are already on the stacks
- Symbols on the right of "." are expected in the input
An LR(0) item represents a production in such a way, that you keep track of the input already read (i.e., present on the stack) and the input yet to be expected. Consider a typical example A Y.XZ , here the special symbol " . " means that the expression to the left of it (i.e., Y ) is present on the stack, while the one of its right is yet expected to complete this production.
Example
Grammar:
I 2 : goto(I 0 ,T)
I6 : goto(I1 ,+)
I9 : goto(I6 ,T)
E ' E
E T.
E E + .T
E E + T.
E E+T T
T T. *F
T .T * F
T T. * F
T T*F F
I3 : goto(I0 ,F)
T .F
goto(I6 ,F) is I 3
F (E) id
T F.
F .(E)
goto(I6 ,( ) is I4
I 0 : closure(E ' .E)
I4 : goto( I0 ,( )
F .id
goto(I6 ,id) is I5
E ' .E
F (.E)
I 7 : goto(I2 ,*)
I 10 : goto(I 7 ,F)
E .E + T
E .E + T
T T * .F
T T * F.
E .T
E .T
F .(E)
goto(I7 ,( ) is I4
T .T * F
T .T * F
F .id
goto(I7 ,id) is I5
T .F
T .F
I 8 : goto(I4 ,E)
I 11 : goto(I8 ,) )
F .(E)
F .(E)
F (E.)
F (E).
F .id
F .id
E E. + T
goto(I8 ,+) is I6
I 1 : goto(I0 ,E)
I5 : goto( I0 ,id)
goto(I4 ,T) is I2
goto(I9 ,*) is I 7
E ' E.
F id.
goto(I 4 ,F) is I3

E E. + T

goto(I 4 ,( ) is I4

goto(I4 ,id) is I5

Let's take an example here. We have earlier calculated the closure I0 . Here, notice that we need to calculate goto (I0 ,E), goto(I0 ,T), goto(I 0 , F) , goto (I0 , ( ) and goto(I0 , id). For calculating goto(I0 , E), we take all the LR(0) items in I 0 , which expect E as input (i.e. are of the form A α .E ß ), and advance ".". Closure is then taken of this set. Here, goto(I 0 , E) will be closure { E ' E. , E E.+T }. The closure adds no item to this set, and hence goto(I 0 , E)={ E ' E. , E E.+T }.

Essay type questions:

1. Construct SLR parsing table for the following grammar.
Sà ASb A à SAa
2. Construct LALR parsing table for the following grammar
S à CC C à cCd
3. What are the common conflicts that can be encountered in shift - reduce
parser.
4. Construct SLR parsing table for the following grammar.
R à R ′′ R RR R∗ (R) a b
5.Explain about the Shift Reduced parser.
6. Explain canonical LR parsing.
7. What is an operator grammar? Give an example.
8. What are the common conflicts that can be encountered in shift - reduce
parser.
9. Explain the stack implementation of shift reduce parsing method with an
example.
10. Define handle. Give suitable example
Short answer questions:
1.What is Bottom up parsing?
2. Define Shift Reduce parsing?
3.What is LR parser?
4.What is SLR parser?
5.Define CLR parser?
6.Define LALR parsing?
7.What is Error recovery in parsing?
8.Define ambiguous grammar?
9.Write the structure of YACC?
10.What is automatic parser generator?
11.Write the role of Parser?
12.What is the use of parser?
13.How to avoid the ambiguity in a grammar?
14.Which parser is stronger in all LR family?
15.Write the formula for finding the look ahead symbol in CLR parser ?

UNIT IV
SEMANTIC ANALYSIS: Intermediate forms of source Programs abstract syntax tree,
Polish notation and 3 address codes. Attributed grammars, SDT,
Conversion of popular programming languages language Constructs into Intermediate code forms, Type checker.

Abstract Syntax Tree/DAG
. Condensed form of a parse tree
. useful for representing language constructs
. Depicts the natural hierarchical structure of the source program
- Each internal node represents an operator
- Children of the nodes represent operands
- Leaf nodes represent operands
. DAG is more compact than abstract syntax tree because common sub expressions are eliminated
A syntax tree depicts the natural hierarchical structure of a source program. Its structure has already been discussed in earlier lectures.
DAGs are generated as a combination of trees: operands that are being reused are linked together, and nodes may be annotated with variable names (to denote assignments). This way, DAGs are highly compact, since they eliminate local common sub-expressions. On the other hand, they are not so easy to optimize, since they are more specific tree forms. However, it can be seen that proper building of DAG for a given sequence of instructions can compactly represent the outcome of the calculation.
An example of a syntax tree and DAG has been given in the next slide .
a := b * -c + b * -c
You can see that the node " * " comes only once in the DAG as well as the leaf " b ", but the meaning conveyed by both the representations (AST as well as the DAG) remains the same.

Postfix notation is a way of writing algebraic expressions without the use of parentheses or rules of operator precedence. The expression above would be written as AB+CD-/ in postfix notation. (Don't panic! We'll explain this in a moment.) Postfix notation had its beginnings in the work of Jan L ukasiewicz * (1878-1956), a Polish logician, mathematician, and philosopher. L ukasiewicz developed a parenthesis-free prefix notation that came to be called Polish notation and a postfix notation now called Reverse Polish Notation or RPN. From these ideas, Charles Hamblin developed a postfix notation for use in computers. L ukasiewicz's work dates from about 1920. Hamblin's work on postfix notation was in the mid-1950's. Calculators, notably those from Hewlett-Packard, used various postfix formats beginning in the 1960s. Postfix notation is a linearized representation of a syntax tree; it is a list of nodes of the tree which appear immediately after its children. You must read the three points written in the slide above to see how postfix expressions are made corresponding to a set of expressions.
Three address code
. It is a sequence of statements of the general form X := Y op Z where
- X, Y or Z are names, constants or compiler generated temporaries
- op stands for any operator such as a fixed- or floating-point arithmetic operator, or a logical operator
Three address code is a sequence of statements of the general form: x := y op z where x, y and z are names, constants, or compiler generated temporaries. op stands for any operator, such as a fixed or floating-point arithmetic operator, or a logical operator or boolean - valued data. Compilers use this form in their IR.

Three address instructions
. Assignment
. Function
- x = y op z
- param x
- x = op y
- call p,n
- x = y
- return y
. Jump

- goto L
. Pointer
- if x relop y goto L
- x = &y
.Indexed assignment
- x = *y
- x = y[i]
- *x = y
- x[i] = y

The various types of the three-address codes. Statements can have symbolic label and there are statements for flow of control. A symbolic label represents the index of a three-address statement in the array holding intermediate code. Actual indices can be substituted for the labels either by making a separate pass, or by using backpatching.
Declarations
For each name create symbol table entry with information like type and relative address
P {offset=0} D

D D ; D

D id : T
enter(id.name, T.type, offset);

offset = offset + T.width
T integer
T.type = integer; T.width = 4
T real
T.type = real; T.width = 8

In the translation scheme nonterminal P generates a sequence of declarations of the form id : T . Before the first declaration is considered, offset is set to 0. As each new name is seen, that name is entered in the symbol table with the offset equal to the current value of offset, and offset is incremented by the width of the data object denoted by the name.
Names in the Symbol table
S id := E
{p = lookup(id.place);
if p <> nil then emit(p := E.place)
else error}
E id
{p = lookup(id.name);
if p <> nil then E.place = p
else error}
Type conversion within assignments
E E1 + E2
E.place= newtmp;
if E1 .type = integer and E2 .type = integer
then emit(E.place ':=' E 1 .place 'int+' E 2 .place);
E.type = integer;
.
similar code if both E1 .type and E2 .type are real
.
else if E 1 .type = int and E2 .type = real
then
u = newtmp;
emit(u ':=' inttoreal E 1 .place);
emit(E.place ':=' u 'real+' E2 .place);
E.type = real;
.
similar code if E1 .type is real and E2 .type is integer
Boolean Expressions
. compute logical values
. change the flow of control
. boolean operators are: and or not
Boolean expressions are used to compute logical values and as conditional expressions in statements that alter flow of control such as if-then statements. Boolean expressions are composed of the Boolean operators and, or, not - applied to boolean variables or relational expressions.
Relational expressions are of the form E1 relop E2 where E1 and E2 are arithmetic expressions. Boolean expressions can be generated by the following grammar-
E -> E or E E and E not E (E) id relop id true false
Syntax directed translation of boolean expressions
E E 1 or E2
E.place := newtmp
emit(E.place ':=' E 1 .place 'or' E2 .place)
E E1 and E 2
E.place:= newtmp
emit(E.place ':=' E 1 .place 'and' E2 .place)
E not E1
E.place := newtmp
emit(E.place ':=' 'not' E1 .place)
E (E1 ) E.place = E1 .place
E id1 relop id2
E.place := newtmp
emit(if id1.place relop id2.place goto nextstat+3)
emit(E.place = 0) emit(goto nextstat+2)
emit(E.place = 1)
E true
E.place := newtmp
emit(E.place = '1')
E false
E.place := newtmp
emit(E.place = '0')
In the above scheme, nextstat gives the index of the next three address code in the output sequence and emit increments nextstat after producing each three address statement.
Example: Code for a < b or c < d and e < f

100: if a < b goto 103
if e < f goto 111
101: t1 = 0
109: t3 = 0
102: goto 104
110: goto 112
103: t1 = 1
111: t3 = 1
104:
112:
if c < d goto 107
t4 = t2 and t3
105: t2 = 0
113: t5 = t1 or t 4
106: goto 108

107: t2 = 1

108:

A relational expression a < b is equivalent to the conditional statement if a < b then 1 else 0 and three address code for this expression is:
100: if a < b goto 103.
101: t = 0
102: goto 104
103: t = 1
104:
It is continued from 104 in the same manner as the above written block.

S if E then S1 else S2
E.true = newlabel
E.false = newlabel
S 1 .next = S.next
S2 .next = S.next
S.code = E.code
gen(E.true ':')
S 1 .code
gen(goto S.next)
gen(E.false ':')
S 2 .code
For if-then-else
S -> if E then S1 else S2
E.true = newlabel
E.false = newlabel
S1.next = S.next
S2.next = S.next
S.code = E.code gen(E.true ':') S1.code gen('goto' S.next) gen(E.false ':') S2.code
In the above code, the labels E.true and E.false created are associated with the first three address code instructions for S1 and S2 respectively, so that if E is true, jump to S1 occurs and if E is false jump to S2 occurs. An explicit goto S.next is required after the code of S1 to ensure that after execution of code for S1 control moves to the statement after S instead of falling through the code of S2, in case E is true.

S while E do S 1
S.begin = newlabel
E.true = newlabel
E.false = S.next
S 1 .next = S.begin
S.ocde = gen(S.begin ':')
E.code
gen(E.true ':')
S 1 .code
gen(goto S.begin)
For while-do
S -> while E do S1
S.begin = newlabel
E.true = newlabel
E.false = S.next
S1.next = S.begin
S.code = gen(S.begin ':') E.code gen(E.true ':') S1.code gen(goto S.begin)
Control flow translation of boolean expression

E E1 or E 2
E1 .true := E.true

E 1 .false := newlabel

E2 .true := E.true

E2 .false := E.false

E.code := E 1 .code gen(E 1 .false) E2 .code
E E 1 and E2
E 1 .true := new label

E 1 false := E.false

E2 .true := E.true

E2 false := E.false

E.code := E 1 .code gen(E1 .true) E2 .code
Case Statement
. switch expression
begin
case value: statement
case value: statement
..
case value: statement
default: statement
end
.evaluate the expression
. find which value in the list of cases is the same as the value of the expression
. - Default value matches the expression if none of the values explicitly mentioned in the cases matches the expression
. execute the statement associated with the value found

Essay type questions:
1.Write a S - attributed grammar to connect the fopllowing grammar with prefix
rotator
Là E E à E+T E-T T T à T*F T/F F
F à P " F P P à (E) P à id.
2. Write short notes on the following:
(a) S-attributed definitions.(b) I-attributed definitions.(c) Dependency graph.

3. What is a syntax tree? Write syntax directed definition for constructing a
syntax tree for an expression. The grammar for an expression is given below.
E à E + T E − T T T à (E) id num
4. Write a note on polymorphic functions.
5. Construct triples of an expression: a _ −(b + c).
6. Write a note on the specification of a simple type checker
7. Write a detail notes on type conversion
8.Construct a Quadruple form for the following expression :x:=b*-c + b*-c
9.Write about three address form with example?
10.Write about abstract syntax tree, Polishing notation and DAG?

Short answer questions?
1. Write the role of Semantic Analysis?
2.What is the role of Intermediate forms of source Programs?
3. what is a abstract syntax tree?
4.Define polish notation?
5.What are the three address forms?
6.Define Attributed grammar?
7.What is a Syntax directed translation?
8.What is the Syntax directed definition?
9.What are the programming languages language Constructs
10.What is Type checker?
11.Define Type expression?
12.What is the role of Intermediate code generator?
13.What is the type conversion?
14.Write the difference between S-Attribute and I-Attribute?
15.What is the L-Attributed Grammar?
UNIT V
SYMBOL TABLES: Symbol table format, organization for block structures languages, hashing
tree structures representation of scope information, Block structures and non block structure
storage allocation for arrays, strings and records.

Symbol Table
. Compiler uses symbol table to keep track of scope and binding information about names
. symbol table is changed every time a name is encountered in the source; changes to table occur
- if a new name is discovered
- if new information about an existing name is discovered
. Symbol table must have mechanism to:
- add new entries
- find existing information efficiently
. Two common mechanism:
- linear lists, simple to implement, poor performance
- hash tables, greater programming/space overhead, good performance
. Compiler should be able to grow symbol table dynamically
. if size is fixed, it must be large enough for the largest program
A compiler uses a symbol table to keep track of scope and binding information about names. It is filled after the AST is made by walking through the tree, discovering and assimilating information about the names. There should be two basic operations - to insert a new name or information into the symbol table as and when discovered and to efficiently lookup a name in the symbol table to retrieve its information.
Two common data structures used for the symbol table are -
1. Linear lists:- simple to implement, poor performance.
2. Hash tables:- greater programming/space overhead, good performance.
Ideally a compiler should be able to grow the symbol table dynamically, i.e., insert new entries or information as and when needed. But if the size of the table is fixed in advance then ( an array implementation for example), then the size must be big enough in advance to accommodate the largest possible program.
Symbol Table Entries
. each entry for a declaration of a name
. format need not be uniform because information depends upon the usage of the name
. each entry is a record consisting of consecutive words
. to keep records uniform some entries may be outside the symbol table
. information is entered into symbol table at various times
- keywords are entered initially
- identifier lexemes are entered by lexical analyzer
. symbol table entry may be set up when role of name becomes clear
. attribute values are filled in as information is available
For each declaration of a name, there is an entry in the symbol table. Different entries need to store different information because of the different contexts in which a name can occur. An entry corresponding to a particular name can be inserted into the symbol table at different stages depending on when the role of the name becomes clear. The various attributes that an entry in the symbol table can have are lexeme, type of name, size of storage and in case of functions - the parameter list etc.
Data Structures
. List data structure
- simplest to implement
- use a single array to store names and information
- search for a name is linear
- entry and lookup are independent operations
- cost of entry and search operations are very high and lot of time goes into book keeping
. Hash table
- The advantages are obvious

Representing Scope Information
. entries are declarations of names
. when a lookup is done, entry for appropriate declaration must be returned
. scope rules determine which entry is appropriate
. maintain separate table for each scope
. symbol table for a procedure or scope is compile time equivalent an activation record
. information about non local is found by scanning symbol table for the enclosing procedures
. symbol table can be attached to abstract syntax of the procedure (integrated into intermediate representation)
most closely nested scope rule can be implemented in data structures discussed so far
. give each procedure a unique number
. blocks must also be numbered
. procedure number is part of all local declarations
. name is represented as a pair of number and name
. names are entered in symbol table in the order they occur
. most closely nested rule can be created in terms of following operations:
- lookup: find the most recently created entry
- insert: make a new entry
- delete: remove the most recently created entry
Most closely nested scope rules can be implemented by adapting the data structures discussed in the previous section. Each procedure is assigned a unique number. If the language is block-structured, the blocks must also be assigned unique numbers. The name is represented as a pair of a number and a name. This new name is added to the symbol table. Most scope rules can be implemented in terms of following operations:
a) Lookup - find the most recently created entry.
b) Insert - make a new entry.
c) Delete - remove the most recently created entry.
Symbol table structure
. Assign variables to storage classes that prescribe scope, visibility, and lifetime
- scope rules prescribe the symbol table structure
- scope: unit of static program structure with one or more variable declarations
- scope may be nested
. Pascal: procedures are scoping units
. C: blocks, functions, files are scoping units
. Visibility, lifetimes, global variables
. Common (in Fortran)
. Automatic or stack storage
. Static variables
storage class : A storage class is an extra keyword at the beginning of a declaration which modifies the declaration in some way. Generally, the storage class (if any) is the first word in the declaration, preceding the type name. Ex. static, extern etc.
Scope: The scope of a variable is simply the part of the program where it may be accessed or written. It is the part of the program where the variable's name may be used. If a variable is declared within a function, it is local to that function. Variables of the same name may be declared and used within other functions without any conflicts. For instance,
int fun1() { int a; int b; .... } int fun2() { int a; int c; .... } Visibility: The visibility of a variable determines how much of the rest of the program can access that variable. You can arrange that a variable is visible only within one part of one function, or in one function, or in one source file, or anywhere in the program
Local and Global variables: A variable declared within the braces {} of a function is visible only within that function; variables declared within functions are called local variables. On the other hand, a variable declared outside of any function is a global variable , and it is potentially visible anywhere within the program
Automatic Vs Static duration: How long do variables last? By default, local variables (those declared within a function) have automatic duration : they spring into existence when the function is called, and they (and their values) disappear when the function returns. Global variables, on the other hand, have static duration : they last, and the values stored in them persist, for as long as the program does. (Of course, the values can in general still be overwritten, so they don't necessarily persist forever.) By default, local variables have automatic duration. To give them static duration (so that, instead of coming and going as the function is called, they persist for as long as the function does), you precede their declaration with the static keyword: static int i; By default, a declaration of a global variable (especially if it specifies an initial value) is the defining instance. To make it an external declaration, of a variable which is defined somewhere else, you precede it with the keyword extern: extern int j; Finally, to arrange that a global variable is visible only within its containing source file, you precede it with the static keyword: static int k; Notice that the static keyword can do two different things: it adjusts the duration of a local variable from automatic to static, or it adjusts the visibility of a global variable from truly global to private-to-the-file.

Activation Tree
The flow of control of the call quicksort(1,9) would be like
Execution begins ..
enter readarray
exit readarray
enter quicksort(1,9)
enter partition(1,9)
exit partition(1,9)
enter quicksort(1,3)
exit quicksort(1,3)
enter quicksort(5,9)
exit quicksort(5,9)
exit quicksort(1,9)
Execution terminates When the information is represented as an activation tree we get the tree as shown.
Storage organization

The runtime storage might be subdivided into
- Target code
- Data objects
- Stack to keep track of procedure activation
- Heap to keep all other information

This kind of organization of run-time storage is used for languages such as Fortran, Pascal and C. The size of the generated target code, as well as that of some of the data objects, is known at compile time. Thus, these can be stored in statically determined areas in the memory. Pascal and C use the stack for procedure activations. Whenever a procedure is called, execution of an activation gets interrupted, and information about the machine state (like register values) is stored on the stack. When the called procedure returns, the interrupted activation can be restarted after restoring the saved machine state. The heap may be used to store dynamically allocated data objects, and also other stuff such as activation information (in the case of languages where an activation tree cannot be used to represent lifetimes). Both the stack and the heap change in size during program execution, so they cannot be allocated a fixed amount of space. Generally they start from opposite ends of the memory and can grow as required, towards each other, until the space available has filled up.
Activation Record
. temporaries: used in expression evaluation
. local data: field for local data
. saved machine status: holds info about machine status before procedure call
. access link : to access non local data
. control link : points to activation record of caller
. actual parameters: field to hold actual parameters
. returned value : field for holding value to be returned

The activation record is used to store the information required by a single procedure call. Not all the fields shown in the figure may be needed for all languages. The record structure can be modified as per the language/compiler requirements. For Pascal and C, the activation record is generally stored on the run-time stack during the period when the procedure is executing. Of the fields shown in the figure, access link and control link are optional (e.g. Fortran doesn't need access links). Also, actual parameters and return values are often stored in registers instead of the activation record, for greater efficiency. The activation record for a procedure call is generated by the compiler. Generally, all field sizes can be determined at compile time. However, this is not possible in the case of a procedure which has a local array whose size depends on a parameter. The strategies used for storage allocation in such cases will be discussed in the coming slides.
Storage Allocation Strategies
. Static allocation: lays out storage at compile time for all data objects
. Stack allocation: manages the runtime storage as a stack
. Heap allocation :allocates and de-allocates storage as needed at runtime from heap
These represent the different storage-allocation strategies used in the distinct parts of the run-time memory organization (as shown in slide 8). We will now look at the possibility of using these strategies to allocate memory for activation records. Different languages use different strategies for this purpose. For example, old FORTRAN used static allocation, Algol type languages use stack allocation, and LISP type languages use heap allocation.
Static allocation
. Names are bound to storage as the program is compiled
. No runtime support is required
. Bindings do not change at run time
. On every invocation of procedure names are bound to the same storage
. Values of local names are retained across activations of a procedure
These are the fundamental characteristics of static allocation. Since name binding occurs during compilation, there is no need for a run-time support package. The retention of local name values across procedure activations means that when control returns to a procedure, the values of the locals are the same as they were when control last left.
Stack Allocation
Figure shows the activation records that are pushed onto and popped for the run time stack as the control flows through the given activation tree. First the procedure is activated. Procedure readarray 's activation is pushed onto the stack, when the control reaches the first line in the procedure sort . After the control returns from the activation of the readarray , its activation is popped. In the activation of sort , the control then reaches a call of qsort with actuals 1 and 9 and an activation of qsort is pushed onto the top of the stack. In the last stage the activations for partition (1,3) and qsort (1,0) have begun and ended during the life time of qsort (1,3), so their activation records have come and gone from the stack, leaving the activation record for qsort (1,3) on top.
Calling Sequence
.A call sequence allocates an activation record and enters information into its field
A call sequence allocates an activation record and enters information into its fields. A return sequence restores the state of the machine so that the calling sequence can continue execution. Calling sequence and activation records differ, even for the same language. The code in the calling sequence is often divided between the calling procedure and the procedure it calls. There is no exact division of runtime tasks between the caller and the callee. As shown in the figure, the register stack top points to the end of the machine status field in the activation record. This position is known to the caller, so it can be made responsible for setting up stack top before control flows to the called procedure. The code for the callee can access its temporaries and the local data using offsets from stack top.
Heap Allocation
. Stack allocation cannot be used if:
- The values of the local variables must be retained when an activation ends
- A called activation outlives the caller
. In such a case de-allocation of activation record cannot occur in last-in first-out fashion
. Heap allocation gives out pieces of contiguous storage for activation records
There are two aspects of dynamic allocation -:
. Runtime allocation and de-allocation of data structures.
. Languages like Algol have dynamic data structures and it reserves some part of memory for it.
If a procedure wants to put a value that is to be used after its activation is over then we cannot use stack for that purpose. That is language like Pascal allows data to be allocated under program control. Also in certain language a called activation may outlive the caller procedure. In such a case last-in-first-out queue will not work and we will require a data structure like heap to store the activation. The last case is not true for those languages whose activation trees correctly depict the flow of control between procedures.
Pieces may be de-allocated in any order
. Over time the heap will consist of alternate areas that are free and in use
. Heap manager is supposed to make use of the free space
. For efficiency reasons it may be helpful to handle small activations as a special case
. For each size of interest keep a linked list of free blocks of that size
Initializing data-structures may require allocating memory but where to allocate this memory. After doing type inference we have to do storage allocation. It will allocate some chunk of bytes. But in language like lisp it will try to give continuous chunk. The allocation in continuous bytes may lead to problem of fragmentation i.e. you may develop hole in process of allocation and de-allocation. Thus storage allocation of heap may lead us with many holes and fragmented memory which will make it hard to allocate continuous chunk of memory to requesting program. So we have heap mangers which manage the free space and allocation and de-allocation of memory. It would be efficient to handle small activations and activations of predictable size as a special case as described in the next slide. The various allocation and de-allocation techniques used will be discussed later.
Access to non-local names
. Scope rules determine the treatment of non-local names
. A common rule is lexical scoping or static scoping (most languages use lexical scoping)
The scope rules of a language decide how to reference the non-local variables. There are two methods that are commonly used:
1. Static or Lexical scoping: It determines the declaration that applies to a name by examining the program text alone. E.g., Pascal, C and ADA.
2. Dynamic Scoping: It determines the declaration applicable to a name at run time, by considering the current activations. E.g., Lisp
Block
. Blocks can be nested
. The property is referred to as block structured
. Scope of the declaration is given by most closely nested rule
- The scope of a declaration in block B includes B
- If a name X is not declared in B then an occurrence of X is in the scope of declarator X in B ' such that
. B ' has a declaration of X
. B ' is most closely nested around B
Blocks contains its own local data structure. Blocks can be nested and their starting and ends are marked by a delimiter. They ensure that either block is independent of other or nested in another block. That is, it is not possible for two blocks B1 and B2 to overlap in such a way that first block B1 begins, then B2, but B1 end before B2. This nesting property is called block structure. The scope of declaration in a block-structured language is given by the most closely nested rule: 1. The scope of a declaration in a block B includes B. 2. If a name X is not declared in a block B, then an occurrence of X in B is in the scope of a declaration of X in an enclosing block B ' such that . B ' has a declaration of X, and . B ' is more closely nested around B then any other block with a declaration of X.
There are two methods of implementing block structure:
1. Stack Allocation : This is based on the observation that scope of a declaration does not extend outside the block in which it appears, the space for declared name can be allocated when the block is entered and de-allocated when controls leave the block. The view treat block as a "parameter less procedure" called only from the point just before the block and returning only to the point just before the block.
2. Complete Allocation : Here you allocate the complete memory at one time. If there are blocks within the procedure, then allowance is made for the storage needed for declarations within the books. If two variables are never alive at the same time and are at same depth they can be assigned same storage
Lexical scope without nested procedures
. A procedure definition cannot occur within another
. Therefore, all non local references are global and can be allocated at compile time
. Any name non-local to one procedure is non-local to all procedures
. In absence of nested procedures use stack allocation
. Storage for non locals is allocated statically
. A non local name must be local to the top of the stack
. Stack allocation of non local has advantage:
- Non locals have static allocations
- Procedures can be passed/returned as parameters
In languages like C nested procedures are not allowed. That is, you cannot define a procedure inside another procedure. So, if there is a non- local reference to a name in some function then that variable must be a global variable. The scope of a global variable holds within all the functions except those in which the variables have been re-declared. Storage for all names declared globally can be allocated statically. Thus their positions will be known at compile time. In static allocation, we use stack allocation. Any other name must be a local of the activation at the top of the stack, accessible through the top pointer. Nested procedures cause this scheme to fail because a non-local may then refer to a local of parent variable which may be buried deep in the stack and not at the top of stack. An important benefit of static allocation for non- locals is that declared procedures can freely be passed as parameters and returned as results (a function is passed in C by passing a pointer to it).
Add 1 to depth as we go from enclosing to enclosed procedure
Access to non-local names
. Include a field 'access link' in the activation record
. If p is nested in q then access link of p points to the access link in most recent activation of q
Nesting Depth : The notion of nesting depth is used to implement lexical scope. The main program is assumed to be at nesting depth 1 and we add 1 to the nesting depth as we go from an enclosing to an enclosed procedure.
Access Links : To implement the lexical scope for nested procedures we add a pointer called an access link to each activation record. If a procedure p is nested immediately within q in the source text, then the access link in an activation record for p points to the access link in the record for most recent activation of q .
Displays
. Faster access to non locals
. Uses an array of pointers to activation records
. Non locals at depth i is in the activation record pointed to by d[i]

Faster access to non locals than with access links can be obtained using an array d of pointers to activation records, called a display. We maintain display so that storage for a non local a at nesting depth i is in the activation record pointed to by display element d[i].
The display changes when a new activation occurs, and it must be reset when control returns from the new activation. When a new activation record for a procedure at nesting depth i is set up, we first save the value of d[i] in the new activation record and then set d[i] to point to the new activation record. Just before an activation ends , d[i] is reset to the saved value.
Dynamic Scope
. Binding of non local names to storage do not change when new activation is set up
. A non local name a in the called activation refers to same storage that it did in the calling activation
In dynamic scope , a new activation inherits the existing bindings of non local names to storage. A non local name a in the called activation refers to the same storage that it did in the calling activation. New bindings are set up for the local names of the called procedure, the names refer to storage in the new activation record.
Dynamic Scope
. Deep Access
- Dispense with access links
- use control links to search into the stack
- term deep access comes from the fact that search may go deep into the stack
. Shallow Access
- hold current value of each name in static memory
- when a new activation of p occurs a local name n in p takes over the storage for n
- previous value of n is saved in the activation record of p
We will discuss two approaches to implement dynamic scope. They bear resemblance to the use of access links and displays, respectively, in the implementation of the lexical scope.
1. Deep Access : Dynamic scope results if access links point to the same activation records that control links do. A simple implementation is to dispense with access links and use control links to search into the stack, looking for the first activation record containing storage for the non- local name. The term deep access comes from the fact that search may go deep into the stack. The depth to which the search may go depends on the input of the program and cannot be determined at compile time.
2. Shallow Access : Here the idea is to hold the current value of each name in static memory. When a new activation of a procedure p occurs, a local name n in p takes over the storage for n. The previous value of n is saved in the activation record for p and is restored when the activation of p ends.
Parameter Passing
. Call by value
- actual parameters are evaluated and their rvalues are passed to the called procedure
- used in Pascal and C
- formal is treated just like a local name
- caller evaluates the actual parameters and places rvalue in the storage for formals
- call has no effect on the activation record of caller
This is, in a sense, the simplest possible method of passing parameters. The actual parameters are evaluated and their r-values are passed to the called procedure. Call-by-value is used in C, and Pascal parameters are usually passed this way. Call-by-Value can be implemented as follows:
1. A formal parameter is treated just like a local name, so the storage for the formals is in the activation record of the called procedure.
2. The caller evaluates the actual parameters and places their r-values in the storage for the formals. A distinguishing feature of call-by-value is that operations on the formal parameters do not affect values in the activation record of the caller.
Parameter Passing .
. Call by reference (call by address)
- the caller passes a pointer to each location of actual parameters
- if actual parameter is a name then lvalue is passed
- if actual parameter is an expression then it is evaluated in a new location and the address of that location is passed
When the parameters are passed by reference (also known as call-by-address or call-by location), the caller passes to the called procedure a pointer to the storage address of each actual parameter.
1. If an actual parameter is a name or an expression having an l-value, then that l-value itself is passed.
2. However, if the actual parameter is an expression, like a + b or 2, that has no l-value, then the expression is evaluated in a new location, and the address of that location is passed.
A reference to a formal parameter in the called procedure becomes, in the target code, an indirect reference through the pointer passed to the called procedure.
Parameter Passing .
. Copy restore (copy-in copy-out, call by value result)
- actual parameters are evaluated, rvalues are passed by call by value, lvalues are determined before the call
- when control returns, the current rvalues of the formals are copied into lvalues of the locals
This is a hybrid form between call-by-value and call-by-reference (also known as copy-in copy-out or value-result).
1. Before control flows to the called procedure, the actual parameters are evaluated. The r-values of the actuals are passed to the called procedure as in call-by-value. In addition, however, the l-values of those actual parameters having l-values are determined before the call.
2. When the control returns, the current r-values of the formal parameters are copied back into the l-values of the actuals, using the l-values computed before the call. Only the actuals having l-values are copied.
Parameter Passing .
. Call by name (used in Algol)
- names are copied
- local names are different from names of calling procedure
swap(i,a[i])
temp = i
i = a[i]
a[i] = temp
This is defined by the copy-rule as used in Algol.
1. The procedure is treated as if it were a macro; that is, its body is substituted for the call in the caller, with the actual parameters literally substituted for the formals. Such a literal substitution is called macro-expansion or inline expansion.
2. The local names of the called procedure are kept distinct from the names of the calling procedure. We can think of each local of the called procedure being systematically renamed into a distinct new name before macro-expansion is done.
3. The actual parameters are surrounded by parentheses if necessary to preserve their integrity.
Language Facility for Dynamic Storage Allocation
. Storage is usually taken from heap
. Allocated data is retained until deallocated
. Allocation can be either explicit or implicit
- Pascal : explicit allocation and de-allocation by new() and dispose()
- Lisp : implicit allocation when cons is used, and de- allocation through garbage collection
Static storage allocation is usually done on the stack, as this is a convenient way to take care of the normal scoping rules, where the most recent values have to be considered, and when the scope ends, their values have to be removed. But for dynamic allocation, no such prior information regarding the use of the variables is available. So we need the maximum possible flexibility in this. For this a heap is used. For the sake of a more efficient utilization of memory, the stack grows downwards and the heap grows upwards, starting from different ends of the available memory. This makes sure that all available memory is utilized. Pascal allows for explicit allocation and de-allocation of memory. This can be done by using the new() and dispose() functions. However, in Lisp, continuous checking is done for free memory. When less than 20 percent of the memory is free, then garbage collection is performed. In garbage collection, cells that can no longer be accessed are de-allocated. (Storage that has been allocated but can no longer be accessed is called 'garbage'.)
Essay type questions:
1. Explain the Indirection in symbol table entries with an example.

2. Explain the Self organizing in symbol tables with an example
3. Draw and explain the symbol table organization for C language with a program
block.
4. Explain the importance of the following attributes of a symbol table.
(a) Variable name
(b) Object time address
5. What is heap storage allocation? Explain in detail.
6. Compare three different storage allocation strategies.
7. Write an algorithm to perform the table lookup and insertion operation for hashed
symbol table.
8. Explain about implicit and explicit storage requests.

9. Explain the importance of the following attributes of a symbol table.
(a) Type of a symbol table
(b) Link field.
10. What is an ordered and unordered symbol table? What is the function of
symbol table in the compliation process? Explain.
Short answer questions:
1. Draw the Symbol table format?
2 What is the Block structured language?
3.What is the Non Block structured language?
4. Explain the organization for block structures languages?
5. What is the hashing?
6. Write the structures of tree representation of scope information?
7. Write the difference between Block structures and non block structures?
8. Write the examples of between Block structures and non block structure languages?
9. What is the storage allocation ?
10. What are the allocation strategies in symbol table?
11. Write various storage organizations?
12. What is the scope information?
UNIT VI
CODE OPTIMIZATION: Consideration for Optimization, Scope of Optimization, local
optimization, loop optimization, frequency reduction, folding, DAG representation.

Essay type questions:
1. Explain different principal sources of optimization technique with suitable exam-
ples.

2. What is code optimization? What are its advantages?
3. What is DAG? Construct the DAG for the following basic block
D := B_C E :=A+B B := B+C A := E-D

4. What is a DAG. Explain its applications.
5. What are the problems in optimizing compiler design?

6. Write the three-address code for the following code.
begin
PROD: = 0;
I: =1;
do
begin
PROD:=PROD + A[I]∗B[I];
I:=I+1;
End
while I<=20
end
7. Write an algorithm for partition of basic blocks.
8. Explain briefly about folding.

9. What are the legal evaluation orders and names for the values at the nodes
for the DAG of problem (a).
i. Assuming A, B and C are alive at the end of the basic block?
ii. Assuming only A is live at the end?
10. Explain in detail the Optimization technique “strength reduction”.
Short answer questions:
1. Write what are the Consideration for Optimization?
2.What is the Scope of Optimization?
3. What are the local optimization techniques?
4. What are the loop optimization techniques?
5.What is the Dag?
6.What is the frequency reduction?
7.What is the Reduction in strength?
8.What is the constant folding?
9.What is the Copy propagation?
10.What is the Call by value?
11.What is the copy in copy out?
12.What is the Call by Name?
13.Define Optimization?
14.What is the role of optimization?
15.How the optimization is cause effect on machine code?

UNIT VII
DATA FLOW ANALYSIS: Flow graph, data flow equation, global optimization, redundant sub
expression elimination, Induction variable elements, Live variable analysis, Copy propagation

Essay type questions:
1. Write and explain live variable analysis algorithm.

2. Explain reducible and non-reducible flow graphs with an example.

3. Explain the use of algebraic transformations with an example

4. Explain natural loops and inner loops of a flow graph with an example.

5. What are the applications of du and ud chains.

6. Explain in detail the procedure that eliminate global common sub - expression.

7. Generate the flow-graphs for the following expressions:
S-> id: = E S; S if E then S else S do S while E
E-> id + id id

8. What is an Induction variable? Explain with an example.
9. Consider the following program which counts the primes form 2 to n using the sieve
method on a suitably large array
begin
read n
for i : = 2 to n do
a[i] : = true / * initialize */
count: = 0;
for i : 2 to n ** .5 do
if a [i] then /* i is a prime */
begin
count := count +1
for j : = 2 * i to n by i do
a[j] : =false
/ * j is divisible by i */
end ;
print count
end
(a) Propagate out copy statements wherever possible.
(b) Is loop jumping possible? If so, do it.
(c) Eliminate the induction variables wherever possible.

10.Discuss how induction variables can be detected and how transformation can
be applied.
Short answer questions:
1.What is Flow graph?
2.Define copy propagation?
3.What is data flow equation?
4.What are the global optimization techniques?
5.How the redundant sub expression eliminated?
6.What is Induction variable?
7.What is the Live variable?
8.What is the use of Copy propagation?
9.What is the role of DFE?
10.What is the use of global optimization?

UNIT VIII
OBJECT CODE GENERATION: Object code forms, machine dependent code optimization,
register allocation and assignment generic code generation algorithms, DAG for register
allocation.

Essay type questions:
1. Explain the different issues in the design of a code generator.

2. Explain the concept of object code forms.

3. Generate optimal machine code for the following C program.
main()
{
int p,q[15];
while (p<=15) q[p] =1
}

4. State and explain different machine dependent code optimization techniques.

5. Describe, how addressing modes can be used for reducing the memory access
Time.

6. Generate code for the following C statements:
i. s= f(a) + f(a) + f(a) ii. s= f(a) /g(b,c) iii. s= f(f(a)) iv. s= ++f(a)

7. Generate the code sequence using Code generation algorithm for the following
expression W:=(A-B)+(A-C)+(A-C)

8. Discribe various Register allocation optimization techniques with an example.

9. Explain the concept of label tree of code generation.

10. What are the various addressing modes available? Give some example machine
instructions which reduces memory access time.

Short answer question:
1. What is the scope of Object code form?
2.What is an Object code?
3.What is machine dependent code ?
4.What is machine independent code ?
5.What is the Direct acyclic graph?
6.What is the role of DAG in Object code?
7.What is the register allocation ?
8.What is the use of assignment generic code generation algorithm?
9.Explain about DAG for register allocation?
10.What is the label tree ?

rajcode

Wednesday, April 20, 2011

Compiler Design N otes for B.Tech CSE and IT student

About Me

Blog Archive