Compilers ~ The Dragon Book - 1986
Chapter 1 - Introduction to Compiling
On the general subject of compiling.
Contents
componentsof a compilerenvironmentin which a compiler does its jobtoolsto make it easier to build compilers
1.1 Compilers
compiler - a program that reads a program written in the source language and translates it into an equivalent target program in the target language. The compiler reports errors in the source program as error messages.
- "basic tasks" all compilers perform "are the same"
- Thousands of source languages
- Target languages include
- Another programming language
- Machine language of any computer (microprocessor to supercomputer)
- Some compiler classifications - based on structure or function
- single-pass
- multi-pass
- load-and-go
- debugging
- optimizing
- etc.
- advances in compilers:
- implementation languages
- programming environments
- software tools
The Analysis-Synthesis Model of Compilation
The 2 parts of compilation are analysis and synthesis.
Analysis- Break thesource programinto pieces and create anintermediate representationof thesource program.Synthesis- Construct thetarget programfrom theintermediate representation*requires the most specialized techniques
Analysis
- Operations of source program are determined & recorded in a tree, (hierarchical structure).
- Often a
syntax treenode==operationchildren of node==arguments of operation- in the text diagram, operations can be arguments:
position := initial + rate * 60
- Many
software toolsthat manipulate source programs first perform some kind ofanalysisStructure Editors- text editor + autocomplete + linter + static analysis + etc.- real-time partial analysis
Pretty Printers- structure viewer for humansStatic Checkers- attempt to discover potential bugs without running the program. Similar tooptimizing compilers.- code that cannot be executed in the course of the program
- variable used before defined
- catch logical errors (type checking / use real variable as pointer)
Interpreters- Instead of producingtarget program, perform the operations implied bysource program
Compilers aren't just for machine code
The analysis portion of compiling shows up in these examples:
Text formatters- Format text for output in a textual formatSilicon compilers- Input is a conventional programmingsource language. Variables represent signals or signal groups in switching circuits. Outputs acircuit designin an appropriatetarget languageQuery interpreters- translates a predicate containing relational and boolean operators into commands to search a database for records satisfying the predicate
The Context of a Compiler
Other programs may be needed to create the executable target program.
- A
preprocessormight:- collect parts of the
source programstored in separate files - expand
macrosintosource languagestatements.
- collect parts of the
See fig 1.3 page 4.
- A
target programmay require further processing to run.assembler- translate assembly to machine codelinker- link to library routines
1.2 Analysis of the source program
Introducing analysis and illustrate its use in some text-formatting languages. Also see chapters 2-4, 6.
Three Phases of Analysis
Linear Analysis- AKA:
lexical analysisorscanning - read stream of
source programleft-to-right and group intotokens, streams of characters, with collective meaning
- AKA:
Hierarchical Analysis- AKA:
syntax analysisorparsing - hierarchically group characters/tokens into meaningful nested collections
- AKA:
Semantic Analysis- find semantic errors, identify operands, type checking
- check that components of a program fit together meaningfully
Lexical Analysis
Linear analysis is called lexical analysis or scanning. See example page 5.
Create tokens from the source program stream. Discard separators, e.g. whitespace.
Syntax Analysis
Hierarchical Analsysis is called parsing or syntax analysis.
Group the tokens into grammatical phrases for the compiler to synthesize output. Usually the grammatical phrases are represented by a parse tree or a syntax tree.
parse tree-statements,expressions, andidentifersare all nodes. A precise definition is not given, see fig. 1.4, page 6.syntax tree- Compressed representation of a parse tree. Operators appear as interior nodes and operands are the children of that operator's node. SeeSection 5.2.
There is a good example here, page 6.
expressions, recursive rules, (nonrecursive) basis rules, and definition rules, and statements are described in the example but not introduced as formal terms.
The division between lexical and syntactic analysis is somewhat arbitrary. We usually choose a division that simplifies the overall task of analysis.
Factors for division between lexical and syntactic analysis:
- Is the source language construct recursive or not?
Lexical constructs- do not require recursion.identifiers- strings ofletters+digitsbeginning with a letter- recursion not necessary to recognize
identifiers, typically scan untilnon-(letter+number)and group back to beginning - the grouped characters are recorded in a
symbol tableand removed from the input (so the pointer is back at the beginning) - the next token is then processed.
Syntactic constructsoften require recursion.- linear scan cannot analyze expressions or statements
- cannot match
parenthesesin expressions - cannot match
beginandendin statements - unless there is a
hierarchicalor nestingstructureON the input
Context-free grammars- formalization of recursive rule that can be used to guidesyntactic analysis- See chapter 2 for introduction, chapter 4 for extensive treatment.
Syntax-directed translation
The compiler uses the hierarchical structure on the input to help generate the output.
See: Chapters 2, 5.
Semantic Analysis
- check
source programforsemantic errors - gather
type informationfor subsequentcode-generation phase -
use
hierarchical structuredetermined bysyntax-analysis phaseto identifyoperatorsandoperandsofexpressionsandstatements -
type checking:compilerchecks that eachoperatorhas permittedoperandsbased onsource languagespecification. Must also check for operator coercions.- Part of
semantic analysis, see ch. 6. - example: binary arithmetic operator applied to integer & real, converts integer to real, see example 1.1 and fig. 1.5 inttoreal
- Part of
Analysis in Text Formatters
TEX uses hierarchical arrangement in boxes as part of the syntax analysis to describe the relative position of groups of vertical and horizontal characters. See p.8-9 for fig 1.6-1.7.
1.3 The Phases of a Compiler
- A compiler operates in conceptual
phases. Each transforms thesource programfrom one representation to another. - Some phases may be grouped together. Intermediate representations between grouped phases need not be explicity constructed.
- See fig. 1.9, page 10 for a "typical decomposition of a compiler".
Phases
There are six phases between source program and target program:
- Lexical Analyzer
- Syntax Analyzer
- Semantic Analyzer
- Intermediate Code Generator
- Code Optimizer
- Code Generator
See figure 1.10 , page 13, for a detailed translation of position := initial + rate * 60.
Supplementary Phases (used during every phase)
- Symbol-table manager
- Error handler
Symbol-Table Management
Symbol Table: data structure containing a record for each identifier with fields for the attributes of the identifier. Quickly find, store, and retrieve data for that identifier.- The
identifieris added by thelexical analyzerbut the identifier'sattributesare typically added and consumed during later phases. - An attribute could be something like storage, scope, type. If the identifier is a procedure: perhaps number and type of arguments, method of passing arguments (by reference, etc.), and type returned, if any.
- See chapters: 2, 7 for more about symbol tables.
Error Detection and Reporting
Each phase must encounter and deal with errors, and potentially proceed.
- The
syntax analyzerandsemantic analyzertypically handle a large fraction of errors. - The
lexical analyzermay encounter errors when remaining characters do not form a token. - The
syntax analyserfinds errors where thetoken streamviolates the structure rules. - The
semantic analyzercan find errors where a meaningless operation occurs, e.g. add an array to the name of a procedure. - Further discussion of errors for each phase is included in the part of the book devoted to that phase.
The Analysis Phase
- As
translationprogresses, the compiler's representatin of the source program changes. - note: In this diagram and chapter, the
intermediate representationis considered/proxied as thesyntax treeandsymbol table. The data structure for the tree infig. 1.10is shown infig. 1.11b. Lexical analysisis covered in chapter 3.
Intermediate Code Generation
- Some compilers generate an explicit
intermediate representationof thesource program, "a program for an abstract machine". - Variety of forms.
- Chapter 8 covers principle
intermediate representationsin compilers. Chapters 5, 8 cover algorithms for generating intermedaite code for typical programming language constructs.
Properties of an intermediate representation
- easy to produce
- easy to translate into the
target program
Three-address code
Three-address code consists of a sequence of instructions, each of which has at most three operands.
Assembly language for a machine in which every memory location can act like a register.
Code Optimization
Goal: improve intermediate code.
- Simple example on pages 14-15.
- Chapter 9 discusses simple optimizations.
- Chapter 10 gives technology used by most powerful optimizing compilers.
Code Generation
Goal: generate target code from intermediate code.
- Normally this gives
relocateable machine codeorassembly code. - A crucial aspect is assignment of variables to registers.
- Simple example on pages 15-16.
- Chapter 9 covers code generation.
1.4 Cousins of the Compiler
This section is about a compiler's typical operating context.
Preprocessors
Produce the input to compilers. Everything here seems to boil down to macros.
Macro processing: allow user to define macros as shorhand for longer constructs.File inclusion: include header files into the program text, e.g. inc language:#include <global.h>is replaced by the contents ofglobal.h"Rational" preprocessors: Add capabilities to a language, especially for flow control and data structures. Same as macros functionally?Language Extensions: More macros, different use. See example in book.
Example 1.2 on page 16.
Assemblers
Assembly code is a mnemonic version of machine code, where names are used instead of binary codes for operations. Names are also given to memory addresses.
It is customary for assembly languages to have macro facilities that are similar to those in the macro preprocessors described above.
Two-pass Assembly
Simple assemblers make two passes over the input file, where a pass reads a input file once.
- The first pass finds and stores
identifiers, assigning storage locations in thesymbol table. - The second pass translates
operationcode into the bits representing thatoperationin machine language. It also translates eachidentifierintot he address given for thatidentifierin thesymbol table.
The output of the second pass is usually relocateable machine code, which means it can be loaded starting in any memory location, L, in memory. If L is added to all addresses in the code, all references will be correct.
Thus, the output of the assembler must distinguish those portions of instructions that refer to addresses that can be relocated.
See example 3, page 18-19.
Loaders and Link-editors
A loader usually performs both loading and link-editing.
loading: Taking relocateable machine code, altering the relocateable address, and placing the altered instructions and data in memory at the proper locations.link-editor: Make a single program from several files of relocateable machine code. Many of the files result from separate compiles, and often are stored in separate locations for system-wide use. e.g. libraries.external reference: code in one file references a location in another file. May be data or an entry point to a procedure.- relocateable machine code file must retain the information in the symbol table for:
- each data location defined in one file and used in another
- each entry point of a procedure defined in one file and used in another
- if not known in advance, must include entire symbol table as part of relocateable machine code
- relocateable machine code file must retain the information in the symbol table for:
1.5 The Grouping of Phases
Multiple phases are often grouped together in compiler implementations.
Front and Back Ends
Front end: largely independent of target machine, based onsource language- lexical analysis
- syntactic analysis
- symbol table creation
- semantic analysis
- generate intermediate code
Back end:- code optimization
- code generation
- error handling
- symbol table operations
Fairly routine: keep the front end of a compiler and redo the back end for another machine. Parts of the back end can even be kept. Further discussion in Chapter 9.
It is also tempting to compile several different languages into the same intermediate language and use a common back end for the different front ends, thereby obtaining several compilers for one machine. However, because of subtle differences in the viewpoints of different languages, there has been only limited success in this direction.
Passes
Many phases can be done in one pass. Organizing phases into passes is complex and discussed for some specific implementations in chapter 12.
The whole back end may be one pass, immedately producing intermediate code.
We may think of the syntax analyzer as being 'in charge'. As it discovers grammatical structure, the parser can call intermediate code generation. See chapter 2 for a compiler with this implementation.
Reducing the Number of Passes
Goal: Have relatively few passes.
Goal: Don't use up all the system memory.
Reconciling these could create tension.
The interface between the lexical and syntactic analyzers can often be limited to a single token. It is often very hard to perform code generation until the intermediate representation has been completely generated.
backpatching: mergeintermediate codeandtarget codegeneration can often be merged into one pass by creating a slot and filling it once the information becomes available. See the example, page 21.
To ensure you do not use too much space, you can reduce the distance over which backpatching occurs.
1.6 Compiler-Construction Tools
Chapter eleven discusses software tools for implementing a compiler.
There are specialized tools for implementing various phases of a compiler
These tools are called variously:
compiler-compilerscompiler-generatorstranslator-writing systems
Specialized tools are typically oriented around a model of languages and are suitable for languages following that model.
Tools exist for automatic design of specific compiler components, they use specialized languages for specifying and implementing the component. The most successful hide details of generation algorithm and make components that are easily integrated into the remainder of the compiler.
A list of compiler construction tools
Parser generators: producesyntax analyzersnormally from in put based oncontext-free grammar. Syntax analysis used to be very time consuming but now is "one of the easiest phases to implement". Oftenparser generatorsuseparsing algorithmstoo complex to do by hand.Scanner generators: automatically generatelexical analyzers, normally from a specification based onregular expressions. Thelexical analyzeris effectively afinite automaton. Seechapter 3, implementations insection 3.5, section 3.8.Syntax-directed translation engines: produce collections of routines that walk the parse tree, generatingintermediate code. One or more translations are associated with each node of the parse tree, each translation defined in terms of its neighbor nodes in the tree. See chapter 5.Automatic code generators: take a collection of rules definingintermediate language->target languageand replace the intermediate rules with templates that represent sequences of machine instructions. Something about tiling intermediate code without a combinatorial explosion in compiler running time. See chapter 9.Data-flow engines:data flow analysisis used in optimization. How are values transmitted from one part of a program to each other part? Especially as the user defines relationships between intermediate code statements and the information being gathered. See Section 10.11.
Bibliographic Notes
Lots of parallel discovery, hard to attribute credit. Page 23-24 have references.