Course Overview
Overall Compiler Structure
(授業の概要;
コンパイラ全体の仕組み)
Language Theory and Compilers
(言語理論とコンパイラ)
1st lecture, April 12, 2019
http://www.sw.it.aoyama.ac.jp/2019/Compiler/lecture1.html
Martin J. Dürst
duerst@it.aoyama.ac.jp, Building O,
Room 529
© 2006-19 Martin
J. Dürst 青山学院大学
Today's Schedule
- Self-introduction
- About this course
- Difficulty of processing input
- Compiler structure
- Lexical analysis and parsing
- Example of an automaton
- Example of a formal language grammar
Self-Introduction
TA: Kōsuke Kurihara (栗原 光祐、M1)
授業の位置付け
- 3 年前期
- 第二科目群 (選択必修、EJ 科目)
- ベースとなる科目:
- 計算機概論
- 情報数学
- 計算機実習 (プログラミング)
授業の目標
- (言語) 理論と応用の関係の理解
- ツールによる応用
- コンパイラ
- 文書処理
- 入力の解析
- データ形式の設計
- 小さい言語の設計と実装
(ドメイン特化言語など)
The Importance of Compilers
Blog by
Steve Yegge
Executive Summary:
- If you don't know how compilers work, then you don't know how computers
work.
- If you're not 100% sure whether you know how compilers work, then you
don't know how they work.
授業の進め方
- 前方着席
- 資料配付とウェブ公開
- 書き込みが重要
- 一部ノートパソコンで演習
- 宿題、レポート、ミニテスト、期末試験
- 出席しないと大損
成績評価方法
(目安)
- 授業中のミニテスト: 20%
- 演習課題: 30%
- 期末試験: 50%
Course Schedule
Schedule (tentative)
- April 19 (next week): No lecture because of conference
- There are no official makeup class days
- It would be good to make up the April 19 lecture early (May)
- Let's select a day and time
Books/References
(授業は言語理論とコンパイラ両方をカバーするが、参考書はそれぞれ片方に集中)
Course Contents
|
Theory |
Compilers |
Other applications |
Front end |
language theory, automata |
lexical analysis, parsing |
regular expressions, text/data formats |
Back end |
|
optimization, code generation |
|
- Concentrate on input, not output
- Use various representations from theory to application
Example of Difference between Input and Output
(Computer Practice I, problems 04A1 and 04C1)
Character itself
(internal representation) |
Escaping in HTML/XML
(external representation) |
' |
' |
" |
" |
< |
< |
> |
> |
& |
& |
Which direction is more difficult?
Difficulties for Input
- Not structured (just a sequence of bytes/characters)
- Anything goes (including errors)
- Deciding whether some input is correct or not is a model for computation
in general
- Deciding whether input is correct is equivalent to recognition
The Function of a Compiler
Bridge between software and hardware
- Input: Program that can be understood by humans
- Language: High-level program language
- Medium: source (file/program)
- Output: Program that can be understood by a machine
- Language: assembly language, machine language
- Medium: object code、machine code
Example Compiler Input/Output
Input fragment:
sum += price * 25;
Output (assembly language):
LOAD R1, price ; load from price into R1 (register 1)
CONST R2, 25 ; put constant 25 into R2 (register 2)
MUL R1, R1, R2 ; put multiple of R1 and R2 into R1
LOAD R2, sum ; load from sum into R2
ADD R2, R1, R2 ; put the sum of R1 and R2 into R2
STORE sum, R2 ; store the contents of R2 into sum
Logical Structure of a Compiler
- [preprocessor]
- Lexical analysis
- Parsing (syntax analysis)
- Semantic analysis
- Optimization (or 5.)
- Code generation (or 4.)
- [assembler]
- [linker, loader]
Compiler Types and Related Software
- One pass compiler
- X-pass compiler (x between 1 and 70 (IBM's PL/1 compiler in the
1970ies))
- Cross-compiler (e.g. compiling on PC for smartphone)
- Dynamic/just-in-time (JIT) compiler
- Preprocessor (runs before the compiler)
- Interpreter (e.g. for Ruby)
Example of Lexical Analysis
Fragment of input program (sequence of characters):
s |
u |
m |
|
+ |
= |
|
p |
r |
i |
c |
e |
|
* |
|
2 |
5 |
; |
\n |
Corresponding output (sequence of tokens):
id("sum"), plusequal, id("price"), asterisk, int(25), semicolon
Overview of Lexical Analysis
- Convert a sequence of characters to a sequence of tokens
- Several characters become one token
- Similar to finding words in a natural language text
- Tokens can have attributes
Examples: id (identifier, with name)、int (integer, with value)
- Many token names are derived from the symbol shape
(example: * is asterisk, not mult, because it can also be
used for pointers)
- White space is used during lexical analysis, but then discarded
Example of Parsing
Input (sequence of tokens):
id("sum"), plusequal, id("price"), asterisk, int(25), semicolon
Corresponding output (syntax tree):
Details of Syntax Trees
- Tokens become nodes in the syntax tree
- For expressions:
- Operators become parents (and represent the operation and its
result)
- Operands (including subexpressions) become children
- The structure of the tree shows priority and associativity of
operations
(operations at the bottom are evaluated first)
- Abstract constructs (e.g. statement,...) are added as parents
- Parentheses and other separators (
,
, ;
,...) are
used during parsing, but then discarded
More Examples
price = pretax / 100 * (108 - discount);
if (a > 5)
b = 15;
Example of an Automaton
Very simple automatic vending machine:
- Input: 50 Yen coins
- Output: Water bottle (price: 150 Yen)
State transition diagram:
Language and Grammar
- Language is defined on two layers:
- Structure (syntax)
- Meaning (semantics)
- Grammar defines (restricts) the structure of a language
- Example (simple imperative sentences):
eat bread、read books、play music
- Grammar for imperative sentences:
ImperativeSentence → Verb Noun
- The grammar allows sentences such as
"read bread" or "play books",
but semantically, they do not make sense
Grammar of Imperative Sentences
ImperativeSentence → Verb Noun
Verb → eat
Verb → read
Verb → play
Noun → bread
Noun → music
Noun → books
Homework / 宿題
Deadline: April 25, 2019 (Thursday), 19:00
Where to submit: Box in front of room O-529 (building O, 5th floor)
Format: A4 single page (using both sides is okay; NO cover page), easily
readable handwriting (NO printouts), name (kanji and kana) and student number
at the top right
Problem: For the one-line C program fragment below, based on the examples
given in this lecture, write down:
- the result of lexical analysis
- the result of parsing
- the output of the compiler (in assembly language; comments are not
needed; use
SUB
for substraction, and DIV
for
division)
grade = english - absent*5 + math/2;
Glossary
- lexical analysis
- 字句解析
- parsing, syntax analysis
- 構文解析
- automaton
- オートマトン
- formal language
- 形式言語
- grammar
- 文法
- executive summary
- 役員 (時間がない人) のための要約
- front end
- フロントエンド
- back end
- バックエンド
- optimization
- 最適化
- code generation
- コード生成
- regular expression
- 正規表現
- text format
- 文書形式
- data format
- データ形式
- internal representation
- 内部表現
- external representation
- 外部表現
- (e.g. face) recognition
- (顔) 認識
- high-level program language
- 高級プログラム言語
- source (file/program)
- ソース (ファイル・プログラム)、原始プログラム
- object code
- 目的プログラム
- machine code
- 実行プログラム
- assembly language
- アセンブリ言語
- register
- レジスタ
- preprocessor
- プリプロセッサ
- semantic analysis
- 意味解析
- assembler
- アセンブラ (アセンブリ言語を処理するソフト)
- linker
- リンカ
- loader
- ローダ
- one pass compiler
- ワンパス・コンパイラ
- x-pass compiler
- x-パス・コンパイラ
- cross-compiler
- クロスコンパイラ
- dynamic/just-in-time (JIT) compiler
- 動的コンパイラ
- preprocessor
- プリプロセッサ
- interpreter
- インタプリタ、通訳系
- token
- トークン、記号、符
- natural language
- 自然言語
- attribute
- 属性
- identifier (発音: アイデンティファイア)
- 識別子
- syntax tree
- 構文木
- operator
- 演算子
- operand
- 被演算子
- expression
- 式
- subexpression
- 部分式
- statement (of a program)
- 文
- separators
- 区切り記号
- automatic vending machine
- 自動販売機
- state transition diagram
- 状態遷移図
- structure
- 構造
- syntax
- 構文
- semantics
- 意味 (論)
- imperative sentence
- 命令文
- verb
- 動詞
- noun
- 名詞