From Lexical Analysis to Parsing

(字句解析から構文解析へ)

6rd lecture, May 13, 2016

Language Theory and Compilers

http://www.sw.it.aoyama.ac.jp/2016/Compiler/lecture6.html

Martin J. Dürst

Today's Schedule

Minitest
Remainders from last lecture
flex homework
Limitations of regular languages
Differences between lexical analysis and parsing
Context-free grammars
Push-down automata

Minitest

`flex` Homework: Lexical Analysis for C

Deadline: May 19, 2016 (Thursday), 19:00, box in front of room O-529 (building O, 5th floor)

Additional Hints for Homework

Start simple and proceed in small steps
Command line for all steps:
flex xml.l && gcc lex.yy.c && ./a <flex_in.txt | diff - flex_out.txt
Integer literals: 0345 (octal), 0xAC, 0XfE (hexadecimal), 123u, 123U (unsigned), 678l, 678L (long)
Character/string literals: L'a', L"abc" (wide characters), '\'', '\\', "ab\"c\ndef" (escaping)

Compilation Stages

Lexical analysis
Parsing (syntax analysis)
Semantic analysis
Optimization (or 5)
Code generation (or 4)

Table of Formal Language Types

grammar	Type	Lanugage type	automaton
phrase structure grammar (psg)	0	phrase structure language	Turing machine
context-sensitive grammar (csg)	1	context-sensitive language	linear-bounded automaton
context-free grammar (cfg)	2	context-free language	push-down automaton
regular grammar (rg)	3	regular language	finite state automaton

Limitations of Regular Languages/Grammars and FSAs

Can the following languages be represented with a regular expression?

The language where all words consist of the symbols ( and ), properly nested as in a formula
The language where all words consist of n 0es followed by n 1s
The language where all words consist of symbols a, b, and c so that words are palindromes (e.g. acbca)

All these languages cannot be accepted by FSAs because they have limited (finite) memory.

Comparing Lexical Analysis and Parsing

	Lexical Analysis	Parsing
Targets of analysis	literals, identifiers, keywords, operators,...	expressions, statements, functions, declarations, definitions,...
Requirement	speed	descriptive power
Notation	regular expression	context-free grammar
device for (automatic) analysis	finite state automaton	push-down automaton

Regular Grammars and Context-Free Grammars

Regular grammar:

Right linear grammar or left linear grammar

Context free grammar:

The left hand side of all rewriting rules is a single non-terminal symbol
The right hand side is not restricted (any number of terminal and non-terminals possible)
Examples: A → cBd, B → ccB, S → cBcAd
Meaning of free: Not dependent, not influenced by
For programming languages, the correctness of syntax can be decided locally, and does not depend on context
(however, there may be semantic constraints, e.g. whether a variable is defined or not and what type it has)

Example of Context-Free Grammar

S → aSa | bSb | c

Examples of generated words: c, aca, bcb, abaabcbaaba

Language being generated: A single c in the middle, surrounded by 0 or more a and b so that the resulting word is a palindrome

This language cannot be accepted by an automaton with finite memory (e.g. FSA)

We need to extend FSAs to create more powerful automata

We will add a push-down stack

Push-Down Stack and Push-Down Symbols

Store push-down symbols
Push-down symbols are different from input symbols
Only the symbol on the top of the stack can be seen/checked
Symbols can be added to or removed from the stack one at a time
There is a special symbol at the bottom of the stack, the bottom marker
The bottom marker cannot be removed

A stack of trays at the cafeteria. Only the topmost tray is visible due to a built-in spring.

Example of Push-Down Automaton

Σ = {a, b, c}
The stack is written with the top on the left
A/BA means that if the top of the stack is A, then replace that with BA (i.e. put B on top)

プッシュダウンオートマトンの図

How a Push-Down Automato Works

In many ways the same as for FSAs (states, start state, final states, transitions,...)
At the start, the bottom marker is visible (i.e. the stack is empty)
The transition is selected based on:
- The next imput symbol (examples: a, b)
- The stack symbol visible on the top of the stack (examples: B, Z)
A transition is one of:
- Remove one symbol from the top of the stack (example: b, B/ε)
- Keep the stack as is (example: c, A/A)
- Put a new symbol on top of the stack (example: a, B/AB)
The input is accepted if (variants can be shown to be equivalent):
1. The bottom marker is visible
2. The automaton is in a final state at the end of the input
3. Both conditions (1. and 2.) are met

Deterministic and Nondeterministic Push-Down Automata

The grammar S → aSa | bSb | c can be accepted by a deterministic push-down automaton
The grammar S → aSa | bSb | ε cannot be accepted by a deterministic push-down automaton
Why: There is no marker in the middle of the word
Different from FSAs, the power of deterministic and nondeterministic push-down automata is different
For efficient parsing, it is important to use a grammar with a deterministic pushdown automaton
Fortunately, this is easy for programming languages

(there are other aspects of context-free languages/grammars that affect parsing speed)

Homework

(no need to submit, but bring to lecture)

For a programming language that you know (e.g. C, Java, Ruby,...), search for a grammar on the Web, print it out, and carefully study it.

Glossary

unsigned: 符号無し
nested: 入れ子 (になっている)
palindrome: 回文、左右対称な語
push-down symbol: プッシュダウン記号
bottom marker: ボトムマーカ
deterministic push-down automaton: 決定性プッシュダウンオートマトン
nonteterministic push-down automaton: 非決定性プッシュダウンオートマトン