Importance, classification, and definition of formal languages; finite automata

(形式言語の重要性、種類、定義)

2rd lecture, April 15, 2016

Language Theory and Compiler

http://www.sw.it.aoyama.ac.jp/2016/Compiler/lecture2.html

Martin J. Dürst

Today's Schedule

Last week's homework
Definitions for formal language theory
- Definitions, operations, and properties for words
- Definitions, operations, and properties for languages
Automata, grammars, and derivation

Example Answers for Homework

Course Contens

	Theory	Compilers	Other applications
Front end	language theory, automata	lexical analysis, parsing	regular expressions, text/data formats
Back end		optimization, code generation

Importance of Formal Language Theory

Model for data formats and programming languages
Model for computation and recognition

Terms used for Natural Languages and Formal Languages

Field		Smallest Unit	Sequence	Set	Classification
natural language	Japanese	(単)語	文、文書	(自然)言語	(大)語族、語族、語派、語群
English	word	sentence, text	(natural) language	language macrofamily, family, group,...
formal language	Japanese	記号 (文字など)	語	(形式)言語	言語 (族)
English	symbol (letter,...)	word	(formal) language	language type,...

Basic Terms

Terms for formal languages:

A word is composed of symbols following some rules
A word is a string/sequence of symbols
Example: Words such as a, abc, aaabbb, and abcba can be created using symbols a, b, and c
The empty word (ε) is also a word

Definition of Word

A word or a language are defined using a finite set of symbols (or letters) Σ
Σ is called alphabet (example: Σ = {a, b, c})
A word over Σ is a sequence of 0 or more symbols from Σ
The number of symbols in a word is called the length of a word
The length of a word w is written |w|
Example: |abcaba| = 6; |ε| = 0
Symbols are also words of length 1 (example: b)

Concatenation Operation for Words

A new word can be created by lining up two words after each other
This is called concatenation operation for (on) words
The concatenation operation is represented without an explicit symbol
(similar to multiplication in high school)
Example: The concatenation of words w and v is written wv
Application example: With w = abc and v = cba, wv = abccba
The concatenation of a word (or symbol) with itself is written using an exponent: w² = ww = abcabc, a⁵ = aaaaa,...

Properties of Concatenation

Associativity: For any words w, v, and u: (wv)u = w(vu)
The neutral element is ε: wε = εw = w
Commutativity doesn't hold: wv ≠ vw (example: abccba ≠ cbaabc)
The length of a concatenation is the sum of the lengths of its operands: |wv| = |w| + |v|

Definition of Language

A language over Σ is a set of words over Σ

Examples for lanuages over Σ ={a,b,c}:

Empty set: {}
Set containing only the empty word: {ε}
Σ (set of words of length 1 over Σ): {a,b,c}
Set of words of length 3 (over Σ) (size of set: 27)
Set of all words over Σ
Set of (all) words (over Σ) starting with a
Set of words representing the weather at each of the prefectural governments for each of the days of next week, if a stands for sunny, b stands for cloudy, and c stands for rainy/snowy (size of each word: 7; number of words: 47)

More Examples of Languages

Σ = {a,..., z}, set of keywords of the programming language C
Σ = {0,..., 9, a,..., f, A,..., F, x,...}, set of integer literals of C
Σ = {characters from ASCII or Unicode}, (set of) grammatically correct C programs
Σ = {a,..., z, (, ), ¬, ∧, ∨}, set of all well-formed formulæ of predicate logic
Σ = {Latin letters,...}, set of all English words
Σ = {Latin letters,...}, set of all French words
Σ = {Kanji, Kana,...}, set of all Japanese words
Σ = {Kanji, Kana,...}, set of all grammatically correct Japanese sentences

Operations on Languages

Operations on languages are combinations of operations on sets and operations on words.

Set union of languages
Set intersection of languages
Set difference of langugages
Concatenation operation for languages: For languages A and B, their concatenation AB is the set { wv | w∈A, v∈B }
As for words, we write L² for LL,...
Kleene closure: Concatenating the same language 0 or more times
written L^*; L^* = ⋃^∞_i=0 Lⁱ

Example: L = {a, b} => L^* = {ε, a, b, aa, ab, ba, bb, aaa, ...}

Main Problems in Formal Language Theory

How can languages be defined in a way that is simple and easy to understand?
How can words be produced from definitions of languages?
How to decide whether some sequence of symbols is a word in some language?
How can such decisions be implemented easily and executed quickly?
How can we associate syntax with sematics?

Languages and Automata and Grammars

An automaton is a model for a machine that accepts/recognizes/distinguishes words in a given language
A grammar is a set of rules to create a language
There are many different kinds of automata and grammars
These different kinds have different ranges of languages that can be accepted/generated
Language theory distinguishes mainly four types of languages
There is an ordered subset relationship between these four types
For each type of language, there is a corresponding type of automaton and a corresponding type of grammar

Table of Formal Language Types

(Chomsky hierarchy)

文法	grammar	Type	Lanugage type	automaton
句構造文法	phrase structure grammar (psg)	0	phrase structure language	Turing machine
文脈依存文法	context-sensitive grammar (csg)	1	context-sensitive language	linear-bounded automaton
文脈自由文法	context-free grammar (cfg)	2	context-free language	push-down automaton
正規文法	regular grammar (rg)	3	regular language	finite state automaton

The Turing machine is a model for computation in general
Context-free languages are used for parsing
Regular languages are used for lexical analysis

Types of Automata

Automata types are distinguised by the restrictions on their "external memory":

0. The external memory is a tape of unlimited length: Turing machine

1. The external memory is a tape of limited length: linear-bounded automaton

2. The external memory is a stack where only the top can be accessed: push-down automaton

3. There is no external memory: finite state automaton

Example of a Grammar for a Formal Language

S, A: nonterminal symbols (upper case)
a, o, y: terminal symbols (lower case)
rewriting rules:
S → a S o

S → Aw

A → y a
S: initial symbol (start symbol)

Example of derivation of a word from the grammar:

S → a S o → a a S o o → a a A o o → a a y a o o

S ⇒ a a y a o o

Definition of Grammar

A finite set of nonterminal symbols N (usually upper case)
A finite set of terminal symbols Σ (usually lower case, N ∩ Σ ＝ {})
A finite set of rewriting rules P (also called production rules)
A start symbol S (S ∈ N, the symbol on the left side of the first rewriting rule if not explicitly specified)

A grammar is defined as a quadruple (N, Σ, P, S)

書換規則

(rewriting rule)

書換規則一つは α → β と書く

α は左辺 (left-hand side)、β は右辺 (right-hand side)

α とβ は 0以上の非終端記号と終端記号の列

左辺には非終端記号が最低一つ

例: aD → aDDb, EF → abc, F → Fb, D → ε

反例: bc → Dc, ε → b

導出

(derivation)

文法から語を作るプロセス
初期記号から開始
一回の導出では一つの書き換え規則を一回適用:
- 現在ある (非)終端記号の中に
- 書換規則の左辺と同じ部分列を特定
- この部分列を書換規則の右辺と置換
複数の導出が可能な場合、任意に選択
(選択により、別の語を作成)
結果が終端記号だけの列の時点で導出が終了
→結果が (文法が定義する) 言語の一つの語
適用可能な書換規則がない場合、導出が失敗

文法と導出の例

文法:
S → aba (1), S → aDTa (2), T → CDTa (3), T → CDa (4), DC → QC (5), QC → QD (6), QD → CD (7),
aC → aa (8), Da → ba (9), Db → bb (10)

(数値は書き換え規則の番号、下線は書換規則の適応範囲、普通 (宿題を含め) 省略)

導出の一例: S →₂ aDTa→₄ aDCDaa→₅ aQCDaa→₉ aQCbaa→₆ aQDbaa→₇ aCDbaa→₈ aaDbaa→₁₀ aabbaa

文法の種類

文法の種類は書換規則の制限で区別:

0. 特に制限なし: 句構造文法 (phrase structure grammar), (Chomsky) 0 型文法

1. αAβ → αγβ (α, β は0以上の、γ は1以上の (非)終端記号の列) の場合:
文脈依存文法 (context-sensitive grammar), (Chomsky) 1 型文法

2. A → γ (γ は0以上の (非)終端記号の列) の場合:
文脈自由文法 (context-free grammar), (Chomsky) 2 型文法

3. A → aB 又は A→ a (A → Ba 又は A→ aでも可) の場合:
正規文法 (regular grammar), (Chomsky) 3 型文法

(全ての場合に、S → ε も特別に可)

宿題

提出期限と場所: 2016年 4月 21日 (木) 19:00 まで O 棟 5 階の O-529 号室の前の箱に投入

形式: A4 一枚 (裏も使用可)

L = { a, cb, ac } の場合、L^* の一番短い語 10個を列挙しなさい。
発展問題 (解答自由): L^* の長さ4の語を全て列挙しなさい。
「導出の例」で使われた文法を使って、3つの (例とお互いと) 異なる語の導出を書きなさい (途中段階を全部含む、番号・下線不要)。この文法はどの様な言語を定義しているかを推測し、説明しなさい。
(ヒント: 推測が簡単でなかったら、導出に問題の可能性大)
発展問題 (解答自由): 自分の推測を証明してみなさい。
(提出なしだが、できなかった場合、必ず次回にノートパソコンを持参すること)
自分のノートパソコンに cygwin をインストールする (画像つき詳細)。インストールの手順で必ず gcc, flex, bison, diff, make と m4 を選ぶ。以前インストールされた場合、必ず確認・更新。

Glossary

word: 語
derivation: 導出
classification: 分類
symbol: 記号
empty word: 空語
alphabet: アルファベット
(word/language) over Σ: Σ 上の (語・言語)
concatenation (operation): 連結 (演算)
associativity: 結合性 (結合率が成立つこと)
neutral element: 単位元
commutativity: 可換性
prefectural government (building): 県庁
keyword: 予約語
well-formed formula: 整論理式
Kleene closure: クリーン閉包
rule: 規則
type of language: 言語族
Chomsky hierarchy: チョムスキー階層
phrase structure language: 句構造言語
context-sensitive language: 文脈依存言語
context-free language: 文脈自由言語
regular language: 正規言語
Turing machine: チューリング機械
linear-bounded automaton: 線形束縛オートマトン
push-down automaton: プッシュダウンオートマトン
finite state automaton: 有限オートマトン
external memory: 外部メモリ
nonterminal symbol: 非終端記号
upper case (letter): 大文字
lower case (letter): 小文字
terminal symbol: 終端記号
rewriting rule/production rule: 書き換え規則・生成規則
initial/start symbol: 初期記号・開始記号
derivation: 導出
quadruple: 四字組