Definitions, operations, and properties for languages
Automata, grammars, and derivation
Schedule for Next Few Weeks
Today:
Importance, classification, and definition of formal languages; finite
automata
May 2
no lecture (Constitution Memorial Day)
休業 (憲法記念日)
May 10
Deterministic and non-deterministic finite automata, (left and right)
linear grammars, regular expressions
May 15th (Wednesday, 1st period /
補講、、水曜日1限、E-202)
Implementation of lexical analysis, use of tools for lexical
analysis
May 17
Applications of lexical analysis, exercises using tools for lexical
analysis
About makeup classes: The material in the makeup class is part of the final
exam. If you have another makeup class at the same time, please inform the
teacher today.
補講について:
補講の内容は期末試験の対象。補講が別の授業とかぶる場合には今日申し出ること。
Course Contens
Theory
Compilers
Other applications
Front end
language theory, automata
lexical analysis, parsing
regular expressions, text/data formats
Back end
optimization, code generation
Importance of Formal Language Theory
Model for data formats and programming languages
Model for computation and recognition
Terms used for Natural Languages and Formal Languages
Basic Concepts
Concepts for formal languages:
A word is composed of symbols following some rules
A word is a sequence of symbols
Example: Words such as a, abc, aaabbb,
and abcba can be created using symbols a,
b, and c
The empty word (ε) is also a word
Definition of Word
A word or a language are defined using a finite set of
symbols (or letters) Σ
Σ is called alphabet (example: Σ =
{a, b, c})
A word over Σ is a sequence of 0 or more symbols from
Σ
The number of symbols in a word is called the length of a
word
The length of a word w is written |w|
Example: |abcaba| = 6;
|ε| = 0
Symbols are also words, of length 1
(∀s∈Σ: |s|=1; example:
|b|=1)
Concatenation Operation on Words
A new word can be created by lining up two words after each other
This is called concatenation on words
The concatenation operation is represented without an explicit symbol
(similar to multiplication in high school)
Example: The concatenation of words w and v is
written wv
Application example: w = abc, v =
cba ⇒ wv = abccba
The concatenation of a word (or symbol) with itself is written using an
exponent: w2 = ww = abcabc, a5 = aaaaa, a1= a, w0 = ε,...
Properties of Concatenation
Associativity: For any words w, v, and
u: (wv)u =
w(vu)
Neutral element: ε
(εw = w =
wε)
Commutativity does not hold: wv ≠
vw (example: abccba ≠
cbaabc)
The length of a concatenation is the sum of the lengths of its operands:
|wv| = |w| + |v|
Definition of Language
A language over Σ is a set of words over
Σ
Examples for lanuages over Σ ={a, b,
c}:
Empty set: {}
Set containing only the empty word: {ε}
Σ (set of words of length 1 over Σ):
{a,b,c}
Set of all possible words of length 3 (over Σ) (size of set:
27)
Set of all possible words of length n (over Σ)
(size of set: |Σ|n)
Set of all possible words over Σ
Set of (all) words (over Σ) starting with a
Set of words representing the weather at each of the prefectural
government locations for each of the days of next week, if a
stands for sunny, b stands for cloudy, and c stands
for rainy/snowy (size of each word: 7; number
of words: 47)
More Examples of Languages
Σ = {a,..., z}, set of keywords of the programming language
C
Σ = {0,..., 9, a,..., f, A,..., F, x,...}, set of integer
literals of C
Σ = {characters from ASCII or Unicode}, (set of)
grammatically correct C programs
Σ = {a,..., z, (, ), ¬, ∧, ∨}, set of all well-formed
formulæ of predicate logic
Σ = {Latin letters,...}, set of all English words
Σ = {Latin letters,...}, set of all French words
Σ = {Kanji, Kana,...}, set of all Japanese words
Σ = {Kanji, Kana,...}, set of all grammatically correct
Japanese sentences
Operations on Languages
Operations on languages are combinations of operations on sets and
operations on words.
Set union of languages
Set intersection of languages
Set difference of langugages
Concatenation operation for languages:
For languages A and B, their concatenation
AB is the set { wv |
w∈A, v∈B }
Example: A = { ab, ca }, B = {
a, bb }, AB = {
aba, abbb, caa, cabb }
As for words, we write L2 for
LL, L1 for L,
L0 for {ε},
...
Kleene closure: Concatenating the same language 0 or more times
written L*; L* =
L0∪ L1∪
L2∪ L3∪... =
⋃∞i=0Li
Example: L = {a, b} =>
L* = {ε, a, b,
aa, ab, ba, bb, aaa,
...}
Main Problems in Formal Language Theory
How to define languages in a way that is simple and easy to
understand?
How to produce words from definitions of languages?
How to decide whether some sequence of symbols is a word in some
language?
How to implement such decisions easily, and execute them quickly?
How to associate syntax with sematics?
Languages and Automata and Grammars
An automaton is a model for a machine that
accepts/recognizes/distinguishes words in a given
language
A grammar is a set of rules to create (the words of) a
language
There are many different types of automata and grammars
These different types have different ranges of languages that can be
accepted/generated
Language theory distinguishes mainly four types of languages
There is an ordered subset relationship between these four types
For each type of language, there is a corresponding type of automaton and
a corresponding type of grammar
Table of Formal Language Types
(Chomsky hierarchy)
言語
grammar
Type
Lanugage type
automaton
句構造言語
phrase structure grammar (psg)
0
phrase structure language
Turing machine
文脈依存言語
context-sensitive grammar (csg)
1
context-sensitive language
linear-bounded automaton
文脈自由言語
context-free grammar (cfg)
2
context-free language
push-down automaton
正規言語
regular grammar (rg)
3
regular language
finite state automaton
The Turing machine is a model for computation in general
Context-free languages are used for parsing
Regular languages are used for lexical analysis
Types of Automata
Automata types are distinguished by the restrictions on their "external
memory":
0. The external memory is a tape of unlimited length: Turing machine
1. The external memory is a tape of limited length: linear-bounded
automaton
2. The external memory is a stack (only the top can be accessed): push-down
automaton
3. There is no external memory: finite state automaton
Example of a Grammar for a Formal Language
S, B: nonterminal symbols (upper case)
a, o, y: terminal symbols (lower case)
rewriting rules:
S → aSo
S → B
B → ya
S: start symbol (initial symbol)
Example of derivation of a word from the grammar:
S → aSo →
aaSoo → aaBoo → aayaoo
S ⇒ aayaoo
(single steps in a derivation are written with →, the overall result with
⇒)
Definition of Grammar
A finite set of nonterminal symbols N (usually upper case)
A finite set of terminal symbols Σ (usually lower case,
N ∩ Σ = {})
A finite set of rewriting rules P (also called production
rules)
A start symbol S (S ∈ N, the symbol on
the left side of the first rewriting rule if not explicitly specified)
A grammar is defined as a quadruple (N, Σ,
P, S)
Rewriting Rule
(also: production rule)
Each rewriting rule is written as α → β
α is called left-hand side, β is called
right-hand side
α is a sequence of (nonterminal/terminal) symbols, with at
least one nonterminal symbol
β is a sequence of 0 or more (nonterminal/terminal)
symbols
Examples: aD → aDDb, EF →
abc, F → Fb, D →
ε
Counterexamples: bc → Dc, ε →
b
Derivation
Process of creating words from a grammar
Start from the start symbol
In each derivation step, apply one rewriting rule as follows:
In the current sequence of (non)terminal symbols,
find a subsequence that is equal to the left-hand side of a rewriting
rule, and
replace it with the right-hand side of the rewriting rule
If more than one rewriting rules can be applied, choose one
(different choices may produce different words)
When the sequence contains only terminal symbols, the derivation is
complete
→ The result is a word of the language defined by this grammar
If there are still some nonterminal symbols,
but there is no matching (left hand side of a) production rule,
the derivation fails
Example of Grammar and Derivation
Grammar:
S → aba
S → aDTa
T → CDTa
T → CDa
DC → CD
aC → aa
Da → ba
Db → bb
Example of derivation: S →2aDTa→4aDCDaa→5aCDDaa→7aCDbaa→8aCbbaa→6aabbaa
(numbers indicate the rewriting rule that is applied, the underlined parts
indicate where the rules are applied)
Summary of this Lecture
A word over an alphabet Σ is a sequence of 0 or more symbols
from Σ
A language over an alphabet Σ is a set of
words over Σ
An automaton is a machine with states that
accepts all words of a language (and only those)
A grammar is a set of rewriting rules that allow to
produce all words of a language (and only those) by starting from a single
start symbol
Homework
Deadline: May 9, 2019 (Thursday), 19:00
Where to submit: Box in front of room O-529 (building O, 5th floor)
Format: A4 single page (using both sides is okay; NO cover page), easily
readable handwriting (NO printouts), name (kanji and kana) and student number
at the top right
For the language L = { a, bd,
ab }, list the 10 shortest words of L*.
Additional problem (solution voluntary): List all words of
L* of length 4.
Using the grammar from the slide "Example of Grammar and Derivation",
find 3 words (different from each other and from aabbaa)
produced by that grammar. Give the full derivation for each word (rule
numbers and underlines not needed). Guess and explain what language this
grammar defines (Hint: If your guess is not simple, maybe you have made a
mistake in the derivations).
Additional problem (solution voluntary): Prove or justify your guess.
(no need to submit, but bring your notebook PC with you to the next
lecture if you have any problems)
Install cygwin on your notebook computer (detailled
instructions with images). Make sure that you select/install
gcc (gcc-core), flex, bison, diff
(diffutils), make, and m4. If you have an earlier
cygwin installation, make sure to check/update.