Linear Grammars and Regular Expressions

(正規文法と正規表現)

4th lecture, April 29, 2022

Language Theory and Compilers

https://www.sw.it.aoyama.ac.jp/2022/Compiler/lecture4.html

Martin J. Dürst

Today's Schedule

Last week's homework, leftovers
Regular Grammars
Regular Expressions
- Formal definition
- Conversion to an NFA
- Conversion from an FSA
- Practical regular expressions

Schedule for Next Few Weeks

Friday, April 29 (Showa Day): Lectures take place, lecture 4
Monday, May 2, 19:00: Homework deadline
Friday, May 6: Lecture 5

Leftovers

Last Week's Homework 4

Check the versions of flex, bison, gcc, make, and m4 that you installed (no need to submit, but contact me by mail if you have a problem)

Last Week's Homework 1

Draw a state transition diagram for a finite state automaton that recognizes all inputs that (at the same time)

Start with gf
End with fg
Contain an even number of h
Contain no other symbols than f, g, and h

Last Week's Homework 2

Draw the state transition diagram for the NFA in the state transition table below

	ε	0	1
→S	{P}	{Q}	{E}
E	{}	{P}	{P, T}
P	{}	{T}	{}
*Q	{}	{}	{S, P}
T	{P}	{}	{Q}

Last Week's Homework 3

Create the state transition table of the DFA that is equivalent to the NFA in homework 2 (do not rename states).

[removed]

Plan for this Lecture

Finite state automata (FSA)
- Deterministic finite automaton (DFA)
- Non-deterministic finite automaton (NFA)
Regular grammar
- Left linear grammar
- Right linear grammar
Regular expression

These all are equivalent, and
define/accept regular languages

Example of Right Linear Grammar

有限オートマトンの状態遷移図

A → fB | gA

B → gA | fC | f

C → gA | fC | f

Right Linear Grammars and FSAs

Right linear grammars and NFAs correspond
as follows (not considering ε transitions):

FSA	Right Linear Grammar
states	nonterminal symbols
start state	start symbol
transitions moving to an accepting state	constant rules (e.g. `A` → `c`)
all transitions	right linear rules (e.g. `A` → `gB`)

There is a similar correspondence for left linear grammars (imagine reading the input backwards)

From NFA to Right Linear Grammar

Convert all states to nonterminal symbols
(start state→start symbol)
Convert all transitions to right linear rules
Convert all transitions to accepting states to constant rules

From Right Linear Grammar to NFA

Create a state for each nonterminal symbol
(start symbol→start state)
Convert all right linear rules to transitions
Create a new state (e.g. named F) only used for acceptance,
and convert all constant rules
to transitions to this state

Linear Grammar Definitions

Simple Rewriting Rules
Rule Shape	Name
`A` → `gB`	right linear rule (nonterminal on the right)
`A` → `Bg`	left linear rule (nonterminal on the left)
`A` → `c`	constant rule

Right linear grammar =
right linear rules + constant rules

Left linear grammar = left linear rules + constant rules

(in both cases, a special rule S → ε is allowed)

Both left linear grammars and right linear grammars are regular grammars

(A grammar that contains both left linear rules and right linear rules is called a linear grammar. A linear grammar is not a regular grammar, but a kind of context-free grammar.)

Today's Outlook

Summary up to now:

Finite state automata (FSA): Deterministic finite automata (DFA) and non-deterministic finite automata (NFA)
Regular grammar: Left linear grammar and right linear grammar
All these have the same power, generating/recognizing regular languages.

Challenge: Find a more compact representation for regular languages.

Example of Regular Expression

To find DFA and NFA in a document,

use the regular expression (D|N)FA (also written /(D|N)FA/)

(// are the delimiters for regular expressions (in Ruby, Perl, JavaScript,...))

Properties of Regular Expressions

Denote a set of patterns or words (i.e. a language)
Match some input
Have a structure similar to arithmetic expressions (in general, expression ↔ 式)
Very compact
Widely used, very useful
Two main variants:
- Theoretical regular expressions
- Practical regular expressions

Theoretical Regular Expression:
Syntax

a: {a} (a single symbol denotes itself)
abc: {abc} (concatenation, single word)
a*: {ε, a, aa, aaa,...} (Kleene closure)
a|b: {a, b} (alternative)
(a|b): same as expression without parentheses

Examples of Regular Expressions

Number of symbols:
- Even: (aa)*
- Odd: a(aa)* or (aa)*a
- Reminder is 2 when divided by 3: aa(aaa)*
A specific symbol sequence at the start of a word: abc(a|b|c)*
A specific symbol sequence at the end of a word: (a|b|c)*abc
A specific symbol sequence in the middle of a word: (a|b|c)*abc(a|b|c)*

Notation of Regular Expressions

Only characters themselves, concatenation, alternative, and repetition are represented
"Usual" characters represent themselves
A small set of characters has a special role (meta-characters: |, *, (, ), ε)
Meta-characters may need to be escaped

Formal Definition of Theoretical Regular Expressions

L(r) denotes the language defined by regular expression r.

Theoretical Regular Expressions over Alphabet Σ
Priority	Regular Expression	Condition	Defined Language	Notes
	ε, a	a ∈ Σ	{ε} or {a}	literals
very high	(`r`)	`r` is a regular expression	L((`r`)) = L(`r`)	grouping
high	`r`*	`r` is a regular expression	L(`r`) = (L(`r`))	Kleene closure
low	`rs`	`r`, `s` are regular expressions	L(`rs`) = L(`r`)L(`s`)	concatenation
very low	`r`\|`s`	`r`, `s` are regular expressions	L(`r`\|`s`) = L(`r`) ∪ L(`s`)	set union

Grammar for Theoretical Regular Expressions

Regular expressions themselves form a language
(set of all regular expressions)
Grammar: R → ε, R →a, R →b,..., R →R|R, R →RR, R →R*, R →(R)
This is not a regular language, but a context-free language
The alphabet of a regular expression is the alphabet of the target language (e.g. a, b,...) and the meta-characters (ε, |, *, (, ))

Caution: Priority

Make sure you get priorities right!

Expression 1	Matches 1	Expression 2	Matches 2
`ab\|c`	ab, c	`a(b\|c)`	ab, ac
`abc*`	ab, abc, abcc,...	`(abc)*`	ε, abc, abcabc,...
`a\|b\|c*`	a, b, ε, c, cc,...	`(a\|b\|c)*`	ε, a, b, c, aa, ab, ac, ba,...
`ab\|c*\|d`	ab, ε, c, cc, ccc,..., d	`a(b\|c)*d`	ad, abd, acd, abbd, abcd, acbd, accd,...

Regular Expression to NFA: Symbols, Concatenation

The NFA for a symbol a has two states and one arrow:

Start state
Accepting state
Arrow labeled a from start state to accepting state (same for ε)

The NFA for the regular expression rs
connects the accepting state of r
with the start state s
through an ε transition.

Regular Expression to NFA: Alternative

The NFA for r|s is constructed from the NFAs for r and s as follows:

全体の初期状態から r と s の初期状態へと、r と s の受理状態から全体の受理状態へ ε で結ぶ

The additional ε connections are necessary to clearly commit to either r or s.

Regular Expression to NFA: Repetition

The NFA for r* is constructed as follows:

全体の初期状態と r の初期状態、r の受理状態と全体の受理状態、全体の初期状態と全体の受理状態、そして r の受理状態と初期状態 (逆!) を ε で結ぶ。

Conversion: Regular Expression to NFA

Construct NFA bottom-up, starting with smallest subexpressions
Each subexpression is converted to an NFA
Each subexpression has one start state and one accepting state
When combining subexpressions, connect start states and accepting states to form a larger NFA
During construction:
- Start state on the left (no incomming arrow)
- Accepting state on the right (no double circle)
When finished, add incomming arrow for start state and double circle for accepting state

Example of Conversion

Regular expression: a|b*c

In some cases, some of the ε transitions may be eliminated, or the NFA may otherwise be simplified.

Conversion: FSA to Regular Expression

Algorithmic conversion is possible, but complicated

General procedure:

Create regular expressions for getting from state A to state B directly for all pairs of states
Select a single state, and create all regular expressions that pass through this intermediate state
Repeat step 2., increasing the number of intermediate states
Simplify intermediate regular expressions as much as possible (they can get quite complex)

When understanding what language the FSA accepts, it is often easy for humans to create a regular expression for this language.

Applications of Regular Expressions

Many patterns can be expressed compactly
Clear connection between theory and applications
Built-in (first-class objects) in many programming languages (Ruby, Javascript, Perl,...)
Available as libraries in other programming languages (Java, C#, C, Python,...)
Usable in many tools (e.g. plain text editors)
Caution: Theoretical regular expressions and practical regular expressions differ in many ways

Practical Regular Expressions:
Syntax

Practical regular expressions have many additional functions and shortcut notations
(the corresponding theoretical regular expressions or simpler constructs are given in parentheses)

.: a single arbitrary character (a|b|c|...)
[acdfh]: character class: select a single character ((a|c|d|f|h))
[b-f]: shortcut for continuous range in character class ((b|c|d|e|f))
r+: one or more occurrences of r (rr*)
r?: r or nothing (r|ε, ε cannot be used in practical regular expressions
r{m,n}: between m and n repetitions of r (r...rr?...r?)
\*,...: \ escapes meta-characters
Meta-characters: |*+?()[]{}.\^$

Theoretical and Practical Regular Expressions:
Usage Differences

Theory: match a full word; practice: match part of a string
^/$ match the start/end of a string or line
The result of the match is not just yes/no, but includes the position of the match, the substring matched, the substrings before/after the match,...
If there are multiple possible matches, the leftmost, longest match is choosen
(leftmost is more important than longest)
Parts of a string matching parts of a regular expression in parentheses can be assigned to variables
Partial matches can be reused inside the regular expression

Use of Practical Regular Expressions

Text/document search
String replacements (single or multiple, e.g.: Ruby sub/gsub)
Cutting strings apart (e.g. Ruby split)
Application examples with very large regular expressions:
- C lexical analyzer in a single regular expression
- Fast Unicode Normalization (in pure Ruby, >100 times faster than Twitter implementation)

Notes on Practical Regular Expressions

Most regular expression engines are more powerful than DFA/NFA/regular languages
Most regular expression engines use backtracking
Some regular expressions may be very slow on some input
Example: String aⁿ, regular expression (a?)ⁿaⁿ (n=3: string: aaa, regular expression: a?a?a?aaa, really slow starting at n~25)
This may lead to regular expression denial of service attacks
For further analysis, see e.g. https://regex101.com/

Theoretical vs. Practical Regular Expressions

	Theoretical	Practical
Meta-characters	`* \| ( )`	`\|*+?()[]{}.\^$`
ε	yes	no
character classes (`[]`)	no	yes
`+`, `?`, `{}` quantifiers	no	yes
`^`, `$` anchors	no	yes
match where	full word	part of a string
implementation	NFA→DFA	backtracking,...
descriptive power	regular language	more than regular language

Summary of this Lecture

Regular expressions, regular grammars, and finites state automata all
have the same power to generate/accept regular languages
Regular expressions are a very compact representation
DFAs are a very efficient way to implement recognition
Regular expressions and DFAs are very useful for lexical analysis
Creating a DFA by hand from a regular expression is tedious
In FSAs, the number of states is finite ⇒ there are languages that cannot be expressed, e.g. languages with corresponding pairs of parentheses

Homework Submission

Deadline: May 2, 2022 (Monday!), 19:00

Format: A4 single page (using both sides is okay; NO cover page), easily readable handwriting (NO printouts), name (kanji and kana) and student number at the top right

Where to submit: Box in front of room O-529 (building O, 5th floor)

Homework

Construct the state transition diagram for the NFA corresponding to the following grammar
S → xB | yB | yC, A → xC | z | yS, B → zD | zC | xB | y, C →yA | aD | z

Convert the automaton defined by the following transition table to a right linear grammar

	0	1
→T	G	M
*G	K	L
H	M	G
K	H	-
*L	M	T
M	L	G

Construct the state transition diagram for the regular expression rp|h*s
Write down two versions:
1. The result of the full procedure (with all ε transitions)
2. A version that is as simple as possible
Bring your notebook PC with you next week (May 6).
Make sure you can use flex, bison, gcc, make, diff, and m4 (no need to submit)

Glossary

regular expression: 正規表現
minimization: 最小化
partition: 分割
isomorphic: 同型 (同形) の
(left/right) linear rule: (左・右) 線形規則
constant rule: 定数規則
(left/right) linear grammar: (左・右) 線形文法
delimiter: 区切り文字
alternative: 選択肢
repetition: 繰返し
meta-character: メタ文字
priority: 優先度
theoretical regular expressions: 論理的 (な) 正規表現
practical regular expressions: 実用的 (な) 正規表現
first-class object: 第一級オブジェクト、ファーストクラスオブジェクト
notation(al): 表記 (上の)
arbitrary: 任意
leftmost: できるだけ左