Linear Grammars and
Regular Expressions
(正規文法と正規表現)
4th lecture, April 29,
2022
Language Theory and Compilers
https://www.sw.it.aoyama.ac.jp/2022/Compiler/lecture4.html
Martin J. Dürst

© 2005-22 Martin
J. Dürst 青山学院大学
Today's Schedule
- Last week's homework, leftovers
- Regular Grammars
- Regular Expressions
- Formal definition
- Conversion to an NFA
- Conversion from an FSA
- Practical regular expressions
Schedule for Next Few Weeks
- Friday, April 29 (Showa Day): Lectures take place, lecture 4
- Monday, May 2, 19:00: Homework deadline
- Friday, May 6: Lecture 5
Leftovers
Last Week's Homework 4
Check the versions of flex
, bison
,
gcc
, make
, and m4
that you installed (no
need to submit, but contact me by mail if you have a problem)
Last Week's Homework 1
Draw a state transition diagram for a finite state automaton that recognizes
all inputs that (at the same time)
- Start with gf
- End with fg
- Contain an even number of h
- Contain no other symbols than f, g, and
h
Last Week's Homework 2
Draw the state transition diagram for the NFA in the state transition table
below
|
ε |
0 |
1 |
→S |
{P} |
{Q} |
{E} |
E |
{} |
{P} |
{P, T} |
P |
{} |
{T} |
{} |
*Q |
{} |
{} |
{S, P} |
T |
{P} |
{} |
{Q} |
Last Week's Homework 3
Create the state transition table of the DFA that is equivalent to the NFA
in homework 2 (do not rename states).
[removed]
Plan for this Lecture
- Finite state automata (FSA)
- Deterministic finite automaton (DFA)
- Non-deterministic finite automaton (NFA)
- Regular grammar
- Left linear grammar
- Right linear grammar
- Regular expression
These all are equivalent, and
define/accept regular languages
Example of Right Linear Grammar

A → fB | gA
B → gA | fC | f
C → gA | fC | f
Right Linear Grammars and FSAs
Right linear grammars and NFAs correspond
as follows (not considering ε transitions):
FSA |
Right Linear Grammar |
states |
nonterminal symbols |
start state |
start symbol |
transitions moving to an accepting state |
constant rules (e.g. A → c) |
all transitions |
right linear rules (e.g. A → gB)
|
There is a similar correspondence for left linear grammars (imagine reading
the input backwards)
From NFA to Right Linear Grammar
- Convert all states to nonterminal symbols
(start state→start symbol)
- Convert all transitions to right linear rules
- Convert all transitions to accepting states to constant rules
From Right Linear Grammar to NFA
- Create a state for each nonterminal symbol
(start symbol→start state)
- Convert all right linear rules to transitions
- Create a new state (e.g. named F) only used for acceptance,
and convert all constant rules
to transitions to this state
Linear Grammar Definitions
Simple Rewriting Rules
Rule Shape |
Name |
A → gB |
right linear rule (nonterminal on the right) |
A → Bg |
left linear rule (nonterminal on the left) |
A → c |
constant rule |
Right linear grammar =
right linear rules + constant rules
Left linear grammar = left linear rules + constant rules
(in both cases, a special rule S → ε is allowed)
Both left linear grammars and right linear grammars are regular
grammars
(A grammar that contains both left linear rules and right linear rules is
called a linear grammar. A linear grammar is not a
regular grammar, but a kind of context-free grammar.)
Today's Outlook
Summary up to now:
- Finite state automata (FSA): Deterministic finite automata (DFA) and
non-deterministic finite automata (NFA)
- Regular grammar: Left linear grammar and right linear grammar
- All these have the same power, generating/recognizing regular
languages.
Challenge: Find a more compact representation for regular languages.
Example of Regular Expression
To find DFA
and NFA
in a document,
use the regular expression (D|N)FA
(also written
/(D|N)FA/
)
(//
are the delimiters for regular expressions (in Ruby, Perl,
JavaScript,...))
Properties of Regular Expressions
- Denote a set of patterns or words (i.e. a language)
- Match some input
- Have a structure similar to arithmetic expressions (in general,
expression ↔ 式)
- Very compact
- Widely used, very useful
- Two main variants:
- Theoretical regular expressions
- Practical regular expressions
Theoretical Regular Expression:
Syntax
a
: {a} (a single symbol denotes itself)
abc
: {abc} (concatenation, single word)
a*
: {ε, a, aa, aaa,...} (Kleene closure)
a|b
: {a, b} (alternative)
(a|b)
: same as expression without parentheses
Examples of Regular Expressions
- Number of symbols:
- Even:
(aa)*
- Odd:
a(aa)*
or (aa)*a
- Reminder is 2 when divided by 3:
aa(aaa)*
- A specific symbol sequence at the start of a word:
abc(a|b|c)*
- A specific symbol sequence at the end of a word:
(a|b|c)*abc
- A specific symbol sequence in the middle of a word:
(a|b|c)*abc(a|b|c)*
Notation of Regular Expressions
- Only characters themselves, concatenation, alternative, and repetition
are represented
- "Usual" characters represent themselves
- A small set of characters has a special role (meta-characters:
|
, *
, (
, )
, ε)
- Meta-characters may need to be escaped
Formal Definition of Theoretical Regular Expressions
L(r) denotes the language defined by regular expression
r.
Theoretical Regular Expressions over Alphabet Σ
Priority |
Regular Expression |
Condition |
Defined Language |
Notes |
|
ε, a |
a ∈ Σ |
{ε} or {a} |
literals
|
very high |
(r) |
r is a regular expression |
L((r)) = L(r) |
grouping
|
high |
r* |
r is a regular expression |
L(r*) = (L(r))* |
Kleene closure |
low |
rs |
r, s are regular expressions |
L(rs) =
L(r)L(s) |
concatenation |
very low |
r|s |
r, s are regular expressions |
L(r|s) = L(r) ∪
L(s) |
set union |
Grammar for Theoretical Regular Expressions
- Regular expressions themselves form a language
(set of all regular expressions)
- Grammar: R → ε, R →a, R →b,..., R
→R|R, R →RR,
R →R*, R →(R)
- This is not a regular language, but a context-free language
- The alphabet of a regular expression is the alphabet of the target
language (e.g. a, b,...) and the meta-characters (ε, |, *, (, ))
Caution: Priority
Make sure you get priorities right!
Expression 1 |
Matches 1 |
Expression 2 |
Matches 2 |
ab|c |
ab, c |
a(b|c) |
ab, ac |
abc* |
ab, abc, abcc,... |
(abc)* |
ε, abc, abcabc,... |
a|b|c* |
a, b, ε, c, cc,... |
(a|b|c)* |
ε, a, b, c, aa, ab, ac, ba,... |
ab|c*|d |
ab, ε, c, cc, ccc,..., d |
a(b|c)*d |
ad, abd, acd, abbd, abcd, acbd,
accd,... |
Regular Expression to NFA: Symbols, Concatenation
The NFA for a symbol a has two states and one arrow:
- Start state
- Accepting state
- Arrow labeled a from start state to accepting state (same for
ε)
The NFA for the regular expression rs
connects the accepting state of r
with the start state s
through an ε transition.
Regular Expression to NFA: Alternative
The NFA for r|s is constructed from the NFAs for
r and s as follows:

The additional ε connections are necessary to clearly commit to
either r or s.
Regular Expression to NFA: Repetition
The NFA for r* is constructed as follows:

Conversion: Regular Expression to NFA
- Construct NFA bottom-up, starting with smallest subexpressions
- Each subexpression is converted to an NFA
- Each subexpression has one start state and one accepting state
- When combining subexpressions, connect start states and accepting states
to form a larger NFA
- During construction:
- Start state on the left (no incomming arrow)
- Accepting state on the right (no double circle)
- When finished, add incomming arrow for start state and double circle for
accepting state
Example of Conversion
Regular expression: a|b*c
In some cases, some of the ε transitions may be eliminated, or the NFA may
otherwise be simplified.
Conversion: FSA to Regular Expression
Algorithmic conversion is possible, but complicated
General procedure:
- Create regular expressions for getting from state A to state
B directly for all pairs of states
- Select a single state, and create all regular expressions that pass
through this intermediate state
- Repeat step 2., increasing the number of intermediate states
- Simplify intermediate regular expressions as much as possible (they can
get quite complex)
When understanding what language the FSA accepts, it is often easy for
humans to create a regular expression for this language.
Applications of Regular Expressions
- Many patterns can be expressed compactly
- Clear connection between theory and applications
- Built-in (first-class objects) in many programming languages (Ruby,
Javascript, Perl,...)
- Available as libraries in other programming languages (Java, C#, C,
Python,...)
- Usable in many tools (e.g. plain text editors)
- Caution: Theoretical regular expressions and practical regular
expressions differ in many ways
Practical Regular Expressions:
Syntax
Practical regular expressions have many additional functions and shortcut
notations
(the corresponding theoretical regular expressions or simpler constructs are
given in parentheses)
.
: a single arbitrary character (a|b|c|
...)
[acdfh]
: character class: select a single character
((a|c|d|f|h)
)
[b-f]
: shortcut for continuous range in character class
((b|c|d|e|f)
)
- r
+
: one or more occurrences of r
(rr*
)
- r
?
: r or nothing (r|ε, ε
cannot be used in practical regular expressions
- r
{
m,
n}
:
between m and n repetitions of r
(r...rr?
...r?
)
\*
,...: \
escapes meta-characters
- Meta-characters:
|*+?()[]{}.\^$
Theoretical and Practical Regular Expressions:
Usage Differences
- Theory: match a full word; practice: match part of a string
^
/$
match the start/end of a string or line
- The result of the match is not just yes/no, but includes the position of
the match, the substring matched, the substrings before/after the
match,...
- If there are multiple possible matches, the leftmost, longest match is
choosen
(leftmost is more important than longest)
- Parts of a string matching parts of a regular expression in parentheses
can be assigned to variables
- Partial matches can be reused inside the regular expression
Use of Practical Regular Expressions
- Text/document search
- String replacements (single or multiple, e.g.: Ruby
sub
/gsub
)
- Cutting strings apart (e.g. Ruby
split
)
- Application examples with very large regular expressions:
- C lexical analyzer in a single regular expression
- Fast
Unicode Normalization (in pure Ruby, >100 times faster than
Twitter implementation)
Notes on Practical Regular Expressions
- Most regular expression engines are more powerful than DFA/NFA/regular
languages
- Most regular expression engines use backtracking
- Some regular expressions may be very slow on some input
Example: String an, regular expression
(a?)nan (n=3:
string: aaa, regular expression: a?a?a?aaa, really slow starting at
n~25)
- This may lead to regular
expression denial of service attacks
- For further analysis, see e.g. https://regex101.com/
Theoretical vs. Practical Regular Expressions
|
Theoretical |
Practical |
Meta-characters |
* | ( ) |
|*+?()[]{}.\^$ |
ε |
yes |
no |
character classes ([] ) |
no |
yes
|
+ , ? , {} quantifiers |
no |
yes |
^ , $ anchors |
no |
yes |
match where |
full word |
part of a string |
implementation |
NFA→DFA |
backtracking,... |
descriptive power |
regular language |
more than regular language |
Summary of this Lecture
- Regular expressions, regular grammars, and finites state automata all
have the same power to generate/accept regular languages
- Regular expressions are a very compact representation
- DFAs are a very efficient way to implement recognition
- Regular expressions and DFAs are very useful for lexical analysis
- Creating a DFA by hand from a regular expression is tedious
- In FSAs, the number of states is finite ⇒ there are languages that
cannot be expressed, e.g. languages with corresponding pairs of
parentheses
Homework Submission
Deadline: May 2, 2022 (Monday!), 19:00
Format: A4 single page (using both sides is okay; NO cover page), easily
readable handwriting (NO printouts), name (kanji and kana) and student number
at the top right
Where to submit: Box in front of room O-529 (building O, 5th floor)
Homework
- Construct the state transition diagram for the NFA corresponding to the
following grammar
S → xB | yB | yC, A → xC | z | yS, B → zD | zC | xB | y, C →yA | aD
| z
- Convert the automaton defined by the following transition table to a
right linear grammar
|
0 |
1 |
→T |
G |
M |
*G |
K |
L |
H |
M |
G |
K |
H |
- |
*L |
M |
T |
M |
L |
G |
- Construct the state transition diagram for the regular expression
rp|h*s
Write down two versions:
- The result of the full procedure (with all ε transitions)
- A version that is as simple as possible
- Bring your notebook PC with you next week (May 6).
Make sure you can use flex
, bison
,
gcc
, make
, diff
, and m4
(no need to submit)
Glossary
- regular expression
- 正規表現
- minimization
- 最小化
- partition
- 分割
- isomorphic
- 同型 (同形) の
- (left/right) linear rule
- (左・右) 線形規則
- constant rule
- 定数規則
- (left/right) linear grammar
- (左・右) 線形文法
- delimiter
- 区切り文字
- alternative
- 選択肢
- repetition
- 繰返し
- meta-character
- メタ文字
- priority
- 優先度
- theoretical regular expressions
- 論理的 (な) 正規表現
- practical regular expressions
- 実用的 (な) 正規表現
- first-class object
- 第一級オブジェクト、ファーストクラスオブジェクト
- notation(al)
- 表記 (上の)
- arbitrary
- 任意
- leftmost
- できるだけ左