Algorithms for String Matching
Data Structures and Algorithms
11th lecture, December 8, 2022
Martin J. Dürst

© 2009-22 Martin
J. Dürst 青山学院大学
Today's Schedule
- Summary/leftovers of last lecture
- Overview of string matching
- Simplistic implementation
- Rabin-Karp algorithm
- Knuth-Morris-Pratt algorithm
- Boyer-Moore algorithm
- String matching and character encoding
- Summary
Leftovers from Last Lecture
Summary of Last Lecture
- A hash table implements a dictionary ADT using a hash
- The hash function converts the keys into a random-like distribution in a
repeatable way
- Hash tables allow search, insertion, and deletion all in time
- The main methods for conflict resolution are chaining and
open addressing
- Expansion (and shrinking) of a hash table is O(n),
but this can be amortized to O(1) per insertion
- In Ruby and many other programming languages, dictionaries/hash tables
are very convenient data strucutures
- The implementation of hashing in Ruby uses open addressing starting with
Ruby 2.4 (before: chaining)
Remaining Schedule
- December 8 (today): this lecture
- December 15: 12th lecture (Dynamic Programming)
- December 22: 13th lecture (Algorithm Design Strategies)
- January 12: 14th lecture (NP-completenes, reducibility)
- January 19: 15th lecture (approximation algorithms)
- (most probably) January 26, 09:30-10:55 (85 min): Term Final Exam
Term Final Exam
- Coverage:
- Complete contents of lecture and handouts (incl. programs)
- No need to write Ruby code, but need to understand it, and to
write your own pseudocode
- Type of problems:
- Similar to problems in Discrete Mathematics I
- Past exams: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, (2020 online)
- How to view example solutions:
- Press [表示
(S)] button or [S] key. To revert, press [非表示 (P)]
button or press [P] key.
Sometimes, more than one key press is needed to start switching.
Some images and example solutions are missing.
- Important points:
- Read problems carefully (distinguish between calculation, proof,
- Be able to explain concepts in your own
- Combine and apply knowledge from
different lectures
- Write clearly
- Answers can be in Japanese or English
Overview of String Matching
Goal: To find a short pattern p in a long text
- Pattern p (needle): String of length m
(examples: single word, phrase,...)
- Text t (haystack): String of length n (in general,
n ≫ m)
(examples: Single document, document collection)
- Task: Search pattern inside text (existence/location/number)
- The location is usually called shift
- The character in t at position s is written
- The substring of shift s and length m in text
t is written ts
Task: Find "pattern" in the text below
Background: String
matching algorithms in various programming languages
Simplistic Implementation
- Compare the pattern with each substring of length m of the
- There are n-m+1 = O(n) substrings
- Comparing the pattern with a substrings takes time O(m)
text |
e |
f |
a |
n |
y |
a |
k |
a |
m |
n |
pattern |
a |
o |
y |
a |
m |
a |
next pattern position |
a |
o |
y |
a |
m |
a |
- A simplistic implementation will
take time O(n) · O(m) = O(nm)
- If the comparison is stopped early:
- In the general case, time will be close to O(n)
- In the worst case, time will be O(nm)
Actual example: t = aaa....aaaab, p = aa..ab
Overview of the Rabin-Karp Algorithm
- Define a hash function hf()
- Calculate the hash value hf(p) of pattern
- Calculate the hash value of each substring of length m of text
- Compare with hf(p)
- If the hash values are equal, check the actual substring against the
- A simplistic implementation is O(nm)
- This can be improved by selecting an appropriate hash function
Selecting the Hash Function
- Use a hash function so that given
hf(ts), it is easy to calculate
- The hash of the pattern is hf (p) =
mod d
(b is the size of the alphabet, d is an arbitrarily
selected divisor)
- The hash of the candidate substrings is:
hf (ts) =
+ ... +
mod d
hf (ts+1)
+ ... +
mod d
Speeding up the Hash Function
- Using the properties of the modulo function (Discrete Mathematics I, Modular
hf (ts+1) = ((hf
(ts) - t[s]·bm-1) ·
b + t[s+m]) mod d
= ((hf (ts) -
t[s]·(bm-1 mod
d)) · b + t[s+m])
mod d
- From hf(ts),
hf(ts+1) can be calculated in
time O(1)
- Therefore, the overall time is O(n)
Example of Rabin-Karp
Pattern: 081205
Text: 28498608120598743297
b = 10, d = 9
(for manual calculation, casting out nines is
Example of implementing Rabin-Karp algorithm in Excel: BRabin-Karp.xls
Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb
Overview of Knuth-Morris-Pratt Algorithm
- A simplistic implementation compares the same text character many
- Basic idea:
Use knowledge from previous comparisons
- Precompute pattern shifts by pattern-internal comparisons
Details of Knuth-Morris-Pratt Algorithm
- Compare the current text location with the current pattern location
- As long as there is a match, continue comparison without changing
(i.e. compare the next character in the text with the next character in the
- If the end of the pattern is reached, there is a successful match
- If there is no match:
At the start of the pattern, we move the pattern by one (increase
s by one)
(i.e. compare the next character in the text with the first character
in the pattern)
text |
e |
f |
g |
h |
i |
j |
k |
l |
m |
n |
pattern |
a |
o |
y |
a |
m |
a |
next pattern position |
a |
o |
y |
a |
m |
a |
In other positions, move the position in the pattern to the left
according to the precomputed value
(i.e. keep the current position in the text, move the pattern to the
text |
e |
f |
a |
o |
y |
a |
k |
l |
m |
n |
o |
p |
pattern |
a |
o |
y |
a |
m |
a |
next pattern position |
a |
o |
y |
a |
m |
a |
Time Complexity of Knuth-Morris-Pratt Algorithm
- As a result of a comparison, either of two actions is taken:
- Shift the pattern to the right by one or more characters (maximum
n-m times)
- Shift the comparison position in the text to the right by one
character (maximum n-m times)
- The total number of operations is <2n, so the time
complexity is O(n)
- Except for the precomputation, the time complexity does not depend on
- Advantage: The characters in the text are accessed strictly in order
(left to right)
Precomputation for Knuth-Morris-Pratt Algorithm
For all positions x in the pattern,
assuming that p[x] does not match
but that p[0] ... p[x-1] (length x)
already match,
the length of the longest matching prefix and suffix of the already matching
indicates the position in the pattern to match next
- In addition, if at the position in the pattern to be checked next, the
same character is used, then an additional comparison can be omitted
- Moving the position in the text by one and the pattern to the start again
is expressed as -1
Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb
Precomputation Example
Precomputation for pattern aoyaoa
(? is any non-matching
Green indicates matches, red indicates non-matches, blue indicates next comparision (maybe
with red ?).
Position not matched |
part |
next |
0 |
text |
? |
pattern |
a |
o |
y |
a |
o |
a |
next |
a |
o |
y |
a |
o |
a |
-1 |
1 |
text |
a |
? |
pattern |
a |
o |
y |
a |
o |
a |
next |
a |
o |
y |
a |
o |
a |
0 |
2 |
text |
a |
o |
? |
pattern |
a |
o |
y |
a |
o |
a |
next |
a |
o |
y |
a |
o |
a |
0 |
3 |
text |
a |
o |
y |
? |
pattern |
a |
o |
y |
a |
o |
a |
next? |
a |
o |
y |
a |
o |
a |
next |
a |
o |
y |
a |
o |
a |
-1 |
4 |
text |
a |
o |
y |
a |
? |
pattern |
a |
o |
y |
a |
o |
a |
next |
a |
o |
y |
a |
o |
a |
0 |
5 |
text |
a |
o |
y |
a |
o |
? |
pattern |
a |
o |
y |
a |
o |
a |
next |
a |
o |
y |
a |
o |
a |
2 |
Overview of Boyer-Moore Algorithm
Idea Details
Two guidelines for shifting the pattern:
- Pattern-internal comparison
(back-to-front version of Knuth-Morris-Pratt algorithm)
The rightmost position in the pattern of the non-matching character in
the text
text |
e |
f |
g |
h |
i |
j |
k |
a |
m |
n |
o |
pattern |
a |
o |
g |
a |
k |
u |
next pattern position |
a |
o |
g |
a |
k |
u |
Select the larger shift
Time Complexity of Boyer-Moore Algorithm
- In the worst case, same as Knuth-Morris-Pratt algorithm: O(n)
- If m is relatively small compared to b, in most
cases: O(n/m)
String Matching Context
- Relative size of n and m (in general, m
≪ n)
- Number of searches, number of patterns
(if there are many searches with the same text or pattern, then that
text or pattern can be preprocessed to make the search more
- Number of characters (size of alphabet, b)
- Sequence of bits: b = 2
- Genetics: b = 4 (nucleotides) or b = 21 (amino
- Western language documents: b ≅ 26~256
- East Asian language documents: b ≅ several thousand
String Matching and Character Encoding
- In some character encodings, there is a large number of characters
- Implementation becomes simpler if working on bytes instead of
- For some character encodings, working byte-by-byte is impossible
- Possible: UTF-8
- Impossible:
(JIS), Shift_JIS
Example for Shift_JIS
byte values |
92 |
90 |
92 |
91 |
92 |
93 |
92 |
99 |
92 |
9a |
92 |
9b |
text |
註 |
酎 |
駐 |
貯 |
丁 |
兆 |
matching pattern |
崇 |
葬 |
湯 |
剪 |
囃 |
Character Encodings and Byte Patterns
- UTF-8: 0xxxxxxx;
110xxxxx 10xxxxxx;
1110xxxx 10xxxxxx 10xxxxxx;
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- EUC-JP: 0xxxxxxx; 1xxxxxxx 1xxxxxxx
- Shift_JIS: 0xxxxxxx; 1xxxxxxx xxxxxxxx
- iso-2022-jp: 0xxxxxxx; 0xxxxxxx 0xxxxxxx
Other String Matching Problems
- Finding several patterns in the same text: Aho-Corasick algorithm
- Finding many patterns (e.g. words,...) in large text collections
- General problem for search engines
- Solved by buiding an index (dictionary ADT: words/n-grams ⇒
documents (incl. positions))
- Many others
- The lexical analysis of a program is advanced string matching
- Patterns are not only fixed character strings
- Using finite automata
- Patterns are specified using regular expressions
- These are topics of "Language Theory and Compilers" (3rd year spring
- Regular expressions can be used for text processing in Ruby as well as
many other programming languages
- Ruby string matching is reported
to have bad worst-case performance, improve as part of graduation
- A simplistic implementation of string matching is
O(nm) in the worst case
- The Rabin-Karp algorithm is O(n), using a hash
function that can be extended to 2D matching
- The Knuth-Morris-Pratt algorithm is O(n), and views
the text strictly in input order
- The Boyer-Moore algorithm is O(n/m) in
most cases
- Review the Rabin-Karp algorithm using BRabinKarp.xls and Bstringmatch.rb
(change b/d to change calculation,
pattern or text to change matches)
- Review the Knuth-Morris-Pratt algorithm using Bstringmatch.rb (in particular
- Review matrix multiplication (linear algebra)
- string matching
- 文字列照合
- pattern
- パターン
- text
- 文書
- substring
- 部分文字列
- simplistic
- 素朴な、単純すぎる
- nucleotide
- ヌクレオチド
- amino acid
- アミノ酸
- precomputation
- 事前計算
- casting out nines
- 九去法
- prefix
- 接頭部分列
- suffix
- 末尾部分列
- character encoding
- 文字コード
- index
- 索引
- lexical analysis
- 字句解析
- finite automaton (pl. automata)
- 有限オートマトン
- regular expression
- 正規表現