Algorithms for String Matching
(文字列照合のアルゴリズム)
Data Structures and Algorithms
11th lecture, December 8, 2022
https://www.sw.it.aoyama.ac.jp/2022/DA/lecture11.html
Martin J. Dürst
© 2009-22 Martin
J. Dürst 青山学院大学
Today's Schedule
- Summary/leftovers of last lecture
- Overview of string matching
- Simplistic implementation
- Rabin-Karp algorithm
- Knuth-Morris-Pratt algorithm
- Boyer-Moore algorithm
- String matching and character encoding
- Summary
Leftovers from Last Lecture
Summary of Last Lecture
- A hash table implements a dictionary ADT using a hash
function
- The hash function converts the keys into a random-like distribution in a
repeatable way
- Hash tables allow search, insertion, and deletion all in time
O(1)
- The main methods for conflict resolution are chaining and
open addressing
- Expansion (and shrinking) of a hash table is O(n),
but this can be amortized to O(1) per insertion
- In Ruby and many other programming languages, dictionaries/hash tables
are very convenient data strucutures
- The implementation of hashing in Ruby uses open addressing starting with
Ruby 2.4 (before: chaining)
Remaining Schedule
- December 8 (today): this lecture
- December 15: 12th lecture (Dynamic Programming)
- December 22: 13th lecture (Algorithm Design Strategies)
- January 12: 14th lecture (NP-completenes, reducibility)
- January 19: 15th lecture (approximation algorithms)
- (most probably) January 26, 09:30-10:55 (85 min): Term Final Exam
Term Final Exam
- Coverage:
- Complete contents of lecture and handouts (incl. programs)
- No need to write Ruby code, but need to understand it, and to
write your own pseudocode
- Type of problems:
- Similar to problems in Discrete Mathematics I
- Past exams: 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, (2020 online)
2021
- How to view example solutions:
- Press [表示
(S)] button or [S] key. To revert, press [非表示 (P)]
button or press [P] key.
Sometimes, more than one key press is needed to start switching.
Some images and example solutions are missing.
- Important points:
- Read problems carefully (distinguish between calculation, proof,
explanation,...)
- Be able to explain concepts in your own
words
- Combine and apply knowledge from
different lectures
- Write clearly
- Answers can be in Japanese or English
Overview of String Matching
Goal: To find a short pattern p in a long text
t
- Pattern p (needle): String of length m
(examples: single word, phrase,...)
- Text t (haystack): String of length n (in general,
n ≫ m)
(examples: Single document, document collection)
- Task: Search pattern inside text (existence/location/number)
- The location is usually called shift
- The character in t at position s is written
t[s]
- The substring of shift s and length m in text
t is written ts
Task: Find "pattern" in the text below
...paternpattnetrnternapatternatnetnpttepneretanetpat...
Background: String
matching algorithms in various programming languages
Simplistic Implementation
- Compare the pattern with each substring of length m of the
text
- There are n-m+1 = O(n) substrings
- Comparing the pattern with a substrings takes time O(m)
text |
e |
f |
a |
n |
y |
a |
k |
a |
m |
n |
pattern |
|
|
a |
o |
y |
a |
m |
a |
|
|
next pattern position |
|
|
|
a |
o |
y |
a |
m |
a |
|
- A simplistic implementation will
take time O(n) · O(m) = O(nm)
- If the comparison is stopped early:
- In the general case, time will be close to O(n)
- In the worst case, time will be O(nm)
Actual example: t = aaa....aaaab, p = aa..ab
Overview of the Rabin-Karp Algorithm
- Define a hash function hf()
- Calculate the hash value hf(p) of pattern
p
- Calculate the hash value of each substring of length m of text
t
- Compare with hf(p)
- If the hash values are equal, check the actual substring against the
pattern
- A simplistic implementation is O(nm)
- This can be improved by selecting an appropriate hash function
Selecting the Hash Function
- Use a hash function so that given
hf(ts), it is easy to calculate
hf(ts+1)
- The hash of the pattern is hf (p) =
(p[0]·bm-1+
p[1]·bm-2+...+p[m-2]·b1+p[m-1]·b0)
mod d
(b is the size of the alphabet, d is an arbitrarily
selected divisor)
- The hash of the candidate substrings is:
hf (ts) =
(t[s]·bm-1+t[s+1]·bm-2
+ ... +
t[s+m-2]·b1+t[s+m-1]·b0)
mod d
hf (ts+1)
=
(t[s+1]·bm-1+t[s+2]·bm-2
+ ... +
t[s+m-1]·b1+t[s+m]·b0)
mod d
Speeding up the Hash Function
- Using the properties of the modulo function (Discrete Mathematics I, Modular
Arithmetic)
hf (ts+1) = ((hf
(ts) - t[s]·bm-1) ·
b + t[s+m]) mod d
= ((hf (ts) -
t[s]·(bm-1 mod
d)) · b + t[s+m])
mod d
- From hf(ts),
hf(ts+1) can be calculated in
time O(1)
- Therefore, the overall time is O(n)
Example of Rabin-Karp
Pattern: 081205
Text: 28498608120598743297
b = 10, d = 9
(for manual calculation, casting out nines is
helpful)
Example of implementing Rabin-Karp algorithm in Excel: BRabin-Karp.xls
Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb
Overview of Knuth-Morris-Pratt Algorithm
- A simplistic implementation compares the same text character many
times
- Basic idea:
Use knowledge from previous comparisons
- Precompute pattern shifts by pattern-internal comparisons
Details of Knuth-Morris-Pratt Algorithm
- Compare the current text location with the current pattern location
- As long as there is a match, continue comparison without changing
s
(i.e. compare the next character in the text with the next character in the
pattern)
- If the end of the pattern is reached, there is a successful match
- If there is no match:
At the start of the pattern, we move the pattern by one (increase
s by one)
(i.e. compare the next character in the text with the first character
in the pattern)
text |
e |
f |
g |
h |
i |
j |
k |
l |
m |
n |
pattern |
|
|
a |
o |
y |
a |
m |
a |
|
|
next pattern position |
|
|
|
a |
o |
y |
a |
m |
a |
|
In other positions, move the position in the pattern to the left
according to the precomputed value
(i.e. keep the current position in the text, move the pattern to the
right)
text |
e |
f |
a |
o |
y |
a |
k |
l |
m |
n |
o |
p |
pattern |
|
|
a |
o |
y |
a |
m |
a |
|
|
|
|
next pattern position |
|
|
|
|
|
a |
o |
y |
a |
m |
a |
|
Time Complexity of Knuth-Morris-Pratt Algorithm
- As a result of a comparison, either of two actions is taken:
- Shift the pattern to the right by one or more characters (maximum
n-m times)
- Shift the comparison position in the text to the right by one
character (maximum n-m times)
- The total number of operations is <2n, so the time
complexity is O(n)
- Except for the precomputation, the time complexity does not depend on
m
- Advantage: The characters in the text are accessed strictly in order
(left to right)
Precomputation for Knuth-Morris-Pratt Algorithm
For all positions x in the pattern,
assuming that p[x] does not match
but that p[0] ... p[x-1] (length x)
already match,
the length of the longest matching prefix and suffix of the already matching
part
indicates the position in the pattern to match next
- In addition, if at the position in the pattern to be checked next, the
same character is used, then an additional comparison can be omitted
- Moving the position in the text by one and the pattern to the start again
is expressed as -1
Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb
Precomputation Example
Precomputation for pattern aoyaoa
(? is any non-matching
character):
Green indicates matches, red indicates non-matches, blue indicates next comparision (maybe
with red ?).
Position not matched |
part |
|
|
|
|
|
|
|
|
|
|
next |
0 |
text |
? |
|
|
|
|
|
|
|
|
|
|
pattern |
a |
o |
y |
a |
o |
a |
|
|
|
|
|
next |
|
a |
o |
y |
a |
o |
a |
|
|
|
-1 |
1 |
text |
a |
? |
|
|
|
|
|
|
|
|
|
pattern |
a |
o |
y |
a |
o |
a |
|
|
|
|
|
next |
|
a |
o |
y |
a |
o |
a |
|
|
|
0 |
2 |
text |
a |
o |
? |
|
|
|
|
|
|
|
|
pattern |
a |
o |
y |
a |
o |
a |
|
|
|
|
|
next |
|
|
a |
o |
y |
a |
o |
a |
|
|
0 |
3 |
text |
a |
o |
y |
? |
|
|
|
|
|
|
|
pattern |
a |
o |
y |
a |
o |
a |
|
|
|
|
|
next? |
|
|
|
a |
o |
y |
a |
o |
a |
|
|
next |
|
|
|
|
a |
o |
y |
a |
o |
a |
-1 |
4 |
text |
a |
o |
y |
a |
? |
|
|
|
|
|
|
pattern |
a |
o |
y |
a |
o |
a |
|
|
|
|
|
next |
|
|
|
|
a |
o |
y |
a |
o |
a |
0 |
5 |
text |
a |
o |
y |
a |
o |
? |
|
|
|
|
|
pattern |
a |
o |
y |
a |
o |
a |
|
|
|
|
|
next |
|
|
|
a |
o |
y |
a |
o |
a |
|
2 |
Overview of Boyer-Moore Algorithm
Idea Details
Two guidelines for shifting the pattern:
- Pattern-internal comparison
(back-to-front version of Knuth-Morris-Pratt algorithm)
The rightmost position in the pattern of the non-matching character in
the text
text |
e |
f |
g |
h |
i |
j |
k |
a |
m |
n |
o |
pattern |
|
|
a |
o |
g |
a |
k |
u |
|
|
|
next pattern position |
|
|
|
|
a |
o |
g |
a |
k |
u |
|
Select the larger shift
Time Complexity of Boyer-Moore Algorithm
- In the worst case, same as Knuth-Morris-Pratt algorithm: O(n)
- If m is relatively small compared to b, in most
cases: O(n/m)
String Matching Context
- Relative size of n and m (in general, m
≪ n)
- Number of searches, number of patterns
(if there are many searches with the same text or pattern, then that
text or pattern can be preprocessed to make the search more
efficient)
- Number of characters (size of alphabet, b)
- Sequence of bits: b = 2
- Genetics: b = 4 (nucleotides) or b = 21 (amino
acids)
- Western language documents: b ≅ 26~256
- East Asian language documents: b ≅ several thousand
String Matching and Character Encoding
- In some character encodings, there is a large number of characters
- Implementation becomes simpler if working on bytes instead of
characters
- For some character encodings, working byte-by-byte is impossible
- Possible: UTF-8
- Impossible:
iso-2022-jp
(JIS), Shift_JIS
(SJIS), EUC-JP
(EUC)
Example for Shift_JIS
:
byte values |
92 |
90 |
92 |
91 |
92 |
93 |
92 |
99 |
92 |
9a |
92 |
9b |
text |
註 |
酎 |
駐 |
貯 |
丁 |
兆 |
matching pattern |
|
崇 |
葬 |
湯 |
剪 |
囃 |
|
Character Encodings and Byte Patterns
- UTF-8: 0xxxxxxx;
110xxxxx 10xxxxxx;
1110xxxx 10xxxxxx 10xxxxxx;
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
- EUC-JP: 0xxxxxxx; 1xxxxxxx 1xxxxxxx
- Shift_JIS: 0xxxxxxx; 1xxxxxxx xxxxxxxx
- iso-2022-jp: 0xxxxxxx; 0xxxxxxx 0xxxxxxx
Other String Matching Problems
- Finding several patterns in the same text: Aho-Corasick algorithm
- Finding many patterns (e.g. words,...) in large text collections
- General problem for search engines
- Solved by buiding an index (dictionary ADT: words/n-grams ⇒
documents (incl. positions))
- Many others
Outlook
- The lexical analysis of a program is advanced string matching
- Patterns are not only fixed character strings
- Using finite automata
- Patterns are specified using regular expressions
- These are topics of "Language Theory and Compilers" (3rd year spring
term)
- Regular expressions can be used for text processing in Ruby as well as
many other programming languages
- Ruby string matching is reported
to have bad worst-case performance, improve as part of graduation
research?
Summary
- A simplistic implementation of string matching is
O(nm) in the worst case
- The Rabin-Karp algorithm is O(n), using a hash
function that can be extended to 2D matching
- The Knuth-Morris-Pratt algorithm is O(n), and views
the text strictly in input order
- The Boyer-Moore algorithm is O(n/m) in
most cases
Homework
- Review the Rabin-Karp algorithm using BRabinKarp.xls and Bstringmatch.rb
(change b/d to change calculation,
pattern or text to change matches)
- Review the Knuth-Morris-Pratt algorithm using Bstringmatch.rb (in particular
show_kmp_preparation
)
- Review matrix multiplication (linear algebra)
Glossary
- string matching
- 文字列照合
- pattern
- パターン
- text
- 文書
- substring
- 部分文字列
- simplistic
- 素朴な、単純すぎる
- nucleotide
- ヌクレオチド
- amino acid
- アミノ酸
- precomputation
- 事前計算
- casting out nines
- 九去法
- prefix
- 接頭部分列
- suffix
- 末尾部分列
- character encoding
- 文字コード
- index
- 索引
- lexical analysis
- 字句解析
- finite automaton (pl. automata)
- 有限オートマトン
- regular expression
- 正規表現