Algorithms for String Matching

(文字列照合のアルゴリズム)

Data Structures and Algorithms

11th lecture, December 7, 2017

http://www.sw.it.aoyama.ac.jp/2017/DA/lecture11.html

Martin J. Dürst

Today's Schedule

Summary and leftovers of last lecture
Overviewof string matching
Simplistic implementation
Rabin-Karp algorithm
Knuth-Morris-Pratt algorithm
Boyer-Moore algorithm
String matching and character encoding
Summary

Summary of Last Lecture

A hash table implements a dictionary ADT using a hash function
The hash function converts the keys into a random-like distribution in a repeatable way
Hash tables allow search, insertion, and deletion all in time O(1)
The main methods for conflict resolution are chaining and open addressing
In Ruby and many other programming languages, hash tables are very convenient data strucutures
The implementation of hashing in Ruby uses open addressing starting with Ruby 2.4 (before: chaining)

Leftovers of Last Lecture

Overview of String Matching

Goal: To find a short pattern p in a long text t

Text t (haystack): String of length n
Pattern p (needle): String of length m
Search pattern inside text (existence/location/number)
The location is usually called shift
The substring of shift s and length m in text t is written t_s
The character in t at position s is written t[s]

Background: String matching algorithms in various programming languages

...paternpattnetrnternapatternatnetnpttepneretanetpat...

String Matching Context

Relative size of n and m (in general, m ≪ n)
Number of searches, number of patterns
(if there are many searches with the same text or pattern, then that text or pattern can be preprocessed to make the search more efficient)
Number of characters (size of alphabet, b)
- Sequence of bits: b = 2
- Genetics: b = 4 (nucleotides) or b = 21 (amino acids)
- Western documents: b ≅ 26~256
- East Asian documents: b ≅ several thousand

Simplistic Implementation

Compare the pattern with each substring of length m of the text
There are n-m+1 = O(n) substrings

Comparing the pattern with a substrings takes time O(m)

text	e	f	a	n	y	a	k	a	m	n
pattern			a	o	y	a	m	a
next pattern position				a	o	y	a	m	a

A simplistic implementation will take time O(n) · O(m) = O(nm)
If the comparison is stopped early:
- In the general case, time will be close to O(n)
- In the worst case, time will be O(nm)
  Actual example: t = aaa....aaaab, p = aa..ab

Overview of the Rabin-Karp Algorithm

Use a hash function
Calculate the hash value of each substring of length m of text t
Compare with hash value of pattern p
If the hash values are equal, check the actual substring against the pattern
A simplistic implementation is O(nm)
This can be improved by selecting an appropriate hash function

Selecting the Hash Function

Use a hash function so that hf(t_s+1) can be easily calculated from hf(t_s)
The hash of the pattern is hf (p) = (p[0]·b^m-1+ p[1]·b^m-2+...+p[m-2]·b¹+p[m-1]·b⁰) mod d
(b is the size of the alphabet, d is an arbitrarily selected divisor)
The hash of the candidate substrings is:
hf (t_s) = (t[s]·b^m-1+t[s+1]·b^m-2 + ... + t[s+m-2]·b¹+t[s+m-1]·b⁰) mod d
hf (t_s+1) = (t[s+1]·b^m-1+t[s+2]·b^m-2 + ... + t[s+m-1]·b¹+t[s+m]·b⁰) mod d

Speeding up the Hash Function

Using the properties of the modulo function (Discrete Mathematics I, Modular Arithmetic)
hf (t_s+1) = ((hf (t_s) - t[s]·b^m-1) · b + t[s+m]) mod d
= ((hf (t_s) - t[s]·(b^m-1 mod d)) · b + t[s+m]) mod d
hf(t_s+1) can be calculated from hf(t_s) in time O(1)
Therefore, the overall time is O(n)

Example of Rabin-Karp

Pattern: 081205

Text: 28498608120598743297

b = 10, d = 9

(for manual calculation, casting out nines is helpful)

Example of implementing Rabin-Karp algorithm in Excell: BRabin-Karp.xls

Pseudocode of Rabin-Karp algorithm: Bstringmatch.rb

Overview of Knuth-Morris-Pratt Algorithm

A simplistic implementation compares the same text character many times
Basic idea:
Use knowledge from previous comparisons
Precompute the pattern shifts by pattern-internal comparisons

Details of Knuth-Morris-Pratt Algorithm

Compare the current text location with the current pattern location
As long as there is a match, continue comparison without changing s
(i.e. compare the next character in the text with the next character in the pattern)
If the end of the pattern is reached, there is a successful match

If there is no match:

At the start of the pattern, we move the pattern by one (increase s by one)
(i.e. compare the next character in the text with the first character in the pattern)

text	e	f	g	h	i	j	k	l	m	n
pattern			a	o	y	a	m	a
next pattern position				a	o	y	a	m	a

In other positions, move the position in the pattern to the left according to the precomputed value
(i.e. keep the current position in the text, move the pattern to the right)

text	e	f	a	o	y	a	k	l	m	n	o	p
pattern			a	o	y	a	m	a
next pattern position						a	o	y	a	m	a

Time Complexity of Knuth-Morris-Pratt Algorithm

As a result of a comparison, either of two actions is taken:
1. Shift the pattern to the right by one or more characters (maximum n-m times)
2. Shift the comparison position by one character to the right (maximum n times)
The total number of operations is about 2n, so the time complexity is O(n)
Except for the precomputation, the time complexity does not depend on m
Advantage: The characters in the text are accessed strictly in order (left to right)

Precomputation for Knuth-Morris-Pratt Algorithm

For all positions x in the pattern,
assuming that p[x] does not match
but that p[0] ... p[x-1] (length x) already match,
the length of the longest matching prefix and suffix of the already matching part
indicates the position in the pattern to match next

In addition, if at the position in the pattern to be checked next, the same character is used, then an additional comparison can be omitted
Moving the pattern to start again is expressed as -1

Pseudocode for Knuth-Morris-Pratt algorithm: Bstringmatch.rb

Overview of Boyer-Moore Algorithm

Start comparing at the end of the pattern
Consider the actual text character when comparing
(i.e. not only match/no match, but match/no match with 'a'/no match with 'b'/...)
Increase the distance by which the pattern can be shifted

Example: If the last character of the pattern doesn't match the text, and the candidate character in the text does not appear in the pattern, then the pattern can be shifted by m at once

text	e	f	g	h	i	j	k	l	m	n	o	p	q	r
pattern			a	o	y	a	m	a
next pattern position									a	o	y	a	m	a

Idea Details

Two guidelines for shifting the pattern:

Pattern-internal comparison
(back-to-front version of Knuth-Morris-Pratt algorithm)

The rightmost position in the pattern of the non-matching character in the text

text	e	f	g	h	i	j	k	a	m	n	o
pattern			a	o	g	a	k	u
next pattern position					a	o	g	a	k	u

Select the larger shift

Time Complexity of Boyer-Moore Algorithm

In the worst case, same as Knuth-Morris-Pratt algorithm: O(n)
If m is relatively small compared to b, in most cases: O(n/m)

String Matching and Character Encoding

In some character encodings, there is a large number of characters
Implementation becomes simpler if working on bytes instead of characters
For some character encodings, working byte-by-byte is impossible
- Possible: UTF-8
- Impossible: iso-2022-jp (JIS), Shift_JIS (SJIS), EUC-JP (EUC)

Character Encodings and Byte Patterns

UTF-8: 0xxxxxxx;
110xxxxx 10xxxxxx;
1110xxxx 10xxxxxx 10xxxxxx;
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
EUC-JP: 0xxxxxxx; 1xxxxxxx 1xxxxxxx
Shift_JIS: 0xxxxxxx; 1xxxxxxx xxxxxxxx
iso-2022-jp: 0xxxxxxx; 0xxxxxxx 0xxxxxxx

Outlook

The lexical analysis of a program is advanced string matching
Not only fixed character strings as patterns
Using finite automata
Patterns are specified using regular expressions
These are topics of the course "Language Theory and Compilers" (3rd year spring term)
Regular expressions can be used for text processing in Ruby as well as many other programming languages
Ruby string matching is reported to have bad worst-case performance, improve as part of graduation research?

Summary

A simplistic implementation of string matching is O(nm) in the worst case
The Rabin-Karp algorithm is O(n), using a hash function that can be extended to 2D matching
The Knuth-Morris-Pratt algorithm is O(n), and views the text strictly in input order
The Boyer-Moore algorithm in O(n/m) in most cases

Homework:

Review the Rabin-Karp algorithm using BRabinKarp.xls and Bstringmatch.rb
Review the Knuth-Morris-Pratt algorithm using Bstringmatch.rb (in particular show_kmp_preparation)

Glossary

string matching: 文字列照合
pattern: パターン
text: 文書
substring: 部分文字列
simplistic: 素朴な、単純すぎる
nucleotide: ヌクレオチド
amino acid: アミノ酸
precomputation: 事前計算
casting out nines: 九去法
prefix: 接頭部分列
suffix: 末尾部分列
character encoding: 文字コード
lexical analysis: 字句解析
finite automaton (pl. automata): 有限オートマトン
regular expression: 正規表現