Hash Functions and Hash Tables
(ハッシュ関数とハッシュ表)
Data Structures and Algorithms
10th lecture, November 22, 2018
http://www.sw.it.aoyama.ac.jp/2018/DA/lecture10.html
Martin J. Dürst
© 2009-18 Martin
J. Dürst 青山学院大学
Today's Schedule
- Leftovers and summary of last lecture
- Additional speedup for dictionary
- Overview of hashing
- Hash functions
- Conflict resolution
- Evaluation of hashing
- Hashes in Ruby
- Summary
Leftovers
Summary of Last Lecture
- Balanced trees keep search/insertion/deletion in a dictionary ADT at
O(log n) worst-case time
- 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep
the tree height constant
- Red-black-trees and AVL-trees impose limitations on the variation of the
tree heigh
- B-trees and B+ trees are very useful for file systems and databases on
secondary storage
Time Complexity for Known Dictionary Implementations
Implementation |
Search |
Insertion |
Deletion |
Sorted array |
O(log n) |
O(n) |
O(n) |
Unordered array/linked list |
O(n) |
O(1) |
O(n) |
Balanced tree |
O(log n) |
O(log n) |
O(log n) |
Direct Addressing
- Use an array with an element for each key value
- Search:
value = array[key]
, time: O(1)
- Insertion/replacement:
array[key] = value
, time: O(1)
- Deletion:
array[key] = nil
,
time: O(1)
- Example:
students = []
students[15817000] = "I.T. Aoyama"
Problem: Array size, non-numeric keys
Solution: Transform key with hash function
Overview of Hashing
(also called scatter storage technique)
- Transform the key k to a compact space using the hash
function hf
- Use hf(k) instead of k in the same way
as direct addressing
- The data is contained in the hash table (
table
below)
- The hash function can be evaluated in constant time (O(1))
- Search:
value = table[hf(key)]
, time: O(1)
- Insertion/replacement:
table[hf(key)] =
value
, time: O(1)
- Deletion:
table[hf(key)] = nil
, time: O(1)
Problems with Hashing
- Choice/design of hash function
Example 1: remainder: def hf(k); k % 100; end
students[15815000 % 100] = "I.T.Aoyama"
Example 2: sum of codepoints: def hf(k); k.codepoints.sum;
end
students["HanakoAoyama".codepoints.sum] = ...
- Resolution of conflicts
What happens with the following:
students[15815000 % 100] = "I.T.Aoyama"
students[15715000 % 100] = "K.S.Aoyama"
Overview of Hash Function
- Goals:
- From a key, calculate an index that is smoothly (randomly)
distributed
(this is the reason for the word hash, as in hashed beef or
hash brows)
- Adjust the range of the result to the size of the hash table
- Steps:
- Calculate a large integer (e.g.
int
in C) from the
key
- Adjust this large integer to the hash table size using a modulo
operation
Step 2 is easy. Therefore, we concentrate on step 1.
(often step 1 alone is called 'hash function')
Hash Function Example 1
int sdbm_hash(char key[])
{
int hash = 0;
while (*key) {
hash = *key++ + hash<<6
+ hash<<16 - hash;
}
return h;
}
Hash Function Example 2
(simplified from MurmurHash3; for
32-bit machines)
#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
int h = 0;
for (int i=0; i<length; i++) {
int k = key[i] * C1; // C1 is a constant
h ^= ROTL32(k, R1); // R1 is a constant
}
h ^= h >> 13;
h *= 0xc2b2ae35;
return h;
}
Frequent operations in hash functions: Addition (+
),
multiplication (*
), bitwise XOR (^
), shift
(<<
, >>
)
Evaluation of Hash Functions
- Quality of distribution
- Execution speed
- Ease of implementation
Precautions for Hash Functions
- Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad
distribution
- Do not use data besides the key
If some data attributes (e.g. price of a product, student's total marks)
change, the key will change and the data will not be found anymore
- Collapse equivalent keys
Examples: For text: Upper/lower case letters, For the game of Go:
top/bottom, left/right, diagonal, and black/white symmetries
Conflicts
- A conflict happens when hf(k1) =
hf(k2) but k1 ≠
k2
- Conflicts happens quite easily
- This requires special treatment
- Main solutions:
Terms and Variables for Conflict Resolution
- Number of data items: n
- Fields in hash table: bins/buckets
- Number of bins: m
(equal to the range of values of the hash function after the modulo
operation)
- Load factor (average number of data items per bin): α (=
n/m)
- For a good (close to random) hash function, the variation in the number
of data items for each bin is low
(Poisson distribution)
Chaining
- Store conflicting data items in a linked list
- Each bin in the hash table is the head of a linked list
- If the linked list is short, then search/insertion/deletion will be
fast
- The average length of the linked list is equal to load factor α
- The load factor is usually greater than 1 (e.g. 3≦α≦6)
- All operations are carried out in three steps:
- Use hf(k) mod m to find the bin
- Use hf(k) (without modulo operation) to find a
candidate entry in the linked list
- Use the actual key k to confirm that we found the correct
data item in the linked list
Implementation of Chaining
- Implementation in Ruby: Ahashdictionary.rb
- Uses
Array
in place of linked list
- Uses Ruby's
hash
function
Open Addressing
- Store key and data in hash table itself
- In case of conflict, successively check different bins
- For check number i, use hash function
ohf(key, i)
- Linear probing: ohf(key, i) =
hf(key) + i
- Quadratic probing: ohf(key, i) =
hf(key) + c1 i +
c2 i2
- Many other variations exist
- The load factor has to be between 0 and 1; ≦0.5 is reasonable
- Problem: Deletion is difficult
Time Complexity of Hashing
(average, for chaining)
- Calculation of hash function
- Dependent on key length
- O(1) if key length is constant
or limited
- Search in bin
- Dependent on load factor
- O(1) if load factor is below a
given constant
- O(n) in worst case,
but this can be avoided by choice of hash function
Expansion and Shrinking of Hash Table
- The efficiency of hashing depends on the load factor
- If the number of data items increases, the hash table has to be
expanded
- If the number of data items decreases, it is desirable to shrink the hash
table
- Expansion/shrinking can be implemented by re-inserting the data into a
new hash table
(changing the divisor of the modulo operation)
- Expansion/shrinking is heavy (time: O(n))
Analysis of the Time Complexity of Expansion
- If the hash table is expanded for every data insertion, this is extremely
inefficient
- Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items
doubles
- The time needed for the insertion of n data items
(n=2x) is
2 + 4 + 8 + ... + n/2 + n < 2n =
O(n)
- The time complexity per data item is
O(n)/n = O(1)
(This is a simple example of amortized analysis.)
Special Purpose Hash Functions
- Universal hashing
- Perfect hash function
- Cryptographic hash function
Universal Hashing
- Include a random number to create a different hash function for each
program execution
- Solution for some denial-of-service attacks:
- Provide lots of data with same hash value
- Efficiency of hash degrades from O(1) to
O(n)
(reference)
Perfect Hash Function
- Custom-designed hash function without conflicts
- Useful when data is completely predefined
- In the best case, the hash table is completely filled
- Application: Keywords in programming languages
- Example implementation: gnu gperf (in
Japanese)
(used in Ruby character property lookups)
Cryptographic Hash Function
- Used for electronic signatures, ...
- Differences from general hash functions:
- Output usually longer (e.g. 128/256/384/512/... bits)
- Practically impossible to generate same output from different
input
- Much more difficult to invert (find k from
hf(k))
- Evaluation may take longer
Evaluation of Hashing
Advantages:
- Search/insertion/deletion are possible in (average)
constant time
- Reasonably good actual performance
- No need for keys to be numeric or ordered
- Wide field of application
Problems:
- Sorting needs to be done separately
(Ruby Hash
es store insertion order, but not key order)
- Proximity/similarity search is impossible
- Expansion/shrinking requires time (possible operation interrupt)
Comparison of Dictionary Implementations
Implementation |
Search |
Insertion |
Deletion |
Sorting |
Sorted array |
O(log n) |
O(n) |
O(n) |
O(n) |
Unordered array/linked list |
O(n) |
O(1) |
O(n) |
O(n log n) |
Balanced tree |
O(log n) |
O(log n) |
O(log n) |
O(n) |
Hash table |
O(1) |
O(1) |
O(1) |
O(n log n) |
The Ruby Hash
Class
(Perl: hash
; Java: HashMap
; Python:
dict
)
- Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are also called "hash"
- Creation:
my_hash = {}
or my_hash =
Hash.new
- Initialization:
months = {'January' => 31, 'February' => 28,
'March' => 31, ... }
- Insertion/replacement:
months['February'] = 29
- Lookup:
this_month_length = months[this_month]
Hash
in Ruby has more functionality than in other
programming languages (presentation)
Implementation of Hashing in Ruby
- Source: st.c
- Used chaining until 2016
(originally by Peter Moore, University of California Berkeley (1989))
- On Nov. 7, 2016 replaced
by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
- Reason: Faster
because open addressing works better with modern cash hierarchy
- Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object
Summary
- A hash table implements a dictionary ADT using a hash
function
- Main design points:
- Selection of hash function
- Conflict resolution methods (chaining or open
addressing)
- Reanosably good actual performance
- Wide field of application
Glossary
- direct addressing
- 直接アドレス表
- hashing, scatter storage technique
- ハッシュ法、挽き混ぜ法
- hash function
- ハッシュ関数
- hash table
- ハッシュ表
- game of Go
- 囲碁
- joseki
- 定石 (囲碁)
- universal hashing
- 万能ハッシュ法
- denial of service attack
- DOS 攻撃、サービス拒否攻撃
- perfect hash function
- 完全ハッシュ関数
- cryptographic hash function
- 暗号技術的ハッシュ関数
- electronic signatures
- 電子署名
- conflict
- 衝突
- Poisson distribution
- ポアソン分布
- chaining
- チェイン法、連鎖法
- open addressing
- 開番地法、オープン法
- load factor
- 占有率
- linear probing
- 線形探査法
- quadratic probing
- 二次関数探査法
- divisor
- (割り算の) 法
- amortized analysis
- 償却分析
- proximity search
- 近接探索
- similarity search
- 類似探索