Hash Functions and Hash Tables
(ハッシュ関数とハッシュ表)
Data Structures and Algorithms
10th lecture, November 19, 2015
http://www.sw.it.aoyama.ac.jp/2015/DA/lecture10.html
Martin J. Dürst
© 2009-15 Martin
J. Dürst 青山学院大学
Today's Schedule
- Additional speedup for dictionary
- Overview of hashing
- Hash functions
- Conflict resolution
- Evaluation of hashing
- Hash in Ruby
- Summary
Time Complexity for Dictionary Implementations up to Here
Implementation |
Search |
Insertion |
Deletion |
Sorted array |
O(log n) |
O(n) |
O(n) |
Unordered array/linked list |
O(n) |
O(1) |
O(n) |
Balanced tree |
O(log n) |
O(log n) |
O(log n) |
Direct Addressing
- Use an array with an element for each key value
- Search:
value = array[key]
, time: O(1)
- Insertion/replacement:
array[key] = value
, time: O(1)
- Deletion:
array[key] = nil
,
time: O(1)
- Example:
students = []
students[15814000] = "Hanako Aoyama"
Problem: Array size, non-numeric keys
Solution: key transformation
Overview of Hashing
(also called scatter storage technique)
- Transform the key k to a compact space using the hash
function hf
- Use hf(k) instead of k in the same way
as direct addressing
- The array is called hash table
- The hash function is evaluated in O(1) time
- Search:
value = table[hf(key)]
, time: O(1)
- Insertion/replacement:
table[hf(key)] =
value
, time: O(1)
- Deletion:
table[hf(key)] = nil
, time: O(1)
Problem 1: Design of hash function
Problem 2: Resolution of conflicts
Overview of Hash Function
- Goals:
- From a key, calculate an index that is as smoothly distributed as
possible
- Adjust the range of the result to the hash table size
- Steps:
- Calculate a large integer (e.g.
int
in C) from the
key
- Adjust this large integer to the hash table size using a modulo
operation
Step 2 is easy. Therefore, we concentrate on step 1.
(often step 1 alone is called 'hash function')
Hash Function Example 1
(SDBM
Hash Function)
int sdbm_hash(char key[])
{
int hash = 0;
while (*key) {
hash = *key++ + hash<<6
+ hash<<16 - hash;
}
}
Hash Function Example 2
(simplified from MurmurHash3; For
32bit machines)
#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
int h = 0;
for (int i=0; i<length; i++) {
int k = key[i] * C1; // C1 is a constant
h ^= ROTL32(k, R1); // R1 is a constant
}
h ^= h >> 13;
h *= 0xc2b2ae35;
return
}
Evaluation of Hash Functions
- Quality of distribution
- Execution speed
- Ease of implementation
Precautions for Hash Functions
- Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad
distribution
- Do not use data besides the key
If some data attributes (e.g. price of a product, studen's total marks)
change, the key will change and the data cannot be found anymore
- Collapse equivalent keys
Examples: Upper/lower case letters, top/bottom/left/right/black/white
symmetries for Joseki in Go
Special Hash Functions
- Universal hashing
- Use a random function to create a different hash function for each
program execution
- Implementation: Change a constant in the hash function
- Solution for some denial-of-service attacks
(reference)
- Perfect hash function
- Custom-designed hash function without conflicts
- Useful when data is completely predefined
- In the best case, the hash table is completely filled
- Application: Keywords in programming languages
- Example implementation: gnu gperf
(in
Japanese)
Cryptographic Hash Function
- Used in electronic signatures
- Differences from general hash functions:
- Practically impossible to generate same output from different
input
- Output usually longer (e.g. 128/256/384/512/... bits)
- Evaluation may take longer
Conflict
- A conflict happens when hf(k1) =
hf(k2) even though
k1 ≠ k2
- This requires special treatment
- Main solutions:
Terms and Variables for Conflict Resolution
- Fields in hash table: bins/buckets
- Number of bins: m
(equal to the range of values of the hash function after the modulo
operation)
- Number of data items: n
- Load factor (average number of data items per bin): α
(α = n/m)
- For a good (close to random) hash function, the variation in the number
of data items for each bin is low
(Poisson distribution)
Chaining
- Store conflicting data items in a linked list
- Each bin in the hash table is the head of a linked list
- If the linked list is short, then search/insertion/deletion will be
fast
- The average length of the linked list is α
- All operations are carried out in two steps:
- Use hf(k) to determine the bin
- Use the key to find the data item in the linked list
Implementation of Chaining
- Implementation in Ruby: Ahashdictionary.rb
- Uses
Array
in place of linked list
- Uses Ruby's hash function
Open Addressing
- Store key and data in hash table itself
- In case of conflict, successively check different bin
- For check number i, use hash function
ohf(key, i)
- Linear probing: ohf(key, i) =
hf(key) + i
- Quadratic probing: ohf(key, i) =
hf(key) + c1 i +
c2 i2
- Many other variations exist
- The load factor has to be between 0 and 1; ≦0.5 is reasonable
- Problem: Deletion is difficult
Time Complexity of Hashing
(average, for chaining)
- Calculation of hash function
- Dependent on key length
- O(1) if key length is constant
or limited
- Search in bin
- Dependent on load factor
- O(1) if load factor is below a
given constant
- O(n) in worst case,
but this can be avoided by choice of hash function
Expansion and Shrinking of Hash Table
- The efficiency of hashing depends on the load factor
- If the number of data items increases, the hash table has to be
expanded
- If the number of data items decreases, it is desirable to shrink the hash
table
- Expansion/shrinking can be implemented by re-inserting the data into a
new hash table
(changing the divisor of the modulo operation)
- Expansion/shrinking is heavy (time: O(n))
Analisys of the Time Complexity of Expansion
- If the hash table is expanded for each data insertion, this is extremely
inefficient
- Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items
doubles
- The time needed for the insertion of n data items
(n=2x) is
2 + 4 + 8 + ... + n/2 + n < 2n =
O(n)
- The time complexity per data item is
O(n)/n = O(1)
(simple example of amortized analysis)
Evaluation of Hashing
Advantages:
- Search/insertion/deletion are possible in (average) constant time
- Reasonably good actual performance
- No need for keys to be ordered
- Wide field of application
Problems:
- Sorting needs to be done separately
- Proximity/similarity search is impossible
- Expansion/shrinking requires time (possible operation interrupt)
Comparison of Dictionary Implementations
Implementation |
Search |
Insertion |
Deletion |
Sorting |
Sorted array |
O(log n) |
O(n) |
O(n) |
O(n) |
Unordered array/linked list |
O(n) |
O(1) |
O(1) |
O(n log n) |
Balanced tree |
O(log n) |
O(log n) |
O(log n) |
O(n) |
Hash table |
O(1) |
O(1) |
O(1) |
O(n log n) |
The Ruby Hash
Class
(Perl: hash
; Java: HashMap
; Python:
dict
)
- Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are are called hash
- Creation:
my_hash = {}
または my_hash =
Hash.new
- Initialization:
months = {'January' => 31, 'February' => 28,
'March' => 31, ... }
- Insertion/replacement:
months['February'] = 29
- Lookup:
this_month_length = months[this_month]
Hash
in Ruby has more functionality than in other
programming languages (presentation)
Implementation of Hashing in Ruby
- Originally by Peter Moore, University of California Berkeley (1989)
- Source: st.c
- Uses chaining
- Expansion by a factor of about 2 whenever a load factor of 5.0 is reached
(
ST_DEFAULT_MAX_DENSITY
)
- The size of the hash table is a power of 2 (
next_pow2
)
(earlier, it was a prime number close to a power of 2)
- Even if there are many deletions, the hash table is never shrunk
- Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object
Summary
- A hash table implements a dictionary ADT using a hash
function
- Main points: Selection of hash function, conflict resolution method
- Reanosably good actual performance
- Wide field of application
Glossary
- direct addressing
- 直接アドレス表
- hashing, scatter storage technique
- ハッシュ法、挽き混ぜ法
- hash function
- ハッシュ関数
- hash table
- ハッシュ表
- joseki
- 定石 (囲碁)
- universal hashing
- 万能ハッシュ法
- denial of service attack
- DOS 攻撃、サービス拒否攻撃
- perfect hash function
- 完全ハッシュ関数
- cryptographic hash function
- 暗号技術的ハッシュ関数
- electronic signatures
- 電子署名
- conflict
- 激突
- Poisson distribution
- ポアソン分布
- chaining
- チェイン法、連鎖法
- open addressing
- 開番地法、オープン法
- load factor
- 占有率
- linear probing
- 線形探査法
- quadratic probing
- 二次関数探査法
- divisor
- (割り算の) 法
- amortized analysis
- 償却分析
- proximity search
- 近接探索
- similarity search
- 類似探索