Hash Functions and Hash Tables

(ハッシュ関数とハッシュ表)

Data Structures and Algorithms

10th lecture, November 30, 2017

http://www.sw.it.aoyama.ac.jp/2017/DA/lecture10.html

Martin J. Dürst

Today's Schedule

Summary of last lecture
Additional speedup for dictionary
Overview of hashing
Hash functions
Conflict resolution
Evaluation of hashing
Hashes in Ruby
Summary

Summary of Last Lecture

Balanced trees keep search/insertion/deletion in a dictionary ADT at O(log n) worst-case time
2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep the tree height constant
Red-black-trees and AVL-trees impose limitations on the variation of the tree heigh
B-trees and B+ trees are very useful for file systems and databases on secondary storage

Time Complexity for Known Dictionary Implementations

Implementation	Search	Insertion	Deletion
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(`n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)

Direct Addressing

Use an array with an element for each key value
Search: value = array[key], time: O(1)
Insertion/replacement: array[key] = value, time: O(1)
Deletion: array[key] = nil, time: O(1)
Example:
students = [] students[15815000] = "I.T. Aoyama"

Problem: Array size, non-numeric keys

Solution: Transform key with hash function

Overview of Hashing

(also called scatter storage technique)

Transform the key k to a compact space using the hash function hf
Use hf(k) instead of k in the same way as direct addressing
The array containing the data is called the hash table (table below)
The hash function can be evaluated in constant time (O(1))
Search: value = table[hf(key)], time: O(1)
Insertion/replacement: table[hf(key)] = value, time: O(1)
Deletion: table[hf(key)] = nil, time: O(1)

Problems with Hashing

Choice/design of hash function
Example 1: remainder: def hf(k); k % 100; end

students[15815000 % 100] = "I.T.Aoyama"

Example 2: sum of codepoints: def hf(k); k.codepoints.sum; end

students["HanakoAoyama". codepoints.sum] = ...
Resolution of conflicts
What happens with the following:

students[15815000 % 100] = "I.T.Aoyama"

students[15715000 % 100] = "K.S.Aoyama"

Overview of Hash Function

Goals:
1. From a key, calculate an index that is smoothly (randomly) distributed
  (this is the reason for the word hash, as in hashed beef or hash brows)
2. Adjust the range of the result to the hash table size
Steps:
1. Calculate a large integer (e.g. int in C) from the key
2. Adjust this large integer to the hash table size using a modulo operation

Step 2 is easy. Therefore, we concentrate on step 1.
(often step 1 alone is called 'hash function')

Hash Function Example 1

int sdbm_hash(char key[])
{
    int hash = 0;
    while (*key) {
        hash = *key++ + hash<<6
               + hash<<16 - hash;
    }
}

Hash Function Example 2

(simplified from MurmurHash3; for 32-bit machines)

#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
    int h = 0;
    for (int i=0; i<length; i++) {
        int k = key[i] * C1;  // C1 is a constant
        h ^= ROTL32(k, R1);  // R1 is a constant
    }
    h ^= h >> 13;
    h *= 0xc2b2ae35;
    return 
}

Evaluation of Hash Functions

Quality of distribution
Execution speed
Ease of implementation

Precautions for Hash Functions

Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad distribution
Do not use data besides the key
If some data attributes (e.g. price of a product, studen's total marks) change, the key will change and the data will not be found anymore
Collapse equivalent keys
Examples: Upper/lower case letters, top/bottom, left/right, diagonal, and black/white symmetries for the game of Go

Special Purpose Hash Functions

Universal hashing
Perfect hash function
Cryptographic hash function

Universal Hashing

Include a random number to create a different hash function for each program execution
Solution for some denial-of-service attacks:
- Provide lots of data with same hash value
- Efficiency of hash degrades from O(1) to O(n)
(reference)

Perfect Hash Function

Custom-designed hash function without conflicts
Useful when data is completely predefined
In the best case, the hash table is completely filled
Application: Keywords in programming languages
Example implementation: gnu gperf (in Japanese)
(used in Ruby character property lookups)

Cryptographic Hash Function

Used for electronic signatures, ...
Differences from general hash functions:
- Practically impossible to generate same output from different input
- Output usually longer (e.g. 128/256/384/512/... bits)
- Evaluation may take longer
- Much more difficult to invert (find k from hf(k))

Conflicts

A conflict happens when hf(k₁) = hf(k₂) even though k₁ ≠ k₂
Conflicts happens quite easily
This requires special treatment
Main solutions:
- Chaining
- Open addressing

Terms and Variables for Conflict Resolution

Number of data items: n
Fields in hash table: bins/buckets
Number of bins: m
(equal to the range of values of the hash function after the modulo operation)
Load factor (average number of data items per bin): α (= n/m)
For a good (close to random) hash function, the variation in the number of data items for each bin is low
(Poisson distribution)

Chaining

Store conflicting data items in a linked list
Each bin in the hash table is the head of a linked list
If the linked list is short, then search/insertion/deletion will be fast
The average length of the linked list is equal to load factor α
The load factor is usually greater than 1 (e.g. 3≦α≦6)
All operations are carried out in three steps:
1. Use hf(k) (after modulo operation) to determine the bin
2. Use hf(k) (without modulo operation) to find a candidate entry in the linked list
3. Use the actual key k to confirm that we found the correct data item in the linked list

Implementation of Chaining

Implementation in Ruby: Ahashdictionary.rb
Uses Array in place of linked list
Uses Ruby's hash function

Open Addressing

Store key and data in hash table itself
In case of conflict, successively check different bins
For check number i, use hash function ohf(key, i)
- Linear probing: ohf(key, i) = hf(key) + i
- Quadratic probing: ohf(key, i) = hf(key) + c₁ i + c₂ i²
- Many other variations exist
The load factor has to be between 0 and 1; ≦0.5 is reasonable
Problem: Deletion is difficult

Time Complexity of Hashing

(average, for chaining)

Calculation of hash function
- Dependent on key length
- O(1) if key length is constant or limited
Search in bin
- Dependent on load factor
- O(1) if load factor is below a given constant
- O(n) in worst case, but this can be avoided by choice of hash function

Expansion and Shrinking of Hash Table

The efficiency of hashing depends on the load factor
If the number of data items increases, the hash table has to be expanded
If the number of data items decreases, it is desirable to shrink the hash table
Expansion/shrinking can be implemented by re-inserting the data into a new hash table
(changing the divisor of the modulo operation)
Expansion/shrinking is heavy (time: O(n))

Analysis of the Time Complexity of Expansion

If the hash table is expanded for every data insertion, this is extremely inefficient
Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items doubles
- The time needed for the insertion of n data items (n=2^x) is
  2 + 4 + 8 + ... + n/2 + n < 2n = O(n)
- The time complexity per data item is O(n)/n = O(1)

(simple example of amortized analysis)

Evaluation of Hashing

Advantages:

Search/insertion/deletion are possible in (average) constant time
Reasonably good actual performance
No need for keys to be numeric or ordered
Wide field of application

Problems:

Sorting needs to be done separately
(Ruby Hashes store insertion order, but not key order)
Proximity/similarity search is impossible
Expansion/shrinking requires time (possible operation interrupt)

Comparison of Dictionary Implementations

Implementation	Search	Insertion	Deletion	Sorting
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(`n`)	`O`(`n` log `n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)	`O`(`n`)
Hash table	`O`(1)	`O`(1)	`O`(1)	`O`(`n` log `n`)

The Ruby `Hash` Class

(Perl: hash; Java: HashMap; Python: dict)

Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are are called hash
Creation: my_hash = {} or my_hash = Hash.new
Initialization: months = {'January' => 31, 'February' => 28, 'March' => 31, ... }
Insertion/replacement: months['February'] = 29
Lookup: this_month_length = months[this_month]
Hash in Ruby has more functionality than in other programming languages (presentation)

Implementation of Hashing in Ruby

Source: st.c
Used chaining until recently
(originally by Peter Moore, University of California Berkeley (1989))
On Nov. 7, 2016 replaced by open addressing (by Vladimir Makarov, with help from Yura Sokolov)
Reason: Faster because open addressing works better with modern cash hierarchy
Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object

Summary

A hash table implements a dictionary ADT using a hash function
Main design points:
Selection of hash function
Conflict resolution methods (chaining or open addressing)
Reanosably good actual performance
Wide field of application

Glossary

direct addressing: 直接アドレス表
hashing, scatter storage technique: ハッシュ法、挽き混ぜ法
hash function: ハッシュ関数
hash table: ハッシュ表
game of Go: 囲碁
joseki: 定石 (囲碁)
universal hashing: 万能ハッシュ法
denial of service attack: DOS 攻撃、サービス拒否攻撃
perfect hash function: 完全ハッシュ関数
cryptographic hash function: 暗号技術的ハッシュ関数
electronic signatures: 電子署名
conflict: 激突
Poisson distribution: ポアソン分布
chaining: チェイン法、連鎖法
open addressing: 開番地法、オープン法
load factor: 占有率
linear probing: 線形探査法
quadratic probing: 二次関数探査法
divisor: (割り算の) 法
amortized analysis: 償却分析
proximity search: 近接探索
similarity search: 類似探索