Hash Functions and Hash Tables

(ハッシュ関数とハッシュ表)

Data Structures and Algorithms

10th lecture, November 19, 2015

http://www.sw.it.aoyama.ac.jp/2015/DA/lecture10.html

Martin J. Dürst

Today's Schedule

Additional speedup for dictionary
Overview of hashing
Hash functions
Conflict resolution
Evaluation of hashing
Hash in Ruby
Summary

Time Complexity for Dictionary Implementations up to Here

Implementation	Search	Insertion	Deletion
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(`n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)

Direct Addressing

Use an array with an element for each key value
Search: value = array[key], time: O(1)
Insertion/replacement: array[key] = value, time: O(1)
Deletion: array[key] = nil, time: O(1)
Example:
students = [] students[15814000] = "Hanako Aoyama"

Problem: Array size, non-numeric keys

Solution: key transformation

Overview of Hashing

(also called scatter storage technique)

Transform the key k to a compact space using the hash function hf
Use hf(k) instead of k in the same way as direct addressing
The array is called hash table
The hash function is evaluated in O(1) time
Search: value = table[hf(key)], time: O(1)
Insertion/replacement: table[hf(key)] = value, time: O(1)
Deletion: table[hf(key)] = nil, time: O(1)

Problem 1: Design of hash function

Problem 2: Resolution of conflicts

Overview of Hash Function

Goals:
1. From a key, calculate an index that is as smoothly distributed as possible
2. Adjust the range of the result to the hash table size
Steps:
1. Calculate a large integer (e.g. int in C) from the key
2. Adjust this large integer to the hash table size using a modulo operation

Step 2 is easy. Therefore, we concentrate on step 1.
(often step 1 alone is called 'hash function')

Hash Function Example 1

(SDBM Hash Function)

int sdbm_hash(char key[])
{
    int hash = 0;
    while (*key) {
        hash = *key++ + hash<<6
               + hash<<16 - hash;
    }
}

Hash Function Example 2

(simplified from MurmurHash3; For 32bit machines)

#define ROTL32(n,by) (((n)<<(by)) | ((n)>>(32-(by))))
int too_simple_hash(int key[], int length)
{
    int h = 0;
    for (int i=0; i<length; i++) {
        int k = key[i] * C1;  // C1 is a constant
        h ^= ROTL32(k, R1);  // R1 is a constant
    }
    h ^= h >> 13;
    h *= 0xc2b2ae35;
    return 
}

Evaluation of Hash Functions

Quality of distribution
Execution speed
Ease of implementation

Precautions for Hash Functions

Use all parts of the key
Counterexample: Using only characters 3 and 4 of a string → bad distribution
Do not use data besides the key
If some data attributes (e.g. price of a product, studen's total marks) change, the key will change and the data cannot be found anymore
Collapse equivalent keys
Examples: Upper/lower case letters, top/bottom/left/right/black/white symmetries for Joseki in Go

Special Hash Functions

Universal hashing
- Use a random function to create a different hash function for each program execution
- Implementation: Change a constant in the hash function
- Solution for some denial-of-service attacks
  (reference)
Perfect hash function
- Custom-designed hash function without conflicts
- Useful when data is completely predefined
- In the best case, the hash table is completely filled
- Application: Keywords in programming languages
- Example implementation: gnu gperf (in Japanese)

Cryptographic Hash Function

Used in electronic signatures
Differences from general hash functions:
- Practically impossible to generate same output from different input
- Output usually longer (e.g. 128/256/384/512/... bits)
- Evaluation may take longer

Conflict

A conflict happens when hf(k₁) = hf(k₂) even though k₁ ≠ k₂
This requires special treatment
Main solutions:
- Chaining
- Open addressing

Terms and Variables for Conflict Resolution

Fields in hash table: bins/buckets
Number of bins: m
(equal to the range of values of the hash function after the modulo operation)
Number of data items: n
Load factor (average number of data items per bin): α
(α = n/m)
For a good (close to random) hash function, the variation in the number of data items for each bin is low
(Poisson distribution)

Chaining

Store conflicting data items in a linked list
Each bin in the hash table is the head of a linked list
If the linked list is short, then search/insertion/deletion will be fast
The average length of the linked list is α
All operations are carried out in two steps:
1. Use hf(k) to determine the bin
2. Use the key to find the data item in the linked list

Implementation of Chaining

Implementation in Ruby: Ahashdictionary.rb
Uses Array in place of linked list
Uses Ruby's hash function

Open Addressing

Store key and data in hash table itself
In case of conflict, successively check different bin
For check number i, use hash function ohf(key, i)
- Linear probing: ohf(key, i) = hf(key) + i
- Quadratic probing: ohf(key, i) = hf(key) + c₁ i + c₂ i²
- Many other variations exist
The load factor has to be between 0 and 1; ≦0.5 is reasonable
Problem: Deletion is difficult

Time Complexity of Hashing

(average, for chaining)

Calculation of hash function
- Dependent on key length
- O(1) if key length is constant or limited
Search in bin
- Dependent on load factor
- O(1) if load factor is below a given constant
- O(n) in worst case, but this can be avoided by choice of hash function

Expansion and Shrinking of Hash Table

The efficiency of hashing depends on the load factor
If the number of data items increases, the hash table has to be expanded
If the number of data items decreases, it is desirable to shrink the hash table
Expansion/shrinking can be implemented by re-inserting the data into a new hash table
(changing the divisor of the modulo operation)
Expansion/shrinking is heavy (time: O(n))

Analisys of the Time Complexity of Expansion

If the hash table is expanded for each data insertion, this is extremely inefficient
Limit the number of expansions:
- Increase the size of the hash table whenever the number of data items doubles
- The time needed for the insertion of n data items (n=2^x) is
  2 + 4 + 8 + ... + n/2 + n < 2n = O(n)
- The time complexity per data item is O(n)/n = O(1)

(simple example of amortized analysis)

Evaluation of Hashing

Advantages:

Search/insertion/deletion are possible in (average) constant time
Reasonably good actual performance
No need for keys to be ordered
Wide field of application

Problems:

Sorting needs to be done separately
Proximity/similarity search is impossible
Expansion/shrinking requires time (possible operation interrupt)

Comparison of Dictionary Implementations

Implementation	Search	Insertion	Deletion	Sorting
Sorted array	`O`(log `n`)	`O`(`n`)	`O`(`n`)	`O`(`n`)
Unordered array/linked list	`O`(`n`)	`O`(1)	`O`(1)	`O`(`n` log `n`)
Balanced tree	`O`(log `n`)	`O`(log `n`)	`O`(log `n`)	`O`(`n`)
Hash table	`O`(1)	`O`(1)	`O`(1)	`O`(`n` log `n`)

The Ruby `Hash` Class

(Perl: hash; Java: HashMap; Python: dict)

Because dictionary ADTs are often implemented using hashing,
in many programming languages, dictionaries are are called hash
Creation: my_hash = {} または my_hash = Hash.new
Initialization: months = {'January' => 31, 'February' => 28, 'March' => 31, ... }
Insertion/replacement: months['February'] = 29
Lookup: this_month_length = months[this_month]
Hash in Ruby has more functionality than in other programming languages (presentation)

Implementation of Hashing in Ruby

Originally by Peter Moore, University of California Berkeley (1989)
Source: st.c
Uses chaining
Expansion by a factor of about 2 whenever a load factor of 5.0 is reached (ST_DEFAULT_MAX_DENSITY)
The size of the hash table is a power of 2 (next_pow2)
(earlier, it was a prime number close to a power of 2)
Even if there are many deletions, the hash table is never shrunk
Used inside Ruby, too:
- Lookup of global identifiers such as class names
- Lookup of methods for each class
- Lookup of instance variables for each object

Summary

A hash table implements a dictionary ADT using a hash function
Main points: Selection of hash function, conflict resolution method
Reanosably good actual performance
Wide field of application

Glossary

direct addressing: 直接アドレス表
hashing, scatter storage technique: ハッシュ法、挽き混ぜ法
hash function: ハッシュ関数
hash table: ハッシュ表
joseki: 定石 (囲碁)
universal hashing: 万能ハッシュ法
denial of service attack: DOS 攻撃、サービス拒否攻撃
perfect hash function: 完全ハッシュ関数
cryptographic hash function: 暗号技術的ハッシュ関数
electronic signatures: 電子署名
conflict: 激突
Poisson distribution: ポアソン分布
chaining: チェイン法、連鎖法
open addressing: 開番地法、オープン法
load factor: 占有率
linear probing: 線形探査法
quadratic probing: 二次関数探査法
divisor: (割り算の) 法
amortized analysis: 償却分析
proximity search: 近接探索
similarity search: 類似探索