Balanced Trees
(平衡木)
Data Structures and Algorithms
9th lecture, November 12, 2015
http://www.sw.it.aoyama.ac.jp/2015/DA/lecture9.html
Martin J. Dürst
© 2009-15 Martin
J. Dürst 青山学院大学
Today's Schedule
- Summary of last lecture
- Balanced trees for internal use
- 2-3-4 tree
- red-black-tree
- AVL-tree
- Balanced trees for secondary storage
Summary of Last Lecture
- A dictionary is an ADT allowing the insertion, deletion, and
search of data items using a key
- With a simplistic implementation, some operations take
O(n) time
- With a binary search tree, all operations are O(log
n) on average, but O(n) in the worst
case
- Different than for sorting, this cannot be improved using
randomization
(for quicksort, we can randomly select a pivot, but the order of insertions
and deletions for a dictionary is determined externally)
Strengthening or Weakening the Invariants of Binary Trees
- For the implementation of a priority queue, we
- Weakened the total order (of a binary search tree) to a local order
(between parent and child only)
- Strengthened the shape of a (general) binary tree to a complete
binary tree
- We have to consider strengthening or weakening invariants to improve
worst-case performance of a binary search tree
Top-Down 2-3-4 Trees
- Each (internal) node has 2, 3, or 4 children
- A node with k children stores k-1 keys and data
items
(if all nodes have 2 children, a 2-3-4 tree is equal to a binary search
tree)
- The keys in the internal nodes separate the key ranges in the
subtrees
- The tree is of uniform height
- In the lowest layer of the tree, the nodes have no children
(implemented as a single unique empty node)
Search in 2-3-4 Trees
- Start from the root node
- If the key being searched for is found in the current node, then return
the corresponding data item
- Select the subtree based on this nodes' keys, and continue
recursively
(each operation on a 2-3-4 tree is a generalization of the same operation on
a binary search tree)
Insertion into 2-3-4 Trees
- Basic operation: Search downwards, insert new data item into leaf
node
- If there are already 3 data items in the leaf node, this node has to be
split
- If a node has to be split, a key and data item have to be inserted into
the parent node
- This may trigger further splits in parents, potentially up to the
root
- To avoid splits after insertion (difficult to implement),
nodes with 4 children are split preemptively on the way from the root to
the leaf
- This version of 2-3-4 trees is called top-down 2-3-4 tree
Deletion from 2-3-4 Trees
- More complicated than insertion (same as binary search tree)
- Find data item to be deleted, using search
- If the item to be deleted is not in a leaf, exchange with an item in a
leaf
- Remove the item in the leaf
- If this results in a leaf node without data items, move (borrow) items
from neigboring leafs
- If the situation cannot be fixed using moving, merge some nodes
- If the situation cannot be fixed using merging, address the problem one
layer higher
- If the problem cannot be solved in the top layer, reduce the number of
the layers
Efficiency of 2-3-4 Trees
- Maximum number of data items in a 2-3-4 tree of height h:
n = 4h-1
- Minimum number of data items in a 2-3-4 tree of height h:
n = 2h-1
- ⇒ The height of the tree is O(log
n)
- The time needed for each operation is proportional to the height of the
tree and therefore O(log
n)
Implementation of 2-3-4 Tree
- Implementation in Ruby: 9234tree.rb、9driver.rb
- Implementation of 2-3-4 trees is quite complicated
- Some memory (in nodes with 2 or 3 children) is unused
- Therefore, other balanced trees have been proposed
Red-Black-Trees
- Implementation of a 2-3-4 tree with a binary tree
- The edges of the original tree are black
- Nodes with 3 or 4 children are split into multiple nodes, coloring the
internal edges red
- Two consecutive red edges are impossible/forbidden
- If this invariant is violated, rotations are used for
restoration
- If only black edges are counted, the tree is of uniform height
- When all edges are considered, the maximum depth of a leaf is at most
twice the minimum depth
AVL-Trees
- Proposed by Adelson-Velskii and Landis (Адельсон-Вельский
and Ландис) in 1962
- Oldest (binary) balanced tree
- Invariant: At each internal node, the difference between the heights of
the subtrees is 1 or less
- The difference between the heights of the left and the right subtrees
(-1, 0, 1) is stored in each internal node and kept up to date
- The tree height is limited to 1.44 log2 n
- Searching is slightly faster than for a red-black-tree
- Insertion and deletion are slightly more complicated than for a
red-black-tree
Secondary Storage
|
Internal Memory |
Secondary Storage |
Access principle |
random |
random |
linear |
Technology |
dynamic RAM |
HD, SSD |
magnetic tape |
Unit of access |
word |
page |
record |
Example unit size |
32/64 bits (4/8 bytes) |
512/1024/2048/4096/... bytes |
varying |
Access speed |
nanoseconds |
milliseconds |
seconds or minutes |
B-Trees
- Variant of 2-3-4 tree
- Each page is a node in the tree
- Maximise the number of keys per page
- The minimum number of keys per page is about half of the maximum
Page of a B-Tree
|
|
ref. to subtree |
key |
data |
ref. to subtree |
key |
data |
ref. to subtree |
... |
... |
... |
key |
data |
ref. to subtree |
|
|
B+ Trees
Starting with a B-tree, all data (except keys) is moved to lowest layer of
tree
⇒ The number of keys and child nodes per internal node increase
(for practical applications, the size of a key is much smaller than the size of
the data)
⇒ The height of the tree shrinks
⇒ Access to data is faster
(the overall access time is dominated by the number of pages that have to be
fetched from secondary memory)
Internal Page of a B+ Tree
|
ref. to subtree |
key |
ref. to subtree |
key |
ref. to subtree |
key |
ref. to subtree |
key |
ref. to subtree |
key |
ref. to subtree |
key |
ref. to subtree |
... |
... |
key |
ref. to subtree |
|
Leaf Page of a B+ Tree
key |
data |
key |
data |
key |
data |
... |
... |
key |
data |
Definition of Variables for B+ Trees
- n: Overall number of data items (example: 50,000)
- Lp: Page size (example: 1024 bytes)
- Lk: Key size (example: 4 bytes)
- Ld: Data size (one item, except key)
(example: 240 bytes)
- Lpp: Size of page number (page
reference) (example: 4 bytes)
- αmin: minimum occupancy (usually
0.5)
Items per Page for B+Trees
(⌊a⌋ is the floor function of a, the greatest
integer smaller or equal to a)
- dmax =
⌊Lp /
(Lk +
Ld)⌋ (example: 4)
(maximum number of data items per leaf)
- dmin =
⌊dmax
αmin⌋ (example: 2)
(minimum number of data items per leaf)
- kmax =
⌊Lp /
(Lk +
Lpp)⌋ (example: 128)
(maximum number of children per internal node)
- kmin =
⌊kmax
αmin⌋ (example: 64)
(minimum number of children per internal node)
Number of Nodes for B+Trees
(⌈a⌉ is the ceiling function of a, the smallest
integer greater or equal to a)
- Ndmax = ⌈n /
dmin⌉ (example: 25,000)
(maximum number of leaves)
- Ndmin = ⌈n /
dmax⌉ (example: 12,500)
(minimum number of leaves)
- Nkmax =
⌈Ndmax /
kmin⌉ +
⌈Ndmax /
kmin2⌉ ...
(maximum number of internal nodes)
(example: 391 + 7 + 1 = 399; height of B+tree:
4; total number of nodes: 25,399)
- Nkmin =
⌈Ndmin /
kmax⌉ +
⌈Ndmin /
kmax2⌉ + ...
(minimum number of internal nodes)
(example: 98 + 1 = 99; height of B+tree: 3; total number of nodes: 12,599)
Summary
- 2-3-4 trees and B(+)trees increase the degree of a binary tree, but keep
the tree height constant
- Red-black-trees and AVL-trees impose limitations on the variation of the
tree heigh
- Balanced trees allow to implement the basic operations on a dictionary
ADT in O(log n) time
- B-trees and B+ trees are extremely important for the implementation of
file systems and databases on secondary storage
Glossary
- red-black-tree
- 赤黒木 (あかくろぎ)
- AVL-tree
- AVL 木
- secondary storage
- 二次記憶装置
- B-tree
- B 木
- B+ tree
- B+木
- strengthen
- 強化 (する)
- weaken
- 緩和 (する)
- uniform
- 一定 (の)
- lowest layer
- 最下層
- occupancy
- 占有率
- floor function
- 床関数
- ceiling function
- 天井関数
- degree
- 次数