B-Trees
- BSTs can either be bushy or spindly.
- A spindly BST tree is slow to access elements. Similar to a linked list. $O(N)$ time for accessing elements.
- A bushy BST is fast to access elements, $O(\log N)$ time to access elements.
- This motivates us to construct a “Balancing-Tree”, a tree that always attempts to maintain a bushy structure.
Note about Big-Theta versus Big-O. Big Theta is typically more informative than Big O as it restricts us to one family of functions. Something that is O(N) could also be O(N^2). The upper bound can keep going up without consequence.
B-Tree Operations
- BST performance is reliant on Height and Depth
- Height: The depth of the deepest leaf and is that tree-wide property.
- Depth: The distance from the root of a node and is node-specific.
- Average Depth: The mean of the depth of every node.
- The core-concept behind a B-Tree is the ability to overstuff the leaf nodes.
- The problem with overstuffing is that with sufficiently enough items in a node, searchign sequentially within a node requires linear search.
Moving items up.
- Once a node becomes saturated with enough values, we may move up a value to the parent node.
- To address this, we create many children for each parent overstuffed parent node.
- In the above example, we have nodes representing intervals for $(-\infty, 15), [15,17], (17, +\infty)$
- We can set some constant $L$ as the limit on our node size.
- This would make our search time increase only by a constant factor, so we preserve our asymptotic behavior.
- When we add a node, we may cause a chain reaction where parent nodes successively split.
- A B-Tree can have different names based on the number of children it may have in a node.
- We work with 2-3 trees, which contains a limit of 2 items per node, where each node can have 2 or three children.
B-Tree Usage
- B-Trees are used for two main contexts
- We use a small L for self-balancing search trees.
- We use a large L in the thousands for databses and filesystems.
- The construction of a B-Tree grants it two invariants:
- All leaves are the same distance from the root.
- A non-leaf node with k items must have exactly k+1 children.
- This guaratees us a B-Tree with $\log N$ height
B-Tree Runtime Analysis
- let $L$ be the maximum items per node and $N$ be the number of items. The maximum height should be $\log_{L+1} N$ in the best case when all nodes have $L$ items and $\log_{2}N$ worst case when each node has 1 item.
- Worst case forces us to look at up to $L$ items per node, but since height is logarithmic, the runtime of
contains
is bounded by $O(L \log N)$. This tells use that the runtime ofcontains
is $O(\log N)$Runtime for
add
- The logic for
add
is similar tocontains
exceptadd
might break up certain nodes. The act of breaking up nodes results in doing $\log N$ additional split operations. THus means that the runtime ofadd
is still $O(\log N)$