Balanced BSTs
Background
Suppose we want to store a collection of distinct integers (henceforth, we call them keys, for no good reason except that I can refer to them with the 🔑 emoji) with support for:
insert(x)
insertx
into the collection if it is not there alreadydelete(x)
deletingx
from the collection if it is presentsearch(x)
determine ifx
belongs to the collection
The motivation for building binary search trees is to capture a happy middle-ground between the following extremes:
- if I store my data fully sorted, then it is hard to insert and delete1, but easy to search2, but
- if I store my data in chaos-mode, then it is easy to insert and delete3, but hard to search4.
Binary search trees are the natural answer to the question:
If binary search was a data structure, what would it look like?
To achieve O(\log n) complexity for search and insertion (and deletion), we want a relatively “loose” sense of structure on the data: it should be fluid enough for insertion and deletion to be manageable, but meaningful enough for searches to be fast.
One way to do this is to store everything in a binary tree where every node u
stores a key k_u with the assurance that:
- all nodes that belong to the subtree rooted at the right child of
u
have keys whose values are greater than k_u; - all nodes that belong to the subtree rooted at the left child of
u
have keys whose values are smaller than k_u.
With this structure in place, search(x)
works exactly as you would expect: if the binary search tree is rooted at r, check if x
equals k_r: if yes, then we’ve found what we are looking for, so abort and celebrate; on the other hand, if not, then repeat the following until you’ve either found x
or have run out of elements in the tree:
- if
x
is greater than k_r, continue searching in the 🌳 rooted at the right child of r - if
x
is smaller than k_r, continue searching in the 🌳 rooted at the left child of r
Notice that if you hit a leaf without a match, you can safely conclude that x
does not belong to the collection. Also notice that the complexity of search is proportional to the length of the longest root-to-leaf path in the tree5 in the worst case.
To insert(x)
, first search(x)
: if x
is found, there’s nothing to do; if x
is not found, then make note of the leaf where the search stopped — say \ell: then if x
is greater than k_\ell, introduce a right child of \ell and assign to it the key value x
; on the other hand if x
is less than k_\ell, introduce a left child of \ell and assign to it the key value x
. Notice the tree with the newly added node is still a valid binary search tree.
What about delete(x)
? If x
is not in the collection, there’s nothing to do. If x
does belong to the collection and is associated with a leaf node, then the node can simply be knocked out, no further action required. However, what if the node whose key value is x
— say v — is an serious one, with a left and a right child? One possibility is to swap x
with an appropriate value ascribed to some leaf node, and then knock out the leaf: which now carries the value of x
. Which leaf should we swap with? It’s clear that it can’t be just any leaf: but you could pick the smallest value in the right subtree or the largest value in the left subtree — both of these reside and leaves, and the swaps with them would leave you with a valid BST after the deed is done.
To find the smallest value in the right subtree, you can take the right turn from v and then go left until you are stuck. If you prefer the largest value in left subtree instead, take the left turn from v and go right until you are stuck. Either of these should work fine. You can find some examples of deletions in a BST here.
Notice that the complexity of all three operations are proportional to the height of the tree: so if we can control the height, we control the complexity of these operations! Notice also that the height of a BST — if built by implementing the algorithms above — can be as bad as linear in the number of keys6, which is no good. Play around with this visualization of BSTs to get a feel!
(2,3) Trees
One way of controlling the height of a BST is to simply demand it: this is called solving a problem by definition :) What if we could insist on having BSTs where the underlying tree structure must have the following additional property:
- all nodes have either 0 or two children
- all leaves are at the same distance from the root.
This would automatically force the structure of the underlying tree to be complete i.e, if the tree has height h, any node whose distance from the root is less than h must be a branching node (indeed, if not, then it is a leaf by property (1), but then we will have a violation of property (2)). Why is this great? Because if you have n nodes packed into a complete tree of height h, then the height must, in fact, be O(\log n) — which is what we wanted.
There is, however, a small catch with this: such trees may not exist for some values of n, the total number of keys that we are trying to accommodate :( We can fix this in two ways: add some dummy keys with values that either duplicate existing values till we have a feasible number of keys to work with (ugh), or make room for multi-ary search trees (cool!).
Since we want the elegant approach, let’s push our definition further:
- all nodes have either 0, two, or three children
- nodes that have two children store one key value and satisfy the standard BST property
- nodes with three children store two key values, say
a
andb
; and further:
- the subtree rooted at the left-most child only hosts values smaller than
a
- the subtree rooted at the right-most child only hosts values larger than
b
- the subtree rooted at the middle child only hosts values strictly between
a
andb
- all leaves are at the same distance from the root.
Adapting search(x)
to multi-ary trees is extremely natural.
If the tree is rooted at r and r is a node with two children, proceed as before.
Otherwise, suppose r hosts the values a_r and b_r. Then check if x
equals either a_r or b_r: if yes, then we’ve found what we are looking for, so abort and celebrate; on the other hand, if not, then repeat the following until you’ve either found x
or have run out of elements in the tree:
- if
x
is greater than b_r, continue searching in the 🌳 rooted at the right-most child of r - if
x
is smaller than a_r, continue searching in the 🌳 rooted at the left-mostt child of r - otherwise, continue searching in the 🌳 rooted at the middle child of r
What about insert(x)
? We search for x
as before. If found, there’s nothing to be done. If not, there are two scenarios:
- The search terminates at a leaf that has one key value.
- The search terminates at a leaf that has two key values.
Note that unlike before, we cannot just have the leaf sprout a child to accommodate the new key value (remember that all leaves have to be at the same distance from the root). So our algorithm pushes upwards, and works as follows:
If there’s room in the current node (i.e, the node is one with two children or a leaf with one key value), accommodate the new value into that node, reorganizing as necessary. In particular, when a node with two children + one value transforms into one with three children + two values, then the subtrees have to be appropriately reorganized.
If there’s no room in the current node, then move up to the parent. If there’s no parent: then you’ve hit the root! If the root turns out to be crowded too, then you’ll have to split the root (note that this causes the height of the tree to increase, but does not happen too often), but if the root has room then follow the procedure in the previous step.
Footnotes
can be as bad as linear time↩︎
can be as good as logarithmic time↩︎
constant time!↩︎
can be as bad as linear time↩︎
Since it’s going to be awkward to keep talking about the “length of the longest root-to-leaf path in the tree”, we will simply refer to this as the height of the tree.↩︎
Try, for example, inserting 1,2,3,\ldots,n in that order.↩︎