menu +

Finding Matches Quickly, Even With Very Little Space

Representing a string of balanced parentheses is a favorite (read: fundamental) problem in succinct data structures. The question involves storing a string of balanced parentheses with good support for the operations of finding matching parentheses and the nearest enclosing parentheses. Here, we are going discuss a representation that requires only $2n + o(n)$ bits, and supports both operations in constant time, where we $2n$ is the total length of the string considered.

Formally, a balanced parentheses is subset of $\{(,)\}^*$ defined recursively as follows: the empty string is a balanced parentheses string, and if A and B are balanced parentheses, then so is the string A(B). Further, the operations that we would like to support are the following.

findopen(x): To find the index of the opening parenthesis that matches a given closing parenthesis x.

findclose(x): To find the index of the closing parenthesis that matches a given opening parenthesis x.

enclose(x): To find the opening parenthesis of the pair that most tightly encloses x.

q l

An example of a balanced parentheses string.

This exposition focuses on motivating the details of the data structure constructed based on the need to find matching parentheses quickly. It turns out that support for findenclose is almost a free consequence. We begin by enumerating some natural attempts at supporting these operations, without any ambitious goals with respect to space and time.

You Win Some, You Lose Some: Simple Trade-Offs

If we represent the string in an array, we need only $2n$ bits of space, but both operations require linear time. On the other hand, we could store all the answers explicitly, that is, for each parenthesis, also keep pointers to the matched and the nearest enclosing parenthesis. This gives us constant time lookups for both operations. However, in the bargain, the amount of space we need has increased by an $O(\log n)$ factor — since every pointer stored will demand $\log 2n$ bits.

Also, the total number of strings with $n$ balanced parentheses is asymptotically $\frac{4^n}{n^{3/2} \sqrt{\pi}}$, thus we cannot avoid having to use $\log (4^n) = \log (2^{2n}) = 2n$ bits, ignoring $\log$ factors. This is the bound on $C_{2n}$, the $(2n)^{th}$ Catalan number, which also happens to be a count of the number of balanced parentheses of length $2n$ (why?).

In what proceeds, we will describe how we can achieve constant time lookups for both operations by using only $2n + o(n)$ bits, which is only $o(n)$ on top of what is absolutely necessary. These method discussed here is due to Geary, Rahman, Raman and Raman 1. Our presentation is largely loyal to the paper, except for minor differences in terminology.

Divide, Divide, till you Conquer

The basic idea is to divide the string into small enough chunks, to the point where storing answers explicitly whenever the information is available within a chunk is affordable. Then we are left with the task of making the information that’s split “across” the chunks accessible. This is done by identifying a small number of important members in each chunk, which happen to double up as efficient pointers to all the remaining information.

The First Hack

In the first step, we’re going to split the string into “blocks” of size $(\log n)/2$. Let the blocks be labelled $B_1, \ldots, B_{\beta}$, where $\beta$ is $\lceil 4n/(\log n) \rceil$ (recall that the length of the string is $2n$). For a parenthesis $x$, we use $b(x)$ to refer to the block that $x$ lies in, and we denote the matching parenthesis by $x^*$. Call a parenthesis a near parenthesis if $b(x) = b(x^*)$, and call it a far parenthesis otherwise.

q l

An illustration of the initial division of the long string into smaller blocks.


We claim that we can already afford to store all the information we want about the near parentheses explicitly. Notice that simply storing a table with all the information for each block separately will simply add up to too much: each block will demand $O(\log n)$ units of space and naively multiplying this by the number of blocks brings us up to $O(n)$ space on top of the base array, well beyond our budget.

However, if we are just a little observant, then we might imagine that many of the tables for different blocks have the same information — indeed, if the near parentheses in two different blocks happen to be exactly the same string, then the tables corresponding to them are identical, and we certainly don’t need multiple copies: one copy of the table with a “tag” indicating which blocks share this table would suffice.

q l

An example demonstrating how identical near strings can be different when it comes to exact locations for matches.

This remark comes with a caveat, however: even if the near parentheses strings look completely identical, they might be scattered around the far parentheses in different ways. For example, denoting near parentheses with () and the far ones by {}, let’s consider (see figure on the left) the two strings ()}{((()){) and ()((}{({))). These have the same “configuration” as far as the near parentheses are concerned, both of them “boil down to” ()((())). However, the location of the matches will be different depending on how the far parentheses are intercepting this string. Nonetheless, it turns out that we will be able to make this distinction by storing a dictionary structure on top of our string, a detail that we will return to momentarily.

In the meantime, is the redundancy that is we gain after neglecting the far parentheses enough for us to argue a $o(n)$ space bound on explicit tables?

Let $N_i$ be the string of balanced parentheses obtained by considering only the near parentheses in the block $B_i$. Formally, if

$$w_i[1] w_i[2] \ldots w_i[r]$$

is the string corresponding to block $B_i$, and the indices of the near parentheses in $B_i$ are $j_1, \ldots, j_{b_i}$, then $N_i$ is simply the following string:

$$w_i[j_1]w_i[j_2] \ldots w_i[j_{b_i}].$$

Let ${\cal B}(t)$ be the collection all $N_i$ for which $|N_i|$ is $t$. How many distinct strings can there be in ${\cal B}(t)$? Since we know that the number of strings of balanced parentheses of length $k$ is the $k^{th}$ Catlan number, the size of ${\cal B}(t)$ is at most $C_t$, or

$$|{\cal B}(t)| \leq {2t \choose t} \frac{1}{t+1} \leq \left(\frac{2t e}{t}\right)^t \frac{1}{t+1} = \frac{(2e)^t }{t+1} \leq \frac{(2e)^{(\log n)/2} }{t+1} \leq \frac{\sqrt{n}}{t+1} = O(\sqrt{n}).$$

The computation above is rather straightforward, we only need to remember that $t \leq (\log n)/2$ for obtaining the third inequality.

Now, for every string $s$ in ${\cal B}(t)$, and for every location $x$ in $s$, the plan is to store, explicitly, information about $x$ that might be useful — for instance, the location of the matching parenthesis of $x$. (We might need to store more pointers to record other kinds of information, but for now we’ll just analyze the cost of having one explicit table, and having more tables will only pile on a constant multiplicative factor.) So we are storing, for every position in the string $s$, addresses that range from $1$ to $\log n$. The resulting table, therefore, requires $O( \sqrt{n} \log n \log \log n)$ bits, where the three terms correspond, respectively, to the number of strings in ${\cal B}(t)$, the number of locations in each such string, and the number of bits required to store the pointers for each of these locations. Finally, since there are such tables for every even number $t$ in the range $(2,(\log n)/2)$, the total amount of space consumed by the tables amounts to:

$$O(\sqrt{n} (\log n)^2 \log \log n).$$

To wrap up, for every table, we just store a list of blocks for which the table is relevant. This contributes only an additional $(\log n)^2$ bits.

So with this, we have the ability to store a fair amount of explicit information with only a $o(n)$ overhead. This information is almost enough to support findopen(x) when $x$ is a near parenthesis: indeed, suppose $x$ is in block $B_i$, then we will appeal to the table that realizes the pattern of $N_i$, and look up the match stored in that table. What remains to be done is to deduce, from this match, the actual location in the original string, which might be distorted because of some far parentheses that come in the way. We’ll come back to this in due course.

Remark: Along with the matches, we will also have a similar table that stores the location of the most tightly enclosing opening and closing parenthesis (whenever they happen to be near parentheses in the same block) for every near parenthesis.

The Second Attack

We now have to deal with remaining parentheses, which are those whose matches lie in blocks different from their own. Recall that we referred to these as the far parentheses. Let us begin by considering $F_i$, which are all the far parentheses in the block $B_i$, defined analogously to $N_i$. For the sake of comprehending the situation, let’s stare at the block numbers of the matching parentheses. Notationally, let there be $t$ opening far parentheses in $B_i$ at indices $o_1, \ldots, o_t$:

$$F_i(O) := w_i[o_1]w_i[o_2] \ldots w_i[o_{t}],$$

so that we are considering the sequence:

$$ b(w_i[o_1]^*), b(w_i[o_2]^*), \ldots b(w_i[o_{t}]^*).$$

Note that these are numbers larger than $i$ and are in a non-increasing sequence, since the earliest opening parentheses will be matched out the farthest. The situation for all the far closing parentheses is symmetric. (See figure below.)

q l

A picture illustring how the matches of the far parentheses are laid out across blocks.


Notice that if we draw the lines as we have drawn above (from a parentheses to it’s matching partner in a later or earlier block) for all far parentheses in the whole sequence, then we can always draw the lines in a non-crossing fashion, since the parentheses are properly nested (and crossings would imply non-nested parentheses).

Notice also that it while it is tempting to mimic the process of explicit storage that worked for the near parentheses, it is too expensive. The length of $F_i$ most $(\log n)/2$, and since the far parentheses strings necessarily look like a sequence of closed parentheses followed by a sequence of opening parentheses (why?), the collection of all the $F_i$ can have at most $(\log n)/2$ distinct strings. However, observe that even the strings $F_i$ and $F_j$ are identical for two distinct blocks, the corresponding tables — that have pointers to matching parentheses — can be wildly different. Thus there is no natural notion of redundancy to exploit, and this failure is accompanied by the hint that we need to hunt for global structure.

Let us return to the matching parentheses block sequence for the opening far parentheses in $B_i$ (although we will talk about finding matches for opening parentheses, the same arguments hold for closing parentheses by symmetry):

$$ b(w_i[o_1]^*), b(w_i[o_2]^*), \ldots b(w_i[o_{t}]^*).$$

Notice that the runs (contiguous subsequences of identical numbers) of the sequence above correspond to parentheses whose matching partners are in the same block. For example, if the sequence above was $(7,7,7,6,5,5,5,5,5,5)$, then it means that all of the last six parentheses have their matching partners in the fifth block. It is natural to wonder if these “runs can be compressed” and if we can get away with storing the information only at the “breakpoints” — let us call the $i^{th}$ element of the sequence is a breakpoint if it is different from the $(i-1)^{th}$ element (note that there are three breakpoints in the example sequence). The first question we would have to address is the following: how do we deal with a far parenthesis $x$ corresponding to a location which is not a breakpoint?

Let us say that a far parenthesis $x$ is taken care of by the parenthesis $b$ if $b$ is the parenthesis corresponding to the nearest breakpoint before $x$. Notice that if we could jump from $x$ to $b$ in constant time, then we already know the block in which it’s matching parenthesis lies. It would then just a matter of pinning down exactly where the match is within that block. It is tempting to hope that once we’ve arrived in the correct block, we will be able to navigate within it, possibly with the help of some more precomputed tables.

Thus far, the breakpoints appear promising: let us identify the ingredients that we will need to make it work. Let us refer to a parenthesis that corresponds to a breakpoint a pioneer parenthesis. First, we need a way of jumping, in constant time, from an arbitrary far parenthesis, to the nearest pioneer before it. Then we need to be sure that we can find matches among the pioneers in constant time. Finally, we need to be able to work our way backward from the match of a pioneer to the matches of the far parentheses that are taken care of by the pioneer.

q l

A schematic showing a pioneer parenthesis.

The second of these requirements is met by imposing a recursive implementation of the balanced parenthesis data structure on top of the pioneers. (One might wonder why we cannot store the matches explicitly for the pioneers. After we have a better grip on the number of pioneers, it will be easy to see that this would be too expensive. However, it will also be easy to argue that with one more level of recursion explicit storage becomes affordable, so in all, the whole data structure comprises of two levels of recursion.)

Towards the first requirement, we need to find the last pioneer that precedes any given location on the string. Consider the string $W$ of length $2n$ on the alphabet $\{0,1\}$ obtained by setting the $i^{th}$ bit to $1$ if and only if the $i^{th}$ location corresponds to a pioneer parentheses. Then for every location $j$ of $W$ that is a 0, what we need is the location of the nearest 1 that appears before $j$. This is the so-called rank operation. It is known that this (along with the select(i) operation, which returns the $i^{th}$ 1 in the string) can be supported in constant time using a dictionary data structure. This will require an additional

$$ \left\lceil \log {2n \choose M} \right\rceil + O(2n \log \log 2n / \log 2n) $$

bits (here $M$ is the number of ones in $W$). Notice that:

$$ \log {2n \choose M} \leq \log { \left( \frac{2ne}{M} \right)^M } \leq M \log (2n/M).$$

So if the number of ones in $W$, or in turn, the number of pioneers in the string of balanced parentheses, was bounded by $O(2n/\log 2n)$, then space used by the dictionary would be $o(n)$. If we can show that the number of pioneer parenthesis is linear in the number of blocks, then notice that we are done - recall that we have $(4n/ \log 2n)$ blocks.

Imagine that the blocks are vertices of a graph laid out on a line, and whenever we have an opening pioneer parenthesis $p$ in a block $b_i$, we draw an edge between vertices corresponding to blocks $b_i$ and $b_j$, where $b_j$ is $b(p^*)$, the block of the matching parenthesis. Notice that edges of this graph count the number of opening pioneers — and because the parenthesis are nested, these edges can always be drawn in a way that they do not cross. If we choose to draw the edges in the half-plane below the line on which the vertices where laid out, then we see that we have on our hands an outerplanar graph without multiedges (why?). It is well-known (and not difficult to prove, starting from Euler’s formula for planar graphs) that an outerplanar graph on $n$ vertices has $(2n-3)$ edges. Since there are no multiedges, this implies that the number of opening pioneers is bounded by $O(2n/\log 2n)$, and repeating the argument allows us to conclude the same about the closing pioneers.

In fact, note that if we expanded the set of pioneers to include all parentheses that are matches of pioneers but not pioneers themselves (convince yourself that such parentheses can exist), even then, the size of the expanded pioneer family will double (at most), and we stay within limits. The dictionary is implemented on the expanded pioneer family, and so is the recursive data structure.

So far, we have been able to ensure comfortable navigation within the pioneers, we’re able to jump to the nearest preceding pioneer from anywhere in the string, and among the pioneers, we have recursive support for finding matches, all with only a $o(n)$ overhead, thanks to the dictionary data structure and the fact that the number of pioneers is small enough to fit right in.

We now turn to our last requirement, which involved identifying the match of a far parenthesis given the block in which it’s match lies. More precisely, we are given the match of it’s pioneer parenthesis. Let the far parenthesis be $x$ and let $p$ the pioneer parenthesis that takes care of $x$. We know the location of $p^*$ and we need to get to $x^*$. Let’s look at the “distance” in the string between $p$ and $x$. If we knew that the distance between $p$ and $x$ is the same as the distance between $p^*$ and and $x^*$, then we would have attempted to precompute a table that recorded the distance of $x$ from the pioneer that was taking care of it. Since every opening far parentheses between $p$ and $x$ has to be closed between $p^*$ and $x^*$, it is conceivable that these distances might be the same (observe that if $x$ is an opening parentheses, then there are no closing far parentheses between $x$ and $p$). However, these distances are not equal in general, because of the near parentheses that are strewn in between $p$ and $x$!

However, notice that for every near parenthesis $y$ that is an opening parenthesis between $p$ and $x$, $y^*$ also lies between $p$ and $x$ (use the fact that the parentheses are nested). And this is true of the near parentheses between $x^*$ and $p^*$ as well. So the invariant that we can exploit is the number of unmatched parentheses. If we stored, for every opening far parentheses $x$, the number of unmatched opening parentheses between $x$ and the pioneer that takes care of $x$, then we are in good shape: let’s say that the number of unmatched opening parentheses between $x$ and $p$ is $j$, then then $x^*$ is the $j^{th}$ closing far parenthesis behind $p^*$.

At this point we will fit two birds in one nest: recall that for identifying matches among the near parentheses, we needed to be able to find the $i^{th}$ near parenthesis in a block. So given a block $B_i$, let us construct $W_i$, a string on $\{0,1\}$ whose $i^{th}$ bit is 0 if the corresponding location in the block is a near parenthesis, and is 1 otherwise. Now we can appeal to the dictionary structure again, with support for the rank and select operations (for both ones and zeroes simultaneously): this will enable us the find the $i^{th}$ near or far parenthesis in constant time. Notice that there are only $2^{(\log n)/2} = O(\sqrt{n})$ possible strings to worry about. So we’ll need the dictionary data structure for $O(\sqrt{n})$ strings of length $(\log n)/2$ each, which is still $o(n)$.

Finally, we need to store the number of unmatched parentheses between every far parenthesis $x$ and the pioneer that takes care of $x$. Notice that the number of unmatched parentheses at location $i$ is either one more or one less than the number of unmatched parentheses at location $(i-1)$. Notice that the first far parenthesis of a block is always a pioneer, so the relevant part of a block is everything after the first far parenthesis — let us refer to this part of the block as $B_f$. Associate the string $b_f$ over the alphabet $\{-1,+1\}$, with $B_f$ in the natural way: at location $i$, $b_f$ is $+1$ if the number of unmatched parentheses at location $i$ is one more than the number at $(i-1)$, and it is $-1$ otherwise. With respect to the information that we want to store, it is clear that if the strings associated with two blocks are identical, then the tables corresponding to these blocks are the same. Thus, again we have essentially $2^{(\log n)/2} = O(\sqrt{n})$ possible strings to worry about, and for every such string we are storing $O(\log n\log \log n)$ bits of information to store, so we’re again within the $o(n)$ limit.

Now with all the ingredients in place, let’s summarize the data structure and the constant time process for findopen(x).

Putting things together

So we first chunk the input string into pieces of size $(\log 2n)/2$ each, and we have explicit tables for storing the matches of the near parentheses. Among the far parentheses, we identified pioneer parentheses, and implemented a recursive data structure on the expanded pioneer family, along with a dictionary that supports rank and select for this family. We also have various precomputed tables that require only $o(n)$ bits.

Let us summarize the process of finding the match of, say, an opening parentheses $x$. Let $b(x) = B_i$.

Case 1. $x$ is a near parenthesis. There is a table that matches the pattern of the near parentheses in $B_i$. One lookup in this table gives us a number $i$, which is the location of $x^*$ in $N_i$ (which was the string $B_i$ restricted to the near parentheses). We also have a dictionary on the word $W_i$, which gives us the location of the $i^{th}$ 1 in $B_i$, which is clearly the location of $x^*$.

Case 2. $x$ is a far parenthesis. If $x$ is a pioneer, use the recursive data structure. If $x$ is not a pioneer, let $y$ be the pioneer that takes care of $x$. Using a table lookup, identify $j$, the number of unmatched parentheses between $y$ and $x$. Find $y^*$ using the recursive data structure. Use the dictionary implemented for the extended pioneer family to find the location of $y^*$ in the input string. Let the block that $y^*$ belongs to be $B_i$. Using the dictionary for $W_i$, find the $j^{th}$ far parenthesis behind $y^*$. This is the location of $x^*$.

The process for finding the match of a closing parenthesis is symmetric.

Now we turn to finding the opening parenthesis of the parenthesis pair that most tightly encloses a given parenthesis $x$. Let $z$ be the parenthesis that we are looking for. As before, let $b(x) = B_i$.

First, using table lookups, we can find if either $z$ or $z^*$ lie in $B_i$. If $z$ lies in $B_i$, then we can find the location of $z$ in $N_i$ using a table lookup and as before, we can find $z$ in the string using the dictionary for $W_i$. If $z^*$ is in $B_i$, we find $z^*$ and then appeal to findopen() to find $z$.

So we are left with the case when $z$ and $z^*$ lie outside of $B_i$. Let us first find the earliest pioneer $y$ that lies after $x$, using a dictionary lookup. If this happens to be a closing parentheses, then we find $y^*$, and let $p= y^*$. If this $y$ is an opening parenthesis, then using the recursive data structure, find the pair of pioneer parentheses that most tightly enclose $y$. Let $p$ be the opening parenthesis of this pair. In both cases, it is easy to observe that $p$ is the pioneer parentheses that most tightly encloses $x$.

To get to $z$ from $p$, find the location of the pioneer parenthesis that follows $p$, let this be $q$. The parenthesis $z$ is clearly sandwiched between $p$ and $q$ — if $q$ lies in a block different from $b(p)$, then $z$ is simply the last opening far parenthesis in the block of $p$, and if $b(p) = b(q)$, then $z$ is the last far parenthesis to the left of $q$, and both these locations can be derived from table lookups.

  1. A simple optimal representation for balanced parentheses, Richard F. Geary, Naila Rahman, Rajeev Raman, Venkatesh Raman. In: Theor. Comput. Sci