*by Carl Burch, Hendrix College, August 2012*

Persistent data structures by Carl Burch is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.

Based on a work at

A **persistent data structure** is one in which no
operations result in permanent changes to the underlying
structure.
It's called “persistent” because as the structure goes
through successive operations, all versions of the
structure persist over time.
The notion of a structure with no permanent changes may sound
like an oxymoron: The simplest example, a stack, will illustrate.

(Many of the structures described in this chapter
are selected from Chris Osaki's book *Purely Functional
Data Structures* (1998). The book is an adaptation of his
Ph.D. dissertation.)

A classic stack has four operations, and each happens to map precisely to Haskell operations on lists. Following are the Java calls and the Haskell equivalents.

Java method Haskell equivalent

stack.peekTop();

head stack

stack.isEmpty();

null stack

stack.push(value);

value:stack

stack.pop();

tail stack

Of course, none of the Haskell functions actually change the stack, but if we have a Java program using a stack, we would represent the operations as in the expressions.

Consider, for example, the problem of writing a method to
check whether a string contains parentheses
and braces that match: That is, we want the string
“`{ab(c)}`”
to pass, but not “`{ab(c})`” or “`{ab(c))`.”
Such a program is easy to write in Java using a stack that
tracks which unmatched opening brackets we've found so far.

**public boolean** `bracketsMatch`(`String str`) {

**for** (**int** `i` = `0`; `i` < `str`.`length`(); `i`++) {

**char** `c` = `str`.`charAt`(`i`);

**if** (`c` == `'('` || `c` == `'}'`) {

`stack`.`push`(`""` + `c`);

} **else if** (`c` == `')'`) {

**if** (`stack`.`isEmpty`() || !`stack`.`pop`().`equals`(`"("`)) **return** `false`;

} **else if** (`c` == `'}'`) {

**if** (`stack`.`isEmpty`() || !`stack`.`pop`().`equals`(`"{"`)) **return** `false`;

}

}

**return** `stack`.`isEmpty`();

}

The Haskell translation is as follows. You can see that the places in the code that correspond to stack operations use the Haskell functions illustrated earlier.

`bracketsMatch str` **=** `sub str` []

**where** `sub` [] `stack` **=** `null stack`

` sub` (`'('`**:**`rest`) `stack` **=** `sub rest` (`'('`**:**`stack`)

`sub` (`'{'`**:**`rest`) `stack` **=** `sub rest` (`'{'`**:**`stack`)

`sub` (`')'`**:**`rest`) (`'('`**:**`bottom`) **=** `sub rest bottom`

` sub` (`')'`**:**`rest`) **_** **=** `False`

` sub` (`'}'`**:**`rest`) (`'{'`**:**`bottom`) **=** `sub rest bottom`

` sub` (`'}'`**:**`rest`) **_** **=** `False`

` sub` ( **_** **:**`rest`) `stack` **=** `sub rest stack`

The value of

technically never changes
— that is, we don't change any pointers in the stack, for after
all that's impossible to accomplish within Haskell. The
`stack`

parameter is different, though, for different
levels of recursion, each value referring to a different
version of the stack.`stack`

Since functional programming doesn't allow changes to pointers, persistent data structures are essential to writing substantive programs in functional languages. But the fact that functional languages need the concept shouldn't lead us to believe that their usefulness is restricted to them. In imperative languages, it's sometimes useful for a program to have a structure in which several different versions of the structure can be stored simultaneously. Persistent structures are ideal for this.

It's easy to translate the persistent-stack concept into Java.

**public class** `Stack` {

**private** `Object value`;

**private** `Stack rest`;

**public** `Stack`(`Object value`, `Stack rest`) { **this**.`value` = `value`; **this**.`rest` = `rest`; }

**public** `Stack`() { **this**.`rest` = **this**; }

**public** `Stack push`(`Object value`) { **return new** `Stack`(`value`, **this**); }

**public** `Stack pop`() { **return** `rest`; }

**public boolean** `isEmpty`() { **return** `rest` == **this**; }

**public** `Object peekTop`() { **return** `value`; }

}

A tree is a somewhat more complex example of a data structure that adapts conveniently into a persistent structure. First we define the data type and a couple of query functions.

**data** `Tree a` **=** `Empty` **|** `Node a` (`Tree a`) (`Tree a`)

`isEmpty Empty` **=** `True`

`isEmpty` **_** **=** `False`

`contains value Empty` **=** `False`

`contains value` (`Node nval left right`) **|** `value` == `nval` **=** `True`

**|** `value` < `nval` **=** `contains value left`

**|** `value` > `nval` **=** `contains value right`

The traditional implementation of insertion, where we search down to where the inserted value belongs along the tree's fringe and change the pointer there to point to a node, won't work in a persistent structure, because of that “change the pointer” step. Insertion, then, is written differently: Instead of changing the pointer, we'll create a leaf and re-create the nodes leading down to the leaf, but maintain the subtrees unvisited during the insertion process. In the figure below, note how after the insertion, both the original tree and the inserted tree are available (though with different roots).

The code to implement tree insertion is as follows.

`insert value Empty` **=** `Node value Empty Empty`

`insert value` (`Node nval left right`)

**|** `value` < `nval` **=** `Node nval` (`insert value left`) `right`

**|** `otherwise` **=** `Node nval left` (`insert value right`)

Removing a node is slightly more complicated, as it is with non-persistent binary search trees. Following is an implementation.

`remove value Empty` **=** `Empty`

`remove value` (`Node nval left right`)

**|** `value` < `nval` **=** `Node nval` (`remove value left`) `right`

**|** `value` > `nval` **=** `Node nval left` (`remove value right`)

**|** `isEmpty right` **=** `left`

**|** `isEmpty left` **=** `right`

**|** `otherwise` **=** `Node sval left newright`

**where** (`sval`, `newright`) **=** `removeSmallest right`

` removeSmallest` (`Node mval Empty mright`) **=** (`mval`, `mright`)

`removeSmallest` (`Node mval mleft mright`) **=**

(`val'`, `Node mval left' mright`)

**where** (`val'`, `left'`) **=** `removeSmallest mleft`

If the tree maintains an `O`(log `n`) depth, then insertion and
deletion in this persistent tree takes `O`(log `n`) time, which
is equivalent to the time taken by a non-persistent tree
(although the constant factor will admittedly be
larger). (If the assumption of `O`(log `n`) depth
disturbs you, it should. But adapting the techniques to
a balanced binary search tree (e.g., red-black trees) is not
difficult.)
However, there is a loss in terms of memory consumption:
Whereas a standard tree requires
`O`(1) memory for insertion and deletion,
the addition of persistence comes at the price of `O`(log `n`) memory
per operation.

Representing a queue is a more significant challenge than a stack or a tree. With a queue, it's important to maintain linear structure but still maintain the ability to modify both ends of the queue. We cannot do this through a straightforward translation of the non-persistent structure.

One approach is to “buffer” items added onto the end of the queue until they are needed. In this scheme, the queue consists of two lists: a list of items found at the front of the queue, and a list of the items found after them in the queue, listed in reverse order. In such a scheme, the front of the queue is the first thing in the first list (or if that list is empty, the last thing in the second list); and the rear of the queue is the first thing in the second list (or if that list is empty, the last thing in the first list).

**type** `Queue a` **=** ([`a`], [`a`])

`emptyQueue` **=** ([], [])

`isEmpty` ([], []) **=** `True`

`isEmpty` (**_**, **_**) **=** `False`

`peekFront` ((`val`**:****_**), **_**) **=** `val`

`peekFront` ([], `second`) **=** `last second`

Finding the front of the queue is problematic when
the first list is empty: Note that we call

, which
will end up stepping through the entire list to find whatever
is at the end. So that this doesn't happen, in what follows
we'll be careful to maintain the invariant that the first of the two
lists is never empty unless the entire queue is empty. This means that
the second case of `last`

will never occur, and so
the function will always take `peekFront``O`(1) time.

Enqueuing a value in this structure is simple: Since the value at the front of the second list is the queue's rear, we want simply to add the enqueued value there. The first case is to guarantee the invariant (that the first list is empty only when the entire queue is empty).

`enqueue value` ([], []) **=** ([`value`], [])

`enqueue value` (`first`, `second`) **=** (`first`, (`value`**:**`second`))

In dequeuing, we want to simply remove the first item from the first list. If this empties that list, then we need to refill it from the second to maintain the invariant.

`dequeue` ([], **_**) **=** `error` `"queue is empty"`

`dequeue` (**_****:**[], `second`) **=** (`reverse second`, [])

`dequeue` (**_****:**`first`, `second`) **=** (`first`, `second`)

Note, in the first case, that the invariant implies that an empty first list means the entire queue is empty.

Given the invariant,

, `peekFront`

, and
`isEmpty`

all take `enqueue``O`(1) time.
The

function takes `dequeue``O`(`n`) time, in the case
that the first list contains only one item. One would hope that
we could guarantee that this would happen only rarely, rarely
enough that

takes `dequeue``O`(1) time *on
average*. Such a guarantee would be possible were it true that
any given version of the queue would have only one operation
performed on it. But such is not the case. If we
had a queue in which the first list held only one element, then
we could call

twice on that same queue several
times, and each time the second `dequeue`

would take `dequeue``O`(`n`)
time. (It may seem that the first

in such a
situation would take `dequeue``O`(`n`) time, since that's the one for
which

is called; but Haskell is lazy,
and so it would not complete the call to `reverse`

until the result is needed, during the
second call to `reverse`

.)`dequeue`

We could guarantee that

takes `dequeue``O`(1) time on
average if we could maintain the invariant that the second list
is never longer than the first list. Developing structures that
maintain this invariant are left as an exercise to the reader.

A random-access structure is the abstract data type corresponding to an array: It is an indexed structure supporting two operations, one to retrieve a value based on an integer index, and the other to mutate the structure based on an integer index and a value.

`get` **::** `RandomAccess` **->** `Int` **->** `a`

`set` **::** `RandomAccess` **->** `Int` **->** `a` **->** `RandomAccess`

The frequent usage of arrays in imperative languages would lead one to believe that random-access structures are extraordinarily important. It's easy to exaggerate this importance, though: Often, imperative programs use arrays that are used only for iterating through elements, and this can be done as easily and as efficiently iterating through lists. (We're talking about efficiency here in big-O terms — there would likely be a constant-factor slowdown.) Still, there are those other times when we genuinely want a random-access structure, and lists turn out not to be an adequate substitute.

From imperative programming, we are accustomed to using arrays
to implement the random-access structure, which gives `O`(1)
time both for reading and writing. The array concept, however,
does not carry over cleanly as a persistent structure, as the
`O`(1)

function relies on being able to change
existing memory. The straightforward way of adding persistence
is to have the `set`

function create a clone of the
array (with the one requested position altered), but the
cloning would take `set``O`(`n`) time.

One way to reduce the

time is
to add a “buffer” accumulating the changes in an auxiliary list
structure. We'll clone the array only once this list
reaches a length of `set``O`(√`n`).
As a result, we'll pay the `O`(`n`) clone cost only once every
`O`(√`n`) times, and other calls to

will
take `set``O`(1) time, for an average cost of `O`(√`n`).
Unfortunately, the

function would also need to be
modified to search through the list before looking into the
array, and searching the list will take `get``O`(√`n`) time.
This buffer approach ends up trading off

time for
`set`

time.`get`

However, trees provide a very different approach that works even better. In this approach, we'll maintain a nearly-complete binary tree, and individual array elements can be maintained at the leaves. We define the array type as follows.

**data** `ArrayNode a` **=** `Leaf a` **|** `Node` (`ArrayNode a`) (`ArrayNode a`)

**type** `Array a` **=** (`Int`, `ArrayNode a`)

An array here is a tuple holding the array length followed by the root of a tree. Within the tree, each leaf holds a value, and each internal node has two subtrees.

When we create an array, we'll pass in the array size and a
function that maps each array index to an array value.
In this way, a call to “

” generates
the array represented by the list `makeArray` `4` (\`i` **->** `3`*`i`)`[`

.`0`,`3`,`6`,`9`]

`makeArray size f` **=** (`size`, `sub` `0` `size`)

**where** `sub i` `1` **=** `Leaf` (`f i`)

`sub i n` **=** `Node` (`sub i lsz`) (`sub` (`i` + `lsz`) (`n` - `lsz`))

**where** `lsz` **=** `n` ``div`` `2`

The tree generated by the

subfunction here
will be complete except for the bottom level. For
example, a tree with 13 nodes, as you can see, will have 6
nodes in the left subtree and 7 in the right subtree.`sub`

The

function simply traverses down this tree.
As it traverses down, it will need to track the index desired
within the current subtree, as well as the size of the current
subtree.`get`

`get` (`size`, `root`) `index` **=** `sub index size root`

**where** `sub` `0` **_** (`Leaf x`) **=** `x`

` sub` **_ _** (`Leaf x`) **=** `error` `"index out of bounds"`

`sub i n` (`Node left right`) **=**

** if** `i` < `lsz` **then** `sub i lsz left`

**else** `sub` (`i` - `lsz`) (`n` - `lsz`) `right`

**where** `lsz` **=** `n` ``div`` `2`

Finally, the

function will do the same, except
that it will have to sew the tree back up, just as we saw with
binary search trees previously.`set`

`set` (`size`, `root`) `index value` **=** `sub index size root`

**where** `sub` `0` **_** (`Leaf x`) **=** `Leaf value`

` sub` **_ _** (`Leaf` **_**) **=** `error` `"index out of bounds"`

`sub i n` (`Node left right`) **=**

** if** `i` < `lsz` **then** `Node` (`sub i lsz left`) `right`

**else** `Node left` (`sub` (`i` - `lsz`) (`n` - `lsz`) `right`)

**where** `lsz` **=** `n` ``div`` `2`

Since the tree is complete, the tree will have
⌈lg `n`⌉ levels, and so both reading and writing
in this tree-based implementation take `O`(log `n`) time.

We can summarize the performance of the three techniques we've examined here in a table.

structure reading writing array O(1)O(n)array with buffered writes O(√n)O(√n)tree O(logn)O(logn)

There's no real reason to use the array with buffered writes, as the tree-based implementation beats it on both operations. The regular array implementation is still useful, though, in cases involving many more reads than writes.

This is not the end of the story with persistent random-access
structures, however. How close one can come to `O`(1) time
is an active research question. One technique, too complex to
describe here, claims `O`(1)
time for writing and but still `O`(log `n`) time for
reading [O'Neill and Burton, “A new method for functional
arrays,” *Journal of Functional Programming*, 1997,
487–513]. An even more complex technique improves the read time
further to `O`(log log `n`) while maintaining `O`(1) access
time [Dietz, “Fully persistent arrays (extended
abstract),” *Lecture Notes in Computer Science* vol. 382,
Springer-Verlag, 1989, 67–74].