Storing My Outlines

2004-02-25

One significant advantage a more conventional outline has over the outline structure I've built up here is that it is much easier to store the traditional outline in a file. Using XML, traditional outlines are almost trivial to store:

<node text="A">
  <node text="B" />
  <node text="C" attribute="D" />
  </node>

Even if you don't read XML, you can quickly learn to read this format. About the only difference between what I just wrote and the official OPML specification is that OPML adds a <head> section that includes some metadata about the outline, and uses "outline" instead of "node".

Right now I'm working on how I'm storing my outlines "natively" in Iron Lute. First of all, using that simple style is not an option, because it doesn't allow for anything but a strict tree. There are two basic other choices that I can see. Both require the addition of node ids, since both will require the ability to refer to previous nodes (and possibly nodes that come later in the file):

<node text="A" id="1">
  <node text="B" id="2" />
  <node text="C" attribute="D" id="CWQ!" />
  </node>

IDs

On the subject of IDs, for the IDs, I face two choices: Do I want them to be globally unique, or just unique within a given file?

Globally unique IDs have some advantages in that they make excellent link targets and having a GUID can preserve the identity of a node, even if you move it from one document to another. But I faced an issue with ensuring that the GUIDs would be truly globally unique, and I couldn't come up with a solution I found satisfactory. (I'm not overly concerned about speed, but all the attempts I tried really slowed Iron Lute down, even with the trivial outline sizes I'm currently dealing in.)

For the purposes I need, it suffices that IDs are unique within a file/stream, and that's much easier to guarentee. I only use these ids in the context of the file itself, so they don't even have to stay constant. Other 'anchors' will be created if we want to link into an outline later.

The Two Basic Choices

Now, let us say that I am representing an outline that looks like A->B->A. The two basic choices I see for representing that outline look like this:

<node text="A" id="1">
  <node text="B" id="2">
    <noderef targetid="1" />
    </node>
  </node>

<node text="A" id="1" root="true" />
<node text="B" id="2" />
<link from="1" to="2" />
<link from="2" to="1" />

The first one preserves the current natural representation of outlines, and uses some sort of placeholder element for nodes that already exist in the file and are re-appearing later. The second one gives up the natural hierarchy in XML and just lists the nodes, the links, and how they correspond; re-assembling them is still fairly easy, though slightly less intuitive.

The first example has the benefit of being more human readable... but since I anticipate supporting a lot of formats natively in Iron Lute, and already natively support OPML, for instance, I'm actually not so concerned about "human readability" for the "native" format of Iron Lute. Iron Lute's native format needs to be powerful enough to handle everything anybody can throw at it, and human readability just isn't in the cards for that.

The second format also has an intriguing possibility to it. Right now, I'm assuming that the order matters, so when the first <link> element comes along, the code that is re-assembling the outline(/graph) structure just "knows" that that is the first child of the node with id="1". If we were willing to specify the order explicitly, we get an interesting property:

<node text="A" id="1" root="true" />
<node text="B" id="2" />
<link from="1" to="2" child="0" />
<link from="2" to="1" child="0" />

What's interesting about this list is that the elements can be arbitrarily re-ordered, and the final assembled product (assuming a correctly written assembler) is the same. The above is exactly the same as

<link from="1" to="2" child="0" />
<node text="B" id="2" />
<link from="2" to="1" child="0" />
<node text="A" id="1" root="true" />

and is exactly the same as all the other 22 orderings of those four things.

(Actually, in the final Iron Lute file format, the nodes will carry a list of link ids in them, rather then the link specifying what child it is. The advantage is that once you have a node, you can know how many children it has, and by extension, whether it has children at all, but sticking it in the link has only the advantage of looking nicer. Showing the links in the node requires something like <links><link id="4"/><link id="5"/></links>. Since this is a discussion, not code, that's why I show the child indicators in the link right now.)

This intrinsically only has the "advantage" of making the re-assembly process a bit more complicated, since in addition to understanding backward references (references to already-existing nodes), you must now understand forward references (references to nodes or links that haven't arrived yet), but that is a relatively minor point in general, especially as this is code that will not be modified by many people.

What is really interesting is the secondary effects that you can get from playing with the order an outline is saved in. Let us imagine some relatively complicated outline with a few hundred nodes, but the last child of the root node has no children. Now, imagine what happens as the outliner reads in that outline, specified in format #1 above. The last child of the root node is the last node to be seen.

What the implies is that the outliner can never be sure that it has all the children of even the root node until the entire file is loaded. Therefore, the program really can't display the outline until the entire outline has been loaded. In theoretical terms, this is because the outline is stored by using a depth-first traversal. Re-ordering the file is not an option since the order in the file directly reflects the order in the outline.

But what if we're not tied to that order? What if I use the second style of file, and choose the order the nodes are dumped out to use breadth-first traversal? Then, we can know whether or not we've seen all the children of a given node. We could then go ahead and display the outline to the user, and allow it to keep loading in the background.

Breadth-first loading here has the advantage that unless the user immediately dives deeply into the outline, or immediately issues an "Expand All" command, we can most likely allow the user to immediately begin using the outline, even if only a fraction of it is loaded. (Iron Lute already weakens "Expand All" anyhow, because of the possibility of looping paths in the outline.)

This appeals to me because the most important thing about a program is not its raw speed, but its responsiveness to human demands. Even if I take ten times longer to read in an outline then some other program (which is not impossible, since they may be written in C and Python, whatever other virtues it has, is not the fastest language), if I display the outline and allow the user to use it ten times faster, my program looks and feels faster, and will make the user happier. This isn't an issue for most normal-sized outlines stored on disk, but if we want to pull outlines from a remote source over a slow link, this can become a huge advantage.

In the ultimate case, I visualize someday having full Co-Outlining implemented in Iron Lute, with Iron Lute connected to another instance of Iron Lute. Even at dial-up speeds, connecting to another Iron Lute outline would allow you to start manipulating the outline within seconds of the request to connect, even if the outline you've connected to is many megabytes in size (as long as it doesn't have too many top-level nodes, which if you want dial-up users to use your outline you will avoid). If you do open something that isn't loaded yet, your Iron Lute could send a request to the remote Iron Lute to start loading those nodes first, minimizing the time it takes to manipulate those nodes.

Another interesting but subtle property of this file format is that it turns the file format into a series of commands. In the above example, it is better to think of the <link> and <node> elements as commands to "add a link" or "add a node", rather then trying to puzzle them out as some kind of structure. Once you start thinking of it that way, you can later in the file add a new node with the same id, and interpret that as "replace the old node with this one", or you can add simple delete commands to remove old nodes or links.

Why is this interesting? For one thing, I intend to use this as Iron Lute's emergency backup technique; Iron Lute can always append to the file rather then re-write it from scratch, so you won't get that massive pause every time the program does an automatic backup, like I've seen in Microsoft Word. The backup time is proportional to the changes you made since the last save, not the size of the document itself. (Every once in a while, Iron Lute might save it from scratch to avoid filling the disk, but a human would not be able to generate too many events so in the common case it would be a long time before the user used a significant proportion of the disk.)

Another thing I'm not sure about yet, but... I'm thinking that the co-outlining will be implementable as exactly the same stream as the file storage system uses; as long as Iron Lute understands "stream still in progress" and has some system for generating stream updates, it should work. A central participant in the co-outlining will be responsible for collating the incoming events, and for resolving resulting conflicts (two people modifying the same node at roughly the same time, one person modifying a node the other has deleted, etc.), and I think that's nearly all that will be necessary for a first-cut, but functional, outline sharing. This is one design goal I haven't talked about but it is affecting my decisions; I don't want to lock myself out of this possibility inadvertantly.

My current plans for Iron Lute at the moment are that I want to get this file saving working correctly, then I want to swing back to the GUI and make it usable as at least a simple outliner. I would then have a project that could theoretically be released at any time, and the decision about when to release it would be driven by whether or not it did anything useful yet, and how much I would mind having others build on top of it yet. It is my guestimate that completing the file saving/loading and GUI work I want to do will be at least three months, so we'll see where to go from there.