What Is An Outline, part 2

In my previous post, I discussed what I believe is the outline data structure in current use by outliners, and why they are inadequate for my purposes. I discussed why transclusion doesn't work very well under that model (an algorithm can sort of hack transclusion on top of the tree, but it's not "real" enough), and the difficulties of replacing some of the tree concepts when you have a full graph. Today's post covers the "transclusion" issue.

Up to this point, the programming model I'm using has not mattered; most everything can handle a simple tree structure. It's not too difficult to work with trees even if you only have a flat array to work with, as in old Basic. But as the data structure gets more complex, the question of how you are representing it starts to matter. I refer you to my Bowers' Law essay for a justification of why we should assume OO from now on.

I will assume that you now agree enough with the ideas expressed in that essay that I can fruitfully discuss how the "outline data structure" will "behave", since we're going to assume the outline data structure is encapsulated and all outline programmers will be assumed to correctly use the encapsulated access. As it happens, I'm writing this in Python so we get most of the standard OO capabilities like Classes and Inheritance, but that is not strictly speaking necessary.

Tree Class Model

The initial class diagram for the Tree model of outlines is simple: A Node class, which defines a Node that may include other Nodes, and optionally, a Document class which represents the top of the document and which may perform such extra duties as tracking the "title" of the document. Nodes also hold their data, their "text" and other attributes. Node Types in this case are now trivially represented as different classes of nodes in the tree; this works so naturally that there isn't any need to further discuss how to do node types in the internal representation right now. (Note that Radio Userland does an OO-based implementation of its node types in the procedural language, doing the virtual lookup table by hand without support from the language. This mostly works but you do lose the advantages of being truly OO, with language support for inheritance and such. Such things are livable and you can always hack in the support you need, it's just a manual and hence error-prone process, rather then something automatic.)

The problem with this data structure is simply that it is very rigid; either the outline is in memory or it is not. You can try to kludge around this by dynamically loading a document when you need it, but that's a fairly coarse grain support of loading data, and also does not do anything for loading outline information from other sources. We'd like to do that on a node-by-node basis, both for obvious reasons (any node in an OPML file may be a link, an RSS file, or some other currently-unknown node-type that involves pulling from external data). That said, there are two basic obvious choices:

Design Decision 2: Special Nodes or "Data Sources"?

Issue: How do we handle the idea of "nodes that load data from a remote source"?

Options:

  • Create special nodes that load their data remotely when the data is needed.
  • Break the concept of node into two pieces: One to represent the structure, one to represent the source of the node's data.
  • Ignore the problem and shift it other programmers to do their own loading.

Cons of shifting the problem to client programmers: Two programmers are liable to walk on each other's concept of "loading data". Or, looked at from the other direction, without a standardized concept of "loading", Iron Lute itself can't understand how to load nodes, meaning I can't do anything interesting with it in the GUI. Thread safety can not be enforced. The outline data structure can't know anything about what the other programmers will do. (This is as opposed to supporting it in the outline code itself where we can define things like "Will loading this data require loading another file?")

Pros of shifting the problem to client programmers: There are no pros. OK, well, it would be easier to implement the outline structure without worrying about this, but even before I ask other programmers to use the outline structure, I'll have already paid for it in additional complexity in internal Iron Lute code!

Pros of creating special node types: This matches most closely how it looks like it's working to the user. It's a fairly natural choice.

Cons of creating special node types: Imagine a "transclude" node before and after it is activated. If we have two special node types for that case, then after the node is activated, we need to take the old "OPMLTransclude" node and completely replace it with an "OPMLTranscludedDocument" node, which means removing a node and adding another, which will fire all sorts of things at the GUI, when there's no need for that. Requires writing code for every kind of special loading node; it would be much cleaner if any node could load remotely without breaking the data structure. (See discussion in later sections.) Of course we could create one node type that constantly checked to see if it has loaded the remote data yet, but that's very kludgey and we still fail to obtain the benefits in the next paragraph.

Pros to creating a separated "data source" concept: It makes sense to split the "structural" aspect of the node from the "data" aspect of the node. Makes "transclusion" simple; just replace the data source, not the whole node. Any node type can be loaded from any defined source type, which prevents having to write so much code.

Cons to created a separated "data source" concept: More parts. More memory consumption. More indirection. Most of these are marginal concerns, though; I'm already assuming that at any given moment, the data loaded into actual memory nodes will be significantly less then the memory of a modern computer (even if the "virtual outline" is massively larger).

Discussion: Shifting the problem out of the outline data structure is not even remotely feasible; at least some critical aspects of the problem must be solved here in a principled manner or we'll have no way to reason about the behavior of the nodes in Iron Lute.

The real deciding point between the other two choices is that where the data is loaded from is orthogonal to how the data is represented in memory. OPML transclusion is only defined by convention for HTTP, but it trivially extends to FTP, RPC calls returning the string for one document, and a wide variety of other sources. Moreover, a lot of data other then OPML data can come via those methods, so it's best to be able to reuse "FTP loading code" for other node types as well.

Writing special node types means needing to write (node type count) * (data source count) different nodes for coverage. Separating the node from where the data comes from means writing (node type count) + (data source count) chunks of code. This is a big win, quickly.

As we'll see, this not only fixes this problem but has many unanticipated benefits as well.

Data Sources

We will go ahead and split the node into two pieces. One piece keeps track of incoming links, which we assume are always known, and is responsible for providing the methods that implement the particular node type. (Since we're still in a tree model, the "incoming link" is just a link back to the parent.) The other part of a node is responsible for providing and mediating access to the data the node represents, typically the text and the attributes of the node. It also provides access to the children nodes. (This is because at the moment you ask for the children, they may not be loaded yet.)

Since I believe in making programming as easy as possible, the node also provides a "facade" around the data source so that if you don't care about the details of a data source, you don't need to worry about where the data is coming from. For instance, if a is a node, a[0] will return the first child of the node.

What does this additional "complication" gain us? Well, compare OPML transclusion under the "monolithic node" model and this model. Transclusion isn't precisely defined in the OPML spec, but is generally accepted as a "'link' type node with a url ending in '.opml' that leads to an OPML file". In the procedural model with monolithic nodes, it might work like this:

When the user tries to expand the node, do the following if it's a link node:
  1. Go load the remote file.
  2. Parse it into outline nodes.
  3. Add the outline nodes into the current outline.
  4. (Optional: Drop a marker that says these nodes don't belong to this document; if we re-save the parent document, we want to save the link, not the nodes from the foreign file. Depending on the internal implementation, the 'link' node itself may be the marker.)
  5. Return the user to their outline view with the new nodes.

There are two major problems with this structure:

  1. The very first line: It generally gets too tied up with "what the user did". The real trigger should be "when the data is accessed" for any reason, including user interaction. A procedural model must have hooks for data access explicitly added, and programmer-users must be sure to only access the data via those hooks. With OO, we can better enforce that users only go through our access routines. The last part of the phrase is also a problem; having to explicitly check if it's a link node can cause a lot of problems if some later programmer forgets.
  2. Loading in the nodes and dropping them into place is a bit of a kludge; it isn't well reflected in the data model. If you ask a link node how many children it has before loading the node, it'll say "0". After it loads, it'll say at least 1.

Compare this with a dedicated OPML data source:

When we initially load the source OPML file, we notice the "link" node type. We create an "OPML Link" node that understands the OPML link type, understanding how to set the destination of the link.

For what it's worth, that's almost an exact transliteration of the OPMLLink class in the currently existing Iron Lute; it also opens a browser if you expand a non-OPML link node, which is the accepted response to that action. The OPML Link class is only about 60 lines, and most of that is spent munging the URL, managing the document's name, and such.

Continuing:

When any part of the program tries to access the children or other data of the node:
  1. Load in the remote OPML file.
  2. Parse it into outline nodes.
  3. Replace the original data source with the standard "in memory" data source containing those nodes.(An "in memory" data source is the "default" data source you'd expect by default, populated with real values from the original file.)
  4. Return the appropriate value to the original call.

This structure removes both problems I mentioned in the "standard procedural" case. Any part of the program, user interface or otherwise, that needs the data can get it.

(Parenthetical: A slight culture change is needed in the programmers; you can no longer totally predict how long some operations like "counting children" will take. As a practical matter, I provide a lot of methods that provide relatively fine-grained queries to the nodes; for instance, there's a separate "hasChildren" vs. "countChildren". The OPML Link node assumes that it has children, so it can answer "True" to a "hasChildren" query without loading the file, whereas to get a true count you can't help but load the file. Also, some operations may eventually allow you to add a parameter that says "Don't load anything to answer this question", if you don't care enough to invoke a load. As a practical matter, this seems to be working out even better then I expected, so I'm not really concerned by this issue.)

It also provides a clean data representation of the idea of "a file loaded from a remote source". The file will have a full Document node on top, and all of the internal nodes can know they aren't part of the original document. The change doesn't actually disturb anything else, since we give up on the idea of tracking child count at all times.

This is merely elegant, but there's more. The quantitative change of "easier to extend" (because the data model itself mediates the data access, rather then asking you to carefully check all cases) becomes a qualitative change, as we can now make anything a data source for a node. We can theoretically load RSS files, files off the disk, computed values from some remote Internet server, and, basically, any value you can pull into a computer program can become a data source, with a read and write interface, if you're willing to write both parts of the data source code. Every node can have a different data source if it likes.

And the real benefit is that this is transparent to client programs, such as the User Interface or scripts. You don't need to check to see whether the given node is "normal", or an "OPML link", or some node you've never heard of. You just use the node normally, and it all works.

The upshot of this is that instead of a concrete outline structure, this creates a dynamic outline structure where various pieces may be yanked from any source we need, or even compiled from a "non-outline" source. Virtually, the entire set of outlines reachable from a given outline is all "there", springing transparently into existence when manipulated... the "outline web" is just an index away: opmlLinkNode[0]. Pretty cool.

I write this as if I knew all of these things in advance before deciding to use this "Data Source" concept. In reality, I used it because I needed it for the transclusion reason initially given. The rest I discovered over the next couple of weeks. When your designing works like that, it's usually a good sign that you're on the right track.

This is just one of several improvements we need to make before we have a truly flexible outlining arrangement though; in the next post, we shall explore how to make the outlining structure itself more flexible, while still retaining the outline spirit.