Something's rotten in the state of [web] Devmark...

2007-01-03

I've got a Django review in the pipe, and it's generally positive. It's still cooking both so I can gather more experience, and while some fact-checking occurs.

But there's something still missing that I can't put my finger on. I think cutting away even more of the general cruft of making web apps is bringing it out. I've been programming on the web for nearly ten years now, and something's not right. Despite the fact that I have gotten generally good at programming, and I keep refactoring and refactoring, I keep writing the same web page over and over again: I've got a tree of heterogeneous objects that I need to render, which itself a view of a graph-like structure, I allow the user to manipulate it somehow, and I have to propagate those changes back to the database. And for some damn reason, no matter what I do, there's always something super-special about this form that requires special treatment and requires further extension of the frameworks or libraries I'm using.

(Note this programming post meanders a lot; part of the point is that I don't entirely know where I'm going with this.)

I'm not entirely certain what the problem is, or I'd talk about it directly. I have identified some distinct sub-problems, which may or may not relate to a larger whole.

One subproblem is that relational databases as they currently exist have some serious weaknesses that are so deeply embedded in our conception of what a database is that we can't hardly seem to see them anymore. Some of these problems I can describe concretely, some I can clearly see but would be hard to describe concretely, and some I have only a vague sense of.

One of the concrete problems that I can describe, and that I keep hitting, is that we can't seem to store graphs or trees worth squat with current technology. A data structure is what it does. A "tree" is something where you can easily say "give me all the children of this node", and there are certain guarantees (like no cycles). A directed graph is a structure where you can for instance say "Give me the entire sub-graph that you can reach from this node". In both cases, you can further specify limits about what things are explored.

We can store things in so-called "relational" databases that are topologically equivalent to trees and graphs, in that you can point to the nodes and the arcs, but you can't query them like a tree or a graph. Therefore, even though they topologically look like trees or graphs, in a very important way, they aren't trees or graphs. There is, as far as I know, no real one-table or two-table solution to storing even a homogeneous tree or graph in this sense. I've looked online and spent some time thinking about it. There are some hacks, but they all have far, far worse performance than they should and/or they still cover only very specific use cases, nowhere near every reasonable operation you might want to run on a tree.

One solution is to store the full path to the root in the node as a freaking string containing the node ids of the parents, which along with being stupid and slow, makes even simple operations like moving one node a pain, requiring you to re-write the ancestor list of every node that is a child of the moved node.

The other basic solution involves loading the entire tree into memory and running a clever labeling routine over it while allows you to quickly query for the children of a given node, but really, not much else.

I've examined this solution and while I haven't quite laid it out with mathematical rigor, I've convinced myself there's no way to use simple numbers as the index, avoid the need to recompute the labels on every tree change, and not run out of numbers really quickly in the general case. (You could do it trivially with actual-factual real numbers, as in "a member of ℝ", but computers don't have reals.)

With both solutions, you can filter out nodes based on standard SQL criteria but there's still no way to say "If you come across a node with a label starting with 'A', stop descending.", because there's no actual descent occurring. It's a hack that gives you 15% of a tree, and not even the most useful 15%, just the 15% that happened to be implementable.

And graphs are even more hopeless.

Interestingly, I happened to get lucky on programming.reddit.com and find this post which led to this story about the SQL standard WITH RECURSIVE query, which does exactly what I want, and if I'm reading this correctly, does it exactly how I want it. Apparently this is a little-known feature even among database mavens; this is the first I'd heard about it despite it being in the ANSI standard since "the early 1990s", according to the article. My guess is that few enough people really "get" recursion that nobody really recognizes it will solve their problem, and therefore nobody asks for it, uses it, or talks about it. (This is what I get for not reading the standard and instead reading books and online tutorials to learn SQL.)

Relational databases, which are what nearly all "real" websites back to, are really, really useful. Having the ability query and join and aggregate built into everything you use is much better than trying to design it on your own. Things like ACID and handling concurrency somewhat reasonably and having several distinct implementations are really, truly good, and that's probably why no technology has unseated them. But I think the limitations of SQL have held back the database community for long enough. It's time to survey the amazingly complicated stored procedures people build, look for common patterns that appear over and over, and look for places where otherwise well-structured code was forced to bend around SQL, and start fixing these problems. The inability to store graphs and trees is, as I said, merely one of the more concrete problems, and I bet fixing that would fix a lot of other fundamental SQL problems as well.

I haven't got much experience with object databases. Maybe they solve this problem. But I get the sense that if they were any good they'd be much more popular by now. I think that they are designed to solve the OO-Relational impedance mismatch, and to attack the heterogeneous object problem, but not the inability to store and query graphs. Other techs exist to store graphs but I don't know of any mature enough to trust on my personal website, let alone recommend to a commercial user.

Another aspect of the problem is that we seem to be doing the same thing over and over, but it's always just different enough to require some exception or other, often a major one.

I and many other people have taken a crack at solving this problem directly, by trying to create a declarative language where you say what you have and what you want to do, and the framework tries to do the rest. This approach fails because there is a fundamental contradiction: In order for the framework to have any value, it needs to be simpler than writing real code. In order to be simpler, it must have a smaller space of problems it can cleanly solve. In the real world, the problems are very complicated and you basically need the full programming power of a real programming language, and the last thing you want in your declarative model is a full programming language. Odds are that in the end you end up with this buggy language that isn't anywhere near as capable as the base language and just gets in your way.

(See also my post Configuration Files Should Be Programs. Using the full, real programming language to make your declarations can have a great bang-for-the-buck.)

Declarative languages aren't the way forward. Metadata can still be darned useful, but it's never enough.

Another problem that has triggered this thought-train: Caching and the way Django does "views".

Django templating system does not allow arbitrary programs to run within a template. As a result, you are forced to declare up-front exactly what values will be passed into a template. This has the effect of completely separating your views from the templates.

This up-front declaration of all data you need, in function form, is making me wonder if we can push this any further. Right now there are two major problems with the web development I've done (not just Django) that this might impact: Query flurries, and caching.

What I refer to as "query flurries" arise when you use an ORM to deal with rows in your table. The abstraction/encapsulation benefits of an ORM are difficult to say no to, especially when you can guarantee that all access to the database will go through the same ORM implementation, but there are certain problems that arise too. If you have a table row A that relates to some Bs that relates to some Cs, it's easy to get all the Bs in one shot: A.get_all_B(). But if you write the natural loop:

for b in a.get_all_B():
    for c in b.get_all_C():
        # do something

You're now getting one query per C. Since queries are usually implemented synchronously, the overhead of a query can be significant, as even a simple query retrieving a small value from a database hosted on the local machine still has to go through the full RPC process being used (local pipe, local socket, whatever) synchronously in both directions. That starts to add up. Add another couple of relationship layers and some decently sized result sets and it's easy to end up with arbitrarily many queries being fired (think "tens of thousands" here) for something that could have been done in a handful of simple queries; in this case, you could have followed the relationships back the other way, by asking for all the Cs that relate to A through a B.

Rewriting it in this case is simple. But sometimes it's not so simple, and right now at least I'm doing this reversal by hand every single time. It seems like there ought to be a way to more fully automate this without also requiring the user to write a query in advance that says exactly what objects will be used.

The other problem I mentioned is caching. When I fully render an entry, such as on the page you are reading now, I have the opportunity to cache everything related to the entry, such as the rendered text, the rendered comments, etc. When any comment is added or changed (you can not currently edit your comments but I plan to add that ability later), I'd really like it if there was a clean way for that change to propagate back up to the cached entry so it can have its cached value wiped out. Right now there's nothing stopping me from doing that except the fact that I have to write the "resolving" routine to get back to the entry manually.

What do the query flurries and this desire to propagate changes backward have in common? Why, it's our good friend the graph from a few paragraphs ago. That "natural loop" is the traversal of a heterogeneous graph. This time I can't really blame the SQL language directly, because if you poke it right you can write a single query that only grabs the correct values; this time the problem lies somewhere else. Maybe the eager evaluation of relationship queries? And the caching problem is even trickier; the root problem is that I need to be able to overlay at least some aspects of a graph on top of my database that is somewhat independent of the actual relationships in the database. For instance, suppose I manage to cache the sidebar, which has things like my blogroll. That has no relationship to my entries at all in the database. Yet if I have cached my sidebar, and I am trying to cache entire pages, I need to invalidate all of the pages, even though in database terms the sidebar just floats off in its own little table, independent of anything else.

(This reminds me of another limitation of SQL I'd like to see lifted: I'd like to be able to return heterogeneous rows, so I can easily query on one table, then easily query all related values in another table, and do this in one query and get two distinct kinds of rows back, without the cartesian product resulting from a join slowing everything down. In theory the DB communication protocol could mitigate this at the wire protocol level, although I don't know if they do, but even so, my target language is going to be creating a lot of extra data it shouldn't have to create.)

I started writing this out because I have found nothing really makes you think through something like writing it out for public consumption, and now I begin to see that maybe there is a pattern here: There are relationships (not in the database sense; if anything the database is fighting this) that are not being captured by any web development platform I've ever seen, and ways we could exploit those relationships that we could be taking advantage of. I've already got the beginnings of some code for this laid out. (It's been difficult to wrap my head around exactly what I'm looking for; I've been struggling with it for the past few days in my spare time. English allows you to sort of babble, but a computer want something a bit more precise....) I guess I'll just see where it goes. Django is as good a base as any to build this experiment on top of.