Failure of Simple Tree Matching

2003-04-14

One major outcome of this project is the discovery that simple tree matching algorithms don't work.

I tried an algorithm where I took the three best nodes from a tree and compared that to a database of the three best nodes from the collected text works. The expectation is that the numerical value of each sense represents meaning spread out across many words and would to some degree adequately represent the "high points" of the text snippet.

By simply matching the senses and their parents (with a bit of a decay factor), the hope was that similar topics of discussion could be bundled together.

This did not work well. One problem was that text with very few identifiable nouns tended to easily be the best match for a given text snippet despite having no value as a match. Another problem was that the common nouns tended to dominate the matching, as if everything was matching on "thing", and that made the resulting matches almost random as a result.

The root problem seems to be that there is too much commonality in your average text snippet, where the commonality dominates the differences, and it is actually the differences we are interested in.

To get around this, we hope to use better matching methodologies by matching on whole trees. Unfortunately, on the timescale of this project this will preclude frequent searching as matching a tree against 1000 other trees will be time consuming.