This week I want to talk about an aspect of the Assimilation database schema which is somewhat controversial, an aspect of the schema for which the jury is still out.
I chose to represent the Assimilation node type hierarchy with relationships which currently serve no purpose other than to represent the types of nodes in the database. This post will talk about why I put the type hierarchy in, and why it might be a good idea, or maybe not.
Originally, I had not put the type hierarchy in, as I saw no reason for it - each node is labelled with its type internally anyway through a nodetype attribute, and most of my nodes are represented in separate indexes, one per nodetype. For example, Drones are indexed in a Drone table indexed by its name (or designation if you prefer), NICs in a NIC table, indexed by MAC address, IP addresses, indexed by canonical-form IP address and so on. Moreover, each node as a field in it called nodetype, which tells the type of the node.
Given the nodetype field and all these indexes, and the fact that my code doesn't use them why would I do this? The first answer is not very compelling - I saw some other Neo4j applications which did it, and since I was (and still am) a Neo4j newbie, who is still learning to think in graphs, I figured it must be a good idea - and maybe I would learn something along the way.
Below you'll see an example of what this looks like with one node of each type relating to its nodetype node. Each of the objects (NIC, IPaddr, Switch, Drone, etc) has an IS_A relationship with its nodetype. Each nodetype also has an IS_A relationship with node zero – sometimes called the reference node in Neo4j.
As you can see, this seems a bit redundant -- after all I have the indexes of everything anyway. My original real reason for putting these in from my perspective is simple - debugging.
Neo4j has a shell, and the way the shell operates, it's easy to follow relationships to see how the program is creating the nodes and relationships - if you start from node zero and go from there. How this works is quite nice and makes seeing what the CMA did quite easy. The Neo4j shell is somewhat reminiscent of a normal UNIX-style shell - with each node in the graph being treated analogously a directory or folder. So if I've added new relationships or node types, I can conveniently find any type of node through the reference of node zero.
There's also another reason as well... Hypothetically, if I wanted to look for some attribute of a node that there was no index for, the entire query would run more quickly (and be easier to write) if I do it all in terms of relationships rather than walking through the indexes first. So, I can imagine it being useful in the future – but right now, I'm paying the cost of creating all these IS_A relationships – which I currently use primarily for debugging.
So, I get some debugging advantage, but I pay a price. In the future, I might get some performance advantages from it, but that day hasn't come yet – and may never come. So, is it a good idea – at this point in time, I'm not yet concerned about performance, and the debugging simplicity I get makes it worthwhile to me – for now. If I want to remove them later, it's all done in one place, so it would be easy to change.
I know that some of the people who read this blog know a lot more about Neo4j than I do. So, what do you think? Does this seem reasonable to you?