This week I want to talk about an aspect of the Assimilation database schema which is somewhat controversial, an aspect of the schema for which the jury is still out.
I chose to represent the Assimilation node type hierarchy with relationships which currently serve no purpose other than to represent the types of nodes in the database. This post will talk about why I put the type hierarchy in, and why it might be a good idea, or maybe not.
Originally, I had not put the type hierarchy in, as I saw no reason for it - each node is labelled with its type internally anyway through a nodetype attribute, and most of my nodes are represented in separate indexes, one per nodetype. For example, Drones are indexed in a Drone table indexed by its name (or designation if you prefer), NICs in a NIC table, indexed by MAC address, IP addresses, indexed by canonical-form IP address and so on. Moreover, each node as a field in it called nodetype, which tells the type of the node.
Given the nodetype field and all these indexes, and the fact that my code doesn't use them why would I do this? The first answer is not very compelling - I saw some other Neo4j applications which did it, and since I was (and still am) a Neo4j newbie, who is still learning to think in graphs, I figured it must be a good idea - and maybe I would learn something along the way.
Below you'll see an example of what this looks like with one node of each type relating to its nodetype node. Each of the objects (NIC, IPaddr, Switch, Drone, etc) has an IS_A relationship with its nodetype. Each nodetype also has an IS_A relationship with node zero – sometimes called the reference node in Neo4j.
As you can see, this seems a bit redundant -- after all I have the indexes of everything anyway. My original real reason for putting these in from my perspective is simple - debugging.
Neo4j has a shell, and the way the shell operates, it's easy to follow relationships to see how the program is creating the nodes and relationships - if you start from node zero and go from there. How this works is quite nice and makes seeing what the CMA did quite easy. The Neo4j shell is somewhat reminiscent of a normal UNIX-style shell - with each node in the graph being treated analogously a directory or folder. So if I've added new relationships or node types, I can conveniently find any type of node through the reference of node zero.
There's also another reason as well... Hypothetically, if I wanted to look for some attribute of a node that there was no index for, the entire query would run more quickly (and be easier to write) if I do it all in terms of relationships rather than walking through the indexes first. So, I can imagine it being useful in the future – but right now, I'm paying the cost of creating all these IS_A relationships – which I currently use primarily for debugging.
So, I get some debugging advantage, but I pay a price. In the future, I might get some performance advantages from it, but that day hasn't come yet – and may never come. So, is it a good idea – at this point in time, I'm not yet concerned about performance, and the debugging simplicity I get makes it worthwhile to me – for now. If I want to remove them later, it's all done in one place, so it would be easy to change.
I know that some of the people who read this blog know a lot more about Neo4j than I do. So, what do you think? Does this seem reasonable to you?
Alan,
in general this is a sound thing. There are two caveats. First is that you create contention on your reference node when you create and assign new nodes from multiple threads. (A solution for that could be adding batching for updating this "secondardy index" structure).
The second is when reading from the ref-node that it will get hundreds of millions of rels in some domains and that hurts, especially if you have traversals that accidentally touch the reference-node (by following the instance_of relationship) and will explore all the many other rels in the next hop.
Both can be alleviated by creating a tree structure. First level can be concrete ref-nodes per type and if that's too much you can still introduce levels beneath that, partitioned e.g. by a domain aspect like time, location, subnet, customer, datacenter, you name it.
See http://docs.neo4j.org/chunked/milestone/cypher-cookbook-path-tree.html
Cheers
Michael
Posted by: Michael Hunger | 14 August 2012 at 00:40
This is good information to know. I don't currently expect for my problem domain to be creating large numbers of nodes of the same type in a short period of time on a routine basis. That would imply that lots of new hardware showed up all at once - which is likely to only occur during initial installation. But I much appreciate your expert opinion, and will keep your advice in mind as we go forward. Thanks Much!
Posted by: Alan R. | 14 August 2012 at 00:55