The title for this blog post comes from a T-shirt I had made for the Assimilation Project. I wore a nicer version of it at my recent talk at LinuxCon 2012.
The Assimilation project has some significant and unique claims to scalability. Some of these have been discussed before. This blog article will explain the different aspects of the project and how each of them measures up in terms of scalability.
Before we get started, there is some geeky computer science terminology that needs explaining. When the work that is being done doing something doesn't go up as the number of servers goes up, us geeky-folk call it O(1). This notation is called big-oh notation. When the work doubles when you double the number of servers and goes up by 10 when the number of servers goes up by 10 and so on, using big-oh notation, we call that O(n) - because the work goes up as the number (n) of servers goes up.
With that out of the way, let's cover the liveness or membership aspects of the system - those discussed in an earlier post, and do a little work on definitions. I typically say that the liveness checking is essentially O(1). That is, the amount of work done by any one server doesn't go up as the number of servers goes up. The programmers and lawyers out there will have noticed the word "essentially". The part that is purely O(1) is the part that discovers that nothing is wrong ("well-baby checks"). The work for that part never goes up as servers are added. However, the work required to process failures does go up as the number of failures goes up. And the number of failures goes up as you add servers. As one back-of-the-envelope data point, if you have a million servers, you will have about 500 servers per day fail. That's a pretty small number of events to process - and many orders of magnitude less than the number of well-baby-checks that this method avoids sending to the CMA. As far as I know at the time of this writing, this is a unique attribute of the assimilation project.
Note that this work is still being done, but because of the way this work is distributed, it is not concentrated on a single machine or collection of machines - this is why I say it's scalable.
For security reasons, we control the nanoprobes from the CMA, so when they come online the CMA has to splice them into its rings. But, it would still need to know it came online, so that part is inevitably O(n) anyway. Our creating rings out of them increase the work somewhat, but don't change the order of complexity of the system.
What about service failures, or other exceptions not involving servers crashing or rebooting? Those are still O(n). Well-baby service checks are performed purely locally. Until they fail, they follow the "no news is good news" philosophy - so only failures get propagated upstream. Some other systems do this (so-called passive checks), but in our case, these are the norm, not the exception. Because they combine with the reliable liveness checks, the no-news-is-good-news for service checks is reliable.
The last item is Stealth discoveryTM. Discovery information is largely O(n) - but based on the number of configuration changes - rather than the number of nodes. Like most things, practice is more complicated than theory So, let's look at those practical details.
At the present time, we have decided not to have any persistent discovery information kept on the systems. So, whenever a system reboots, it doesn't know if a particular discovery item has been announced before or not - so it will send all discovery information once at each reboot. After it is sent, it then follows the no-news-is-good-news philosophy, and doesn't resend it unless it changes. Note that this is curable by adding persistent state to the discovery data. If this is needed it can be added - with some careful checking to deal with things like restoring data from backups, cloning of systems and other common administrative tasks.
It is the difficulty of dealing with this properly (whcih is harder than it seems) that has led me to the current decision.
In summary, the largest amount of effort most monitoring systems perform centrally is now performed in a fair and evenly distributed fashion - resulting in an O(1) designation for these "well-baby checks". However, there are still a number of things which, are still O(n) - but are drastically reduced by our no-news-is-good-news design philosophy.
Credits: The words on the T-shirt were inspired by a T-shirt that Splunk gave out a while back.
Does this make sense to you? Does the pun in the title and shirt hurt your soul, make you groan, or does it make you laugh?
Comments