In my previous post on automatic service monitoring rules, I explained the general principles behind the Assimilation project's rules for automatically monitoring with LSB-style init scripts based on what we've discovered. In this post, I'll extend that further and explain how to monitor using Open Cluster Framework (OCF) resource agents - which are far more powerful, but a bit more interesting to configure.
As you will recall from the previous post, LSB resource agents rules were written using JSON with comments. The rules for recognizing and configuring OCF resource agents are very similar - and are also described in JSON. Faithful readers may recall that I wrote a blog post on how OCF monitoring works - which may be of interest to review.
With an LSB resource agent, all you had to do was recognized that it controlled this particular service - there is nothing to configure. Either it matches, and you can issue a status operation, or it doesn't. With OCF resource agents, they support configuring multiple instances of resources on the same machines, through a series of name/value parameters in the environment. Some parameters have defaults, but some have to be configured in order to properly monitor an instance. The challenge for the Assimilation code is to figure out these parameters from the discovery information and plug them into the right places.
As part of the Assimilation project, we've written an neo4j OCF resource agent for monitoring and controlling Neo4j which we'll discuss here. The LSB init script merely looks to see if the neo4j process is running. The OCF resource agent goes further - it issues a query to Neo4j and makes sure it gets a sane result returned. For good measure, it implements multiple levels of monitoring - one with a simpler query and one with a more complex query. This allows us to detect when services hang or give erratic results - which is far more interesting than merely looking to see if the process is in the process table.
There are a few interesting parameters for this resource agent which we can derive from our discovery information. These are:
- ipport - the IP:port address that neo4j is listening on for REST queries
- neo4j_home - the root directory of the neo4j installation
- neo4j - the name of this Neo4j instance
For the purposes of monitoring, only the first parameter is of interest. The other two parameters are necessary in order to know how to properly start and stop the service (which is part of the OCF resource agent specification). So, let's see how we tell the Assimilation code how to find and fill in these parameters.
{
"class": "ocf",
"type": "neo4j",
"provider": "assimilation",
"classconfig": [
# OCF parameter expression-to-evaluate regular expression
[null, "@basename()", "java$"],
[null, "argv[-1]", "org\\.neo4j\\.server\\.Bootstrapper$"],
["ipport", "@serviceipport()", "..."],
["neo4j_home", "@argequals(-Dneo4j.home)", "/"],
["neo4j", "@basename(@argequals(-Dneo4j.home))","."]
]
}
The last two columns are the same as in the LSB example in my previous blog post, but the first column is new. It gives the name of the parameter that corresponds to the expression in the second column. If this expression and regular expression are only used for recognizing the service and doesn't have a corresponding parameter, then we use the JSON tag null to indicate that. Our previous matching patterns are the same as they were, and don't have a corresponding parameter to go with it. To fill in the ipport parameter in our example, we call the serviceport() function which examines the IP address/port combinations that neo4j is listening on, and returns a single IP/port that it is listening on. This function prefers to return ipv4 addresses, and lower numbered ports. If it is listening on an ANY address it returns the appropriate loopback address (127.0.0.1 or ::1).
This data that serviceipport() returns is derived from data discovered by our tcpdiscovery discovery agent. The tcpdiscovery agent currently uses netstat to gather this information. Because of this we know exactly how to monitor this particular instance of neo4j. The other two parameters use the argequals() function. This function searches the command line of this service (collected by the tcpdiscovery agent) for a "-Dneo4j.home=something" argument and returns the something part. Because you're one of my brilliant readers, you doubtless have no figured out that neo4j is invoked with that argument as a parameter - enabling us extract it from the command line.
As time goes on and we start recognizing more and more services, we will no doubt add more functions to simplify writing these rules. When the Assimilation code is extended to use other monitoring APIs, these same principles will enable us to configure those monitoring methods using nearly identical formats to either the LSB or OCF formats discussed in these two blog posts.
What's your reaction to this? Do these kinds of rules seem straightforward? Incomprehensible? Useful? A waste of time?