Sunday, May 23, 2010

Distributed Nagios

The blog isn't dead, I swear!  I've been working on this article for a long time trying to work out how to word it.  I've probobly failed to make it as clear as it should be but I think it conveys the ideas at least adequately.

This is going to be the first of what will probably be several  posts (that will most likely eschew my whole format ::gasp::, ok.. maybe not ) describing the Nagios monitoring system I have been working on for what feels like a century.  For now I'm going to discuss the goals of the system and the variety of devices that need to be watched and what the different pieces of relevant information for these devices.  It should be noted that 'devices' in this context may in fact refer to an entire system, for example we have a set of storage systems that consists of multiple computers, raid arrays, networks and etc.

To Start we'll label the Internet as Network 0.

Our primary storage device has connectivity to Network 0, as well as a connection to a 10 Gbps Ethernet connection for our cluster, we'll call this Network 1.  Finally it has a backend management network we will label Network 2.

Our next device of primary concern is the computational cluster.  This has 3 network links as well.  The first being to Network 0.  The second being to Network 1 in the form of an infiniband to 10 Gbps switch and Twinax 10 Gbps cables connecting to the 10 Gbps communication switch.  Finally it has a 1 Gbps management network we will label Network 3.

The final 'major' device is very similar to the first one as it's task is to mirror the primary storage.  It has a connection to Network 0, which is used for keeping the devices linked and synced asynchronously.  It also has a management network which we will label Network 4.

Each of these devices also has a server situated near it with access to the rear end private networks of the individual systems.  This is required due to the fact that these networks have no pathway to the internet.  Among the myriad tasks of these server is to serve up a set of virtual machines running CentOS and Nagios.  This allows us to keep an eye on the devices without exposing the nagios traffic to the internet.

However this also means we have three seperate locations to keep an eye on and review if there is a problem being reported.  This doesn't necessarily have to be the case thou, thanks to some industrious coding on the part of some of the nagios contributors you can use the NSCA tool to forward these status updates to a central location.  Alternatively you could use the (still listed as experimental) MySQL data storage system to store all the information is a single area, but this has several caveats like not being able to use the default nagios display system and rellying on your MySQL server not dying to keep an eye on what's going on.  Due to these caveats, as well as the desire to have an issolated implimentation that isn't dependant on outside systems (these systems don't even mount an NFS mount), we went with NSCA.

This of course is only part of the puzzle, anyone that's explored Nagios in even a cursory manner will know that the backbone of the system is the config files used to tell it about the different systems it is required to keep an eye on.  However you can't just use the same config files on each of the systems and call it a day.  This is a short path to a truly massive headache since then you'll have copies of nagios without the capacity to reach a given host that is listed in their config file.  Also your aggragator will be hammering on phantoms it has no way to reach all around (or just making people around the world with the same IP's very very confused).

So now that we know we can do it but there are some problems, how to fix them?  I decided to kill as many birds with as few stones as possible.  It does make configuration a bit trickier but I personnally find it easier to configure and maintain than the alternatives.  In this case that required a few steps.  First there is 1 master list of configuration files, and these when pushed ou will overwrite any changes made on the outlying systems.  I've placed this master set on the aggragator 'head' for nagios.  To make this work the aggragator has sub folders of config files that are important to each of the gatherer systems, a central set of config files that everything needs for things like templates and command definitions and a head folder with special config information for the head (hostgroups that need to exist but arn't important from the gatherers point of view).  The head then imports this main folder while the gatherers import the folder labeled as theirs and the 'central' folder to get their command and template definitions.

This is all well and good to say but it doesn't say how those configs are given to the gatherers and the like.  To handle this first I made it so that you can't just SSH into the gatherers with an ALL operator in their hosts.deny files, then added a special exception for the nagios head so while nobody else can SSH into these hosts, the head can.  Eventually the head's mirror will be allowed as well, however I can still get a desktop on these hosts at will due to the fact that they are VM's run on vmware server 2.  After this I configured an SSH key from the head to the gatherers, currently it's a root cert but I'm trying to find a way around this, this allows the head to SCP the config files to the clients and give the nagios service a 'reload' command.  After that I can modify the config files on the head and simply execute a command called '' to push those updates out to the gatherers.

For a quick but less granular setup you'll want to disable active host and active service checks on the head.  This has the side effect of not allowing it to test the availability of the gatherers, but also won't create spurious requests and false failures from systems the head simply can't reach (or false positives for IP's on the internet that have the same IP as the hidden host).

The next articles will discuss how to file out some of the caveats from this article, how to monitor some specific types of systems using SNMP (Well supported in nagios), IPMI (a few unreliable plugins available, the PNI has published our own to hopefully solve this) and NRPE a client side monitoring and request program provided by the nagios dev team.

The Blue Pill:

It works.  Nagios is almost infinetely flexible so you can mold it to any of a variety of behaviors that you might wish to see and it will do it with a little massaging.  On top of that writing plugins to make it handle data that it doesn't have previously exsisting handlers for is very straight forward and probobly the least painful part ofthe whole thing.

The Red Pill:

Configuration is a BEAST.  The vast majority of the time I've spent on this project has been working out how to properly configure everything so as not to have duplicate configs / phantom configs and still keeping everything talking happily.  A graphical config system would be helpful but would have to be incredibly sophisticated to properly handle the layout I'm using as the network isn't flat and the gatherers and head are issolated systems using the internet as their common communcation path.  That said I've been very happy with it thus far, it helped us catch some things already that our systems should have been notifying us about and were not (like failed disks in a storage system and a few cluster nodes that have dead BMC monitors).

I covered a bit more ground in this post than I had originally intended but it seemed relevant to explain how these systems have to talk to each other so that I can get to the nitty gritty next time.