Ranting, Technically Speaking

February 25, 2010

Nagios Agents (NRPE)

Filed under: Uncategorized — Tags: , , — archangel @ 10:42 am

In an earlier post , I mentioned Nagios as a system monitoring tool. It’s simple, it’s flexible, and out of the box, you can monitor network services without any software installed on the monitored systems.

Now if you want to monitor other aspects of a system, like it’s disk usage, you can either make that information generically visible on the network (say with SNMP) or you can install an agent for Nagios. The most common agent is NRPE.

Like everything else in Nagios, you first need a plugin for Nagios to be able to check nrpe and there’s a standard package available called, well, check_nrpe. Use your package manager of choice to install this plugin (nagios-plugins-nrpe in Fedora). I found that although this installed the Nagios plugin, it did not create a command definition so I created one myself. First run the check_nrpe command manually to see what arguments it takes and then add your command definition to your Nagios configuration. It should look something like this:

# 'check-nrpe' command definition
define command{
        command_name    check_nrpe
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ $ARG2$
        }

The command definition specifies the name of the command and then simply it’s invocation. The macros given ($USER1$, etc) are pretty generic and it’s pretty easy to work from existing command definitions or the Nagios documentation.

Now once you get NRPE installed on a client, the service definition is going to look something like this:

define service{
        use                             generic-service
        host_name                       Hudson
        service_description             DISK_ROOT
        check_command                   check_nrpe!check_root
        }

You should be able to get the NRPE agent installed on many “Linux” distros from the package manager. The agent can either run under inetd (preferred) or as a stand-alone daemon. If you are using xinetd (which you should), make sure you specify the Nagios server in the only_from line, enable the service and then kick xinetd. Since you’re using xinetd, basically all the service configuration is there leaving really only the command definitions in NRPE’s main config file (/etc/nagios/nrpe.cfg). In the main config file, you are going to specify the commands that can be run. Here’s the definition for the check_root command:

command[check_root]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /

As you can see, the command definition provides all the arguments needed such that the Nagios server should not ever have to pass any arguments to NRPE. This is for both safety and simplicity.

Now you’re done! Reload your NRPE and Nagios processes and check back in a few minutes to ensure your service check is working. If it’s not, typical issues are that the port is firewalled (TCP 5666 by default) or the Nagios host was not specified correctly in the only_from line (or the allowed_hosts line if not using a xinetd).

Next up is to monitor a Windows host. Since Microsoft doesn’t have a convenient software repository of third-party applications, you get to go download and install an agent yourself. There are a handful of choices but generally, NSC++ (NSCP) will be the one you want. It supports a variety of protocols including NRPE and NSCA (NSCA is for submitting passive checks). When you install NSCP, the installer will let you enable NRPE and should handle setting up NRPE as a service and opening the firewall for it. The one thing you have to do is either enable external scripts (preferred) or enable arguments. There are a handful of stock scripts and aliases provided which get you most of the basic functionality like checking disk usage etc.

One last note is that you can always quickly check if the NRPE (or NSCP) process is talking to the server okay by simply running the check_nrpe plugin manually giving it only the host. It will report OK if NRPE is working or an error if it is not:

[root@alma nagios]# /usr/lib/nagios/plugins/check_nrpe -H hudson
Connection refused by host
[root@alma nagios]# /usr/lib/nagios/plugins/check_nrpe -H hudson
NRPE v2.12

- Arch

January 26, 2010

Essential Application Plugins

Filed under: Uncategorized — archangel @ 11:33 am

The nice thing about programs like Firefox and Thunderbird is that you can get a lot of community-created plugins to make the program look and do what you want. The downside of programs like Firefox and Thunderbird, is there is (at least for me) a few plugins that have to be installed before they work well. So to that end, I’ve started building up a list of essential plugins.

The plugin model isn’t perfect, but it far exceeds the alternative which is that your applications all suck (Microsoft, I mean you). Heck, Nagios at the core doesn’t do anything at all for you, it’s all from plugins and I can’t rave enough about how great an application Nagios is.

- Arch

December 31, 2009

Fire Bad!

Filed under: Uncategorized — archangel @ 3:47 pm

Battery backup at home went off today BEEEEEEEEEEEEEEEEEEEEEEP BEEEEEEEEEEEEEEEP! Everything shuts itself down and I go to reboot the UPS when *sniff* *sniff* ah yes, the distinctive smell of burned electronics. So that’s it finally. Adios APC BackUPS 350. You will torture us no more with your intermittent failures!

Now I have to look for a new UPS. Preferrably small in size (it has to go under my desk) and monitored. APC’s successor to the UPS model I had wasn’t monitored last time I checked, maybe they’ve got a newer model that is though. If not, I’ll have to look at other manufacturers and then that means looking at software support, etc.

Oh well, out with the old! Happy New Year’s!

- Arch

November 6, 2009

Nagios Rules All

Filed under: General — Tags: , — archangel @ 1:44 pm

Nagios is a network monitoring application which itself provides no actual monitoring but rather specializes in scheduling checks and notifications. As a module framework, it works very well and there are a lot of monitoring plugins and all told, there aren’t many (or any) systems that really compare, F/OSS, proprietary, or otherwise.

Since it’s not a complete solution in and of itself, I know at least I found it a bit daunting to get in to. So I got this book:

Building a Monitoring Infrastructure with Nagios by David Josephsen

It’s not a huge book like say, HP’s OpenView manual(s), so read it first.

Nagios is super cool. You build definitions for each host on your network and each service on each host. Nagios checks each service recording the service’s status. When a service fails, nagios will send a notification once it is sure the service is down and then periodically until it comes back up.

Fine, that’s the basic premise. Now the configuration works pretty well because any host can inherit it’s configuration from any other host definition including host definition templates. So set your general parameters once, and then override where necessary. It’s the same for services. And you also have host and service groups which allow you logically group hosts (or services).

Nagios doesn’t have any built-in way to check services, it’s all through plugins. A plugin is simply an external script or program which exits with a status of 0 for OK, 1 for warning, or 2 for critical and optional 1 line of standard output for status text. Nagios has many standard plugins available, for example the check_ping plugin. This plugin is a little wrapper script which is invoked with arguments specifying the warning and critical thresholds for response time and packet loss. So in testing a plugin, you can simply invoke the plugin with the arguments that Nagios would be feeding it.

Now if a service goes down, Nagios will check if the host is down. Again, this is a plugin of the same type as for the service. Typically, this means check_ping. So you don’t really need to have a check_ping “service” check, just for the host. So if your host runs a webserver, you would use check_http and if that fails, Nagios will check_ping on the host to see if that’s down. If the host is down, well then obviously all services on that host are a write-off so Nagios will send a notification for that host once rather than for each individual service.

And when I say Nagios will send a notification, it doesn’t know how to do that either. Notifications are also defined but typically the stock notification will suffice. On Fedora, it uses the “mail” program to send a mail message.

Ah, so who does it notify? Well, each service and each host defines contact groups and also contact hours. So Nagios will notify everyone in a contact group if it’s during notification hours. So you can monitor your development systems as well as production ones and only get notifications when appropriate.

Nagios also provides escalations. So if a service (or host) remains down, you can define an escalation path. Maybe level one is help desk, and if they don’t respond, it escalates to supervisors as well, and if they still don’t respond, then it escalates to on-call staff, managers and eventually the head cheese.

What else is cool? Oh yeah, parent-child relationships. On each host, you can define parent hosts. So if you have say several routers throughout your network, connectivity to hosts would depend on connectivity to their routers. So if a router goes down, as with services on a host, Nagios will know to only notify of the router being down and not all the children individually.

There is also an agent for Nagios called the NRPE. It is totally optional but if you want local system checks, like disk, CPU, checking running processes, and not just network service checks, then NRPE lets you do this. Install NRPE on your monitored hosts, and NRPE is available for “Linux” and Windows, and it, I think, is like a little baby Nagios invoked by the mothership. So you install service check plugins with the NRPE and then on the server, your service checks are like check_nrpe!check_disk … or something like that so the server sends the service check to the NRPE on the monitored system. I haven’t used this yet, but will definitely be doing so.

The NEB is another cool part of Nagios. The Nagios Event Broker is an interface where can write programs which hook into Nagios’s regular operations. There’s a couple dozen callback functions you can hook into and this makes the possibilities for Nagios virtually endless.

The part I’ve left for last is the user interface. Well, once again, there is none. You configure it, it fires off notifications, that’s the core. There is a web interface Nagios provides you can use if you want. It will show you host and service status and you can acknowledge alarms through it and schedule downtime for hosts. Now you can hook in more functionality, for example, historical graphs can be very useful. If you’re checking disk usage with Nagios, why not keep a record of it? Well, when you get a service check result back, it comes back with one line of text from standard output, right? Well, there’s packages that will build graphs from this data so that you can have your service status and historical reports too! Josephsen has a pretty extensive discussion on doing this kind of stuff and some great info on some of the options out there.

So, yeah. Get that book! Use Nagios! Monitor everything with it! Let it tell you when your toast is toasted or your beer needs a refill!

- Arch

October 1, 2009

Fedora Bootable USB

Filed under: Uncategorized — Tags: , , — archangel @ 1:22 pm

LiveUSB Creator, it’s a wonderful thing. Connect a USB key, get the LiveUSB Creator on your PC (Windows or “Linux”), point it either to a local .iso file for a Fedora live CD or let it download the version you want for you, click go, and shazzam! (yes, “shazzam”) You’ve now got a bootable Fedora USB key. And if you gave it a block of persistent storage, you’ve got, well, persistent storage to use in this OS for data files etc.

- Arch

September 30, 2009

Processing Deferred Messages in Postfix

Filed under: Uncategorized — Tags: , — archangel @ 2:54 pm

For anyone who’s had to cleanup some mail problems with Postfix configuration (or more often with other things, like anti-spam, tied in but not part of postfix), it may be common enough that a large spool of mail gets queued up and needs to be pushed out. The easy way to do this is to do either “postfix flush” or “postqueue -f” which basically force Postfix to re-process pending messages (actually “deferred” usually) and send them out.

However, if either the queue is huge, or you don’t really know if you have your problems resolved and want to try a few messages before unleashing the masses, I found it was not clear how this can be done. There is a straight-forward way to do this which is to put everything on hold using “postsuper -h ALL deferred”, and then un-hold whichever messages you do want processed with “postsuper -H “.

Tres handy

September 11, 2009

Let’s FUSE him with this juice!

Filed under: HOWTO — Tags: , — archangel @ 8:07 am

Back in the olden days, like a year or two ago, Filesystem in Userspace (FUSE) was a fancy feature that allows users to mount file systems. Using FUSE means that you can create a file system driven by an application rather than a driver (e.g. a kernel module). When I first tried it, it meant customizing your kernel to include this feature and building a bunch of utilities and drivers and generally it was awesome, but not something one does for a “quick fix”.

Fast forward to a few months later (or aeons in OSS terms) and there’s standard kernels and packages to operate FUSE. You can pull everything you need from your distro’s stock repository.

In particular, there is sshfs which is hella tight. “sshfs” is, as you might guess, a file system over SSH, e.g. in FUSE. This means the security and features of SSH including SSH keys and all that good fun. Installing “sshfs” and FUSE is a simple three step process:

  1. yum install sshfs (or aptitude install sshfs for Debian / Ubuntu users)
  2. ?
  3. Profit!

Similarly, once you’ve installed “sshfs”, using it is a simple three step process:

  1. sshfs myhost.example.com:/some/remote/path /some/local/path
  2. ?
  3. Profit!

What could be simpler? If you’re finding your virtual file system access in Gnome or KDE produces odd behaviour sometimes, just FUSE your remote file system instead. You get fully functional and secure access to remote file systems.

Oh, and just one last note, you use a FUSE command to disconnect the mount:

fusermount -u /some/local/path

Thanks, Toddz for mentioning FUSE the other day and getting me to revisit it.

Ciao,
- Arch

(title for this post nicked from an Invader Zim quote)

September 4, 2009

Crappy Power

Filed under: Uncategorized — archangel @ 10:08 am

I’ve had some problems in the somewhat recent past where my UPS goes into panic mode and because the battery was old / crappy, this made things “very bad”. I’ve had no issues since replacing the battery, but now I’m getting a picture of why it was so awful from apcupsd:

Mon Aug 31 11:13:36 PDT 2009  Power is back. UPS running on mains.
Mon Aug 31 11:13:34 PDT 2009  Power failure.
Thu Aug 27 11:20:20 PDT 2009  Power is back. UPS running on mains.
Thu Aug 27 11:20:18 PDT 2009  Power failure.
Sat Aug 22 16:59:32 PDT 2009  Power is back. UPS running on mains.
Sat Aug 22 16:59:29 PDT 2009  Power failure.
Sat Aug 22 16:56:29 PDT 2009  Power is back. UPS running on mains.
Sat Aug 22 16:56:27 PDT 2009  Power failure.
Fri Aug 21 00:12:33 PDT 2009  Power is back. UPS running on mains.
Fri Aug 21 00:12:31 PDT 2009  Power failure.
Fri Aug 21 00:11:52 PDT 2009  Power is back. UPS running on mains.
Fri Aug 21 00:11:50 PDT 2009  Power failure.
... etc

There are a lot of power events going on. Given that the time of the “power failure” is always 2 seconds, my guess is that this just means power is fluctuating. I’ve lived in places where this happened a bit and where it happened not at all, but this is the worst I’ve seen.

The only thing I can say is: get a UPS if you don’t have one! You may not need battery backup per se, but this is the kind of stuff that will send the power supply unit in your PC to an early grave. And if you’re unlucky, the PSU may just take other components of your PC with it.

- Archangel

August 27, 2009

Rolling dice in Bash

Filed under: HOWTO — Tags: , , — archangel @ 11:23 am

I often need short random numbers at work. For example, if I’m scheduling a whole bunch of servers to do the same automated tasks and I want them to not run at exactly the same time, I’ll use a random number between 1 and 60 to have them run on different minutes. You can do this somewhat easily in bash using the $RANDOM variable and a mod operation like so:

echo $((RANDOM%60))

However, it’s a bit long to type and sometimes I need batches of numbers. So I looked around at dice rolling programs but most were too fancy. So I wrote a simple simple script I called “roll” which returns sets of random numbers.

#!/bin/bash
 
# Roll
#   This script returns the values and sum of a set of dice rolls.  The first 
#   arg is optional and gives a number of dice.  The second arg is the number 
#   of sides on the dice.  For example "roll 2 6" will give two values from 1 
#   to 6 and also returns their sum.
#
# (c)2009 Dominic Lepiane <dlepiane@gmail.com>
 
sides=6
dice=1
total=0
c=0
 
if [ $# = 2 ] ; then
        dice=$1
        sides=$2
elif [ $# = 1 ] ; then
        sides=$1
else
        echo "Usage: $0 [# of dice] <# of sides>" >&2
        exit -1
fi
 
#echo "Rolling {$dice}d{$sides}"
 
while [ $c -lt $dice ] ; do
        c=$((c+1))
        roll=$((RANDOM%sides + 1))
        total=$((total+roll))
        echo -n "$roll "
done
 
if [ $dice -gt 1 ] ; then
        echo -n " = $total"
fi
 
echo ""

So if I want 12 numbers from 1 to 60, it looks like this:

./roll 12 60
21 32 30 38 56 36 27 19 25 34 25 48  = 391

Very handy!

July 15, 2009

VMware and Unity

Filed under: General, HOWTO, Rant, Troubleshooting — Tags: , , — archangel @ 1:34 pm

I’ve been running Fedora 11 (x64) on my workstation at work and running Windows XP (32b) in a VMware virtual machine. It was a VM I’d created with server so all I needed to run it was the free VMware Player.

First, installing VMware Player was a bit of a problem. The install from RPM didn’t work, it hosed initially. Then the install from the bundle also failed… Much like it does for many users Online it turned out so there was a community-created patch which worked just fine.

Then there was running the VM. Initially, it seemed great. I was running Windows XP full-screen on my right-screen and had my Fedora desktop / apps on my left screen. But it was pretty wonky about mouse control so I got to the point where I was firing up the Windows VM only when I needed it and then not in full-screen mode.

But I discovered that VMware’s Unity mode helps bridge the gap. It pulls you out of console mode and launches any apps from the guest VM in their own windows in your desktop environment. This is especially useful for say, running MSIE or MS Outlook. It’s still a little weird because the apps *look* like they should be running natively yet the responsiveness is clearly far behind local apps, but the only real gap is that I can’t Shift+Right-Click -> Run As… on tools like Active Directory Users and Computers (which I need). I tried switch back to the console, doing the Run As… and then switching back to Unity, but the escalated app doesn’t show up.

Well, it’s great and closes the gap some, but for now I’ll just keep updating Player and Tools and see if eventually that full-screen mode just gets fixed and works transparently.

- Arch

Newer Posts »

Powered by WordPress