LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

Remote server monitoring with nagios (CentOS)

lowendtutorial

There are many ways to remotely monitor your servers, but nagios is one of the most configurable and flexible ones. Once you have set up nagios, your localhost is being checked. But when there are real issues with your localhost, nagios probably isn’t going to send you any messages. That’s why you should set up remote monitoring: a nagios server that checks all your servers. Ideally, you set up multiple nagios servers or monitor your single nagios server as well. This article assumes a single nagios server.

So, remote monitoring. Nagios has a plugin named Nagios Remote Plugin Executor (NRPE for short). This plugin enables nagios to execute check scripts on remote host. What it checks, how often it checks it and when it sends out a warning can be easily configured (per server). In order for these remote checks to work, they need to be defined in two places:

  • Nagios server: so it knows what to check and how to check it
  • Remote host: so it knows what the nagios server is allowed to check

This article assumes you have nagios up and running for localhost. It also assumes you use CentOS or RHEL (though it may very well work on other Linux distributions as well). The first thing to do is to make sure the NRPE plugin is installed on the nagios server:

yum install nagios-plugins-nrpe

Once that is done,  we need to fix fulfil two preconditions for the rest of the guide to work. I won’t go into the details of this, as the Ubuntu guide doesn’t have these either (Ubuntu has these preconditions by default). First, create /etc/nagios/conf.d/hostgroups.cfg and paste the following into it:

define hostgroup {
hostgroup_name  all
alias           All Servers
members         *
}

define hostgroup {
hostgroup_name  generic-servers
alias           Generic servers
}

define hostgroup {
hostgroup_name  http-servers
alias           HTTP servers
}

Second, add the following lines to the bottom of /etc/nagios/objects/commands.cfg

define command {
command_name    check_nrpe
command_line    /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -a $ARG2$
}

# this command runs a program $ARG1$ with no arguments
define command {
command_name    check_nrpe_1arg
command_line    /usr/lib64/nagios/plugins/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

Next, you usually need to determine what services you would like to check and whether there is a plugin for it. You can write your own nagios plugins in whatever language you want. I’ve worked with some PHP nagios plugins, so it’s really not that hard. However, that’s something for another article and we’re going to work with some of the stock plugins now. I’m going to use the following plugins in my examples:

  • Current Users (check_users)
  • Current Load (check_load)
  • Disk Space (check_all_disks)
  • Zombie Processes (check_zombie_procs)
  • Total Processes (check_total_procs)
  • Swap (check_swap)

Nagios server

First thing you need to do, is to define the services on the nagios server. The nagios server needs to know what services to check for a remote host and how it needs to do that. I’ve added my services to /etc/nagios/conf.d/services.cfg. We’re going to make a service definition block for every remote check we need. I’m going to show one example here and attach the full list of definitions to this article (for reference or copy-paste). Here is the first remote service definition:

define service {
hostgroup_name generic-servers
service_description Current Users NRPE
check_command check_nrpe_1arg!check_users
use generic-service
notification_interval 0
}

I’m going to explain this block line by line, so you really understand what it all does. The ‘define service’ kind of speaks its purpose, so I’ll skip that.

hostgroup_name generic-servers

This assigns this service to a hostgroup (and defines it if it didn’t exist yet). You can later add a host to that hostgroup and it’ll automatically execute this check for that host (and all other hosts in that hostgroup).

service_description Current Users NRPE

This describes the service and is used for displaying this service on both the nagios web interface and in any nagios software you could use (like nagstamon or aNag).

check_command check_nrpe_1arg!check_users

This defines the actual command of the check. The first part, ‘check_nrpe_1arg’, says this is a check without any additional arguments. It says 1arg, because that one argument is ‘check_users’ (the exclamation mark is used as a separator for arguments). So this piece makes sure that the ‘check_users’ command is executed on the remote host.

use generic-service

The use keyword enables you to include templates. This time it’s the generic-service template. A template is actually a normal nagios configuration, in this case a service definition like the one we’re adding now. I use the default template that is installed with nagios (on Ubuntu). To see the service definition:

sudo less /etc/nagios/conf.d/generic-service.cfg

This service defines a lot of options that you now don’t have to have to define with every service definition you write. So with the use keyword, you can easily define basic definitions for groups of hosts. Now back to the last line:

notification_interval 0

This defines the number of time units before re-sending a notification in case of a problem with this service. A time unit is defined as interval_length (defined in /etc/nagios/nagios.cfg), which defaults to 60 seconds. The number here is thus in minutes (x times 60 seconds). Set this to 0 to receive just one notification.

Now we’ve established what the definition does, you can add service definitions for all the plugins I listed and we can move on to the remote host. You may also skip doing this and add more checks later.

Remote host

Now on to the remote host we want to check. We start by installing the NRPE server (which includes the default set of plugins):

yum install nrpe nagios-plugins-load nagios-plugins-swap nagios-plugins-ssh nagios-plugins-http nagios-plugins-ping nagios-plugins-disk nagios-plugins-procs nagios-plugins-users

Again, make sure the plugins are actually there:

ls -al /usr/lib64/nagios/plugins/

Next, make sure NRPE starts when your server starts:

/sbin/chkconfig nrpe on

chkconfig is a tool to easily create symlinks in the various /etc/rc[0-6].d directories where the applications that need to be started with each runlevel need to be defined (linked).

Next, we fix the default nrpe.cfg. I’ve got a bug report open at Red Hat about this (https://bugzilla.redhat.com/show_bug.cgi?id=963703). Basically, NRPE may override several custom checks with default values. Open up /etc/nagios/nrpe.cfg and look for these lines:

# INCLUDE CONFIG DIRECTORY
# This directive allows you to include definitions from config files (with a
# .cfg extension) in one or more directories (with recursion).

include_dir=/etc/nrpe.d/

Cut them from their current position (or copy them and remove them) and paste them back at the bottom of the file. Your checks will now not be overwritten.

Now it is important to make sure the nagios server is allowed to connect to the remote host. First of all, if you use a firewall, you need to open up port 5666 (TCP & UDP). Second, you need to define the allowed_hosts parameter in /etc/nrpe.d/nrpe_local.cfg. Any .cfg file in /etc/nrpe.d will be included by NRPE, so you can add more if you wish. I’m just going to use the single one for now.

Make sure the nrpe_local.cfg line looks something like this:

######################################
# Do any local nrpe configuration here
######################################
allowed_hosts=127.0.0.1,192.0.2.20

I always leave 127.0.0.1 there. Although it is not strictly necessary, it could be useful for local testing. You can add as many IP addresses as you deem necessary there. I just added one IP, the one of the nagios server.

After setting this and restarting the nagios-nrpe-server, the nagios server can connect to the nagios server. Howerver, it cannot execute any checks yet. In order to be able to execute checks, the checks that were defined on the nagios server need to be defined here as well. Open up the file /etc/nrpe.d/nrpe_local.cfg again and add the following lines (you can comment them out with #):

command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_all_disks]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10%
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 200 -c 250
command[check_swap]=/usr/lib64/nagios/plugins/check_swap -w 50% -c 25%

By adding these lines, we have mapped the checks defined on the nagios server (behind check_nrpe_arg1!) to actual checks on the remote host. Let’s look at the first line:

command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10

This maps check_users to /usr/lib64/nagios/plugins/check_users. You can even test the check by executing /usr/lib64/nagios/plugins/check_users directly. If you do that, it gives a nice warning that you forgot two parameters:

Usage:
check_users -w <users> -c <users>

In the command definition, we’ve set these to 5 and 10. The -w is to set the warning threshold. The -c is to set the critical threshold. If five users are logged in at the same time, and the nagios servers checks it, it triggers a warning. If 10 users are logged in at the same time, it triggers a critical warning.

Now run the check again to see the actual result:

/usr/lib64/nagios/plugins/check_users -w 5 -c 10

It should now say:

USERS OK – 1 users currently logged in |users=1;5;10;0

As you can see, every command has different threshold values. Notice the % with check_all_disks and check_swap. If you ever add a check, be sure to run in directly on the remote host to see if it returns the expected result. It’s never a waste to play with these checks a bit and fine-tune them.

Finally, restart the NRPE server so the new config gets loaded:

/etc/init.d/nrpe restart

Back to the nagios server

The last thing we need to do is to add a host to the hostgroup we have just added the check for. Still following me? We added the remote check to a hostgroup. If we now add hosts to that hostgroup, the hosts will automatically get the checks from that hostgroup.

On your nagios server, create a file in /etc/nagios/conf.d. How you name it is up to you, but I use a scheme where I name the file after the host I define in it. So, say I want to add a host definition for alpha.example.net, I would use:

sudo vim /etc/nagios/conf.d/alpha.example.net.cfg

Feel free to add a servers/ folder in conf.d/: nagios load all config files in that directory recursively. Now to the contents of the file. They should look like this:

define host {
host_name alpha
alias alpha.example.net
hostgroups general-servers
address 192.0.2.21
use generic-host
}

The lines host_name and alias are used to identify the host within nagios. Contrary to what you might expect, the host_name is the short version and the alias the long version of the host identification.

The hostgroups line is similar to what we’ve seen before. We have now added this host to just one hostgroup, the one we defined earlier. However, a host can belong to many more hostgroups. You can add it to more by comma-separating a list of hostgroups, like:

hostgroups general-servers,http-servers

The address refers to the IP address of the remote host This may also be a fully qualified domain name, so this is valid as well:

address alpha.example.net

The use statement has been explained before. Feel free to check the definition of generic-host and figure out what it does, it should lead to a better understanding of the host definition.

Now we’ve defined the host, restart the nagios NRPE server:

/etc/init.d/nagios restart

…and you’re all set! If you now go to your nagios web interface, you should see the remote host and the services we’ve defined! You can now add all your hosts to nagios and have an overview of the full status of all of your servers!

Full list of service definitions

For reference, the full list of service definitions:

# NRPE Services
define service {
hostgroup_name generic-servers
service_description Current Users NRPE
check_command check_nrpe_1arg!check_users
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Current Load NRPE
check_command check_nrpe_1arg!check_load
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Disk Space NRPE
check_command check_nrpe_1arg!check_all_disks
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Zombie Processes NRPE
check_command check_nrpe_1arg!check_zombie_procs
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Total Processes NRPE
check_command check_nrpe_1arg!check_total_procs
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Swap NRPE
check_command check_nrpe_1arg!check_swap
use generic-service
notification_interval 0
}

mpkossen

11 Comments

  1. I think you should list a tut on observium too, nice piece of stuff !

    July 27, 2013 @ 2:14 pm | Reply
  2. I use Glances.O(∩_∩)O~

    July 27, 2013 @ 2:24 pm | Reply
  3. DalComp:

    Thanks mpkossen for writing and Liam for proof-reading. :)

    July 27, 2013 @ 3:51 pm | Reply
  4. Shawn_ky:

    Awesome!!! Thanks

    July 28, 2013 @ 5:23 am | Reply
  5. Shawn_Ky:

    Looks like in some places you have general-servers and generic-servers….

    July 28, 2013 @ 11:45 pm | Reply
  6. Nagios is a great tool, but as there are too many people it can be a problem to support it inspite of the fact it is open source. May be it is better sometimes to pay a little bit and to have other tools which are not bad e.g. Anturis or Cacti.

    August 4, 2013 @ 5:11 pm | Reply
  7. Do not forget to add nrpe: ALL and allow single IP in /etc/hosts.allow and /etc/hosts.deny for better security.

    August 9, 2013 @ 6:08 pm | Reply
  8. ory burch 財布 値段

    August 30, 2013 @ 4:56 am | Reply
  9. henry:

    great!

    December 2, 2014 @ 3:31 am | Reply

Leave a Reply

Some notes on commenting on LowEndBox:

  • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
  • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
  • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

Your email address will not be published. Required fields are marked *