LowEndBox - Cheap VPS, Hosting and Dedicated Server Deals

Remote server monitoring with nagios

There are many ways to remotely monitor your servers, but nagios is one of the most configurable and flexible ones. Once you have set up nagios, your localhost is being checked. But when there are real issues with your localhost, nagios probably isn’t going to send you any messages. That’s why you should set up remote monitoring: a nagios server that checks all your servers. Ideally, you set up multiple nagios servers or monitor your single nagios server as well. This article assumes a single nagios server.

So, remote monitoring. Nagios has a plugin named Nagios Remote Plugin Executor (NRPE for short). This plugin enables nagios to execute check scripts on remote host. What it checks, how often it checks it and when it sends out a warning can be easily configured (per server). In order for these remote checks to work, they need to be defined in two places:

  • Nagios server: so it knows what to check and how to check it
  • Remote host: so it knows what the nagios server is allowed to check

This article assumes you have nagios up and running for localhost. It also assumes you use Ubuntu or Debian (though it may very well work on other Linux distributions as well). The first thing to do is to make sure the NRPE plugin is installed on the nagios server:

sudo apt-get install nagios-nrpe-plugin

Once that is done, check if there’s a bunch of files (plugins) listed:

ls -al /usr/lib/nagios/plugins/

Next, you usually need to determine what services you would like to check and whether there is a plugin for it. You can write your own nagios plugins in whatever language you want. I’ve worked with some PHP nagios plugins, so it’s really not that hard. However, that’s something for another article and we’re going to work with some of the stock plugins now. I’m going to use the following plugins in my examples:

  • Current Users (check_users)
  • Current Load (check_load)
  • Disk Space (check_all_disks)
  • Zombie Processes (check_zombie_procs)
  • Total Processes (check_total_procs)
  • Swap (check_swap)

Nagios server

First thing you need to do, is to define the services on the nagios server. The nagios server needs to know what services to check for a remote host and how it needs to do that. I’ve added my services to /etc/nagios3/conf.d/services_nagios2.cfg (which is default on Ubuntu). The file consists of blocks that look like:

define service {
hostgroup_name local-servers
service_description SSH
check_command check_ssh
use generic-service
notification_interval 0
}

We’re going to make such a block for every remote check we need. I’m going to show one example here and attach the full list of definitions to this article (for reference or copy-paste). Here is the first remote service definition:

define service {
hostgroup_name generic-servers
service_description Current Users NRPE
check_command check_nrpe_1arg!check_users
use generic-service
notification_interval 0
}

I’m going to explain this block line by line, so you really understand what it all does. The ‘define service’ kind of speaks its purpose, so I’ll skip that.

hostgroup_name generic-servers

This assigns this service to a hostgroup (and defines it if it didn’t exist yet). You can later add a host to that hostgroup and it’ll automatically execute this check for that host (and all other hosts in that hostgroup).

service_description Current Users NRPE

This describes the service and is used for displaying this service on both the nagios web interface and in any nagios software you could use (like nagstamon or aNag).

check_command check_nrpe_1arg!check_users

This defines the actual command of the check. The first part, ‘check_nrpe_1arg’, says this is a check without any additional arguments. It says 1arg, because that one argument is ‘check_users’ (the exclamation mark is used as a separator for arguments). So this piece makes sure that the ‘check_users’ command is executed on the remote host.

use generic-service

The use keyword enables you to include templates. This time it’s the generic-service template. A template is actually a normal nagios configuration, in this case a service definition like the one we’re adding now. I use the default template that is installed with nagios (on Ubuntu). To see the service definition:

sudo less /etc/nagios3/conf.d/generic-service_nagios2.cfg

This service defines a lot of options that you now don’t have to have to define with every service definition you write. So with the use keyword, you can easily define basic definitions for groups of hosts. Now back to the last line:

notification_interval 0

This defines the number of time units before re-sending a notification in case of a problem with this service. A time unit is defined as interval_length (defined in /etc/nagios3/nagios.cfg), which defaults to 60 seconds. The number here is thus in minutes (x times 60 seconds). Set this to 0 to receive just one notification.

Now we’ve established what the definition does, you can add service definitions for all the plugins I listed and we can move on to the remote host. You may also skip doing this and add more checks later.

Remote host

Now on to the remote host we want to check. We start by installing the NRPE server (which includes the default set of plugins):

sudo apt-get install nagios-nrpe-server

Again, make sure the plugins are actually there:

ls -al /usr/lib/nagios/plugins/

Now it is important to make sure the nagios server is allowed to connect to the remote host. First of all, if you use a firewall, you need to open up port 5666 (TCP & UDP). Second, you need to define the allowed_hosts parameter in /etc/nagios/nrpe_local.cfg. Notice the ‘_local’, because there also is a nrpe.cfg. The nrpe.cfg contains the default configuration. You can override these configuration options in the nrpe_local.cfg file without getting merge issues when upgrading.

Make sure the nrpe_local.cfg line looks something like this:

######################################
# Do any local nrpe configuration here
######################################
allowed_hosts=127.0.0.1,192.0.2.20

I always leave 127.0.0.1 there. Although it is not strictly necessary, it could be useful for local testing. You can add as many IP addresses as you deem necessary there. I just added one IP, the one of the nagios server.

After setting this and restarting the nagios-nrpe-server, the nagios server can connect to the nagios server. Howerver, it cannot execute any checks yet. In order to be able to execute checks, the checks that were defined on the nagios server need to be defined here as well. Open up the file /etc/nagios/nrpe_local.cfg again and add the following lines (you can comment them out with #):

command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10%
command[check_zombie_procs]=/usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/lib/nagios/plugins/check_procs -w 150 -c 200
command[check_swap]=/usr/lib/nagios/plugins/check_swap -w 50% -c 25%

By adding these lines, we have mapped the checks defined on the nagios server (behind check_nrpe_arg1!) to actual checks on the remote host. Let’s look at the first line:

command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10

This maps check_users to /usr/lib/nagios/plugins/check_users. You can even test the check by executing /usr/lib/nagios/plugins/check_users directly. If you do that, it gives a nice warning that you forgot two parameters:

Usage:
check_users -w <users> -c <users>

In the command definition, we’ve set these to 5 and 10. The -w is to set the warning threshold. The -c is to set the critical threshold. If five users are logged in at the same time, and the nagios servers checks it, it triggers a warning. If 10 users are logged in at the same time, it triggers a critical warning.

Now run the check again to see the actual result:

/usr/lib/nagios/plugins/check_users -w 5 -c 10

It should now say:

USERS OK – 1 users currently logged in |users=1;5;10;0

As you can see, every command has different threshold values. Notice the % with check_all_disks and check_swap. If you ever add a check, be sure to run in directly on the remote host to see if it returns the expected result. It’s never a waste to play with these checks a bit and fine-tune them.

Finally, restart the NRPE server so the new config gets loaded:

sudo service nagios-nrpe-server restart

Back to the nagios server

The last thing we need to do is to add a host to the hostgroup we have just added the check for. Still following me? We added the remote check to a hostgroup. If we now add hosts to that hostgroup, the hosts will automatically get the checks from that hostgroup.

On your nagios server, create a file in /etc/nagios3/conf.d. How you name it is up to you, but I use a scheme where I name the file after the host I define in it. So, say I want to add a host definition for alpha.example.net, I would use:

sudo vim /etc/nagios3/conf.d/alpha.example.net.cfg

Feel free to add a servers/ folder in conf.d/: nagios load all config files in that directory recursively. Now to the contents of the file. They should look like this:

define host {
host_name alpha
alias alpha.example.net
hostgroups general-servers
address 192.0.2.21
use generic-host
}

The lines host_name and alias are used to identify the host within nagios. Contrary to what you might expect, the host_name is the short version and the alias the long version of the host identification.

The hostgroups line is similar to what we’ve seen before. We have now added this host to just one hostgroup, the one we defined earlier. However, a host can belong to many more hostgroups. You can add it to more by comma-separating a list of hostgroups, like:

hostgroups general-servers,http-servers,critical-servers

The address refers to the IP address of the remote host This may also be a fully qualified domain name, so this is valid as well:

address alpha.example.net

The use statement has been explained before. Feel free to check the definition of generic-host and figure out what it does, it should lead to a better understanding of the host definition.

Now we’ve defined the host, reload (or restart, but reloading is faster) the nagios NRPE server:

sudo service nagios3 reload

…and you’re all set! If you now go to your nagios web interface, you should see the remote host and the services we’ve defined! You can now add all your hosts to nagios and have an overview of the full status of all of your servers!

Full list of service definitions

For reference, the full list of service definitions:

# NRPE Services
define service {
hostgroup_name generic-servers
service_description Current Users NRPE
check_command check_nrpe_1arg!check_users
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Current Load NRPE
check_command check_nrpe_1arg!check_load
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Disk Space NRPE
check_command check_nrpe_1arg!check_all_disks
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Zombie Processes NRPE
check_command check_nrpe_1arg!check_zombie_procs
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Total Processes NRPE
check_command check_nrpe_1arg!check_total_procs
use generic-service
notification_interval 0
}

define service {
hostgroup_name generic-servers
service_description Swap NRPE
check_command check_nrpe_1arg!check_swap
use generic-service
notification_interval 0
}

mpkossen

20 Comments

  1. Maarten Kossen:

    Here’s my second article guys, I hope you like it. This one was written before the previous one, so this one doesn’t yet include CentOS instructions. Any articles I write form now on, will have those included as well. I may even post them separately later, if there is enough demand for it :-)

    May 11, 2013 @ 11:58 am | Reply
  2. Really nice and useful tutorial :). Awesome work mate.

    May 11, 2013 @ 12:39 pm | Reply
  3. Maarten,

    Keep up the good work!

    May 11, 2013 @ 3:47 pm | Reply
  4. Steve Hill:

    Have you used nagios xi yet? We recently implemented it and love it. It came as preconfigured virtual machine and we had our config ported over, up and running within an hour.

    May 11, 2013 @ 3:49 pm | Reply
  5. Steve Hill:

    P.S. Not sure if it will still work (looks like it will), but here is the link we used for %10 off: http://www.nagios.com/nagiosxi10?ref=JB10

    May 11, 2013 @ 3:54 pm | Reply
  6. Liam: hello. I am a webmaster of http://www.vpsspy.net. I’m honored to your reprint this article translated into Chinese. Thank you for providing such a good article for you. Hope to make friendship with you.

    Chinese address, http://www.vpsspy.net/437.html

    May 11, 2013 @ 4:11 pm | Reply
  7. Great stuff. Full of details and explanations. Keep it up.

    May 12, 2013 @ 6:52 am | Reply
  8. The main issue with Nagios is , didnt come with Graphing by default.

    May 12, 2013 @ 7:59 pm | Reply
    • Maarten Kossen:

      Yeah, it isn’t perfect. I may write something about icinga in the future :-)

      May 14, 2013 @ 4:05 am | Reply
  9. Rolz:

    Hey, thanks! i’ve been using zabbix for a while but this gave me enough motivation to switch to nagios!

    It’s working great

    May 13, 2013 @ 7:49 am | Reply
  10. Maarten Kossen:

    Thanks guys! For all you positive feedback! Much appreciated!

    May 14, 2013 @ 4:06 am | Reply
  11. Again nice article Maarten. Maybe you could mention the folder structure of the /etc/nagios/ folder, once you get bigger everything in one file is not really readable. For example

    -/etc/nagios/conf.d/
    –/hosts/
    —/webservers/
    —-/server1.example.com.cfg
    –/services/
    —/http/
    —-/sparklingnetwork.cfg
    –/contacts/
    —/ops/
    —-/maarten.cfg
    –/templates

    and so on..

    But still, good article! Well done.

    May 15, 2013 @ 4:06 am | Reply
    • Maarten Kossen:

      Good point! In my next nagios guide (there’s going to be another one in the future) I’ll make sure I’ll include this. Thanks!

      May 15, 2013 @ 12:07 pm | Reply
  12. Use the nagios configurator generator (still immature but works for me): http://tools.asimz.com/nagios-configurator/

    On a debian/ubuntu server I do

    ~: wget labs.asimz.com/setup.sh

    then the following to install the server

    ~: bash setup.sh nagios

    then the following to install the server

    ~: bash setup.sh nagiosclient

    I know the server IP is hardcoded and there are certain plugins and things I do that may not be standard but it works for me :)

    May 15, 2013 @ 6:12 pm | Reply
  13. Vasim:

    Hi,

    we are using Nagios® Core™ 3.3.1 in our network for monitoring.

    we have changed and scrapped a few Servers in our network. when i comment using # in nagios.cfg file, for some servers it works that is nagios stops monitoring those servers which is expected behaviour. whereas, for some servers it does not work nagios still monitors those servers and notifies them as status DOWN on http URL of Nagios monitoring Server.

    kinldy advice what needs to be done to remove these servers from nagios monitoring or why the behaviour is different for these servers.

    Thanks & Regards,
    Vasim

    June 10, 2013 @ 6:52 pm | Reply
  14. sanjay:

    Hello,

    I am using centos 5.9, Can I use nagios server for my other website on other port…….. because this time nagios server’s web fronted and my website fronted is conflict

    Thanks you………

    August 7, 2013 @ 1:33 pm | Reply
  15. abhishek kushwahaa:

    As I am new for Nagios tool, I want to configure Linux server on Nagios so what changes i have to do in these files.
    command.cfg
    localhost.cfg
    service.cfg

    or should i have to add new separate files for this regarding configuration.

    Thank you

    November 28, 2016 @ 12:04 pm | Reply
  16. hringriin:

    Hi, how can I check that worked for me? In the webinterface I can only see my localhost and the services for localhost. If I add a hostgroup in the file `/etc/nagios3/conf.d/hostgroups_nagios2.cfg` and add for hostgroup_name and alias “general-servers” and for members my server “example.net” it throws an error

    Error: Could not find any host matching 'example.net' (config file '/etc/nagios3/conf.d/hostgroups_nagios2.cfg', starting on line 31)
    Error: Could not expand members specified in hostgroup (config file '/etc/nagios3/conf.d/hostgroups_nagios2.cfg', starting on line 31)
    

    If I do not add the new hostgroups, nothing happens. It is very hard to understand the file structure in addition to the structure of hosts, hostgroups, services, service-groups … May someone provide any help or links or comment to that situation?

    Regards!
    hringriin

    January 7, 2018 @ 12:57 pm | Reply

Leave a Reply

Some notes on commenting on LowEndBox:

  • Do not use LowEndBox for support issues. Go to your hosting provider and issue a ticket there. Coming here saying "my VPS is down, what do I do?!" will only have your comments removed.
  • Akismet is used for spam detection. Some comments may be held temporarily for manual approval.
  • Use <pre>...</pre> to quote the output from your terminal/console, or consider using a pastebin service.

Your email address will not be published. Required fields are marked *