====== Nagios with Nginx on Fedora (or Amazon Linux AMI) ======

I was browsing around Amazon EC2 Infrastructure when I decided to sign up for a micro instance, which is free, and experiment with the cloud a bit.

Past the deciphering period of both the amazon terminology (load you EC2, attach an EBS, don't forget the CloudWatch, ...) and the management interface, I ended up with a shell on what seems to bit a revamped fedora with Amazon's logo.

Since the bandwidth is fairly limited (15 gigs a months), I don't want to enable bandwidth costly services on this system, but still I feel like I can use it for something useful.

So I've decided to re-install a nagios system to supervise the linuxwall domain (4+ servers). This article is about installing and configuring Nagios with Nginx on a Fedora style system.

<note>Note for the french readers, there is a similar article in French [[http://wiki.linuxwall.info/doku.php/fr:ressources:dossiers:supervision:nagios3|here]]. This one might be more up to date though.</note>

===== Installing Nagios =====

This is going to be basic, since Fedora comes with yum:

<code>
# yum install nagios nagios-plugins-all
</code>

We can start the service and add it to the runlevel 3 with the following commands:

<code>

# service nagios start
Starting nagios: done.

# chkconfig --level 3 nagios on
# chkconfig --list |grep nagios
nagios         	0:off	1:off	2:off	3:on	4:off	5:off	6:off

</code>

The installation creates a directory in **/etc/nagios**. The file **nagios.cfg** contain the core of the configuration, we will get back to it later. You can also take a look at **passwd** that contains the users allowed to connect to the interface of nagios. By default, this file only contain a //nagiosadmin// default user:

<code>
# cat passwd 
nagiosadmin:Oqd8214Hd37q1hd
</code>

The format of this file follows the htpasswd syntax: user:crypt(password). So a simple perl script can generate it (or the htpasswd tool if you have apache installed).

<code perl>
#!/usr/bin/perl
use strict;
if ( @ARGV != 2 ){
	print "usage:./htpasswd.pl <username> <password>\n";
} 
else {
	print $ARGV[0].":".crypt($ARGV[1],$ARGV[1])."\n";
}
</code>

To add a user, run it as follow:

<code>
# perl /root/htpasswd.pl toto superpassword >> /etc/nagios/passwd
</code>


===== Configuration of Nginx to serve the Nagios Interface =====

Nginx is a very powerful web server, but to communicate with nagios, it will require an additional software.

Nginx will handle incoming HTTP(S) requests and serve the response. It needs to pass the incoming requests to Nagios in order for it to generate a response (an HTML page) that will be returned to the client.

Nagios web interface is composed of a set of C programs that need to be launched to produce the HTML code. Nginx cannot launch them. So we need an external program, a wrapper, that will be in charge of launching the nagios web script. And that wrapper needs to be woken up by another program, a spawner, itself called by nginx. It's a bit complex but the illustration below should help you understand it. 

The wrapper is **[[https://github.com/gnosek/fcgiwrap|fcgiwrap]]**. And the spawner is [[http://redmine.lighttpd.net/projects/spawn-fcgi|spawn-fcgi]].

{{:en:ressources:dossiers:supervision:nagios.png|}}

The nagios web interface set of program is located in **/usr/lib/nagios/cgi-bin**.

<code>
# ls /usr/lib/nagios/cgi-bin/
avail.cgi    config.cgi     histogram.cgi  notifications.cgi  
showlog.cgi  statusmap.cgi  statuswrl.cgi  tac.cgi
cmd.cgi      extinfo.cgi    history.cgi    outages.cgi        
status.cgi   statuswml.cgi  summary.cgi    trends.cgi
</code>

==== Installation of fcgiwrap ====

Unfortunately, fcgiwrap is not shipped with the amazon packaged version of fedora. So we cannot use yum here.

And apparently, Amazon has removed a bunch of packages from the regular fedora repositories when they build their own version. So there is no libfcgi present in the repository, and you need that one.

I worked around that by building fcgiwrap on a debian i686 box and then transfert the library and binary.

The tar is here, is 32bits systems only: {{:en:ressources:dossiers:supervision:fcgiwrap-libfcgi.tar|}}

Get the tar and uncompress it. Then move fcgiwrap to /usr/bin and libfcgi.so.0.0.0 to /usr/lib as follow:

<code>
# tar -xvf fcgiwrap-libfcgi.tar
# mv fcgiwrap /usr/bin/
# chown root:root /usr/bin/fcgiwrap

# mv usr/lib/libfcgi.so.0.0.0 /usr/lib/
# ln -s /usr/lib/libfcgi.so.0.0.0 /usr/lib/libfcgi.so
# ln -s /usr/lib/libfcgi.so.0.0.0 /usr/lib/libfcgi.so.0

# /usr/bin/fcgiwrap 
Status: 403 Forbidden
Content-type: text/plain

Cannot get script name, are DOCUMENT_ROOT and SCRIPT_NAME (or SCRIPT_FILENAME) set and is the script executable?
403
</code>

The last command shows that fcgiwrap starts properly. Now the next step is to configure spawn-fcgi to call fcgiwrap.

==== Installation of spawn-fcgi ====

When installing spawn-fcgi with **yum install spawn-fcgi**, a startup script is created in /etc/init.d/.
spawn-fcgi will need a list of option to know what to do when it is called. Those options are stored in **/etc/sysconfig/spawn-fcgi**.

<file>
# You must set some working options before the "spawn-fcgi" service will work.
# If SOCKET points to a file, then this file is cleaned up by the init script.
#
# See spawn-fcgi(1) for all possible options.
#
# Example :
#SOCKET=/var/run/php-fcgi.sock
#OPTIONS="-u apache -g apache -s $SOCKET -S -M 0600 -C 32 -F 1 -P /var/run/spawn-fcgi.pid -- /usr/bin/php-cgi"
</file>

What we want is to enable a TCP socket on localhost:9001 that will call /usr/bin/fcgiwrap when woken up.
For that purpose, we set the following option line in the file:

<file>
OPTIONS="-u nginx -g nginx -a 127.0.0.1 -p 9001 -f /usr/bin/fcgiwrap -P /var/run/spawn-fcgi.pid"
</file>

We can now start the spawn-fcgi service and verify that the socket is listen on port 9001.

<code>
# service spawn-fcgi start
Starting spawn-fcgi:                                       [  OK  ]

# netstat -taupen |grep LISTEN |grep 9001
tcp        0      0 127.0.0.1:9001              0.0.0.0:*                   LISTEN      0          6075       3092/fcgiwrap       

</code>

To add spawn-fcgi to the default runlevel, use chkconfig.
<code>
# chkconfig --level 3 spawn-fcgi on
</code>

==== Almost done, some PHP ====

Apparently the latest version of Nagios include some PHP code. Similarly to the C programs, PHP needs to be executed by spawn-fcgi and php-cgi (not fcgiwrap this time).

So, to enable this, we will simply copy the init script of spawn-fcgi and create a new OPTION file that launches php-cgi and listen on socket TCP 9002.

Here is how it's done:

<code>
# cp /etc/init.d/spawn-fcgi /etc/init.d/spawn-fcgi-php
# vim /etc/init.d/spawn-fcgi-php

[edit lines 24/25 to replace spawn-fcgi with spawn-fcgi-php]

# cp /etc/sysconfig/spawn-fcgi /etc/sysconfig/spawn-fcgi-php
# vim /etc/sysconfig/spawn-fcgi-php

[modify the OPTION line so it matches the line below]
OPTIONS="-u nginx -g nginx -a 127.0.0.1 -p 9002 -f /usr/bin/php-cgi -P /var/run/spawn-fcgi-php.pid"

# service spawn-fcgi-php start
Starting spawn-fcgi-php:                                   [  OK  ]
# netstat -taupen |grep 9002
tcp        0      0 127.0.0.1:9002              0.0.0.0:*                   LISTEN      0          12930      10364/php-cgi
</code>

And now add the new service to chkconfig and run it a level 3.

<code>
# chkconfig --add spawn-fcgi-php
# chkconfig --level 3 spawn-fcgi-php on
</code>

That's all for now, let's move to Nginx configuration.

==== Back to Nginx ====

Alright, now we have a method to execute the C and PHP scripts from the Nagios web interface through FastCGI. Nginx, however, is not configured yet. We need to tell it where to find those scripts and how to handle them.

<note warning>Before anything else, you **MUST** check the permissions of the files in **/etc/nagios** and **/usr/share/nagios**. In particular, /etc/nagios/passwd and /usr/share/nagios/html/config.inc.php were unreadable to the group //nginx// (permissions were given to apache).</note>

There are two locations: the envelope of the interface and the libraries.

So, first, we handle the PHP and HTML files that compose the envelope. Those files are stored in **/usr/share/nagios/html**. The Nginx configuration is done in **/etc/nginx/conf.d/ssl.conf**:

<file>
    location / {
        auth_basic "Access to the web interface is restricted";
        auth_basic_user_file /etc/nagios/passwd;

        rewrite ^/nagios/(.*) /$1 break;

        root /usr/share/nagios/html;
        index  index.php;
        include fastcgi_params;
        fastcgi_param  SCRIPT_FILENAME  $document_root$fastcgi_script_name;
        if ($uri ~ "\.php"){
            fastcgi_pass   127.0.0.1:9002;
        }

    }

</file>

Some explanation is necessary:
  * The authentication is required at the root of the server, and nginx bases it on the content of **/etc/nagios/passwd** that we looked at earlier.
  * The rewrite rule remove unnecessary /nagios in the URI when browsing the interface
  * We include the FastCGI parameters from the file **/etc/nginx/fastcgi_params** (take a look at it, it's interesting)
  * We add an additional parameters that locates the complete path of a script when executed
  * Finally, if a requested file name finishes with .php, we send the request to php-cgi through the spawn-fcgi socket (php-cgi will return the HTML code to send to the client)


Now, the section that execute the libraries is somewhat similar, except that the file are located in /usr/lib and we send them to a different socket.
Also, Nagios want to know who is connecting, so we pass the FastCGI parameters AUTH_USER and REMOTE_USER.
<file>
    location /nagios/cgi-bin/ {
        root /usr/lib/;
        include /etc/nginx/fastcgi_params;
        auth_basic "Restricted";
        auth_basic_user_file /etc/nagios/passwd;
        fastcgi_param  AUTH_USER $remote_user;
        fastcgi_param  REMOTE_USER $remote_user;
        if ($uri ~ "\.cgi$"){
            fastcgi_pass   127.0.0.1:9001;
        }
    }
</file>


The complete definition of the host looks like this:

<code>
server {
    listen       443;
    server_name  example_server;

    ssl                  on;
    ssl_certificate      /etc/ssl/certs/example_server/example_server.pem;
    ssl_certificate_key  /etc/ssl/certs/example_server/example_server.key;

    ssl_session_timeout  5m;

    ssl_protocols  SSLv3 TLSv1;
    ssl_ciphers  ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:!MEDIUM:!LOW:!SSLv2:+EXP;
    ssl_prefer_server_ciphers   on;

    location / {
        auth_basic "Access to the web interface is restricted";
        auth_basic_user_file /etc/nagios/passwd;

        rewrite ^/nagios/(.*) /$1 break;

        root /usr/share/nagios/html;
        index  index.php;
        include fastcgi_params;
        fastcgi_param  SCRIPT_FILENAME  $document_root$fastcgi_script_name;
        if ($uri ~ "\.php"){
            fastcgi_pass   127.0.0.1:9002;
        }

    }

    location /nagios/cgi-bin/ {
        root /usr/lib/;
        include /etc/nginx/fastcgi_params;
        auth_basic "Restricted";
        auth_basic_user_file /etc/nagios/passwd;
        fastcgi_param  AUTH_USER $remote_user;
        fastcgi_param  REMOTE_USER $remote_user;
        if ($uri ~ "\.cgi$"){
            fastcgi_pass   127.0.0.1:9001;
        }
    }


}
</code>

Restart Nginx and you should be able to admire this:

{{:en:ressources:dossiers:supervision:nagiosadmin.jpg|}}

===== Nagios configuration =====

There is a number of ways to monitor hosts with Nagios. I cannot cover them all. Plus, the documentation shipped with the default Nagios installation is extremely well written (check out the Documentation section on the left column), so rewriting it here would be useless.

Instead, I will describe the configuration of the services that I want to monitor. Some HTTP servers, DNS master and slave, cpu/disk/memory on the remote host (using snmp), and so on...

==== Understanding the logic ====

=== Templates ===

Nagios tries to limit the duplication of information. It uses **templates** that defines a type of host, contact or service and the associated test and/or data that defines it. 

Templates are basically regular objects definition that are not registered (register value = 0). They define parameters that can be used by other registered objects through inheritance. So you can define a generic server template with basic parameters, and then a more specific one for, say, toto.example.net, that inherit the generic template basic parameters and adds its own (if needed).

The inheritance works as follow: if both the //generic-server// template and the toto.example.net object declare the same parameter, then the second one (from toto) will be the one used. 

<note tip>further reading about objects inheritance can be found here: [[http://nagios.sourceforge.net/docs/2_0/templaterecursion.html|Template Recursion]]</note>

The templates definitions are stored in **/etc/nagios/objects/templates.cfg** by default. If you take a closer look at this file, you will see three types of templates: contact, host and service.

The template for **contact**, named //generic-contact//, defines that this type of contact monitors all services and host on a 24/7 basis, and that notification to this contact must be send by email.

<file>
 28 define contact{
 29         name                            generic-contact     ; The name of this contact template
 30         service_notification_period     24x7            ; service notifications can be sent anytime
 31         host_notification_period        24x7            ; host notifications can be sent anytime
 32         service_notification_options    w,u,c,r,f,s     ; send notifications for all service states, flapping events, and scheduled downtime events
 33         host_notification_options       d,u,r,f,s       ; send notifications for all host states, flapping events, and scheduled downtime events
 34         service_notification_commands   notify-service-by-email ; send service notifications via email
 35         host_notification_commands      notify-host-by-email    ; send host notifications via email
 36         register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
 37         }

</file>


The first template for **host**, named //generic-host//, defines specific parameters on the monitoring of this type of host. The declaration is self-explanatory:
<file>
 52 define host{
 53         name                            generic-host    ; The name of this host template
 54         notifications_enabled           1           ; Host notifications are enabled
 55         event_handler_enabled           1           ; Host event handler is enabled
 56         flap_detection_enabled          1           ; Flap detection is enabled
 57         failure_prediction_enabled      1           ; Failure prediction is enabled
 58         process_perf_data               1           ; Process performance data
 59         retain_status_information       1           ; Retain status information across program restarts
 60         retain_nonstatus_information    1           ; Retain non-status information across program restarts
 61         notification_period             24x7        ; Send host notifications at any time
 62         register                        0           ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
 63         }
</file>

Right after is declared a //linux-server// template that inherits the //generic-host// parameters. Both of those being templates, they are not registered. And if a host use //linux-server//, it will inherit from both //linux-server// and //generic-host//.

<file>
 68 define host{
 69     name                    linux-server    ; The name of this host template
 70     use                     generic-host    ; This template inherits other values from the generic-host template
 71     check_period            24x7        ; By default, Linux hosts are checked round the clock
 72     check_interval          5       ; Actively check the host every 5 minutes
 73     retry_interval          1       ; Schedule host check retries at 1 minute intervals
 74     max_check_attempts      10      ; Check each Linux host 10 times (max)
 75     check_command           check-host-alive ; Default command to check Linux hosts
 76     notification_period     workhours   ; Linux admins hate to be woken up, so we only notify during the day
 77                                         ; Note that the notification_period variable is being overridden from
 78                                         ; the value that is inherited from the generic-host template!
 79     notification_interval   120     ; Resend notifications every 2 hours
 80     notification_options    d,u,r   ; Only send notifications for specific host states
 81     contact_groups          admins  ; Notifications get sent to the admins by default
 82     register                0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
 83     }
</file>

There are a few other templates defined (windows server, printers, ...). But let's skip to the service definition.

Almost at the end of the file defined a //generic-service//. This is the template for all services that will control the state of //something// and return a status (OK, WARNING, UNKNOWN, or CRITICAL). 
<file>
153 define service{
154         name                            generic-service     ; The 'name' of this service template
155         active_checks_enabled           1               ; Active service checks are enabled
156         passive_checks_enabled          1               ; Passive service checks are enabled/accepted
157         parallelize_check               1               ; Active service checks should be parallelized (disabling this can lead to major performance problems)
158         obsess_over_service             1               ; We should obsess over this service (if necessary)
159         check_freshness                 0               ; Default is to NOT check service 'freshness'
160         notifications_enabled           1               ; Service notifications are enabled
161         event_handler_enabled           1               ; Service event handler is enabled
162         flap_detection_enabled          1               ; Flap detection is enabled
163         failure_prediction_enabled      1               ; Failure prediction is enabled
164         process_perf_data               1               ; Process performance data
165         retain_status_information       1               ; Retain status information across program restarts
166         retain_nonstatus_information    1               ; Retain non-status information across program restarts
167         is_volatile                     0               ; The service is not volatile
168         check_period                    24x7            ; The service can be checked at any time of the day
169         max_check_attempts              3               ; Re-check the service up to 3 times in order to determine its final (hard) state
170         normal_check_interval           10              ; Check the service every 10 minutes under normal conditions
171         retry_check_interval            2               ; Re-check the service every two minutes until a hard state can be determined
172         contact_groups                  admins          ; Notifications get sent out to everyone in the 'admins' group
173         notification_options            w,u,c,r         ; Send notifications about warning, unknown, critical, and recovery events
174         notification_interval           60              ; Re-notify about service problems every hour
175         notification_period             24x7            ; Notifications can be sent out at any time
176         register                       0                ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
177         }
</file>

=== Hosts ===

Now, the whole point of using templates is to reduce the amount of parameters you have to set when you declare a new host. You just reuse the template.

Therefore, a basic host definition is fairly short. Take a look at the **localhost** host definition in **localhost.cfg** and who will see that it almost does not contain anything:

<file>
 25 define host{
 26         use                     linux-server            ; Name of host template to use
 27                             ; This host definition will inherit all variables that are defined
 28                             ; in (or inherited by) the linux-server host template definition.
 29         host_name               localhost
 30         alias                   localhost
 31         address                 127.0.0.1
 32         }
</file>

=== Services ===

Now, the previous host definition does not declare any service to check. For that, we need to declare services and link those services to our host. This is still done in **localhost.cfg**, but further down:

<file>
 63 # Define a service to "ping" the local machine
 64 
 65 define service{
 66         use                             local-service         ; Name of service template to use
 67         host_name                       localhost
 68         service_description             PING
 69     check_command           check_ping!100.0,20%!500.0,60%
 70         }
 71 
 72 
 73 # Define a service to check the disk space of the root partition
 74 # on the local machine.  Warning if < 20% free, critical if
 75 # < 10% free space on partition.
 76 
 77 define service{
 78         use                             local-service         ; Name of service template to use
 79         host_name                       localhost
 80         service_description             Root Partition
 81     check_command           check_local_disk!20%!10%!/
 82         }
</file>

The **check_command** parameter is the key: it calls a command that will check the status of a service. Commands are declared in **/etc/nagios/objects/commands.cfg**. There, we can find the **check_local_disk** command above:

<file>
 76 # 'check_local_disk' command definition
 77 define command{
 78         command_name    check_local_disk
 79         command_line    $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
 80         }
</file>

As we see, this command uses a plugin called **check_disk** and launches this plugin using the 3 arguments passed by the **check_command** line.
  * $ARG1$ is replaced with 20%
  * $ARG2$ is replaced with 10%
  * $ARG3$ is replaced with /

If we take a look at the **check_disk** plugin located in **/usr/lib/nagios/plugins/**, we see that it is a binary program that check the usage level of a volume:

<code>
# file check_disk 
check_disk: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped

# ./check_disk -h
check_disk v1.4.15 (nagios-plugins 1.4.15)
Copyright (c) 1999 Ethan Galstad <nagios@nagios.org>
Copyright (c) 1999-2008 Nagios Plugin Development Team
	<nagiosplug-devel@lists.sourceforge.net>

This plugin checks the amount of used disk space on a mounted file system
and generates an alert if free space is less than one of the threshold values


Usage:
 check_disk -w limit -c limit [-W limit] [-K limit] {-p path | -x device}
[-C] [-E] [-e] [-g group ] [-k] [-l] [-M] [-m] [-R path ] [-r path ]
[-t timeout] [-u unit] [-v] [-X type]

[... etc ... ]
</code>

We can even try this plugin directly from the command line and using the same argument defined in nagios:

<code>
# ./check_disk -w 20% -c 10% -p /
DISK OK - free space: / 7031 MB (88% inode=97%);| /=949MB;6450;7256;0;8063
</code>

The languages in which the plugin is written does not matter. Some of them are written in C, some in Perl. For example, take a look at **check_file_age**, it's a Perl plugin that control the age of a file.

<code>
# ./check_file_age -w 30 -f /root/htpasswd.pl 
FILE_AGE CRITICAL: /root/htpasswd.pl is 63208 seconds old and 169 bytes
</code>

Writing plugins is easy, because the only thing that matters is the return value that must either OK, WARNING, UNKNOWN, or CRITICAL.

The illustration below summarizes everything we just saw.

{{:en:ressources:dossiers:supervision:nagios_arch.png|}}

This is enough theory to understand the logic behind Nagios. With that in hands, we can start declaring host and services for our infrastructure.

==== Monitoring "public" services ====

Public facing services, such as web pages, dns servers, email servers, are probably the easiest to monitor. The Nagios document called "How to monitor a publicly available service (HTTP, FTP, SSH, etc.)" gives a good description on how to do this. But basically, we will reuse the knowledge we just acquired to create a host and add services to it.

=== Create a host ===

We want to monitor myserver1.example.net. To avoid mixing things up with the files furnished by Nagios, we will create a new directory dedicated to example.net. and create a host file for myserver1 in it.

<code>
# mkdir /etc/nagios/example.net
# vim /etc/nagios/example.net/myserver1.cfg

define host{
    use          linux-server
    host_name    myserver1
    alias        myserver1
    address      11.22.33.44
    }

</code>

Know, this directory will not be loaded by Nagios. For that, we have to modify **/etc/nagios/nagios.cfg** and add the line

<file>
cfg_dir=/etc/nagios/example.net
</file>

This will automatically load the content of **/etc/nagios/example.net** at startup.

=== Service check_http ===

As discussed before, a service is composed of a command calling a plugin. The service is then attached to a host.
myserver1 hosts 3 websites behind a Haproxy load balancer. We will have 4 services: one for each website and one for haproxy.

**check_http** can take a good number of parameters (see ./check_http -h for the complete list). To control that a website is up and running, we expect to receive a 200 HTTP code back from the server. This is what the plugin checks by default. So a basic declaration looks like that:

<file>
define service{
        use                             generic-service
        host_name                       myserver1
        service_description             dontputyourcatinthemicrowave.com
	check_command		        check_http!-w 5 -c 10 -H dontputyourcatinthemicrowave.com
        }
</file>

The service definition will connect to myserver1 using a HTTP/1.1 request containing the dontputyourcatinthemicrowave.com location header. This request will pass through haproxy and reach the webserver, that will reply with a 200 code.
check_http will evaluate the returned HTTP code and also the response time. If it is higher than 5 seconds (-w 5), a warning is issued. Higher than 10 seconds (-c 10), it's a critical.


The second website does not have anything at the root, it redirects (HTTP 302) users directly to /blog. We need to inform check_http to connect to /blog directly otherwise it will consider the 302 as a potential warning. To inform check_http to go to a specific URI, use the -u parameter.

<file>
define service{
        use                             generic-service
        host_name                       myserver1
        service_description             victoriasecretsforgrandmothers.org
	check_command		        check_http!-w 5 -c 10 -H victoriasecretsforgrandmothers.org -u "/blog/"
        }
</file>

The third website, now, uses a basic HTTP authentication. Ones again, this can be specified in the command line using the -a parameter.

<file>
define service{
        use                             generic-service
        host_name                       sachiel
        service_description             mybosswifeishot.com
        check_command                   check_http!-w 5 -c 10 -H mybosswifeishot.com -a raymond:mypassword
        }

</file>

Note that I kept the response time pretty high. Feel free to reduce them. Traversing the Atlantic, I have response times as low as 0.5 seconds (0.2s when no php is involved).


Now, concerning Haproxy, we need to find a way to test it. It is in the path of any of the three websites we tested above, but if those websites crash, you won't know if it's haproxy's fault or the webserver.
However, haproxy does not normally reply to clients. Except when the request coming from the client cannot be resolved. 

For example, when trying to access a website that does not exist on this server, haproxy will reply with HTTP code 503 Service Unavailable.

So, what we can do is to create a service that connect to myserver1 and ask for a non existant website, then control that haproxy properly replied with the expected 503 code. This is done as follow:

<file>
define service{
        use                   generic-service
        host_name             myserver1
        service_description   HAPROXY
        ; check for a non-existant virtual host, if 503 returned, then haproxy is alive
	check_command         check_http!-w 2 -c 5 -H nonexistanthost.com -e "HTTP/1.0 503 Service Unavailable"   
        }
</file>


=== Service check_ldap ===

The plugin **check_ldap** takes a few argument to bind to a LDAP directory and return a status code. There is only one trick: if you are using LDAPS (with SSL on port 636), then you need to make sure that check_ldap can verify the X.509 certificate returned by the LDAP directory.  And to do that, check_ldap will look into **/etc/pki/tls/cert.pem** for the X.509 certificate of the Certificate Authority that signed the certificate of the LDAP directory.

If, like me, you use your own personal CA, you must add the PEM encoded CA certificate into **/etc/pki/tls/cert.pem**.

<code>
# openssl x509 -in ca-linuxwall.crt -text >> /etc/pki/tls/cert.pem
</code>

check_ldap will now accept to connect to the LDAP directory:

<code>
# /usr/lib/nagios/plugins/check_ldap -H ldap.example.net -D "cn=nagioscheck,ou=infrastructure,dc=example,dc=net" -P "totodanslecaniveau" -b dc=example,dc=net --ssl -p 636 -3 -w 2 -c 5
LDAP OK - 0.644 seconds response time|time=0.644208s;;;0.000000
</code>

<note important>If you are having difficulties, add the **-v** switch at the command line. It increases the verbosity.</note>

The command for check_ldap does not exist in the command file. We need to create it in **/etc/nagios/objects/commands.cfg**:

<file>
# 'check_ldap' command definition
define command{
    command_name    check_ldap
    command_line    $USER1$/check_ldap $ARG1$
    }
</file>

And add that command to the services of our host in **/etc/nagios/example.net/myserver1**

<file>
define service{
        use                     generic-service
        host_name               myserver1
        service_description     LDAPS
	check_command	        check_ldap!-H ldap.example.net -D "cn=nagioscheck,ou=infrastructure,dc=example,dc=net" -P "totodanslecaniveau" -b dc=example,dc=net --ssl -p 636 -3 -w 2 -c 5
        }

</file>

=== Service check_dns ===

The plugin available for check DNS will perform a comparison of DNS entries between the target server and what it can find by itself (using the local resolver listed in /etc/resolv.conf).

The command does not exist by default, we need to create it in **commands.cfg**

<file>
# 'check_dns' command definition
define command{
    command_name    check_dns
    command_line    $USER1$/check_dns $ARG1$
    }
</file>

The service is not to complicated to set up. You need to find an entry in your DNS that will most probably never change (like the IP of one of the nameserver) and use at in the command line.
The plugin will compare the requested FQDN with the IP furnished on both its local resolver and the target server. If it notices a differences (or if it can't connect to the target server), it will raise an alarm.

<file>
define service{
        use                     generic-service
        host_name               myserver1
        service_description     DNS
	check_command	        check_dns!-H ns0.example.net -s myserver1.example.net -a 55.66.77.88 -w 2 -c 5
        }
</file>

=== Service PING ===

The ping is essential to check the status of a system. Nagios send icmp echo-request and expect to receive replies. The plugin checks the Round Trip Average (RTA) and triggers a warning if it's above an acceptable limit.

As always, this limit is set in the command line of the service, as follow:

<file>
define service{
        use                             generic-service
        host_name                       myserver1
        service_description             PING
        check_command                   check_ping!300.0,20%!1000.0,60%
        }
</file>

<note>The default values are a bit lower than this, but since the servers I monitor are not in the same room but everywhere on the interwebz, I was getting warnings all the time.</note>

The check_command takes 2 groups of arguments: the first group '300.0,20%' triggers the warning, while the second group '1000.0,60%' triggers the critical.
In each group, the first value (300.0 and 1000.0) represent the round trip average (the time a icmp take to reach the target and come back). The second value is the percentage of packet loss. By default, check_ping send 5 icmp each time, so if you lose 1, you get a warning, and 3, a critical.


=== Services SMTP, IMAP, SSH, XMPP, FTP ===

I regroup those services together because there is nothing much to say about them. The check is basic, you have a plugin for each. And if the command doesn't already exist (for xmpp), you just create it.

The entries in the host file look like that:

<file>
define service{
        use                    generic-service
        host_name              myserver1
        service_description    SSH
        check_command          check_ssh!-p 2222
        }

define service{
        use                             generic-service
        host_name                       myserver1
        service_description             SMTP
	check_command                   check_smtp!--fqdn nagios-myserver1.example.net --starttls -w 5 -C 10
        }

define service{
        use                             generic-service
        host_name                       myserver1
        service_description             IMAP
	check_command                   check_imap!-p 993 --ssl -w 5 -C 10
        }

define service{
        use                             generic-service
        host_name                       myserver1
        service_description             XMPP
	check_command                   check_jabber!-p 5222 -w 5 -c 10
        }

</file>

==== Checking with SNMP ====

FIXME
work in progress....

<code>
./check_snmp -H myserver1.example.net -P 3 -C public -U nagios-myserver1 -L authPriv -a SHA -A eiohfwoih2892 -x AES -X oiurhw89ehf2 -o .1.3.6.1.4.1.2021.10.1.3.1
</code>