====== Nagios with Nginx on Fedora (or Amazon Linux AMI) ====== I was browsing around Amazon EC2 Infrastructure when I decided to sign up for a micro instance, which is free, and experiment with the cloud a bit. Past the deciphering period of both the amazon terminology (load you EC2, attach an EBS, don't forget the CloudWatch, ...) and the management interface, I ended up with a shell on what seems to bit a revamped fedora with Amazon's logo. Since the bandwidth is fairly limited (15 gigs a months), I don't want to enable bandwidth costly services on this system, but still I feel like I can use it for something useful. So I've decided to re-install a nagios system to supervise the linuxwall domain (4+ servers). This article is about installing and configuring Nagios with Nginx on a Fedora style system. Note for the french readers, there is a similar article in French [[http://wiki.linuxwall.info/doku.php/fr:ressources:dossiers:supervision:nagios3|here]]. This one might be more up to date though. ===== Installing Nagios ===== This is going to be basic, since Fedora comes with yum: # yum install nagios nagios-plugins-all We can start the service and add it to the runlevel 3 with the following commands: # service nagios start Starting nagios: done. # chkconfig --level 3 nagios on # chkconfig --list |grep nagios nagios 0:off 1:off 2:off 3:on 4:off 5:off 6:off The installation creates a directory in **/etc/nagios**. The file **nagios.cfg** contain the core of the configuration, we will get back to it later. You can also take a look at **passwd** that contains the users allowed to connect to the interface of nagios. By default, this file only contain a //nagiosadmin// default user: # cat passwd nagiosadmin:Oqd8214Hd37q1hd The format of this file follows the htpasswd syntax: user:crypt(password). So a simple perl script can generate it (or the htpasswd tool if you have apache installed). #!/usr/bin/perl use strict; if ( @ARGV != 2 ){ print "usage:./htpasswd.pl \n"; } else { print $ARGV[0].":".crypt($ARGV[1],$ARGV[1])."\n"; } To add a user, run it as follow: # perl /root/htpasswd.pl toto superpassword >> /etc/nagios/passwd ===== Configuration of Nginx to serve the Nagios Interface ===== Nginx is a very powerful web server, but to communicate with nagios, it will require an additional software. Nginx will handle incoming HTTP(S) requests and serve the response. It needs to pass the incoming requests to Nagios in order for it to generate a response (an HTML page) that will be returned to the client. Nagios web interface is composed of a set of C programs that need to be launched to produce the HTML code. Nginx cannot launch them. So we need an external program, a wrapper, that will be in charge of launching the nagios web script. And that wrapper needs to be woken up by another program, a spawner, itself called by nginx. It's a bit complex but the illustration below should help you understand it. The wrapper is **[[https://github.com/gnosek/fcgiwrap|fcgiwrap]]**. And the spawner is [[http://redmine.lighttpd.net/projects/spawn-fcgi|spawn-fcgi]]. {{:en:ressources:dossiers:supervision:nagios.png|}} The nagios web interface set of program is located in **/usr/lib/nagios/cgi-bin**. # ls /usr/lib/nagios/cgi-bin/ avail.cgi config.cgi histogram.cgi notifications.cgi showlog.cgi statusmap.cgi statuswrl.cgi tac.cgi cmd.cgi extinfo.cgi history.cgi outages.cgi status.cgi statuswml.cgi summary.cgi trends.cgi ==== Installation of fcgiwrap ==== Unfortunately, fcgiwrap is not shipped with the amazon packaged version of fedora. So we cannot use yum here. And apparently, Amazon has removed a bunch of packages from the regular fedora repositories when they build their own version. So there is no libfcgi present in the repository, and you need that one. I worked around that by building fcgiwrap on a debian i686 box and then transfert the library and binary. The tar is here, is 32bits systems only: {{:en:ressources:dossiers:supervision:fcgiwrap-libfcgi.tar|}} Get the tar and uncompress it. Then move fcgiwrap to /usr/bin and libfcgi.so.0.0.0 to /usr/lib as follow: # tar -xvf fcgiwrap-libfcgi.tar # mv fcgiwrap /usr/bin/ # chown root:root /usr/bin/fcgiwrap # mv usr/lib/libfcgi.so.0.0.0 /usr/lib/ # ln -s /usr/lib/libfcgi.so.0.0.0 /usr/lib/libfcgi.so # ln -s /usr/lib/libfcgi.so.0.0.0 /usr/lib/libfcgi.so.0 # /usr/bin/fcgiwrap Status: 403 Forbidden Content-type: text/plain Cannot get script name, are DOCUMENT_ROOT and SCRIPT_NAME (or SCRIPT_FILENAME) set and is the script executable? 403 The last command shows that fcgiwrap starts properly. Now the next step is to configure spawn-fcgi to call fcgiwrap. ==== Installation of spawn-fcgi ==== When installing spawn-fcgi with **yum install spawn-fcgi**, a startup script is created in /etc/init.d/. spawn-fcgi will need a list of option to know what to do when it is called. Those options are stored in **/etc/sysconfig/spawn-fcgi**. # You must set some working options before the "spawn-fcgi" service will work. # If SOCKET points to a file, then this file is cleaned up by the init script. # # See spawn-fcgi(1) for all possible options. # # Example : #SOCKET=/var/run/php-fcgi.sock #OPTIONS="-u apache -g apache -s $SOCKET -S -M 0600 -C 32 -F 1 -P /var/run/spawn-fcgi.pid -- /usr/bin/php-cgi" What we want is to enable a TCP socket on localhost:9001 that will call /usr/bin/fcgiwrap when woken up. For that purpose, we set the following option line in the file: OPTIONS="-u nginx -g nginx -a 127.0.0.1 -p 9001 -f /usr/bin/fcgiwrap -P /var/run/spawn-fcgi.pid" We can now start the spawn-fcgi service and verify that the socket is listen on port 9001. # service spawn-fcgi start Starting spawn-fcgi: [ OK ] # netstat -taupen |grep LISTEN |grep 9001 tcp 0 0 127.0.0.1:9001 0.0.0.0:* LISTEN 0 6075 3092/fcgiwrap To add spawn-fcgi to the default runlevel, use chkconfig. # chkconfig --level 3 spawn-fcgi on ==== Almost done, some PHP ==== Apparently the latest version of Nagios include some PHP code. Similarly to the C programs, PHP needs to be executed by spawn-fcgi and php-cgi (not fcgiwrap this time). So, to enable this, we will simply copy the init script of spawn-fcgi and create a new OPTION file that launches php-cgi and listen on socket TCP 9002. Here is how it's done: # cp /etc/init.d/spawn-fcgi /etc/init.d/spawn-fcgi-php # vim /etc/init.d/spawn-fcgi-php [edit lines 24/25 to replace spawn-fcgi with spawn-fcgi-php] # cp /etc/sysconfig/spawn-fcgi /etc/sysconfig/spawn-fcgi-php # vim /etc/sysconfig/spawn-fcgi-php [modify the OPTION line so it matches the line below] OPTIONS="-u nginx -g nginx -a 127.0.0.1 -p 9002 -f /usr/bin/php-cgi -P /var/run/spawn-fcgi-php.pid" # service spawn-fcgi-php start Starting spawn-fcgi-php: [ OK ] # netstat -taupen |grep 9002 tcp 0 0 127.0.0.1:9002 0.0.0.0:* LISTEN 0 12930 10364/php-cgi And now add the new service to chkconfig and run it a level 3. # chkconfig --add spawn-fcgi-php # chkconfig --level 3 spawn-fcgi-php on That's all for now, let's move to Nginx configuration. ==== Back to Nginx ==== Alright, now we have a method to execute the C and PHP scripts from the Nagios web interface through FastCGI. Nginx, however, is not configured yet. We need to tell it where to find those scripts and how to handle them. Before anything else, you **MUST** check the permissions of the files in **/etc/nagios** and **/usr/share/nagios**. In particular, /etc/nagios/passwd and /usr/share/nagios/html/config.inc.php were unreadable to the group //nginx// (permissions were given to apache). There are two locations: the envelope of the interface and the libraries. So, first, we handle the PHP and HTML files that compose the envelope. Those files are stored in **/usr/share/nagios/html**. The Nginx configuration is done in **/etc/nginx/conf.d/ssl.conf**: location / { auth_basic "Access to the web interface is restricted"; auth_basic_user_file /etc/nagios/passwd; rewrite ^/nagios/(.*) /$1 break; root /usr/share/nagios/html; index index.php; include fastcgi_params; fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name; if ($uri ~ "\.php"){ fastcgi_pass 127.0.0.1:9002; } } Some explanation is necessary: * The authentication is required at the root of the server, and nginx bases it on the content of **/etc/nagios/passwd** that we looked at earlier. * The rewrite rule remove unnecessary /nagios in the URI when browsing the interface * We include the FastCGI parameters from the file **/etc/nginx/fastcgi_params** (take a look at it, it's interesting) * We add an additional parameters that locates the complete path of a script when executed * Finally, if a requested file name finishes with .php, we send the request to php-cgi through the spawn-fcgi socket (php-cgi will return the HTML code to send to the client) Now, the section that execute the libraries is somewhat similar, except that the file are located in /usr/lib and we send them to a different socket. Also, Nagios want to know who is connecting, so we pass the FastCGI parameters AUTH_USER and REMOTE_USER. location /nagios/cgi-bin/ { root /usr/lib/; include /etc/nginx/fastcgi_params; auth_basic "Restricted"; auth_basic_user_file /etc/nagios/passwd; fastcgi_param AUTH_USER $remote_user; fastcgi_param REMOTE_USER $remote_user; if ($uri ~ "\.cgi$"){ fastcgi_pass 127.0.0.1:9001; } } The complete definition of the host looks like this: server { listen 443; server_name example_server; ssl on; ssl_certificate /etc/ssl/certs/example_server/example_server.pem; ssl_certificate_key /etc/ssl/certs/example_server/example_server.key; ssl_session_timeout 5m; ssl_protocols SSLv3 TLSv1; ssl_ciphers ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:!MEDIUM:!LOW:!SSLv2:+EXP; ssl_prefer_server_ciphers on; location / { auth_basic "Access to the web interface is restricted"; auth_basic_user_file /etc/nagios/passwd; rewrite ^/nagios/(.*) /$1 break; root /usr/share/nagios/html; index index.php; include fastcgi_params; fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name; if ($uri ~ "\.php"){ fastcgi_pass 127.0.0.1:9002; } } location /nagios/cgi-bin/ { root /usr/lib/; include /etc/nginx/fastcgi_params; auth_basic "Restricted"; auth_basic_user_file /etc/nagios/passwd; fastcgi_param AUTH_USER $remote_user; fastcgi_param REMOTE_USER $remote_user; if ($uri ~ "\.cgi$"){ fastcgi_pass 127.0.0.1:9001; } } } Restart Nginx and you should be able to admire this: {{:en:ressources:dossiers:supervision:nagiosadmin.jpg|}} ===== Nagios configuration ===== There is a number of ways to monitor hosts with Nagios. I cannot cover them all. Plus, the documentation shipped with the default Nagios installation is extremely well written (check out the Documentation section on the left column), so rewriting it here would be useless. Instead, I will describe the configuration of the services that I want to monitor. Some HTTP servers, DNS master and slave, cpu/disk/memory on the remote host (using snmp), and so on... ==== Understanding the logic ==== === Templates === Nagios tries to limit the duplication of information. It uses **templates** that defines a type of host, contact or service and the associated test and/or data that defines it. Templates are basically regular objects definition that are not registered (register value = 0). They define parameters that can be used by other registered objects through inheritance. So you can define a generic server template with basic parameters, and then a more specific one for, say, toto.example.net, that inherit the generic template basic parameters and adds its own (if needed). The inheritance works as follow: if both the //generic-server// template and the toto.example.net object declare the same parameter, then the second one (from toto) will be the one used. further reading about objects inheritance can be found here: [[http://nagios.sourceforge.net/docs/2_0/templaterecursion.html|Template Recursion]] The templates definitions are stored in **/etc/nagios/objects/templates.cfg** by default. If you take a closer look at this file, you will see three types of templates: contact, host and service. The template for **contact**, named //generic-contact//, defines that this type of contact monitors all services and host on a 24/7 basis, and that notification to this contact must be send by email. 28 define contact{ 29 name generic-contact ; The name of this contact template 30 service_notification_period 24x7 ; service notifications can be sent anytime 31 host_notification_period 24x7 ; host notifications can be sent anytime 32 service_notification_options w,u,c,r,f,s ; send notifications for all service states, flapping events, and scheduled downtime events 33 host_notification_options d,u,r,f,s ; send notifications for all host states, flapping events, and scheduled downtime events 34 service_notification_commands notify-service-by-email ; send service notifications via email 35 host_notification_commands notify-host-by-email ; send host notifications via email 36 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE! 37 } The first template for **host**, named //generic-host//, defines specific parameters on the monitoring of this type of host. The declaration is self-explanatory: 52 define host{ 53 name generic-host ; The name of this host template 54 notifications_enabled 1 ; Host notifications are enabled 55 event_handler_enabled 1 ; Host event handler is enabled 56 flap_detection_enabled 1 ; Flap detection is enabled 57 failure_prediction_enabled 1 ; Failure prediction is enabled 58 process_perf_data 1 ; Process performance data 59 retain_status_information 1 ; Retain status information across program restarts 60 retain_nonstatus_information 1 ; Retain non-status information across program restarts 61 notification_period 24x7 ; Send host notifications at any time 62 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! 63 } Right after is declared a //linux-server// template that inherits the //generic-host// parameters. Both of those being templates, they are not registered. And if a host use //linux-server//, it will inherit from both //linux-server// and //generic-host//. 68 define host{ 69 name linux-server ; The name of this host template 70 use generic-host ; This template inherits other values from the generic-host template 71 check_period 24x7 ; By default, Linux hosts are checked round the clock 72 check_interval 5 ; Actively check the host every 5 minutes 73 retry_interval 1 ; Schedule host check retries at 1 minute intervals 74 max_check_attempts 10 ; Check each Linux host 10 times (max) 75 check_command check-host-alive ; Default command to check Linux hosts 76 notification_period workhours ; Linux admins hate to be woken up, so we only notify during the day 77 ; Note that the notification_period variable is being overridden from 78 ; the value that is inherited from the generic-host template! 79 notification_interval 120 ; Resend notifications every 2 hours 80 notification_options d,u,r ; Only send notifications for specific host states 81 contact_groups admins ; Notifications get sent to the admins by default 82 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! 83 } There are a few other templates defined (windows server, printers, ...). But let's skip to the service definition. Almost at the end of the file defined a //generic-service//. This is the template for all services that will control the state of //something// and return a status (OK, WARNING, UNKNOWN, or CRITICAL). 153 define service{ 154 name generic-service ; The 'name' of this service template 155 active_checks_enabled 1 ; Active service checks are enabled 156 passive_checks_enabled 1 ; Passive service checks are enabled/accepted 157 parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems) 158 obsess_over_service 1 ; We should obsess over this service (if necessary) 159 check_freshness 0 ; Default is to NOT check service 'freshness' 160 notifications_enabled 1 ; Service notifications are enabled 161 event_handler_enabled 1 ; Service event handler is enabled 162 flap_detection_enabled 1 ; Flap detection is enabled 163 failure_prediction_enabled 1 ; Failure prediction is enabled 164 process_perf_data 1 ; Process performance data 165 retain_status_information 1 ; Retain status information across program restarts 166 retain_nonstatus_information 1 ; Retain non-status information across program restarts 167 is_volatile 0 ; The service is not volatile 168 check_period 24x7 ; The service can be checked at any time of the day 169 max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state 170 normal_check_interval 10 ; Check the service every 10 minutes under normal conditions 171 retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined 172 contact_groups admins ; Notifications get sent out to everyone in the 'admins' group 173 notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events 174 notification_interval 60 ; Re-notify about service problems every hour 175 notification_period 24x7 ; Notifications can be sent out at any time 176 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! 177 } === Hosts === Now, the whole point of using templates is to reduce the amount of parameters you have to set when you declare a new host. You just reuse the template. Therefore, a basic host definition is fairly short. Take a look at the **localhost** host definition in **localhost.cfg** and who will see that it almost does not contain anything: 25 define host{ 26 use linux-server ; Name of host template to use 27 ; This host definition will inherit all variables that are defined 28 ; in (or inherited by) the linux-server host template definition. 29 host_name localhost 30 alias localhost 31 address 127.0.0.1 32 } === Services === Now, the previous host definition does not declare any service to check. For that, we need to declare services and link those services to our host. This is still done in **localhost.cfg**, but further down: 63 # Define a service to "ping" the local machine 64 65 define service{ 66 use local-service ; Name of service template to use 67 host_name localhost 68 service_description PING 69 check_command check_ping!100.0,20%!500.0,60% 70 } 71 72 73 # Define a service to check the disk space of the root partition 74 # on the local machine. Warning if < 20% free, critical if 75 # < 10% free space on partition. 76 77 define service{ 78 use local-service ; Name of service template to use 79 host_name localhost 80 service_description Root Partition 81 check_command check_local_disk!20%!10%!/ 82 } The **check_command** parameter is the key: it calls a command that will check the status of a service. Commands are declared in **/etc/nagios/objects/commands.cfg**. There, we can find the **check_local_disk** command above: 76 # 'check_local_disk' command definition 77 define command{ 78 command_name check_local_disk 79 command_line $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$ 80 } As we see, this command uses a plugin called **check_disk** and launches this plugin using the 3 arguments passed by the **check_command** line. * $ARG1$ is replaced with 20% * $ARG2$ is replaced with 10% * $ARG3$ is replaced with / If we take a look at the **check_disk** plugin located in **/usr/lib/nagios/plugins/**, we see that it is a binary program that check the usage level of a volume: # file check_disk check_disk: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped # ./check_disk -h check_disk v1.4.15 (nagios-plugins 1.4.15) Copyright (c) 1999 Ethan Galstad Copyright (c) 1999-2008 Nagios Plugin Development Team This plugin checks the amount of used disk space on a mounted file system and generates an alert if free space is less than one of the threshold values Usage: check_disk -w limit -c limit [-W limit] [-K limit] {-p path | -x device} [-C] [-E] [-e] [-g group ] [-k] [-l] [-M] [-m] [-R path ] [-r path ] [-t timeout] [-u unit] [-v] [-X type] [... etc ... ] We can even try this plugin directly from the command line and using the same argument defined in nagios: # ./check_disk -w 20% -c 10% -p / DISK OK - free space: / 7031 MB (88% inode=97%);| /=949MB;6450;7256;0;8063 The languages in which the plugin is written does not matter. Some of them are written in C, some in Perl. For example, take a look at **check_file_age**, it's a Perl plugin that control the age of a file. # ./check_file_age -w 30 -f /root/htpasswd.pl FILE_AGE CRITICAL: /root/htpasswd.pl is 63208 seconds old and 169 bytes Writing plugins is easy, because the only thing that matters is the return value that must either OK, WARNING, UNKNOWN, or CRITICAL. The illustration below summarizes everything we just saw. {{:en:ressources:dossiers:supervision:nagios_arch.png|}} This is enough theory to understand the logic behind Nagios. With that in hands, we can start declaring host and services for our infrastructure. ==== Monitoring "public" services ==== Public facing services, such as web pages, dns servers, email servers, are probably the easiest to monitor. The Nagios document called "How to monitor a publicly available service (HTTP, FTP, SSH, etc.)" gives a good description on how to do this. But basically, we will reuse the knowledge we just acquired to create a host and add services to it. === Create a host === We want to monitor myserver1.example.net. To avoid mixing things up with the files furnished by Nagios, we will create a new directory dedicated to example.net. and create a host file for myserver1 in it. # mkdir /etc/nagios/example.net # vim /etc/nagios/example.net/myserver1.cfg define host{ use linux-server host_name myserver1 alias myserver1 address 11.22.33.44 } Know, this directory will not be loaded by Nagios. For that, we have to modify **/etc/nagios/nagios.cfg** and add the line cfg_dir=/etc/nagios/example.net This will automatically load the content of **/etc/nagios/example.net** at startup. === Service check_http === As discussed before, a service is composed of a command calling a plugin. The service is then attached to a host. myserver1 hosts 3 websites behind a Haproxy load balancer. We will have 4 services: one for each website and one for haproxy. **check_http** can take a good number of parameters (see ./check_http -h for the complete list). To control that a website is up and running, we expect to receive a 200 HTTP code back from the server. This is what the plugin checks by default. So a basic declaration looks like that: define service{ use generic-service host_name myserver1 service_description dontputyourcatinthemicrowave.com check_command check_http!-w 5 -c 10 -H dontputyourcatinthemicrowave.com } The service definition will connect to myserver1 using a HTTP/1.1 request containing the dontputyourcatinthemicrowave.com location header. This request will pass through haproxy and reach the webserver, that will reply with a 200 code. check_http will evaluate the returned HTTP code and also the response time. If it is higher than 5 seconds (-w 5), a warning is issued. Higher than 10 seconds (-c 10), it's a critical. The second website does not have anything at the root, it redirects (HTTP 302) users directly to /blog. We need to inform check_http to connect to /blog directly otherwise it will consider the 302 as a potential warning. To inform check_http to go to a specific URI, use the -u parameter. define service{ use generic-service host_name myserver1 service_description victoriasecretsforgrandmothers.org check_command check_http!-w 5 -c 10 -H victoriasecretsforgrandmothers.org -u "/blog/" } The third website, now, uses a basic HTTP authentication. Ones again, this can be specified in the command line using the -a parameter. define service{ use generic-service host_name sachiel service_description mybosswifeishot.com check_command check_http!-w 5 -c 10 -H mybosswifeishot.com -a raymond:mypassword } Note that I kept the response time pretty high. Feel free to reduce them. Traversing the Atlantic, I have response times as low as 0.5 seconds (0.2s when no php is involved). Now, concerning Haproxy, we need to find a way to test it. It is in the path of any of the three websites we tested above, but if those websites crash, you won't know if it's haproxy's fault or the webserver. However, haproxy does not normally reply to clients. Except when the request coming from the client cannot be resolved. For example, when trying to access a website that does not exist on this server, haproxy will reply with HTTP code 503 Service Unavailable. So, what we can do is to create a service that connect to myserver1 and ask for a non existant website, then control that haproxy properly replied with the expected 503 code. This is done as follow: define service{ use generic-service host_name myserver1 service_description HAPROXY ; check for a non-existant virtual host, if 503 returned, then haproxy is alive check_command check_http!-w 2 -c 5 -H nonexistanthost.com -e "HTTP/1.0 503 Service Unavailable" } === Service check_ldap === The plugin **check_ldap** takes a few argument to bind to a LDAP directory and return a status code. There is only one trick: if you are using LDAPS (with SSL on port 636), then you need to make sure that check_ldap can verify the X.509 certificate returned by the LDAP directory. And to do that, check_ldap will look into **/etc/pki/tls/cert.pem** for the X.509 certificate of the Certificate Authority that signed the certificate of the LDAP directory. If, like me, you use your own personal CA, you must add the PEM encoded CA certificate into **/etc/pki/tls/cert.pem**. # openssl x509 -in ca-linuxwall.crt -text >> /etc/pki/tls/cert.pem check_ldap will now accept to connect to the LDAP directory: # /usr/lib/nagios/plugins/check_ldap -H ldap.example.net -D "cn=nagioscheck,ou=infrastructure,dc=example,dc=net" -P "totodanslecaniveau" -b dc=example,dc=net --ssl -p 636 -3 -w 2 -c 5 LDAP OK - 0.644 seconds response time|time=0.644208s;;;0.000000 If you are having difficulties, add the **-v** switch at the command line. It increases the verbosity. The command for check_ldap does not exist in the command file. We need to create it in **/etc/nagios/objects/commands.cfg**: # 'check_ldap' command definition define command{ command_name check_ldap command_line $USER1$/check_ldap $ARG1$ } And add that command to the services of our host in **/etc/nagios/example.net/myserver1** define service{ use generic-service host_name myserver1 service_description LDAPS check_command check_ldap!-H ldap.example.net -D "cn=nagioscheck,ou=infrastructure,dc=example,dc=net" -P "totodanslecaniveau" -b dc=example,dc=net --ssl -p 636 -3 -w 2 -c 5 } === Service check_dns === The plugin available for check DNS will perform a comparison of DNS entries between the target server and what it can find by itself (using the local resolver listed in /etc/resolv.conf). The command does not exist by default, we need to create it in **commands.cfg** # 'check_dns' command definition define command{ command_name check_dns command_line $USER1$/check_dns $ARG1$ } The service is not to complicated to set up. You need to find an entry in your DNS that will most probably never change (like the IP of one of the nameserver) and use at in the command line. The plugin will compare the requested FQDN with the IP furnished on both its local resolver and the target server. If it notices a differences (or if it can't connect to the target server), it will raise an alarm. define service{ use generic-service host_name myserver1 service_description DNS check_command check_dns!-H ns0.example.net -s myserver1.example.net -a 55.66.77.88 -w 2 -c 5 } === Service PING === The ping is essential to check the status of a system. Nagios send icmp echo-request and expect to receive replies. The plugin checks the Round Trip Average (RTA) and triggers a warning if it's above an acceptable limit. As always, this limit is set in the command line of the service, as follow: define service{ use generic-service host_name myserver1 service_description PING check_command check_ping!300.0,20%!1000.0,60% } The default values are a bit lower than this, but since the servers I monitor are not in the same room but everywhere on the interwebz, I was getting warnings all the time. The check_command takes 2 groups of arguments: the first group '300.0,20%' triggers the warning, while the second group '1000.0,60%' triggers the critical. In each group, the first value (300.0 and 1000.0) represent the round trip average (the time a icmp take to reach the target and come back). The second value is the percentage of packet loss. By default, check_ping send 5 icmp each time, so if you lose 1, you get a warning, and 3, a critical. === Services SMTP, IMAP, SSH, XMPP, FTP === I regroup those services together because there is nothing much to say about them. The check is basic, you have a plugin for each. And if the command doesn't already exist (for xmpp), you just create it. The entries in the host file look like that: define service{ use generic-service host_name myserver1 service_description SSH check_command check_ssh!-p 2222 } define service{ use generic-service host_name myserver1 service_description SMTP check_command check_smtp!--fqdn nagios-myserver1.example.net --starttls -w 5 -C 10 } define service{ use generic-service host_name myserver1 service_description IMAP check_command check_imap!-p 993 --ssl -w 5 -C 10 } define service{ use generic-service host_name myserver1 service_description XMPP check_command check_jabber!-p 5222 -w 5 -c 10 } ==== Checking with SNMP ==== FIXME work in progress.... ./check_snmp -H myserver1.example.net -P 3 -C public -U nagios-myserver1 -L authPriv -a SHA -A eiohfwoih2892 -x AES -X oiurhw89ehf2 -o .1.3.6.1.4.1.2021.10.1.3.1