Nagios installation and configuration of the monitoring server. Nagios Nagios 4 Easy Setup Contact Group Setup

To start on server01 you need to install the nagios package. To do this, enter in the terminal:

sudo apt-get install nagios3 nagios-nrpe-plugin

You will be prompted to enter a password for the user nagiosadmin. User accounts are located in /etc/nagios3/htpasswd.users. To change the user's password nagiosadmin or adding other users to execute Nagios CGI scripts use the utility htpasswd, which is part of the package apache2-utils.

For example, to change the user's password nagiosadmin type in terminal:

sudo htpasswd /etc/nagios3/htpasswd.users nagiosadmin

To add a user:

sudo htpasswd /etc/nagios3/htpasswd.users steve

sudo apt-get install nagios-nrpe-server

NRPE allows you to perform local checks on remote computer. But there are other ways to achieve this goal using other Nagios plugins, as well as other ways of checking.

Configuration files overview

There are several directories containing Nagios configuration files as well as check files.

1. /etc/nagios3: contains configuration files for running the nagios daemon, CGI files, computer descriptions, etc.

2. /etc/nagios-plugins: configuration files for service checks.

3. /etc/nagios: contains configuration files on the remote computer nagios-nrpe-server.

4. /usr/lib/nagios/plugins/: here are binary checks. To view scan options, use the "-h" key.

For example:/usr/lib/nagios/plugins/check_dhcp -h

There are many Nagios checks that can be configured to run on any machine. In this example, Nagios will be configured to check disk space, DNS service, and MySQL user groups. DNS check will be performed on server02, and the group MySQL computers will include both server01 so and server02.

See section HTTPD - Apache2 Web Server for more detailed Apache settings, Domain Name Service (DNS) for DNS setup, and MySQL for setup MySQL.

In addition to this, a few terms will be given to help you easily set up Nagios:

Computer (host): server, work station, network device etc. that is tracked.

Computer group: group of similar computers. For example, you can group all web servers, file servers etc.

Service: a service that is being monitored on a computer. For example HTTP , DNS , NFS , etc.

Service group: allows you to combine multiple services together. For example, this would be useful for merging multiple web servers.

Contact: the person who will be notified when an event occurs. Nagios can be configured to sending an email, SMS messages, etc.

By default, Nagios is configured to check HTTP , disk space, SSH , current users, processes, and load level monitoring on local computer. Nagios also performs a gateway check with the command ping.

The set of installed Nagios systems can be configured comprehensively. It's better to start with several computers, one or two, and then configure in an optimal way instead of using add-ons.

Setting

1.1. First you need to create a config file for server02. Unless otherwise noted, run all of these commands on server01. Type in terminal:

sudo cp /etc/nagios3/conf.d/localhost_nagios2.cfg \ /etc/nagios3/conf.d/server02.cfg

In the example above and the following, replace "server01", "server02" 172.18.100.100 and 172.18.100.101 with the name and ip address of your servers.

Define host( use generic-host ; Name of host template to use host_name server02 alias Server 02 address 172.18.100.101 ) # check DNS service. define service ( use generic-service host_name server02 service_description DNS check_command check_dns!172.18.100.101 )

1.3. Reload the nagios daemon to activate the new settings:

2.1 Now let's add a service description to check MySQL by adding the following lines to /etc/nagios3/conf.d/services_nagios2.cfg:

# check MySQL servers. define service ( hostgroup_name mysql-servers service_description MySQL check_command check_mysql_cmdlinecred!nagios!secret!$HOSTADDRESS use generic-service notification_interval 0 ; set > 0 if you want to be renotified )

2.2. The mysql group servers should now be defined. Edit /etc/nagios3/conf.d/hostgroups_nagios2.cfg adding the following:

# MySQL host group. define hostgroup ( hostgroup_name mysql-servers alias MySQL servers members localhost, server02 )

Mysql -u root -p -e "create user nagios identified by "secret";"

The nagios user must be present on all computers in the mysql server workgroup.

2.4. Restart nagios to test the MySQL server.

sudo /etc/init.d/nagios3 restart

3.1. Finally, you need to configure NRPE to check disk space on server02.

On the server01 add a service check to /etc/nagios3/conf.d/server02.cfg:

# NRPE disk check. define service ( use generic-service host_name server02 service_description nrpe-disk check_command check_nrpe_1arg!check_all_disks!172.18.100.101 )

3.2. Now on server02 edit /etc/nagios/nrpe.cfg:

Allowed_hosts=172.18.100.100

And in the command declaration line add:

Command=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -e

3.3. Finally, restart nagios-nrpe-server:

sudo /etc/init.d/nagios-nrpe-server restart

3.4. On the server01 you also need to reload nagios:

sudo /etc/init.d/nagios3 restart

You should now be able to see your servers and service checks in the Nagios CGI files. To access them, type http://server01/nagios3 in your browser. You will be prompted for a username and password for nagiosadmin.

Links

In this section, only minor features of Nagios have been described. nagios-plugins-extra and nagios-snmp-plugins contains much more files service checks.

1. For more detailed information, refer to the documentation on the official Nagios website.

2. Narrowly focused documentation on Nagios.

3. There are several books on Nagios and network monitoring.

4. The Nagios Ubuntu Wiki page also contains a lot of documentation.

Benefits and new opportunities for system monitoring

Track and analyze large amounts of information about the status of different computers (for example, the degree of processor utilization and network card) requires a lot of effort, but Nagios open source code(see section ) successfully copes with the tasks of monitoring and prompt notification.

It is extremely important to understand that Nagios is not a tool for measuring system parameters, for example, the degree of processor utilization, but a utility that issues monitoring results in the form of "working", "unreliable" and "faulty" states. This feature of Nagios helps the operator to focus on the most important and critical issues based on predefined and customizable criteria.

Nagios software implements functionality to report the amount of time lost due to downtime, which can be useful for tracking the quality of service delivery according to a service level agreement (SLA). As will be shown in subsequent articles, Nagios also offers features for downtime accounting and creating dependencies on services and systems. This introductory article shows how easy it is to create small, customized solutions for specific monitoring requirements.

Installation

Majority Linux distributions® come with an embedded version of Nagios. In this case, the product easily integrates with the Apache Web server. To activate or update such a configuration, you must run the command:

yum install nagios

or apt-get install nagios-text . The binaries for the AIX® platform are available for download from the NagiosExchange Web site (see section ).

For other platforms, the Nagios source code can be downloaded from the Nagios.org Web site (see section ). The following developer tools are required to create Nagios from scratch.

  • Tools:
    • autoconf
    • automake
  • Executable files:
    • libgd
    • openssl
  • Packages (libraries and header files)

Many plugins related to SNMP (Simple Network Management Protocol) will also require Perl and the Net::SNMP package.

Once Nagios is installed and configured, it can be accessed via the standard URL http://your.host.name/nagios . Shows which systems and services are enabled or disabled.

Setting up Nagios

By default, all Nagios configuration files are located in the /etc/nagios directory. Apache-related configuration files can be conveniently linked to the Apache configuration directory using links. The configuration is divided into several files, each of which is dedicated to separate configuration fragments.

The first components to set up are contacts and contact groups. Contacts are the people who are notified when a system or service goes down. By default, Nagios offers email and pager notifications, but extensions allow you to send notifications via Jabber protocol and many other methods that may be convenient in various circumstances.

Contacts are stored in the contacts.cfg file and are defined as shown in Listing 1.

Listing 1. Configuration 1: Basic contact information
define contact( contact_name jdoe alias John Due service_notification_commands notify-by-email host_notification_commands host-notify-by-emailes email [email protected] }

Contacts can be grouped, and instead of individual people to be notified when a system or service status changes, Nagios will notify the appropriate group. Sometimes it makes sense to specify a user multiple times to define different addresses or commands to send notifications and then add all the ways to contact the user to the contact group they belong to ().

Listing 2. Configuration 2: Grouping contacts
define contactgroup( contactgroup_name server-admins alias Server Administrators members jdoe,albundy )

The next step is to set up the systems that Nagios will monitor. You must add each computer that has services that you want to monitor or periodically check for activity. The configuration file to store the system is the hosts.cfg file. Listing 3 shows an example of a computer definition.

Listing 3. Configuration 3: Adding a new computer
define host( host_name ubuntu_1_2 alias Ubuntu test server address 192.168.1.2 check_command check-host-alive max_check_attempts 20 notifications_enabled 1 event_handler_enabled 0 flap_detection_enabled 0 process_perf_data 1 retain_status_information 1 retain_nonstatus_information 1 notification_interval 60 notification_period 24x7)

The last step in the Nagios configuration is defining the services for the configured systems. The example shown in Listing 4 uses a predefined ping plugin for Nagios that sends ICMP (Internet Control Message Protocol) pings to determine if the computer is responding or not.

Listing 4. Configuration 4: Adding a new service
define service( use service-template host_name ubuntu_1_2 service_description PING check_period 24x7 contact_groups server-admins notification_options c,r check_command check_ping!300.0,20%!1000.0,60% )

After preparing this configuration, you need to restart the Nagios daemon, and then, after waiting a few seconds while Nagios initializes, check if the ping services appeared in the admin web interface.

Writing plugins for Nagios

The most interesting aspect of Nagios is that you can easily write your own plugin for it, which requires you to learn a few simple rules. To manage a plugin, Nagios simply spawns a child process each time it requests the state of a service and uses the output and return code of that command to determine the state. Service state return codes are interpreted as follows:

  • OK- return code 0 - means that the service is working normally;
  • WARNING- return code 1 is a warning signal that the service may have problems;
  • CRITICAL- return code 2 - critical state of the service;
  • UNKNOWN- return code 3 - unknown service status.

The last state simply means that the plugin was unable to determine the state of the service. This can happen, for example, as a result of an internal error.

Listing 5 is an example Python script that tests UNIX® OS boot average. It assumes that a level above 2.0 is a warning condition and a level above 5.0 is a critical condition. These values ​​are hardwired into the code, and the last minute load average is always used as well.

Listing 5. Python plugin - working plugin example
#!/usr/bin/env python import os,sys (d1, d2, d3) = os.getloadavg() if d1 >= 5.0: print "GETLOADAVG CRITICAL: Load average is %.2f" % (d1) sys. exit(2) elif d1 >= 2.0: print "GETLOADAVG WARNING: Load average is %.2f" % (d1) sys.exit(1) else: print "GETLOADAVG OK: Load average is %.2f" % (d1) sys.exit(0)

Having prepared a small executable component, we need to register this plugin with Nagios and create a service definition that will check the load average.

It's quite simple: first, a /etc/nagios-plugins/config/mygetloadavg.cfg file is created with the following content, and the service is added to the services.cfg file, as shown in the example below. Let me remind you that localhost must be present in the hosts.cfg configuration file.

Listing 6. Plugin example - Nagios registration
define command( command_name check_mygetloadavg command_line /path/to/check_getloadavg )
Listing 7. Creating a service using the sample plugin
define service( use service-template host_name localhost service_description LoadAverage check_period 24x7 contact_groups server-admins notification_options c,r check_command check_mygetloadavg )

Writing a complete plugin

The previous example showed the limitations of a "hard-coded" plugin that does not allow configuration changes at runtime. In practice, it is better to create configurable plugins, then it will be possible to create and maintain only one plugin, register it as a separate plugin in Nagios, and pass warning and critical level adjustment arguments to it under different circumstances. The following example also includes usage messages, this is especially useful for plugins that are used or maintained by multiple developers or administrators.

Another useful trick is to catch all exceptions and return UNKNOWN in the service status report so that Nagios can appropriately notify the event. Plugins that allow exceptions to "escape" from their boundaries most often return a value of 1, which is treated by Nagios as a WARNING state. It is important that the plugin correctly distinguish between WARNING (warning) and UNKNOWN (unknown) states. It's worth noting that normally notifications for individual WARNING states are disabled, but it's not a good idea to disable notifications for UNKNOWN states.

Writing a Python Plugin

The assumptions above (run-time parametrization, usage reporting, and improved exception handling) result in a plugin whose source code is several times larger than the previous one. However, this adds safe error handling and the ability to reuse the plugin in different situations.

Listing 8. Python plugin - complete plugin for getting load average data
#!/usr/bin/env python import os import sys import getopt def usage(): print """Usage: check_getloadavg [-h|--help] [-m|--mode 1|2|3] \ [ -w|--warning level] [-c|--critical level]" Mode: 1 - last minute ; 2 - last 5 minutes 3 - last 15 minutes" Warning level defaults to 2.0 Critical level defaults to 5.0""" sys.exit(3) try: options, args = getopt.getopt(sys.argv, "hm:w:c:", "- -help --mode= --warning= --critical=",) except getopt.GetoptError: usage() sys.exit(3) argMode = "1" argWarning = 2.0 argCritical = 5.0 for name, value in options: if name in ("-h", "--help"): usage() if name in ("-m", "--mode"): if value not in ("1", "2", "3" ): usage() argMode = value if name in ("-w", "--warning"): try: argWarning = 0.0 + value except Exception: print "Unable to convert to floating point value\n" usage() if name in ("-c", "--critical"): try: argCritical = 0.0 + value except Exception: print "Unable to convert to floating point value\n" usage() try: (d1, d2, d3) = os.getloadavg() except Exception: print "GETLOADAVG UNKNOWN: error while getting load average" sys.exit(3) if argMode == "1": d = d1 elif argMode == "2": d = d2 elif argMode == "3": d = d3 if d >= argCritical: print "GETLOADAVG CRITICAL: Load average is %.2f" % (d) sys.exit(2) elif d >= argWarning: print "GETLOADAVG WARNING: Load average is %.2f" % (d) sys.exit(1) else : print "GETLOADAVG OK: Load average is %.2f" % (d) sys.exit(0)

To use a new plugin, you must register it in the /etc/nagios-plugins/config/mygetloadavg2.cfg file, as shown in Listing 9.

Listing 9. Python plugin - Nagios registration
define command( command_name check_mygetloadavg2 command_line /path/to/check_getloadavg2 -m $ARG1$ -w $ARG2$ -c $ARG3$ )

You also need to add or change an entry for this service in the services.cfg file, as shown in Listing 10. Note that the exclamation mark! separates plugin parameters. As before, localhost needs to be defined in the hosts.cfg configuration file.

Listing 10. Creating a service using the Python plugin
define service( use service-template host_name localhost service_description LoadAverage2 check_period 24x7 contact_groups server-admins notification_options c,r check_command check_mygetloadavg2!1!3.0!6.0 )

Writing a Tcl Plugin

The last example is a Tcl plugin that checks exchange rates from xmethods.net using SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language) technology. SOAP provides the plugin with the current exchange rates to compare them with the configured values. If the value is within the warning range, then the state is considered to be OK . If the value is above or below the warning level, but not below the critical limit, then the state is considered to be WARNING . Otherwise, the state is considered CRITICAL unless a network failure occurs, in which case the state is set to UNKNOWN .

The plugin recognizes configurable parameters so that different rates can be checked with different ranges to check. It can also be used to check the exchange rates of various countries (Listing 11).

Listing 11. Tcl plugin - check current exchange rates
#!/usr/bin/env tclsh # parse arguments package require cmdline set options ( (country1.arg "" "Country 1") (country2.arg "" "Country 2") (lowerwarning.arg "" "Lower warning limit ") (upperwarning.arg "" "Upper warning limit") (lowercritical.arg "" "Lower critical limit") (uppercritical.arg "" "Upper critical limit") ) array set opt ​​)] # if the user didn't supply all arguments, # then show help message for each necessary ( if ($opt($necessary) == "") ( set argv "-help" catch (cmdline::geoptions argv $options (: )) usage puts stderr $usage exit 3 ) ) # load package TclWebServices package require WS::Client if ( 1] ) error]) ( # if for some reason the course could not be loaded, report it puts "EXCHANGERATE UNKNOWN: $error" exit 3 ) if (($result< $opt(lowercritical)) || ($result >$opt(uppercritical))) ( puts "EXCHANGERATE CRITICAL: rate is $result" exit 2 ) if (($result< $opt(lowerwarning)) || ($result >$opt(upperwarning))) ( puts "EXCHANGERATE WARNING: rate is $result" exit 1 ) puts "EXCHANGERATE OK: rate is $result" exit 0

Now we need to register this command so that Nagios knows how to call it. In order to do this, you need to create a file /etc/nagios-plugins/config/exchangerate.cfg with content similar to the previous configurations and the following command definition:

command_line /path/to/check_exchangerate -country1 $ARG1$ -country2 $ARG2$ -lowercritical \ $ARG3$ -lowerwarning $ARG4$ -upperwarning $ARG5$ -uppercritical $ARG6$

The command name check_exchangerate is used in the example below.

Then you need to create a service that will use the created plugin to track exchange rates. The following is an example of a service definition that associates the service with the localhost server. While the check is not actually tied to any real computer, it still needs to be bound to the system. If the check includes calling the SOAP methods of servers within the monitored network, then you need to add a real server to be monitored and bind the service to this server. The code in checks that the exchange rate of the British pound against the Japanese yen is between 225 and 275.

Listing 12. Adding the Tcl plugin as a new service
define service( use service-template host_name localhost service_description EXCHANGERATE check_period 24x7 contact_groups other-admins notification_options c,r check_command check_exchangerate!England!Japan!200!225!275!300 )

Conclusion

Nagios can be used to monitor all types of software and computer hardware. The ability to create your own plugins allows you to monitor everything that the Nagios server can talk to. You can use any programming language that accepts arguments from the command line and supports return codes for this, so the possibilities are almost limitless!

An experienced system administrator can extend the SOAP example with Tcl or any other language to interact with Web services on the Intranet and write plug-ins to verify that these services function correctly.

You can also use C plugins or the C programming features built into your dynamic language (Pyinline in Python, Inline in Perl, or Critcl in Tcl) to combine a combination of OS system APIs in C with a plugin written in a high-level language.

Other possibility of Nagios The one worth paying attention to is the passive check. The monitoring process with Nagios covered in this article relies on executable components to determine the status with a short lifecycle, run those components, and get results from them. With passive checking, Nagios does not run plugins to check the status, but individual applications send status change messages periodically or when the state of the service changes. Such an application can receive alerts from various sources, accumulate them and pass the prepared summary information to Nagios. Nagios can also assume that a service has gone down if it doesn't send notifications for a certain period of time. The implementation of passive validation with Nagios will be described in the next article.

The advantage of plugins for Nagios is the ease with which they can be created and shared. Nagios plugins are useful in situations that network and system administrators deal with, and in most cases this reuse results of work that someone has already done before. Like the popular Wiki and Web resources, it doesn't take much effort to contribute in the form of useful example, while the combined capabilities of all available plugins are very large.

Nagios- program with open source, designed to monitor computer systems and networks. It monitors the specified nodes and services, and notifies the administrator if any of the services stop (or resume) their work. Also, using nagios, you can view the status of hosts and services through the web interface. Now the latest version of nagios3.

    For nagios3 to work you will need
  • Apache
  • GCC compiler and development libraries
  • GD development libraries

I won’t tell you how Apache installs, libraries are installed by commands

sudo apt-get install build-essential sudo apt-get install libgd2-xpm-dev

Although Nagios3 did not work for me without these libraries. I will talk about work Nagios with Apache2.

Installing Nagios3 produced by one team.

sudo apt-get install nagios3

After installation, nagios is already up and running. Now let's create a separate Apache virtual host for nagios. Create in directory /etc/apache2/sites-enabled config file for nagios host. In the host settings, specify as the home directory /usr/share/nagios3/htdocs usually all nagios3 web files are there. Still need to connect to the settings apache settings nagios. Add to file /etc/apache2/apache2.conf next line.

Include /etc/nagios3/apache2.conf

After that, go to this new host, if you did everything right, your browser should ask for a password that you don't know yet, but everything works.

Now let's get busy configuring nagios. All configuration files are in /etc/nagios3/. Main configuration file nagios.cfg it connects all other configuration files and sets the settings for nagios itself. So if you have created your own configuration file, do not forget to include it in this file.
Let's move on to the file cgi.cfg, all cgi script settings are set in it, and access rights to the site with GUI. Default full access the user nagiosadmin has, if you want to allow something to other users, just add them separated by commas. My config looks like this.

default_user_name=myuser authorized_for_system_information=nagiosadmin,myuser authorized_for_configuration_information=nagiosadmin,myuser authorized_for_system_commands=nagiosadmin,myuser authorized_for_all_services =nagiosadmin,myuser authorized_for_all_hosts =nagiosadmin,myuser authorized_for_all_services =nagiosadmin,myuser authorized_for_all_hosts =nagiosadmin,myuser authorized_for_all_service_commands=nagiosadmin,myuser authorized_for_all_host_commands=nagiosadmin,myuser

Where myuser is my username. Now you need to create a file with users and password, for this go to the directory /etc/nagios3/ let's use the command

cd /etc/nagios3/ sudo htpasswd -c htpasswd.users myuser

and enter password for user myuser
By default nagios looks for users to authenticate in a file /etc/nagios3/htpasswd.users, but you can store from elsewhere, for this change in the file /etc/nagios3/apache2.conf parameter AuthUserFile on your own.

Now reload nagios for the changes to take effect

sudo /etc/init.d/nagios3 restart

You can also check the entire nagios config before rebooting

sudo nagios3 -v /etc/nagios3/ nagios.cfg

It will check the nagios.cfg file and all files that are included in it, and if it finds errors, it will write detailed information, I advise you to do such a check after each change in configuration files.

That's all, now go to your virtual host created for nagios and enter your username and password.
You will see the status of your services, by default nagios checks the operation of localhost and gateway. You can add your hosts or services to check, now we will look at how.

Let's say I want to check when my colleagues turn their computers on and off. To do this, you first need to describe these hosts. Create in directory /etc/nagios3/conf.d file my-hosts.cfg and write my hosts into it

# a host definition for my friends comps define host ( host_name volodya #Hostname alias Volodya comp #description address 192.168.140.3 #ip address use generic-host ) define host ( host_name lexa alias lexa comp address 192.168.140.4 use generic-host ) define host ( host_name xz1 alias xz1 comp address 192.168.140.5 use generic-host ) define host ( host_name xz2 alias xz2 comp address 192.168.140.8 use generic-host ) define host ( host_name diman alias diman comp address 192.168.140.10 use generic-host )

Because this file is in the directory /etc/nagios3/conf.d connect it separately in the file
/etc/nagios3/nagios.cfg is not necessary, because it already includes all files from the Directory by default /etc/nagios3/conf.d

Let's combine these hosts into a group. Write to the group configuration file /etc/nagios3/conf.d/hostgroups_nagios2.cfg such a text

#Defind my group define hostgroup ( hostgroup_name my-friends #group name alias my-friends comps # description members lexa, volodya, xz1,xz2, diman #group members )

Now we need to configure a service that will check this group of hosts. Adding to a file /etc/nagios3/conf.d/services_nagios2.cfg or create your own file with such a config.

# check that my friends comps are up define service ( hostgroup_name my-friends #group name to check service_description PING check_command check_ping! 100.0 ,20%! 500.0 .60% #check command use generic-service )

define contact( contact_name pasha #name alias pasha service_notification_period 24x7 #service notification period host_notification_period 24x7 #host notification period service_notification_options w ,u,c,r # what to notify notifications about host_notification_options d #notify that the host is down service_notification_commands notify-service-by-email #how to notify host_notification_commands notify-host-by-email #how to notify email [email protected] yandex.ru #mail)

Time periods are set in the file /etc/nagios3/conf.d/contacts_nagios2.cfg there are already several periods already set by default, according to their analogue, you can easily set your own.

When something breaks in the system or starts to behave in an unusual way, users suffer in unison. Therefore, in this case, you need to notify someone about the breakdown as soon as possible. It would be even better to anticipate the occurrence of problems in advance. This note will describe the installation and configuration of Nagios, which allows you to quite successfully solve such problems.

Invariants

Most systems have a number of invariants that should never be violated. Here are some examples of possible violations:

  • Load average on one of the machines has become more than X;
  • There is less than X free memory on one of the machines;
  • One of the machines has less than X free disk space;
  • Too many open file descriptors on machine X;
  • The percentage is very hot, the disk will soon fall apart, a small UPS charge;
  • High network traffic, disk io, low swap, and so on;
  • One of the hosts is not pinging or is pinging with too much RTT;
  • Something stopped resolving via DNS;
  • Newer versions of installed packages are available;
  • Suspiciously many users logged into one of the machines;
  • There is critical errors in logs for the last X minutes;
  • The number of non-critical errors in the last X minutes exceeded Y;
  • Laying or slowly responding PostgreSQL , Redis , RabbitMQ , ...;
  • SSL certificate will expire soon;
  • The 99th percentile of service response time is much longer than usual;
  • Mail, SMS, push notifications do not go, ...;
  • You need to top up the balance in a third-party service (AWS , Logentries , ...);
  • Suspiciously high costs in a third-party service;
  • In the test environment, it was not possible to recover from the backup from the sale;
  • The service became unavailable from Zelenograd and South Africa;
  • According to the internal health checks of the service, we ran into one of the thread pools;

As you can see, in almost any service, you can easily find two dozen invariants, or even more, that should never be violated, and which are quite easy to monitor automatically. If something breaks, we start sending letters to admins, SMS to superiors, and making phone calls to coders.

Installing Nagios

By the way, thanks to my acquaintance with Nagios, I began to understand people who advocate manual sharding and manual failover much better. But this is perhaps a topic for a separate note.

How are you monitoring your system?

Before configuring nagios, you need to install the necessary dependencies

# apt install build-essential apache2 php libapache2-mod-php7.0 php-gd libgd-dev mailutils

And add a user and group on behalf of which nagios will run

# useradd nagios # groupadd nagcmd # usermod -a -G nagcmd nagios # usermod -a -G nagcmd www-data

Go to the build directory and download the source code for nagios and plugins

# cd /usr/src/ # wget https://sourceforge.net/projects/nagios/files/nagios-4.x/nagios-4.2.3/nagios-4.2.3.tar.gz # wget https:// nagios-plugins.org/download/nagios-plugins-2.1.4.tar.gz

Unzip downloaded archives

# tar xzvf nagios-4.2.3.tar.gz # tar xzvf nagios-plugins-2.1.4.tar.gz

Let's go to the directory with the nagios source code and configure

# cd nagios-4.2.3 # ./configure --prefix=/etc/nagios --with-command-group=nagcmd --with-httpd-conf=/etc/apache2/sites-available --with-mail= /usr/bin/mail

Let's build

#make all

Install nagios

# make install

Install an init script in /etc/init.d and enable auto start

# make install-init # update-rc.d nagios defaults

Set the rights to the directory for storing external batch files

# make install-commandmode

Install nagios configuration files

# make install-config

Set up the nagios config for apache

# make install-webconf

Copy the event processing scripts of external batch files to the folder with nagios and set the owner to the folder

# cp -R contrib/eventhandlers/ /etc/nagios/libexec/ # chown -R nagios:nagios /etc/nagios/libexec/eventhandlers

Let's check the installed configuration

# /etc/nagios/bin/nagios -v /etc/nagios/etc/nagios.cfg

# make # make install

Enable the nagios configuration in apache and activate the necessary add-ons

# a2ensite nagios # a2enmod rewrite cgi

Restart the apache service

# service apache2 restart

Let's start nagios and check the status

# service nagios start # service nagios status ● nagios.service - Nagios Loaded: loaded (/etc/systemd/system/nagios.service; enabled; vendor preset: enabled) Active: active (running)

Add a nagios administrator

# htpasswd -c /etc/nagios/etc/htpasswd.users nagiosadmin

Now let's deal with the nagios configuration files.

/etc/nagios/etc/ cgi.cfg- defines the settings for the web interface, as well as access rights to the nagios web console.

/etc/nagios/etc/ htpasswd.users- database of users and their passwords for accessing the nagios web interface.

/etc/nagios/etc/ nagios.cfg— contains the main settings and paths to *.cfg files.

/etc/nagios/etc/ resource.cfg- a variable is defined here, up to the directory with plugins.

/etc/nagios/etc/objects/ commands.cfg- contains command definitions.

/etc/nagios/etc/objects/ contacts.cfg- defines the mail contacts to which nagios notifications will be sent.

/etc/nagios/etc/objects/ templates.cfg- contains templates for contacts, hosts and services.

/etc/nagios/etc/objects/ timeperiods.cfg- contains definitions of time periods.

/etc/nagios/etc/objects/ localhost.cfg- configuration for monitoring the nagios server itself. It defines the host itself, the host group for linux servers and monitoring services.

/etc/nagios/etc/objects/ printer.cfg- configuration for monitoring the printer. It defines an arbitrary printer, a host group for printers, and monitoring services.

/etc/nagios/etc/objects/ switch.cfg- configuration for monitoring the switch. It defines an arbitrary switch, a host group for switches, and monitoring services.

/etc/nagios/etc/objects/ windows.cfg- configuration for monitoring an arbitrary windows host. It defines an arbitrary windows host, a host group for windows servers and monitoring services.

The description shows that localhost.cfg, printer.cfg, switch.cfg and windows.cfg contain host group definitions. For convenience, it makes sense to separate them into separate file/etc/nagios/etc/objects/ hostgroups.cfg. Accordingly, they need to be commented out in the original file. Description of the same object should not be repeated in different configuration files. We will also add the file /etc/nagios/etc/objects/ servicegroups.cfg, in which the service groups will be defined. Since this publication is setting up monitoring for linux and windows hosts, we will define groups for them. For printers and switches, groups are defined by analogy. We will also create the /etc/nagios/etc/servers/ folder, which will store the files that define the hosts to monitor. Let's make the appropriate changes to the nagios.cfg file

# nano /etc/nagios/etc/nagios.cfg . . . # You can specify individual object config files as shown below: cfg_file=/etc/nagios/etc/objects/commands.cfg cfg_file=/etc/nagios/etc/objects/contacts.cfg cfg_file=/etc/nagios/etc/objects /timeperiods.cfg cfg_file=/etc/nagios/etc/objects/templates.cfg # Definitions for hostgroups and servicegroups cfg_file=/etc/nagios/etc/objects/hostgroups.cfg cfg_file=/etc/nagios/etc/objects/servicegroups.cfg# Definitions for monitoring the local (Linux) host cfg_file=/etc/nagios/etc/objects/localhost.cfg # Definitions for monitoring a Windows machine #cfg_file=/etc/nagios/etc/objects/windows.cfg # Definitions for monitoring a router/switch #cfg_file=/etc/nagios/etc/objects/switch.cfg # Definitions for monitoring a network printer #cfg_file=/etc/nagios/etc/objects/printer.cfg # You can also tell Nagios to process all config files (with a .cfg # extension) in a particular directory by using the cfg_dir # directive as shown below: cfg_dir=/etc/nagios/etc/servers#cfg_dir=/etc/nagios/etc/printers #cfg_dir=/etc/nagios/etc/switches #cfg_dir=/etc/nagios/etc/routers . . .

Create files for the host and service groups and set the rights to them

# cd /etc/nagios/etc/objects/ # touch hostgroups.cfg servicegroups.cfg # chown nagios:nagios hostgroups.cfg servicegroups.cfg # chmod 664 hostgroups.cfg servicegroups.cfg

Create directory /etc/nagios/etc/ servers/ and set rights to it

# mkdir /etc/nagios/etc/servers/ # chown nagios:nagios /etc/nagios/etc/servers/ # chmod 775 /etc/nagios/etc/servers/

Add to hostgroups.cfg definitions of host groups for linux and windows servers from localhost.cfg and windows.cfg, respectively

# nano /etc/nagios/etc/objects/hostgroups.cfg # Define an optional hostgroup for Linux machines # All hosts that use the linux-server template will automatically be a member of this group define hostgroup( hostgroup_name linux-servers ; The name of the hostgroup alias Linux Servers ; Long name of the group ) # Define a hostgroup for Windows machines # All hosts that use the windows-server template will automatically be a member of this group define hostgroup( hostgroup_name windows-servers ; The name of the hostgroup alias Windows Servers ; Long name of the group )

Since the path to windows.cfg is commented out in nagios.cfg, commenting out the definition of host groups in windows.cfg is not necessary, but in localhost.cfg it is a mandatory action

# nano /etc/nagios/etc/objects/localhost.cfg . . . # Define an optional hostgroup for Linux machines #define hostgroup( # hostgroup_name linux-servers ; The name of the hostgroup # alias Linux Servers ; Long name of the group # members localhost ; Comma separated list of hosts that belong to this group # ) . . .

When a windows server object is created, it automatically becomes a member of the windows-servers group. This action is defined in the templates.cfg file. To linux servers automatically got into the linux-servers group, you need to make the following change

# nano /etc/nagios/etc/objects/templates.cfg . . . # Linux host definition template - This is NOT a real host, just a template! define host( name linux-server ; The name of this host template use generic-host ; This template inherits other values ​​from the generic-host template check_period 24x7 ; By default, Linux hosts are checked round the clock check_interval 5 ; Actively check the host every 5 minutes retry_interval 1 ; Schedule host check retries at 1 minute intervals max_check_attempts 10 ; Check each Linux host 10 times (max) check_command check-host-alive ; Default command to check Linux hosts notification_period workhours ; Linux admins hate to be woken up, so we only notify during the day ; Note that the notification_period variable is being overridden from ; the value that is inherited from the generic-host template! notification_interval 120 ; Resend notifications every 2 hours notification_options d,u,r ; Only send notifications for specific host states contact_groups admins ; Notifications get sent to the admins by default hostgroups linux-servers ; Host groups that linux servers should be a member of register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! ) . . .

To set up a notification, you need to specify the email of the system administrator in contacts.cfg

# nano /etc/nagios/etc/objects/contacts.cfg . . . define contact( contact_name nagiosadmin ; Short name of user use generic-contact ; Inherit default values ​​from generic-contact template (defined above) alias Nagios Admin ; Full name of user email [email protected] website ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ****** } . . .

# service nagios restart

You can check the performance of alerts in the following way, select “Hosts” on the left in the web interface, then click on “localhost”, click on “Send custom host notification” in the menu on the right, write anything in the “Comment” line and click “Commit”.

An alert should be sent to the mail specified in contacts.cfg.


A little about plugins

The /etc/nagios/libexec folder contains installed plugins. At the beginning of the article, it was noted that nagios receives all information through plugins. For example, let's look at the uptime of the system

# /etc/nagios/libexec/check_uptime Uptime OK: 0 day(s) 3 hour(s) 52 minute(s) | uptime=232.000000;;;

Most plugins work only with arguments, for example, let's see the status of the system swap file

# /etc/nagios/libexec/check_swap -w 20 -c 10 SWAP OK - 100% free (2044 MB out of 2044 MB) |swap=2044MB;0;0;0;2044

There are two arguments -w 20 and -c 10.

w - warning, when 20% of free space remains, a warning event will occur.

c - critical, when 10% of free space remains, the critical event will occur.

There are also plugins local and "general". The above examples are local. For example, the check_swap plugin will not be able to check the status of the paging file on a remote host, but the check_ping plugin can check the availability of both local and remote hosts

# /etc/nagios/libexec/check_ping -H localhost -w 100.0.20% -c 500.0.60% PING OK - Packet loss = 0%, RTA = 0.04 ms|rta=0.036000ms;100.000000;500.000000;0.000000 pl= 0%;20;60;0 # /etc/nagios/libexec/check_ping -H 192.168.1.16 -w 100.0.20% -c 500.0.60% PING OK - Packet loss = 0%, RTA = 0.27 ms|rta= 0.273000ms;100.000000;500.000000;0.000000pl=0%;20;60;0


A little about NRPE

NRPE - Nagios Remote Plugin Executor. In order for nagios to receive information from remote hosts, such as disk or CPU usage, the nrpe plugin is used. Nagios accesses, via the nrpe plugin, an nrpe server installed on a remote linux/unix host. The nrpe server runs the local plugins and passes the information to the nagios server. Important! nrpe server and plugin must be the same version, otherwise errors may occur.


Installing the NRPE Plugin

Before installing the nrpe plugin, you need to install the dependency

# apt install libssl-dev

You can download the latest version of nrpe from the nagios website. Go to the build directory, download and unzip the latest version of nrpe

# cd /usr/src/ # wget https://github.com/NagiosEnterprises/nrpe/archive/3.0.1.tar.gz # tar xzvf 3.0.1.tar.gz

Let's go to the folder with nrpe and configure

Let's build and install the nrpe plugin

# make check_nrpe # make install-plugin

Add nrpe support to /etc/nagios/etc/objects/commands.cfg

# nano /etc/nagios/etc/objects/commands.cfg . . . # "check_nrpe" command definition define command( command_name check_nrpe command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ ) . . .

After the changes made, you need to restart the nagios service

# service nagios restart


Installing the NRPE server

On the linux host that we will be monitoring, we need to install the nrpe server and plugins.

Install the required dependencies

# apt install build-essential libssl-dev

Go to build directory, download nrpe, plugins and unzip them

# cd /usr/src/ # wget https://github.com/NagiosEnterprises/nrpe/archive/3.0.1.tar.gz # wget https://nagios-plugins.org/download/nagios-plugins-2.1. 4.tar.gz # tar xzvf 3.0.1.tar.gz # tar xzvf nagios-plugins-2.1.4.tar.gz

Let's go to the directory with nrpe and configure

# cd nrpe-3.0.1 # ./configure --prefix=/etc/nagios

Let's build

# make nrpe

Add a user and group on behalf of which the nrpe server will run

# make install-groups-users

Let's install the server and the configuration file

# make install-daemon # make install-config

Install start script

# make install-init # systemctl enable /lib/systemd/system/nrpe.service

Let's go to the directory with the source code of the plugins and configure

# cd /usr/src/nagios-plugins-2.1.4/ # ./configure --prefix=/etc/nagios --with-nagios-user=nagios --with-nagios-group=nagios

Build and install plugins

# make # make install

In /etc/nagios/etc/nrpe.cfg we will allow the nagios server to receive information about the system, as well as specify the true name of the disk partition to be monitored, in a predefined command

Nano /etc/nagios/etc/nrpe.cfg . . . allowed_hosts=127.0.0.1, 192.168.1.13 . . . command=/etc/nagios/libexec/check_users -w 5 -c 10 command=/etc/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 command=/etc/nagios/libexec/check_disk -w 20% -c 10% -p /dev/ sda1 command=/etc/nagios/libexec/check_procs -w 5 -c 10 -s Z command=/etc/nagios/libexec/check_procs -w 150 -c 200 . . .

192.168.1.13 should be replaced with the address of your nagios server.
Start nrpe server and check its status

# service nrpe start # service nrpe status ● nrpe.service - Nagios Remote Program Executor Loaded: loaded (/lib/systemd/system/nrpe.service; enabled; vendor preset: enabled) Active: active (running)


Adding a linux host to the monitoring system

To do this, we will create a linux-server.cfg file in the servers folder

# nano /etc/nagios/etc/servers/linux-serv.cfg define host( use linux-server host_name linux-serv alias linux-serv address 192.168.1.12 ) define service( use generic-service host_name linux-serv service_description CPU Load check_command check_nrpe!check_load ) define service( use generic-service host_name linux-serv service_description Current Users check_command check_nrpe!check_users ) define service( use generic-service host_name linux-serv service_description /dev/sda1 Free Space check_command check_nrpe!check_sda1 ) define service( use generic-service host_name linux-serv service_description Total Processes check_command check_nrpe!check_total_procs ) define service( use generic-service host_name linux-serv service_description Zombie Processes check_command check_nrpe!check_zombie_procs )

192.168.1.12 needs to be replaced with the address of your linux server.
The use directive points to the name of a template in templates.cfg that defines the default settings. In order for the new host to appear in the web interface, you need to restart the nagios service

# service nagios restart

To check if nrpe is working, you can run the command

# /etc/nagios/libexec/check_nrpe -H 192.168.1.12 NRPE v3.0.1

To check a specific service, you need to add an argument with the name of the check

# /etc/nagios/libexec/check_nrpe -H 192.168.1.12 -c check_sda1 DISK OK - free space: /var/tmp 14549 MB (85% inode=88%);| /var/tmp=2527MB;14411;16212;0;18014

We defined the name of the check in the nrpe.cfg file

Command[ check_sda1]=/etc/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda1

and in the linux-serv.cfg file

Define service( use generic-service host_name linux-serv service_description /dev/sda1 Free Space check_command check_nrpe! check_sda1 }


Windows host monitoring

The nt plugin is used to monitor windows hosts. By default, it is included in the base plugins and does not need to be installed separately. Through nt, the nagios plugin accesses the NSClient++ installed on the windows host. NSClient++ accesses certain modules that tell it information about the system. NSClient++ sends the received information to nogios server.


Installing NSClient++

On windows host you need to install NSClient++. Download the latest version and run as administrator.

Click "Next"

Click "Next" again

Choose "Typical"

Specify the address of the nagios server, password and leave the first two checkboxes. Click "Next"

Click "Install"

Click "Finish"


Adding a windows host to the monitoring system

To do this, we will create a windows-serv.cfg file in the servers folder

# nano /etc/nagios/etc/servers/windows-serv.cfg define host( use windows-server host_name windows-serv alias My Windows Server address 192.168.1.33 ) define service( use generic-service host_name windows-serv service_description NSClient++ Version check_command check_nt!CLIENTVERSION ) define service( use generic-service host_name windows-serv service_description Uptime check_command check_nt!UPTIME ) define service( use generic-service host_name windows-serv service_description CPU Load check_command check_nt!CPULOAD!-l 5,80,90 ) define service( use generic-service host_name windows-serv service_description Memory Usage check_command check_nt!MEMUSE!-w 80 -c 90 ) define service( use generic-service host_name windows-serv service_description C:\ Drive Space check_command check_nt!USEDDISKSPACE!-l c -w 80 -c 90 ) define service( use generic-service host_name windows-serv service_description VMTools check_command check_nt!SERVICESTATE!-d SHOWALL -l VMToo ls ) define service( use generic-service host_name windows-serv service_description Explorer check_command check_nt!PROCSTATE!-d SHOWALL -l explorer.exe )

192.168.1.33 must be replaced with the address of your windows server.
If a password was specified when installing the NSClient++ client, you need to add it to commands.cfg

# nano /etc/nagios/etc/objects/commands.cfg . . . # "check_nt" command definition define command( command_name check_nt command_line $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -s MegaPass-v $ARG1$ $ARG2$ ) . . .

And restart the nagios service

# service nagios restart


Service group definition

A service group groups together hosts that are subject to certain service checks. Let's create a CPU Load service group, and unite the hosts to check the processor load.

# nano /etc/nagios/etc/objects/servicegroups.cfg define servicegroup( servicegroup_name cpuload alias CPU Load members linux-serv,CPU Load, localhost,Current Load, windows-serv,CPU Load )

Group members are defined in the members directive according to the principle
members= ,,,,…,n>,n>

To make the group available, you need to restart the nagios service

# service nagios restart


Location of host links on the map

By default, all hosts on the map are connected to Nagios Process. There are times when you need to override this behavior. For example, the connection should not come from Nagios Process, but from another point on the map (as an example, a server connection through a switch). This is done by adding the parents directive to the host description section. For an illustrative example, let's change the windows-serv connection from Nagios Process to linux-serv

# nano /etc/nagios/etc/servers/windows-serv.cfg define host( use windows-server host_name windows-serv alias My Windows Server address 192.168.1.33 parents linux-serv } . . .

And restart the nagios service

# service nagios restart

In the first case, all hosts had a connection with Nagios Process, in the second case, the windows-serv connection starts from linux-serv.


Enable icons

nagios has the ability to enable icons next to the host name. The icons are located in the /etc/nagios/share/images/logos folder. You can use a ready-made set, you can download from the Internet. To enable the display of icons, you need to add the following lines to templates.cfg

# nano /etc/nagios/etc/objects/templates.cfg . . . # Linux host definition template - This is NOT a real host, just a template! define host( name linux-server ; The name of this host template use generic-host ; This template inherits other values ​​from the generic-host template check_period 24x7 ; By default, Linux hosts are checked round the clock check_interval 1 ; Actively check the host every 5 minutes retry_interval 1 ; Schedule host check retries at 1 minute intervals max_check_attempts 10 ; Check each Linux host 10 times (max) check_command check-host-alive ; Default command to check Linux hosts notification_period workhours ; Linux admins hate to be woken up, so we only notify during the day ; Note that the notification_period variable is being overridden from ; the value that is inherited from the generic-host template! notification_interval 120 ; Resend notifications every 2 hours notification_options d,u,r ; Only send notifications for specific host states contact_groups admins ; Notifications get sent to the admins by default hostgroups linux-servers ; Host groups that linux servers sho uld be a member of icon_image linux40.png statusmap_image linux40.gd2 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! ) # Windows host definition template - This is NOT a real host, just a template! define host( name windows-server ; The name of this host template use generic-host ; Inherit default values ​​from the generic-host template check_period 24x7 ; By default, Windows servers are monitored round the clock check_interval 5 ; Actively check the server every 5 minutes retry_interval 1 ; Schedule host check retries at 1 minute intervals max_check_attempts 10 ; Check each server 10 times (max) check_command check-host-alive ; Default command to check if servers are "alive" notification_period 24x7 ; Send notification out at any time - day or night notification_interval 30 ; Resend notifications every 30 minutes notification_options d,r ; Only send notifications for specific host states contact_groups admins ; Notifications get sent to the admins by default hostgroups windows-servers ; Host groups that Windows servers should be a member of icon_image win40.png statusmap_image win40.gd2 register 0 ; DONT REGISTER THIS - ITS JUST A TEMPLATE ) . . .

Restarting the nagios service

# service nagios restart

A computer