data:image/s3,"s3://crabby-images/da3bd/da3bd0ebf17de8ec650afe01c91183505cf78607" alt="Learning NAGIOS 3.0"
Nagios Configuration
Nagios stores its configuration in a separate directory. Usually it's either in /etc/nagios
or /usr/local/etc/nagios
. If you followed the steps for a manual installation, (as described above) it would be in /etc/nagios
.
Main Configuration File
The main configuration file is called nagios.cfg
, which is the main file that is loaded during Nagios startup. Its syntax is simple—a line beginning with # is a comment, and all lines in the form <parameter>=<value>
will set a value. In some cases, a value might be repeated (such as specifying additional files/directories to read).
The following is a sample of Nagios's main configuration file:
# log file to use log_file=/var/nagios/nagios.log # object configuration directory cfg_dir=/etc/nagios/objects # storage information resource_file=/etc/nagios/resource.cfg status_file=/var/nagios/status.dat status_update_interval=10 (…)
The main configuration file needs to define a log file to use, and that has to be passed as the first option in the file. It also configures various Nagios parameters that tune Nagios's its behavior and performance. The following are some of the commonly-changed options:
data:image/s3,"s3://crabby-images/09788/09788703399ec7ce8f60cad9d2aa87cc036a64c5" alt=""
For a complete list of accepted parameters, please consult the Nagios documentation on http://nagios.sourceforge.net/docs/3_0/configmain.html.
The Nagios option resource_file
defines the file in which all user variables are to be stored. This file can be used to store additional information that can be accessed in all object definitions. This file usually contains sensitive data as it can only be used in object definitions, and it is not possible to read these variables from the web interface. This makes it possible to hide passwords of various sensitive services from Nagios administrators who do not have adequate privileges. There can be up to 32 macros, named $USER1$
, $USER2$
… $USER32$
. Macro definition $USER1$
defines the path to the Nagios plugins and is commonly used in check command definitions.
Options cfg_file
and cfg_dir
are used to specify the files that should be read for object definitions. The first option specifies a single file to read and the second specifies the directory in which all files should be read. Each file may contain different types of objects. The following sections describe each type of definition that Nagios uses.
One of the first things that needs to be decided is how your Nagios configuration should be stored. In order to create a configuration that is maintainable as your IT infrastructure changes, it is worth investing some time in planning out how you want your host definitions set up and how they could be most easily placed in a configuration file structure. Throughout this book, various approaches on how to make your configuration maintainable are discussed. It's also recommended that you set up a small Nagios system to get a better understanding of Nagios configuration, before proceeding to larger setups.
Sometimes, it is best to have configuration grouped into separate directories defined according to the locations that hosts and/or services are in. In other cases, it might be best to keep definitions of all servers with similar functionalities in one directory.
A good directory separation makes it much easier to control Nagios configuration to, for example, massively disable all objects related to a particular part of the IT infrastructure. Even though it is recommended to use downtimes, it is sometimes useful to just remove all entries from Nagios configuration.
Throughout all configuration examples in this book, we use a directory structure. A separate directory is used for each object type and similar objects are grouped within a single file. For example, all command definitions are stored in the commands/
subdirectory. All host definitions are stored in the hosts/<hostname>.cfg
files.
In order for Nagios to read configuration from these directories, edit your main Nagios configuration file (/etc/nagios/nagios.cfg
), remove all cfg_file
and cfg_dir
entries, and add the following ones:
cfg_dir=/etc/nagios/commands cfg_dir=/etc/nagios/timeperiods cfg_dir=/etc/nagios/contacts cfg_dir=/etc/nagios/hosts cfg_dir=/etc/nagios/services
In order to use the default Nagios plugins, copy the default Nagios command definitions file /etc/nagios/objects/commands.cfg
to /etc/nagios/commands/default.cfg
.
In addition, please make sure that the following options are set as shown in your nagios.cfg
file:
check_external_commands=1 interval_length=60 accept_passive_service_checks=1 accept_passive_host_checks=1
If any of the options are set to a different value, change them, and add them to the end of the file, if they are not currently present in it.
After such changes in the Nagios set up, you can move on to the next sections and prepare a working configuration for your Nagios installation.
Macro Definitions
The ability to use macro definitions is one of the key features of Nagios. Macros offer a lot of flexibility in object and command definitions. Nagios 3 provides custom macro definitions, which gives you a greater possibility to use object templates for specifying parameters common to a group of similar objects.
All command definitions can use macros. Macro definitions allow parameters from other objects, such as hosts, services, and contacts, to be referenced so that a command does not need to have everything passed as an argument. Each macro invocation begins and ends with a $
sign.
A typical example is a HOSTADDRESS
macro, which references the address field from the host object. All host definitions provide the value of the address
parameter. For the following host and command definition:
define host { host_name somemachine address 10.0.0.1 check_command check-host-alive } define command { command_name check-host-alive command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5 }
this command will be invoked:
/opt/nagios/plugins/check_ping -H 10.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5
In addition, please note that the USER1
macro was also used and expanded as the path to Nagios plugins directory. This is a macro definition that references data contained in the file that is passed as the resource_file
configuration directive. Even though it is not necessary for USER1
macro to point to the plugins directory, all standard command definitions that come with Nagios use this macro, and so it is recommended that you do not change it.
Some of the macro definitions are listed in the following table:
data:image/s3,"s3://crabby-images/ff217/ff2170c013ce4acacb96e88855436708a1fbd593" alt=""
This table is not complete and only covers commonly used macro definitions. A complete list of available macros can be found in the Nagios documentation at http://nagios.sourceforge.net/docs/3_0/macros.html. Moreover, remember that all macro definitions need to be prefixed and suffixed with a $
sign—for example, $HOSTADDRESS$
maps to the HOSTADDRESS
macro definition.
An additional functionality is the on-demand macro definitions. These are macros that are not defined, not exported as environment variables, but if found in a command definition, will be parsed and substituted accordingly. These macros accept one or more arguments inside the macro definition name, each passed after a colon. This is mainly used to read specific values not related to the current object. In order to read the contact email for user jdoe, regardless of who the current contact person is, the macro would be as follows: $CONTACTEMAIL:jdoe$
, which means getting a CONTACTEMAIL
macro definition in the context of the jdoe
contact.
Nagios 3 also offers custom macro definitions. This works in a way that allows administrators to define additional attributes in each type of object, and the macro can then be used inside a command. This is used to store additional parameters related to an object—for example, you can store a MAC address in a host definition and use it in certain types of host checks.
It works in such a way that an object has a directive that starts with an underscore and is written in uppercase. It is referenced in one of the following ways, based on the object type it is defined in:
$_HOST<variable>$
– for directives defined within a host object$_SERVICE<variable>$
– for directives defined within a service object$_CONTACT<variable>$
– for directives defined within a contact object
A sample host definition that includes an additional directive with a MAC address would be as follows:
define host
{
host_name somemachine
address 10.0.0.1
_MAC 12:12:12:12:12:12
check_command check-host-by-mac
}
and a corresponding check
command that uses this attribute inside a check:
define command
{
command_name check-host-by-mac
command_line $USER1$/check_hostmac -H $HOSTADDRESS$ -m
$_HOSTMAC$
}
Since Nagios 3, a majority of standard macro definitions are exported to check commands as environment variables. The environment variable names are the same as macros, but are prefixed with NAGIOS_
—for example, HOSTADDRESS
is passed as the NAGIOS_HOSTADDRESS
variable. On-demand variables are not made available. For security reasons, the $USERn$
variables are also not passed to commands as environment variables.
Configuring Hosts
Hosts are objects that describe machines that should be monitored—either physical hardware or virtual machines. A host consists of a short name, a descriptive name, and an IP address. The host also tells Nagios when and how the system should be monitored, as well as who should be contacted with regards to any problems related to this host. It also specifies how often the host should be checked, how retrying the checks should be handled, and how often should a notification about problems be sent out.
A sample definition of a host is as follows:
define host { host_name linuxbox01 hostgroups linuxservers alias Linux Server 01 address 10.0.2.1 check_command check-host-alive check_interval 5 retry_interval 1 max_check_attempts 5 check_period 24x7 contact_groups linux-admins notification_interval 30 notification_period 24x7 notification_options d,u,r }
This defines a Linux box that will use the check-host-alive
command to make sure the box is up and running. The test will be performed every five minutes, and after five failed tests, it will assume the host is down. If it is down, a notification will be sent out every 30 minutes.
The following is a table of common directives that can be used to describe hosts. Items in bold are required while specifying a host.
data:image/s3,"s3://crabby-images/330bc/330bc88f201aa0a8c388e5aeff1a48235153694b" alt=""
For a complete list of accepted parameters, please consult the Nagios documentation at http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#host.
By default, Nagios assumes all host states to be up. If the check_command
option is not specified for a host, then it will always be in the up state. When the command to perform host checks is specified, then the regularly-scheduled checks will take place and the host state will be monitored using the value of check_interval
as the number of minutes between checks.
Nagios uses a soft and hard state logic to handle host states. Therefore, if a host state has changed from UP to DOWN since the last hard state, then Nagios assumes that the host is soft state DOWN and performs retries of the test, waiting retry_interval
minutes between each test. Once if the result is the same after max_check_attempts
re-tries, Nagios assumes that the DOWN state is a hard state. The same mechanisms apply for DOWN to UP transitions.
The host object parents
directive is used to define the topology of the network. Usually, this directive points to a switch, router or any other device that is responsible for forwarding network packets. The host is assumed to be unreachable if the parent host is currently in a hard DOWN state. For example, if a router is down, then all machines accessed through it are considered unreachable and no tests will be performed on them.
If your network consists of servers connected via a switch and routers to a different network, then the parent for all of the servers in the local network, as well as the router, would be the switch. The parent of the router on the other side of the link would be the local router. The following diagram shows the actual network infrastructure and indicates how Nagios hosts should be configured in terms of parents for each element of the network:
data:image/s3,"s3://crabby-images/50899/508996c6fa89615f1f4e0eae12e297520d60c4fa" alt="Configuring Hosts"
The actual network topology is shown on the left, and the parent hosts setup for the machines is shown on the right. Each arrow represents a mapping from a host to a parent host. There is no need to define a parent for hosts that are directly on the network with your Nagios server. So in this case, switch1
should not have a parent host defined.
Even though some devices, such as switches, cannot be easily checked to see if they are down, it is still a good idea to describe them as a part of your topology. In this case, you might use a functionality such as scheduled downtime to keep track of when the device is going to be offline, or mark it as DOWN manually. This helps in determining other problems—Nagios will not scan hosts that have the router somewhere along the path that is currently scheduled for downtime. This way, you won't be flooded with notifications on actually unreachable hosts being down.
Check and notification periods specify the time periods during which checks for host state and notifications are to be performed. These can be specified so that different hosts can be monitored at different times.
It is also possible to create a setup where information that a host is down is kept, but nobody is notified about it. This can be done by specifying a notification_period
that will tell Nagios when a notification should be sent out. No notifications will be sent out outside of this time period.
A typical example is a server that is only required during business hours and has a daily maintenance window between 10 PM and 4 AM. You can set up Nagios so as to not monitor host availability outside of business hours, or you can make Nagios monitor it, but without notifying that it is actually down. If monitoring is not done at all, Nagios will perform fewer operations during this period. In the second case, it is possible to gather statistics on how much of the maintenance window is used—which can be used to see if changes to the window need to be made.
Nagios allows the grouping of multiple hosts in order to effectively manage them. In order to do this, Nagios offers host group objects, which are a group of one or more machines. A host may be a member of more than one host group. Usually, grouping is done either by the type of machines or by the location they are in.
Each host group has a unique short name that specified along with a descriptive name, and one or more hosts that are members of this group.
Example host group definitions that define groups of hosts and a group that combines both groups, are given as follows:
define hostgroup { hostgroup_name linux-servers alias Linux servers members linuxbox1,linuxbox2 } define hostgroup { hostgroup_name aix-servers alias AIX servers members aixbox1,aixbox2 } define hostgroup { hostgroup_name unix-servers alias UNIX servers servers hostgroup_members linux-servers,aix-servers }
The following table shows the directives that can be used to describe host groups. Items in bold are required when specifying a host.
data:image/s3,"s3://crabby-images/90b37/90b37bd8f5c0a65a3378b2057ef01a97a640ba33" alt=""
Host groups can also be used when defining services or dependencies. For example, it is possible to tell Nagios that all Linux servers should have their SSH service monitored and all AIX servers should have a telnet accepting connections.
It is also possible to define dependencies between hosts. They are, in a way, similar to a parent-host relationship, but dependencies offer more complex configuration options. Nagios will only issue host and service checks if all dependant hosts are currently up. More details on dependencies can be found in Chapter 5.
For the purpose of this book, we will define at least one host in our Nagios configuration directory structure.
To be able to monitor the local server that the Nagios installation is running on, we will need to add its definition into the /etc/nagios/hosts/localhost.cfg
file as follows:
define host { host_name localhost alias Localhost address 127.0.0.1 check_command check-host-alive check_interval 5 retry_interval 1 max_check_attempts 5 check_period 24x7 contact_groups admins notification_interval 60 notification_period 24x7 notification_options d,u,r }
If you are planning to monitor other servers as well, you will want to add them—either in a single file, or multiple files.
Configuring Services
Services are objects that describe the functionality a particular host is offering. This can be virtually anything—network servers such as FTP, or resources such as storage space or CPU load.
A service is always tied to a host that it is running on. It is also identified by its description, which needs to be unique within a particular host. A service also defines when and how Nagios should check to see if it is running properly, and how to notify people responsible for this service, if it is not.
A short example of a web server that is defined on the linuxbox01
machine created earlier is as follows:
define service { host_name linuxbox01 service_description WWW check_command check_http check_interval 10 check_period 24x7 retry_interval 3 max_check_attempts 3 notification_interval 30 notification_period 24x7 notification_options w,c,u,r contact_groups linux-admins }
This definition tells Nagios to check that the web server is working correctly every 10 minutes.
The following table shows the common directives that can be used to describe a service. Items in bold are required when specifying a service.
data:image/s3,"s3://crabby-images/0d2fc/0d2fc1d3b84c9464337a0bc311f69594d306cbf7" alt=""
For a complete list of accepted parameters, refer to the Nagios documentation at http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#host
Very often, the same service is offered by more than one host. In such cases, it is possible to specify a service that will be provided by multiple machines, or even specify host groups for which all hosts will be checked. It is also possible to specify the hosts for which checks will not be performed—for example, if a service is present on all hosts in a group except for a specific box. To do that, an exclamation mark needs to be added before a host name or a host group name.
For example, to tell Nagios that SSSH should be checked on all Linux servers shown except for linux01,
as well as on the aix01
machine, a service definition similar to the one shown here can be created:
define service
{
hostgroup_name linux-servers
host_name !linux01,aix01
service_description SSH
check_command check_ssh
check_interval 10
check_period 24x7
retry_interval 2
max_check_attempts 3
notification_interval 30
notification_period 24x7
notification_options w,c,r
contact_groups linux-admins
}
Services can be grouped in a similar way to host objects. This can be done to manage services more conveniently. It also aids in viewing service reports on the Nagios web interface. Service groups are also used to configure dependencies in a more convenient way.
The following table describes the attributes that can be used to define a group. Items in bold are required when specifying a service group.
data:image/s3,"s3://crabby-images/e5db8/e5db885c13522013e530447b8feaa3da4037d62a" alt=""
The format of the members
directive of a service group object is one or more <host>,<service>
pairs.
An example of a service group is shown here:
define servicegroup
{
servicegroup_name databaseservices
alias All services related to databases
members linux01,mysql,linux01,pgsql,aix01,db2
}
This service group consists of the mysql
and pgsql
services on the linux01
host and db2
on the aix01
machine. It is uniquely identified by its name, databaseservices
.
It is also possible to specify groups that a service should be member of inside the service definition itself. This can be achieved by specifying all groups that this service should be a member of. To do this, add a list of all groups in the servicegroups
directive in the service definition.
Services may be configured to be dependant on one another, similar to how hosts can. In this case, Nagios will only perform checks on a service if all dependant services are working correctly. More details on dependencies can be found in Chapter 5, Advanced Configuration.
Nagios requires that at least one service is defined for every host, and requires that at least one service is defined for it to run. That is why we will now create a sample service in our configuration directory structure. For this purpose, we will monitor the secure shell protocol.
In order to check if the SSH server is running on the Nagios installation, we will need to add its definition into the /etc/nagios/hosts/localhost.cfg
file:
define service { host_name localhost service_description ssh check_command check_ssh check_interval 5 retry_interval 1 max_check_attempts 3 check_period 24x7 contact_groups admins notification_interval 60 notification_period 24x7 notification_options w,c,u,r }
If you are planning on monitoring other services as well, you will want to add them to the same file.
Configuring Commands
Command definitions describe how host/service checks should be done. They can also define how notifications about problems or event handlers should work. A command definition has two parameters—name and command line. The first parameter is a name that is then used for defining checks and notifications. The second parameter is an actual command that will be run, along with all required parameters for the command.
Commands are used by hosts and services. They define what system command to execute when making sure a host or service is working properly. A check command is identified by its unique name.
When used with other object definitions, it can also have additional arguments, and uses an exclamation mark as a delimiter. The commands with parameters have the following syntax: command_name[!arg1][!arg2][!arg3][...]
.
A command name is often the same as the plugin that it runs, but it can be different. The command line includes macro definitions (such as $HOSTADDRESS$
). Check commands also use macros, $ARG1$
, $ARG2$
… $ARG32$
, if the check command for the host or service pass additional arguments.
The following is an example that defines a command for trying to ping a host to make sure it is working properly. It does not use any arguments.
define command
{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$
-w 3000.0,80% -c 5000.0,100% -p 5
}
and a very short host definition that would use this check command, could be similar to the one shown here:
define host
{
host_name somemachine
address 10.0.0.1
check_command check-host-alive
}
Such a check is usually done as part of the host checks. This allows Nagios to make sure that a machine is working properly if it responds to ICMP requests.
Commands allow the passing of arguments as it offers a more flexible way of defining checks. Therefore, a definition accepting parameters would be as follows:
define command
{
command_name check-host-alive-limits
command_line $USER1$/check_ping -H $HOSTADDRESS$
-w $ARG1$ -c $ARG2$ -p 5
}
and the corresponding host definition would be:
define host
{
host_name othermachine
address 10.0.0.2
check_command check-host-alive-limits!3000.0,80%!5000.0,100%
}
Another example is setting up a check command for a previously-defined service:
define command { command_name check_http command_line $USER1$/check_http -H $HOSTADDRESS$ }
This check can then be used when defining a service to be monitored by Nagios. Chapter 4, Overview of Nagios Plugins, covers standard Nagios plugins along with sample command definitions. Sample Nagios configurations are also included in sources and installed by the make-config
target.
Configuring Time Periods
Time periods are definitions of dates and times during which an action should be performed or specified people should be notified. They describe date and time ranges, and can be re-used across various operations.
A time period definition includes a name that uniquely identifies it in Nagios. It also contains a description, and one or more days or dates along with time spans.
A typical example of a time period would be working hours, which defines that a valid time to perform an action is from Monday to Friday during business hours. Another definition of a time period can be weekends, which means Saturday and Sunday, all day long.
The following is a sample time period for working hours:
define timeperiod { timeperiod_name workinghours alias Working Hours, from Monday to Friday monday 09:00-17:00 tuesday 09:00-17:00 wednesday 09:00-17:00 thursday 09:00-17:00 friday 09:00-17:00 }
This particular example tells Nagios that the acceptable time to perform something is from Monday to Friday between 9 AM and 5 PM. Each entry in a time period contains information on a date or weekday. It also contains a range of hours. Nagios first checks if the current date matches any of the dates specified. If it does, then it checks if the current time matches the time ranges specified for the date.
There are multiple ways of specifying a date. Depending on what type of date it is, one definition might take precedence over another. For example, a definition for December 24th is more important than a generic definition that every weekday an action should be performed between 9 AM and 5 PM.
Possible date types are mentioned here:
- Calendar date: For example,
2009-11-01
, which means November 1st, year 2009, (Nagios accepts dates in the format YYYY-MM-DD) - Date recurring every year: For example,
july 4
, which means 4th of July every year - Specific day within a month: For example,
day 14
, which means the 14 th of every month - Specific weekday, along with an offset in a month: For example,
monday 1 september
, which means the first Monday in September;monday -1 may
would mean the last Monday in May - Specific weekday in all months: For example,
monday 1
, which means the 1st Monday of every month - Weekday: For example,
monday
, which means every Monday
The above list shows all date types in the order at which Nagios ranks them in terms of importance. This means that a date recurring every year will always be used in preference to an entry describing what should be done every Monday.
In order to be able to correctly configure all objects, we will now create some standard time periods that will be used in configuration. The following example periods will be used in the remaining sections of this chapter, and it is recommended that you put them in the /etc/nagios/timeperiods/default.cfg
file:
define timeperiod { timeperiod_name workinghours alias Working Hours, from Monday to Friday monday 09:00-17:00 tuesday 09:00-17:00 wednesday 09:00-17:00 thursday 09:00-17:00 friday 09:00-17:00 }
define timeperiod { timeperiod_name weekends alias Weekends all day long saturday 00:00-24:00 sunday 00:00-24:00 } define timeperiod { timeperiod_name 24x7 alias 24 hours a day 7 days a week monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00 sunday 00:00-24:00 }
The last time period is also used by the www
service for linuxbox01
host, defined earlier. This way, the web server will be monitored all the time.
Configuring Contacts
Contacts define people who can either be owners of specific machines, or people who should be contacted in case of problems. Depending on how your organization chooses to contact people in case of problems, the definition of a contact may vary a lot. A contact consists of a unique name, a descriptive name, and one or more email addresses and/or pager numbers. Contact definitions can also contain additional data specific to how a person can be contacted.
A basic contact definition is shown here, and specifies the unique contact name, an alias, and contact information. It also specifies the event types that the person should receive and time periods during which notifications should be sent.
define contact { contact_name jdoe alias John Doe email john.doe@yourcompany.com host_notification_period workinghours service_notification_period workinghours host_notification_options d,u,r service_notification_options w,u,c,r host_notification_commands host-notify-by-email service_notification_commands notify-by-email }
The following table describes all available directives when defining a contact. Items in bold are required when specifying a contact.
data:image/s3,"s3://crabby-images/f0b94/f0b944cba78b87200e0fc3166bbf48de987c2531" alt=""
Contacts are also mapped to users that log into the Nagios web interface. This means that all operations performed via the interface will be logged as having been executed by that particular user and the web interface will use access granted to particular contact objects when evaluating whether an operation should be allowed or not. The contact_name
field from a contact object maps to the user name in the Nagios web interface.
Contacts can be grouped. Usually, grouping is used to keep a list of which users are responsible for which tasks, and the group maps to job responsibilities for particular people. It also makes it possible to define people who should be responsible for handling problems at specific time periods, and Nagios will automatically contact the right people depending on the time at which a problem has occurred.
A sample definition of a contact group is as follows:
define contactgroup { contactgroup_name linux-admins alias Linux Administrators members jdoe,asmith }
This group is also used when defining the linuxbox01
and www
service contacts. This means that both jdoe
and asmith
will receive information on the status of this host and service.
The following is a complete list of directives that can be used to describe contact groups. Items in bold are required while specifying a contact group.
data:image/s3,"s3://crabby-images/973a0/973a087b3b949f785c546e8317b050d750fb600b" alt=""
Members of a contact group can be specified either in the contact group definition or by using the contactgroups
directive in a contact definition. It is also possible to combine both methods—some of the members can be specified in the contact group definition, and others can be specified in their contact object definition z`.
Contacts are used to specify who should be contacted if the status of one or more hosts or services changes. Nagios accepts both contacts and contact groups in its object definitions. This allows making either specific people or entire groups responsible for particular machines or services.
It is also possible to specify different people or groups for handling host-related and service-related problems—for example, hardware administrators for handling host problems and system administrators for handling service issues.
In order to function properly, we need to create at least one contact that will be used by Nagios, and put this definition in the /etc/nagios/contacts/nagiosadmin.cfg
file:
define contact { contact_name nagiosadmin contactgroups admins alias Nagios administrator email administrator@yourcompany.com host_notification_period workinghours service_notification_period workinghours host_notification_options d,u,r service_notification_options w,u,c,r host_notification_commands host-notify-by-email service_notification_commands notify-by-email }
We also need to define the admins
group in the /etc/nagios/contacts/groups.cfg
file:
define contactgroup { contactgroup_name admins alias System administrators }
If you are not very familiar with Nagios, it is recommended that you leave the contact's name as nagiosadmin
, as this will also be the user for all web interface operations.
Templates and Object Inheritance
In order to allow the flexible configuration of machines, Nagios offers a powerful inheritance engine. The main concept is that administrators can set up templates that define common parameters, and re-use these templates in actual host or service definitions. The mechanism even offers the possibility to create templates that inherit parameters from other templates.
This mechanism works in a way where templates are plain Nagios objects that specify the register
directive and set it to 0
. This means that they will not be registered as an actual host or service to monitor. Objects that inherit parameters from a template or another host should have a use
directive pointing to the short name of the template object they are using.
When defining a template, its name is always specified using the name
directive. This is slightly different to how typical hosts and services are registered, as they require the host_name
and/or service_description
parameters.
Inheritance can be used to define a template for basic host checks, with only basic parameters such as IP address being defined for each particular host. For example:
define host { name generic-server check_command check-host-alive check_interval 5 retry_interval 1 max_check_attempts 5 check_period 24x7 notification_interval 30 notification_period 24x7 notification_options d,u,r register 0 } define host { use generic-server name linuxbox01 alias Linux Server 01 address 10.0.2.1 contact_groups linux-admins }
Version 3 of Nagios also introduces inheriting from multiple templates. To do this, simply put multiple names in the use
directive, separated by commas. This allows the host to use several templates, which define parts or all directives. In case multiple templates specify the same parameters, the value from the first template specifying it will be used. For example:
define service { name generic-service check_interval 10 retry_interval 2 max_check_attempts 3 check_period 24x7 register 0 } define service { host_name workinghours-service check_period workinghours notification_interval 30 notification_period workinghours notification_options w,c,u,r register 0 } define service { use workinghours-service,generic-service contact_groups linux-admins host_name linuxbox01 service_description SSH check_command check_ssh }
In this case, values from both templates will be used. The value of workinghours
will be used for the check_period
directive as this directive was first specified in the workinghours-service
template. Changing the order in the use directive to generic-service,workinghours-service
would cause value of the check_period
parameter to be 24x7
.
Nagios also accepts creating multiple levels of templates. For example, you can set up a generic service template, and inherit it to create additional templates for various types of checks such as local services, resource sensitive checks, and templates for passive-only checks.
Let's consider the following objects and template structures:
define host { host_name linuxserver1 use generic-linux,template-chicago ..... } define host { register 0 name generic-linux use generic-server ..... } define host { register 0 name generic-server use generic-host ..... } define host { register 0 name template-chicago use contacts-chicago,misc-chicago ..... }
The following illustration shows how Nagios will search for values for all directives.
data:image/s3,"s3://crabby-images/ab7d8/ab7d8077bae6a19da68d3345dd9f776a439f0ab3" alt="Templates and Object Inheritance"
When looking for parameters, Nagios will first look for the value in the linuxserver1
object definition. Next, it will use the following templates, in this order: generic-linux
, generic-server
, generic-host
, template-chicago
, contacts-chicago
, and misc-chicago
in the end.
I t is also possible to set up host or service dependencies that will be inherited from a template. In this case, the dependant hosts or services can't be templates themselves, and need to be registered as objects that will be monitored by the Nagios daemon.
Introduction to Notifications
N otifications are the way by which Nagios lets people know that something is either wrong or has returned to the normal way of operations. They are not objects on their own, but provide very important functionality in Nagios. Configuring notifications correctly might seem a bit tricky in the beginning.
When and how notifications are sent out is configured as part of contact configuration. Each contact has configuration directives on when notifications can be sent out, and how he or she should be contacted. Contacts also contain information about contact details—telephone number, email address, Jabber/MSN address, and so on. Each host and service is configured for when the information about it should be sent, and who should be contacted. Nagios then combines all of this information in order to notify people of the changes in status.
Notifications may be sent out in one of the following situations:
- The host has changed its state to
DOWN
orUNREACHABLE
state; notification is sent out afterfirst_notification_delay
number of minutes specified in the corresponding host object - The host remains in
DOWN
orUNREACHABLE
state; notification is sent out everynotification_interval
number of minutes specified in the corresponding host object - Host recovers to an
UP
state; notification is sent out immediately and only once - Host starts or stops flapping; notification is sent out immediately
- Host remains flapping; notification is sent out every
notification_interval
number of minutes specified in the corresponding host object - Service has changed its state to
WARNING
,CRITICAL
orUNKNOWN
state; notification is sent out afterfirst_notification_delay
number of minutes specified in the corresponding service object - Service remains in
WARNING
,CRITICAL
orUNKNOWN
state; notification is sent out everynotification_interval
number of minutes specified in the corresponding service object - Service recovers to an
OK
state; notification is sent out immediately and only once - Service starts or stops flapping; notification is sent out immediately
- Service remains flapping; notification is sent out every
notification_interval
number of minutes specified in the corresponding service object
If one of these conditions occurs, Nagios starts evaluating whether information about it should be sent out and to whom.
First of all, the current date and time is checked against the notification time period. The time period is taken from the notification_timeperiod
field from the current host or service definition. Only if the time period includes current time, will the notification be sent out.
Next, a list of users based on the contacts
and contact_groups
fields is created. A complete list of users is made based on all members of all groups, and included groups, as well as all the contacts directly bound to the current host or service.
Each of the matched users is checked to see whether he or she should be notified about the current event. In this case, each user's time period is also checked to see if it includes the current date and time. The directive host_notification_period
or service_notification_period
is used depending on whether the notification is for the host or the service.
For host notifications,the host_notification_options
directive for each contact is also used to determine whether that particular person should be contacted—for example, different users might be contacted about an unreachable host than those contacted if the host is actually down. For service notifications, the service_notification_options
parameter is used to check every user if he or she should be notified about this issue. The section on hosts and services configuration describes what values these directives take.
If all of these criteria have been met, Nagios will send a notification to this user. It will now use commands specified in the host_notification_commands
and service_notification_commands
directives.
It is possible to specify multiple commands that will be used for notifications. So it is possible to set up Nagios such that it sends both an email as well as a message on an instant messaging system.
Nagios also offers escalations that allow emails to be sent to other people when a problem remains unresolved for too long. This can be used to propagate problems to higher management, or to teams that might be affected by unresolved problems. It is a very powerful mechanism and is split between host- and service-based escalations. This functionality is described in more detail in Chapter 6, Notifications and Events.