Learning NAGIOS 3.0
上QQ阅读APP看书,第一时间看更新

Nagios Configuration

Nagios stores its configuration in a separate directory. Usually it's either in /etc/nagios or /usr/local/etc/nagios. If you followed the steps for a manual installation, (as described above) it would be in /etc/nagios.

Main Configuration File

The main configuration file is called nagios.cfg, which is the main file that is loaded during Nagios startup. Its syntax is simple—a line beginning with # is a comment, and all lines in the form <parameter>=<value> will set a value. In some cases, a value might be repeated (such as specifying additional files/directories to read).

The following is a sample of Nagios's main configuration file:

# log file to use
log_file=/var/nagios/nagios.log
# object configuration directory
cfg_dir=/etc/nagios/objects
# storage information
resource_file=/etc/nagios/resource.cfg
status_file=/var/nagios/status.dat
status_update_interval=10
(…)

The main configuration file needs to define a log file to use, and that has to be passed as the first option in the file. It also configures various Nagios parameters that tune Nagios's its behavior and performance. The following are some of the commonly-changed options:

For a complete list of accepted parameters, please consult the Nagios documentation on http://nagios.sourceforge.net/docs/3_0/configmain.html.

The Nagios option resource_file defines the file in which all user variables are to be stored. This file can be used to store additional information that can be accessed in all object definitions. This file usually contains sensitive data as it can only be used in object definitions, and it is not possible to read these variables from the web interface. This makes it possible to hide passwords of various sensitive services from Nagios administrators who do not have adequate privileges. There can be up to 32 macros, named $USER1$, $USER2$$USER32$. Macro definition $USER1$ defines the path to the Nagios plugins and is commonly used in check command definitions.

Options cfg_file and cfg_dir are used to specify the files that should be read for object definitions. The first option specifies a single file to read and the second specifies the directory in which all files should be read. Each file may contain different types of objects. The following sections describe each type of definition that Nagios uses.

One of the first things that needs to be decided is how your Nagios configuration should be stored. In order to create a configuration that is maintainable as your IT infrastructure changes, it is worth investing some time in planning out how you want your host definitions set up and how they could be most easily placed in a configuration file structure. Throughout this book, various approaches on how to make your configuration maintainable are discussed. It's also recommended that you set up a small Nagios system to get a better understanding of Nagios configuration, before proceeding to larger setups.

Sometimes, it is best to have configuration grouped into separate directories defined according to the locations that hosts and/or services are in. In other cases, it might be best to keep definitions of all servers with similar functionalities in one directory.

A good directory separation makes it much easier to control Nagios configuration to, for example, massively disable all objects related to a particular part of the IT infrastructure. Even though it is recommended to use downtimes, it is sometimes useful to just remove all entries from Nagios configuration.

Throughout all configuration examples in this book, we use a directory structure. A separate directory is used for each object type and similar objects are grouped within a single file. For example, all command definitions are stored in the commands/ subdirectory. All host definitions are stored in the hosts/<hostname>.cfg files.

In order for Nagios to read configuration from these directories, edit your main Nagios configuration file (/etc/nagios/nagios.cfg), remove all cfg_file and cfg_dir entries, and add the following ones:

cfg_dir=/etc/nagios/commands
cfg_dir=/etc/nagios/timeperiods
cfg_dir=/etc/nagios/contacts
cfg_dir=/etc/nagios/hosts
cfg_dir=/etc/nagios/services

In order to use the default Nagios plugins, copy the default Nagios command definitions file /etc/nagios/objects/commands.cfg to /etc/nagios/commands/default.cfg.

In addition, please make sure that the following options are set as shown in your nagios.cfg file:

check_external_commands=1
interval_length=60
accept_passive_service_checks=1
accept_passive_host_checks=1

If any of the options are set to a different value, change them, and add them to the end of the file, if they are not currently present in it.

After such changes in the Nagios set up, you can move on to the next sections and prepare a working configuration for your Nagios installation.

Macro Definitions

The ability to use macro definitions is one of the key features of Nagios. Macros offer a lot of flexibility in object and command definitions. Nagios 3 provides custom macro definitions, which gives you a greater possibility to use object templates for specifying parameters common to a group of similar objects.

All command definitions can use macros. Macro definitions allow parameters from other objects, such as hosts, services, and contacts, to be referenced so that a command does not need to have everything passed as an argument. Each macro invocation begins and ends with a $ sign.

A typical example is a HOSTADDRESS macro, which references the address field from the host object. All host definitions provide the value of the address parameter. For the following host and command definition:

  define host
  {
    host_name     somemachine
    address       10.0.0.1
    check_command check-host-alive
  }
  define command
  {
    command_name  check-host-alive
    command_line  $USER1$/check_ping -H $HOSTADDRESS$ 
                  -w 3000.0,80% -c 5000.0,100% -p 5
  }

this command will be invoked:

/opt/nagios/plugins/check_ping -H 10.0.0.1 -w 3000.0,80% -c 5000.0,100% -p 5

In addition, please note that the USER1 macro was also used and expanded as the path to Nagios plugins directory. This is a macro definition that references data contained in the file that is passed as the resource_file configuration directive. Even though it is not necessary for USER1 macro to point to the plugins directory, all standard command definitions that come with Nagios use this macro, and so it is recommended that you do not change it.

Some of the macro definitions are listed in the following table:

This table is not complete and only covers commonly used macro definitions. A complete list of available macros can be found in the Nagios documentation at http://nagios.sourceforge.net/docs/3_0/macros.html. Moreover, remember that all macro definitions need to be prefixed and suffixed with a $ sign—for example, $HOSTADDRESS$ maps to the HOSTADDRESS macro definition.

An additional functionality is the on-demand macro definitions. These are macros that are not defined, not exported as environment variables, but if found in a command definition, will be parsed and substituted accordingly. These macros accept one or more arguments inside the macro definition name, each passed after a colon. This is mainly used to read specific values not related to the current object. In order to read the contact email for user jdoe, regardless of who the current contact person is, the macro would be as follows: $CONTACTEMAIL:jdoe$, which means getting a CONTACTEMAIL macro definition in the context of the jdoe contact.

Nagios 3 also offers custom macro definitions. This works in a way that allows administrators to define additional attributes in each type of object, and the macro can then be used inside a command. This is used to store additional parameters related to an object—for example, you can store a MAC address in a host definition and use it in certain types of host checks.

It works in such a way that an object has a directive that starts with an underscore and is written in uppercase. It is referenced in one of the following ways, based on the object type it is defined in:

  • $_HOST<variable>$ – for directives defined within a host object
  • $_SERVICE<variable>$ – for directives defined within a service object
  • $_CONTACT<variable>$ – for directives defined within a contact object

A sample host definition that includes an additional directive with a MAC address would be as follows:

  define host
  {
    host_name     somemachine
    address       10.0.0.1
 _MAC 12:12:12:12:12:12
    check_command check-host-by-mac
  }

and a corresponding check command that uses this attribute inside a check:

  define command
  {
    command_name  check-host-by-mac
    command_line  $USER1$/check_hostmac -H $HOSTADDRESS$ -m
    $_HOSTMAC$
  }

Since Nagios 3, a majority of standard macro definitions are exported to check commands as environment variables. The environment variable names are the same as macros, but are prefixed with NAGIOS_—for example, HOSTADDRESS is passed as the NAGIOS_HOSTADDRESS variable. On-demand variables are not made available. For security reasons, the $USERn$ variables are also not passed to commands as environment variables.

Configuring Hosts

Hosts are objects that describe machines that should be monitored—either physical hardware or virtual machines. A host consists of a short name, a descriptive name, and an IP address. The host also tells Nagios when and how the system should be monitored, as well as who should be contacted with regards to any problems related to this host. It also specifies how often the host should be checked, how retrying the checks should be handled, and how often should a notification about problems be sent out.

A sample definition of a host is as follows:

  define host
  {
    host_name                       linuxbox01
    hostgroups                      linuxservers
    alias                           Linux Server 01
    address                         10.0.2.1
    check_command                   check-host-alive
    check_interval                  5
    retry_interval                  1
    max_check_attempts              5
    check_period                    24x7
    contact_groups                  linux-admins
    notification_interval           30
    notification_period             24x7
    notification_options            d,u,r
  }

This defines a Linux box that will use the check-host-alive command to make sure the box is up and running. The test will be performed every five minutes, and after five failed tests, it will assume the host is down. If it is down, a notification will be sent out every 30 minutes.

The following is a table of common directives that can be used to describe hosts. Items in bold are required while specifying a host.

For a complete list of accepted parameters, please consult the Nagios documentation at http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#host.

By default, Nagios assumes all host states to be up. If the check_command option is not specified for a host, then it will always be in the up state. When the command to perform host checks is specified, then the regularly-scheduled checks will take place and the host state will be monitored using the value of check_interval as the number of minutes between checks.

Nagios uses a soft and hard state logic to handle host states. Therefore, if a host state has changed from UP to DOWN since the last hard state, then Nagios assumes that the host is soft state DOWN and performs retries of the test, waiting retry_interval minutes between each test. Once if the result is the same after max_check_attempts re-tries, Nagios assumes that the DOWN state is a hard state. The same mechanisms apply for DOWN to UP transitions.

The host object parents directive is used to define the topology of the network. Usually, this directive points to a switch, router or any other device that is responsible for forwarding network packets. The host is assumed to be unreachable if the parent host is currently in a hard DOWN state. For example, if a router is down, then all machines accessed through it are considered unreachable and no tests will be performed on them.

If your network consists of servers connected via a switch and routers to a different network, then the parent for all of the servers in the local network, as well as the router, would be the switch. The parent of the router on the other side of the link would be the local router. The following diagram shows the actual network infrastructure and indicates how Nagios hosts should be configured in terms of parents for each element of the network:

Configuring Hosts

The actual network topology is shown on the left, and the parent hosts setup for the machines is shown on the right. Each arrow represents a mapping from a host to a parent host. There is no need to define a parent for hosts that are directly on the network with your Nagios server. So in this case, switch1 should not have a parent host defined.

Even though some devices, such as switches, cannot be easily checked to see if they are down, it is still a good idea to describe them as a part of your topology. In this case, you might use a functionality such as scheduled downtime to keep track of when the device is going to be offline, or mark it as DOWN manually. This helps in determining other problems—Nagios will not scan hosts that have the router somewhere along the path that is currently scheduled for downtime. This way, you won't be flooded with notifications on actually unreachable hosts being down.

Check and notification periods specify the time periods during which checks for host state and notifications are to be performed. These can be specified so that different hosts can be monitored at different times.

It is also possible to create a setup where information that a host is down is kept, but nobody is notified about it. This can be done by specifying a notification_period that will tell Nagios when a notification should be sent out. No notifications will be sent out outside of this time period.

A typical example is a server that is only required during business hours and has a daily maintenance window between 10 PM and 4 AM. You can set up Nagios so as to not monitor host availability outside of business hours, or you can make Nagios monitor it, but without notifying that it is actually down. If monitoring is not done at all, Nagios will perform fewer operations during this period. In the second case, it is possible to gather statistics on how much of the maintenance window is used—which can be used to see if changes to the window need to be made.

Nagios allows the grouping of multiple hosts in order to effectively manage them. In order to do this, Nagios offers host group objects, which are a group of one or more machines. A host may be a member of more than one host group. Usually, grouping is done either by the type of machines or by the location they are in.

Each host group has a unique short name that specified along with a descriptive name, and one or more hosts that are members of this group.

Example host group definitions that define groups of hosts and a group that combines both groups, are given as follows:

  define hostgroup
  {
    hostgroup_name                 linux-servers
    alias                          Linux servers
    members                        linuxbox1,linuxbox2
  }
  define hostgroup
  {
    hostgroup_name                 aix-servers
    alias                          AIX servers
    members                        aixbox1,aixbox2
  }
  define hostgroup
  {
    hostgroup_name                 unix-servers
    alias                          UNIX servers servers
    hostgroup_members              linux-servers,aix-servers
  }

The following table shows the directives that can be used to describe host groups. Items in bold are required when specifying a host.

Host groups can also be used when defining services or dependencies. For example, it is possible to tell Nagios that all Linux servers should have their SSH service monitored and all AIX servers should have a telnet accepting connections.

It is also possible to define dependencies between hosts. They are, in a way, similar to a parent-host relationship, but dependencies offer more complex configuration options. Nagios will only issue host and service checks if all dependant hosts are currently up. More details on dependencies can be found in Chapter 5.

For the purpose of this book, we will define at least one host in our Nagios configuration directory structure.

To be able to monitor the local server that the Nagios installation is running on, we will need to add its definition into the /etc/nagios/hosts/localhost.cfg file as follows:

  define host
  {
    host_name                       localhost
    alias                           Localhost
    address                         127.0.0.1
    check_command                   check-host-alive
    check_interval                  5
    retry_interval                  1
    max_check_attempts              5
    check_period                    24x7
    contact_groups                  admins
    notification_interval           60
    notification_period             24x7
    notification_options            d,u,r
  }

If you are planning to monitor other servers as well, you will want to add them—either in a single file, or multiple files.

Configuring Services

Services are objects that describe the functionality a particular host is offering. This can be virtually anything—network servers such as FTP, or resources such as storage space or CPU load.

A service is always tied to a host that it is running on. It is also identified by its description, which needs to be unique within a particular host. A service also defines when and how Nagios should check to see if it is running properly, and how to notify people responsible for this service, if it is not.

A short example of a web server that is defined on the linuxbox01 machine created earlier is as follows:

  define service
  {
    host_name                      linuxbox01
    service_description            WWW
    check_command                  check_http
    check_interval                 10
    check_period                   24x7
    retry_interval                 3
    max_check_attempts             3
    notification_interval          30
    notification_period            24x7
    notification_options           w,c,u,r
    contact_groups                 linux-admins
  }

This definition tells Nagios to check that the web server is working correctly every 10 minutes.

The following table shows the common directives that can be used to describe a service. Items in bold are required when specifying a service.

For a complete list of accepted parameters, refer to the Nagios documentation at http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#host

Very often, the same service is offered by more than one host. In such cases, it is possible to specify a service that will be provided by multiple machines, or even specify host groups for which all hosts will be checked. It is also possible to specify the hosts for which checks will not be performed—for example, if a service is present on all hosts in a group except for a specific box. To do that, an exclamation mark needs to be added before a host name or a host group name.

For example, to tell Nagios that SSSH should be checked on all Linux servers shown except for linux01, as well as on the aix01 machine, a service definition similar to the one shown here can be created:

  define service
  {
    hostgroup_name                 linux-servers
    host_name !linux01,aix01
    service_description            SSH
    check_command                  check_ssh
    check_interval                 10
    check_period                   24x7
    retry_interval                 2
    max_check_attempts             3
    notification_interval          30
    notification_period            24x7
    notification_options           w,c,r
    contact_groups                 linux-admins
  }

Services can be grouped in a similar way to host objects. This can be done to manage services more conveniently. It also aids in viewing service reports on the Nagios web interface. Service groups are also used to configure dependencies in a more convenient way.

The following table describes the attributes that can be used to define a group. Items in bold are required when specifying a service group.

The format of the members directive of a service group object is one or more <host>,<service> pairs.

An example of a service group is shown here:

  define servicegroup
  {
    servicegroup_name  databaseservices
    alias              All services related to databases
    members            linux01,mysql,linux01,pgsql,aix01,db2
  }

This service group consists of the mysql and pgsql services on the linux01 host and db2 on the aix01 machine. It is uniquely identified by its name, databaseservices.

It is also possible to specify groups that a service should be member of inside the service definition itself. This can be achieved by specifying all groups that this service should be a member of. To do this, add a list of all groups in the servicegroups directive in the service definition.

Services may be configured to be dependant on one another, similar to how hosts can. In this case, Nagios will only perform checks on a service if all dependant services are working correctly. More details on dependencies can be found in Chapter 5, Advanced Configuration.

Nagios requires that at least one service is defined for every host, and requires that at least one service is defined for it to run. That is why we will now create a sample service in our configuration directory structure. For this purpose, we will monitor the secure shell protocol.

In order to check if the SSH server is running on the Nagios installation, we will need to add its definition into the /etc/nagios/hosts/localhost.cfg file:

  define service
  {
    host_name                       localhost
    service_description             ssh
    check_command                   check_ssh
    check_interval                  5
    retry_interval                  1
    max_check_attempts              3
    check_period                    24x7
    contact_groups                  admins
    notification_interval           60
    notification_period             24x7
    notification_options            w,c,u,r
  }

If you are planning on monitoring other services as well, you will want to add them to the same file.

Configuring Commands

Command definitions describe how host/service checks should be done. They can also define how notifications about problems or event handlers should work. A command definition has two parameters—name and command line. The first parameter is a name that is then used for defining checks and notifications. The second parameter is an actual command that will be run, along with all required parameters for the command.

Commands are used by hosts and services. They define what system command to execute when making sure a host or service is working properly. A check command is identified by its unique name.

When used with other object definitions, it can also have additional arguments, and uses an exclamation mark as a delimiter. The commands with parameters have the following syntax: command_name[!arg1][!arg2][!arg3][...].

A command name is often the same as the plugin that it runs, but it can be different. The command line includes macro definitions (such as $HOSTADDRESS$). Check commands also use macros, $ARG1$, $ARG2$$ARG32$, if the check command for the host or service pass additional arguments.

The following is an example that defines a command for trying to ping a host to make sure it is working properly. It does not use any arguments.

  define command
  {
    command_name  check-host-alive
    command_line  $USER1$/check_ping -H $HOSTADDRESS$ 
                  -w 3000.0,80% -c 5000.0,100% -p 5
  }

and a very short host definition that would use this check command, could be similar to the one shown here:

  define host
  {
    host_name     somemachine
    address       10.0.0.1
    check_command check-host-alive
  }

Such a check is usually done as part of the host checks. This allows Nagios to make sure that a machine is working properly if it responds to ICMP requests.

Commands allow the passing of arguments as it offers a more flexible way of defining checks. Therefore, a definition accepting parameters would be as follows:

  define command
  {
    command_name  check-host-alive-limits
    command_line  $USER1$/check_ping -H $HOSTADDRESS$ 
                  -w $ARG1$ -c $ARG2$ -p 5
  }

and the corresponding host definition would be:

  define host
  {
    host_name     othermachine
    address       10.0.0.2
    check_command check-host-alive-limits!3000.0,80%!5000.0,100%
  }

Another example is setting up a check command for a previously-defined service:

  define command
  {
    command_name  check_http
    command_line  $USER1$/check_http -H $HOSTADDRESS$
  }

This check can then be used when defining a service to be monitored by Nagios. Chapter 4, Overview of Nagios Plugins, covers standard Nagios plugins along with sample command definitions. Sample Nagios configurations are also included in sources and installed by the make-config target.

Configuring Time Periods

Time periods are definitions of dates and times during which an action should be performed or specified people should be notified. They describe date and time ranges, and can be re-used across various operations.

A time period definition includes a name that uniquely identifies it in Nagios. It also contains a description, and one or more days or dates along with time spans.

A typical example of a time period would be working hours, which defines that a valid time to perform an action is from Monday to Friday during business hours. Another definition of a time period can be weekends, which means Saturday and Sunday, all day long.

The following is a sample time period for working hours:

  define timeperiod
  {
    timeperiod_name  workinghours
    alias            Working Hours, from Monday to Friday
    monday           09:00-17:00
    tuesday          09:00-17:00
    wednesday        09:00-17:00
    thursday         09:00-17:00
    friday           09:00-17:00
  }

This particular example tells Nagios that the acceptable time to perform something is from Monday to Friday between 9 AM and 5 PM. Each entry in a time period contains information on a date or weekday. It also contains a range of hours. Nagios first checks if the current date matches any of the dates specified. If it does, then it checks if the current time matches the time ranges specified for the date.

There are multiple ways of specifying a date. Depending on what type of date it is, one definition might take precedence over another. For example, a definition for December 24th is more important than a generic definition that every weekday an action should be performed between 9 AM and 5 PM.

Possible date types are mentioned here:

  • Calendar date: For example, 2009-11-01, which means November 1st, year 2009, (Nagios accepts dates in the format YYYY-MM-DD)
  • Date recurring every year: For example, july 4, which means 4th of July every year
  • Specific day within a month: For example, day 14, which means the 14 th of every month
  • Specific weekday, along with an offset in a month: For example, monday 1 september, which means the first Monday in September; monday -1 may would mean the last Monday in May
  • Specific weekday in all months: For example, monday 1, which means the 1st Monday of every month
  • Weekday: For example, monday, which means every Monday

The above list shows all date types in the order at which Nagios ranks them in terms of importance. This means that a date recurring every year will always be used in preference to an entry describing what should be done every Monday.

In order to be able to correctly configure all objects, we will now create some standard time periods that will be used in configuration. The following example periods will be used in the remaining sections of this chapter, and it is recommended that you put them in the /etc/nagios/timeperiods/default.cfg file:

  define timeperiod
  {
    timeperiod_name  workinghours
    alias            Working Hours, from Monday to Friday
    monday           09:00-17:00
    tuesday          09:00-17:00
    wednesday        09:00-17:00
    thursday         09:00-17:00
    friday           09:00-17:00
  }
  define timeperiod
  {
    timeperiod_name  weekends
    alias            Weekends all day long
    saturday         00:00-24:00
    sunday           00:00-24:00
  }
  define timeperiod
  {
    timeperiod_name  24x7
    alias            24 hours a day 7 days a week
    monday           00:00-24:00
    tuesday          00:00-24:00
    wednesday        00:00-24:00
    thursday         00:00-24:00
    friday           00:00-24:00
    saturday         00:00-24:00
    sunday           00:00-24:00
  }

The last time period is also used by the www service for linuxbox01 host, defined earlier. This way, the web server will be monitored all the time.

Configuring Contacts

Contacts define people who can either be owners of specific machines, or people who should be contacted in case of problems. Depending on how your organization chooses to contact people in case of problems, the definition of a contact may vary a lot. A contact consists of a unique name, a descriptive name, and one or more email addresses and/or pager numbers. Contact definitions can also contain additional data specific to how a person can be contacted.

A basic contact definition is shown here, and specifies the unique contact name, an alias, and contact information. It also specifies the event types that the person should receive and time periods during which notifications should be sent.

  define contact
  {
    contact_name                   jdoe
    alias                          John Doe
    email                          john.doe@yourcompany.com
    host_notification_period       workinghours
    service_notification_period    workinghours
    host_notification_options      d,u,r
    service_notification_options   w,u,c,r
    host_notification_commands     host-notify-by-email
    service_notification_commands  notify-by-email
  }

The following table describes all available directives when defining a contact. Items in bold are required when specifying a contact.

Contacts are also mapped to users that log into the Nagios web interface. This means that all operations performed via the interface will be logged as having been executed by that particular user and the web interface will use access granted to particular contact objects when evaluating whether an operation should be allowed or not. The contact_name field from a contact object maps to the user name in the Nagios web interface.

Contacts can be grouped. Usually, grouping is used to keep a list of which users are responsible for which tasks, and the group maps to job responsibilities for particular people. It also makes it possible to define people who should be responsible for handling problems at specific time periods, and Nagios will automatically contact the right people depending on the time at which a problem has occurred.

A sample definition of a contact group is as follows:

  define contactgroup
  {
    contactgroup_name              linux-admins
    alias                          Linux Administrators
    members                        jdoe,asmith
  }

This group is also used when defining the linuxbox01 and www service contacts. This means that both jdoe and asmith will receive information on the status of this host and service.

The following is a complete list of directives that can be used to describe contact groups. Items in bold are required while specifying a contact group.

Members of a contact group can be specified either in the contact group definition or by using the contactgroups directive in a contact definition. It is also possible to combine both methods—some of the members can be specified in the contact group definition, and others can be specified in their contact object definition z`.

Contacts are used to specify who should be contacted if the status of one or more hosts or services changes. Nagios accepts both contacts and contact groups in its object definitions. This allows making either specific people or entire groups responsible for particular machines or services.

It is also possible to specify different people or groups for handling host-related and service-related problems—for example, hardware administrators for handling host problems and system administrators for handling service issues.

In order to function properly, we need to create at least one contact that will be used by Nagios, and put this definition in the /etc/nagios/contacts/nagiosadmin.cfg file:

  define contact
  {
    contact_name                   nagiosadmin
    contactgroups                  admins
    alias                          Nagios administrator
    email                          administrator@yourcompany.com
    host_notification_period       workinghours
    service_notification_period    workinghours
    host_notification_options      d,u,r
    service_notification_options   w,u,c,r
    host_notification_commands     host-notify-by-email
    service_notification_commands  notify-by-email
  }

We also need to define the admins group in the /etc/nagios/contacts/groups.cfg file:

  define contactgroup
  {
    contactgroup_name              admins
    alias                          System administrators
  }

If you are not very familiar with Nagios, it is recommended that you leave the contact's name as nagiosadmin, as this will also be the user for all web interface operations.

Templates and Object Inheritance

In order to allow the flexible configuration of machines, Nagios offers a powerful inheritance engine. The main concept is that administrators can set up templates that define common parameters, and re-use these templates in actual host or service definitions. The mechanism even offers the possibility to create templates that inherit parameters from other templates.

This mechanism works in a way where templates are plain Nagios objects that specify the register directive and set it to 0. This means that they will not be registered as an actual host or service to monitor. Objects that inherit parameters from a template or another host should have a use directive pointing to the short name of the template object they are using.

When defining a template, its name is always specified using the name directive. This is slightly different to how typical hosts and services are registered, as they require the host_name and/or service_description parameters.

Inheritance can be used to define a template for basic host checks, with only basic parameters such as IP address being defined for each particular host. For example:

  define host
  {
 name generic-server
    check_command                   check-host-alive
    check_interval                  5
    retry_interval                  1
    max_check_attempts              5
    check_period                    24x7
    notification_interval           30
    notification_period             24x7
    notification_options            d,u,r
 register 0
  }
  define host
  {
 use generic-server
    name                            linuxbox01
    alias                           Linux Server 01
    address                         10.0.2.1
    contact_groups                  linux-admins
  }

Version 3 of Nagios also introduces inheriting from multiple templates. To do this, simply put multiple names in the use directive, separated by commas. This allows the host to use several templates, which define parts or all directives. In case multiple templates specify the same parameters, the value from the first template specifying it will be used. For example:

  define service
  {
 name generic-service
    check_interval              10
    retry_interval              2
    max_check_attempts          3
    check_period                24x7
 register 0
  }
  define service
  {
 host_name workinghours-service
    check_period                workinghours
    notification_interval       30
    notification_period         workinghours
    notification_options        w,c,u,r
 register 0
  }
  define service
  {
 use workinghours-service,generic-service
    contact_groups              linux-admins
    host_name                   linuxbox01
    service_description         SSH
    check_command               check_ssh
  }

In this case, values from both templates will be used. The value of workinghours will be used for the check_period directive as this directive was first specified in the workinghours-service template. Changing the order in the use directive to generic-service,workinghours-service would cause value of the check_period parameter to be 24x7.

Nagios also accepts creating multiple levels of templates. For example, you can set up a generic service template, and inherit it to create additional templates for various types of checks such as local services, resource sensitive checks, and templates for passive-only checks.

Let's consider the following objects and template structures:

define host
  {
    host_name      linuxserver1
    use            generic-linux,template-chicago
    .....
}
define host
  {
    register       0
    name           generic-linux
    use            generic-server
    .....
}
define host
  {
    register       0
    name           generic-server
    use            generic-host
    .....
}
define host
  {
    register       0
    name           template-chicago
    use            contacts-chicago,misc-chicago
    .....
}

The following illustration shows how Nagios will search for values for all directives.

Templates and Object Inheritance

When looking for parameters, Nagios will first look for the value in the linuxserver1 object definition. Next, it will use the following templates, in this order: generic-linux, generic-server, generic-host, template-chicago, contacts-chicago, and misc-chicago in the end.

I t is also possible to set up host or service dependencies that will be inherited from a template. In this case, the dependant hosts or services can't be templates themselves, and need to be registered as objects that will be monitored by the Nagios daemon.

Introduction to Notifications

N otifications are the way by which Nagios lets people know that something is either wrong or has returned to the normal way of operations. They are not objects on their own, but provide very important functionality in Nagios. Configuring notifications correctly might seem a bit tricky in the beginning.

When and how notifications are sent out is configured as part of contact configuration. Each contact has configuration directives on when notifications can be sent out, and how he or she should be contacted. Contacts also contain information about contact details—telephone number, email address, Jabber/MSN address, and so on. Each host and service is configured for when the information about it should be sent, and who should be contacted. Nagios then combines all of this information in order to notify people of the changes in status.

Notifications may be sent out in one of the following situations:

  1. The host has changed its state to DOWN or UNREACHABLE state; notification is sent out after first_notification_delay number of minutes specified in the corresponding host object
  2. The host remains in DOWN or UNREACHABLE state; notification is sent out every notification_interval number of minutes specified in the corresponding host object
  3. Host recovers to an UP state; notification is sent out immediately and only once
  4. Host starts or stops flapping; notification is sent out immediately
  5. Host remains flapping; notification is sent out every notification_interval number of minutes specified in the corresponding host object
  6. Service has changed its state to WARNING, CRITICAL or UNKNOWN state; notification is sent out after first_notification_delay number of minutes specified in the corresponding service object
  7. Service remains in WARNING, CRITICAL or UNKNOWN state; notification is sent out every notification_interval number of minutes specified in the corresponding service object
  8. Service recovers to an OK state; notification is sent out immediately and only once
  9. Service starts or stops flapping; notification is sent out immediately
  10. Service remains flapping; notification is sent out every notification_interval number of minutes specified in the corresponding service object

If one of these conditions occurs, Nagios starts evaluating whether information about it should be sent out and to whom.

First of all, the current date and time is checked against the notification time period. The time period is taken from the notification_timeperiod field from the current host or service definition. Only if the time period includes current time, will the notification be sent out.

Next, a list of users based on the contacts and contact_groups fields is created. A complete list of users is made based on all members of all groups, and included groups, as well as all the contacts directly bound to the current host or service.

Each of the matched users is checked to see whether he or she should be notified about the current event. In this case, each user's time period is also checked to see if it includes the current date and time. The directive host_notification_period or service_notification_period is used depending on whether the notification is for the host or the service.

For host notifications,the host_notification_options directive for each contact is also used to determine whether that particular person should be contacted—for example, different users might be contacted about an unreachable host than those contacted if the host is actually down. For service notifications, the service_notification_options parameter is used to check every user if he or she should be notified about this issue. The section on hosts and services configuration describes what values these directives take.

If all of these criteria have been met, Nagios will send a notification to this user. It will now use commands specified in the host_notification_commands and service_notification_commands directives.

It is possible to specify multiple commands that will be used for notifications. So it is possible to set up Nagios such that it sends both an email as well as a message on an instant messaging system.

Nagios also offers escalations that allow emails to be sent to other people when a problem remains unresolved for too long. This can be used to propagate problems to higher management, or to teams that might be affected by unresolved problems. It is a very powerful mechanism and is split between host- and service-based escalations. This functionality is described in more detail in Chapter 6, Notifications and Events.