Linux下Nagios的安装与配置

一、本文说明

本文是在参考：http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html David_Tang文章以及网上的一些资料完成，其中绝大部分内容是转载于David_Tang。

二、Nagios简介

Nagios是一款开源的电脑系统和网络监视工具，能有效监控Windows、Linux和Unix的主机状态，交换机路由器等网络设置，打印机等。在系统或服务状态异常时发出邮件或短信报警第一时间通知运维人员，在状态恢复后发出正常的邮件或短信通知。

Nagios原名为NetSaint，由Ethan Galstad开发并维护至今。NAGIOS是一个缩写形式：“Nagios Ain't Gonna Insist On Sainthood” Sainthood翻译为圣徒，而"Agios"是"saint"的希腊表示方法。Nagios被开发在Linux下使用，但在Unix下也工作得非常好。

主要功能

    •网络服务监控（SMTP、POP3、HTTP、NNTP、ICMP、SNMP、FTP、SSH）
    •主机资源监控（CPU load、disk usage、system logs），也包括Windows主机（使用NSClient++ plugin）
    •可以指定自己编写的Plugin通过网络收集数据来监控任何情况（温度、警告……）
    •可以通过配置Nagios远程执行插件远程执行脚本
    •远程监控支持SSH或SSL加通道方式进行监控
    •简单的plugin设计允许用户很容易的开发自己需要的检查服务，支持很多开发语言（shell scripts、C++、Perl、ruby、Python、PHP、C#等）
    •包含很多图形化数据Plugins（Nagiosgraph、Nagiosgrapher、PNP4Nagios等）
    •可并行服务检查
    •能够定义网络主机的层次，允许逐级检查，就是从父主机开始向下检查
    •当服务或主机出现问题时发出通告，可通过email, pager, sms 或任意用户自定义的plugin进行通知
    •能够自定义事件处理机制重新激活出问题的服务或主机
    •自动日志循环
    •支持冗余监控
    •包括Web界面可以查看当前网络状态，通知，问题历史，日志文件等

三、Nagios工作原理

Nagios的功能是监控服务和主机，但是他自身并不包括这部分功能，所有的监控、检测功能都是通过各种插件来完成的。

　启动Nagios后，它会周期性的自动调用插件去检测服务器状态，同时Nagios会维持一个队列，所有插件返回来的状态信息都进入队列，Nagios每次都从队首开始读取信息，并进行处理后，把状态结果通过web显示出来。

　 Nagios提供了许多插件，利用这些插件可以方便的监控很多服务状态。安装完成后，在nagios主目录下的/libexec里放有nagios自带的可以使用的所有插件，如，check_disk是检查磁盘空间的插件，check_load是检查CPU负载的，等等。每一个插件可以通过运行./check_xxx –h 来查看其使用方法和功能。

　 Nagios可以识别4种状态返回信息，即 0(OK)表示状态正常/绿色、1(WARNING)表示出现警告/黄色、2(CRITICAL)表示出现非常严重的错误/红色、3(UNKNOWN)表示未知错误/深黄色。Nagios根据插件返回来的值，来判断监控对象的状态，并通过web显示出来，以供管理员及时发现故障。

四种监控状态：

Linux下Nagios的安装与配置

再说报警功能，如果监控系统发现问题不能报警那就没有意义了，所以报警也是nagios很重要的功能之一。但是，同样的，Nagios 自身也没有报警部分的代码，甚至没有插件，而是交给用户或者其他相关开源项目组去完成的。

　 Nagios 安装，是指基本平台，也就是Nagios软件包的安装。它是监控体系的框架，也是所有监控的基础。

　打开Nagios官方的文档，会发现Nagios基本上没有什么依赖包，只要求系统是Linux或者其他Nagios支持的系统。不过如果你没有安装apache（http服务），那么你就没有那么直观的界面来查看监控信息了，所以apache姑且算是一个前提条件。关于apache的安装，网上有很多，照着安装就是了。安装之后要检查一下是否可以正常工作。

　知道Nagios 是如何通过插件来管理服务器对象后，现在开始研究它是如何管理远端服务器对象的。Nagios 系统提供了一个插件NRPE。Nagios 通过周期性的运行它来获得远端服务器的各种状态信息。它们之间的关系如下图所示：

Linux下Nagios的安装与配置

Nagios 通过NRPE 来远端管理服务

1. Nagios 执行安装在它里面的check_nrpe 插件，并告诉check_nrpe 去检测哪些服务。

2. 通过SSL，check_nrpe 连接远端机子上的NRPE daemon

3. NRPE 运行本地的各种插件去检测本地的服务和状态(check_disk,..etc)

4. 最后，NRPE 把检测的结果传给主机端的check_nrpe，check_nrpe 再把结果送到Nagios状态队列中。

5. Nagios 依次读取队列中的信息，再把结果显示出来。

四、实验环境

Host Name	OS	IP	Software
node1	rhel5.4	192.168.1.151 192.168.11.164	hadoop0.20.2、namenode、dns、nfs、apache、php、nagios、nagios-plugins
node2	rhel5.4	192.168.1.152 192.168.11.167	hadoop0.20.2、datanode、mysql、nagios-plugins、nrpe
node3	rhel5.4	192.168.1.153 192.168.11.166	hadoop0.20.2、datanode、hive

node1安装了nagios软件，对监控的数据做处理，并且提供web界面查看和管理。当然也可以对本机自身的信息进行监控。

node2安装了NRPE等客户端，根据监控机的请求执行监控，然后将结果回传给监控机。

防火墙已关闭/iptables：Firewall is not running。

SELINUX=disable

五、实验目标

主机名	要监控的服务
node1	cpu负载
	当前登录用户数
	是否开启80端口
	是否活动
	/分区使用情况
	总进程数
	是否开启ssh服务
	swap分区使用情况
	是否启动dns服务
node2	是否活动
	datanode进程
	mysql数据库
node3	是否活动
node3	datanode进程

六、Nagios服务端安装

6.1、基础支持套件：gcc glibc glibc-common gd gd-devel xinetd openssl-devel

[root@node1 nagios]# rpm -q gcc glibc glibc-common gd gd-devel xinetd openssl-devel
gcc-4.1.2-46.el5
glibc-2.5-42
glibc-common-2.5-42
gd-2.0.33-9.4.el5_4.2
gd-devel-2.0.33-9.4.el5_4.2
xinetd-2.3.14-10.el5
openssl-devel-0.9.8e-26.el5_9.1
----如果系统中没有这些套件，使用yum安装

6.2、创建nagios用户和用户组

[root@node1 app]# useradd nagios
[root@node1 app]# mkdir /usr/local/nagios
[root@node1 app]# chown -R nagios.nagios /usr/local/nagios
[root@node1 app]# ll -d /usr/local/nagios/
drwxr-xr-x 2 nagios nagios 4096 Sep 24 12:02 /usr/local/nagios/

6.3、编译安装Nagios

[root@node1 app]# cd nagios
[root@node1 nagios]# ./configure --prefix=/usr/local/nagios
*** Configuration summary for nagios 3.3.1 07-25-2011 ***:

 General Options:
-------------------------
        Nagios executable:  nagios
        Nagios user/group:  nagios,nagios
       Command user/group:  nagios,nagios
            Embedded Perl:  no
             Event Broker:  yes
        Install ${prefix}:  /usr/local/nagios
                Lock file:  ${prefix}/var/nagios.lock
Check result directory:  ${prefix}/var/spool/checkresults
           Init directory:  /etc/rc.d/init.d
  Apache conf.d directory:  /etc/httpd/conf.d
             Mail program:  /bin/mail
                  Host OS:  linux-gnu

 Web Interface Options:
------------------------
                 HTML URL:  http://localhost/nagios/
                  CGI URL:  http://localhost/nagios/cgi-bin/
 Traceroute (used by WAP):  /bin/traceroute


Review the options above for accuracy.  If they look okay,
type 'make all' to compile the main program and CGIs.

[root@node1 nagios]# make all
cd ./base && make
make[1]: Entering directory `/app/nagios/base'
*** Support Notes *******************************************

If you have questions about configuring or running Nagios,
please make sure that you:

     - Look at the sample config files
     - Read the documentation on the Nagios Library at:
           http://library.nagios.com

before you post a question to one of the mailing lists.
Also make sure to include pertinent information that could
help others help you.  This might include:

     - What version of Nagios you are using
     - What version of the plugins you are using
     - Relevant snippets from your config files
     - Relevant error messages from the Nagios log file

For more information on obtaining support for Nagios, visit:

       http://support.nagios.com

*************************************************************

Enjoy.

[root@node1 nagios]# make install
*** Main program, CGIs and HTML files installed ***

You can continue with installing Nagios as follows (type 'make'
without any arguments for a list of all possible options):

  make install-init
- This installs the init script in /etc/rc.d/init.d

  make install-commandmode
- This installs and configures permissions on the
       directory for holding the external command file

  make install-config
- This installs sample config files in /usr/local/nagios/etc

make[1]: Leaving directory `/app/nagios'

[root@node1 nagios]# make install-init
/usr/bin/install -c -m 755 -d -o root -g root /etc/rc.d/init.d
/usr/bin/install -c -m 755 -o root -g root daemon-init /etc/rc.d/init.d/nagios

*** Init script installed ***

[root@node1 nagios]# make install-commandmode
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/var/rw
chmod g+s /usr/local/nagios/var/rw

*** External command directory configured ***

[root@node1 nagios]# make install-config
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/etc
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/etc/objects
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/nagios.cfg /usr/local/nagios/etc/nagios.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/cgi.cfg /usr/local/nagios/etc/cgi.cfg
/usr/bin/install -c -b -m 660 -o nagios -g nagios sample-config/resource.cfg /usr/local/nagios/etc/resource.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/templates.cfg /usr/local/nagios/etc/objects/templates.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/commands.cfg /usr/local/nagios/etc/objects/commands.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/contacts.cfg /usr/local/nagios/etc/objects/contacts.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/timeperiods.cfg /usr/local/nagios/etc/objects/timeperiods.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/localhost.cfg /usr/local/nagios/etc/objects/localhost.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/windows.cfg /usr/local/nagios/etc/objects/windows.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/printer.cfg /usr/local/nagios/etc/objects/printer.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/switch.cfg /usr/local/nagios/etc/objects/switch.cfg

*** Config files installed ***

Remember, these are *SAMPLE* config files.  You'll need to read
the documentation for more information on how to actually define
services, hosts, etc. to fit your particular needs.

[root@node1 nagios]# chkconfig --add nagios
[root@node1 nagios]# chkconfig --level 35 nagios on
[root@node1 nagios]# chkconfig --list nagios
nagios             0:off    1:off    2:off    3:on    4:on    5:on    6:off

6.4、验证程序是否被正确安装

切换目录到安装路径（这里是/usr/local/nagios），看是否存在etc、bin、sbin、share、var 这五个目录，如果存在则可以表明程序被正确的安装到系统了。Nagios 各个目录用途说明如下：

bin	Nagios 可执行程序所在目录
etc	Nagios 配置文件所在目录
sbin	Nagios CGI 文件所在目录，也就是执行外部命令所需文件所在的目录
share	Nagios网页文件所在的目录
libexec	Nagios 外部插件所在目录
var	Nagios 日志文件、lock 等文件所在的目录
var/archives	Nagios 日志自动归档目录
var/rw	用来存放外部命令文件的目录

6.5、安装Nagios插件

[root@node1 nagios-plugins-1.4.15]# ./configure --prefix=/usr/local/nagios
config.status: creating po/Makefile
--with-apt-get-command: 
--with-ping6-command: /bin/ping6 -n -U -w %d -c %d %s
--with-ping-command: /bin/ping -n -U -w %d -c %d %s
--with-ipv6: yes
--with-mysql: no
--with-openssl: yes
--with-gnutls: no
--enable-extra-opts: no
--with-perl: /usr/bin/perl
--enable-perl-modules: no
--with-cgiurl: /nagios/cgi-bin
--with-trusted-path: /bin:/sbin:/usr/bin:/usr/sbin
--enable-libtap: no
[root@node1 nagios-plugins-1.4.15]# make && make install

    6.6、安装与配置Apache和Php
    Apache 和Php 不是安装nagios 所必须的，但是nagios提供了web监控界面，通过web监控界面可以清晰的看到被监控主机、资源的运行状态，因此，安装一个web服务是很必要的。
    需要注意的是，nagios在nagios3.1.x版本以后，配置web监控界面时需要php的支持。这里我们下载的nagios版本为nagios-3.4.3，因此在编译安装完成apache后，还需要编译php模块，这里选取的php版本为php5.4.10。

a.安装Apache

# wget http://archive.apache.org/dist/httpd/httpd-2.2.23.tar.gz

# tar zxvf httpd-2.2.23.tar.gz

# cd httpd-2.2.23

# ./configure --prefix=/usr/local/apache2

# make && make install

若出现错误，则在编译时加入 --with-included-apr 即可解决。
b.安装Php

# wget http://cn2.php.net/distributions/php-5.4.10.tar.gz

# tar zxvf php-5.4.10.tar.gz

# cd php-5.4.10

# ./configure --prefix=/usr/local/php --with-apxs2=/usr/local/apache2/bin/apxs 

# make && make install

c.配置apache
找到apache的配置文件/usr/local/apache2/conf/httpd.conf

----找到：
User daemon
Group daemon
----修改为：
User nagios
Group nagios
----然后找到：
<IfModule dir_module>
   DirectoryIndex index.html
</IfModule>
----修改为：
<IfModule dir_module>
   DirectoryIndex index.html index.php
</IfModule>  
----接着增加如下内容
AddType application/x-httpd-php .php

为了安全起见，一般情况下要让nagios的web监控页面必须经过授权才能访问，这需要增加验证配置，即在httpd.conf文件最后添加如下信息：

#setting for nagios
ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin"
<Directory "/usr/local/nagios/sbin">
     AuthType Basic
     Options ExecCGI
     AllowOverride None
Order allow,deny
     Allow from all
     AuthName "Nagios Access"
     AuthUserFile /usr/local/nagios/etc/htpasswd         //用于此目录访问身份验证的文件
     Require valid-user
</Directory>
Alias /nagios "/usr/local/nagios/share"
<Directory "/usr/local/nagios/share">
     AuthType Basic
     Options None
     AllowOverride None
Order allow,deny
     Allow from all
     AuthName "nagios Access"
     AuthUserFile /usr/local/nagios/etc/htpasswd
     Require valid-user
</Directory>

d.创建apache目录验证文件
在上面的配置中，指定了目录验证文件htppasswd，下面要创建这个文件：

# /usr/local/apache2/bin/htpasswd -c /usr/local/nagios/etc/htpasswd david

这样就在/usr/local/nagios/etc 目录下创建了一个htpasswd 验证文件，当通过192.168.11.164/nagios/ 访问时就需要输入用户名和密码了。

e.查看认证文件的内容

# cat /usr/local/nagios/etc/htpasswd

f.启动apache服务

# /usr/local/apache2/bin/apachectl start

到这里nagios 的安装也就基本完成了，你可以通过web来访问了。

Linux下Nagios的安装与配置

七、配置Nagios

Nagios 主要用于监控一台或者多台本地主机及远程的各种信息，包括本机资源及对外的服务等。默认的Nagios 配置没有任何监控内容，仅是一些模板文件。若要让Nagios 提供服务，就必须修改配置文件，增加要监控的主机和服务，下面将详细介绍。

7.1、默认配置文件介绍

Nagios安装完毕后，默认的配置文件在/usr/local/nagios/etc目录下。

[root@node1 ~]# cd /usr/local/nagios/
[root@node1 nagios]# ls
bin  etc  include  libexec  sbin  share  var
[root@node1 nagios]# cd etc/
[root@node1 etc]# ls
cgi.cfg  contacts.cfg  hosts.cfg  htpasswd  nagios.cfg  objects  resource.cfg  services.cfg  timeperiods.cfg
[root@node1 etc]# cd objects/
[root@node1 objects]# ls
commands.cfg  localhost.cfg  switch.cfg     templates.cfg.bak  windows.cfg
contacts.cfg  printer.cfg    templates.cfg  timeperiods.cfg

每个文件或目录含义如下表所示：

文件名或目录名	用途
cgi.cfg	控制CGI访问的配置文件
nagios.cfg	Nagios 主配置文件
resource.cfg	变量定义文件，又称为资源文件，在些文件中定义变量，以便由其他配置文件引用，如$USER1$
objects	objects 是一个目录，在此目录下有很多配置文件模板，用于定义Nagios 对象
objects/commands.cfg	命令定义配置文件，其中定义的命令可以被其他配置文件引用
objects/contacts.cfg	定义联系人和联系人组的配置文件
objects/localhost.cfg	定义监控本地主机的配置文件
objects/printer.cfg	定义监控打印机的一个配置文件模板，默认没有启用此文件
objects/switch.cfg	定义监控路由器的一个配置文件模板，默认没有启用此文件
objects/templates.cfg	定义主机和服务的一个模板配置文件，可以在其他配置文件中引用
objects/timeperiods.cfg	定义Nagios 监控时间段的配置文件
objects/windows.cfg	监控Windows 主机的一个配置文件模板，默认没有启用此文件

7.2、配置文件之间的关系

在nagios的配置过程中涉及到的几个定义有：主机、主机组，服务、服务组，联系人、联系人组，监控时间，监控命令等，从这些定义可以看出，nagios各个配置文件之间是互为关联，彼此引用的。

成功配置出一台nagios监控系统，必须要弄清楚每个配置文件之间依赖与被依赖的关系，最重要的有四点：

第一：定义监控哪些主机、主机组、服务和服务组；

第二：定义这个监控要用什么命令实现；

第三：定义监控的时间段；

第四：定义主机或服务出现问题时要通知的联系人和联系人组。

7.3、配置Nagios

为了能更清楚的说明问题，同时也为了维护方便，建议将nagios各个定义对象创建独立的配置文件：

创建hosts.cfg文件来定义主机和主机组

创建services.cfg文件来定义服务

用默认的contacts.cfg文件来定义联系人和联系人组

用默认的commands.cfg文件来定义命令

用默认的timeperiods.cfg来定义监控时间段

用默认的templates.cfg文件作为资源引用文件

a. templates.cfg文件

nagios主要用于监控主机资源以及服务，在nagios配置中称为对象，为了不必重复定义一些监控对象，Nagios引入了一个模板配置文件，将一些共性的属性定义成模板，以便于多次引用。这就是templates.cfg的作用。

----此文件可能需要修改contact_groups----
[root@node1 objects]# cat templates.cfg###############################################################################
# TEMPLATES.CFG - SAMPLE OBJECT TEMPLATES
#
# Last Modified: 10-03-2007
#
# NOTES: This config file provides you with some example object definition
#        templates that are refered by other host, service, contact, etc.
#        definitions in other config files.
#       
#        You don't need to keep these definitions in a separate file from your
#        other object definitions.  This has been done just to make things
#        easier to understand.
#
###############################################################################



###############################################################################
###############################################################################
#
# CONTACT TEMPLATES
#
###############################################################################
###############################################################################

# Generic contact definition template - This is NOT a real contact, just a template!

define contact{
        name                            generic-contact        ; The name of this contact template
        service_notification_period     24x7            ; service notifications can be sent anytime
        host_notification_period        24x7            ; host notifications can be sent anytime
        service_notification_options    w,u,c,r,f,s        ; send notifications for all service states, flapping events, and scheduled downtime events
        host_notification_options       d,u,r,f,s        ; send notifications for all host states, flapping events, and scheduled downtime events
        service_notification_commands   notify-service-by-email    ; send service notifications via email
        host_notification_commands      notify-host-by-email    ; send host notifications via email
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
        }




###############################################################################
###############################################################################
#
# HOST TEMPLATES
#
###############################################################################
###############################################################################

# Generic host definition template - This is NOT a real host, just a template!

define host{
        name                            generic-host    ; The name of this host template
        notifications_enabled           1           ; Host notifications are enabled
        event_handler_enabled           1           ; Host event handler is enabled
        flap_detection_enabled          1           ; Flap detection is enabled
        failure_prediction_enabled      1           ; Failure prediction is enabled
        process_perf_data               1           ; Process performance data
        retain_status_information       1           ; Retain status information across program restarts
        retain_nonstatus_information    1           ; Retain non-status information across program restarts
    notification_period        24x7        ; Send host notifications at any time
        register                        0           ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }


# Linux host definition template - This is NOT a real host, just a template!

define host{
    name                linux-server    ; The name of this host template
    use                generic-host    ; This template inherits other values from the generic-host template
    check_period            24x7        ; By default, Linux hosts are checked round the clock
    check_interval            1        ; Actively check the host every 5 minutes
    retry_interval            1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts        2        ; Check each Linux host 10 times (max)
        check_command               check-host-alive ; Default command to check Linux hosts
    notification_period        workhours    ; Linux admins hate to be woken up, so we only notify during the day
                            ; Note that the notification_period variable is being overridden from
                            ; the value that is inherited from the generic-host template!
    notification_interval        120        ; Resend notifications every 2 hours
    notification_options        d,u,r        ; Only send notifications for specific host states
    contact_groups            ts        ; Notifications get sent to the admins by default
    notifications_enabled           1
        register            0        ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
    }
----linux-server3和linux-server2为新增加进去的----
define host{        name                            linux-server3    ; The name of this host template
        use                             generic-host    ; This template inherits other values from the generic-host template
        check_period                    24x7            ; By default, Linux hosts are checked round the clock
        check_interval                  1               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              2               ; Check each Linux host 10 times (max)
        check_command                   check-host-alive ; Default command to check Linux hosts
        notification_period             workhours       ; Linux admins hate to be woken up, so we only notify during the day
                                                        ; Note that the notification_period variable is being overridden from
                                                        ; the value that is inherited from the generic-host template!
        notification_interval           120             ; Resend notifications every 2 hours
        notification_options            d,u,r           ; Only send notifications for specific host states
        contact_groups                  ts              ; Notifications get sent to the admins by default
        notifications_enabled           1
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

define host{
        name                            linux-server2    ; The name of this host template
        use                             generic-host    ; This template inherits other values from the generic-host template
        check_period                    24x7            ; By default, Linux hosts are checked round the clock
        check_interval                  5               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              10              ; Check each Linux host 10 times (max)
        check_command                   check-host-alive ; Default command to check Linux hosts
        notification_period             workhours       ; Linux admins hate to be woken up, so we only notify during the day
                                                        ; Note that the notification_period variable is being overridden from
                                                        ; the value that is inherited from the generic-host template!
        notification_interval           120             ; Resend notifications every 2 hours
        notification_options            d,u,r           ; Only send notifications for specific host states
        contact_groups                  ts              ; Notifications get sent to the admins by default
        notifications_enabled           1
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }


# Windows host definition template - This is NOT a real host, just a template!

define host{
    name            windows-server    ; The name of this host template
    use            generic-host    ; Inherit default values from the generic-host template
    check_period        24x7        ; By default, Windows servers are monitored round the clock
    check_interval        5        ; Actively check the server every 5 minutes
    retry_interval        1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts    10        ; Check each server 10 times (max)
    check_command        check-host-alive    ; Default command to check if servers are "alive"
    notification_period    24x7        ; Send notification out at any time - day or night
    notification_interval    30        ; Resend notifications every 30 minutes
    notification_options    d,r        ; Only send notifications for specific host states
    contact_groups        ts        ; Notifications get sent to the admins by default
    hostgroups        windows-servers ; Host groups that Windows servers should be a member of
    register        0        ; DONT REGISTER THIS - ITS JUST A TEMPLATE
    }


# We define a generic printer template that can be used for most printers we monitor

define host{
    name            generic-printer    ; The name of this host template
    use            generic-host    ; Inherit default values from the generic-host template
    check_period        24x7        ; By default, printers are monitored round the clock
    check_interval        5        ; Actively check the printer every 5 minutes
    retry_interval        1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts    10        ; Check each printer 10 times (max)
    check_command        check-host-alive    ; Default command to check if printers are "alive"
    notification_period    workhours        ; Printers are only used during the workday
    notification_interval    30        ; Resend notifications every 30 minutes
    notification_options    d,r        ; Only send notifications for specific host states
    contact_groups        ts        ; Notifications get sent to the admins by default
    register        0        ; DONT REGISTER THIS - ITS JUST A TEMPLATE
    }


# Define a template for switches that we can reuse
define host{
    name            generic-switch    ; The name of this host template
    use            generic-host    ; Inherit default values from the generic-host template
    check_period        24x7        ; By default, switches are monitored round the clock
    check_interval        5        ; Switches are checked every 5 minutes
    retry_interval        1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts    10        ; Check each switch 10 times (max)
    check_command        check-host-alive    ; Default command to check if routers are "alive"
    notification_period    24x7        ; Send notifications at any time
    notification_interval    30        ; Resend notifications every 30 minutes
    notification_options    d,r        ; Only send notifications for specific host states
    contact_groups        ts        ; Notifications get sent to the admins by default
    register        0        ; DONT REGISTER THIS - ITS JUST A TEMPLATE
    }




###############################################################################
###############################################################################
#
# SERVICE TEMPLATES
#
###############################################################################
###############################################################################

# Generic service definition template - This is NOT a real service, just a template!

define service{
        name                            generic-service     ; The 'name' of this service template
        active_checks_enabled           1               ; Active service checks are enabled
        passive_checks_enabled          1                   ; Passive service checks are enabled/accepted
        parallelize_check               1               ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1               ; We should obsess over this service (if necessary)
        check_freshness                 0               ; Default is to NOT check service 'freshness'
        notifications_enabled           1               ; Service notifications are enabled
        event_handler_enabled           1               ; Service event handler is enabled
        flap_detection_enabled          1               ; Flap detection is enabled
        failure_prediction_enabled      1               ; Failure prediction is enabled
        process_perf_data               1               ; Process performance data
        retain_status_information       1               ; Retain status information across program restarts
        retain_nonstatus_information    1               ; Retain non-status information across program restarts
        is_volatile                     0               ; The service is not volatile
        check_period                    24x7            ; The service can be checked at any time of the day
        max_check_attempts              3            ; Re-check the service up to 3 times in order to determine its final (hard) state
        normal_check_interval           10            ; Check the service every 10 minutes under normal conditions
        retry_check_interval            2            ; Re-check the service every two minutes until a hard state can be determined
        contact_groups                  ts            ; Notifications get sent out to everyone in the 'admins' group
    notification_options        w,u,c,r            ; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           60            ; Re-notify about service problems every hour
        notification_period             24x7            ; Notifications can be sent out at any time
         register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }


# Local service definition template - This is NOT a real service, just a template!

define service{
    name                local-service         ; The name of this service template
    use                generic-service        ; Inherit default values from the generic-service definition
        max_check_attempts              4            ; Re-check the service up to 4 times in order to determine its final (hard) state
        normal_check_interval           5            ; Check the service every 5 minutes under normal conditions
        retry_check_interval            1            ; Re-check the service every minute until a hard state can be determined
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
    }

b. resource.cfg文件
resource.cfg是nagios的变量定义文件，文件内容只有一行：

[root@node1 etc]# cat resource.cfg 
$USER1$=/usr/local/nagios/libexec

其中，变量$USER1$指定了安装nagios插件的路径，如果把插件安装在了其它路径，只需在这里进行修改即可。需要注意的是，变量必须先定义，然后才能在其它配置文件中进行引用。

c. commands.cfg文件

此文件默认是存在的，无需修改即可使用，当然如果有新的命令需要加入时，在此文件进行添加即可。

[root@node1 etc]# cat objects/commands.cfg 
###############################################################################
# COMMANDS.CFG - SAMPLE COMMAND DEFINITIONS FOR NAGIOS 3.3.1
#
# Last Modified: 05-31-2007
#
# NOTES: This config file provides you with some example command definitions
#        that you can reference in host, service, and contact definitions.
#       
#        You don't need to keep commands in a separate file from your other
#        object definitions.  This has been done just to make things easier to
#        understand.
#
###############################################################################


################################################################################
#
# SAMPLE NOTIFICATION COMMANDS
#
# These are some example notification commands.  They may or may not work on
# your system without modification.  As an example, some systems will require 
# you to use "/usr/bin/mailx" instead of "/usr/bin/mail" in the commands below.
#
################################################################################


# 'notify-host-by-email' command definition
define command{
    command_name    notify-host-by-email
    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" |/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
    }

# 'notify-service-by-email' command definition
define command{
    command_name    notify-service-by-email
    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" |/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
    }





################################################################################
#
# SAMPLE HOST CHECK COMMANDS
#
################################################################################


# This command checks to see if a host is "alive" by pinging it
# The check must result in a 100% packet loss or 5 second (5000ms) round trip 
# average time to produce a critical error.
# Note: Five ICMP echo packets are sent (determined by the '-p 5' argument)

# 'check-host-alive' command definition
define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
        }




################################################################################
#
# SAMPLE SERVICE CHECK COMMANDS
#
# These are some example service check commands.  They may or may not work on
# your system, as they must be modified for your plugins.  See the HTML 
# documentation on the plugins for examples of how to configure command definitions.
#
# NOTE:  The following 'check_local_...' functions are designed to monitor
#        various metrics on the host that Nagios is running on (i.e. this one).
################################################################################

# 'check_local_disk' command definition
define command{
        command_name    check_local_disk
        command_line    $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
        }


# 'check_local_load' command definition
define command{
        command_name    check_local_load
        command_line    $USER1$/check_load -w $ARG1$ -c $ARG2$
        }


# 'check_local_procs' command definition
define command{
        command_name    check_local_procs
        command_line    $USER1$/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$
        }


# 'check_local_users' command definition
define command{
        command_name    check_local_users
        command_line    $USER1$/check_users -w $ARG1$ -c $ARG2$
        }


# 'check_local_swap' command definition
define command{
    command_name    check_local_swap
    command_line    $USER1$/check_swap -w $ARG1$ -c $ARG2$
    }


# 'check_local_mrtgtraf' command definition
define command{
    command_name    check_local_mrtgtraf
    command_line    $USER1$/check_mrtgtraf -F $ARG1$ -a $ARG2$ -w $ARG3$ -c $ARG4$ -e $ARG5$
    }


################################################################################
# NOTE:  The following 'check_...' commands are used to monitor services on
#        both local and remote hosts.
################################################################################

# 'check_ftp' command definition
define command{
        command_name    check_ftp
        command_line    $USER1$/check_ftp -H $HOSTADDRESS$ $ARG1$
        }


# 'check_hpjd' command definition
define command{
        command_name    check_hpjd
        command_line    $USER1$/check_hpjd -H $HOSTADDRESS$ $ARG1$
        }


# 'check_snmp' command definition
define command{
        command_name    check_snmp
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ $ARG1$
        }


# 'check_http' command definition
define command{
        command_name    check_http
        command_line    $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
        }


# 'check_ssh' command definition
define command{
    command_name    check_ssh
    command_line    $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
    }


# 'check_dhcp' command definition
define command{
    command_name    check_dhcp
    command_line    $USER1$/check_dhcp $ARG1$
    }


# 'check_ping' command definition
define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
        }


# 'check_pop' command definition
define command{
        command_name    check_pop
        command_line    $USER1$/check_pop -H $HOSTADDRESS$ $ARG1$
        }


# 'check_imap' command definition
define command{
        command_name    check_imap
        command_line    $USER1$/check_imap -H $HOSTADDRESS$ $ARG1$
        }


# 'check_smtp' command definition
define command{
        command_name    check_smtp
        command_line    $USER1$/check_smtp -H $HOSTADDRESS$ $ARG1$
        }


# 'check_tcp' command definition
define command{
    command_name    check_tcp
    command_line    $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$
    }


# 'check_udp' command definition
define command{
    command_name    check_udp
    command_line    $USER1$/check_udp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$
    }


# 'check_nt' command definition
define command{
    command_name    check_nt
    command_line    $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$
    }



################################################################################
#
# SAMPLE PERFORMANCE DATA COMMANDS
#
# These are sample performance data commands that can be used to send performance
# data output to two text files (one for hosts, another for services).  If you
# plan on simply writing performance data out to a file, consider using the 
# host_perfdata_file and service_perfdata_file options in the main config file.
#
################################################################################


# 'process-host-perfdata' command definition
define command{
    command_name    process-host-perfdata
    command_line    /usr/bin/printf "%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n" >> /usr/local/nagios/var/host-perfdata.out
    }


# 'process-service-perfdata' command definition
define command{
    command_name    process-service-perfdata
    command_line    /usr/bin/printf "%b" "$LASTSERVICECHECK$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICESTATE$\t$SERVICEATTEMPT$\t$SERVICESTATETYPE$\t$SERVICEEXECUTIONTIME$\t$SERVICELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$\n" >> /usr/local/nagios/var/service-perfdata.out
    }

#'check_nrpe' command definition
  define command{
            command_name   check_nrpe
            command_line   $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
            }
----以下三个命令是新增的
define command{
        command_name    check_jps
        command_line    /usr/local/nagios/libexec/check_jps $ARG1$ $ARG2$
        }

define command{
        command_name    check_zhulh
        command_line    /usr/local/nagios/libexec/check_zhulh $ARG1$ $ARG2$
        }

define command{
        command_name    check_jps2
        command_line    /usr/local/nagios/libexec/check_jps2 $ARG1$ $ARG2$
        }

d. hosts.cfg文件

此文件默认不存在，需要手动创建，hosts.cfg主要用来指定被监控的主机地址以及相关属性信息，根据实验目标配置如下：

[root@node1 etc]# cat hosts.cfg 
define host{
use                     linux-server2
host_name               node2
        alias                   Nagios-node2
        address                 192.168.11.167
        }
define host{
use                     linux-server3
host_name               node3
        alias                   Nagios-node3
        address                 192.168.11.166
        }
define hostgroup{      
        hostgroup_name          bsmart-servers      
        alias                   bsmart servers        
        members                 node2,node3
        }

注意：在/usr/local/nagios/etc/objects 下默认有localhost.cfg 和windows.cfg 这两个配置文件，localhost.cfg 文件是定义监控主机本身的，windows.cfg 文件是定义windows 主机的，其中包括了对host 和相关services 的定义。根据自己的需要修改其中的相关配置，详细如下：

localhost.cfg

[root@node1 etc]# cat objects/localhost.cfg 
###############################################################################
# LOCALHOST.CFG - SAMPLE OBJECT CONFIG FILE FOR MONITORING THIS MACHINE
#
# Last Modified: 05-31-2007
#
# NOTE: This config file is intended to serve as an *extremely* simple 
#       example of how you can create configuration entries to monitor
#       the local (Linux) machine.
#
###############################################################################




###############################################################################
###############################################################################
#
# HOST DEFINITION
#
###############################################################################
###############################################################################

# Define a host for the local machine

define host{
use                     linux-server            ; Name of host template to use
                            ; This host definition will inherit all variables that are defined
                            ; in (or inherited by) the linux-server host template definition.
host_name               node1
        alias                   node1
        address                 192.168.11.164
        }



###############################################################################
###############################################################################
#
# HOST GROUP DEFINITION
#
###############################################################################
###############################################################################

# Define an optional hostgroup for Linux machines

define hostgroup{
        hostgroup_name  linux-servers ; The name of the hostgroup
        alias           Linux Servers ; Long name of the group
        members         node1     ; Comma separated list of hosts that belong to this group
        }



###############################################################################
###############################################################################
#
# SERVICE DEFINITIONS
#
###############################################################################
###############################################################################


# Define a service to "ping" the local machine

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             PING
    check_command            check_ping!100.0,20%!500.0,60%
        }


# Define a service to check the disk space of the root partition
# on the local machine.  Warning if < 20% free, critical if
# < 10% free space on partition.

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             Root Partition
    check_command            check_local_disk!20%!10%!/
        }



# Define a service to check the number of currently logged in
# users on the local machine.  Warning if > 20 users, critical
# if > 50 users.

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             Current Users
    check_command            check_local_users!20!50
        }


# Define a service to check the number of currently running procs
# on the local machine.  Warning if > 250 processes, critical if
# > 400 users.

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             Total Processes
    check_command            check_local_procs!250!400!RSZDT
        }



# Define a service to check the load on the local machine. 

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             Current Load
    check_command            check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
        }



# Define a service to check the swap usage the local machine. 
# Critical if less than 10% of swap is free, warning if less than 20% is free

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             Swap Usage
    check_command            check_local_swap!20!10
        }



# Define a service to check SSH on the local machine.
# Disable notifications for this service by default, as not all users may have SSH enabled.

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             SSH
    check_command            check_ssh
    notifications_enabled        1
        }



# Define a service to check HTTP on the local machine.
# Disable notifications for this service by default, as not all users may have HTTP enabled.

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             HTTP
    check_command            check_http
    notifications_enabled        1
        }

define service{
use                             local-service         ; Name of service template to use
host_name                       node1
        service_description             dns on node1
        check_command                   check_jps!dns!node1
        notifications_enabled           1
        }

windows.cfg 省略
e. services.cfg文件

此文件默认也不存在，需要手动创建，services.cfg文件主要用于定义监控的服务和主机资源，例如监控http服务、ftp服务、主机磁盘空间、主机系统负载等等。

[root@node1 etc]# cat services.cfg 

define service{
use                     local-service
host_name               node3
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
use                             local-service         ; Name of service template to use
host_name                       node3
        service_description             datanode on node3
        check_command                   check_jps2!DataNode!node3
        notifications_enabled           1
        }

define service{
use                     local-service
host_name               node2
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
use                             local-service         ; Name of service template to use
host_name                       node2
        service_description             datanode on node2
        check_command                   check_jps2!DataNode!node2
        notifications_enabled           1
        }


define service{
use                             local-service
host_name                       node2
        service_description             mysql
        check_command                   check_nrpe!check_mysql
        notifications_enabled           1
        check_interval                  1               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              2    
        }

f. contacts.cfg文件

contacts.cfg是一个定义联系人和联系人组的配置文件，当监控的主机或者服务出现故障，nagios会通过指定的通知方式（邮件或者短信）将信息发给这里指定的联系人或者使用者。

[root@node1 etc]# cat contacts.cfg 
define contact{
        contact_name                    David           
use                             generic-contact 
        alias                           Nagios Admin
        email                           zlh200868@gmail.com
        }
define contact{
        contact_name                    Jack
use                             generic-contact
        alias                           Nagios Admin2
        email                           zlh10@163.com
        }

define contactgroup{
        contactgroup_name       ts                             
        alias                   Technical Support               
        members                 David,Jack                 
        }

g. timeperiods.cfg文件

此文件只要用于定义监控的时间段，下面是一个配置好的实例：

[root@node1 etc]# cat timeperiods.cfg 

define timeperiod{  
        timeperiod_name 24x7  
        alias           24 Hours A Day, 7 Days A Week  
        sunday          00:00-24:00  
        monday          00:00-24:00  
        tuesday         00:00-24:00  
        wednesday       00:00-24:00  
        thursday        00:00-24:00  
        friday          00:00-24:00  
        saturday        00:00-24:00  
        }
define timeperiod{  
        timeperiod_name workhours   
        alias           Normal Work Hours  
        monday          09:00-17:00  
        tuesday         09:00-17:00  
        wednesday       09:00-17:00  
        thursday        09:00-17:00  
        friday          09:00-17:00  
        }

h. cgi.cfg文件

此文件用来控制相关cgi脚本，如果想在nagios的web监控界面执行cgi脚本，例如重启nagios进程、关闭nagios通知、停止nagios主机检测等，这时就需要配置cgi.cfg文件了。由于nagios的web监控界面验证用户为david，所以只需在cgi.cfg文件中添加此用户的执行权限就可以了，需要修改的配置信息如下：

default_user_name=david
authorized_for_system_information=nagiosadmin,david  
authorized_for_configuration_information=nagiosadmin,david  
authorized_for_system_commands=david
authorized_for_all_services=nagiosadmin,david  
authorized_for_all_hosts=nagiosadmin,david
authorized_for_all_service_commands=nagiosadmin,david  
authorized_for_all_host_commands=nagiosadmin,david

i. nagios.cfg文件

nagios.cfg默认的路径为/usr/local/nagios/etc/nagios.cfg，是nagios的核心配置文件，所有的对象配置文件都必须在这个文件中进行定义才能发挥其作用，这里只需将对象配置文件在Nagios.cfg文件中进行引用即可。

# You can specify individual object config files as shown below:
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg


# Definitions for monitoring the local (Linux) host
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
#cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/hosts.cfg
cfg_file=/usr/local/nagios/etc/services.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg

# Definitions for monitoring a Windows machine
#cfg_file=/usr/local/nagios/etc/objects/windows.cfg

# Definitions for monitoring a router/switch
#cfg_file=/usr/local/nagios/etc/objects/switch.cfg

status_update_interval=10

nagios_user=nagios
nagios_group=nagios

check_external_commands=0

command_check_interval=10s

interval_length=60

7.4、验证Nagios 配置文件的正确性

Nagios 在验证配置文件方面做的非常到位，只需通过一个命令即可完成：

[root@node1 etc]# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check

Nagios提供的这个验证功能非常有用，在错误信息中通常会打印出错误的配置文件以及文件中的哪一行，这使得nagios的配置变得非常容易，报警信息通常是可以忽略的，因为一般那些只是建议性的。
看到上面这些信息就说明没问题了，然后启动Nagios 服务。

八、Nagios的启动与停止

8.1、启动Nagios

service nagios start

8.2、手动方式启动nagios
通过nagios命令的"-d"参数来启动nagios守护进程：

# /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

8.3、手工方式停止Nagios

#kill <nagios_pid>

九、利用NRPE监控远程Linux上的"本地信息"

上面已经对远程Linux 主机是否存活做了监控，而判断远程机器是否存活，我们可以使用ping 工具对其监测。还有一些远程主机服务，例如ftp、ssh、http，都是对外开放的服务，即使不用Nagios，我们也可以试的出来，随便找一台机器看能不能访问这些服务就行了。但是对于像磁盘容量，cpu负载这样的“本地信息”，Nagios只能监测自己所在的主机，而对其他的机器则显得有点无能为力。毕竟没得到被控主机的适当权限是不可能得到这些信息的。为了解决这个问题，nagios有这样一个附加组件--“NRPE”，用它就可以完成对Linux 类型主机"本地信息”的监控。

9.1、NRPE工作原理

Linux下Nagios的安装与配置

NRPE 总共由两部分组成： check_nrpe 插件，位于监控主机上 NRPE daemon，运行在远程的Linux主机上(通常就是被监控机) 按照上图，整个的监控过程如下：

当Nagios 需要监控某个远程Linux 主机的服务或者资源情况时：

Nagios 会运行check_nrpe 这个插件，告诉它要检查什么；

check_nrpe 插件会连接到远程的NRPE daemon，所用的方式是SSL；

NRPE daemon 会运行相应的Nagios 插件来执行检查；

NRPE daemon 将检查的结果返回给check_nrpe 插件，插件将其递交给nagios做处理。

注意：NRPE daemon 需要Nagios 插件安装在远程的Linux主机上，否则，daemon不能做任何的监控。

9.2、在被监控机(node2、node3)上

a.增加用户&设定密码

#useradd nagios

#passwd nagios

b.安装Nagios插件

# tar zxvf nagios-plugins-1.4.16.tar.gz
# cd nagios-plugins-1.4.16
# ./configure --prefix=/usr/local/nagios
# make && make install

这一步完成后会在/usr/local/nagios/下生成三个目录include、libexec和share。

修改目录权限：

# chown nagios.nagios /usr/local/nagios
# chown -R nagios.nagios /usr/local/nagios/libexec

c.安装NRPE

# wget http://prdownloads.sourceforge.net/sourceforge/nagios/nrpe-2.13.tar.gz
# tar zxvf nrpe-2.13.tar.gz
# cd nrpe-2.13
# ./configure
*** Configuration summary for nrpe 2.13 11-11-2011 ***:

 General Options:
-------------------------
 NRPE port:    5666
 NRPE user:    nagios
 NRPE group:   nagios
 Nagios user:  nagios
 Nagios group: nagios


Review the options above for accuracy.  If they look okay,
type 'make all' to compile the NRPE daemon and client.

[root@node2 nrpe-2.13]# make all
cd ./src/; make ; cd ..
make[1]: Entering directory `/app/nrpe-2.13/src'
gcc -g -O2 -I/usr/include/openssl -I/usr/include -DHAVE_CONFIG_H -o nrpe nrpe.c utils.c acl.c -L/usr/lib  -lssl -lcrypto -lnsl -lwrap  
gcc -g -O2 -I/usr/include/openssl -I/usr/include -DHAVE_CONFIG_H -o check_nrpe check_nrpe.c utils.c -L/usr/lib  -lssl -lcrypto -lnsl 
make[1]: Leaving directory `/app/nrpe-2.13/src'

*** Compile finished ***

If the NRPE daemon and client compiled without any errors, you
can continue with the installation or upgrade process.

Read the PDF documentation (NRPE.pdf) for information on the next
steps you should take to complete the installation or upgrade.

接下来安装NRPE插件，daemon和示例配置文件
c.1 安装check_nrpe

监控机需要安装check_nrpe这个插件，被监控机并不需要，我们在这里安装它只是为了测试目的。

[root@node2 nrpe-2.13]# make install-plugin
cd ./src/ && make install-plugin
make[1]: Entering directory `/app/nrpe-2.13/src'
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/libexec
/usr/bin/install -c -m 775 -o nagios -g nagios check_nrpe /usr/local/nagios/libexec
make[1]: Leaving directory `/app/nrpe-2.13/src'

c.2 安装deamon

[root@node2 nrpe-2.13]# make install-daemon
cd ./src/ && make install-daemon
make[1]: Entering directory `/app/nrpe-2.13/src'
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/bin
/usr/bin/install -c -m 775 -o nagios -g nagios nrpe /usr/local/nagios/bin
make[1]: Leaving directory `/app/nrpe-2.13/src'

c.3 安装配置文件

[root@node2 nrpe-2.13]# make install-daemon-config
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/etc
/usr/bin/install -c -m 644 -o nagios -g nagios sample-config/nrpe.cfg /usr/local/nagios/etc

按照安装文档的说明，是将NRPE deamon作为xinetd下的一个服务运行的。在这样的情况下xinetd就必须要先安装好，不过一般系统已经默认安装了。
d.安装xinetd脚本

[root@node2 nrpe-2.13]# make install-xinetd
/usr/bin/install -c -m 644 sample-config/nrpe.xinetd /etc/xinetd.d/nrpe

可以看到创建了这个文件/etc/xinetd.d/nrpe

编译这个脚本：

[root@node2 ~]# cat /etc/xinetd.d/nrpe 
# default: on
# description: NRPE (Nagios Remote Plugin Executor)
service nrpe
{
           flags           = REUSE
        socket_type     = stream    
    port        = 5666    
           wait            = no
user            = nagios
group        = nagios
           server          = /usr/local/nagios/bin/nrpe
        server_args     = -c /usr/local/nagios/etc/nrpe.cfg --inetd
           log_on_failure  += USERID
        disable         = no
    only_from       = 192.168.11.164 127.0.0.1
}

在only_from后增加监控主机的IP地址

编辑/etc/services文件，增加NRPE服务

[root@node2 ~]# tail -n 4 /etc/services 
iqobject    48619/tcp            # iqobject
iqobject    48619/udp            # iqobject
# Local services
nrpe            5666/tcp                        #nrpe

重启xinetd服务

[root@node2 ~]# service xinetd restart
Stopping xinetd:                                           [  OK  ]
Starting xinetd:                                           [  OK  ]

查看NRPE是否已经启动

[root@node2 ~]# netstat -an|grep 5666
tcp        0      0 0.0.0.0:5666                0.0.0.0:*                   LISTEN

可以看到5666端口已经在监听了。

e.测试NRPE是否正常工作

使用上面在被监控机上安装的check_nrpe 这个插件测试NRPE 是否工作正常。

# /usr/local/nagios/libexec/check_nrpe -H localhost

会返回当前NRPE的版本

[root@node2 ~]# /usr/local/nagios/libexec/check_nrpe -H localhost
NRPE v2.13

也就是在本地用check_nrpe连接nrpe daemon是正常的。

注：为了后面工作的顺利进行，注意本地防火墙要打开5666能让外部的监控机访问。

9.3 在监控机(node1)上

之前已经将Nagios运行起来了，现在要做的事情是：

安装check_nrpe 插件；在commands.cfg 中创建check_nrpe 的命令定义，因为只有在commands.cfg 中定义过的命令才能在services.cfg 中使用；创建对被监控主机的监控项目；

9.3.1、安装check_nrpe插件

# tar zxvf nrpe-2.13.tar.gz 
# cd nrpe-2.13
# ./configure
# make all
# make install-plugin

只运行这一步就行了，因为只需要check_nrpe插件。

在node2和node3上我们已经装好了nrpe，现在我们测试一下监控机使用check_nrpe 与被监控机运行的nrpe daemon之间的通信。

[root@node1 etc]# /usr/local/nagios/libexec/check_nrpe -H 192.168.11.167
NRPE v2.13

看到已经正确返回了NRPE的版本信息，说明一切正常。

9.3.2、在commands.cfg中增加对check_nrpe的定义

[root@node1 etc]# cat objects/commands.cfg

#'check_nrpe' command definition
  define command{
            command_name   check_nrpe
            command_line   $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
            }

-c 后面带的$ARG1$ 参数是传给nrpe daemon 执行的检测命令，之前说过了它必须是nrpe.cfg 中所定义的那5条命令中的其中一条。在services.cfg 中使用check_nrpe 的时候要用 “!” 带上这个参数。

9.3.3、定义对Nagios-Linux 主机的监控

下面就可以在services.cfg 中定义对Nagios-Linux 主机的监控了。

[root@node1 etc]# cat services.cfg 

define service{
use                     local-service
host_name               node3
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
use                             local-service         ; Name of service template to use
host_name                       node3
        service_description             datanode on node3
        check_command                   check_jps2!DataNode!node3
        notifications_enabled           1
        }

define service{
use                     local-service
host_name               node2
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
use                             local-service         ; Name of service template to use
host_name                       node2
        service_description             datanode on node2
        check_command                   check_jps2!DataNode!node2
        notifications_enabled           1
        }


define service{
use                             local-service
host_name                       node2
        service_description             mysql
        check_command                   check_nrpe!check_mysql
        notifications_enabled           1
        check_interval                  1               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              2    
        }

9.3.4、查看配置情况：

Linux下Nagios的安装与配置

十、Nagios邮件报警的配置

10.1、安装sendmail组件

首先要确保sendmail相关组件的完整安装，我们

可以使用如下的命令来完成sendmail 的安装：

# yum install -y sendmail*

然后重新启动sendmail服务：

# service sendmail restart

然后发送测试邮件，验证sendmail的可用性：

# echo "Hello World" | mail zlh10@163.com

10.2、邮件报警的配置

在上面我们已经简单配置过了/usr/local/nagios/etc/objects/contacts.cfg 文件，Nagios 会将报警邮件发送到配置文件里的E-mail 地址。

10.3 Nagios通知

Linux下Nagios的安装与配置

十一、重点说明：

11.1、监控远端的mysql

Nagios监控远端的mysql

11.2、由于需要监控node2和node3上面datanode的进程因此需要node1、node2、node3之间设置无密码登陆。

11.3、启动nagios报错：

[root@rhel5 etc]# service nagios start
Starting nagios:This account is currently not available.
 done.

修改/etc/passwd
将/sbin/nologin改成/bin/bash

十二、参考资料：

    •Nagios官方网站：http://www.nagios.org/
    •yahoon的小屋《nagios全攻略》：http://yahoon.blog.51cto.com/
    •技术成就梦想《运维监控利器Nagios》：http://ixdba.blog.51cto.com/

秒客网

Linux下Nagios的安装与配置

相关文章