Ganglia+nagios 监控hadoop资源与报警

阅读量：4585 次

发布时间：2019-06-09

本文共 29420 字，大约阅读时间需要 98 分钟。

全篇主要依赖下面２篇文章

http://quenlang.blog.51cto.com/4813803/1571635

http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html#!comments

一资源下载

nagios : http://sourceforge.net/projects/nagios/files/nagios-4.x/nagios-4.1.1/nagios-4.1.1.tar.gz/download

nagios-plugs :

nrpe : http://sourceforge.net/projects/nagios/files/nrpe-2.x/nrpe-2.15/nrpe-2.15.tar.gz/download

二 ganglia 安装

hadoop1安装ganglia的gmetad、gmond及ganglia-web

2.1 依赖检验,安装

新建一个 ganglia.rpm 文件,写入以下依赖组件

$ vim ganglia.rpmapr-develapr-utilcheck-develcairo-develpango-devellibxml2-develglib2-develdbus-develfreetype-develfontconfig-develgcc-c++expat-develpython-devel rrdtool rrdtool-devellibXrender-develzliblibart_lgpllibpngdejavu-lgc-sans-mono-fontsdejavu-sans-mono-fontsperl-ExtUtils-CBuilderperl-ExtUtils-MakeMaker

查看这些组件是否有安装

$ rpm -q `cat ganglia.rpm`package apr-devel is not installedapr-util-1.3.9-3.el6_0.1.x86_64check-devel-0.9.8-1.1.el6.x86_64cairo-devel-1.8.8-3.1.el6.x86_64pango-devel-1.28.1-10.el6.x86_64libxml2-devel-2.7.6-14.el6_5.2.x86_64glib2-devel-2.28.8-4.el6.x86_64dbus-devel-1.2.24-7.el6_3.x86_64freetype-devel-2.3.11-14.el6_3.1.x86_64fontconfig-devel-2.8.0-5.el6.x86_64gcc-c++-4.4.7-11.el6.x86_64package expat-devel is not installedpython-devel-2.6.6-52.el6.x86_64libXrender-devel-0.9.8-2.1.el6.x86_64zlib-1.2.3-29.el6.x86_64libart_lgpl-2.3.20-5.1.el6.x86_64libpng-1.2.49-1.el6_2.x86_64package dejavu-lgc-sans-mono-fonts is not installedpackage dejavu-sans-mono-fonts is not installedperl-ExtUtils-CBuilder-0.27-136.el6.x86_64perl-ExtUtils-MakeMaker-6.55-136.el6.x86_64

使用 yum install 安装机器上没有的组件

还要安装 confuse

下载地址:http://www.nongnu.org/confuse/

$ tar -zxf confuse-2.7.tar.gz$ cd confuse-2.7$ ./configure CFLAGS=-fPIC --disable-nls$ make && make install

2.2 安装gangali

hadoop1上安装

$ tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz -C /opt/soft/## 安装gmetad$ ./configure --prefix=/usr/local/ganglia --with-gmetad --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia$ make && make install$ cp gmetad/gmetad.init /etc/init.d/gmetad$ cp /usr/local/ganglia/sbin/gmetad /usr/sbin/$ chkconfig --add gmetad## 安装gmond$ cp gmond/gmond.init /etc/init.d/gmond$ cp /usr/local/ganglia/sbin/gmond /usr/sbin/$ gmond --default_config>/etc/ganglia/gmond.conf$ chkconfig --add gmond

gmetad、gmond安装成功，接着安装ganglia-web，首先要安装php和httpd

yum install php httpd -y

修改httpd的配置文件/etc/httpd/conf/httpd.conf，只把监听端口改为8080

Listen 8080

安装ganglia-web

$ tar xf ganglia-web-3.6.2.tar.gz  -C /opt/soft/$ cd /opt/soft/$ chmod -R 777 ganglia-web-3.6.2/ $ mv ganglia-web-3.6.2/ /var/www/html/ganglia $ cd /var/www/html/ganglia $ useradd www-data $ make install $ chmod 777 /var/lib/ganglia-web/dwoo/cache/ $ chmod 777 /var/lib/ganglia-web/dwoo/compiled/

至此ganglia-web安装完成，修改conf_default.php修改文件，指定ganglia-web的目录及rrds的数据目录，修改如下两行：

36 # Where gmetad stores the rrd archives.37 $conf['gmetad_root'] = "/var/www/html/ganglia"; ## 改为web程序的安装目录38 $conf['rrds'] = "/var/lib/ganglia/rrds";        ## 指定rrd数据存放的路径

创建rrd数据存放目录并授权

$ mkdir /var/lib/ganglia/rrds -p$ chown nobody:nobody /var/lib/ganglia/rrds/ -R

到这里，hadoop1上的ganglia的所有安装工作就完成了，接下来就是要在其他所有节点上安装ganglia的gmond客户端。

其他节点安装上gmond

也是要先安装依赖,然后在安装gmond,所有节点安装都是一样的,所以这里写个脚本

$ vim install_ganglia.sh#!/bin/sh#安装依赖  这是是我已经知道我缺少哪些依赖,所以只安装这些,具体按照你的环境来列出需要安装哪些yum install -y apr-devel expat-devel rrdtool rrdtool-develmkdir /opt/soft;cd /opt/softtar -xvf /home/hadoop/confuse-2.7.tar.gzcd confuse-2.7./configure CFLAGS=-fPIC --disable-nlsmake && make installcd /opt/soft#安装 ganglia gmondtar -xvf /home/hadoop/ganglia-3.6.0.tar.gzcd ganglia-3.6.0/./configure --prefix=/usr/local/ganglia --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/gangliamake && make installcp gmond/gmond.init /etc/init.d/gmondcp /usr/local/ganglia/sbin/gmond /usr/sbin/gmond --default_config>/etc/ganglia/gmond.confchkconfig --add gmond

将这个脚本复制到所有节点执行

2.3 配置ganglia

分为服务端和客户端的配置，服务端的配置文件为gmetad.conf,客户端的配置文件为gmond.conf

首先配置hadoop1上的gmetad.conf,这个文件只有hadoop1上有

$ vi  /etc/ganglia/gmetad.conf## 定义数据源的名字及监听地址，gmond会将收集的数据发送到数据源监听机器上的rrd数据目录中 ## hadoop cluster 为自己定义 data_source "hadoop cluster" 192.168.0.101:8649

接着配置 gmond.conf

$ head -n 80 /etc/ganglia/gmond.conf/* This configuration is as close to 2.5.x default behavior as possible   The values closely match ./gmond/metric.h definitions in 2.5.x */globals {  daemonize = yes        ## 以守护进程运行  setuid = yes             user = nobody          ## 运行gmond的用户  debug_level = 0        ## 改为1会在启动时打印debug信息  max_udp_msg_len = 1472  mute = no              ## 哑巴，本节点将不会再广播任何自己收集到的数据到网络上  deaf = no              ## 聋子，本节点将不再接收任何其他节点广播的数据包  allow_extra_data = yes  host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */  host_tmax = 20 /*secs */  cleanup_threshold = 300 /*secs */  gexec = no  # By default gmond will use reverse DNS resolution when displaying your hostname  # Uncommeting following value will override that value.  # override_hostname = "mywebserver.domain.com"  # If you are not using multicast this value should be set to something other than 0.  # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable  send_metadata_interval = 0 /*secs */ } /* * The cluster attributes specified will be used as part of the 
     
       * tag that will wrap all hosts collected by this instance. */cluster {  name = "hadoop cluster"    ## 指定集群的名字  owner = "nobody"           ## 集群的所有者  latlong = "unspecified"  url = "unspecified"} /* The host section describes attributes of the host, like the location */host {  location = "unspecified"} /* Feel free to specify as many udp_send_channels as you like.  Gmond   used to only support having a single channel */udp_send_channel {  #bind_hostname = yes # Highly recommended, soon to be default.                       # This option tells gmond to use a source address                       # that resolves to the machine's hostname.  Without                       # this, the metrics may appear to come from any                       # interface and the DNS names associated with                       # those IPs will be used to create the RRDs.#  mcast_join = 239.2.11.71    ## 单播模式要注释调这行  host = 192.168.0.101    ## 单播模式，指定接受数据的主机  port = 8649             ## 监听端口  ttl = 1} /* You can specify as many udp_recv_channels as you like as well. */udp_recv_channel {  #mcast_join = 239.2.11.71    ## 单播模式要注释调这行  port = 8649  #bind = 239.2.11.71          ## 单播模式要注释调这行  retry_bind = true  # Size of the UDP buffer. If you are handling lots of metrics you really  # should bump it up to e.g. 10MB or even higher.  # buffer = 10485760} /* You can specify as many tcp_accept_channels as you like to share   an xml description of the state of the cluster */tcp_accept_channel {  port = 8649  # If you want to gzip XML output  gzip_output = no} /* Channel to receive sFlow datagrams */#udp_recv_channel {#  port = 6343#} /* Optional sFlow settings */

好了，hadoop1上的gmetad.conf和gmond.conf配置文件已经修改完成，这时，直接将hadoop1上的gmond.conf文件scp到其他节点上相同的路径下覆盖原来的gmond.conf即可。

2.4 启动 ganglia

所有节点启动 gmond 服务

/etc/init.d/gmond start

hadoop1 节点启动 gmetad httpd 服务

/etc/init.d/gmetad start/etc/init.d/httpd start

2.5 在浏览器中访问hadoop1:8080/ganglia,就会出现下面的页面

配置完成

三配置hadoop

此时，ganglia只是监控了各主机基本的性能，并没有监控到hadoop，接下来需要配置hadoop配置文件，这里以hadoop1上的配置文件为例，其他节点对应的配置文件应从hadoop1上拷贝，首先需要修改的是hadoop配置目录下的hadoop-metrics2.properties

$ cd /usr/local/hadoop-2.6.0/etc/hadoop/$ vim hadoop-metrics2.properties# for Ganglia 3.1 support *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 *.sink.ganglia.period=10# default for supportsparse is false *.sink.ganglia.supportsparse=true*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40# Tag values to use for the ganglia prefix. If not defined no tags are used.# If '*' all tags are used. If specifiying multiple tags separate them with # commas. Note that the last segment of the property name is the context name.##*.sink.ganglia.tagsForPrefix.jvm=ProcesName#*.sink.ganglia.tagsForPrefix.dfs=#*.sink.ganglia.tagsForPrefix.rpc=#*.sink.ganglia.tagsForPrefix.mapred=namenode.sink.ganglia.servers=192.168.0.101:8649 datanode.sink.ganglia.servers=192.168.0.101:8649 resourcemanager.sink.ganglia.servers=192.168.0.101:8649 nodemanager.sink.ganglia.servers=192.168.0.101:8649 mrappmaster.sink.ganglia.servers=192.168.0.101:8649 jobhistoryserver.sink.ganglia.serve=192.168.0.101:8649

复制到所有节点,重启hadoop集群

此时在监控中已经可以看到关于hadoop指标的监控了

四 nagios 安装

4.1 hadoop1 机器

新建nagios用户

# useradd -s /sbin/nologin nagios# mkdir /usr/local/nagios# chown -R nagios.nagios /usr/local/nagios

4.1.1 编译安装nagios

$ cd /opt/soft$ tar zxvf nagios-3.4.3.tar.gz$ cd nagios-3.4.3$ ./configure --prefix=/usr/local/nagios$ make al$ make install$ make install-init$ make install-config$ make install-commandmode$ make install-webconf

切换目录到安装路径（这里是/usr/local/nagios），看是否存在etc、bin、sbin、share、var 这五个目录，如果存在则可以表明程序被正确的安装到系统了

4.1.2 编译安装　nagios-plugs

$ cd /opt/soft$ tar zxvf nagios-plugins-1.4.16.tar.gz$ cd nagios-plugins-1.4.16 $ mkdir /user/local/nagios $ ./configure --prefix=/usr/local/nagios$ make && make install

4.1.3 安装　check_nrpe　插件

$ cd /opt/soft/$ tar -xvf /home/hadoop/nrpe-2.15.tar.gz$ cd nrpe-2.15/$ ./configure$ make all$ make install-plugin

4.2 datanode 节点

datanode只要安装nagios-plugs 与 nrpe.

因为所有节点是一样的，这里写个脚本

#!/bin/shadduser nagioscd /opt/softtar xvf /home/hadoop/nagios-plugins-2.1.1.tar.gzcd nagios-plugins-2.1.1mkdir /usr/local/nagios./configure --prefix=/usr/local/nagiosmake && make installchown nagios.nagios /usr/local/nagioschown -R nagios.nagios /usr/local/nagios/libexec #安装xinetd.看你的机器是否有xinetd,如果没有就安装，有的话就不用了 yum install xinetd -y cd ../tar xvf /home/hadoop/nrpe-2.15.tar.gzcd nrpe-2.15./configuremake allmake install-daemonmake install-daemon-configmake install-xinetd

安装完成后

修改nrpe.cfg

$ vim /usr/local/nagios/etc/nrpe.cfg log_facility=daemonpid_file=/var/run/nrpe.pid## nagios的监听端口server_port=5666nrpe_user=nagiosnrpe_group=nagios ## nagios服务器主机地址 allowed_hosts=xx.xxx.x.xx dont_blame_nrpe=0 allow_bash_command_substitution=0 debug=0 command_timeout=60 connection_timeout=300 ## 监控负载 command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 ## 当前系统用户数 command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10 ## 根分区空闲容量 command[check_sda2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda2 ## mysql状态 command[check_mysql]=/usr/local/nagios/libexec/check_mysql -H localhost -P 3306 -d kora -u kora -p upbjsxt ## 主机是否存活 command[check_ping]=/usr/local/nagios/libexec/check_ping -H localhost -w 100.0,20% -c 500.0,60% ## 当前系统的进程总数 command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200 ## swap使用情况 command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20 -c 10

只有在被监控机器的这个配置文件中定义的命令，在监控机器（也就是hadoop1）上才能通过nrpe插件获取．也就是想监控机器的什么指标必须现在此处定义

同步到其他所有datanode节点

可以看到创建了这个文件/etc/xinetd.d/nrpe。

编辑这个脚本（图用的其他文章的图，版本号跟配置不一样，意思到就行了）：

在only_from 后增加监控主机的IP地址。

编辑/etc/services 文件，增加NRPE服务

重启xinted 服务

# service xinetd restart

查看NRPE 是否已经启动

可以看到5666端口已经在监听了。

4.3 配置

在hadoop1上

要想让nagios与ganglia整合起来，就需要在hadoop1上把ganglia安装包中的ganglia的插件放到nagios的插件目录下

$ /opt/soft/ganglia-3.6.0$ cp contrib/check_ganglia.py /usr/local/nagios/libexec/

默认的check_ganglia.py 插件中只有监控项的实际值大于critical阀值的情况，这里需要增加监控项的实际值小于critical阀值的情况，即最后添加的一段代码

$ vim  /usr/local/nagios/libexec/check_ganglia.py 88   if critical > warning: 89     if value >= critical: 90       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) 91       sys.exit(2) 92     elif value >= warning: 93       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) 94       sys.exit(1) 95     else: 96       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) 97       sys.exit(0) 98   else: 99     if critical >=value:100       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)101       sys.exit(2)102     elif warning >=value:103       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)104       sys.exit(1)105     else:106       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)107       sys.exit(0)

最后改成上面这样

hadoop1上配置各个主机及对应的监控项

没配置前，现在目录结构是这样的

$ cd /usr/local/nagios/etc/objects/$ lltotal 48-rw-rw-r-- 1 nagios nagios  8010 9月  11 14:59 commands.cfg-rw-rw-r-- 1 nagios nagios  2138 9月  11 11:35 contacts.cfg-rw-rw-r-- 1 nagios nagios  5375 9月  11 11:35 localhost.cfg-rw-rw-r-- 1 nagios nagios  3096 9月  11 11:35 printer.cfg-rw-rw-r-- 1 nagios nagios  3265 9月  11 11:35 switch.cfg-rw-rw-r-- 1 nagios nagios 10621 9月  11 11:35 templates.cfg-rw-rw-r-- 1 nagios nagios  3180 9月  11 11:35 timeperiods.cfg-rw-rw-r-- 1 nagios nagios  3991 9月  11 11:35 windows.cfg

注意：cfg的文件跟在配置后面的说明注释一定要用逗号，而不是＃号．我就是因为一开始用了＃号，结果一直出问题找不到是什么原因

修改　commands.cfg

在文件最后加上如下内容

# 'check_ganglia' command definitiondefine command{        command_name    check_ganglia        command_line    $USER1$/check_ganglia.py -h $HOSTADDRESS$ -m $ARG1$ -w $ARG2$ -c $ARG3$        }# 'check_nrpe' command definitiondefine command{        command_name    check_nrpe        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$        }

修改templates.cfg

我有18台datanode机器，这里篇幅原因只截取５个，后面依次再加就行了

define service {         use generic-service         name ganglia-service1     ；这里的配置在service1.cfg中用到        hostgroup_name a01    ；这里的配置在hadoop1.cfg中用到        service_groups ganglia-metrics1    ；这里的配置在service1.cfg中用到        register        0} define service {         use generic-service            name ganglia-service2    ；这里的配置在service2.cfg中用到         hostgroup_name a02    ；这里的配置在hadoop2.cfg中用到        service_groups ganglia-metrics2    ；这里的配置在service2.cfg中用到        register        0}define service {         use generic-service         name ganglia-service3    ；这里的配置在service3.cfg中用到         hostgroup_name a03    ；这里的配置在hadoop3.cfg中用到        service_groups ganglia-metrics3    ；这里的配置在service3.cfg中用到        register        0}define service {         use generic-service         name ganglia-service4    ；这里的配置在service4.cfg中用到         hostgroup_name a04    ；这里的配置在hadoop4.cfg中用到        service_groups ganglia-metrics4    ；这里的配置在service4.cfg中用到        register        0}define service {         use generic-service             name ganglia-service5    ；这里的配置在service5.cfg中用到             hostgroup_name a05    ；这里的配置在hadoop5.cfg中用到            service_groups ganglia-metrics5    ；这里的配置在service5.cfg中用到        register        0}

hadoop1.cfg　配置

这个默认是没有，用localhost.cfg 拷贝来

$cp localhost.cfg hadoop1.cfg

# vim hadoop1.cfg define host{           use                     linux-server         host_name               a01        alias                   a01        address                a01        } define hostgroup {         hostgroup_name  a01        alias  a01        members a01        }define service{        use                             local-service        host_name                       a01        service_description             PING        check_command                   check_ping!100,20%!500,60%        } define service{        use                             local-service        host_name                      a01        service_description             根分区        check_command                   check_local_disk!20%!10%!/#       contact_groups                  admins        } define service{        use                             local-service        host_name                       a01        service_description             用户数量        check_command                   check_local_users!20!50        } define service{        use                             local-service        host_name                       a01        service_description             进程数        check_command                   check_local_procs!550!650!RSZDT        } define service{         use                             local-service                 host_name                       a01        service_description             系统负载        check_command                   check_local_load!5.0,4.0,3.0!10.0,6.0,4.0}

service1.cfg 配置

默认没有service１.cfg，新建一个

$ vim service1.cfgdefine servicegroup {         servicegroup_name ganglia-metrics1        alias Ganglia Metrics1} ## 这里的check_ganglia为commonds.cfg中声明的check_ganglia命令define service{         use                             ganglia-service1        service_description             内存空闲        check_command                   check_ganglia!mem_free!200!50}  define service{        use                             ganglia-service1        service_description             NameNode同步        check_command                   check_ganglia!dfs.namenode.SyncsAvgTime!10!50 }

hadoop2.cfg 配置

需要注意使用check_nrpe插件的监控项必须要在hadoop2上的nrpe.cfg中声明

也就是每个service里的check_command必须在这台机器的　nrpe.cfg　中声明了才有用，比且要保证名称一样

$ cp localhost.cfg hadoop2.cfg $ vim hadoop2.cfg

define host{        use                     linux-server            ; Name of host template to use                                                        ; This host definition will inherit all variables that are defined                                                        ; in (or inherited by) the linux-server host template definition.        host_name               a02        alias                   a02        address                 a02        }# Define an optional hostgroup for Linux machinesdefine hostgroup{        hostgroup_name  a02; The name of the hostgroup        alias           a02 ; Long name of the group        members         a02    ; Comma separated list of hosts that belong to this group        }# Define a service to "ping" the local machinedefine service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             PING        check_command                   check_nrpe!check_ping        }# Define a service to check the disk space of the root partition# on the local machine.  Warning if < 20% free, critical if# < 10% free space on partition.define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Root Partition        check_command                   check_nrpe!check_sda2        }# Define a service to check the number of currently logged in# users on the local machine.  Warning if > 20 users, critical# if > 50 users.define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Current Users        check_command                   check_nrpe!check_users        }# Define a service to check the number of currently running procs# on the local machine.  Warning if > 250 processes, critical if# > 400 users.define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Total Processes        check_command                   check_nrpe!check_total_procs        }define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Current Load        check_command                   check_nrpe!check_load        }# Define a service to check the swap usage the local machine. # Critical if less than 10% of swap is free, warning if less than 20% is freedefine service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Swap Usage        check_command                   check_nrpe!check_swap        }

hadoop2的设置完，拷贝16份，因为datanode配置基本一样，就是hostname有点小区别

$ for i in {
   3..18};do cp hadoop2.cfg hadoop$i.cfg;done

将剩下里面hostname改下就行，后面就不说了

service2.cfg 配置

新建文件并配置

$ vim service2.cfg define servicegroup {        servicegroup_name ganglia-metrics2        alias Ganglia Metrics2}define service{        use     ganglia-service2        service_description     内存空闲        check_command   check_ganglia!mem_free!200!50}define service{        use     ganglia-service2        service_description     RegionServer_Get        check_command   check_ganglia!yarn.NodeManagerMetrics.AvailableVCores!７!７}define service{        use     ganglia-service2        service_description     DateNode_Heartbeat        check_command   check_ganglia!dfs.datanode.HeartbeatsAvgTime!15!40

service2的设置完，拷贝16份，因为datanode配置基本一样，就是servicegroup_name,use有点小区别

$ for i in {
   3..18};do scp service2.cfg service$i.cfg;done

改成对应的编号

修改　nagios.cfg

$ vim  ../nagios.cfgcfg_file=/usr/local/nagios/etc/objects/commands.cfgcfg_file=/usr/local/nagios/etc/objects/contacts.cfgcfg_file=/usr/local/nagios/etc/objects/timeperiods.cfgcfg_file=/usr/local/nagios/etc/objects/templates.cfg#引进host文件cfg_file=/usr/local/nagios/etc/objects/hadoop1.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop2.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop3.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop4.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop5.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop6.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop7.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop8.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop9.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop10.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop11.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop12.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop13.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop14.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop15.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop16.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop17.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop18.cfg#引进监控项的文件cfg_file=/usr/local/nagios/etc/objects/service1.cfgcfg_file=/usr/local/nagios/etc/objects/service2.cfgcfg_file=/usr/local/nagios/etc/objects/service3.cfgcfg_file=/usr/local/nagios/etc/objects/service4.cfgcfg_file=/usr/local/nagios/etc/objects/service5.cfgcfg_file=/usr/local/nagios/etc/objects/service6.cfgcfg_file=/usr/local/nagios/etc/objects/service7.cfgcfg_file=/usr/local/nagios/etc/objects/service8.cfgcfg_file=/usr/local/nagios/etc/objects/service9.cfgcfg_file=/usr/local/nagios/etc/objects/service10.cfgcfg_file=/usr/local/nagios/etc/objects/service11.cfgcfg_file=/usr/local/nagios/etc/objects/service12.cfgcfg_file=/usr/local/nagios/etc/objects/service13.cfgcfg_file=/usr/local/nagios/etc/objects/service14.cfgcfg_file=/usr/local/nagios/etc/objects/service15.cfgcfg_file=/usr/local/nagios/etc/objects/service16.cfgcfg_file=/usr/local/nagios/etc/objects/service17.cfgcfg_file=/usr/local/nagios/etc/objects/service18.cfg

验证配置是否正确

$ pwd/usr/local/nagios/etc$ ../bin/nagios -v nagios.cfg Nagios Core 4.1.1Copyright (c) 2009-present Nagios Core Development Team and Community ContributorsCopyright (c) 1999-2009 Ethan GalstadLast Modified: 08-19-2015License: GPLWebsite: https://www.nagios.orgReading configuration data...   Read main config file okay...   Read object config files okay...Running pre-flight check on configuration data...Checking objects...    Checked 161 services.    Checked 18 hosts.    Checked 18 host groups.    Checked 18 service groups.    Checked 1 contacts.    Checked 1 contact groups.    Checked 26 commands.    Checked 5 time periods.    Checked 0 host escalations.    Checked 0 service escalations.Checking for circular paths...    Checked 18 hosts    Checked 0 service dependencies    Checked 0 host dependencies    Checked 5 timeperiodsChecking global event handlers...Checking obsessive compulsive processor commands...Checking misc settings...Total Warnings: 0Total Errors:   0Things look okay - No serious problems were detected during the pre-flight check

没有错误，这时就可以启动hadoop1上的nagios服务

$ /etc/init.d/nagios startStarting nagios: done.

因为之前datanode上的nrpe已经启动了

测试hadoop1与datanode上nrpe通信是否正常

]$ for i in {
   10..28};do /usr/local/nagios/libexec/check_nrpe -H xx.xxx.x.$i;doneNRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15

ok，通信正常，验证check_ganglia.py插件是否工作正常

$ /usr/local/nagios/libexec/check_ganglia.py -h a01 -m mem_free -w 200 -c 50CHECKGANGLIA OK: mem_free is 61840868.00

工作正常，现在我们可以nagios的web页面，看是否监控成功。

localhost:8080/nagios

4.4 邮件报警配置

先检查服务器是否安装sendmail

$ rpm -q sendmail $ yum install sendmail  #如果没有就安装sendmail $ service sendmail restart  #重启sendmail

因为给外部发邮件，需要服务器自己有邮件服务器，这很麻烦并且非常占资源．这里我们配置一下，使用现有的STMP服务器

配置地址　/etc/mail.rc

$ vim /etc/mail.rcset from=systeminformation@xxx.comset smtp=mail.xxx.com smtp-auth-user=systeminformation smtp-auth-password=111111 smtp-auth=login

配置完毕之后，就可以先命令行测试一下，是否可以发邮件了

$ echo "hello world" |mail -s "test" pingjie@xxx.com

如果看你的邮件已经收到邮件了，说明sendmail已经没有问题．

下面配置nagios的邮件告警配置

$ vim /usr/local/nagios/etc/objects/contacts.cfgdefine contact{        contact_name                    nagiosadmin             ; Short name of user        use                             generic-contact         ; Inherit default values from generic-contact template (defined above)        alias                           Nagios Admin            ; Full name of user        ## 告警时间段        service_notification_period     24x7        host_notification_period        24x7        ## 告警信息格式        service_notification_options    w,u,c,r,f,s        host_notification_options       d,u,r,f,s        ## 告警方式为邮件        service_notification_commands   notify-service-by-email        host_notification_commands      notify-host-by-email        email                           pingjie@xxx.com       ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******        }# We only have one contact in this simple configuration file, so there is# no need to create more than one contact group.define contactgroup{        contactgroup_name       admins        alias                   Nagios Administrators        members                 nagiosadmin        }

至此配置全部完成

脚本监控hadoop进程

1.监控datanode的脚本

就是用python 读取HDFS页面，再正则匹配到Live Nodes这部分

1 #!/usr/bin/env python 2  3 import commands 4 import sys 5 from optparse import OptionParser 6 import urllib 7 import re 8  9 def get_value():10     urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")11     html = urlItem.read()12     urlItem.close()13     return float(re.findall('.+Live Nodes  :\\s+(\d+)\\s+\\(Decommissioned: \d+\\).+', html)[0])14 15 if __name__ == '__main__':16 17     parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")18     parser.add_option("-w", "--warning", type="int", dest="w", default=16)19     parser.add_option("-c", "--critical", type="int", dest="c", default=15)20     (options, args) = parser.parse_args()21 22     if(options.c >= options.w):23         print '-w must greater then -c'24         sys.exit(1)25 26     value = get_value()27 28     if(value <= options.c ) :29         print 'CRITICAL - Live Nodes %d' %(value)30         sys.exit(2)31     elif(value <= options.w):32         print 'WARNING - Live Nodes %d' %(value)33         sys.exit(1)34     else:35         print 'OK - Live Nodes %d' %(value)36         sys.exit(0)

2.监控dfs空间：

#!/usr/bin/env pythonimport commandsimport sysfrom optparse import OptionParserimport urllibimport redef get_dfs_free_percent():    urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")    html = urlItem.read()    urlItem.close()    return float(re.findall('.+ DFS Remaining% :\\s+(\d+\\.\d+)%.+', html)[0])if __name__ == '__main__':    parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")    parser.add_option("-w", "--warning", type="int", dest="w", default=30, help="total dfs used percent")    parser.add_option("-c", "--critical", type="int", dest="c", default=20, help="total dfs used percent")    (options, args) = parser.parse_args()    if(options.c >= options.w):        print '-w must greater then -c'        sys.exit(1)    dfs_free_percent = get_dfs_free_percent()    if(dfs_free_percent <= options.c ) :        print 'CRITICAL - DFS free %d%%' %(dfs_free_percent)        sys.exit(2)    elif(dfs_free_percent <= options.w):        print 'WARNING - DFS free %d%%' %(dfs_free_percent)        sys.exit(1)    else:        print 'OK - DFS free %d%%' %(dfs_free_percent)        sys.exit(0)

如果脚本出错，就进ｐｙｔｈｏｎ命令行，根据html的结果调试一下正则部分即可

拷贝这２个脚本到/usr/local/nagios/etc/objects/

这２个脚本单独在命令行使用 ./check_hadoop_datanode.py 这种方式执行一下试试，如果报这个错

: No such file or directory

vim打开文件后,命令模式执行 :set ff=unix , 然后保存就行了

3. 修改nagios配置

commands.cfg　增加如下２个command

$ vim /usr/local/nagios/etc/objects/commands.cfgdefine command{        command_name    check_datanode        command_line    $USER1$/check_hadoop_datanode.py -w $ARG1$ -c $ARG2$        }define command{        command_name    check_dfs        command_line    $USER1$/check_hadoop_dfs.py -w $ARG1$ -c $ARG2$        }

修改server1.cfg,增加如下２个service

$ vim service1.cfg define service{        use     ganglia-service1        service_description     datanode存活个数        check_command   check_datanode!16!15}define service{        use     ganglia-service1        service_description     dfs剩余空间        check_command   check_dfs!30!20}

完成

五问题记录

5.1 ganglia监控的指标有问题

问题描述：为了测试nagios报警功能，然后我就kill了一个节点的datanode，但是看nagios上一直显示这个datanode是正常的．因为nagios这些指标是从ganglia来的，于是就找到ganglia上，发现也是正常的．这个问题就很奇怪了，为啥datanode已经kill了还一直发心跳

解决方案：没有，有知道的请赐教。曲线救国，nagios使用脚本方式监控进程

转载于:https://www.cnblogs.com/pingjie/p/4809489.html

你可能感兴趣的文章