博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Ganglia+nagios 监控hadoop资源与报警
阅读量:4585 次
发布时间:2019-06-09

本文共 29420 字,大约阅读时间需要 98 分钟。

全篇主要依赖下面2篇文章

http://quenlang.blog.51cto.com/4813803/1571635

http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html#!comments

一 资源下载

nagios : http://sourceforge.net/projects/nagios/files/nagios-4.x/nagios-4.1.1/nagios-4.1.1.tar.gz/download

nagios-plugs : 

nrpe : http://sourceforge.net/projects/nagios/files/nrpe-2.x/nrpe-2.15/nrpe-2.15.tar.gz/download

 

二 ganglia 安装

hadoop1安装ganglia的gmetad、gmond及ganglia-web

2.1 依赖检验,安装

新建一个 ganglia.rpm 文件,写入以下依赖组件

$ vim ganglia.rpmapr-develapr-utilcheck-develcairo-develpango-devellibxml2-develglib2-develdbus-develfreetype-develfontconfig-develgcc-c++expat-develpython-devel rrdtool rrdtool-devellibXrender-develzliblibart_lgpllibpngdejavu-lgc-sans-mono-fontsdejavu-sans-mono-fontsperl-ExtUtils-CBuilderperl-ExtUtils-MakeMaker

查看这些组件是否有安装

$ rpm -q `cat ganglia.rpm`package apr-devel is not installedapr-util-1.3.9-3.el6_0.1.x86_64check-devel-0.9.8-1.1.el6.x86_64cairo-devel-1.8.8-3.1.el6.x86_64pango-devel-1.28.1-10.el6.x86_64libxml2-devel-2.7.6-14.el6_5.2.x86_64glib2-devel-2.28.8-4.el6.x86_64dbus-devel-1.2.24-7.el6_3.x86_64freetype-devel-2.3.11-14.el6_3.1.x86_64fontconfig-devel-2.8.0-5.el6.x86_64gcc-c++-4.4.7-11.el6.x86_64package expat-devel is not installedpython-devel-2.6.6-52.el6.x86_64libXrender-devel-0.9.8-2.1.el6.x86_64zlib-1.2.3-29.el6.x86_64libart_lgpl-2.3.20-5.1.el6.x86_64libpng-1.2.49-1.el6_2.x86_64package dejavu-lgc-sans-mono-fonts is not installedpackage dejavu-sans-mono-fonts is not installedperl-ExtUtils-CBuilder-0.27-136.el6.x86_64perl-ExtUtils-MakeMaker-6.55-136.el6.x86_64

使用 yum install 安装机器上没有的组件

 

还要安装 confuse

下载地址:http://www.nongnu.org/confuse/

$ tar -zxf confuse-2.7.tar.gz$ cd confuse-2.7$ ./configure CFLAGS=-fPIC --disable-nls$ make && make install

2.2  安装gangali

hadoop1上安装

$ tar -xvf /home/hadoop/ganglia-3.6.0.tar.gz -C /opt/soft/## 安装gmetad$ ./configure --prefix=/usr/local/ganglia --with-gmetad --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/ganglia$ make && make install$ cp gmetad/gmetad.init /etc/init.d/gmetad$ cp /usr/local/ganglia/sbin/gmetad /usr/sbin/$ chkconfig --add gmetad## 安装gmond$ cp gmond/gmond.init /etc/init.d/gmond$ cp /usr/local/ganglia/sbin/gmond /usr/sbin/$ gmond --default_config>/etc/ganglia/gmond.conf$ chkconfig --add gmond

 gmetad、gmond安装成功,接着安装ganglia-web,首先要安装php和httpd

yum install php httpd -y

修改httpd的配置文件/etc/httpd/conf/httpd.conf,只把监听端口改为8080

Listen 8080

 

 安装ganglia-web

$ tar xf ganglia-web-3.6.2.tar.gz  -C /opt/soft/$ cd /opt/soft/$ chmod -R 777 ganglia-web-3.6.2/ $ mv ganglia-web-3.6.2/ /var/www/html/ganglia $ cd /var/www/html/ganglia $ useradd www-data $ make install $ chmod 777 /var/lib/ganglia-web/dwoo/cache/ $ chmod 777 /var/lib/ganglia-web/dwoo/compiled/

 至此ganglia-web安装完成,修改conf_default.php修改文件,指定ganglia-web的目录及rrds的数据目录,修改如下两行:

36 # Where gmetad stores the rrd archives.37 $conf['gmetad_root'] = "/var/www/html/ganglia"; ## 改为web程序的安装目录38 $conf['rrds'] = "/var/lib/ganglia/rrds";        ## 指定rrd数据存放的路径

创建rrd数据存放目录并授权

$ mkdir /var/lib/ganglia/rrds -p$ chown nobody:nobody /var/lib/ganglia/rrds/ -R

到这里,hadoop1上的ganglia的所有安装工作就完成了,接下来就是要在其他所有节点上安装ganglia的gmond客户端。

 

其他节点安装上gmond

也是要先安装依赖,然后在安装gmond,所有节点安装都是一样的,所以这里写个脚本

$ vim install_ganglia.sh#!/bin/sh#安装依赖  这是是我已经知道我缺少哪些依赖,所以只安装这些,具体按照你的环境来列出需要安装哪些yum install -y apr-devel expat-devel rrdtool rrdtool-develmkdir /opt/soft;cd /opt/softtar -xvf /home/hadoop/confuse-2.7.tar.gzcd confuse-2.7./configure CFLAGS=-fPIC --disable-nlsmake && make installcd /opt/soft#安装 ganglia gmondtar -xvf /home/hadoop/ganglia-3.6.0.tar.gzcd ganglia-3.6.0/./configure --prefix=/usr/local/ganglia --with-libpcre=no --enable-gexec --enable-status --sysconfdir=/etc/gangliamake && make installcp gmond/gmond.init /etc/init.d/gmondcp /usr/local/ganglia/sbin/gmond /usr/sbin/gmond --default_config>/etc/ganglia/gmond.confchkconfig --add gmond

将这个脚本复制到所有节点执行

2.3 配置ganglia

分为服务端和客户端的配置,服务端的配置文件为gmetad.conf,客户端的配置文件为gmond.conf

首先配置hadoop1上的gmetad.conf,这个文件只有hadoop1上有

$ vi  /etc/ganglia/gmetad.conf## 定义数据源的名字及监听地址,gmond会将收集的数据发送到数据源监听机器上的rrd数据目录中 ## hadoop cluster 为自己定义 data_source "hadoop cluster" 192.168.0.101:8649

接着配置 gmond.conf

$ head -n 80 /etc/ganglia/gmond.conf/* This configuration is as close to 2.5.x default behavior as possible   The values closely match ./gmond/metric.h definitions in 2.5.x */globals {  daemonize = yes        ## 以守护进程运行  setuid = yes             user = nobody          ## 运行gmond的用户  debug_level = 0        ## 改为1会在启动时打印debug信息  max_udp_msg_len = 1472  mute = no              ## 哑巴,本节点将不会再广播任何自己收集到的数据到网络上  deaf = no              ## 聋子,本节点将不再接收任何其他节点广播的数据包  allow_extra_data = yes  host_dmax = 86400 /*secs. Expires (removes from web interface) hosts in 1 day */  host_tmax = 20 /*secs */  cleanup_threshold = 300 /*secs */  gexec = no  # By default gmond will use reverse DNS resolution when displaying your hostname  # Uncommeting following value will override that value.  # override_hostname = "mywebserver.domain.com"  # If you are not using multicast this value should be set to something other than 0.  # Otherwise if you restart aggregator gmond you will get empty graphs. 60 seconds is reasonable  send_metadata_interval = 0 /*secs */ } /* * The cluster attributes specified will be used as part of the 
* tag that will wrap all hosts collected by this instance. */cluster { name = "hadoop cluster" ## 指定集群的名字 owner = "nobody" ## 集群的所有者 latlong = "unspecified" url = "unspecified"} /* The host section describes attributes of the host, like the location */host { location = "unspecified"} /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */udp_send_channel { #bind_hostname = yes # Highly recommended, soon to be default. # This option tells gmond to use a source address # that resolves to the machine's hostname. Without # this, the metrics may appear to come from any # interface and the DNS names associated with # those IPs will be used to create the RRDs.# mcast_join = 239.2.11.71 ## 单播模式要注释调这行 host = 192.168.0.101 ## 单播模式,指定接受数据的主机 port = 8649 ## 监听端口 ttl = 1} /* You can specify as many udp_recv_channels as you like as well. */udp_recv_channel { #mcast_join = 239.2.11.71 ## 单播模式要注释调这行 port = 8649 #bind = 239.2.11.71 ## 单播模式要注释调这行 retry_bind = true # Size of the UDP buffer. If you are handling lots of metrics you really # should bump it up to e.g. 10MB or even higher. # buffer = 10485760} /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */tcp_accept_channel { port = 8649 # If you want to gzip XML output gzip_output = no} /* Channel to receive sFlow datagrams */#udp_recv_channel {# port = 6343#} /* Optional sFlow settings */

好了,hadoop1上的gmetad.conf和gmond.conf配置文件已经修改完成,这时,直接将hadoop1上的gmond.conf文件scp到其他节点上相同的路径下覆盖原来的gmond.conf即可。

2.4 启动 ganglia

所有节点启动 gmond 服务

/etc/init.d/gmond start

hadoop1 节点启动 gmetad httpd 服务

/etc/init.d/gmetad start/etc/init.d/httpd start

2.5 在浏览器中访问hadoop1:8080/ganglia,就会出现下面的页面

配置完成

三 配置hadoop 

此时,ganglia只是监控了各主机基本的性能,并没有监控到hadoop,接下来需要配置hadoop配置文件,这里以hadoop1上的配置文件为例,其他节点对应的配置文件应从hadoop1上拷贝,首先需要修改的是hadoop配置目录下的hadoop-metrics2.properties

$ cd /usr/local/hadoop-2.6.0/etc/hadoop/$ vim hadoop-metrics2.properties# for Ganglia 3.1 support *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 *.sink.ganglia.period=10# default for supportsparse is false *.sink.ganglia.supportsparse=true*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40# Tag values to use for the ganglia prefix. If not defined no tags are used.# If '*' all tags are used. If specifiying multiple tags separate them with # commas. Note that the last segment of the property name is the context name.##*.sink.ganglia.tagsForPrefix.jvm=ProcesName#*.sink.ganglia.tagsForPrefix.dfs=#*.sink.ganglia.tagsForPrefix.rpc=#*.sink.ganglia.tagsForPrefix.mapred=namenode.sink.ganglia.servers=192.168.0.101:8649 datanode.sink.ganglia.servers=192.168.0.101:8649 resourcemanager.sink.ganglia.servers=192.168.0.101:8649 nodemanager.sink.ganglia.servers=192.168.0.101:8649 mrappmaster.sink.ganglia.servers=192.168.0.101:8649 jobhistoryserver.sink.ganglia.serve=192.168.0.101:8649

复制到所有节点,重启hadoop集群

此时在监控中已经可以看到关于hadoop指标的监控了

 

四 nagios 安装

4.1 hadoop1 机器

新建nagios用户

# useradd -s /sbin/nologin nagios# mkdir /usr/local/nagios# chown -R nagios.nagios /usr/local/nagios

4.1.1 编译安装nagios

$ cd /opt/soft$ tar zxvf nagios-3.4.3.tar.gz$ cd nagios-3.4.3$ ./configure --prefix=/usr/local/nagios$ make al$ make install$ make install-init$ make install-config$ make install-commandmode$ make install-webconf

切换目录到安装路径(这里是/usr/local/nagios),看是否存在etc、bin、sbin、share、var 这五个目录,如果存在则可以表明程序被正确的安装到系统了

4.1.2 编译安装 nagios-plugs

$ cd /opt/soft$ tar zxvf nagios-plugins-1.4.16.tar.gz$ cd nagios-plugins-1.4.16 $ mkdir /user/local/nagios $ ./configure --prefix=/usr/local/nagios$ make && make install

4.1.3 安装 check_nrpe 插件

$ cd /opt/soft/$ tar -xvf /home/hadoop/nrpe-2.15.tar.gz$ cd nrpe-2.15/$ ./configure$ make all$ make install-plugin

4.2 datanode 节点

datanode只要安装nagios-plugs 与 nrpe.

因为所有节点是一样的,这里写个脚本

#!/bin/shadduser nagioscd /opt/softtar xvf /home/hadoop/nagios-plugins-2.1.1.tar.gzcd nagios-plugins-2.1.1mkdir /usr/local/nagios./configure --prefix=/usr/local/nagiosmake && make installchown nagios.nagios /usr/local/nagioschown -R nagios.nagios /usr/local/nagios/libexec #安装xinetd.看你的机器是否有xinetd,如果没有就安装,有的话就不用了 yum install xinetd -y cd ../tar xvf /home/hadoop/nrpe-2.15.tar.gzcd nrpe-2.15./configuremake allmake install-daemonmake install-daemon-configmake install-xinetd

安装完成后

修改nrpe.cfg

$ vim /usr/local/nagios/etc/nrpe.cfg log_facility=daemonpid_file=/var/run/nrpe.pid## nagios的监听端口server_port=5666nrpe_user=nagiosnrpe_group=nagios ## nagios服务器主机地址 allowed_hosts=xx.xxx.x.xx dont_blame_nrpe=0 allow_bash_command_substitution=0 debug=0 command_timeout=60 connection_timeout=300 ## 监控负载 command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 ## 当前系统用户数 command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10 ## 根分区空闲容量 command[check_sda2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda2 ## mysql状态 command[check_mysql]=/usr/local/nagios/libexec/check_mysql -H localhost -P 3306 -d kora -u kora -p upbjsxt ## 主机是否存活 command[check_ping]=/usr/local/nagios/libexec/check_ping -H localhost -w 100.0,20% -c 500.0,60% ## 当前系统的进程总数 command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200 ## swap使用情况 command[check_swap]=/usr/local/nagios/libexec/check_swap -w 20 -c 10

只有在被监控机器的这个配置文件中定义的命令,在监控机器(也就是hadoop1)上才能通过nrpe插件获取.也就是想监控机器的什么指标必须现在此处定义

同步到其他所有datanode节点

 

可以看到创建了这个文件/etc/xinetd.d/nrpe。

编辑这个脚本(图用的其他文章的图,版本号跟配置不一样,意思到就行了):

在only_from 后增加监控主机的IP地址。

编辑/etc/services 文件,增加NRPE服务

重启xinted 服务

# service xinetd restart

查看NRPE 是否已经启动

可以看到5666端口已经在监听了。

4.3 配置

在hadoop1上

 要想让nagios与ganglia整合起来,就需要在hadoop1上把ganglia安装包中的ganglia的插件放到nagios的插件目录下

$ /opt/soft/ganglia-3.6.0$ cp contrib/check_ganglia.py /usr/local/nagios/libexec/

 默认的check_ganglia.py 插件中只有监控项的实际值大于critical阀值的情况,这里需要增加监控项的实际值小于critical阀值的情况,即最后添加的一段代码 

$ vim  /usr/local/nagios/libexec/check_ganglia.py 88   if critical > warning: 89     if value >= critical: 90       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value) 91       sys.exit(2) 92     elif value >= warning: 93       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value) 94       sys.exit(1) 95     else: 96       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value) 97       sys.exit(0) 98   else: 99     if critical >=value:100       print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)101       sys.exit(2)102     elif warning >=value:103       print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)104       sys.exit(1)105     else:106       print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)107       sys.exit(0)

最后改成上面这样

 

hadoop1上配置各个主机及对应的监控项

没配置前,现在目录结构是这样的

$ cd /usr/local/nagios/etc/objects/$ lltotal 48-rw-rw-r-- 1 nagios nagios  8010 9月  11 14:59 commands.cfg-rw-rw-r-- 1 nagios nagios  2138 9月  11 11:35 contacts.cfg-rw-rw-r-- 1 nagios nagios  5375 9月  11 11:35 localhost.cfg-rw-rw-r-- 1 nagios nagios  3096 9月  11 11:35 printer.cfg-rw-rw-r-- 1 nagios nagios  3265 9月  11 11:35 switch.cfg-rw-rw-r-- 1 nagios nagios 10621 9月  11 11:35 templates.cfg-rw-rw-r-- 1 nagios nagios  3180 9月  11 11:35 timeperiods.cfg-rw-rw-r-- 1 nagios nagios  3991 9月  11 11:35 windows.cfg

注意:cfg的文件跟在配置后面的说明注释一定要用逗号,而不是#号.我就是因为一开始用了#号,结果一直出问题找不到是什么原因

 

修改 commands.cfg

在文件最后加上如下内容

# 'check_ganglia' command definitiondefine command{        command_name    check_ganglia        command_line    $USER1$/check_ganglia.py -h $HOSTADDRESS$ -m $ARG1$ -w $ARG2$ -c $ARG3$        }# 'check_nrpe' command definitiondefine command{        command_name    check_nrpe        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$        }

修改templates.cfg

我有18台datanode机器,这里篇幅原因只截取5个,后面依次再加就行了

define service {         use generic-service         name ganglia-service1     ;这里的配置在service1.cfg中用到        hostgroup_name a01    ;这里的配置在hadoop1.cfg中用到        service_groups ganglia-metrics1    ;这里的配置在service1.cfg中用到        register        0} define service {         use generic-service            name ganglia-service2    ;这里的配置在service2.cfg中用到         hostgroup_name a02    ;这里的配置在hadoop2.cfg中用到        service_groups ganglia-metrics2    ;这里的配置在service2.cfg中用到        register        0}define service {         use generic-service         name ganglia-service3    ;这里的配置在service3.cfg中用到         hostgroup_name a03    ;这里的配置在hadoop3.cfg中用到        service_groups ganglia-metrics3    ;这里的配置在service3.cfg中用到        register        0}define service {         use generic-service         name ganglia-service4    ;这里的配置在service4.cfg中用到         hostgroup_name a04    ;这里的配置在hadoop4.cfg中用到        service_groups ganglia-metrics4    ;这里的配置在service4.cfg中用到        register        0}define service {         use generic-service             name ganglia-service5    ;这里的配置在service5.cfg中用到             hostgroup_name a05    ;这里的配置在hadoop5.cfg中用到            service_groups ganglia-metrics5    ;这里的配置在service5.cfg中用到        register        0}

hadoop1.cfg 配置

这个默认是没有,用localhost.cfg 拷贝来

$cp localhost.cfg hadoop1.cfg
# vim hadoop1.cfg define host{           use                     linux-server         host_name               a01        alias                   a01        address                a01        } define hostgroup {         hostgroup_name  a01        alias  a01        members a01        }define service{        use                             local-service        host_name                       a01        service_description             PING        check_command                   check_ping!100,20%!500,60%        } define service{        use                             local-service        host_name                      a01        service_description             根分区        check_command                   check_local_disk!20%!10%!/#       contact_groups                  admins        } define service{        use                             local-service        host_name                       a01        service_description             用户数量        check_command                   check_local_users!20!50        } define service{        use                             local-service        host_name                       a01        service_description             进程数        check_command                   check_local_procs!550!650!RSZDT        } define service{         use                             local-service                 host_name                       a01        service_description             系统负载        check_command                   check_local_load!5.0,4.0,3.0!10.0,6.0,4.0}

service1.cfg 配置

默认没有service1.cfg,新建一个

$ vim service1.cfgdefine servicegroup {         servicegroup_name ganglia-metrics1        alias Ganglia Metrics1} ## 这里的check_ganglia为commonds.cfg中声明的check_ganglia命令define service{         use                             ganglia-service1        service_description             内存空闲        check_command                   check_ganglia!mem_free!200!50}  define service{        use                             ganglia-service1        service_description             NameNode同步        check_command                   check_ganglia!dfs.namenode.SyncsAvgTime!10!50 }

hadoop2.cfg 配置

需要注意使用check_nrpe插件的监控项必须要在hadoop2上的nrpe.cfg中声明

也就是每个service里的check_command必须在这台机器的 nrpe.cfg 中声明了才有用,比且要保证名称一样

$ cp localhost.cfg hadoop2.cfg $ vim hadoop2.cfg
define host{        use                     linux-server            ; Name of host template to use                                                        ; This host definition will inherit all variables that are defined                                                        ; in (or inherited by) the linux-server host template definition.        host_name               a02        alias                   a02        address                 a02        }# Define an optional hostgroup for Linux machinesdefine hostgroup{        hostgroup_name  a02; The name of the hostgroup        alias           a02 ; Long name of the group        members         a02    ; Comma separated list of hosts that belong to this group        }# Define a service to "ping" the local machinedefine service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             PING        check_command                   check_nrpe!check_ping        }# Define a service to check the disk space of the root partition# on the local machine.  Warning if < 20% free, critical if# < 10% free space on partition.define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Root Partition        check_command                   check_nrpe!check_sda2        }# Define a service to check the number of currently logged in# users on the local machine.  Warning if > 20 users, critical# if > 50 users.define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Current Users        check_command                   check_nrpe!check_users        }# Define a service to check the number of currently running procs# on the local machine.  Warning if > 250 processes, critical if# > 400 users.define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Total Processes        check_command                   check_nrpe!check_total_procs        }define service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Current Load        check_command                   check_nrpe!check_load        }# Define a service to check the swap usage the local machine. # Critical if less than 10% of swap is free, warning if less than 20% is freedefine service{        use                             local-service         ; Name of service template to use        host_name                       a02        service_description             Swap Usage        check_command                   check_nrpe!check_swap        }

hadoop2的设置完,拷贝16份,因为datanode配置基本一样,就是hostname有点小区别

$ for i in {
3..18};do cp hadoop2.cfg hadoop$i.cfg;done

将剩下里面hostname改下就行,后面就不说了

service2.cfg 配置

新建文件并配置

$ vim service2.cfg define servicegroup {        servicegroup_name ganglia-metrics2        alias Ganglia Metrics2}define service{        use     ganglia-service2        service_description     内存空闲        check_command   check_ganglia!mem_free!200!50}define service{        use     ganglia-service2        service_description     RegionServer_Get        check_command   check_ganglia!yarn.NodeManagerMetrics.AvailableVCores!7!7}define service{        use     ganglia-service2        service_description     DateNode_Heartbeat        check_command   check_ganglia!dfs.datanode.HeartbeatsAvgTime!15!40

service2的设置完,拷贝16份,因为datanode配置基本一样,就是servicegroup_name,use有点小区别

$ for i in {
3..18};do scp service2.cfg service$i.cfg;done

改成对应的编号

修改 nagios.cfg

$ vim  ../nagios.cfgcfg_file=/usr/local/nagios/etc/objects/commands.cfgcfg_file=/usr/local/nagios/etc/objects/contacts.cfgcfg_file=/usr/local/nagios/etc/objects/timeperiods.cfgcfg_file=/usr/local/nagios/etc/objects/templates.cfg#引进host文件cfg_file=/usr/local/nagios/etc/objects/hadoop1.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop2.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop3.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop4.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop5.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop6.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop7.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop8.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop9.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop10.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop11.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop12.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop13.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop14.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop15.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop16.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop17.cfgcfg_file=/usr/local/nagios/etc/objects/hadoop18.cfg#引进监控项的文件cfg_file=/usr/local/nagios/etc/objects/service1.cfgcfg_file=/usr/local/nagios/etc/objects/service2.cfgcfg_file=/usr/local/nagios/etc/objects/service3.cfgcfg_file=/usr/local/nagios/etc/objects/service4.cfgcfg_file=/usr/local/nagios/etc/objects/service5.cfgcfg_file=/usr/local/nagios/etc/objects/service6.cfgcfg_file=/usr/local/nagios/etc/objects/service7.cfgcfg_file=/usr/local/nagios/etc/objects/service8.cfgcfg_file=/usr/local/nagios/etc/objects/service9.cfgcfg_file=/usr/local/nagios/etc/objects/service10.cfgcfg_file=/usr/local/nagios/etc/objects/service11.cfgcfg_file=/usr/local/nagios/etc/objects/service12.cfgcfg_file=/usr/local/nagios/etc/objects/service13.cfgcfg_file=/usr/local/nagios/etc/objects/service14.cfgcfg_file=/usr/local/nagios/etc/objects/service15.cfgcfg_file=/usr/local/nagios/etc/objects/service16.cfgcfg_file=/usr/local/nagios/etc/objects/service17.cfgcfg_file=/usr/local/nagios/etc/objects/service18.cfg

 

验证配置是否正确

$ pwd/usr/local/nagios/etc$ ../bin/nagios -v nagios.cfg Nagios Core 4.1.1Copyright (c) 2009-present Nagios Core Development Team and Community ContributorsCopyright (c) 1999-2009 Ethan GalstadLast Modified: 08-19-2015License: GPLWebsite: https://www.nagios.orgReading configuration data...   Read main config file okay...   Read object config files okay...Running pre-flight check on configuration data...Checking objects...    Checked 161 services.    Checked 18 hosts.    Checked 18 host groups.    Checked 18 service groups.    Checked 1 contacts.    Checked 1 contact groups.    Checked 26 commands.    Checked 5 time periods.    Checked 0 host escalations.    Checked 0 service escalations.Checking for circular paths...    Checked 18 hosts    Checked 0 service dependencies    Checked 0 host dependencies    Checked 5 timeperiodsChecking global event handlers...Checking obsessive compulsive processor commands...Checking misc settings...Total Warnings: 0Total Errors:   0Things look okay - No serious problems were detected during the pre-flight check

没有错误,这时就可以启动hadoop1上的nagios服务

$ /etc/init.d/nagios startStarting nagios: done.

因为之前datanode上的nrpe已经启动了

测试hadoop1与datanode上nrpe通信是否正常

]$ for i in {
10..28};do /usr/local/nagios/libexec/check_nrpe -H xx.xxx.x.$i;doneNRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15NRPE v2.15

ok,通信正常,验证check_ganglia.py插件是否工作正常

$ /usr/local/nagios/libexec/check_ganglia.py -h a01 -m mem_free -w 200 -c 50CHECKGANGLIA OK: mem_free is 61840868.00

 工作正常,现在我们可以nagios的web页面,看是否监控成功。

localhost:8080/nagios

4.4 邮件报警配置

先检查服务器是否安装sendmail

$ rpm -q sendmail $ yum install sendmail  #如果没有就安装sendmail $ service sendmail restart  #重启sendmail

因为给外部发邮件,需要服务器自己有邮件服务器,这很麻烦并且非常占资源.这里我们配置一下,使用现有的STMP服务器

配置地址 /etc/mail.rc

$ vim /etc/mail.rcset from=systeminformation@xxx.comset smtp=mail.xxx.com smtp-auth-user=systeminformation smtp-auth-password=111111 smtp-auth=login

配置完毕之后,就可以先命令行测试一下,是否可以发邮件了

$ echo "hello world" |mail -s "test" pingjie@xxx.com

如果看你的邮件已经收到邮件了,说明sendmail已经没有问题.

下面配置nagios的邮件告警配置

$ vim /usr/local/nagios/etc/objects/contacts.cfgdefine contact{        contact_name                    nagiosadmin             ; Short name of user        use                             generic-contact         ; Inherit default values from generic-contact template (defined above)        alias                           Nagios Admin            ; Full name of user        ## 告警时间段        service_notification_period     24x7        host_notification_period        24x7        ## 告警信息格式        service_notification_options    w,u,c,r,f,s        host_notification_options       d,u,r,f,s        ## 告警方式为邮件        service_notification_commands   notify-service-by-email        host_notification_commands      notify-host-by-email        email                           pingjie@xxx.com       ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******        }# We only have one contact in this simple configuration file, so there is# no need to create more than one contact group.define contactgroup{        contactgroup_name       admins        alias                   Nagios Administrators        members                 nagiosadmin        }

至此配置全部完成

 

脚本监控hadoop进程

1.监控datanode的脚本

就是用python 读取HDFS页面,再正则匹配到Live Nodes这部分

1 #!/usr/bin/env python 2  3 import commands 4 import sys 5 from optparse import OptionParser 6 import urllib 7 import re 8  9 def get_value():10     urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")11     html = urlItem.read()12     urlItem.close()13     return float(re.findall('.+Live Nodes  :\\s+(\d+)\\s+\\(Decommissioned: \d+\\).+', html)[0])14 15 if __name__ == '__main__':16 17     parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")18     parser.add_option("-w", "--warning", type="int", dest="w", default=16)19     parser.add_option("-c", "--critical", type="int", dest="c", default=15)20     (options, args) = parser.parse_args()21 22     if(options.c >= options.w):23         print '-w must greater then -c'24         sys.exit(1)25 26     value = get_value()27 28     if(value <= options.c ) :29         print 'CRITICAL - Live Nodes %d' %(value)30         sys.exit(2)31     elif(value <= options.w):32         print 'WARNING - Live Nodes %d' %(value)33         sys.exit(1)34     else:35         print 'OK - Live Nodes %d' %(value)36         sys.exit(0)

2.监控dfs空间:

#!/usr/bin/env pythonimport commandsimport sysfrom optparse import OptionParserimport urllibimport redef get_dfs_free_percent():    urlItem = urllib.urlopen("http://namenode:50070/dfshealth.jsp")    html = urlItem.read()    urlItem.close()    return float(re.findall('.+ DFS Remaining% :\\s+(\d+\\.\d+)%.+', html)[0])if __name__ == '__main__':    parser = OptionParser(usage="%prog [-w] [-c]", version="%prog 1.0")    parser.add_option("-w", "--warning", type="int", dest="w", default=30, help="total dfs used percent")    parser.add_option("-c", "--critical", type="int", dest="c", default=20, help="total dfs used percent")    (options, args) = parser.parse_args()    if(options.c >= options.w):        print '-w must greater then -c'        sys.exit(1)    dfs_free_percent = get_dfs_free_percent()    if(dfs_free_percent <= options.c ) :        print 'CRITICAL - DFS free %d%%' %(dfs_free_percent)        sys.exit(2)    elif(dfs_free_percent <= options.w):        print 'WARNING - DFS free %d%%' %(dfs_free_percent)        sys.exit(1)    else:        print 'OK - DFS free %d%%' %(dfs_free_percent)        sys.exit(0)

如果脚本出错,就进python命令行,根据html的结果调试一下正则部分即可

拷贝这2个脚本到/usr/local/nagios/etc/objects/

这2个脚本单独在命令行使用 ./check_hadoop_datanode.py 这种方式执行一下试试,如果报这个错

: No such file or directory

vim打开文件后,命令模式执行 :set ff=unix  , 然后保存就行了

3. 修改nagios配置

commands.cfg 增加如下2个command

$ vim /usr/local/nagios/etc/objects/commands.cfgdefine command{        command_name    check_datanode        command_line    $USER1$/check_hadoop_datanode.py -w $ARG1$ -c $ARG2$        }define command{        command_name    check_dfs        command_line    $USER1$/check_hadoop_dfs.py -w $ARG1$ -c $ARG2$        }

修改server1.cfg,增加如下2个service

$ vim service1.cfg define service{        use     ganglia-service1        service_description     datanode存活个数        check_command   check_datanode!16!15}define service{        use     ganglia-service1        service_description     dfs剩余空间        check_command   check_dfs!30!20}

完成

 

五问题记录

5.1 ganglia监控的指标有问题

问题描述:为了测试nagios报警功能,然后我就kill了一个节点的datanode,但是看nagios上一直显示这个datanode是正常的.因为nagios这些指标是从ganglia来的,于是就找到ganglia上,发现也是正常的.这个问题就很奇怪了,为啥datanode已经kill了还一直发心跳

解决方案:没有,有知道的请赐教。曲线救国,nagios使用脚本方式监控进程

 

转载于:https://www.cnblogs.com/pingjie/p/4809489.html

你可能感兴趣的文章
命名管道
查看>>
简单的几个Boost定时器
查看>>
使用天天模拟器开发Android应用
查看>>
c++小学期大作业攻略(一)环境配置
查看>>
【CH1809】匹配统计(KMP)
查看>>
scrapy基础知识之 使用FormRequest.from_response()方法模拟用户登录:
查看>>
格式化输入1.%s占位输入法
查看>>
WinForm中的ListBox组件编程
查看>>
matplotlib柱状图
查看>>
jfinal初接触,一个简单的文件上传例子
查看>>
背景音乐的实现
查看>>
灵玖软件:NLPIR大数据提供智能挖掘技术方案
查看>>
Selenium常用API的使用java语言之3-selenium3 浏览器驱动
查看>>
linux一些好用的命令
查看>>
实验 杨辉三角
查看>>
模型保存与恢复、自定义命令行参数
查看>>
tensorflow框架学习(一)——四个基础元素graph、session、tensor、op
查看>>
mySql分组排序
查看>>
I-think-3
查看>>
mybatis中封装结果集常见示例
查看>>