基础监控-交换机监控

　　交换机需要采集的指标项包括：电源、端口状态、端口映射（物理端口&逻辑端口）、温度、内存、cpu（最大cpu、平均cpu）、端口出入流量（需要计算峰值流量）、端口带宽使用率等，其中分品牌采集不同的指标，主要有四个品牌，brocade、blade、vdx、huawei，分别采集不同的指标，其中brocade全部指标都有采集到。采集维度是1分钟。本文主要介绍交换机监控系统后台逻辑。

　　监控指标主要是带宽使用率、流量波动值（波动比）、cpu使用率、内存使用率、温度、端口状态（up->down）、端口衰减，同时需要统计各个ope（运营商）的流量峰值和流量总和。

　　公司的交换机大概有120多台，登记于CMDB，信息包括ip、名称、带宽信息、负责人、固执编号、品牌型号、端口总数、价格、购买日期、质保年限、IDC机房、机架、体积U数、上架时间、状态、相应级别、备注。我们系统需要的信息主要是ip、名称（告警信息展示）、负责人（主要为基础运维组）、品牌型号、带宽信息（包含ope和对应的物理端口登记、带宽）。

　　了解完交换机的大概之后就是程序了。程序的架构逻辑也很简单，使用snmp v2采集信息：首先配置好采集客户端的路由和snmp配置，然后从CMDB获取交换机信息，接着就是采集、上报、展示、解析告警了。配置路由主要由基础运维配置后提供，需要什么客户端直接报给他们，他们帮忙配好即可。以下是程序的架构：

基础监控-交换机监控

考虑到容灾和方便横向扩展，所以增加了一个ip调度中心，将CMDB的ip随机分配到各个采集agent中，采集agent分布在不同机房，一旦机房不可用则可以根据调度中心将ip分配到其他机房，而如果想横向扩展则可以增加agent，配置好snmp环境和路由，修改调度中心配置即可。

　　后台程序全部python写的，主要分为三个部分：ip调度中心、采集客户端、解析告警。

　　一、ip调度中心

　　由于调度中心实现功能比较简单，只是根据客户端随机分配相同数量ip给采集器。所以我没有专门写一个c/s程序，只是简单写一个公用接口，挂在开放接口【运维监控系列文章会介绍】名下，由各个采集器去调用。

　　调度中心配置包含所有的采集器ip，一旦有采集器调用，则会去CMDB获取交换机信息，然后平均分配给各个采集器，因为从CMDB获取的信息是固定顺序的，所以每次分配给各个采集器也是固定的ip列表，如果CMDB增加了ip，则一般会被分配到最后一个采集器上，不会漏掉。很容易看出，这其实不是随机算法。（其实是懒，由客户端去调用接口，每个客户端都需要调用一次，一旦同时调用，服务端就无法准确分配。要随机分配只需要写一个程序随机分配一次ip给各个采集器，然后通过文件发送过去，采集器读取文件即可。）如果增加了采集器，则只需要在配置agentList里面添加多该采集器ip即可。目前我的程序是每隔10分钟就更新一次采集列表。代码如下(sql语句部分字段被我修改了)：

 1 def getConf(db):
 2     retInfo = {}
 3     sql = ("select id, ip, model, width, name from "
 4            "switchConf where currentStatus=3 and ip > 0 "
 5            "order by id asc")
 6     code, count, errMsg = db.doSql(sql)
 7     if 0 != code:
 8         return retInfo
 9 
10     info = {}
11     ipList = []
12     res = db.getRet()
13     for each in res:
14         ip = int(each[1])
15         brand = each[2].strip()
16         bw = each[3]
17         name = each[4].strip()
18         if ip in info:  # 过滤重复的交换机ip
19             continue
20         ipList.append(ip)  # 顺序插入list
21         info[ip] = dict(brand=brand, bandWidth=bw, name=name)
22 
23     agentNums = len(agentList)
24     ipNums = len(ipList)
25     step = ipNums / agentNums
26     start = 0
27     for index in range(agentNums):
28         agent = agentList[index]
29         retInfo[agent] = {}
30         if index == agentNums - 1:  # 最后一个break
31             break
32         stop = start + step
33         for i in range(start, stop):  # 按顺序获取前面的ip列表
34             ip = ipList[i]
35             retInfo[agent][ip] = info[ip]
36         start += step
37     for i in range(start, ipNums):  # 获取剩下的所有ip
38         ip = ipList[i]
39         retInfo[agent][ip] = info[ip]
40 
41     return retInfo

　　二、采集程序

　　1、采集指令

　　交换机监控有一个比较麻烦的地方是要根据文档整理出采集指标信息的object id，目前我们有四个品牌，包括brocade，blade，vdx，huawei，每个品牌的oid都不同，整理加验证花了至少2周的时间。【参考附录(一)】

　　2、采集命令

　　使用snmp v2协议采集。具体命令需要基础运维配置好snmpwalk的相关环境，例如采集blade电源的命令如下：

snmpwalk -v 2c -c ** ip 1.3.6.1.4.1.26543.102.102.14.20

其中**是自己定义的协议，每个配置都不同，具体配置我也不熟，就不展开说了。ip例如10.1.1.1，后面的1.3.6.***是采集指标的oid。

注：python也有一个snmpwalk库，但考虑到之前的其他python库经常会出现不稳定的现象就没使用，而直接使用shell命令更加方便稳定。

　　3、采集逻辑

　　每个采集器都有一套完整的采集程序，根据调度中心分配的ip列表采集信息。使用subprocess库调用shell命令，使用线程池，每个线程一个ip，尽可能保证该采集器的所有ip能在1分钟内完成（无需加超时限制，因为每个命令的运行时间不一样，一个ip可能有好几个采集命令，如果要保证速度够快当然可以考虑一个命令一个线程，但subprocess是调用子进程的，相对来说比较消耗系统资源）。主函数代码如下：

    def run(self):
        self.getConfFromOPI()
if not self.ipInfo:  # 获取不到信息
            self.amcAlert(SC.SM_SYS_ERROR, SC.bindIp,
"get nothing(no ip) from cache/CMDB",
                          SC.admins, 0)
            self.log.close()
            self.db.close()
return
        POOL_SIZE = len(self.ipInfo)  # pool size = ip nums
        self.log.info("init thread pool (size=%s)" % POOL_SIZE)
        pool = Pool.ThreadPool(POOL_SIZE)  # 多了黑名单的线程数量
for ip in self.ipInfo:
            ipStr = self.ipInfo[ip]["ipStr"]
if ipStr in SC.CONF_BLACK_IP:
                self.log.warn("refuse black list ip: %s" % ipStr)
continue
            kw = dict(handlerKey=ip, ip=ip)  # kw for test
            pool.addJob(self.getSwitchInfo, **kw)  # JUST FOR TEST
        pool.waitForComplete()  # wait for all ip done sleep 0.1
if self.CMDBInfo is not None:
            self.writeConfToCache(self.CMDBInfo)  # 放在这里写为了配合解析
        self.log.info("all of ip(%s) done job" % POOL_SIZE)
        self.log.close()
        self.db.close()
return

线程池根据采集器获得的ip列表数量来初始化。每个线程的采集逻辑主要在self.getSwitchInfo函数中，其中根据不同品牌使用不同的采集函数。采集完数据后即将需要计算的数据写进redis，展示和解析的数据保存到mysql中。

　　4、流量采集

　　流量采集是采集指标里面的最重要的一项。brocade的采集根据命令即可采集到当前的流量（Shows the bit rate, in kilobits per second, received on a 10 Gigabit or faster interface within a five minute interval），但其他品牌的采集则需要采集两次的流量差值，并且除以时间差获取到速率Mbit/s。所以第一次采集的流量值保存到本地缓存文件中，使用ini配置文件保存方便下次读取。第二次采集后获取上一次的缓存值一并处理得到当前的流量值，保存到redis和mysql中。代码如下：

        def getPortInfo(getPort, getPortStatus, getEth, VALUE, CACHE=False):
            lastCacheFile = os.path.join(
                    SC.LAST_CACHE_PATH, "%s-%s.cache" % (ipStr, brand))
if not os.path.isfile(lastCacheFile):
try:
                    fd = open(lastCacheFile, 'w')
except IOError:
raise
                fd.close()
            portInfo = getPort(ipStr, brandLog)
if not portInfo["pPort"]:  # 获取不到端口需要计算流量则把cache文件删掉
if CACHE:
                    os.remove(lastCacheFile)
return
            ps = getPortStatus(ipStr, brandLog)
            rhMinKey = "%s-%s" % (ip, self.reportTimeInt)  # redis hash name
            rhMinEth = dict()
            rhPeakKey = "%s-%s" % (ip, self.curYM)
            rhPeakEth = self.peakRedis.hgetall(rhPeakKey)
if not rhPeakEth:
                rhPeakFirst = 1  # 第一次出现
else:
                rhPeakFirst = 0
            lastCache = ConfigParser.ConfigParser()
            section = brand.lower()
if CACHE:  # 没有CACHE的是空文件
                lastCache.readfp(open(lastCacheFile), 'rb')
                sections = lastCache.sections()
if brand.lower() not in sections:
                    lastCache.add_section(brand)

for ope in bandInfo:
if ope not in ethFlow:
                    ethFlow[ope] = dict(ethFlowOut=0, ethFlowIn=0)
for p in bandInfo[ope]["pPort"]:  # CMDB phy port
if p not in portInfo["pPort"]:  # snmpwalk phy port
continue
                    port = portInfo["pPort"][p]  # int logic port
                    bandInfo[ope]["pPort"][p] = port  # update port from None
                    self.portInfo[ip]["pPort"][p] = port  # 对应
                    port = str(port)
                    self.portInfo[ip]["lPort"][port] = p  # 对应
if port in ps:
                        portStatus[p] = ps[port]
else:
                        portStatus[p] = SC.PORT_STATUS_DOWN  # not exists
                    ethInfo = getEth(ipStr, port, brandLog)
# 如果某品牌未采集util则是None，有采集但出错则是0,
# 参考采集代码，所有品牌都有采集Flow，所以flow不会是None
#  部分机型不采集端口流量比率，如果两个都为None则是空{}
if (ethInfo["ethUtilOut"] is not None) and (
                                ethInfo["ethUtilIn"] is not None):
if 0 > ethInfo["ethUtilOut"]:
                            ethInfo["ethUtilOut"] = 0
if 0 > ethInfo["ethUtilIn"]:
                            ethInfo["ethUtilIn"] = 0
if ope not in ethUtil:
                            ethUtil[ope] = dict()
                        ethUtil[ope][p] = dict(  # 物理端口
                                ethUtilOut=ethInfo["ethUtilOut"],
                                ethUtilIn=ethInfo["ethUtilIn"])

                        rInKey = "%s-%s-inUtil" % (ope, p)  # 物理端口
                        rOutKey = "%s-%s-outUtil" % (ope, p)
                        rhMinEth[rInKey] = ethInfo["ethUtilIn"]
                        rhMinEth[rOutKey] = ethInfo["ethUtilOut"]

if not CACHE:  # 计算流量
                        ethFlow[ope]["ethFlowOut"] += ethInfo["ethFlowOut"]
                        ethFlow[ope]["ethFlowIn"] += ethInfo["ethFlowIn"]
continue

                    curTime = int(time.time())  # now time
                    opeLow = ope.lower()
                    flowInKey = "eth_flow_in_%s_%s_%s" % (ip, opeLow, port)
                    flowOutKey = "eth_flow_out_%s_%s_%s" % (ip, opeLow, port)
                    timeKey = "eth_flow_time_%s_%s_%s" % (ip, opeLow, port)
                    options = lastCache.options(section)
if timeKey not in options:
                        timeInter = 0  # not time will set it 0
else:
                        timeVal = lastCache.getint(section, timeKey)
                        timeInter = curTime - timeVal
if timeInter <= 0:
                            timeInter = 0
if flowInKey not in options:
                        inInter = 0  # the first time is 0
else:
if 0 > ethInfo["ethFlowIn"]:  # 小于0就是失败
                            inInter = 0
else:
                            flowIn = lastCache.getfloat(section, flowInKey)
if 0 > flowIn:
                                inInter = 0
else:  # int()解决误差
                                inInter = int(ethInfo["ethFlowIn"] - flowIn)
if inInter < 0:  # 64位不可能在1分钟转过1圈
                                    inInter += VALUE
if flowOutKey not in options:
                        outInter = 0  # the first time is 0
else:
if 0 > ethInfo["ethFlowOut"]:  # 小于0就是失败
                            outInter = 0
else:
                            flowOut = lastCache.getfloat(section, flowOutKey)
if 0 > flowOut:
                                outInter = 0
else:  # int()解决误差
                                outInter = int(ethInfo["ethFlowOut"] - flowOut)
if outInter < 0:  # 64位不可能在1分钟转过1圈
                                    outInter += VALUE
                    lastCache.set(section, timeKey, curTime)
                    lastCache.set(
                                section, flowInKey, ethInfo["ethFlowIn"])
                    lastCache.set(
                                section, flowOutKey, ethInfo["ethFlowOut"])
if 0 == timeInter:  # get not time
                        ethFlow[ope]["ethFlowIn"] += 0
                        ethFlow[ope]["ethFlowOut"] += 0
else:
if 0 == inInter:
                            ethFlow[ope]["ethFlowIn"] += 0
else:
                            ethFlow[ope]["ethFlowIn"] += round(
                                    inInter/timeInter, 2)
if 0 == outInter:
                            ethFlow[ope]["ethFlowOut"] += 0
else:
                            ethFlow[ope]["ethFlowOut"] += round(
                                    outInter/timeInter, 2)

                ethFlow[ope]["ethFlowOut"] = round(
                        ethFlow[ope]["ethFlowOut"], 2)
                ethFlow[ope]["ethFlowIn"] = round(
                        ethFlow[ope]["ethFlowIn"], 2)
if CACHE:  # 同步cache file文件
                options = lastCache.options(section)
for option in options:
                    port = option.split('_')[-1]  # 获取最后一个端口
if port not in self.portInfo[ip]["lPort"]:
                        lastCache.remove_option(section, option)
                lastCache.write(open(lastCacheFile, 'w'))
            self.ipInfo[ip]["bandInfo"] = bandInfo

（1）每个ip保存一个cache文件，则如果该ip在运行超过1分钟后也不会影响到其他ip的流量计算。当然，这时候该ip的流量计算有可能会出错，如果它超时的是流量采集命令的话。

（2）cache文件在计算的时候需要考虑字段是否缺失，如果缺失即使用0（采集失败）处理，同时每次计算后需要同步cache文件，保证里面的端口是最新的。因为系统都是根据端口来计算流量的，如果一个端口被统计了两次则流量会变大。

（3）最后一个需要注意的点是两次流量的采集差有可能会小于0，则表示流量计数器已经转过了一圈，需要加上计数器的最大值来得到当前流量差。由于我们使用64位的计数器，所以流量最大值是pow(2, 64)/1000.0/1000.0*8=147573952589676.41，当然这个值超级大，我们的端口带宽最大只有10Gbit/s，所以两次的流量差永远不会等于0（刚好转一圈）。这里会出现一个问题，vdx的流量计数器跟其他的计数器不一样，如果当前值小于上次的值，那使用我们上面的计算方法计算后的值将会达到2000多G的流量速率，这明显是错的。所以我们采用一种折中的方法，一旦端口的流量值计算后超过了10Gbit/s，就直接使用上一次的值（从redis中获取）替换本分钟的流量速率。

　　三、解析告警

　　每分钟根据上一分钟采集器上报的数据和阈值配置，解析交换机的异常并告警。一个品牌使用一个解析程序，然后发送rtx、wechat、sms告警。

　　四、展示

　　展示交换机的异常情况和状态图，历史数据。　　

　　五、附录

　　（一）、四个品牌的采集oid

1、Brocade

1.1 采集端口信息

　　1.3.6.1.4.1.1991.1.1.3.3.5.1.18

1.2 采集端口状态

　　1.3.6.1.4.1.1991.1.1.3.3.5.1.11

1.3 采集入出流量比率，入出端口流量

　　1.3.6.1.4.1.1991.1.1.3.3.5.1.52.$port

　　1.3.6.1.4.1.1991.1.1.3.3.5.1.53.$port

　　1.3.6.1.4.1.1991.1.1.3.3.5.1.54.$port

　　1.3.6.1.4.1.1991.1.1.3.3.5.1.55.$port

1.4 采集电源状态

　　1.3.6.1.4.1.1991.1.1.1.2.1.1.3

1.5 采集风扇状态

　　1.3.6.1.4.1.1991.1.1.1.3.1.1.3

1.6 采集温度

　　1.3.6.1.4.1.1991.1.1.2.13.1.1.4

1.7 采集CPU使用率

　　1.3.6.1.4.1.1991.1.1.2.11.1.1.5

1.8 采集内存使用率

　　1.3.6.1.4.1.1991.1.1.2.1.53

2、Blade

2.1 采集端口信息

　　ifDescr

2.2 采集端口状态

　　ifOperStatus

2.3 采集端口入出流量

　　ifHCinOctets.$port

　　ifHCOutOctets.$port

2.4 采集电源信息

　　1.3.6.1.4.1.26543.102.102.14.20

　　1.3.6.1.4.1.26543.102.102.14.21

2.5 采集温度

　　1.3.6.1.4.1.26543.102.102.14.11

　　1.3.6.1.4.1.26543.102.102.14.12

　　1.3.6.1.4.1.26543.102.102.14.13

3、Vdx

3.1 采集端口信息

　　ifDescr

3.2 采集端口状态

　　ifOperStatus

3.3 采集端口入出流量

　　1.3.6.1.2.1.31.1.1.1.6.$port

　　1.3.6.1.2.1.31.1.1.1.10.$port

3.4 采集CPU使用率

　　1.3.6.1.4.1.1588.2.1.1.1.26.1

3.5 采集内存使用率

　　1.3.6.1.4.1.1588.2.1.1.1.26.6

3.6 采集温度/风扇/电源信息

　　1.3.6.1.4.1.1588.2.1.1.1.1.22.1.2

　　1.3.6.1.4.1.1588.2.1.1.1.1.22.1.4

4、Huawei

4.1 采集端口信息

　　ifDescr

4.2 采集端口状态

　　ifOperStatus

4.3 采集端口入出流量

　　ifHCInOctets.$port

　　ifHCOutOctets.$port

4.4 采集CPU使用率

　　1.3.6.1.4.1.2011.6.3.4.1.2

4.5 采集内存使用率

　　1.3.6.1.4.1.1588.2.1.1.1.26.6

秒客网

基础监控-交换机监控

相关文章