8.HBase In Action 第一章-HBase简介(1.2.2 捕获增量数据)

时间:2023-11-25 09:44:32

Data often trickles in and is added to an existing data store for further usage, such as analytics, processing, and serving. Many HBase use cases fall in this category—using HBase as the data store that captures incremental data coming in from various data sources. These data sources can be, for example, web crawls (the canonical Bigtable use case that we talked about), advertisement impression data containing information about which user saw what advertisement and for how long, or time series data generated from recording metrics of various kinds. Let’s talk about a few successful use cases and the companies that are behind these projects.

数据一般是不断流动并加入到一个现有的数据存储中,为进一步使用作好准备,如分析,处理和提供数据服务。许多HBase的应用案件属于这种类型,HBase作为数据存储,从各种数据源中捕获增量数据。这些数据源可以是,抓取的网页(我们谈到了Bigtable的典型案例),广告统计数据(包含哪个用户看了哪个广告,停留了多长时间)或者各种记录度量的时间序列数据。让我们来谈谈一些成功应用HBase的应用案例和公司。

http://www.uifanr.com/

CAPTURING METRICS: OPENTSDB

Web-based products serving millions of users typically have hundreds or thousands of servers in their back-end infrastructure. These servers spread across various functions—serving traffic, capturing logs, storing data, processing data, and so on. To keep

the products up and running, it’s critical to monitor the health of the servers as well as the software running on these servers (from the OS right up to the application the user is interacting with). Monitoring the entire stack at scale requires systems that can collect and store metrics of all kinds from these different sources. Every company has its own way of achieving this. Some use proprietary tools to collect and visualize metrics; others use open source frameworks.

获取度量:OPENTSDB

服务百万用户的web类型产品,其后端基础设施通常具有数百或数千个服务器。在这些服务器上,分布着不同的功能,如提供数据传输的,捕获日志的,存储数据的,处理数据的,等等。为了维护该产品的运行稳定性,很关键的一件事情是监控服务器的运行状况,以及这些服务器上运行的软件的运行状况(包括操作系统,用户使用的应用程序)。监控整个规模的系统,我们需要收集到各种不同来源的指标数据。每个公司都有自己的实现这一目标的方式。一些公司会使用专门的工具来收集和查看这些度量,而有的公司会使用开源的框架。

http://www.uifanr.com/

StumbleUpon built an open source framework that allows the company to collect metrics of all kinds into a single system. Metrics being collected over time can be thought of as basically time-series data: that is, data collected and recorded over time.The framework that StumbleUpon built is called OpenTSDB, which stands for Open Time Series Database. This framework uses HBase at its core to store and access the collected metrics. The intention of building this framework was to have an extensible metrics collection system that could store and make metrics be available for access over a long period of time, as well as allow for all sorts of new metrics to be added as more features are added to the product. StumbleUpon uses OpenTSDBto monitor all of its infrastructure and software, including its HBase clusters. We cover OpenTSDB in detail in chapter 7 as a sample application built on top of HBase.

StumbleUpon公司建立了一套开源框架,它使公司能够收集各种指标数据到一套单一的系统中。指标被收集的时间可以被认为是基本的时间序列数据,也就是说指标数据是基于时间来收集并记录的。StumbleUpon公司构建的框架数据被称为OpenTSDB,称为开放的时间序列数据库。这个框架是使用HBase作为核心来存储和访问收集的指标数据的。建立这个框架的目的是为了建立一个可扩展的指标收集系统,它可以存储并度量很长一段时间的指标数据,同时允许各种各样的指标集添加到系统中,增强产品的功能。 StumbleUpon公司使用OpenTSDB来监控他们所有的基础设施及应用软件,包括其HBase集群。我们会在第七章中,把OpenTSDB作为一个基于HBase构建的示例应用程序进行详细的介绍。

http://www.uifanr.com/

CAPTURING USER-INTERACTION DATA: FACEBOOK AND STUMBLEUPON

Metrics captured for monitoring are one category. There are also metrics about user interaction with a product. How do you keep track of the site activity of millions of people? How do you know which site features are most popular? How do you use one page view to directly influence the next? For example, who saw what, and how many times was a particular button clicked? Remember the Like button in Facebook and the Stumble and +1 buttons in StumbleUpon? Does this smell like a counting problem? They increment a counter every time a user likes a particular topic.

捕获用户交互数据:Facebook和StumbleUpon公司

捕获的监测指标属于一种类型的数据。此外,还有用户与产品交互相关的指标数据。你怎么跟踪数百万用户在网站里的活动呢?你怎么知道哪个网站功能是最受欢迎的?你准备怎么用一个页面去直接影响用户去看下一个页面?例如,谁看见了什么,特定的按钮被点击了多少次?通过记录Facebook的链接按钮点击事件,记录StumbleUpon公司+1按钮的点击事件?这像不像一个计数问题?当每次用户喜欢一个特定主题的时候,他们就给相应计数器加1。

http://www.uifanr.com/

StumbleUpon had its start with MySQL, but as the service became more popular, that technology choice failed it. The online demand of this increasing user load was too much for the MySQL clusters, and ultimately StumbleUpon chose HBase to replace those clusters. At the time, HBase didn’t directly support the necessary features. StumbleUpon implemented atomic increment in HBase and contributed it back to the project.

StumbleUpon公司开始是使用MySQL,但随着他们的服务变得越来越流行,该技术的选择就显得很失败了。这种日益增长的在线需求造成的用户负载对MySQL群集来讲太多了,支撑不了,最终StumbleUpon公司选择HBase取代了那些MYSQL集群。当时,HBase并不直接支持所需要的功能。 StumbleUpon公司在HBase里实现原子性增量技术,并把这项技术并入到HBase的原本项目中来。

http://www.uifanr.com/

Facebook uses the counters in HBase to count the number of times people like a particular page. Content creators and page owners can get near real-time metrics about how many users like their pages. This allows them to make more informed decisions about what content to generate. Facebook built a system called Facebook Insights, which needs to be backed by a scalable storage system. The company looked at various options, including RDBMS, in-memory counters, and Cassandra, before settling on HBase. This way, Facebook can scale horizontally and provide the service to millions of users as well as use its existing experience in running large-scale HBase clusters. The system handles tens of billions of events per day and records hundreds

of metrics.

Facebook使用HBase里计数器计数来统计用户标记自己喜欢某个特定页面的次数。内容发布者和页面的拥有者可以近乎实时获取到用户标记喜欢自己革个页面的次数。这使他们能够在发表什么内容方面做出更明智的决策。Facebook上建立了一个名为“Facebook洞察力”的系统,它需要一个可扩展的存储系统来进行数据备份。在确定用HBase前,该公司考察了各种方案,包括关系型数据库,内存计数器和Cassandra。这样一来,Facebook使用他在运行大型HBase集群上经验,能够水平扩展他的系统并提供服务给数百万的用户,该系统每天处理数百亿的事件信息,并记录数百个指标。

http://www.uifanr.com/

TELEMETRY: MOZILLA AND TREND MICRO

Operational and software-quality data includes more than just metrics. Crash reports are an example of useful software-operational data that can be used to gain insights into the quality of the software and plan the development roadmap. This isn’t necessarily related to web servers serving applications. HBase has been successfully used to capture and store crash reports that are generated from software crashes on users’ computers.

遥测技术:Mozilla和趋势科技

业务和软件质量的数据远不是只包含指标数据这么简单。崩溃报告是可以用来深入了解软件质量和规划发展蓝图的有用的数据例子之一。HBase不只是与运行在Web服务器上的应用程序才有关系,它已经被成功地用于捕获和存储在用户计算机上软件崩溃生成的崩溃报告。

http://www.uifanr.com/

The Mozilla Foundation is responsible for the Firefox web browser and Thunderbird email client. These tools are installed on millions of computers worldwide and run on a wide variety of OSs. When one of these tools crashes, it may send a crash report back to Mozilla in the form of a bug report. How does Mozilla collect these reports? What use are they once collected? The reports are collected via a system called Socorro and are used to direct development efforts toward more stable products. Socorro’s data storage and analytics are built on HBase

Mozilla基金会是负责发行火狐网页浏览器和Thunderbird电子邮件客户端的。这些工具安装在数百万世界各地的计算机上,并运行在多种操作系统中。当这些工具当中有一个出现崩溃的时候,它就可能以错误报告的形式发送一个崩溃报告给Mozilla。Mozilla如何收集这些报告呢?他们收藏这些报告有什么用呢?  这些报告将被一个称为“索科洛”系统收集起来,并用于为开发更稳定的产品提供依据。 “索科洛” 系统的数据存储和分析都是建立在HBase上的。

http://www.uifanr.com/

The introduction of HBase enabled basic analysis over far more data than was previously possible. This analysis was used to direct Mozilla’s developer focus to great effect, resulting in the most bug-free release ever.

HBase的引入,使基于更多的数据进行基本分析的操作成为了一种可能。分析的结果用来直接影响Mozilla开发重点的调整,Mozilla提升了质量,发布的版本大多数是0缺陷。

http://www.uifanr.com/

Trend Micro provides internet security and threat-management services to corporate clients. A key aspect of security is awareness, and log collection and analysis are critical for providing that awareness in computer systems. Trend Micro uses HBase to manage its web reputation database, which requires both row-level updates and support for batch processing with MapReduce. Much like Mozilla’s Socorro, HBase is also used to collect and analyze log activity, collecting billions of records every day. The flexible schema in HBase allows data to easily evolve over time, and Trend Micro can add new attributes as analysis processes are refined

趋势科技提供给企业客户网络安全服务和威胁管理服务。安全很重要的一个方面是安全意识,日志收集和分析在提升人们对计算机系统的安全意识方面是至关重要的。趋势科技使用HBase管理他的互联网信誉数据库,这个数据库需要支持行级更新和MapReduce批处理,非常像Mozilla的“索科罗”,HBase用来收集和分析日志活动信息,每天采集数十亿条记录。HBase灵活的数据架构模式允许数据模型随着时间的推移而快速修改,趋势科技可以很方便地添加新属性作为分析内容,从而使分析结果更精确。

http://www.uifanr.com/

ADVERTISEMENT IMPRESSIONS AND CLICK STREAM

Over the last decade or so, online advertisements have become a major source of revenue for web-based products. The model has been to provide free services to users but have ads linked to them that are targeted to the user using the service at the time. This kind of targeting requires detailed capturing and analysis of user-interaction data to understand the user’s profile. The ad to be displayed is then selected based on that profile. Fine-grained user-interaction data can lead to building better models, which in turn leads to better ad targeting and hence more revenue. But this kind of data has two properties: it comes in the form of a continuous stream, and it can be easily partitioned based on the user. In an ideal world, this data should be available to use as soon as it’s generated, so the user-profile models can be improved continuously without delay—that is, in an online fashion.

广告展示次数和点击流

在过去的十年左右,网络广告已成为Web产品的主要收入来源。 这个模式是,向用户提供免费服务,向使用该服务的用户链接针对性的广告。这种针对性的广告推送,需要对用户的交互数据进行详细采集和分析,了解用户的行为习惯。应该显示什么广告,后续就是基于用户的行为习惯来确定的。细粒度的用户交互数据可以用来建立更好的数据模型,从而带来更好的广告推送定位,获得更多收入。这种交互数据具有两个属性:它是连续的数据流的形式,它可以基于用户非常方便地切分。在理想的情况下,这种数据分析结果应该尽可能地快地提供出来。用户行为习惯模型的改善应该是基于在线的方式,连续无延迟的进行的。

http://www.uifanr.com/

Online vs. offline systems

The terms online and offline have come up a couple times. For the uninitiated, these terms describe the conditions under which a software system is expected to perform. Online systems have low-latency requirements. In some cases, it’s better for these systems to respond with no answer than to take too long producing the correct answer. You can think of a system as online if there’s a user at the other end impatiently tapping their foot. Offline systems don’t have this low-latency requirement. There’s a user waiting for an answer, but that response isn’t expected immediately.

The intent to be an online or an offline system influences many technology decisions when implementing an application. HBase is an online system. Its tight integration with Hadoop MapReduce makes it equally capable of offline access as well.

在线系统与离线系统

在线和离线这些术语已经多次提到了。对于外行来说,这些术语描述了一个软件系统运行的模式。在线系统具有低时延的要求。在某些情况下,这些系统不返回响应要比花很多时间来产生正确的响应更重要。你可以想像一下,如果一个在线系统有个用户在等得不耐烦地跺脚了,那是什么滋味。离线系统没有这个低延迟要求。是有用户在等待它的处理返回,但并不是要那种立即的返回响应。

设计一个应用系统的时候,把它定位为在线还是离线,这会很大程度上影响着许多技术决策。 HBase的是一个在线系统,同时它与Hadoop的MapReduce框架紧密集成,使得它同样能够支持离线访问模式。

http://www.uifanr.com/

These factors make collecting user-interaction data a perfect fit for HBase, and HBase has been successfully used to capture raw clickstream and user-interaction data incrementally and then process it (clean it, enrich it, use it) using different processing mechanisms (MapReduce being one of them). If you look for companies that do this, you’ll find plenty of examples.

这些因素使收集用户交互数据非常适合用HBase来实现,HBase已经被成功地应用于捕获原始点击流,用户交互的数据增量,并使用不同的处理机制(MapReduce是其中之一)来处理(清洗,填充,使用)这些数据。如果你查找一下做这方面业务的企业,你会发现很多的例子。

http://www.uifanr.com/