Toward Scalable Systems for Big Data Analytics: A Technology Tutorial (I - III)

时间:2022-07-11 16:23:34

ABSTRACT Recent technological advancement have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The term big data was coined to capture the meaning of this emerging trend. In addition to its sheer volume, big data also exhibits other unique characteristics as compared with traditional data. For instance, big data is commonly unstructured and require more real-time analysis. This development calls for new system architectures for data acquisition, transmission, storage, and large-scale data processing mechanisms. In this paper, we present a literature survey and system tutorial for big data analytics platforms, aiming to provide an overall picture for nonexpert readers and instill a do-it-your self spirit for advanced audiences to customize their own big-data solutions. First, we present the definition of big data and discuss big data challenges. Next, we present a systematic framework to decompose big data systems into four sequential modules, namely data generation, data acquisition, data storage, and data analytics. These four modules form a big data value chain. Following that, we present a detailed survey of numerous approaches and mechanisms from research and industry communities. In addition, we present the prevalent Hadoop framework for addressing big data challenges. Finally, we outline several evaluation benchmarks and potential research directions for big data systems.

INDEX TERMS Big data analytics, cloud computing, data acquisition, data storage, data analytics, Hadoop

I.
INTRODUCTION

The
emerging big-data paradigm, owing to its broader impact, has
profoundly transformed our society and will continue to attract
diverse attentions from both technological experts and the public in
general. It is obvious that we are living a data deluge era,
evidenced by the sheer volume of data from a variety of sources and
its growing rate of generation. For instance, an IDC report predicts
that, from 2005 to 2020, the global data volume will grow by a factor
of 300, from 130 exabytes to 40,000 exabytes, representing a double
growth every two years. The term of “big-data” was coined to
capture the profound meaning of this data-explosion trend and indeed
the data has been touted as the new oil, which is expected to
transform our society. For example, a Mckinsey report states that the
potential value of global personal location data is estimated to be
$100 billion in revenue to service providers over the next ten years
and be as much as $700 billion in value to consumer and business end
users. The huge potential associated with big-data has led to an
emerging research field that has quickly attracted tremendous
interest from diverse sectors, for example, industry, government and
research community. The broad interest is first exemplified by
coverage on both industrial reports and public media (e.g., the
Economist, the New Your Times, and the National Public Radio (NPR)).
Government has also played a major role in creating new programs to
accelerate the progress of tackling the big data challenges. Finally,
Nature and Science Magazines have published special issues to discuss
the big-data phenomenon and its challenges, expanding its impact
beyond technological domains. As a result, this growing interest in
big-data from diverse domains demands a clear and intuitive
understanding of its definition, evolutionary history, building
technologies and potential challenges.

This
tutorial paper focuses on scalable
big-data systems, which include a set of tools and mechanisms
to load, extract, and improve disparate data while leveraging the
massively parallel processing power to perform complex
transformations and analysis. Owing to the uniqueness of big-data,
designing a scalable big-data systems faces a series of technical
challenges, including:

  • First,
    due to the variety of disparate data sources and the sheer volume,
    it is difficult to collect and integrate data with scalability from
    distributed locations. For instance, more than 175 million tweets
    containing text, image, video, and social relationship are generated
    by millions of accounts distributed globally.

  • Second,
    big data systems need to store and manage the gathered massive and
    heterogeneous datasets, while provide function and performance
    guarantee, in terms of fast retrieval, scalability, and privacy
    protection. For example, Facebook needs to store, access, and
    analyze over 30 pertabytes of user generate data.

  • Third,
    big data analytics must effectively mine massive datasets at
    different levels in realtime or near realtime - including modeling,
    visualization, prediction, and optimization - such that inherent
    promises can be revealed to improve decision making and acquire
    further advantages.

These
technological challenges demand an overhauling re-examination of the
current data management systems, ranging from their architectural
principle to the implementation details. Indeed, many leading
industry companies have discarded the transitional solutions to
embrace the emerging big data platforms.

However,
traditional data management and analysis systems, mainly based on
relational database management system (RDBMS), are inadequate in
tackling the aforementioned list of big-data challenges.
Specifically, the mismatch between the traditional RDBMS and the
emerging big-data paradigm falls into the following two aspects,
including:

  • From
    the perspective of data structure, RDBMSs can only support
    structured data, but offer little support for semi-structured or
    unstructured data.

  • From
    the perspective of scalability, RDBMSs scale up with expensive
    hardware and cannot scale out with commodity hardware in parallel,
    which is unsuitable to cope with the ever growing data volume.

To
address these challenges, the research community and industry have
proposed various solutions for big data systems in an ac-hoc manner.
Cloud computing can be deployed as the infrastructure layer for big
data systems to meet certain infrastructure requirements, such as
cost-effectiveness, elasticity, and the ability to scale up or down.
Distributed file systems and NoSQL databases are suitable for
persistent storage and the management of massive scheme free
datasets. MapReduce, a programming framework, has achieved great
success in processing group-aggregation tasks, such as website
ranking. Hadoop integrates data storage, data processing, system
management, and other modules to form a powerful system-level
solution, which is becoming the mainstay in handling big data
challenges. We can construct various big data applications based on
these innovative technologies and platforms. In light of the
proliferation of big-data technologies, a systematic framework should
be in order to capture the fast evolution of big-data research and
development efforts and put the development in different frontiers in
perspective.

In
this paper, learning from our first-hand experience of building a
big-data solution on our private modular data center testbed, we
strive to offer a systematic tutorial for scalable big-data systems,
focusing on the enabling technologies and the architectural
principle. It is our humble expectation that the paper can server as
a first stop for domain experts, big-data users and the general
audience to look for information and guideline in their specific
needs for big-data solutions. For example, the domain experts could
follow our guideline to develop their own big-data platform and
conduct research in big-data domain; the big-data users can use our
framework to evaluation alternative solutions proposed by their
vendors; and the general audience can understand the basic of
big-data and its impact on their work and life. For such a purpose,
we first present a list of alternative definitions of big data,
supplemented with the history of big-data and big-data paradigms.
Following that, we introduce a generic framework to decompose big
data platforms into four components, i.e., data generation, data
acquisition, data storage, and data analysis. For each stage, we
survey current research and development efforts and provide
engineering insights for architecture design. Moving toward a
specific solution, we then delve on Hadoop - the de facto choice for
big data analysis platform, and provide benchmark results for
big-data platforms.

The
rest of this paper is organized as follows. In Section II, we present
the definition of big data and its brief history, in addition to
processing of big data and its brief history, in addition to
processing paradigms. Then, in Section III, we introduce the big data
value chain (which is composed of four phases), the big data
technology map, the layered system architecture and challenges. The
next four sections describe the different big data phases associated
with the big data value chain. Specifically, Section IV focuses on
big data generation and introduces representative big data sources.
Section V discusses big data acquisition and presents data
collection, data transmission, and data preprocessing technologies.
Section VI investigates big data storage approaches and programming
models. Section VII discuss big data analytics, and several
applications are discussed in Section VIII. Section IX introduces
Hadoop, which is the current mainstay of the big data movement.
Section X outlines several benchmarks for evaluating the performance
of big data systems. A brief conclusion with recommendations for
future studies is presented in Section XI.

II.
BIG DATA: DEFINITION, HISTORY AND PARADIGMS

In
this section, we first present a list of popular definitions of big
data, followed by a brief history of its evolution. This section
also discusses two alternative paradigms, streaming processing and
batch processing.

A.
BIG DATA DEFINITION

Given
its current popularity, the definition of big data is rather diverse,
and reaching a consensus is difficult. Fundamentally, big data means
not only a large volume of data but also other features that
differentiate it from the concepts of “massive data” and “very
large data”. In fact, several definitions for big data are found in
the literature, and three types of definitions play an important role
in shaping how big data is viewed:

  • Attribute
    Definition: IDC is a pioneer in studying big data and its impact. It
    defines big data in 2011 report that was sponsored by EMC (the cloud
    computing leader): “Big data technologies describe a new
    generation of technologies and architectures, designed to
    economically extract value from very large volumes of a wide variety
    of data, by enabling high-velocity capture, discovery, and/or
    analysis.” This definition delineates the four salient feature of
    big data, i.e., volume, variety, velocity and value. As a result,
    the “4Vs” definition has been used widely to characterize big
    data. A similar description appeared in 2011 research report in
    which META group (now Gartner) analyst Doug Laney noted that data
    growth challenges and opportunities and three-dimensional, i.e.,
    increasing volume, velocity, and variety. Although this description
    was not meant originally to define big data, Gartner and much of the
    industry, including IBM and certain Microsoft researchers, continue
    to use this “3Vs” model to describe big data 10 years later.

  • Comparative
    Definition: In 2011, Mckinsey's report defined big data as “datasets
    whose size is beyond the ability of typical database software tools
    to capture, store, manage, and analyze.” This definition is
    subjective and does not define big data in terms of any particular
    metric. However, it incorporates an evolutionary aspect in the
    definition (over time or across sectors) of what a dataset must be
    to be considered as big data.

  • Architectural
    Definition: The National Institute of Standards and Technology
    (NIST) suggests that, “Big data is where the data volume,
    acquisition velocity, or data representation limits the ability to
    perform effective analysis using traditional relational approaches
    of requires the use of significant horizontal scaling for efficient
    processing.” In particular, big data can be further categorized
    into big data science and big data frameworks. Big data science is
    “the study of techniques covering the acquisition, conditioning,
    and evaluation of big data,” whereas big data frameworks are
    “software libraries along with their associated algorithms that
    enable distributed processing and analysis of big data problems
    across clusters of computer units”. An instantiation of one or
    more big data frameworks is known as big data infrastructure.

Concurrently,
there has been much discussion in various industries and academia
about what big data actually means.

However,
reaching a consensus about the definition of big data is difficult,
if not impossible. A logical choice might be to embrace all the
alternative definitions, each of which focuses on a specific aspect
of big data. In this paper, we take this approach and embark on
developing an understanding of common problems and approaches in big
data science and engineering.

C.
BIG-DATA PARADIGMS: STRAMING VS. BATCH

Big
data analytics is the process of using analysis algorithms running on
powerful supporting platforms to uncover potential concealed in big
data, such as hidden patterns or unknown correlations. According to
the processing time requirement, big data analytics can be
categorized into two alternative paradigms:

  • Streaming
    Processing: The start point for the streaming processing paradigm is
    the assumption that the potential value of data depends on data
    freshness. Thus, the streaming processing paradigm analyzes data as
    soon as possible to derive its results. In this paradigm, data
    arrives in a stream. In its continuous arrival, because the stream
    is fast and carries enormous volume, only a small portion of the
    stream is stored in limited memory. One or few passes over the
    stream are made to find approximation results. Streaming processing
    theory and technology have been studied for decades. Representative
    open source systems include Storm, S4, and Kafka. The streaming
    processing paradigm is used for online applications, commonly at the
    second, or even millisecond, level.

  • Batch
    Processing: In the batch-processing paradigm, data are first stored
    and then analyzed. MapReduce has become the dominant
    batch-processing model. The core idea of MapReduce is that data are
    first divided into small chunks. Next, these chunks are processed in
    parallel and in a distributed manner to generate intermediate
    results. The final result is derived by aggregating all the
    intermediate results. This model schedules computation resources
    close to data location, which avoids the communication overhead of
    data transmission. The MapReduce model is simple and widely applied
    in bioinformatics, web mining, and machine learning.

There
are many differences between these two processing paradigms. In
general, the streaming processing paradigm is suitable for
applications in which data are generated in the form of a stream and
rapid processing is required to obtain approximation results.
Therefore, the streaming processing paradigm's application domains
are relatively narrow. Recently, most applications have adopted the
batch-processing paradigm; even certain real-time processing
applications use the batch-processing paradigm to achieve a faster
response. Moreover, some research effort has been made to integrate
the advantages of these two paradigms.

Big
data platforms can use alternative processing paradigms; however, the
differences in these two paradigms will cause architectural
distinctions in the associated platforms. For example,
batch-processing-based platforms typically encompass complex data
storage and management systems, whereas streaming-processing-based
platforms do not. In practice, we can customize the platform
according to the data characteristics and application requirements.
Because the batch-processing paradigm is widely adopted, we only
consider batch-processing-based big data platforms in this paper.

III.
BIG-DATA SYSTEM ARCHITECTURE

In
this section, we focus on the value chain for big data analytics.
Specifically, we describe a big data value chain that consists of
four stages (generation, acquisition, storage, and processing). Next,
we present a big data technology map that associates the leading
technologies in this domain with specific phases in the big data
value chain and a time stamp.

A.
BIG-DATA SYSTEM: A VALUE-CHAIN VIEW

A
big-data system is complex, providing functions to deal with
different phases in the digital data life cycle, ranging from its
birth to its destruction. At the same time, the system usually
involves multiple distinct phases for different applications. In this
case, we adopt a systems-engineering approach, well accepted in
industry, to decompose a typical big-data system into four
consecutive phases, including data generation, data acquisition, data
storage, and data analytics. Notice that data visualization is an
assistance method for data analysis. In general, one shall visualize
data to find some rough patterns first, and then employ specific data
mining methods. I mention this in data analytics section. The details
for each phase are explained as follows.

Data
generation
concerns how data are generated. In this case, the
term “big data” is designed to mean large, diverse, and complex
datasets that are generated from various longitudinal and/or
distributed data sources, including sensors, video, click streams,
and other available digital sources. Normally, these datasets are
associated with different levels of domain-specific values. In this
paper, we focus on datasets from three prominent domains, business,
Internet, and scientific research, for which values are relatively
easy to understand. However, there are overwhelming technical
challenges in collecting, processing, and analyzing these datasets
that demand new solutions to embrace the latest advances in the
information and communications technology (ICT) domain.

Data
acquisition
refers to the process of obtaining information and is
subdivided into data collection, data transmission, and data
pre-processing. First, because data may come from a diverse set of
sources, websites that host formatted text, images and/or videos -
data collection refers to dedicated data collection technology that
acquires raw data from a specific data production environment.
Second, after collecting raw data, we need a high-speed transmission
mechanism to transmit the data into the proper storage sustaining
system for various types of analytical applications. Finally,
collected datasets might contain many meaningless data, which
unnecessarily increases the amount of storage space and affects the
consequent data analysis. For instance, redundancy is common in most
datasets collected from sensors deployed to monitor the environment,
and we can use data compression technology to address this issue.
Thus, we must perform data pre-processing operations for efficient
storage and mining.

Data
storage
concerns persistently storing and managing large-scale
datasets. A data storage system can be divided into two parts:
hardware infrastructure and data management. Hardware infrastructure
consists of a pool of shared ICT resources organized in an elastic
way for various tasks in response to their instantaneous demand. The
hardware infrastructure should be able to scale up and out and be
able to be dynamically reconfigured to address different types of
application environments. Data management software is deployed on top
of the hardware infrastructure to maintain large-scale datasets.
Additionally, to analyze or interact with the stored data, storage
systems must provide several interface functions, fast querying and
other programming models.

Data
analysis
leverages analytical methods or tools to inspect,
transform, and model data to extract value. Many application fields
leverage opportunities presented by abundant data and domain-specific
analytical methods to derive the intended impact. Although various
fields pose different application requirements and data
characteristics, a few of these fields may leverage similar
underlying technologies. Emerging analytics research can be
classified into six critical technical areas: structured data
analytics, text analytics, multimedia analytics, web analytics,
network analytics, and mobile analytics. This classification is
intended to highlight the key data characteristics of each area.

B.
BIG-DATA TECHNOLOGY MAP

Big
data research is a vast field that connects with many enabling
technologies. In this section, we present a big data technology map.
In this technology map, we associate a list of enabling technologies,
both open source and proprietary, with different stages in the big
data value chain.

This
map reflects the development trends of big data. In
the data generation stage, the structure of big data becomes
increasingly complex, from structured or unstructured to a mixture of
different types, whereas data sources become increasingly diverse. In
the data acquisition stage, data collection, data
pre-processing, and data transmission research emerge at different
times. Most research in the data storage stage began in approximately
2005. The fundamental methods of data analytics were built before
2000, and subsequent research attempts to leverage these methods to
solve domain-specific problems. Moreover, qualified technology or
methods associated with different stages can be chosen from this map
to customize a big data system.

C.
BIG-DATA SYSTEM: A LAYERED VIEW

Alternatively,
the big data system can be decomposed into a layered structure. The
layered structure is divisible into three layers, i.e., the
infrastructure layer, the computing layer, and the application layer,
from bottom to top. This layered view only provides a conceptual
hierarchy to underscore the complexity of a big data system. The
function of each layer is as follows.

  • The
    infrastructure layer consists of a pool of ICT resources, which can
    be organized by cloud computing infrastructure and enabled by
    virtualization technology. These resources will be exposed to
    upper-layer system in a fine-grained manner with specific
    service-level agreement (SLA). Within this model, resources must be
    allocated to meet the big data demand while achieving resource
    efficiency by maximizing system utilization, energy awareness,
    operational simplification, etc.

  • The
    computing layer encapsulates various data tools into a middleware
    layer that runs over raw ICT resources. In the context of big data,
    typical tools include data integration, data management, and the
    programming model. Data integration means acquiring data from
    disparate sources and integrating the dataset into a unified from
    with the necessary data pre-processing operations. Data management
    refers to mechanisms and tools that provide persistent data storage
    and highly efficient management, such as distributed file systems
    and SQL or NoSQL data stores. The programming model implements
    abstraction application logic and facilitates the data analysis
    applications. MapReduce, Dryad, Pregel, and Dremel exemplify
    programming models.

  • The
    application layer exploits the interface provided by the programming
    models to implement various data analysis functions, including
    querying, statistical analyses, clustering, and classification;
    then, it combines basic analytical methods to develop various field
    related applications. McKinsey presented five potential big data
    application domains: health care, public sector administration,
    retail, global manufacturing, and personal location data.

D.
BIG-DATA SYSTEM CHALLENGES

Designing
and deploying a big data analytics system is not a trivial or
straigthforward task. As one of its definitions suggests, big data is
beyond the capability of current hardware and software platforms. The
new hardware and software platforms in turn demand new infrastructure
and models to address the wide range of challenges of big data.
Recent works have discussed potential obstacles to the growth of big
data applications. In this paper, we strive to classify these
challenges into three categories: data collection and management,
data analytics, and system issues.

Data
collection and management addresses massive amounts of heterogeneous
and complex data. The following challenges of big data must be met:

  • Data
    Representation:
    Many datasets are heterogeneous in type,
    structure, semantics, organization, granularity, and accessibility.
    A competent data presentation should be designed to reflect the
    structure, hierarchy, and diversity of the data, and an integration
    technique should be designed to enable efficient operations across
    different datasets.

  • Redundancy
    Reduction and Data Compression
    : Typically, there is a large
    number of redundant data in raw datasets. Redundancy reduction and
    data compression without scarifying potential value are efficient
    ways to lessen overall system overhead.

  • Data
    Life-Cycle Management
    : Pervasive sensing and computing is
    generating data at an unprecedented rate and scale that exceed must
    smaller advances in storage system technologies. One of the urgent
    challenges is that the current storage system cannot host the
    massive data. In general, the value concealed in the big data
    depends on data freshness; therefore, we should set up the data
    importance principle associated with the analysis value to decide
    what parts of data should be archived and what parts should be
    discarded.

  • Data
    Privacy and Security
    : With the proliferation of online services
    and mobile phones, privacy and security concerns regarding accessing
    and analyzing personal information is growing. It is critical to
    understand what support for privacy must be provided at the platform
    level to eliminate privacy leakage and to facilitate various
    analyses.

  • There
    will be a significant impact that results from advances in big data
    analytics, including interpretation, modeling, prediction, and
    simulation. Unfortunately, massive amounts of data, heterogeneous
    data structures, and diverse applications present tremendous
    challenges, such as the following.

Approximate
Analytics: As data sets grow and the real time requirement becomes
stricter, analysis of the entire dataset is becoming more difficult.
One wary to potentially solve this problem is to provide approximate
results, such as the following.

  • Approximate
    Analytics
    : As data sets and the real time requirement becomes
    stricter, analysis of the entire dataset is becoming more difficult.
    One way to potentially, such as by means of an approximation query.
    The notion of approximation has two dimensions: the accuracy of the
    result and the groups omitted from the output.

  • Connecting
    Social Media
    : Social media possesses unique properties, such as
    vastness, statistical redundancy and the availability of user
    feedback. Various extraction techniques have been successfully used
    to identify references from social medial to specific product names,
    locations, or people on websites. By connecting inter-field data
    with social media, applications can achieve high levels of precision
    and distinct points of view.

  • Deep
    Analytics
    : One of the drivers of excitement around big data is
    the expectation of gaming novel insights. Sophisticated analytical
    technologies, such as machine learning, are necessary to unlock such
    insights. However, effectively leveraging these analysis toolkits
    requires an understanding of probability and statistics. The
    potential pillars of privacy and security mechanisms are mandatory
    access control and security communication, multi-granularity access
    control, privacy-aware data mining and analysis, and security
    storage and management.

Finally,
large-scale parallel systems generally confront several common
issues; however, the emergence of big data has amplified the
following challenges, in particular.

Energy
Management: The energy consumption of large scale computing systems
has attracted greater concern from economic and environmental
perspective. Data transmission, storage, and processing will
inevitably consume progressively more energy, as data volume and
analytics demand increases. Therefore, system-level power control and
management mechanisms must be considered in a big data system, while
continuing to provide extensibility and accessibility.

Scalability:
A big data analytics system must be able to support very large
datasets created now and in the future. All the components in big
data systems must be capable of scaling to address the ever-growing
size of complex datasets.

Collaboration:
Big data analytics is an interdisciplinary research field that
requires specialists from multiple professional fields collaborating
to mine hidden values. A comprehensive big data cyber infrastructure
is necessary to allow broad communities of scientists and engineers
to access the diverse data, apply their respective expertise, and
cooperate to accomplish the goals of analysis.

In
the remainder of this paper, we follow the value-chain framework to
investigate the four phases of the big-data analytic platform.