Hadoop与任何数据库组合的优点

There are so many different databases.

有这么多不同的数据库。

relational databases
nosql databases
- key/value
- document store
- wide columns store
- graph databases

nosql数据库键/值文档存储宽列存储图数据库

And database technologies

和数据库技术

in-memory
column oriented

All have their advantages and disadvantages. For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.

都有其优点和缺点。对我来说,很难理解如何为大数据项目评估或选择合适的数据库。

I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.

我想一下Hadoop,它有许多功能可以在hdfs中保存数据或访问不同的数据库进行分析。

Is it right to say, that hadoop can make it easier to choose the right database, because it can be used at first as a data storage? so if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?

是不是说,hadoop可以更容易地选择正确的数据库,因为它最初可以用作数据存储?所以,如果我有hadoop hdfs作为我的主要数据存储,我仍然可以为我的应用程序更改我的数据库或使用多个数据库?

2 个解决方案

#1

First and Foremost, Hadoop is not a database. It is a distributed Filesystem.

首先,Hadoop不是一个数据库。它是一个分布式文件系统。

For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.

对我来说,很难理解如何为大数据项目评估或选择合适的数据库。

The choice of database for a project depends on these factors,

项目数据库的选择取决于这些因素,

Nature of the data storage and retrieval

数据存储和检索的性质

If it is meant for transactions, It is highly recommended that you stick to an ACID database.

如果它适用于交易,强烈建议您坚持使用ACID数据库。

If it is to be used for web-applications or Random Access, then you have wide variety of choices from the traditional SQL ones and to the latest database technologies which support HDFS as storage layer, like HBase. Traditional are well suited for Random Access as they highly support constraints and indexes.

如果要将其用于Web应用程序或随机访问,那么您可以从传统的SQL应用程序和支持HDFS作为存储层的最新数据库技术(如HBase)中进行选择。 Traditional非常适合随机访问,因为它们高度支持约束和索引。

If analytical batch processing is the concern, based on the structure complexity and volume, choice can be made among all the available ones.

如果关注分析批处理,则根据结构复杂性和体积,可以在所有可用的选项中进行选择。
Data Format or Structure

数据格式或结构

Most of the SQL databases support Structured data (the data which can be formatted into tables), some do extend their support beyond that for storing JSON and likewise.

大多数SQL数据库都支持结构化数据(可以格式化为表格的数据),有些确实扩展了它们的支持,超出了存储JSON的范围。

If the data is unstructured, especially flatfiles, storing and processing them can be easily done with any Bigdata supportive technologies like Hadoop, Spark, Storm. Again these technologies will come into picture only if the volume is high.

如果数据是非结构化的,尤其是平面文件,那么使用任何Bigdata支持技术(如Hadoop,Spark,Storm)都可以轻松地存储和处理它们。只有当音量很高时,这些技术才会再次出现。

Different database technologies play well for different data formats. For example, Graph databases are well suited for storing structures representing relationships or graphs.

不同的数据库技术适用于不同的数据格式。例如,Graph数据库非常适合存储表示关系或图形的结构。
Size

This is the next bigger concern, more the data more the need for scalability. So it is better to choose a technology that supports Scale-Out Architecture (Hadoop, NoSql) than Scale-In. This could become a bottleneck in the future when you plan to store more.

这是下一个更大的问题,更多的数据更需要可扩展性。因此,最好选择支持横向扩展架构(Hadoop,NoSql)而不是Scale-In的技术。当您计划存储更多内容时,这可能会成为未来的瓶颈。

I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.

我想一下Hadoop,它有许多功能可以在hdfs中保存数据或访问不同的数据库进行分析。

Yes, you can use HDFS as your storage layer and use any of the HDFS supported databases to do the processing(Choice of Processing framework is another concern to choose from batch to near real time to real time). To be noted is that Relational databases do not support HDFS storage. Some NoSql databases, like MongoDB, also support HDFS storage.

是的,您可以使用HDFS作为您的存储层,并使用任何支持HDFS的数据库进行处理(选择处理框架是从批量到近实时到实时选择的另一个问题)。需要注意的是,关系数据库不支持HDFS存储。一些NoSql数据库(如MongoDB)也支持HDFS存储。

if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?

如果我有hadoop hdfs作为我的主要数据存储,我仍然可以为我的应用程序更改我的数据库或使用多个数据库?

This could be tricky depending upon which database you want to pair with afterwards.

这可能很棘手,具体取决于您之后要与哪个数据库配对。

#2

HDFS is not a posix-compatible filesystem, so you can't just use it as a general purpose storage and then deploy any DB on top of it. The database you'll deploy should have explicit support for HDFS. There are a few options: HBase, Hive, Impala, SolR.

HDFS不是与posix兼容的文件系统,因此您不能将其用作通用存储,然后在其上部署任何数据库。您将部署的数据库应该明确支持HDFS。有几个选项:HBase,Hive,Impala,SolR。

#1