为什么HDFS不适合需要低延迟的应用?

时间:2022-06-28 12:21:14

I am new to Hadoop and HDFS and it confuses me as to why HDFS is not preferred with applications that require low latency. In a big data scenerio, we would have data spread over different community hardware, so accessing the data should be faster.

我是Hadoop和HDFS的新手,它让我很困惑为什么HDFS不适合需要低延迟的应用程序。在大数据场景中,我们将数据分布在不同的社区硬件上,因此访问数据应该更快。

3 个解决方案

#1


Hadoop is completely a batch processing system designed to store and analyze structured, unstructured and semistructured data.

Hadoop完全是一个批处理系统,用于存储和分析结构化,非结构化和半结构化数据。

The map/reduce framework of Hadoop is relatively slower since it is designed to support different format, structure and huge volume of data.

Hadoop的map / reduce框架相对较慢,因为它旨在支持不同的格式,结构和大量数据。

We should not say HDFS is slower, since the HBase no-sql database and MPP based datasources like Impala, Hawq sit on the HDFS. These datasources act faster because they do not follow mapreduce execution for data retrieval and processing.

我们不应该说HDFS速度较慢,因为HBase no-sql数据库和基于MPP的数据源如Impala,Hawq都位于HDFS上。这些数据源的作用更快,因为它们不遵循mapreduce执行数据检索和处理。

The slowness occurs only because of the nature of the map/reduce based execution, where it produces lots of intermediate data, much data exchanged between nodes, thus causes huge disk IO latency. Furthermore it has to persist much data in disk for synchronization between phases so that it can support Job recovery from failures. Also there are no ways in mapreduce to cache the all/subset of the data in memory.

发生缓慢只是因为基于map / reduce的执行的性质,它产生大量中间数据,在节点之间交换大量数据,从而导致巨大的磁盘IO延迟。此外,它必须在磁盘中保留大量数据以实现阶段之间的同步,以便它可以支持从故障中恢复作业。此外,mapreduce中没有办法将所有/子集的数据缓存到内存中。

The Apache Spark is yet another batch processing system but it is relatively faster than Hadoop mapreduce since it caches much of the input data on memory by RDD and keeps intermediate data in memory itself , eventually writes the data to disk upon completion or whenever required.

Apache Spark是另一个批处理系统,但它比Hadoop mapreduce快,因为它通过RDD将大部分输入数据缓存在内存中,并将中间数据保存在内存中,最终在完成时或在需要时将数据写入磁盘。

#2


There is also the fact that HDFS, as a filesystem, is optimized for big chunks of data. For instance, a single block is usually 64-128 MB instead of the more usual .5 - 4 KB. So even for small operations, there will be significan delay on reading or writing to disk. Add to that the distributed nature of it and you will see significant overhead (indirection, synchronization, replication, etc.) compared to a traditional filesystem.

还有一个事实是HDFS作为一个文件系统,针对大块数据进行了优化。例如,单个块通常为64-128 MB,而不是通常的.5 - 4 KB。因此,即使对于小型操作,读取或写入磁盘也会有明显的延迟。再加上它的分布式特性,与传统的文件系统相比,您将看到显着的开销(间接,同步,复制等)。

This from the point of view of HDFS, which I read to be your main question. Hadoop as a data processing framework has its own set of tradeoffs and inefficiencies (better explained on @hserus answer), but they basically aim for the same niche: reliable bulk processing.

从HDFS的角度来看,我认为这是你的主要问题。 Hadoop作为一个数据处理框架有自己的权衡和低效率(更好地解释@hserus答案),但它们基本上是针对相同的利基:可靠的批量处理。

#3


The low latency or real time applications usually require specific data. They need to serve quickly some small amount of data which end user or application is waiting for.

低延迟或实时应用程序通常需要特定数据。他们需要快速提供最终用户或应用程序正在等待的一些少量数据。

The HDFS is designed by storing large data in a distributed environment which provide fault tolerance and high availability. The actual location of the data is known only to the Namenode. It stores the data almost randomly on any Datanode. Again it splits the data files into smaller blocks of fixed size. So the data can be transferred to the real time applications quickly because of the network latency and distribution of the data and filtering of specific data. Where as it help for running MapReduce or data intensive job, because executable program is transferred to the machine which holds the data locally (data locality principle).

HDFS的设计目的是在分布式环境中存储大数据,从而提供容错和高可用性。数据的实际位置仅为Namenode所知。它几乎随机地将数据存储在任何Datanode上。它再次将数据文件拆分为固定大小的较小块。因此,由于网络延迟和数据分布以及特定数据的过滤,数据可以快速传输到实时应用程序。因为它有助于运行MapReduce或数据密集型作业,因为可执行程序被转移到本地保存数据的机器(数据位置原理)。

#1


Hadoop is completely a batch processing system designed to store and analyze structured, unstructured and semistructured data.

Hadoop完全是一个批处理系统,用于存储和分析结构化,非结构化和半结构化数据。

The map/reduce framework of Hadoop is relatively slower since it is designed to support different format, structure and huge volume of data.

Hadoop的map / reduce框架相对较慢,因为它旨在支持不同的格式,结构和大量数据。

We should not say HDFS is slower, since the HBase no-sql database and MPP based datasources like Impala, Hawq sit on the HDFS. These datasources act faster because they do not follow mapreduce execution for data retrieval and processing.

我们不应该说HDFS速度较慢,因为HBase no-sql数据库和基于MPP的数据源如Impala,Hawq都位于HDFS上。这些数据源的作用更快,因为它们不遵循mapreduce执行数据检索和处理。

The slowness occurs only because of the nature of the map/reduce based execution, where it produces lots of intermediate data, much data exchanged between nodes, thus causes huge disk IO latency. Furthermore it has to persist much data in disk for synchronization between phases so that it can support Job recovery from failures. Also there are no ways in mapreduce to cache the all/subset of the data in memory.

发生缓慢只是因为基于map / reduce的执行的性质,它产生大量中间数据,在节点之间交换大量数据,从而导致巨大的磁盘IO延迟。此外,它必须在磁盘中保留大量数据以实现阶段之间的同步,以便它可以支持从故障中恢复作业。此外,mapreduce中没有办法将所有/子集的数据缓存到内存中。

The Apache Spark is yet another batch processing system but it is relatively faster than Hadoop mapreduce since it caches much of the input data on memory by RDD and keeps intermediate data in memory itself , eventually writes the data to disk upon completion or whenever required.

Apache Spark是另一个批处理系统,但它比Hadoop mapreduce快,因为它通过RDD将大部分输入数据缓存在内存中,并将中间数据保存在内存中,最终在完成时或在需要时将数据写入磁盘。

#2


There is also the fact that HDFS, as a filesystem, is optimized for big chunks of data. For instance, a single block is usually 64-128 MB instead of the more usual .5 - 4 KB. So even for small operations, there will be significan delay on reading or writing to disk. Add to that the distributed nature of it and you will see significant overhead (indirection, synchronization, replication, etc.) compared to a traditional filesystem.

还有一个事实是HDFS作为一个文件系统,针对大块数据进行了优化。例如,单个块通常为64-128 MB,而不是通常的.5 - 4 KB。因此,即使对于小型操作,读取或写入磁盘也会有明显的延迟。再加上它的分布式特性,与传统的文件系统相比,您将看到显着的开销(间接,同步,复制等)。

This from the point of view of HDFS, which I read to be your main question. Hadoop as a data processing framework has its own set of tradeoffs and inefficiencies (better explained on @hserus answer), but they basically aim for the same niche: reliable bulk processing.

从HDFS的角度来看,我认为这是你的主要问题。 Hadoop作为一个数据处理框架有自己的权衡和低效率(更好地解释@hserus答案),但它们基本上是针对相同的利基:可靠的批量处理。

#3


The low latency or real time applications usually require specific data. They need to serve quickly some small amount of data which end user or application is waiting for.

低延迟或实时应用程序通常需要特定数据。他们需要快速提供最终用户或应用程序正在等待的一些少量数据。

The HDFS is designed by storing large data in a distributed environment which provide fault tolerance and high availability. The actual location of the data is known only to the Namenode. It stores the data almost randomly on any Datanode. Again it splits the data files into smaller blocks of fixed size. So the data can be transferred to the real time applications quickly because of the network latency and distribution of the data and filtering of specific data. Where as it help for running MapReduce or data intensive job, because executable program is transferred to the machine which holds the data locally (data locality principle).

HDFS的设计目的是在分布式环境中存储大数据,从而提供容错和高可用性。数据的实际位置仅为Namenode所知。它几乎随机地将数据存储在任何Datanode上。它再次将数据文件拆分为固定大小的较小块。因此,由于网络延迟和数据分布以及特定数据的过滤,数据可以快速传输到实时应用程序。因为它有助于运行MapReduce或数据密集型作业,因为可执行程序被转移到本地保存数据的机器(数据位置原理)。