《Hadoop权威指南(英文版第四版)》—— HDFS学习笔记

时间:2024-03-23 07:48:23

When a dataset outgrows(过大而不适用于) the storage capacity of a single physical machine, it becomes necessary to partition(分割分布) it across a number of separate machines(大量独立的机器).

Filesystems that manage the storage across a network of machines are called distributed filesystems.

Since they are network based(由于基于网路), all the complications of network programming kick in, thus making distributed filesystems more complex than regular disk filesystems.

For example, one of the biggest challenges is making the filesystem tolerate node failure without suffering data loss(能够容忍节点故障而不丢失数据).
Hadoop comes with(自带) a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem. 

《Hadoop权威指南(英文版第四版)》—— HDFS学习笔记

1 The Design of HDFS

HDFS is a filesystem designed for storing(用来存储) very large files with streaming data access
patterns(流式数据访问模式)
, running on clusters of commodity hardware(商用的硬件集群).

(1)Very large files超大文件

“Very large” in this context means files that are hundreds of MB, GB, or TB in size. There are Hadoop clusters running today that store PB of data.

(2)Streaming data access流式数据访问

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times(一次写入多出读取) pattern.  The time to read the whole dataset is more important than the latency(延迟) in reading the first record.

(3)Commodity hardware商用硬件

Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware.(各种零售店都能买到的普通硬件) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working(继续运行) without a noticeable interruption(不会有明显的中断) to the user in the face of such failure.

These are areas where HDFS is not a good fit today:

(1)Low-latency data access低延迟数据访问

HDFS is optimized for delivering a high throughput of data(高数据吞吐量), and this may be at the expense of latency(以延迟为代价). HBase is currently a better choice for low-latency access.

(2)Lots of small files大量的小文件

Because the namenode holds filesystem metadata(元数据) in memory.

 Each file, directory, and block takes about(约需要) 150 bytes.

If you had one million files, each taking one block, you would need at least 300 MB of memory. Although storing millions of files is feasible, billions is beyond the capability of current hardware.(存储百万个文件还可行,十亿或更多个文件就......)

(对于大量的小件,单是维护文件元数据所占用的内存就会超出硬件的能力了)

(3)Multiple writers, arbitrary(任意的) file modifications多用户写入,任意修改文件

Files in HDFS may be written to by a single writer(只有一个写入者). Writes are always made at the end of the file, in append-only fashion(总在文件末尾进行写操作). There is no support for multiple writers or for modifications at arbitrary offsets in the file. (不支持多个写入者,不支持在文件任意位置修改)

扩展1:什么是文件

《Hadoop权威指南(英文版第四版)》—— HDFS学习笔记

扩展2:什么是文件系统

《Hadoop权威指南(英文版第四版)》—— HDFS学习笔记

扩展3:如何设计一个文件系统

《Hadoop权威指南(英文版第四版)》—— HDFS学习笔记

扩展4:什么是元数据(metadata)

《Hadoop权威指南(英文版第四版)》—— HDFS学习笔记

下面是契诃夫的小说《套中人》中的一段,描写一个叫做瓦莲卡的女子:

(她)年纪已经不轻,三十岁上下,个子高挑,身材匀称,黑黑的眉毛,红红的脸蛋--一句话,不是姑娘,而是果冻,她那样活跃,吵吵嚷嚷,不停地哼着小俄罗斯的抒情歌曲,高声大笑,动不动就发出一连串响亮的笑声:哈,哈,哈!

这段话里提供了这样几个信息:年龄(三十岁上下)、身高(个子高挑)、相貌(身材匀称,黑黑的眉毛,红红的脸蛋)、性格(活跃,吵吵嚷嚷,不停地哼着小俄罗斯的抒情歌曲,高声大笑)。有了这些信息,我们就可以大致想像出瓦莲卡是个什么样的人。推而广之,只要提供这几类的信息,我们也可以推测出其他人的样子。

这个例子中的"年龄"、"身高"、"相貌"、"性格",就是元数据,因为它们是用来描述具体数据/信息的数据/信息。

当然,这几个元数据用来刻画个人状况还不够精确。我们每个人从小到大,都填过《个人情况登记表》之类的东西吧,其中包括姓名、性别、民族、政治面貌、一寸照片、学历、职称等等......这一套元数据才算比较完备。

在日常生活中,元数据无所不在。有一类事物,就可以定义一套元数据。

(引用自阮一峰的博文:http://www.ruanyifeng.com/blog/2007/03/metadata.html

2 HDFS Concepts

2.1 Blocks块

《Hadoop权威指南(英文版第四版)》—— HDFS学习笔记