如何在Hadoop HDFS上存储数千个CSV文件

What's the situation? I have ten thousands of CSV files (each 250kb - 270kb) that I would like to process using Spark (Pyspark to be precise). Each CSV file represents process data for one specific event. You could say that one CSV file represents one object. Since I want to store the data on HDFS I have to find a way to concatenate the data (since it's inefficient to store large amounts of tiny data on HDFS).

是什么情况?我有成千上万的CSV文件(每个250kb - 270kb)，我想使用Spark (Pyspark)进行处理。每个CSV文件表示一个特定事件的过程数据。一个CSV文件表示一个对象。由于我想在HDFS上存储数据，所以我必须找到一种将数据连接起来的方法(因为在HDFS上存储大量的小数据效率很低)。

Snippet of one CSV file (simplified).

一个CSV文件的片段(简化)。

Time        Module  v1   v2      v3     v4  v5   v6      v7     v8
00:00:00    Start   0   26,2    26,0    0   0   25,899  25,7    0
00:00:06    2: M1   0   26,1    26,2    0   0   25,8    25,899  0
00:01:06    2: M1   0   26,6    26,6    0   0   26,8    26,799  0
00:02:05    2: M1   0   27,1    27,0    0   0   27,7    27,7    0
00:03:06    2: M1   0   27,3    27,5    0   0   28,1    28,1    0

The full data has 45-50 columns and around 1000 rows.

完整的数据有45-50列和大约1000行。

My idea so far. I was thinking of transforming each CSV into one JSON object and then concatenate the JSON objects as seen below

到目前为止我的想法。我考虑将每个CSV转换成一个JSON对象，然后将JSON对象连接起来，如下所示。

{
 "Event": "MLV14092",
 "Values": [
  {
   "Time": "00:00:00",
   "Module": "Start",
   "v1": "33.299"
   ...
  },
  {
   "Time": "00:00:06",
   "Module": "2: M1",
   "v1": "33.4"
   ... 
  }
 ]
}

Question. Is that a valid approach? I'm relatively new to the Hadoop environment and I've done some tutorials with JSON files. However in those tutorials I was always able to store one JSON object in one line and therefore I didn't have to worry at what line HDFS splits the file. With one JSON object being so "big", it won't fit into a single line. Is there a better way to proceed?

的问题。这是一个有效的方法吗?我对Hadoop环境比较陌生，我已经用JSON文件做过一些教程。但是在这些教程中，我总是能够在一行中存储一个JSON对象，因此我不必担心HDFS是如何分割文件的。有一个JSON对象是如此“大”，它将无法容纳到一行中。有更好的方法继续吗?

1 个解决方案

#1

Generally, you would not want to store many small files in HDFS -- small being files < ~64-128MB in size.

通常，您不希望在HDFS中存储许多小文件——小的是小于64-128MB的文件。

From your description, it also looks like the "Event" name/id will be very important, but it is not part of the existing csv files (i.e. it's in the filename, but not in the file).

从您的描述中，看起来“事件”名称/id非常重要，但是它不是现有csv文件的一部分(例如，它在文件名中，但不在文件中)。

Given that the size and number of the files is still not that large, have you considered writing a small shell or Python script to do the following:

考虑到文件的大小和数量仍然没有那么大，您是否考虑过编写一个小的shell或Python脚本来执行以下操作:

Remove the header from each csv
从每个csv中删除标头。
Prepend/append a column to each csv containing the "Event" name/id
在包含“事件”名称/id的每个csv中添加一个列
Store the result in a new file
将结果存储在一个新文件中

You would apply the script to each each file, which would give you a transformed output file. (Your script could also do this to the entire set or subset of files in batches)

您将把脚本应用到每个文件，这将给您一个转换后的输出文件。(您的脚本也可以对整个文件集或文件子集进行分批处理)

You could then concatenate the transformed output files and store the concatenated file(s) in HDFS. The concatenated file(s) would be space-efficient, line-delimited and be well-suited for exploration and analysis using tools such as PySpark/Spark and Hive.

然后可以将转换后的输出文件连接起来，并将连接后的文件存储在HDFS中。连接文件将具有空间效率、行分隔，并且非常适合使用PySpark/Spark和Hive等工具进行勘探和分析。

On a separate note, there are more-optimal, file formats than CSV for such analysis, but consider exploring the columnar file format topic after this initial set of steps. For Spark, you may want to look into later storing this data in Parquet format, and for Hive, in ORC format. You could convert the data into those formats using the very same tools.

另一方面，对于这种分析，有比CSV更优的文件格式，但是请考虑在这组初始步骤之后研究柱状文件格式主题。对于Spark，您可能希望稍后以Parquet格式存储这些数据，对于Hive，您可能希望以ORC格式存储这些数据。您可以使用相同的工具将数据转换为这些格式。

#1