I'm very new to the Hadoop MapReduce/Spark, for my target project, I want to perform Data Preprocessing with Hadoop MapReduce/Spark. I know the basics of Hadoop MapReduce, but I don't know how to implement the Preprocessing algorithms/methods with this framework. For Hadoop MapReduce, I have to define Map()
and Reduce()
which takes <key, value>
pair as the transmission type from Mappers to Reducers. But with database tables, how can I handle table entries in <key, value>
format? Applying primay key as the key seems nonsense. It's the similar case for Spark since I need to specify the key.
我是Hadoop MapReduce / Spark的新手,对于我的目标项目,我想用Hadoop MapReduce / Spark执行数据预处理。我知道Hadoop MapReduce的基础知识,但我不知道如何用这个框架实现预处理算法/方法。对于Hadoop MapReduce,我必须定义Map()和Reduce(),它将
For example, for each data entry in the database table, some fields of some entries may be missed, thus I want to add the default value for those fields with kind of imputation strategies. How can I process the data entries in a <key, value>
way? Setting the primary key as key
here is nonsense since if that's the case, each <key, value>
pair won't have the same key as others, so aggregation is not helpful in this case.
例如,对于数据库表中的每个数据条目,某些条目的某些字段可能会被遗漏,因此我想为那些具有插补策略的字段添加默认值。如何以
1 个解决方案
#1
0
Map reduce is kind of low level programming. You can start with high level abstractions like HIVE and PIG.
Map reduce是一种低级编程。您可以从HIVE和PIG等高级抽象开始。
If in case you are dealing with structured data you go with HIVE, which is SQL like interface, which intenally converts SQLs to MR jobs.
如果您正在处理结构化数据,那么您可以使用HIVE,这就像SQL一样,它可以将SQL转换为MR作业。
Hope this helps.
希望这可以帮助。
#1
0
Map reduce is kind of low level programming. You can start with high level abstractions like HIVE and PIG.
Map reduce是一种低级编程。您可以从HIVE和PIG等高级抽象开始。
If in case you are dealing with structured data you go with HIVE, which is SQL like interface, which intenally converts SQLs to MR jobs.
如果您正在处理结构化数据,那么您可以使用HIVE,这就像SQL一样,它可以将SQL转换为MR作业。
Hope this helps.
希望这可以帮助。