We want to use AWS data pipeline to automate data ingestion process. In our ingestion process we mainly copy CSV files into S3 bucket and run Hive queries on it for more than 100 different tables.
我们希望使用AWS数据管道来自动化数据提取过程。在我们的摄取过程中,我们主要将CSV文件复制到S3存储桶中,并在其上运行Hive查询,以获得100多个不同的表。
We want to create one pipeline in which we will be able to process all the 100 tables.
我们想要创建一个管道,在这个管道中我们将能够处理所有100个表。
I would like to know if we can run multiple Hive activities and S3 copy activities in parallel? I couldn't find this information in AWS documents if pipeline activities run serially or in parallel.
我想知道我们是否可以并行运行多个Hive活动和S3复制活动?如果管道活动是串行或并行运行,我在AWS文档中找不到此信息。
1 个解决方案
#1
1
You can use a HadoopActivity which calls the hive query from a Java executable. AWS Data Pipeline supports parallel execution of HadoopActivities.
您可以使用HadoopActivity从Java可执行文件调用hive查询。 AWS Data Pipeline支持并行执行HadoopActivities。
Documentation: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hadoopactivity.html
文档:http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hadoopactivity.html
#1
1
You can use a HadoopActivity which calls the hive query from a Java executable. AWS Data Pipeline supports parallel execution of HadoopActivities.
您可以使用HadoopActivity从Java可执行文件调用hive查询。 AWS Data Pipeline支持并行执行HadoopActivities。
Documentation: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hadoopactivity.html
文档:http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hadoopactivity.html