Recently I've started learning Spark to accelerate the processing. In my situation the input RDD of the Spark application does not contain all the data required for the batch processing. As a result, I have to do some SQL queries in each worker thread.
最近我开始学习Spark来加速处理。在我的情况下,Spark应用程序的输入RDD不包含批处理所需的所有数据。因此,我必须在每个工作线程中执行一些SQL查询。
Preprocessing of all the input data is possible, but it takes too long.
可以预处理所有输入数据,但这需要很长时间。
I know the following questions may be too "general", but any experience will help.
我知道以下问题可能过于“笼统”,但任何经验都会有所帮助。
- is it possible to do some SQL queries in worker threads?
- 是否可以在工作线程中进行一些SQL查询?
- will the scheduling on the data server be the bottle neck, if a single query is complicated?
- 如果单个查询很复杂,那么数据服务器上的调度是否会成为瓶颈?
- which database suits this situation (with good concurrency abilities maybe)? MongoDB? *SQL?
- 哪个数据库适合这种情况(可能具有良好的并发能力)? MongoDB的? * SQL?
1 个解决方案
#1
0
It is hard to answer some of your questions without a specific use-case. But the following generic answers might be of some help
没有具体的用例,很难回答你的一些问题。但以下通用答案可能会有所帮助
- Yes. You can access external data sources (RDBMS, Mongo etc.). You can use
mapPartitions
to even improve performance by creating the connections once. See an example here - 是。您可以访问外部数据源(RDBMS,Mongo等)。您可以使用mapPartitions通过创建一次连接来提高性能。在这里查看示例
- can't answer without looking at specific example
- 没有看具体的例子就无法回答
- Database selection is dependent on use-cases.
- 数据库选择取决于用例。
#1
0
It is hard to answer some of your questions without a specific use-case. But the following generic answers might be of some help
没有具体的用例,很难回答你的一些问题。但以下通用答案可能会有所帮助
- Yes. You can access external data sources (RDBMS, Mongo etc.). You can use
mapPartitions
to even improve performance by creating the connections once. See an example here - 是。您可以访问外部数据源(RDBMS,Mongo等)。您可以使用mapPartitions通过创建一次连接来提高性能。在这里查看示例
- can't answer without looking at specific example
- 没有看具体的例子就无法回答
- Database selection is dependent on use-cases.
- 数据库选择取决于用例。