在spark集群下运行c ++应用程序

I am working on my school project. there is a video duplication detection application wrote with c++. The application is designed to run on a single machine, and I would like to create a spark cluster and run that application under the cluster. Is this possible? difficult?

我正在研究我的学校项目。有一个用c ++编写的视频复制检测应用程序。该应用程序设计为在单个机器上运行,我想创建一个spark集群并在集群下运行该应用程序。这可能吗?难?

1 个解决方案

#1

Let me try to answer your question:

让我试着回答你的问题:

First, you should figure out that which format the c++ application has?

首先,你应该弄清楚c ++应用程序的格式是什么?

Is it source code or binary executable bin?

是源代码还是二进制可执行bin?

source code

You can implement the algorithm in java/scala, and make the most use of the cluster resouce, make you job much more quick that the single machine version.

您可以在java / scala中实现该算法,并充分利用群集资源,使您的单机版本的工作更加快捷。

if your time is limited, you can use gcc to compile your c++ source code, and follow the next method.

如果您的时间有限,您可以使用gcc编译您的c ++源代码,并按照下一个方法。

binary executable bin

二进制可执行bin

Because java/scala bytecode(.class format) run on the jvm, not compatible with the native code on machine, which is determined by the combination of compiler and operating system.

因为java / scala字节码(.class格式)在jvm上运行,与机器上的本机代码不兼容,这是由编译器和操作系统的组合决定的。

In the case, the only choice is get a new process to execute the c++ executable bin, and get what you want through inter process communication, such as pipe.

在这种情况下,唯一的选择是获得一个新的进程来执行c ++可执行文件箱,并通过进程间通信获得你想要的东西,比如管道。

In a word, you should get a new process on driver node to de-duplicate your data, and use spark engine to do the following parallel computing to accelerate you project.

总之,您应该在驱动程序节点上获得一个新进程来重复数据删除,并使用spark引擎执行以下并行计算以加速您的项目。

and get new process is so easy in scala or java, please refer the doc: Process

并且在scala或java中获取新进程非常简单,请参考doc:Process

#1