I am working on my school project. there is a video duplication detection application wrote with c++. The application is designed to run on a single machine, and I would like to create a spark cluster and run that application under the cluster. Is this possible? difficult?
我正在研究我的学校项目。有一个用c ++编写的视频复制检测应用程序。该应用程序设计为在单个机器上运行,我想创建一个spark集群并在集群下运行该应用程序。这可能吗?难?
1 个解决方案
#1
0
Let me try to answer your question:
让我试着回答你的问题:
First, you should figure out that which format the c++ application has?
首先,你应该弄清楚c ++应用程序的格式是什么?
Is it source code or binary executable bin?
是源代码还是二进制可执行bin?
-
source code
You can implement the algorithm in java/scala, and make the most use of the cluster resouce, make you job much more quick that the single machine version.
您可以在java / scala中实现该算法,并充分利用群集资源,使您的单机版本的工作更加快捷。
if your time is limited, you can use gcc to compile your c++ source code, and follow the next method.
如果您的时间有限,您可以使用gcc编译您的c ++源代码,并按照下一个方法。
binary executable bin
二进制可执行bin
Because java/scala bytecode(.class format) run on the jvm, not compatible with the native code on machine, which is determined by the combination of compiler and operating system.
因为java / scala字节码(.class格式)在jvm上运行,与机器上的本机代码不兼容,这是由编译器和操作系统的组合决定的。
In the case, the only choice is get a new process to execute the c++ executable bin, and get what you want through inter process communication, such as pipe.
在这种情况下,唯一的选择是获得一个新的进程来执行c ++可执行文件箱,并通过进程间通信获得你想要的东西,比如管道。
In a word, you should get a new process on driver node to de-duplicate your data, and use spark engine to do the following parallel computing to accelerate you project.
总之,您应该在驱动程序节点上获得一个新进程来重复数据删除,并使用spark引擎执行以下并行计算以加速您的项目。
and get new process is so easy in scala or java, please refer the doc: Process
并且在scala或java中获取新进程非常简单,请参考doc:Process
#1
0
Let me try to answer your question:
让我试着回答你的问题:
First, you should figure out that which format the c++ application has?
首先,你应该弄清楚c ++应用程序的格式是什么?
Is it source code or binary executable bin?
是源代码还是二进制可执行bin?
-
source code
You can implement the algorithm in java/scala, and make the most use of the cluster resouce, make you job much more quick that the single machine version.
您可以在java / scala中实现该算法,并充分利用群集资源,使您的单机版本的工作更加快捷。
if your time is limited, you can use gcc to compile your c++ source code, and follow the next method.
如果您的时间有限,您可以使用gcc编译您的c ++源代码,并按照下一个方法。
binary executable bin
二进制可执行bin
Because java/scala bytecode(.class format) run on the jvm, not compatible with the native code on machine, which is determined by the combination of compiler and operating system.
因为java / scala字节码(.class格式)在jvm上运行,与机器上的本机代码不兼容,这是由编译器和操作系统的组合决定的。
In the case, the only choice is get a new process to execute the c++ executable bin, and get what you want through inter process communication, such as pipe.
在这种情况下,唯一的选择是获得一个新的进程来执行c ++可执行文件箱,并通过进程间通信获得你想要的东西,比如管道。
In a word, you should get a new process on driver node to de-duplicate your data, and use spark engine to do the following parallel computing to accelerate you project.
总之,您应该在驱动程序节点上获得一个新进程来重复数据删除,并使用spark引擎执行以下并行计算以加速您的项目。
and get new process is so easy in scala or java, please refer the doc: Process
并且在scala或java中获取新进程非常简单,请参考doc:Process