I'm doing some testing with Apache Spark, for my final project in college. I have a data set that I use to generate a decision tree, and make some predictions on new data.
我正在用Apache Spark进行一些测试,这是我在大学的最后一个项目。我有一个数据集,用于生成决策树,并对新数据进行一些预测。
In the future, I think to use this project into production, where I would generate a decision tree (batch processing), and through a web interface or a mobile application receives new data, making the prediction of the class of that entry, and inform the result instantly to the user. And also go storing these new entries for after a while generating a new decision tree (batch processing), and repeat this process continuously.
在将来,我认为将这个项目用于生产,我将生成一个决策树(批处理),并通过Web界面或移动应用程序接收新数据,预测该条目的类,并通知结果立即给用户。并且还会在一段时间后存储这些新条目,生成新的决策树(批处理),并连续重复此过程。
Despite the Apache Spark have the purpose of performing batch processing, there is the streaming API that allows you to receive real-time data, and in my application this data will only be used by a model built in a batch process with a decision tree, and how the prediction is quite fast, it allows the user to have the answer quickly.
尽管Apache Spark的目的是执行批处理,但是有一个流API可以让您接收实时数据,而在我的应用程序中,这些数据只能由一个带有决策树的批处理中构建的模型使用,以及预测如何快速,它允许用户快速得到答案。
My question is what are the best ways to integrate Apache Spark with a web application (plan to use the Play Framework scala version)?
我的问题是什么是将Apache Spark与Web应用程序集成的最佳方法(计划使用Play Framework scala版本)?
4 个解决方案
#1
One of the issues you will run into with Spark is it takes some time to start up and build a SparkContext. If you want to do Spark queries via web calls, it will not be practical to fire up spark-submit every time. Instead, you will want to turn your driver application (these terms will make more sense later) into an RPC server.
您将遇到Spark的一个问题是启动并构建SparkContext需要一些时间。如果你想通过网络电话进行Spark查询,那么每次启动spark-submit都是不切实际的。相反,您需要将驱动程序应用程序(这些术语稍后会更有意义)转换为RPC服务器。
In my application I am embedding a web server (http4s) so I can do XmlHttpRequests in JavaScript to directly query my application, which will return JSON objects.
在我的应用程序中,我嵌入了一个Web服务器(http4s),因此我可以在JavaScript中执行XmlHttpRequests直接查询我的应用程序,这将返回JSON对象。
#2
Spark is a fast large scale data processing platform. The key here is large scale data. In most cases, the time to process that data will not be sufficiently fast to meet the expectations of your average web app user. It is far better practice to perform the processing offline and write the results of your Spark processing to e.g a database. Your web app can then efficiently retrieve those results by querying that database.
Spark是一个快速的大规模数据处理平台。这里的关键是大规模数据。在大多数情况下,处理该数据的时间不足以满足普通Web应用程序用户的期望。最好的做法是离线执行处理并将Spark处理的结果写入数据库。然后,您的Web应用程序可以通过查询该数据库来有效地检索这些结果。
That being said, spark job server server provides a REST api for submitting Spark jobs.
话虽这么说,spark job服务器服务器提供了一个REST api来提交Spark作业。
#3
Spark (< v1.6) uses Akka underneath. So does Play. You should be able to write a Spark action as an actor that communicates with a receiving actor in the Play system (that you also write).
Spark(
You can let Akka worry about de/serialization, which will work as long as both systems have the same class definitions on their classpaths.
你可以让Akka担心de / serialization,只要两个系统在类路径上都有相同的类定义,它就会起作用。
If you want to go further than that, you can write Akka Streams code that tees the data stream to your Play application.
如果您想要更进一步,可以编写Akka Streams代码,将数据流发送到您的Play应用程序。
#4
check this link out, you need to run spark in local mode (on your web server) and the offline ML model should be saved in S3 so you can access the model then from web app and cache the model jut once and you will be having spark context running in local mode continuously .
检查此链接,您需要以本地模式运行spark(在您的Web服务器上),并且离线ML模型应保存在S3中,以便您可以从Web应用程序访问模型并缓存模型jut一次,您将拥有火花上下文在本地模式下连续运行。
https://commitlogs.com/2017/02/18/serve-spark-ml-model-using-play-framework-and-s3/
Also another approach is to use Livy (REST API calls on spark)
另一种方法是使用Livy(REST API调用spark)
https://index.scala-lang.org/luqmansahaf/play-livy-module/play-livy/1.0?target=_2.11
the s3 option is the way forward i guess, if the batch model changes you need to refresh the website cache (down time) for few minutes.
s3选项是前进的方式我想,如果批量模型更改,您需要刷新网站缓存(停机时间)几分钟。
look into these links
看看这些链接
https://github.com/openforce/spark-mllib-scala-play/blob/master/app/modules/SparkUtil.scala
https://github.com/openforce/spark-mllib-scala-play
Thanks Sri
#1
One of the issues you will run into with Spark is it takes some time to start up and build a SparkContext. If you want to do Spark queries via web calls, it will not be practical to fire up spark-submit every time. Instead, you will want to turn your driver application (these terms will make more sense later) into an RPC server.
您将遇到Spark的一个问题是启动并构建SparkContext需要一些时间。如果你想通过网络电话进行Spark查询,那么每次启动spark-submit都是不切实际的。相反,您需要将驱动程序应用程序(这些术语稍后会更有意义)转换为RPC服务器。
In my application I am embedding a web server (http4s) so I can do XmlHttpRequests in JavaScript to directly query my application, which will return JSON objects.
在我的应用程序中,我嵌入了一个Web服务器(http4s),因此我可以在JavaScript中执行XmlHttpRequests直接查询我的应用程序,这将返回JSON对象。
#2
Spark is a fast large scale data processing platform. The key here is large scale data. In most cases, the time to process that data will not be sufficiently fast to meet the expectations of your average web app user. It is far better practice to perform the processing offline and write the results of your Spark processing to e.g a database. Your web app can then efficiently retrieve those results by querying that database.
Spark是一个快速的大规模数据处理平台。这里的关键是大规模数据。在大多数情况下,处理该数据的时间不足以满足普通Web应用程序用户的期望。最好的做法是离线执行处理并将Spark处理的结果写入数据库。然后,您的Web应用程序可以通过查询该数据库来有效地检索这些结果。
That being said, spark job server server provides a REST api for submitting Spark jobs.
话虽这么说,spark job服务器服务器提供了一个REST api来提交Spark作业。
#3
Spark (< v1.6) uses Akka underneath. So does Play. You should be able to write a Spark action as an actor that communicates with a receiving actor in the Play system (that you also write).
Spark(
You can let Akka worry about de/serialization, which will work as long as both systems have the same class definitions on their classpaths.
你可以让Akka担心de / serialization,只要两个系统在类路径上都有相同的类定义,它就会起作用。
If you want to go further than that, you can write Akka Streams code that tees the data stream to your Play application.
如果您想要更进一步,可以编写Akka Streams代码,将数据流发送到您的Play应用程序。
#4
check this link out, you need to run spark in local mode (on your web server) and the offline ML model should be saved in S3 so you can access the model then from web app and cache the model jut once and you will be having spark context running in local mode continuously .
检查此链接,您需要以本地模式运行spark(在您的Web服务器上),并且离线ML模型应保存在S3中,以便您可以从Web应用程序访问模型并缓存模型jut一次,您将拥有火花上下文在本地模式下连续运行。
https://commitlogs.com/2017/02/18/serve-spark-ml-model-using-play-framework-and-s3/
Also another approach is to use Livy (REST API calls on spark)
另一种方法是使用Livy(REST API调用spark)
https://index.scala-lang.org/luqmansahaf/play-livy-module/play-livy/1.0?target=_2.11
the s3 option is the way forward i guess, if the batch model changes you need to refresh the website cache (down time) for few minutes.
s3选项是前进的方式我想,如果批量模型更改,您需要刷新网站缓存(停机时间)几分钟。
look into these links
看看这些链接
https://github.com/openforce/spark-mllib-scala-play/blob/master/app/modules/SparkUtil.scala
https://github.com/openforce/spark-mllib-scala-play
Thanks Sri