为什么我们需要在Spark中分发文件,例如--py-文件?

时间:2022-07-31 20:51:53

As I read from many blogs and posts here in SO, for example this one (in the first a few paragraphs), quoted as follows:

正如我从SO中的许多博客和帖子中读到的那样,例如这个(在前几段中),引用如下:

Not to get into too many details, but when you run different transformations on a RDD (map, flatMap, filter and others), your transformation code (closure) is:

不要涉及太多细节,但是当你在RDD(map,flatMap,filter等)上运行不同的转换时,转换代码(闭包)是:

  1. serialized on the driver node,
  2. 在驱动程序节点上序列化,

  3. shipped to the appropriate nodes in the cluster,
  4. 运送到群集中的相应节点,

  5. deserialized,
  6. and finally executed on the nodes
  7. 最后在节点上执行

OK, here is my take at this:

好的,这是我对此的看法:

I define some custom transformation/action functions in the driver, and then those custom functions will be serialized to all the executors to run the job.

我在驱动程序中定义了一些自定义转换/操作函数,然后将这些自定义函数序列化到所有执行程序以运行作业。

Then what's the point of shipping extra py-files to all the nodes? Since all that executors need will be serialized to them, what the heck is going on here?

那么将额外的py文件发送到所有节点有什么意义呢?由于执行者需要的所有内容都会被序列化,所以这里到底发生了什么?

1 个解决方案

#1


0  

Not sure, but use spark 2.x and DataFrame API to avoid serialization and to ship scala code to your nodes without dealing with extra python container on your nodes.

不确定,但使用spark 2.x和DataFrame API来避免序列化并将scala代码发送到您的节点,而无需在节点上处理额外的python容器。

#1


0  

Not sure, but use spark 2.x and DataFrame API to avoid serialization and to ship scala code to your nodes without dealing with extra python container on your nodes.

不确定,但使用spark 2.x和DataFrame API来避免序列化并将scala代码发送到您的节点,而无需在节点上处理额外的python容器。