在Spark executor节点上安装Python依赖项的最简单方法?

I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?

我理解您可以将单个文件作为与Python Spark程序的依赖项发送。但是成熟的库(比如numpy)呢?

Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?

Spark是否有办法使用提供的包管理器(例如pip)来安装库依赖项?还是必须在Spark程序执行之前手动完成?

If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?

如果答案是手动的，那么在大量分布式节点上同步库(安装路径、版本等)的“最佳实践”方法是什么?

1 个解决方案

#1

Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.

实际上，我已经尝试过了，我认为我作为评论发布的链接并不能完全实现您想要的依赖关系。您相当合理地要求的是使用setuptools和pip在安装依赖项时能够很好地使用Spark play。我的想法是，这一点在Spark中没有得到更好的支持。第三方依赖问题在一般用途的Python中得到了很大的解决，但是在Spark之下，您似乎会回到手动依赖管理之类的问题。

I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is

我一直在使用基于virtualenv的不完善但功能强大的管道。基本思想是

Create a virtualenv purely for your Spark nodes
为您的Spark节点创建一个virtualenv
Each time you run a Spark job, run a fresh pip install of all your own in-house Python libraries. If you have set these up with setuptools, this will install their dependencies
每次运行Spark作业时，运行所有内部Python库的新pip安装。如果您已经使用setuptools进行了设置，这将安装它们的依赖项
Zip up the site-packages dir of the virtualenv. This will include your library and it's dependencies, which the worker nodes will need, but not the standard Python library, which they already have
压缩virtualenv的站点包目录。这将包括您的库及其依赖项(worker节点将需要它们)，但不包括它们已经拥有的标准Python库
Pass the single .zip file, containing your libraries and their dependencies as an argument to --py-files
将包含库及其依赖项的.zip文件作为参数传递给——py文件

Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:

当然，您需要编写一些帮助程序脚本来管理这个过程。这里有一个助手脚本改编自我一直在使用的一个脚本，它无疑可以得到很多改进:

#!/usr/bin/env bash
# helper script to fulfil Spark's python packaging requirements.
# Installs everything in a designated virtualenv, then zips up the virtualenv for using as an the value of
# supplied to --py-files argument of `pyspark` or `spark-submit`
# First argument should be the top-level virtualenv
# Second argument is the zipfile which will be created, and
#   which you can subsequently supply as the --py-files argument to 
#   spark-submit
# Subsequent arguments are all the private packages you wish to install
# If these are set up with setuptools, their dependencies will be installed

VENV=$1; shift
ZIPFILE=$1; shift
PACKAGES=$*

. $VENV/bin/activate
for pkg in $PACKAGES; do
  pip install --upgrade $pkg
done
TMPZIP="$TMPDIR/$RANDOM.zip" # abs path. Use random number to avoid *es with other processes
( cd "$VENV/lib/python2.7/site-packages" && zip -q -r $TMPZIP . )
mv $TMPZIP $ZIPFILE

I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run spark-submit (as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.

我有一组用于提交spark作业的其他简单包装器脚本。我只是首先调用这个脚本作为这个过程的一部分，并确保第二个参数(zip文件的名称)在运行spark-submit(如注释所示)时作为-py-files参数传递。我总是运行这些脚本，所以我不会意外地运行旧代码。与Spark开销相比，我的小型项目的打包开销是最小的。

There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.

有很多可以改进的地方——比如在什么时候创建一个新的zip文件，将它分成两个zip文件，一个包含经常更改的私有包，另一个包含很少更改的依赖项，这些依赖项不需要经常重新构建。在重新构建zip之前，您可以更明智地检查文件更改。检查论证的有效性也是一个好主意。但是现在这对我来说已经足够了。

The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.

我提出的解决方案并不是专门为象NumPy这样的大型依赖项设计的(尽管它可能对它们有用)。此外，如果您正在构建基于c的扩展，并且您的驱动节点与您的集群节点具有不同的体系结构，那么它将无法工作。

I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.

我在其他地方看到过建议，在所有节点上运行类似Anaconda的Python发行版，因为它已经包含NumPy(和许多其他包)，这可能是获得NumPy以及其他基于c的扩展的更好方式。无论如何，我们不能总是期望Anaconda在正确的版本中拥有我们想要的PyPI包，而且您可能无法控制您的Spark环境以在其上放置Anaconda，因此我认为这个基于虚拟环境的方法仍然是有用的。

#1