使用Cloud Dataflow运行外部库

I'm trying to run some external shared library functions with Cloud Dataflow similar to described here: Running external libraries with Cloud Dataflow for grid-computing workloads.

我正在尝试使用类似于此处描述的Cloud Dataflow运行一些外部共享库函数：使用Cloud Dataflow运行外部库以用于网格计算工作负载。

I have a couple of questions according to the approach.

根据这种方法，我有几个问题。

There is the following passage in the article mentioned earlier:

前面提到的文章中有以下段落：

In the case of making a call to an external library, you need to do this step manually for that library. The approach is to:

在调用外部库的情况下，您需要手动为该库执行此步骤。方法是：

Store the code (along with versioning information) in Cloud Storage, this removes any concerns about throughput if running 10,000s of cores in the flow.

将代码（以及版本控制信息）存储在云存储中，如果在流中运行10,000个核心，这将消除对吞吐量的任何担忧。

In the @beginBundle [sic] method, create a synchronized block to check if the file is available on the local resource. If not, use the Cloud Storage client library to pull the file across.

在@beginBundle [sic]方法中，创建一个synchronized块以检查该文件是否在本地资源上可用。如果没有，请使用云存储客户端库来提取文件。

However, with my Java package, I simply put the library .so file into the src/main/resource/linux-x86-64 directory and call the library functions the following way (stripped to a bare minimum for brevity):

但是，使用我的Java包，我只需将库.so文件放入src / main / resource / linux-x86-64目录中，并按以下方式调用库函数（为了简洁起见，将其剥离为最小值）：

import com.sun.jna.Library;
import com.sun.jna.Native;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.values.KV;

public class HostLookupPipeline {

  public interface LookupLibrary extends Library {
    String Lookup(String domain);
  }

  static class LookupFn extends DoFn<String, KV<String, String>> {
    private static LookupLibrary lookup;

    @StartBundle
    public void startBundle() {
      // src/main/resource/linux-x86-64/liblookup.so
      lookup = Native.loadLibrary("lookup", LookupLibrary.class);
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
      String domain = c.element();
      String results = lookup.Lookup(domain);
      if (results != null) {
        c.output(KV.of(domain, results));
      }
    }
  }
}

Is such approach considered acceptable or extracting .so file from JAR performs poorly compared to downloading from GCS? If not, where should I put the file after downloading to make it accessible by the Cloud Dataflow worker?

这种方法是否被认为是可接受的或提取。与从GCS下载相比，JAR的文件表现不佳？如果没有，我应该在下载后将文件放在何处以使其可由Cloud Dataflow工作人员访问？

I've noticed that the transformation calling the external library function works rather slow — about 90 elements/s — utilizing 15 Cloud Dataflow workers (autoscaling, default max workers). If my rough calculations are correct, it should be twice as fast. I suppose that's because I call the external library function for every element.

我注意到调用外部库函数的转换工作速度相当慢 - 大约90个元素/秒 - 利用15个云数据流工作者（自动缩放，默认最大工作者）。如果我的粗略计算是正确的，它应该快两倍。我想那是因为我为每个元素调用了外部库函数。

Are there any best practices to improve external libraries performance when running with Java?

在使用Java运行时，是否有任何最佳实践可以提高外部库的性能？

1 个解决方案

#1

The guidance in that blog post is slightly incorrect - a much better place to put the initialization code is the @Setup method, not @StartBundle.

该博客文章中的指导略有不正确 - 放置初始化代码的更好的地方是@Setup方法，而不是@StartBundle。

@Setup is called to initialize an instance of your DoFn in every thread on every worker that will be executing it. It is the intended place for heavy setup code. Its counterpart is @Teardown.

调用@Setup来在每个将要执行它的worker的每个线程中初始化你的DoFn实例。它是重型设置代码的预定位置。它的对手是@Teardown。

@StartBundle and @FinishBundle are much finer granularity: per bundle, which is a quite low-level concept, and I believe the only common legitimate use for them writing batches of elements to an external service: then typically in @StartBundle you would initialize the next batch and in @FinishBundle flush it.

@StartBundle和@FinishBundle的粒度更精细：每个捆绑包，这是一个非常低级别的概念，我相信他们将批量元素写入外部服务的唯一常见合法用途：然后通常在@StartBundle中你会初始化下一批并在@FinishBundle中冲洗它。

Generally, to debug the performance, try adding logging to your DoFn's methods and see how many milliseconds the calls take and how that compares against your expectations. If you get stuck, include a Dataflow job ID in the question and an engineer will take a look at it.

通常，要调试性能，请尝试将日志记录添加到DoFn的方法中，并查看调用所花费的毫秒数以及与预期的比较方式。如果您遇到问题，请在问题中包含Dataflow作业ID，工程师将对其进行查看。

#1