Google Cloud - 用于时间序列数据清理的产品是什么?

时间:2022-07-20 23:11:26

I have around 20TB of time series data stored in big query.

我有大约20TB的时间序列数据存储在大查询中。

The current pipeline I have is:

我目前的管道是:

raw data in big query => joins in big query to create more big query datasets => store them in buckets

大查询中的原始数据=>加入大查询以创建更大的查询数据集=>将它们存储在存储桶中

Then I download a subset of the files in the bucket:

然后我下载存储桶中的一部分文件:

Work on interpolation/resampling of data using Python/SFrame, because some of the time series data have missing times and they are not evenly sampled.

使用Python / SFrame处理数据的插值/重采样,因为某些时间序列数据具有丢失的时间并且它们未被均匀采样。

However, it takes a long time on a local PC, and I'm guessing it will take days to go through that 20TB of data.

但是,在本地PC上需要很长时间,而且我猜测需要几天才能完成20TB的数据。


Since the data are already in buckets, I'm wondering what would the best Google tools for interpolation and resampling?

由于数据已经存在于桶中,我想知道用于插值和重采样的最佳Google工具是什么?

After resampling and interpolation I might use Facebook's Prophet or Auto ARIMA to create some forecasts. But that would be done locally.

重新采样和插值后,我可能会使用Facebook的Prophet或Auto ARIMA来创建一些预测。但那将在当地完成。


There's a few services from Google that seems are like good options.

Google提供的一些服务似乎是不错的选择。

  1. Cloud DataFlow: I have no experience in Apache Beam, but it looks like the Python API with Apache Beam have missing functions compared to the Java version? I know how to write Java, but I'd like to use one programming language for this task.

    Cloud DataFlow:我没有Apache Beam的经验,但看起来Apache API的Python API缺少与Java版本相比的功能?我知道如何编写Java,但我想使用一种编程语言来完成这项任务。

  2. Cloud DataProc: I know how to write PySpark, but I don't really need any real time processing or stream processing, however spark has time series interpolation, so this might be the only option?

    Cloud DataProc:我知道如何编写PySpark,但我并不需要任何实时处理或流处理,但是spark有时间序列插值,所以这可能是唯一的选择吗?

  3. Cloud Dataprep: Looks like a GUI for cleaning data, but it's in beta. Not sure if it can do time series resampling/interpolation.

    Cloud Dataprep:看起来像一个用于清理数据的GUI,但它处于测试阶段。不确定它是否可以进行时间序列重采样/插值。

Does anyone have any idea which might best fit my use case?

有没有人知道哪个最适合我的用例?

Thanks

谢谢

2 个解决方案

#1


0  

I would use PySpark on Dataproc, since Spark is not just realtime/streaming but also for batch processing.

我会在Dataproc上使用PySpark,因为Spark不仅仅是实时/流式传输,也适用于批量处理。

You can choose the size of your cluster (and use some preemptibles to save costs) and run this cluster only for the time you actually need to process this data. Afterwards kill the cluster.

您可以选择群集的大小(并使用一些preemptibles来节省成本),并仅在实际需要处理此数据的时间运行此群集。然后杀死群集。

Spark also works very nicely with Python (not as nice as Scala) but for all effects and purposes the main difference is performance, not reduced API functionality.

Spark也可以很好地与Python一起工作(不像Scala那样好)但是对于所有效果和目的而言,主要区别在于性能,而不是简化的API功能。

Even with the batch processing you can use the WindowSpec for effective time serie interpolation

即使使用批处理,您也可以使用WindowSpec进行有效的时间序列插值

To be fair: I don't have a lot of experience with DataFlow or DataPrep, but that's because out use case is somewhat similar to yours and Dataproc works well for that

公平地说:我没有很多使用DataFlow或DataPrep的经验,但这是因为用例与你的有些相似而且Dataproc效果很好

#2


0  

Cloud Dataflow is a batch data processing, Cloud Dataproc is a managed Spark and Hadoop service and Cloud Dataprep is used to Transform/Clean raw data. All of them can be used to perform interpolation/resampling of data.

Cloud Dataflow是批量数据处理,Cloud Dataproc是托管Spark和Hadoop服务,Cloud Dataprep用于转换/清理原始数据。所有这些都可用于执行数据的插值/重采样。

I would discard Cloud Dataprep. It might change in backward-incompatible ways because is in beta release. The main difference between Cloud Dataflow and Cloud Dataproc is the cluster management capabilities in the last one. If you do not expect a clear comeback by managing clusters, Cloud Dataflow is the product in which you can set up the mentioned operations in the easiest way.

我会丢弃Cloud Dataprep。它可能会以向后兼容的方式发生变化,因为它处于测试阶段。 Cloud Dataflow和Cloud Dataproc之间的主要区别在于最后一个集群管理功能。如果您不希望通过管理群集来明确回归,那么Cloud Dataflow就是您可以以最简单的方式设置上述操作的产品。

Apache Beam Java version is older than Python version since Apache Beam 1.X supports only Java. The new 2.X version supports both languages with no apparent Python/Java difference.

Apache Beam Java版本比Python版本旧,因为Apache Beam 1.X仅支持Java。新的2.X版本支持两种语言,没有明显的Python / Java差异。

You will find useful this Dataflow timeseries example in Java if you decide that Dataflow is the best suited option.

如果您认为Dataflow是最适合的选项,您将在Java中找到有用的此Dataflow时间序列示例。

#1


0  

I would use PySpark on Dataproc, since Spark is not just realtime/streaming but also for batch processing.

我会在Dataproc上使用PySpark,因为Spark不仅仅是实时/流式传输,也适用于批量处理。

You can choose the size of your cluster (and use some preemptibles to save costs) and run this cluster only for the time you actually need to process this data. Afterwards kill the cluster.

您可以选择群集的大小(并使用一些preemptibles来节省成本),并仅在实际需要处理此数据的时间运行此群集。然后杀死群集。

Spark also works very nicely with Python (not as nice as Scala) but for all effects and purposes the main difference is performance, not reduced API functionality.

Spark也可以很好地与Python一起工作(不像Scala那样好)但是对于所有效果和目的而言,主要区别在于性能,而不是简化的API功能。

Even with the batch processing you can use the WindowSpec for effective time serie interpolation

即使使用批处理,您也可以使用WindowSpec进行有效的时间序列插值

To be fair: I don't have a lot of experience with DataFlow or DataPrep, but that's because out use case is somewhat similar to yours and Dataproc works well for that

公平地说:我没有很多使用DataFlow或DataPrep的经验,但这是因为用例与你的有些相似而且Dataproc效果很好

#2


0  

Cloud Dataflow is a batch data processing, Cloud Dataproc is a managed Spark and Hadoop service and Cloud Dataprep is used to Transform/Clean raw data. All of them can be used to perform interpolation/resampling of data.

Cloud Dataflow是批量数据处理,Cloud Dataproc是托管Spark和Hadoop服务,Cloud Dataprep用于转换/清理原始数据。所有这些都可用于执行数据的插值/重采样。

I would discard Cloud Dataprep. It might change in backward-incompatible ways because is in beta release. The main difference between Cloud Dataflow and Cloud Dataproc is the cluster management capabilities in the last one. If you do not expect a clear comeback by managing clusters, Cloud Dataflow is the product in which you can set up the mentioned operations in the easiest way.

我会丢弃Cloud Dataprep。它可能会以向后兼容的方式发生变化,因为它处于测试阶段。 Cloud Dataflow和Cloud Dataproc之间的主要区别在于最后一个集群管理功能。如果您不希望通过管理群集来明确回归,那么Cloud Dataflow就是您可以以最简单的方式设置上述操作的产品。

Apache Beam Java version is older than Python version since Apache Beam 1.X supports only Java. The new 2.X version supports both languages with no apparent Python/Java difference.

Apache Beam Java版本比Python版本旧,因为Apache Beam 1.X仅支持Java。新的2.X版本支持两种语言,没有明显的Python / Java差异。

You will find useful this Dataflow timeseries example in Java if you decide that Dataflow is the best suited option.

如果您认为Dataflow是最适合的选项,您将在Java中找到有用的此Dataflow时间序列示例。