在Google Dataflow中以groupby计算

I have the following in my Google cloud storage

我在Google云存储中有以下内容

Advertiser | Event
__________________
100 | Click

101 | Impression

100 | Impression

100 | Impression

101 | Impression

My output of the pipeline should be something like

我的输出管道应该是这样的

Advertiser | Count

100 | 3

101 | 2

First I used groupByKey, the output is like

首先我使用了groupByKey，输出就像

100 Click, Impression, Impression

101 Impression, Impression

How to proceed from here?

怎么从这里开始？

2 个解决方案

#1

Instead of a GroupByKey, you may want to use a combine function, which is a composite that optimizes before and after the group by key. Your pipeline can look something like this:

您可能希望使用组合功能而不是GroupByKey，这是一种在按组分组之前和之后进行优化的组合。你的管道看起来像这样：

Python

蟒蛇

collection_contents = [(100, 'Click'), 
                       (101, 'Impression'), 
                       (100, 'Impression'), 
                       (100, 'Impression'), 
                       (101, 'Impression']

input_collection = pipeline | beam.Create(collection_contents)

counts = input_collection | Count.PerKey()

This should output a collection with the shape you are looking for. The Count series of transforms is available in the apache_beam.transforms.combiners.combine.Count module.

这应该输出一个您正在寻找的形状的集合。 Count系列转换可在apache_beam.transforms.combiners.combine.Count模块中找到。

Java

Java的

The same transforms exist for Java in the org.apache.beam.sdk.transforms package:

org.apache.beam.sdk.transforms包中存在相同的Java转换：

PCollection<KV<Integer, Integer>> resultColl = inputColl.apply(Count.perKey())

#2

This counting pattern has been described in the 'word count' sample of Apache Beam.

这种计数模式已经在Apache Beam的'word count'样本中描述。

Find the sample at Github apache beam sample: wordcount.py. The counting starts at line 95.

在Github apache beam sample中找到样本：wordcount.py。计数从第95行开始。

#1