I have the following in my Google cloud storage
我在Google云存储中有以下内容
Advertiser | Event
__________________
100 | Click
101 | Impression
100 | Impression
100 | Impression
101 | Impression
My output of the pipeline should be something like
我的输出管道应该是这样的
Advertiser | Count
100 | 3
101 | 2
First I used groupByKey, the output is like
首先我使用了groupByKey,输出就像
100 Click, Impression, Impression
101 Impression, Impression
How to proceed from here?
怎么从这里开始?
2 个解决方案
#1
0
Instead of a GroupByKey
, you may want to use a combine function, which is a composite that optimizes before and after the group by key. Your pipeline can look something like this:
您可能希望使用组合功能而不是GroupByKey,这是一种在按组分组之前和之后进行优化的组合。你的管道看起来像这样:
Python
蟒蛇
collection_contents = [(100, 'Click'),
(101, 'Impression'),
(100, 'Impression'),
(100, 'Impression'),
(101, 'Impression']
input_collection = pipeline | beam.Create(collection_contents)
counts = input_collection | Count.PerKey()
This should output a collection with the shape you are looking for. The Count
series of transforms is available in the apache_beam.transforms.combiners.combine.Count
module.
这应该输出一个您正在寻找的形状的集合。 Count系列转换可在apache_beam.transforms.combiners.combine.Count模块中找到。
Java
Java的
The same transforms exist for Java in the org.apache.beam.sdk.transforms
package:
org.apache.beam.sdk.transforms包中存在相同的Java转换:
PCollection<KV<Integer, Integer>> resultColl = inputColl.apply(Count.perKey())
#2
0
This counting pattern has been described in the 'word count' sample of Apache Beam.
这种计数模式已经在Apache Beam的'word count'样本中描述。
Find the sample at Github apache beam sample: wordcount.py. The counting starts at line 95.
在Github apache beam sample中找到样本:wordcount.py。计数从第95行开始。
#1
0
Instead of a GroupByKey
, you may want to use a combine function, which is a composite that optimizes before and after the group by key. Your pipeline can look something like this:
您可能希望使用组合功能而不是GroupByKey,这是一种在按组分组之前和之后进行优化的组合。你的管道看起来像这样:
Python
蟒蛇
collection_contents = [(100, 'Click'),
(101, 'Impression'),
(100, 'Impression'),
(100, 'Impression'),
(101, 'Impression']
input_collection = pipeline | beam.Create(collection_contents)
counts = input_collection | Count.PerKey()
This should output a collection with the shape you are looking for. The Count
series of transforms is available in the apache_beam.transforms.combiners.combine.Count
module.
这应该输出一个您正在寻找的形状的集合。 Count系列转换可在apache_beam.transforms.combiners.combine.Count模块中找到。
Java
Java的
The same transforms exist for Java in the org.apache.beam.sdk.transforms
package:
org.apache.beam.sdk.transforms包中存在相同的Java转换:
PCollection<KV<Integer, Integer>> resultColl = inputColl.apply(Count.perKey())
#2
0
This counting pattern has been described in the 'word count' sample of Apache Beam.
这种计数模式已经在Apache Beam的'word count'样本中描述。
Find the sample at Github apache beam sample: wordcount.py. The counting starts at line 95.
在Github apache beam sample中找到样本:wordcount.py。计数从第95行开始。