We are running logfile parsing jobs in google dataflow using the Python SDK. Data is spread over several 100s of daily logs, which we read via file-pattern from Cloud Storage. Data volume for all files is about 5-8 GB (gz files) with 50-80 million lines in total.
我们使用Python SDK在google dataflow中运行日志文件解析作业。数据分布在几百个日常日志中,我们通过云存储中的文件模式读取这些日志。所有文件的数据量约为5-8 GB(gz文件),总共有5千万到8千万行。
loglines = p | ReadFromText('gs://logfile-location/logs*-20180101')
In addition, we have a simple (small) mapping csv, that maps logfile-entries to human readable text. Has about 400 lines, 5 kb size.
此外,我们有一个简单的(小)映射csv,它将logfile条目映射到人类可读的文本。有大约400行,5 kb大小。
For Example a logfile entry with [param=testing2] should be mapped to "Customer requested 14day free product trial" in the final output.
例如,带有[param = testing2]的日志文件条目应映射到最终输出中的“客户请求的14天免费产品试用”。
We do this in a simple beam.Map with sideinput, like so:
我们在一个简单的beam.Map中使用sideinput执行此操作,如下所示:
customerActions = loglines | beam.Map(map_logentries,mappingTable)
where map_logentries is the mapping function and mappingTable is said mapping table.
其中map_logentries是映射函数,mappingTable是映射表。
However, this only works if we read the mapping table in native python via open() / read(). If we do the same utilising the beam pipeline via ReadFromText() and pass the resulting PCollection as side-input to the Map, like so:
但是,这只有在我们通过open()/ read()读取本机python中的映射表时才有效。如果我们通过ReadFromText()使用波束管道并将生成的PCollection作为侧输入传递给Map,如下所示:
mappingTable = p | ReadFromText('gs://side-inputs/category-mapping.csv')
customerActions = loglines | beam.Map(map_logentries,beam.pvalue.AsIter(mappingTable))
performance breaks down completely to about 2-3 items per Second.
性能完全分解为每秒约2-3项。
Now, my questions:
现在,我的问题:
- Why would performance break so badly, what is wrong with passing a PCollection as side-input?
- 为什么性能会如此糟糕地破坏,将PCollection作为侧输入传递会出现什么问题呢?
- If it is maybe not recommended to use PCollections as side-input, how is one supposed to build such as pipeline that needs mappings that can/should not be hard coded into the mapping function?
- 如果可能不建议使用PCollections作为侧输入,那么应该如何构建诸如需要可以/不应该硬编码到映射函数中的映射的管道?
For us, the mapping does change frequently and I need to find a way to have "normal" users provide it. The idea was to have the mapping csv available in Cloud Storage, and simply incorporate it into the Pipeline via ReadFromText(). Reading it locally involves providing the mapping to the workers, so only the tech-team can do this.
对我们来说,映射确实经常发生变化,我需要找到让“普通”用户提供它的方法。我们的想法是在云存储中提供映射csv,并通过ReadFromText()将其简单地合并到管道中。在本地阅读涉及向工人提供映射,因此只有技术团队才能这样做。
I am aware that there are caching issues with side-input, but surely this should not apply to a 5kb input.
我知道侧输入存在缓存问题,但这肯定不适用于5kb输入。
All code above is pseudo code to explain the problem. Any ideas and thoughts on this would be highly appreciated!
上面的所有代码都是伪代码来解释问题。任何想法和想法都将受到高度赞赏!
2 个解决方案
#1
2
For more efficient side inputs (with small to medium size) you can utilize beam.pvalue.AsList(mappingTable)
since AsList
causes Beam to materialize the data, so you're sure that you will get in-memory list for that pcollection.
对于更有效的侧输入(小到中等大小),您可以使用beam.pvalue.AsList(mappingTable),因为AsList导致Beam实现数据,因此您确定将获得该pcollection的内存列表。
Intended for use in side-argument specification---the same places where AsSingleton and AsIter are used, but forces materialization of this PCollection as a list.
旨在用于旁参数规范---使用AsSingleton和AsIter的相同位置,但强制将此PCollection的实现作为列表。
资料来源:https://beam.apache.org/documentation/sdks/pydoc/2.2.0/apache_beam.pvalue.html?highlight=aslist#apache_beam.pvalue.AsList
#2
0
-
The code looks fine. However, since
mappingTable
is a mapping, wouldn'tbeam.pvalue.AsDict
be more appropriate for your use case?代码看起来很好。但是,由于mappingTable是一个映射,不会beam.pvalue.AsDict更适合您的用例吗?
-
Your
mappingTable
is small enough so side input is a good use case here. Given thatmappingTable
is also static, you can load it from GCS instart_bundle
function of yourDoFn
. See the answer to this post for more details. IfmappingTable
becomes very large in future, you can also consider converting yourmap_logentries
andmappingTable
intoPCollection
of key-value pairs and join them usingCoGroupByKey
.您的mappingTable足够小,因此侧输入是一个很好的用例。鉴于mappingTable也是静态的,您可以在DoFn的start_bundle函数中从GCS加载它。有关详细信息,请参阅此帖子的答案。如果将来mappingTable变得非常大,您还可以考虑将map_logentries和mappingTable转换为键值对的PCollection,并使用CoGroupByKey将它们连接起来。
#1
2
For more efficient side inputs (with small to medium size) you can utilize beam.pvalue.AsList(mappingTable)
since AsList
causes Beam to materialize the data, so you're sure that you will get in-memory list for that pcollection.
对于更有效的侧输入(小到中等大小),您可以使用beam.pvalue.AsList(mappingTable),因为AsList导致Beam实现数据,因此您确定将获得该pcollection的内存列表。
Intended for use in side-argument specification---the same places where AsSingleton and AsIter are used, but forces materialization of this PCollection as a list.
旨在用于旁参数规范---使用AsSingleton和AsIter的相同位置,但强制将此PCollection的实现作为列表。
资料来源:https://beam.apache.org/documentation/sdks/pydoc/2.2.0/apache_beam.pvalue.html?highlight=aslist#apache_beam.pvalue.AsList
#2
0
-
The code looks fine. However, since
mappingTable
is a mapping, wouldn'tbeam.pvalue.AsDict
be more appropriate for your use case?代码看起来很好。但是,由于mappingTable是一个映射,不会beam.pvalue.AsDict更适合您的用例吗?
-
Your
mappingTable
is small enough so side input is a good use case here. Given thatmappingTable
is also static, you can load it from GCS instart_bundle
function of yourDoFn
. See the answer to this post for more details. IfmappingTable
becomes very large in future, you can also consider converting yourmap_logentries
andmappingTable
intoPCollection
of key-value pairs and join them usingCoGroupByKey
.您的mappingTable足够小,因此侧输入是一个很好的用例。鉴于mappingTable也是静态的,您可以在DoFn的start_bundle函数中从GCS加载它。有关详细信息,请参阅此帖子的答案。如果将来mappingTable变得非常大,您还可以考虑将map_logentries和mappingTable转换为键值对的PCollection,并使用CoGroupByKey将它们连接起来。