用于在Cloud Dataflow中读取Parquet文件的自定义源

时间:2021-12-16 15:47:46

I have a requirement to read parquet file in my dataflow written in java and upload on bigquery. As there is no out of the box functionality given yet I know I have to write a custom source with hadoopFileFormat but I am not able to find any documentation regarding the same. Can somebody help me with some code or documentation on how to write a custom source or any other approach available to read a parquet file in Cloud dataflow.

我需要在我用java编写的数据流中读取镶木地板文件并在bigquery上传。由于没有开箱即用的功能,我知道我必须用hadoopFileFormat编写自定义源,但我无法找到任何相同的文档。有人可以帮我提供一些代码或文档,介绍如何编写自定义源或任何其他可用于读取云数据流中的镶木地板文件的方法。

1 个解决方案

#1


0  

The Apache Beam documentation for Built-in I/O Transforms provides a list of the work in progress for I/O Transforms in Apache Beam. Actually that list includes reading Apache Parquet files in Java, which can be followed in the BEAM-214 Jira.

内置I / O转换的Apache Beam文档提供了Apache Beam中I / O转换的正在进行的工作列表。实际上,该列表包括用Java读取Apache Parquet文件,可以在BEAM-214 Jira中使用。

So as of now, you are right, there is no out-of-the-box solution for working with Parquet files in Apache Beam / Cloud Dataflow. However, progress is being made in that field, so feel free to keep updated on the Jira I shared above.

到目前为止,您是对的,没有开箱即用的解决方案来处理Apache Beam / Cloud Dataflow中的Parquet文件。但是,该领域正在取得进展,因此请随时更新我在上面分享的Jira。

Also, you should know that Stack Overflow is not the appropriate site to ask for code or external tutorials/documentation on how to do something, so it will be really unlikely that you get that type of information. As per the Help Center:

此外,您应该知道Stack Overflow不是要求代码或外部教程/文档如何做某事的合适站点,因此您不太可能获得该类型的信息。根据帮助中心:

  1. Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
  2. 要求我们推荐或找到书籍,工具,软件库,教程或其他场外资源的问题都是Stack Overflow的主题,因为它们往往会吸引固执己见的答案和垃圾邮件。相反,描述问题以及到目前为止已经做了什么来解决它。

Instead, I would suggest that you first try an implementation yourself and then come back here with specific questions that can be better answered by the community.

相反,我建议你先自己尝试一个实现,然后回到这里,提出一些社区可以更好地回答的具体问题。

#1


0  

The Apache Beam documentation for Built-in I/O Transforms provides a list of the work in progress for I/O Transforms in Apache Beam. Actually that list includes reading Apache Parquet files in Java, which can be followed in the BEAM-214 Jira.

内置I / O转换的Apache Beam文档提供了Apache Beam中I / O转换的正在进行的工作列表。实际上,该列表包括用Java读取Apache Parquet文件,可以在BEAM-214 Jira中使用。

So as of now, you are right, there is no out-of-the-box solution for working with Parquet files in Apache Beam / Cloud Dataflow. However, progress is being made in that field, so feel free to keep updated on the Jira I shared above.

到目前为止,您是对的,没有开箱即用的解决方案来处理Apache Beam / Cloud Dataflow中的Parquet文件。但是,该领域正在取得进展,因此请随时更新我在上面分享的Jira。

Also, you should know that Stack Overflow is not the appropriate site to ask for code or external tutorials/documentation on how to do something, so it will be really unlikely that you get that type of information. As per the Help Center:

此外,您应该知道Stack Overflow不是要求代码或外部教程/文档如何做某事的合适站点,因此您不太可能获得该类型的信息。根据帮助中心:

  1. Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
  2. 要求我们推荐或找到书籍,工具,软件库,教程或其他场外资源的问题都是Stack Overflow的主题,因为它们往往会吸引固执己见的答案和垃圾邮件。相反,描述问题以及到目前为止已经做了什么来解决它。

Instead, I would suggest that you first try an implementation yourself and then come back here with specific questions that can be better answered by the community.

相反,我建议你先自己尝试一个实现,然后回到这里,提出一些社区可以更好地回答的具体问题。