在谷歌数据流中读取文件时可以跳过前导行

时间:2023-01-28 15:34:44

I want to skip leading rows when reading files while using google dataflow. Is that feature available in the lastest version? The files are kept in google storage. I will be writing these files to big query.

我想在使用谷歌数据流时阅读文件时跳过前导行。最新版本中是否提供该功能?这些文件保存在谷歌存储中。我将把这些文件写入大查询。

bq load command has option --skip_leading_rows . This option skips the leading rows when reading from the files.

bq load命令有选项--skip_leading_rows。从文件读取时,此选项会跳过前导行。

I want a similar feature to this in google dataflow. My input is in following format.

我想在google dataflow中使用类似的功能。我的输入采用以下格式。

I want google dataflow to ignore the first line and write only the rest of the lines to big Query

我希望google dataflow忽略第一行,只将其余行写入大查询

在谷歌数据流中读取文件时可以跳过前导行

1 个解决方案

#1


2  

This feature is not supported directly in Dataflow/ParDo's.

Dataflow / ParDo中不直接支持此功能。

You need to use a Filter.byPredicate() to achieve this.

您需要使用Filter.byPredicate()来实现此目的。

e.g.

例如

PCollection<X> rows = ...;
PCollection<X> nonHeaders =
   rows.apply(Filter.by(new MatchIfNonHeader()));

#1


2  

This feature is not supported directly in Dataflow/ParDo's.

Dataflow / ParDo中不直接支持此功能。

You need to use a Filter.byPredicate() to achieve this.

您需要使用Filter.byPredicate()来实现此目的。

e.g.

例如

PCollection<X> rows = ...;
PCollection<X> nonHeaders =
   rows.apply(Filter.by(new MatchIfNonHeader()));