云数据流:在BigQuery中生成表

时间:2021-12-27 14:08:10

I have a pipeline which reads streaming data from Cloud Pub/Sub, this data is processed by Dataflow, then saved into one large BigQuery table, each Pub/Sub message includes an associated account_id. Is there a way to create new tables on the fly when a new account_id is identified? And then populate them with data from that associated account_id?

我有一个从Cloud Pub / Sub读取流数据的管道,这些数据由Dataflow处理,然后保存到一个大的BigQuery表中,每个Pub / Sub消息包含一个关联的account_id。有没有办法在识别新的account_id时动态创建新表?然后用来自相关account_id的数据填充它们?

I know that this can be done by updating the pipeline for each new account. But in an ideal world, Cloud Dataflow would generate these tables within the code programmatically.

我知道这可以通过更新每个新帐户的管道来完成。但在理想的世界中,Cloud Dataflow会以编程方式在代码中生成这些表。

2 个解决方案

#1


1  

wanted to share few options I see

我想分享几个选项

Option1 - wait for Partition on non-date field feature
It is not know when this is going to be implemented and available for us, so it might be not what you want now. But when this will go live - this will be the best option for such scenarios

选项1 - 等待非日期字段功能上的分区不知道何时实现并且可用于我们,因此它可能不是您现在想要的。但是,如果这将成为现实 - 这将是这种情况的最佳选择

Option 2 – you can come up with hashing your account_id into predefined number of buckets. In this case you can pre-create all those tables and in your code have logic that will handle respective destination table based on account hash. Same hashing logic than needs to be used in queries that will query that data

选项2 - 您可以将您的account_id哈希到预定义数量的存储桶中。在这种情况下,您可以预先创建所有这些表,并在您的代码中具有将根据帐户哈希处理相应目标表的逻辑。与将在查询该数据的查询中使用的哈希逻辑相同

#2


0  

The API for creating BigQuery Tables is at https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert.

用于创建BigQuery表的API位于https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert。

Nevertheless, it would probably be easier if you store all accounts in one static table that contains account_id as one column.

但是,如果将所有帐户存储在一个包含account_id作为一列的静态表中,则可能会更容易。

#1


1  

wanted to share few options I see

我想分享几个选项

Option1 - wait for Partition on non-date field feature
It is not know when this is going to be implemented and available for us, so it might be not what you want now. But when this will go live - this will be the best option for such scenarios

选项1 - 等待非日期字段功能上的分区不知道何时实现并且可用于我们,因此它可能不是您现在想要的。但是,如果这将成为现实 - 这将是这种情况的最佳选择

Option 2 – you can come up with hashing your account_id into predefined number of buckets. In this case you can pre-create all those tables and in your code have logic that will handle respective destination table based on account hash. Same hashing logic than needs to be used in queries that will query that data

选项2 - 您可以将您的account_id哈希到预定义数量的存储桶中。在这种情况下,您可以预先创建所有这些表,并在您的代码中具有将根据帐户哈希处理相应目标表的逻辑。与将在查询该数据的查询中使用的哈希逻辑相同

#2


0  

The API for creating BigQuery Tables is at https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert.

用于创建BigQuery表的API位于https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert。

Nevertheless, it would probably be easier if you store all accounts in one static table that contains account_id as one column.

但是,如果将所有帐户存储在一个包含account_id作为一列的静态表中,则可能会更容易。