使用Pub / Sub将.csv文件流式传输到云存储中

时间:2022-05-30 15:35:09

General question if anyone can point me in the right way if possible, what is the Best way to get incoming streaming .csv files into BigQuery (with some transformations applied using dataflow) at a large scale, using pub/sub ?.. since im thinking to use pub/ sub to handle the many multiple large raw streams of incoming .csv files

一般问题,如果有人能够以正确的方式指出我,如果可能的话,使用pub / sub?来大规模地将传入的.csv文件传入BigQuery(使用数据流进行一些转换)的最佳方法是什么?想要使用pub / sub来处理传入的.csv文件的许多大型原始流

for example the approach I’m thinking of is:

例如,我想到的方法是:

1.incoming raw.csv file > 2. pub/sub > 3. cloud storage > 4. cloud Function (to trigger dataflow) > 5. DataFlow (to transform) > 5. BigQuery

1.incoming raw.csv file> 2. pub / sub> 3. cloud storage> 4. cloud Function(触发数据流)> 5. DataFlow(转换)> 5. BigQuery

let me know if there are any issues with this Approach at scale Or a better alternative?

让我知道这种方法在规模上是否有任何问题或更好的选择?

If that is a good approach, how to I get pub /sub to pickup the .csv files / and how do I construct this?

如果这是一个很好的方法,如何让pub / sub获取.csv文件/我如何构建它?

Thanks

谢谢

Ben

使用Pub / Sub将.csv文件流式传输到云存储中

1 个解决方案

#1


1  

There's a couple of different ways to approach this but much of your use case can be solved using the Google-provided Dataflow templates. When using the templates, the light transformations can be done within a JavaScript UDF. This saves you from needing to maintain an entire pipeline and only writing the transformations necessary for your incoming data.

有几种不同的方法可以解决这个问题,但您的大部分用例都可以使用Google提供的Dataflow模板来解决。使用模板时,光转换可以在JavaScript UDF中完成。这使您无需维护整个管道,只需编写传入数据所需的转换。

If your accepting many files input as a stream to Cloud Pub/Sub, remember that Cloud Pub/Sub has no guarantees on ordering so records from different files would likely get intermixed in the output. If you're looking to capture an entire file as is, uploading directly to GCS would be the better approach.

如果您接受许多文件作为流发送到Cloud Pub / Sub,请记住Cloud Pub / Sub无法保证订购,因此来自不同文件的记录可能会混合在输出中。如果您希望按原样捕获整个文件,则直接上传到GCS将是更好的方法。

Using the provided templates either Cloud Pub/Sub to BigQuery or GCS to BigQuery, you could utilize a simple UDF to transform the data from CSV format to a JSON format matching the BigQuery output table schema.

使用提供的模板Cloud Pub / Sub to BigQuery或GCS to BigQuery,您可以利用简单的UDF将数据从CSV格式转换为与BigQuery输出表架构匹配的JSON格式。

For example if you had CSV records such as:

例如,如果您有CSV记录,例如:

transactionDate,product,retailPrice,cost,paymentType
2018-01-08,Product1,99.99,79.99,Visa

You could write a UDF to transform that data into your output schema as such:

您可以编写UDF来将该数据转换为输出模式,如下所示:

function transform(line) {
  var values = line.split(',');

  // Construct output and add transformations
  var obj = new Object();
  obj.transactionDate = values[0];
  obj.product = values[1];
  obj.retailPrice = values[2];
  obj.cost = values[3];
  obj.marginPct = (obj.retailPrice - obj.cost) / obj.retailPrice;
  obj.paymentType = values[4];
  var jsonString = JSON.stringify(obj);

  return jsonString;
}

#1


1  

There's a couple of different ways to approach this but much of your use case can be solved using the Google-provided Dataflow templates. When using the templates, the light transformations can be done within a JavaScript UDF. This saves you from needing to maintain an entire pipeline and only writing the transformations necessary for your incoming data.

有几种不同的方法可以解决这个问题,但您的大部分用例都可以使用Google提供的Dataflow模板来解决。使用模板时,光转换可以在JavaScript UDF中完成。这使您无需维护整个管道,只需编写传入数据所需的转换。

If your accepting many files input as a stream to Cloud Pub/Sub, remember that Cloud Pub/Sub has no guarantees on ordering so records from different files would likely get intermixed in the output. If you're looking to capture an entire file as is, uploading directly to GCS would be the better approach.

如果您接受许多文件作为流发送到Cloud Pub / Sub,请记住Cloud Pub / Sub无法保证订购,因此来自不同文件的记录可能会混合在输出中。如果您希望按原样捕获整个文件,则直接上传到GCS将是更好的方法。

Using the provided templates either Cloud Pub/Sub to BigQuery or GCS to BigQuery, you could utilize a simple UDF to transform the data from CSV format to a JSON format matching the BigQuery output table schema.

使用提供的模板Cloud Pub / Sub to BigQuery或GCS to BigQuery,您可以利用简单的UDF将数据从CSV格式转换为与BigQuery输出表架构匹配的JSON格式。

For example if you had CSV records such as:

例如,如果您有CSV记录,例如:

transactionDate,product,retailPrice,cost,paymentType
2018-01-08,Product1,99.99,79.99,Visa

You could write a UDF to transform that data into your output schema as such:

您可以编写UDF来将该数据转换为输出模式,如下所示:

function transform(line) {
  var values = line.split(',');

  // Construct output and add transformations
  var obj = new Object();
  obj.transactionDate = values[0];
  obj.product = values[1];
  obj.retailPrice = values[2];
  obj.cost = values[3];
  obj.marginPct = (obj.retailPrice - obj.cost) / obj.retailPrice;
  obj.paymentType = values[4];
  var jsonString = JSON.stringify(obj);

  return jsonString;
}