Apache Beam无法读取Avro文件

时间:2021-06-30 15:36:08

I need to read in an avro file from local or gcs, via java. I followed the example from docs from https://beam.apache.org/documentation/sdks/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/AvroIO.html

我需要通过java读取本地或gcs的avro文件。我按照https://beam.apache.org/documentation/sdks/javadoc/2.0.0/index.html?org/apache/beam/sdk/io/AvroIO.html中的文档示例进行操作

Pipeline p = ...;

// A Read from a GCS file (runs locally and using remote execution):
Schema schema = new Schema.Parser().parse(new File("schema.avsc"));
PCollection<GenericRecord> records =
    p.apply(AvroIO.readGenericRecords(schema)
            .from("gs://my_bucket/path/to/records-*.avro"));

But when I try to process it through a DoFn there doesnt appear to be any data there. The avro file does have data and was able to run a function to generate a schema from it. If anybody has advice please share.

但是当我尝试通过DoFn处理它时,那里似乎没有任何数据。 avro文件确实有数据,并且能够运行一个函数来从中生成模式。如果有人有建议请分享。

2 个解决方案

#1


1  

I absolutely agree with Andrew, more information would be required. However, I think you should consider using AvroIO.Read which is a more appropriate transform to read records from one or more Avro files.

我绝对同意安德鲁,需要更多信息。但是,我认为你应该考虑使用AvroIO.Read,这是一个更合适的转换来读取一个或多个Avro文件中的记录。

https://cloud.google.com/dataflow/model/avro-io#reading-with-avroio

https://cloud.google.com/dataflow/model/avro-io#reading-with-avroio

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);

Schema schema = new Schema.Parser().parse(new File("schema.avsc"));

PCollection<GenericRecord> records =
p.apply(AvroIO.Read.named("ReadFromAvro")
                   .from("gs://my_bucket/path/records-*.avro")
                   .withSchema(schema));

#2


0  

Hey guys thanks for looking into this. I can't share any code because they belong to clients. I did not receive any error messages, and the debugger did see data, but we were not able to see the data in the avro file (via pardo).

嘿伙计们感谢您对此进行调查。我无法共享任何代码,因为它们属于客户端。我没有收到任何错误消息,调试器确实看到了数据,但是我们无法在avro文件中看到数据(通过pardo)。

I did manage to fix the issue by recreating the dataflow project using the Eclipse wizard. I even used the same code. I wonder why I did not receive any error messages.

我确实通过使用Eclipse向导重新创建数据流项目来解决问题。我甚至使用相同的代码。我想知道为什么我没有收到任何错误消息。

#1


1  

I absolutely agree with Andrew, more information would be required. However, I think you should consider using AvroIO.Read which is a more appropriate transform to read records from one or more Avro files.

我绝对同意安德鲁,需要更多信息。但是,我认为你应该考虑使用AvroIO.Read,这是一个更合适的转换来读取一个或多个Avro文件中的记录。

https://cloud.google.com/dataflow/model/avro-io#reading-with-avroio

https://cloud.google.com/dataflow/model/avro-io#reading-with-avroio

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);

Schema schema = new Schema.Parser().parse(new File("schema.avsc"));

PCollection<GenericRecord> records =
p.apply(AvroIO.Read.named("ReadFromAvro")
                   .from("gs://my_bucket/path/records-*.avro")
                   .withSchema(schema));

#2


0  

Hey guys thanks for looking into this. I can't share any code because they belong to clients. I did not receive any error messages, and the debugger did see data, but we were not able to see the data in the avro file (via pardo).

嘿伙计们感谢您对此进行调查。我无法共享任何代码,因为它们属于客户端。我没有收到任何错误消息,调试器确实看到了数据,但是我们无法在avro文件中看到数据(通过pardo)。

I did manage to fix the issue by recreating the dataflow project using the Eclipse wizard. I even used the same code. I wonder why I did not receive any error messages.

我确实通过使用Eclipse向导重新创建数据流项目来解决问题。我甚至使用相同的代码。我想知道为什么我没有收到任何错误消息。