Apache PIG - 加入后跟NULL的投影结果

时间:2021-03-24 13:44:41

The below code works as expected:

以下代码按预期工作:

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join

When I inspect the fields, they are populated correctly.

当我检查字段时,它们会正确填充。

However, once I add a projection into the mix, it doesn't work.

但是,一旦我在混合中添加投影,它就不起作用。

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
ab = foreach a_b generate a1 as a1, a2 as a2, b2 as b2;

In ab, all cells in the fields from b are NULL.

在ab中,来自b的字段中的所有单元格都是NULL。

The same thing happens if I do this:

如果我这样做会发生同样的事情:

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
a2 = foreach a generate a1, a2;
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
b2 = foreach b generate b1, b2;
ab = join a2 by a1, b2 by b1;

I use the following workaround, but hate being bogged down by the store/load:

我使用以下解决方法,但讨厌被存储/加载陷入困境:

a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
store a_b into 'hdfs:///a_b_temp' using PigStorage('\t','-schema');
a_b2 = load 'hdfs:///a_b_temp' using PigStorage('\t');
ab = foreach a_b2 generate a1 as a1, a2 as a2, b2 as b2;

And the fields in ab do not become NULL. However, if I then group and perform aggregations, I typically get the error:

并且ab中的字段不会变为NULL。但是,如果我然后分组并执行聚合,我通常会收到错误:

ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Long

However, this error goes away if I skip the last projection.

但是,如果我跳过最后一个投影,则此错误消失。

I am new to Pig - are there any known bugs/issues that could be causing this? I have observed it happening several times with different data sets.

我是Pig的新手 - 是否有任何可能导致此问题的已知错误/问题?我观察到它使用不同的数据集发生了好几次。

I am using pig 0.12 on Amazon AWS EMR.

我在Amazon AWS EMR上使用pig 0.12。

Thanks for any help!

谢谢你的帮助!

1 个解决方案

#1


I tried with your second approach and here is the code.

我尝试了你的第二种方法,这是代码。

a = load '/user/root/pig/file1.txt' using PigStorage('\t') as (a1:int, a2:chararray, a3:chararray);
b = load '/user/root/pig/file2.txt' using PigStorage('\t') as (b1:int, b2:chararray, b3:chararray);

--inner join
a_b = join a by a1, b by b1; 

--if your goal is to get selected field from relation b based on join condition.
--a::a1 says "there is a record from "a" and that has a column called a1"
ab = foreach a_b generate a::a1, a2, b2;

--If your goal is to get all matching data on id from both relations.
--ab = foreach a_b generate $0..;

DUMP ab;

Hope it will help you.

希望它会对你有所帮助。

#1


I tried with your second approach and here is the code.

我尝试了你的第二种方法,这是代码。

a = load '/user/root/pig/file1.txt' using PigStorage('\t') as (a1:int, a2:chararray, a3:chararray);
b = load '/user/root/pig/file2.txt' using PigStorage('\t') as (b1:int, b2:chararray, b3:chararray);

--inner join
a_b = join a by a1, b by b1; 

--if your goal is to get selected field from relation b based on join condition.
--a::a1 says "there is a record from "a" and that has a column called a1"
ab = foreach a_b generate a::a1, a2, b2;

--If your goal is to get all matching data on id from both relations.
--ab = foreach a_b generate $0..;

DUMP ab;

Hope it will help you.

希望它会对你有所帮助。