What is the difference between ::
and .
in pig
?
什么是::和。在猪吗?
When do I use one vs the other?
我什么时候用一个,什么时候用另一个?
E.g., I know that ::
is need in join
when a field exists in both aliases:
例:我知道:当两个别名中都存在字段时,才需要join:
A = foreach (join B by (x), C by (y)) generate B::y as b_y, C::y as c_y;
and I need .
when accessing group
fields:
我所需要的。当访问组字段:
A = foreach (group B by (x,y)) generate group.x as x, group.y as y, SUM(B?z) as z;
However, do I pass B::z
or B.z
to SUM
above instead of B?z
?
但是,我是通过B: z还是B。用z来代替B?z?
2 个解决方案
#1
1
IIRC you get ::
as a side effect after some statements. You cannot bother about it, unless (as you mentioned) a name exists inside two different prefixes.
IIRC你得到:::作为一些声明的副作用。除非(正如您所提到的)名称存在于两个不同的前缀中,否则您不必为此烦恼。
The .
is different in that you are going inside the structure. group.x as x, group.y as y
is equivalent to FLATTEN(group)
的。不同的是你要进入这个结构。组。x,x,组。y等于FLATTEN(群)
SUM(B?z)
- here you should do SUM(B.z)
, to specify that you need a particular field to SUM
.
SUM(B?z)——这里你应该用SUM(B.z)来指定你需要一个特定的字段来求和。
#2
5
In Pig, ::
is used as a disambiguation tool after operations which could possibly create naming collisions. Notably, this happens with JOIN
, CROSS
, and FLATTEN
. Consider two relations, A:{(id:int, name:chararray)}
and B:{(id:int, location:chararray)}
. If you want to associate names with locations, naturally you would do:
在Pig中,:是在操作之后用作消除歧义的工具,可能会产生命名冲突。值得注意的是,这种情况发生在JOIN、CROSS和FLATTEN身上。考虑两种关系,A:{(id:int, name:chararray)}和B:{(id:int, location:chararray)}。如果你想把名字和地点联系起来,你自然会这么做:
C = JOIN A BY id, B BY id;
Without the disambiguation operator, your schema would be
如果没有消歧操作符,您的模式将是
C:{(id:int, name:chararray, id:int, location:chararray)}
Now you can't tell which field id
refers to. To avoid this, Pig will instead do
现在你无法判断id指的是哪个字段。为了避免这种情况,猪就会这么做
C:{(A::id:int, A::name:chararray, B::id:int, B::location:chararray)}
Likewise, you could FLATTEN
two bags whose tuples have fields with the same name, and they would also collide. So the same operator is used in this case as well. When there is no such conflict, you do not need to use the full name: name
is unambiguous here. To simplify C
, then, you can do this:
同样,你也可以把两个袋子压平,它们的乳头有同名的字段,它们也会碰撞。同样的运算符也在这个例子中使用。当没有这种冲突时,您不需要使用全名:name在这里是明确的。为了简化C,你可以这样做:
D = FOREACH C GENERATE A::id, name, location;
The .
operator, by contrast, projects fields from bags and tuples. If you have a bag b
with schema {(x:int, y:int, z:int)}
, the projection b.y
yields a bag with just the specified field: {(y:int)}
. You can project multiple fields at once with parentheses: b.(y,z)
yields {(y:int, z:int)}
.
的。相比之下,操作符从包和元组中投射字段。如果您有一个带有模式{(x:int, y:int, z:int)}的包b,那么投影b。y产生的包只有指定的字段:{(y:int)}。您可以同时使用圆括号:b.(y,z)生成{(y:int, z:int)}。
When used with tuples, the result is a tuple with just the specified fields. If the tuple t
has schema (x:int, y:int, z:int)
, then t.x
is the tuple (x:int)
and t.(y,z)
is the tuple (y:int, z:int)
.
当与元组一起使用时,结果是一个只有指定字段的元组。如果tuple t具有模式(x:int, y:int, z:int),则t。x是tuple (x:int), t (y,z)是tuple (y:int, z:int)。
To your specific question about SUM
, note that SUM
along with the other summary statistic UDFs, takes a bag as its argument. Therefore, you need to create a bag with just the one field per tuple that you want to sum. Using the projection operator, .
: B.z
.
关于SUM的具体问题,请注意和其他汇总统计udf一起,取一个包作为它的参数。因此,您需要创建一个包,每个元组只包含一个字段。使用投影运算符:B.z。
#1
1
IIRC you get ::
as a side effect after some statements. You cannot bother about it, unless (as you mentioned) a name exists inside two different prefixes.
IIRC你得到:::作为一些声明的副作用。除非(正如您所提到的)名称存在于两个不同的前缀中,否则您不必为此烦恼。
The .
is different in that you are going inside the structure. group.x as x, group.y as y
is equivalent to FLATTEN(group)
的。不同的是你要进入这个结构。组。x,x,组。y等于FLATTEN(群)
SUM(B?z)
- here you should do SUM(B.z)
, to specify that you need a particular field to SUM
.
SUM(B?z)——这里你应该用SUM(B.z)来指定你需要一个特定的字段来求和。
#2
5
In Pig, ::
is used as a disambiguation tool after operations which could possibly create naming collisions. Notably, this happens with JOIN
, CROSS
, and FLATTEN
. Consider two relations, A:{(id:int, name:chararray)}
and B:{(id:int, location:chararray)}
. If you want to associate names with locations, naturally you would do:
在Pig中,:是在操作之后用作消除歧义的工具,可能会产生命名冲突。值得注意的是,这种情况发生在JOIN、CROSS和FLATTEN身上。考虑两种关系,A:{(id:int, name:chararray)}和B:{(id:int, location:chararray)}。如果你想把名字和地点联系起来,你自然会这么做:
C = JOIN A BY id, B BY id;
Without the disambiguation operator, your schema would be
如果没有消歧操作符,您的模式将是
C:{(id:int, name:chararray, id:int, location:chararray)}
Now you can't tell which field id
refers to. To avoid this, Pig will instead do
现在你无法判断id指的是哪个字段。为了避免这种情况,猪就会这么做
C:{(A::id:int, A::name:chararray, B::id:int, B::location:chararray)}
Likewise, you could FLATTEN
two bags whose tuples have fields with the same name, and they would also collide. So the same operator is used in this case as well. When there is no such conflict, you do not need to use the full name: name
is unambiguous here. To simplify C
, then, you can do this:
同样,你也可以把两个袋子压平,它们的乳头有同名的字段,它们也会碰撞。同样的运算符也在这个例子中使用。当没有这种冲突时,您不需要使用全名:name在这里是明确的。为了简化C,你可以这样做:
D = FOREACH C GENERATE A::id, name, location;
The .
operator, by contrast, projects fields from bags and tuples. If you have a bag b
with schema {(x:int, y:int, z:int)}
, the projection b.y
yields a bag with just the specified field: {(y:int)}
. You can project multiple fields at once with parentheses: b.(y,z)
yields {(y:int, z:int)}
.
的。相比之下,操作符从包和元组中投射字段。如果您有一个带有模式{(x:int, y:int, z:int)}的包b,那么投影b。y产生的包只有指定的字段:{(y:int)}。您可以同时使用圆括号:b.(y,z)生成{(y:int, z:int)}。
When used with tuples, the result is a tuple with just the specified fields. If the tuple t
has schema (x:int, y:int, z:int)
, then t.x
is the tuple (x:int)
and t.(y,z)
is the tuple (y:int, z:int)
.
当与元组一起使用时,结果是一个只有指定字段的元组。如果tuple t具有模式(x:int, y:int, z:int),则t。x是tuple (x:int), t (y,z)是tuple (y:int, z:int)。
To your specific question about SUM
, note that SUM
along with the other summary statistic UDFs, takes a bag as its argument. Therefore, you need to create a bag with just the one field per tuple that you want to sum. Using the projection operator, .
: B.z
.
关于SUM的具体问题,请注意和其他汇总统计udf一起,取一个包作为它的参数。因此,您需要创建一个包,每个元组只包含一个字段。使用投影运算符:B.z。