用来自相同键索引的值填充空白字段

时间:2021-03-10 11:45:55

I have a set of data (10 columns, 1000 rows) that is indexed by an ID number that one or more of these rows can share. To give a small example to illustrate my point, consider this table:

我有一组数据(10列,1000行),这些数据由一个或多个行可以共享的ID号进行索引。为了举例说明我的观点,请考虑这张表:

ID       Name      Location
5014     John     
5014     Kate     California
5014     Jim
5014     Ryan     California
5018     Pete     
5018     Pat      Indiana
5019     Jeff     Arizona
5020     Chris    Kentucky
5020     Mike
5021     Will     Indiana

I need for all entries to have something in the Location field and I'm having a hell of a time trying to do it.

我需要所有的元素在Location字段中都有一些东西,我花了很多时间去尝试。

Things to note:

注意事项:

  1. Every unique ID number has at least one row with the location field populated.
  2. 每个唯一ID号都至少有一行填充了location字段。
  3. If two rows have the same ID number, they have the same location.
  4. 如果两行有相同的ID号,它们有相同的位置。
  5. Two different ID numbers can have the same location.
  6. 两个不同的ID号可以有相同的位置。
  7. ID numbers are not necessarily consecutive, nor are they necessarily completely numeric. The arrangement of them isn't of importance to me, since any rows that are related share the same ID number.
  8. ID号不一定是连续的,也不一定是完全的数字。它们的排列对我来说并不重要,因为任何相关的行都共享相同的ID号。

Any ideas for a solution? I'm currently using R with the data.table package, but I'm relatively new to it.

有什么解决办法吗?我现在用R表示数据。表包,但我对它比较陌生。

1 个解决方案

#1


4  

We can convert the 'data.frame' to 'data.table' (setDT(df1)), Grouped by 'ID', get the elements of Location that are not '' (Location[Location!=''][1L]). Suppose, if there are more than one element per group that are not '', the [1L], selects the first non-blank element, and assign (:=) the output to Location

我们可以将“data.frame”转换为“data”。表(setDT(df1))),按“ID”分组,获得不属于“(Location[Location!= "][1L])的位置元素。假设,如果每个组有多个不为“的元素,则[1L]选择第一个非空白元素,并将输出(:=)分配到位置

library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
#     ID  Name   Location
# 1: 5014  John California
# 2: 5014  Kate California
# 3: 5014   Jim California
# 4: 5014  Ryan California
# 5: 5018  Pete    Indiana
# 6: 5018   Pat    Indiana
# 7: 5019  Jeff    Arizona
# 8: 5020 Chris   Kentucky
# 9: 5020  Mike   Kentucky
#10: 5021  Will    Indiana

Or we can use setdiff as suggested by @Frank

或者我们可以使用@Frank建议的setdiff

 setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]

#1


4  

We can convert the 'data.frame' to 'data.table' (setDT(df1)), Grouped by 'ID', get the elements of Location that are not '' (Location[Location!=''][1L]). Suppose, if there are more than one element per group that are not '', the [1L], selects the first non-blank element, and assign (:=) the output to Location

我们可以将“data.frame”转换为“data”。表(setDT(df1))),按“ID”分组,获得不属于“(Location[Location!= "][1L])的位置元素。假设,如果每个组有多个不为“的元素,则[1L]选择第一个非空白元素,并将输出(:=)分配到位置

library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
#     ID  Name   Location
# 1: 5014  John California
# 2: 5014  Kate California
# 3: 5014   Jim California
# 4: 5014  Ryan California
# 5: 5018  Pete    Indiana
# 6: 5018   Pat    Indiana
# 7: 5019  Jeff    Arizona
# 8: 5020 Chris   Kentucky
# 9: 5020  Mike   Kentucky
#10: 5021  Will    Indiana

Or we can use setdiff as suggested by @Frank

或者我们可以使用@Frank建议的setdiff

 setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]