I have a set of data (10 columns, 1000 rows) that is indexed by an ID number that one or more of these rows can share. To give a small example to illustrate my point, consider this table:
我有一组数据(10列,1000行),这些数据由一个或多个行可以共享的ID号进行索引。为了举例说明我的观点,请考虑这张表:
ID Name Location
5014 John
5014 Kate California
5014 Jim
5014 Ryan California
5018 Pete
5018 Pat Indiana
5019 Jeff Arizona
5020 Chris Kentucky
5020 Mike
5021 Will Indiana
I need for all entries to have something in the Location field and I'm having a hell of a time trying to do it.
我需要所有的元素在Location字段中都有一些东西,我花了很多时间去尝试。
Things to note:
注意事项:
- Every unique ID number has at least one row with the location field populated.
- 每个唯一ID号都至少有一行填充了location字段。
- If two rows have the same ID number, they have the same location.
- 如果两行有相同的ID号,它们有相同的位置。
- Two different ID numbers can have the same location.
- 两个不同的ID号可以有相同的位置。
- ID numbers are not necessarily consecutive, nor are they necessarily completely numeric. The arrangement of them isn't of importance to me, since any rows that are related share the same ID number.
- ID号不一定是连续的,也不一定是完全的数字。它们的排列对我来说并不重要,因为任何相关的行都共享相同的ID号。
Any ideas for a solution? I'm currently using R with the data.table
package, but I'm relatively new to it.
有什么解决办法吗?我现在用R表示数据。表包,但我对它比较陌生。
1 个解决方案
#1
4
We can convert the 'data.frame' to 'data.table' (setDT(df1)
), Grouped by 'ID', get the elements of Location
that are not ''
(Location[Location!=''][1L]
). Suppose, if there are more than one element per group that are not ''
, the [1L]
, selects the first non-blank element, and assign (:=
) the output to Location
我们可以将“data.frame”转换为“data”。表(setDT(df1))),按“ID”分组,获得不属于“(Location[Location!= "][1L])的位置元素。假设,如果每个组有多个不为“的元素,则[1L]选择第一个非空白元素,并将输出(:=)分配到位置
library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
# ID Name Location
# 1: 5014 John California
# 2: 5014 Kate California
# 3: 5014 Jim California
# 4: 5014 Ryan California
# 5: 5018 Pete Indiana
# 6: 5018 Pat Indiana
# 7: 5019 Jeff Arizona
# 8: 5020 Chris Kentucky
# 9: 5020 Mike Kentucky
#10: 5021 Will Indiana
Or we can use setdiff
as suggested by @Frank
或者我们可以使用@Frank建议的setdiff
setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]
#1
4
We can convert the 'data.frame' to 'data.table' (setDT(df1)
), Grouped by 'ID', get the elements of Location
that are not ''
(Location[Location!=''][1L]
). Suppose, if there are more than one element per group that are not ''
, the [1L]
, selects the first non-blank element, and assign (:=
) the output to Location
我们可以将“data.frame”转换为“data”。表(setDT(df1))),按“ID”分组,获得不属于“(Location[Location!= "][1L])的位置元素。假设,如果每个组有多个不为“的元素,则[1L]选择第一个非空白元素,并将输出(:=)分配到位置
library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
# ID Name Location
# 1: 5014 John California
# 2: 5014 Kate California
# 3: 5014 Jim California
# 4: 5014 Ryan California
# 5: 5018 Pete Indiana
# 6: 5018 Pat Indiana
# 7: 5019 Jeff Arizona
# 8: 5020 Chris Kentucky
# 9: 5020 Mike Kentucky
#10: 5021 Will Indiana
Or we can use setdiff
as suggested by @Frank
或者我们可以使用@Frank建议的setdiff
setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]