I have this data df.1
:
我有这个数据df.1:
month a b c
1 0 0.000000000 0.000000000
2 0 0.000000000 0.001503194
3 0 0.000000000 0.000000000
4 0 0.000000000 0.000000000
5 0 0.000000000 0.000000000
6 0 0.000000000 0.000000000
7 0 0.000000000 0.000000000
8 0 0.000000000 0.000000000
9 0 0.000000000 0.000000000
10 0 0.000000000 0.000000000
11 NA NA NA
12 NA NA NA
1 0 0.000000000 0.000000000
2 0 0.001537279 0.006917756
3 0 0.000000000 0.003669725
4 0 0.000000000 0.000000000
5 0 0.000000000 0.000000000
6 0 0.000000000 0.000000000
7 0 0.000000000 0.000000000
8 0 0.000000000 0.000000000
9 0 0.000000000 0.000000000
10 0 0.000000000 0.000000000
11 0 0.000000000 0.013513514
12 NA NA NA
and this data df.2
:
这个数据df.2:
month a b c
1 0.03842077 0.002266291 0.000000000
2 0.01359501 0.001027937 0.000000000
3 0.08631519 0.008732519 0.001376147
4 0.26564710 0.083635347 0.019053692
5 0.34839088 0.152203121 0.021010075
6 0.31767367 0.152029019 0.029397773
7 0.31507761 0.110973916 0.023445471
8 0.29773872 0.096458381 0.026745770
9 0.31226976 0.109342562 0.023996392
10 0.23841220 0.081582743 0.021674228
11 0.04379016 0.003519300 0.000000000
12 0.02244389 0.002493766 0.000000000
I would to subsitute the value NA (and only NA) in df.1
[,2:4] with value in df.2
[,2:4] when the index in column 1 (month
) is the same. I tried with this code:
当第1列(月)中的索引相同时,我会将df.1 [,2:4]中的值NA(和仅NA)替换为df.2 [,2:4]中的值。我尝试使用此代码:
res_new <- data.frame(matrix(nrow=nrow(df.1),ncol=3))
for (n in 1:12){
res_new <- data.frame(ifelse(is.na(df.1[which(df.1[,1] == n),2:4])==TRUE,df.2[which(df.2[,1] == n),2:4],df.1[,n]))
}
but the result is a big new matrix where each NA value in df.1
is substitued with all value in df.2
但结果是一个很大的新矩阵,其中df.1中的每个NA值都被替换为df.2中的所有值
How can do it? (My actual data frames are much bigger)
怎么办呢? (我的实际数据帧要大得多)
4 个解决方案
#1
1
Assuming that you have complete rows that have missing values that you want to fill in, you can do this is two steps using which
and match
.
假设您有完整的行具有要填写的缺失值,则可以执行此操作,使用哪个并匹配两个步骤。
# find the location of the missing rows in df
missRows <- which(!complete.cases(df.1))
# fill in missing rows with rows in df.2 with matching months
df.1[missRows, ] <- df.2[match(df.1$month[missRows], df.2$month, nomatch=0),]
Note that missing rows are identified with !complete.cases
. Also, the nomatch=0 argument is used in order to ignore instances where no match is found.
请注意,缺少的行用!complete.cases标识。此外,使用nomatch = 0参数以忽略未找到匹配项的实例。
#2
1
The first 12 rows of the data:
前12行数据:
df.1 <- data.frame(
month = 1:12,
a = c(rep(0, 10), NA, NA),
b = c(rep(0, 10), NA, NA),
c = c(0, 0.001503194, rep(0, 8), NA, NA)
)
df.2 <- data.frame(
month = 1:12,
a = c(0.03842077, 0.01359501, 0.08631519, 0.2656471, 0.34839088, 0.31767367,
0.31507761, 0.29773872, 0.31226976, 0.2384122, 0.04379016, 0.02244389),
b = c(0.002266291, 0.001027937, 0.008732519, 0.083635347, 0.152203121,
0.152029019, 0.110973916, 0.096458381, 0.109342562, 0.081582743,
0.0035193, 0.002493766 ),
c = c(0, 0, 0.001376147, 0.019053692, 0.021010075, 0.029397773, 0.023445471,
0.02674577, 0.023996392, 0.021674228, 0, 0)
)
Solution
This solution allows for only some columns in a row to be NA
. It might take some time on big data but gets the job done.
该解决方案仅允许一行中的某些列为NA。大数据可能需要一些时间才能完成工作。
for (row in 1:nrow(df.1)) {
for (col in names(df.1)[-1]) {
if (is.na(df.1[row, col]) && df.1[row, "month"] == df.2[row, "month"]) {
df.1[row, col] <- df.2[row, col]
}
}
}
df.1
month a b c
1 1 0.00000000 0.000000000 0.000000000
2 2 0.00000000 0.000000000 0.001503194
3 3 0.00000000 0.000000000 0.000000000
4 4 0.00000000 0.000000000 0.000000000
5 5 0.00000000 0.000000000 0.000000000
6 6 0.00000000 0.000000000 0.000000000
7 7 0.00000000 0.000000000 0.000000000
8 8 0.00000000 0.000000000 0.000000000
9 9 0.00000000 0.000000000 0.000000000
10 10 0.00000000 0.000000000 0.000000000
11 11 0.04379016 0.003519300 0.000000000
12 12 0.02244389 0.002493766 0.000000000
Explanation
Using a double loop we check every element in the a
to c
columns. And if that element is not NA
we proceed to the next one. Otherwise we check if the month in the same row in df.2
is the same and if that is TRUE
we replace the element with corresponding one from df.2
.
使用双循环,我们检查a到c列中的每个元素。如果该元素不是NA,我们继续下一个。否则,我们检查df.2中同一行中的月份是否相同,如果为TRUE,则用df.2中的相应元素替换该元素。
#3
0
Considering that you have a larger dataframe, I would try to avoid merging the tables. You can use ifelse
to get the job done.
考虑到你有一个更大的数据帧,我会尽量避免合并表。您可以使用ifelse完成工作。
month <- c(1:12, 1:12)
a <- c(rep(0,10), NA, NA, rep(0,11), NA)
b <- c(rep(0,10), NA, NA, 0,.0015,rep(0,9), NA)
c <- c(0,.0015,rep(0,8), NA, NA, 0,.0069, .0036,rep(0,7), .0135, NA)
df.1 <- data.frame(month,a,b,c)
df.2 <- data.frame(month=c(1:12), a=rep(1,12), b=rep(2,12), c=rep(3,12))
df.1$a <- ifelse(is.na(df.1$a), df.2$a[match(df.1$month, df.2$month)], df.1$a)
df.1$b <- ifelse(is.na(df.1$b), df.2$b[match(df.1$month, df.2$month)], df.1$b)
df.1$c <- ifelse(is.na(df.1$c), df.2$c[match(df.1$month, df.2$month)], df.1$c)
> df.1
month a b c
1 1 0 0.0000 0.0000
2 2 0 0.0000 0.0015
3 3 0 0.0000 0.0000
4 4 0 0.0000 0.0000
5 5 0 0.0000 0.0000
6 6 0 0.0000 0.0000
7 7 0 0.0000 0.0000
8 8 0 0.0000 0.0000
9 9 0 0.0000 0.0000
10 10 0 0.0000 0.0000
11 11 1 2.0000 3.0000
12 12 1 2.0000 3.0000
13 1 0 0.0000 0.0000
14 2 0 0.0015 0.0069
15 3 0 0.0000 0.0036
16 4 0 0.0000 0.0000
17 5 0 0.0000 0.0000
18 6 0 0.0000 0.0000
19 7 0 0.0000 0.0000
20 8 0 0.0000 0.0000
21 9 0 0.0000 0.0000
22 10 0 0.0000 0.0000
23 11 0 0.0000 0.0135
24 12 1 2.0000 3.0000
#4
0
Maybe is not the best way, but some approach like this could work!
也许不是最好的方法,但这样的方法可行!
df1 <- data.frame(month = 1:12,
a = c(rep(1, 10), NA, NA),
b = c(rep(2, 11), NA))
df2 <- data.frame(month = 1:12,
a = rnorm(12),
b = rnorm(12))
# first, merge both data frame by the key in this case the month
new_df <- merge(df1, df2, by = "month")
# then use a vectorize operation with ifelse function
new_df$imp_a <- ifelse(!is.na(new_df$a.x), new_df$a.x, new_df$a.y)
# then you need to drop the temporal columns or make a subset of the
# new imputed columns generated
new_df
Perhaps create a function for the ifelse step, if you need to impute many columns, like this:
也许为ifelse步骤创建一个函数,如果你需要输入许多列,如下所示:
impute <- function(df, col1, col2) {
# impute col1 NA by col2 values creating a new column
new_name <- paste("new", col1, by = "_")
df[[new_name]] <- ifelse(!is.na(df[[col1]]), df[[col1]], df[[col2]])
df
}
impute(new_df, "a.x", "a.y")
#1
1
Assuming that you have complete rows that have missing values that you want to fill in, you can do this is two steps using which
and match
.
假设您有完整的行具有要填写的缺失值,则可以执行此操作,使用哪个并匹配两个步骤。
# find the location of the missing rows in df
missRows <- which(!complete.cases(df.1))
# fill in missing rows with rows in df.2 with matching months
df.1[missRows, ] <- df.2[match(df.1$month[missRows], df.2$month, nomatch=0),]
Note that missing rows are identified with !complete.cases
. Also, the nomatch=0 argument is used in order to ignore instances where no match is found.
请注意,缺少的行用!complete.cases标识。此外,使用nomatch = 0参数以忽略未找到匹配项的实例。
#2
1
The first 12 rows of the data:
前12行数据:
df.1 <- data.frame(
month = 1:12,
a = c(rep(0, 10), NA, NA),
b = c(rep(0, 10), NA, NA),
c = c(0, 0.001503194, rep(0, 8), NA, NA)
)
df.2 <- data.frame(
month = 1:12,
a = c(0.03842077, 0.01359501, 0.08631519, 0.2656471, 0.34839088, 0.31767367,
0.31507761, 0.29773872, 0.31226976, 0.2384122, 0.04379016, 0.02244389),
b = c(0.002266291, 0.001027937, 0.008732519, 0.083635347, 0.152203121,
0.152029019, 0.110973916, 0.096458381, 0.109342562, 0.081582743,
0.0035193, 0.002493766 ),
c = c(0, 0, 0.001376147, 0.019053692, 0.021010075, 0.029397773, 0.023445471,
0.02674577, 0.023996392, 0.021674228, 0, 0)
)
Solution
This solution allows for only some columns in a row to be NA
. It might take some time on big data but gets the job done.
该解决方案仅允许一行中的某些列为NA。大数据可能需要一些时间才能完成工作。
for (row in 1:nrow(df.1)) {
for (col in names(df.1)[-1]) {
if (is.na(df.1[row, col]) && df.1[row, "month"] == df.2[row, "month"]) {
df.1[row, col] <- df.2[row, col]
}
}
}
df.1
month a b c
1 1 0.00000000 0.000000000 0.000000000
2 2 0.00000000 0.000000000 0.001503194
3 3 0.00000000 0.000000000 0.000000000
4 4 0.00000000 0.000000000 0.000000000
5 5 0.00000000 0.000000000 0.000000000
6 6 0.00000000 0.000000000 0.000000000
7 7 0.00000000 0.000000000 0.000000000
8 8 0.00000000 0.000000000 0.000000000
9 9 0.00000000 0.000000000 0.000000000
10 10 0.00000000 0.000000000 0.000000000
11 11 0.04379016 0.003519300 0.000000000
12 12 0.02244389 0.002493766 0.000000000
Explanation
Using a double loop we check every element in the a
to c
columns. And if that element is not NA
we proceed to the next one. Otherwise we check if the month in the same row in df.2
is the same and if that is TRUE
we replace the element with corresponding one from df.2
.
使用双循环,我们检查a到c列中的每个元素。如果该元素不是NA,我们继续下一个。否则,我们检查df.2中同一行中的月份是否相同,如果为TRUE,则用df.2中的相应元素替换该元素。
#3
0
Considering that you have a larger dataframe, I would try to avoid merging the tables. You can use ifelse
to get the job done.
考虑到你有一个更大的数据帧,我会尽量避免合并表。您可以使用ifelse完成工作。
month <- c(1:12, 1:12)
a <- c(rep(0,10), NA, NA, rep(0,11), NA)
b <- c(rep(0,10), NA, NA, 0,.0015,rep(0,9), NA)
c <- c(0,.0015,rep(0,8), NA, NA, 0,.0069, .0036,rep(0,7), .0135, NA)
df.1 <- data.frame(month,a,b,c)
df.2 <- data.frame(month=c(1:12), a=rep(1,12), b=rep(2,12), c=rep(3,12))
df.1$a <- ifelse(is.na(df.1$a), df.2$a[match(df.1$month, df.2$month)], df.1$a)
df.1$b <- ifelse(is.na(df.1$b), df.2$b[match(df.1$month, df.2$month)], df.1$b)
df.1$c <- ifelse(is.na(df.1$c), df.2$c[match(df.1$month, df.2$month)], df.1$c)
> df.1
month a b c
1 1 0 0.0000 0.0000
2 2 0 0.0000 0.0015
3 3 0 0.0000 0.0000
4 4 0 0.0000 0.0000
5 5 0 0.0000 0.0000
6 6 0 0.0000 0.0000
7 7 0 0.0000 0.0000
8 8 0 0.0000 0.0000
9 9 0 0.0000 0.0000
10 10 0 0.0000 0.0000
11 11 1 2.0000 3.0000
12 12 1 2.0000 3.0000
13 1 0 0.0000 0.0000
14 2 0 0.0015 0.0069
15 3 0 0.0000 0.0036
16 4 0 0.0000 0.0000
17 5 0 0.0000 0.0000
18 6 0 0.0000 0.0000
19 7 0 0.0000 0.0000
20 8 0 0.0000 0.0000
21 9 0 0.0000 0.0000
22 10 0 0.0000 0.0000
23 11 0 0.0000 0.0135
24 12 1 2.0000 3.0000
#4
0
Maybe is not the best way, but some approach like this could work!
也许不是最好的方法,但这样的方法可行!
df1 <- data.frame(month = 1:12,
a = c(rep(1, 10), NA, NA),
b = c(rep(2, 11), NA))
df2 <- data.frame(month = 1:12,
a = rnorm(12),
b = rnorm(12))
# first, merge both data frame by the key in this case the month
new_df <- merge(df1, df2, by = "month")
# then use a vectorize operation with ifelse function
new_df$imp_a <- ifelse(!is.na(new_df$a.x), new_df$a.x, new_df$a.y)
# then you need to drop the temporal columns or make a subset of the
# new imputed columns generated
new_df
Perhaps create a function for the ifelse step, if you need to impute many columns, like this:
也许为ifelse步骤创建一个函数,如果你需要输入许多列,如下所示:
impute <- function(df, col1, col2) {
# impute col1 NA by col2 values creating a new column
new_name <- paste("new", col1, by = "_")
df[[new_name]] <- ifelse(!is.na(df[[col1]]), df[[col1]], df[[col2]])
df
}
impute(new_df, "a.x", "a.y")