I have a data.frame
and some columns have NA
values. I want to replace the NA
s with zeros. How I do this?
我有一个data.frame和一些列有NA值。我想用0替换NAs。我该如何去做呢?
14 个解决方案
#1
614
See my comment in @gsk3 answer. A simple example:
请参阅@gsk3的回复。一个简单的例子:
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)> d <- as.data.frame(m) V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 3 NA 3 7 6 6 10 6 52 9 8 9 5 10 NA 2 1 7 23 1 1 6 3 6 NA 1 4 1 64 NA 4 NA 7 10 2 NA 4 1 85 1 2 4 NA 2 6 2 6 7 46 NA 3 NA NA 10 2 1 10 8 47 4 4 9 10 9 8 9 4 10 NA8 5 8 3 2 1 4 5 9 4 79 3 9 10 1 9 9 10 5 3 310 4 2 2 5 NA 9 7 2 5 5> d[is.na(d)] <- 0> d V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 3 0 3 7 6 6 10 6 52 9 8 9 5 10 0 2 1 7 23 1 1 6 3 6 0 1 4 1 64 0 4 0 7 10 2 0 4 1 85 1 2 4 0 2 6 2 6 7 46 0 3 0 0 10 2 1 10 8 47 4 4 9 10 9 8 9 4 10 08 5 8 3 2 1 4 5 9 4 79 3 9 10 1 9 9 10 5 3 310 4 2 2 5 0 9 7 2 5 5
There's no need to apply apply
. =)
没有必要申请。=)
EDIT
编辑
You should also take a look at norm
package. It has a lot of nice features for missing data analysis. =)
你也应该看看norm软件包。对于缺失的数据分析,它有很多很好的特性。=)
#2
110
The hybrid dplyr/Base R option: mutate_all(funs(replace(., is.na(.), 0))))
is more than twice as fast as the base R d[is.na(d)] <- 0
option. (please see benchmark analyses below.)
混合dplyr/Base R选项:mutate_all(funs(替换)。, is.na(.), 0))))))比基本的R d[is.na(d)] <- 0选项快两倍以上。(请参阅下面的基准分析)
If you are struggling with massive dataframes, data.table
is the fastest option of all: 30% less time than dplyr, and 3 times faster than the Base R approaches. It also modifies the data in place, effectively allowing you to work with nearly twice as much of the data at once.
如果您正在与大量的dataframes和data作斗争。表是所有选项中最快的:比dplyr少30%,比基本R快3倍。它还修改了适当的数据,有效地使您可以同时处理几乎两倍的数据。
A clustering of other helpful tidyverse replacement approaches
Locationally:
区位:
-
index
mutate_at(c(5:10), funs(replace(., is.na(.), 0)))
- 指数mutate_at(c(5:10)乐趣(取代。is.na(。),0)))
-
direct reference
mutate_at(vars(var5:var10), funs(replace(., is.na(.), 0)))
- 直接引用mutate_at(var(var5:var10)乐趣(取代。is.na(。),0)))
-
fixed match
mutate_at(vars(contains("1")), funs(replace(., is.na(.), 0)))
- or in place of
contains()
, tryends_with()
,starts_with()
- 或者代替contains(),尝试ends_with(),starts_with()
- or in place of
- 固定搭配mutate_at(var(包含(" 1 ")),乐趣(取代。, is.na(.), 0))或in place of contains(), try ends_with(),starts_with()
-
pattern match
mutate_at(vars(matches("\\d{2}")), funs(replace(., is.na(.), 0)))
- 模式匹配mutate_at(var(匹配(\ \ d { 2 })),乐趣(取代。is.na(。),0)))
Conditionally:
(change just numeric (columns) and leave string (columns) alone.)
有条件地:(仅更改数字(列)并单独保留字符串(列)。
-
integers
mutate_if(is.integer, funs(replace(., is.na(.), 0)))
- 整数mutate_if(is.integer乐趣(取代。is.na(。),0)))
-
doubles
mutate_if(is.numeric, funs(replace(., is.na(.), 0)))
- 双打mutate_if(。数字,乐趣(取代。is.na(。),0)))
-
strings
mutate_if(is.character, funs(replace(., is.na(.), 0)))
- 字符串mutate_if(。性格,乐趣(取代。is.na(。),0)))
The Complete Analysis -
Approaches tested:
# Base R: baseR.sbst.rssgn <- function(x) { x[is.na(x)] <- 0; x }baseR.replace <- function(x) { replace(x, is.na(x), 0) }baseR.for <- function(x) { for(j in 1:ncol(x)) x[[j]][is.na(x[[j]])] = 0 }# tidyverse## dplyrlibrary(tidyverse)dplyr_if_else <- function(x) { mutate_all(x, funs(if_else(is.na(.), 0, .))) }dplyr_coalesce <- function(x) { mutate_all(x, funs(coalesce(., 0))) }## tidyrtidyr_replace_na <- function(x) { replace_na(x, as.list(setNames(rep(0, 10), as.list(c(paste0("var", 1:10)))))) }## hybrid hybrd.ifelse <- function(x) { mutate_all(x, funs(ifelse(is.na(.), 0, .))) }hybrd.rplc_all <- function(x) { mutate_all(x, funs(replace(., is.na(.), 0))) }hybrd.rplc_at.idx<- function(x) { mutate_at(x, c(1:10), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.nse<- function(x) { mutate_at(x, vars(var1:var10), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.stw<- function(x) { mutate_at(x, vars(starts_with("var")), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.ctn<- function(x) { mutate_at(x, vars(contains("var")), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.mtc<- function(x) { mutate_at(x, vars(matches("\\d+")), funs(replace(., is.na(.), 0))) }hybrd.rplc_if <- function(x) { mutate_if(x, is.numeric, funs(replace(., is.na(.), 0))) }# data.table library(data.table)DT.for.set.nms <- function(x) { for (j in names(x)) set(x,which(is.na(x[[j]])),j,0) }DT.for.set.sqln <- function(x) { for (j in seq_len(ncol(x))) set(x,which(is.na(x[[j]])),j,0) }
The code for this analysis:
library(microbenchmark)# 20% NA filled dataframe of 5 Million rows and 10 columnsset.seed(42) # to recreate the exact dataframedfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 5e6*10, replace = TRUE), dimnames = list(NULL, paste0("var", 1:10)), ncol = 10))# Running 250 trials with each replacement method # (the functions are excecuted locally - so that the original dataframe remains unmodified in all cases)perf_results <- microbenchmark( hybrid.ifelse = hybrid.ifelse(copy(dfN)), dplyr_if_else = dplyr_if_else(copy(dfN)), baseR.sbst.rssgn = baseR.sbst.rssgn(copy(dfN)), baseR.replace = baseR.replace(copy(dfN)), dplyr_coalesce = dplyr_coalesce(copy(dfN)), hybrd.rplc_at.nse= hybrd.rplc_at.nse(copy(dfN)), hybrd.rplc_at.stw= hybrd.rplc_at.stw(copy(dfN)), hybrd.rplc_at.ctn= hybrd.rplc_at.ctn(copy(dfN)), hybrd.rplc_at.mtc= hybrd.rplc_at.mtc(copy(dfN)), hybrd.rplc_at.idx= hybrd.rplc_at.idx(copy(dfN)), hybrd.rplc_if = hybrd.rplc_if(copy(dfN)), tidyr_replace_na = tidyr_replace_na(copy(dfN)), baseR.for = baseR.for(copy(dfN)), DT.for.set.nms = DT.for.set.nms(copy(dfN)), DT.for.set.sqln = DT.for.set.sqln(copy(dfN)), times = 250L)
Summary of Results
> perf_resultsUnit: milliseconds expr min lq mean median uq max neval hybrid.ifelse 5250.5259 5620.8650 5809.1808 5759.3997 5947.7942 6732.791 250 dplyr_if_else 3209.7406 3518.0314 3653.0317 3620.2955 3746.0293 4390.888 250 baseR.sbst.rssgn 1611.9227 1878.7401 1964.6385 1942.8873 2031.5681 2485.843 250 baseR.replace 1559.1494 1874.7377 1946.2971 1920.8077 2002.4825 2516.525 250 dplyr_coalesce 949.7511 1231.5150 1279.3015 1288.3425 1345.8662 1624.186 250 hybrd.rplc_at.nse 735.9949 871.1693 1016.5910 1064.5761 1104.9590 1361.868 250 hybrd.rplc_at.stw 704.4045 887.4796 1017.9110 1063.8001 1106.7748 1338.557 250 hybrd.rplc_at.ctn 723.9838 878.6088 1017.9983 1063.0406 1110.0857 1296.024 250 hybrd.rplc_at.mtc 686.2045 885.8028 1013.8293 1061.2727 1105.7117 1269.949 250 hybrd.rplc_at.idx 696.3159 880.7800 1003.6186 1038.8271 1083.1932 1309.635 250 hybrd.rplc_if 705.9907 889.7381 1000.0113 1036.3963 1083.3728 1338.190 250 tidyr_replace_na 680.4478 973.1395 978.2678 1003.9797 1051.2624 1294.376 250 baseR.for 670.7897 965.6312 983.5775 1001.5229 1052.5946 1206.023 250 DT.for.set.nms 496.8031 569.7471 695.4339 623.1086 861.1918 1067.640 250 DT.for.set.sqln 500.9945 567.2522 671.4158 623.1454 764.9744 1033.463 250
Boxplot of Results (on a log scale)
# adjust the margins to prepare for better boxplot printingpar(mar=c(8,5,1,1) + 0.1) # generate boxplotboxplot(opN, las = 2, xlab = "", ylab = "log(time)[milliseconds]")
Color-coded Scatterplot of Trials (on a log scale)
qplot(y=time/10^9, data=opN, colour=expr) + labs(y = "log10 Scaled Elapsed Time per Trial (secs)", x = "Trial Number") + scale_y_log10(breaks=c(1, 2, 4))
A note on the other high performers
When the datasets get larger, Tidyr''s replace_na
had historically pulled out in front. With the current collection of 50M data points to run through, it performs almost exactly as well as a Base R For Loop. I am curious to see what happens for different sized dataframes.
当数据集变大时,Tidyr的replace_na历来都是放在前面。通过运行当前收集的50M数据点,它执行的几乎与基本的R For循环一样好。我很好奇不同大小的数据爆炸会发生什么。
Additional examples for the mutate
and summarize
_at
and _all
function variants can be found here: https://rdrr.io/cran/dplyr/man/summarise_all.htmlAdditionally, I found helpful demonstrations and collections of examples here: https://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a
关于突变和汇总_at和_all函数变体的其他示例可以在这里找到:https://rdrr.io/cran/dplyr/man/abstrse_all。在这里,我找到了有用的演示和示例集合:https://blog.exploratory.io/dplyr-0-5-is-awesome-here -why-be095fd4eb8a
Attributions and Appreciations
With special thanks to:
特别感谢:
- Tyler Rinker and Akrun for demonstrating microbenchmark.
- 泰勒·林克和Akrun演示微基准。
-
alexis_laz for working on helping me understand the use of
local()
, and (with Frank's patient help, too) the role that silent coercion plays in speeding up many of these approaches. - alexis_laz帮助我理解local()的使用,以及(在Frank耐心的帮助下)沉默胁迫在加速这些方法中所起的作用。
- ArthurYip for the poke to add the newer
coalesce()
function in and update the analysis. - ArthurYip for the poke添加新的coalesce()函数并更新分析。
- Gregor for the nudge to figure out the
data.table
functions well enough to finally include them in the lineup. - 格里高尔用轻推来计算数据。表功能足够好,最终可以将它们包括在队列中。
- Base R For loop: alexis_laz
- 底R为循环:alexis_laz
- data.table For Loops: Matt_Dowle
- 数据。表圈:Matt_Dowle
(Of course, please reach over and give them upvotes, too if you find those approaches useful.)
(当然,如果你觉得这些方法有用,也请向他们伸出援手。)
Note on my use of Numerics: If you do have a pure integer dataset, all of your functions will run faster. Please see alexiz_laz's work for more information. IRL, I can't recall encountering a data set containing more than 10-15% integers, so I am running these tests on fully numeric dataframes.
注意我对数字的使用:如果您有一个纯整数数据集,那么所有的函数都将运行得更快。请参阅alexiz_laz的工作以获得更多信息。IRL,我不记得遇到一个包含超过10-15%整数的数据集,所以我在全数字的dataframes上运行这些测试。
#3
101
For a single vector:
为一个向量:
x <- c(1,2,NA,4,5)x[is.na(x)] <- 0
For a data.frame, make a function out of the above, then apply
it to the columns.
对于一个data.frame,从上面创建一个函数,然后将其应用到列中。
Please provide a reproducible example next time as detailed here:
请下次提供一个可复制的例子,详情如下:
How to make a great R reproducible example?
如何做出一个伟大的R可再现的例子?
#4
58
dplyr example:
dplyr例子:
library(dplyr)df1 <- df1 %>% mutate(myCol1 = if_else(is.na(myCol1), 0, myCol1))
Note: This works per selected column, if we need to do this for all column, see @reidjax's answer using mutate_each.
注意:这适用于每个选定的列,如果我们需要对所有列都这样做,请参阅使用mutate_each的@reidjax的答案。
#5
40
I know the question is already answered, but doing it this way might be more useful to some:
我知道这个问题已经得到了回答,但这样做对某些人可能更有用:
Define this function:
定义这个函数:
na.zero <- function (x) { x[is.na(x)] <- 0 return(x)}
Now whenever you need to convert NA's in a vector to zero's you can do:
现在,当你需要将向量中的NA转换为0时你可以这样做:
na.zero(some.vector)
#6
39
If we are trying to replace NA
s when exporting, for example when writing to csv, then we can use:
如果我们尝试在导出时替换NAs,例如写入到csv时,我们可以使用:
write.csv(data, "data.csv", na = "0")
#7
18
More general approach of using replace()
in matrix or vector to replace NA
to 0
在矩阵或向量中使用replace()来将NA替换为0的更一般的方法
For example:
例如:
> x <- c(1,2,NA,NA,1,1)> x1 <- replace(x,is.na(x),0)> x1[1] 1 2 0 0 1 1
This is also an alternative to using ifelse()
in dplyr
这也是在dplyr中使用ifelse()的一种替代方法
df = data.frame(col = c(1,2,NA,NA,1,1))df <- df %>% mutate(col = replace(col,is.na(col),0))
#8
15
With dplyr
0.5.0, you can use coalesce
function which can be easily integrated into %>%
pipeline by doing coalesce(vec, 0)
. This replaces all NAs in vec
with 0:
使用dplyr 0.5.0,您可以使用coalesce函数,它可以轻松集成到%>%的管道中(vec, 0)。
Say we have a data frame with NA
s:
假设我们有一个带NAs的数据框架:
library(dplyr)df <- data.frame(v = c(1, 2, 3, NA, 5, 6, 8))df# v# 1 1# 2 2# 3 3# 4 NA# 5 5# 6 6# 7 8df %>% mutate(v = coalesce(v, 0))# v# 1 1# 2 2# 3 3# 4 0# 5 5# 6 6# 7 8
#9
8
Another example using imputeTS package:
另一个使用imputeTS package的例子:
library(imputeTS)na.replace(yourDataframe, 0)
#10
8
If you want to replace NAs in factor variables, this might be useful:
如果您想在因子变量中替换NAs,这可能是有用的:
n <- length(levels(data.vector))+1data.vector <- as.numeric(data.vector)data.vector[is.na(data.vector)] <- ndata.vector <- as.factor(data.vector)levels(data.vector) <- c("level1","level2",...,"leveln", "NAlevel")
It transforms a factor-vector into a numeric vector and adds another artifical numeric factor level, which is then transformed back to a factor-vector with one extra "NA-level" of your choice.
它将一个因子向量转换为一个数字向量,并添加另一个人工的数值因子级别,然后将其转换为一个因子向量,并提供一个额外的“NA-level”供您选择。
#11
6
Would've commented on @ianmunoz's post but I don't have enough reputation. You can combine dplyr
's mutate_each
and replace
to take care of the NA
to 0
replacement. Using the dataframe from @aL3xa's answer...
我会评论@ianmunoz的帖子,但我没有足够的声誉。你可以把dplyr的mutate_each和replace替换为take care of the NA to 0 replacement。从@aL3xa的答案中使用dataframe…
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)> d <- as.data.frame(m)> d V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 8 1 9 6 9 NA 8 9 82 8 3 6 8 2 1 NA NA 6 33 6 6 3 NA 2 NA NA 5 7 74 10 6 1 1 7 9 1 10 3 105 10 6 7 10 10 3 2 5 4 66 2 4 1 5 7 NA NA 8 4 47 7 2 3 1 4 10 NA 8 7 78 9 5 8 10 5 3 5 8 3 29 9 1 8 7 6 5 NA NA 6 710 6 10 8 7 1 1 2 2 5 7> d %>% mutate_each( funs_( interp( ~replace(., is.na(.),0) ) ) ) V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 8 1 9 6 9 0 8 9 82 8 3 6 8 2 1 0 0 6 33 6 6 3 0 2 0 0 5 7 74 10 6 1 1 7 9 1 10 3 105 10 6 7 10 10 3 2 5 4 66 2 4 1 5 7 0 0 8 4 47 7 2 3 1 4 10 0 8 7 78 9 5 8 10 5 3 5 8 3 29 9 1 8 7 6 5 0 0 6 710 6 10 8 7 1 1 2 2 5 7
We're using standard evaluation (SE) here which is why we need the underscore on "funs_
." We also use lazyeval
's interp
/~
and the .
references "everything we are working with", i.e. the data frame. Now there are zeros!
我们在这里使用标准评估(SE),这就是为什么我们需要在“funs_”上下划线。我们还使用了lazyeval的interp/~和the。引用“我们正在处理的一切”,即数据框架。现在是零!
#12
4
You can use replace()
您可以使用替代()
For example:
例如:
> x <- c(-1,0,1,0,NA,0,1,1)> x1 <- replace(x,5,1)> x1[1] -1 0 1 0 1 0 1 1> x1 <- replace(x,5,mean(x,na.rm=T))> x1[1] -1.00 0.00 1.00 0.00 0.29 0.00 1.00 1.00
#13
3
Another dplyr
pipe compatible option with tidyr
method replace_na
that works for several columns:
另一个与tidyrmethod replace_na兼容的dplyr管道选项,适用于几个列:
require(dplyr)require(tidyr)m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)d <- as.data.frame(m)myList <- setNames(lapply(vector("list", ncol(d)), function(x) x <- 0), names(d))df <- d %>% replace_na(myList)
You can easily restrict to e.g. numeric columns:
你可以很容易地限制在数字列:
d$str <- c("string", NA)myList <- myList[sapply(d, is.numeric)]df <- d %>% replace_na(myList)
#14
3
This simple function extracted from Datacamp could help:
从Datacamp中提取的这个简单函数可以帮助:
replace_missings <- function(x, replacement) { is_miss <- is.na(x) x[is_miss] <- replacement message(sum(is_miss), " missings replaced by the value ", replacement) x}
Then
然后
replace_missings(df, replacement = 0)
#1
614
See my comment in @gsk3 answer. A simple example:
请参阅@gsk3的回复。一个简单的例子:
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)> d <- as.data.frame(m) V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 3 NA 3 7 6 6 10 6 52 9 8 9 5 10 NA 2 1 7 23 1 1 6 3 6 NA 1 4 1 64 NA 4 NA 7 10 2 NA 4 1 85 1 2 4 NA 2 6 2 6 7 46 NA 3 NA NA 10 2 1 10 8 47 4 4 9 10 9 8 9 4 10 NA8 5 8 3 2 1 4 5 9 4 79 3 9 10 1 9 9 10 5 3 310 4 2 2 5 NA 9 7 2 5 5> d[is.na(d)] <- 0> d V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 3 0 3 7 6 6 10 6 52 9 8 9 5 10 0 2 1 7 23 1 1 6 3 6 0 1 4 1 64 0 4 0 7 10 2 0 4 1 85 1 2 4 0 2 6 2 6 7 46 0 3 0 0 10 2 1 10 8 47 4 4 9 10 9 8 9 4 10 08 5 8 3 2 1 4 5 9 4 79 3 9 10 1 9 9 10 5 3 310 4 2 2 5 0 9 7 2 5 5
There's no need to apply apply
. =)
没有必要申请。=)
EDIT
编辑
You should also take a look at norm
package. It has a lot of nice features for missing data analysis. =)
你也应该看看norm软件包。对于缺失的数据分析,它有很多很好的特性。=)
#2
110
The hybrid dplyr/Base R option: mutate_all(funs(replace(., is.na(.), 0))))
is more than twice as fast as the base R d[is.na(d)] <- 0
option. (please see benchmark analyses below.)
混合dplyr/Base R选项:mutate_all(funs(替换)。, is.na(.), 0))))))比基本的R d[is.na(d)] <- 0选项快两倍以上。(请参阅下面的基准分析)
If you are struggling with massive dataframes, data.table
is the fastest option of all: 30% less time than dplyr, and 3 times faster than the Base R approaches. It also modifies the data in place, effectively allowing you to work with nearly twice as much of the data at once.
如果您正在与大量的dataframes和data作斗争。表是所有选项中最快的:比dplyr少30%,比基本R快3倍。它还修改了适当的数据,有效地使您可以同时处理几乎两倍的数据。
A clustering of other helpful tidyverse replacement approaches
Locationally:
区位:
-
index
mutate_at(c(5:10), funs(replace(., is.na(.), 0)))
- 指数mutate_at(c(5:10)乐趣(取代。is.na(。),0)))
-
direct reference
mutate_at(vars(var5:var10), funs(replace(., is.na(.), 0)))
- 直接引用mutate_at(var(var5:var10)乐趣(取代。is.na(。),0)))
-
fixed match
mutate_at(vars(contains("1")), funs(replace(., is.na(.), 0)))
- or in place of
contains()
, tryends_with()
,starts_with()
- 或者代替contains(),尝试ends_with(),starts_with()
- or in place of
- 固定搭配mutate_at(var(包含(" 1 ")),乐趣(取代。, is.na(.), 0))或in place of contains(), try ends_with(),starts_with()
-
pattern match
mutate_at(vars(matches("\\d{2}")), funs(replace(., is.na(.), 0)))
- 模式匹配mutate_at(var(匹配(\ \ d { 2 })),乐趣(取代。is.na(。),0)))
Conditionally:
(change just numeric (columns) and leave string (columns) alone.)
有条件地:(仅更改数字(列)并单独保留字符串(列)。
-
integers
mutate_if(is.integer, funs(replace(., is.na(.), 0)))
- 整数mutate_if(is.integer乐趣(取代。is.na(。),0)))
-
doubles
mutate_if(is.numeric, funs(replace(., is.na(.), 0)))
- 双打mutate_if(。数字,乐趣(取代。is.na(。),0)))
-
strings
mutate_if(is.character, funs(replace(., is.na(.), 0)))
- 字符串mutate_if(。性格,乐趣(取代。is.na(。),0)))
The Complete Analysis -
Approaches tested:
# Base R: baseR.sbst.rssgn <- function(x) { x[is.na(x)] <- 0; x }baseR.replace <- function(x) { replace(x, is.na(x), 0) }baseR.for <- function(x) { for(j in 1:ncol(x)) x[[j]][is.na(x[[j]])] = 0 }# tidyverse## dplyrlibrary(tidyverse)dplyr_if_else <- function(x) { mutate_all(x, funs(if_else(is.na(.), 0, .))) }dplyr_coalesce <- function(x) { mutate_all(x, funs(coalesce(., 0))) }## tidyrtidyr_replace_na <- function(x) { replace_na(x, as.list(setNames(rep(0, 10), as.list(c(paste0("var", 1:10)))))) }## hybrid hybrd.ifelse <- function(x) { mutate_all(x, funs(ifelse(is.na(.), 0, .))) }hybrd.rplc_all <- function(x) { mutate_all(x, funs(replace(., is.na(.), 0))) }hybrd.rplc_at.idx<- function(x) { mutate_at(x, c(1:10), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.nse<- function(x) { mutate_at(x, vars(var1:var10), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.stw<- function(x) { mutate_at(x, vars(starts_with("var")), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.ctn<- function(x) { mutate_at(x, vars(contains("var")), funs(replace(., is.na(.), 0))) }hybrd.rplc_at.mtc<- function(x) { mutate_at(x, vars(matches("\\d+")), funs(replace(., is.na(.), 0))) }hybrd.rplc_if <- function(x) { mutate_if(x, is.numeric, funs(replace(., is.na(.), 0))) }# data.table library(data.table)DT.for.set.nms <- function(x) { for (j in names(x)) set(x,which(is.na(x[[j]])),j,0) }DT.for.set.sqln <- function(x) { for (j in seq_len(ncol(x))) set(x,which(is.na(x[[j]])),j,0) }
The code for this analysis:
library(microbenchmark)# 20% NA filled dataframe of 5 Million rows and 10 columnsset.seed(42) # to recreate the exact dataframedfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 5e6*10, replace = TRUE), dimnames = list(NULL, paste0("var", 1:10)), ncol = 10))# Running 250 trials with each replacement method # (the functions are excecuted locally - so that the original dataframe remains unmodified in all cases)perf_results <- microbenchmark( hybrid.ifelse = hybrid.ifelse(copy(dfN)), dplyr_if_else = dplyr_if_else(copy(dfN)), baseR.sbst.rssgn = baseR.sbst.rssgn(copy(dfN)), baseR.replace = baseR.replace(copy(dfN)), dplyr_coalesce = dplyr_coalesce(copy(dfN)), hybrd.rplc_at.nse= hybrd.rplc_at.nse(copy(dfN)), hybrd.rplc_at.stw= hybrd.rplc_at.stw(copy(dfN)), hybrd.rplc_at.ctn= hybrd.rplc_at.ctn(copy(dfN)), hybrd.rplc_at.mtc= hybrd.rplc_at.mtc(copy(dfN)), hybrd.rplc_at.idx= hybrd.rplc_at.idx(copy(dfN)), hybrd.rplc_if = hybrd.rplc_if(copy(dfN)), tidyr_replace_na = tidyr_replace_na(copy(dfN)), baseR.for = baseR.for(copy(dfN)), DT.for.set.nms = DT.for.set.nms(copy(dfN)), DT.for.set.sqln = DT.for.set.sqln(copy(dfN)), times = 250L)
Summary of Results
> perf_resultsUnit: milliseconds expr min lq mean median uq max neval hybrid.ifelse 5250.5259 5620.8650 5809.1808 5759.3997 5947.7942 6732.791 250 dplyr_if_else 3209.7406 3518.0314 3653.0317 3620.2955 3746.0293 4390.888 250 baseR.sbst.rssgn 1611.9227 1878.7401 1964.6385 1942.8873 2031.5681 2485.843 250 baseR.replace 1559.1494 1874.7377 1946.2971 1920.8077 2002.4825 2516.525 250 dplyr_coalesce 949.7511 1231.5150 1279.3015 1288.3425 1345.8662 1624.186 250 hybrd.rplc_at.nse 735.9949 871.1693 1016.5910 1064.5761 1104.9590 1361.868 250 hybrd.rplc_at.stw 704.4045 887.4796 1017.9110 1063.8001 1106.7748 1338.557 250 hybrd.rplc_at.ctn 723.9838 878.6088 1017.9983 1063.0406 1110.0857 1296.024 250 hybrd.rplc_at.mtc 686.2045 885.8028 1013.8293 1061.2727 1105.7117 1269.949 250 hybrd.rplc_at.idx 696.3159 880.7800 1003.6186 1038.8271 1083.1932 1309.635 250 hybrd.rplc_if 705.9907 889.7381 1000.0113 1036.3963 1083.3728 1338.190 250 tidyr_replace_na 680.4478 973.1395 978.2678 1003.9797 1051.2624 1294.376 250 baseR.for 670.7897 965.6312 983.5775 1001.5229 1052.5946 1206.023 250 DT.for.set.nms 496.8031 569.7471 695.4339 623.1086 861.1918 1067.640 250 DT.for.set.sqln 500.9945 567.2522 671.4158 623.1454 764.9744 1033.463 250
Boxplot of Results (on a log scale)
# adjust the margins to prepare for better boxplot printingpar(mar=c(8,5,1,1) + 0.1) # generate boxplotboxplot(opN, las = 2, xlab = "", ylab = "log(time)[milliseconds]")
Color-coded Scatterplot of Trials (on a log scale)
qplot(y=time/10^9, data=opN, colour=expr) + labs(y = "log10 Scaled Elapsed Time per Trial (secs)", x = "Trial Number") + scale_y_log10(breaks=c(1, 2, 4))
A note on the other high performers
When the datasets get larger, Tidyr''s replace_na
had historically pulled out in front. With the current collection of 50M data points to run through, it performs almost exactly as well as a Base R For Loop. I am curious to see what happens for different sized dataframes.
当数据集变大时,Tidyr的replace_na历来都是放在前面。通过运行当前收集的50M数据点,它执行的几乎与基本的R For循环一样好。我很好奇不同大小的数据爆炸会发生什么。
Additional examples for the mutate
and summarize
_at
and _all
function variants can be found here: https://rdrr.io/cran/dplyr/man/summarise_all.htmlAdditionally, I found helpful demonstrations and collections of examples here: https://blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a
关于突变和汇总_at和_all函数变体的其他示例可以在这里找到:https://rdrr.io/cran/dplyr/man/abstrse_all。在这里,我找到了有用的演示和示例集合:https://blog.exploratory.io/dplyr-0-5-is-awesome-here -why-be095fd4eb8a
Attributions and Appreciations
With special thanks to:
特别感谢:
- Tyler Rinker and Akrun for demonstrating microbenchmark.
- 泰勒·林克和Akrun演示微基准。
-
alexis_laz for working on helping me understand the use of
local()
, and (with Frank's patient help, too) the role that silent coercion plays in speeding up many of these approaches. - alexis_laz帮助我理解local()的使用,以及(在Frank耐心的帮助下)沉默胁迫在加速这些方法中所起的作用。
- ArthurYip for the poke to add the newer
coalesce()
function in and update the analysis. - ArthurYip for the poke添加新的coalesce()函数并更新分析。
- Gregor for the nudge to figure out the
data.table
functions well enough to finally include them in the lineup. - 格里高尔用轻推来计算数据。表功能足够好,最终可以将它们包括在队列中。
- Base R For loop: alexis_laz
- 底R为循环:alexis_laz
- data.table For Loops: Matt_Dowle
- 数据。表圈:Matt_Dowle
(Of course, please reach over and give them upvotes, too if you find those approaches useful.)
(当然,如果你觉得这些方法有用,也请向他们伸出援手。)
Note on my use of Numerics: If you do have a pure integer dataset, all of your functions will run faster. Please see alexiz_laz's work for more information. IRL, I can't recall encountering a data set containing more than 10-15% integers, so I am running these tests on fully numeric dataframes.
注意我对数字的使用:如果您有一个纯整数数据集,那么所有的函数都将运行得更快。请参阅alexiz_laz的工作以获得更多信息。IRL,我不记得遇到一个包含超过10-15%整数的数据集,所以我在全数字的dataframes上运行这些测试。
#3
101
For a single vector:
为一个向量:
x <- c(1,2,NA,4,5)x[is.na(x)] <- 0
For a data.frame, make a function out of the above, then apply
it to the columns.
对于一个data.frame,从上面创建一个函数,然后将其应用到列中。
Please provide a reproducible example next time as detailed here:
请下次提供一个可复制的例子,详情如下:
How to make a great R reproducible example?
如何做出一个伟大的R可再现的例子?
#4
58
dplyr example:
dplyr例子:
library(dplyr)df1 <- df1 %>% mutate(myCol1 = if_else(is.na(myCol1), 0, myCol1))
Note: This works per selected column, if we need to do this for all column, see @reidjax's answer using mutate_each.
注意:这适用于每个选定的列,如果我们需要对所有列都这样做,请参阅使用mutate_each的@reidjax的答案。
#5
40
I know the question is already answered, but doing it this way might be more useful to some:
我知道这个问题已经得到了回答,但这样做对某些人可能更有用:
Define this function:
定义这个函数:
na.zero <- function (x) { x[is.na(x)] <- 0 return(x)}
Now whenever you need to convert NA's in a vector to zero's you can do:
现在,当你需要将向量中的NA转换为0时你可以这样做:
na.zero(some.vector)
#6
39
If we are trying to replace NA
s when exporting, for example when writing to csv, then we can use:
如果我们尝试在导出时替换NAs,例如写入到csv时,我们可以使用:
write.csv(data, "data.csv", na = "0")
#7
18
More general approach of using replace()
in matrix or vector to replace NA
to 0
在矩阵或向量中使用replace()来将NA替换为0的更一般的方法
For example:
例如:
> x <- c(1,2,NA,NA,1,1)> x1 <- replace(x,is.na(x),0)> x1[1] 1 2 0 0 1 1
This is also an alternative to using ifelse()
in dplyr
这也是在dplyr中使用ifelse()的一种替代方法
df = data.frame(col = c(1,2,NA,NA,1,1))df <- df %>% mutate(col = replace(col,is.na(col),0))
#8
15
With dplyr
0.5.0, you can use coalesce
function which can be easily integrated into %>%
pipeline by doing coalesce(vec, 0)
. This replaces all NAs in vec
with 0:
使用dplyr 0.5.0,您可以使用coalesce函数,它可以轻松集成到%>%的管道中(vec, 0)。
Say we have a data frame with NA
s:
假设我们有一个带NAs的数据框架:
library(dplyr)df <- data.frame(v = c(1, 2, 3, NA, 5, 6, 8))df# v# 1 1# 2 2# 3 3# 4 NA# 5 5# 6 6# 7 8df %>% mutate(v = coalesce(v, 0))# v# 1 1# 2 2# 3 3# 4 0# 5 5# 6 6# 7 8
#9
8
Another example using imputeTS package:
另一个使用imputeTS package的例子:
library(imputeTS)na.replace(yourDataframe, 0)
#10
8
If you want to replace NAs in factor variables, this might be useful:
如果您想在因子变量中替换NAs,这可能是有用的:
n <- length(levels(data.vector))+1data.vector <- as.numeric(data.vector)data.vector[is.na(data.vector)] <- ndata.vector <- as.factor(data.vector)levels(data.vector) <- c("level1","level2",...,"leveln", "NAlevel")
It transforms a factor-vector into a numeric vector and adds another artifical numeric factor level, which is then transformed back to a factor-vector with one extra "NA-level" of your choice.
它将一个因子向量转换为一个数字向量,并添加另一个人工的数值因子级别,然后将其转换为一个因子向量,并提供一个额外的“NA-level”供您选择。
#11
6
Would've commented on @ianmunoz's post but I don't have enough reputation. You can combine dplyr
's mutate_each
and replace
to take care of the NA
to 0
replacement. Using the dataframe from @aL3xa's answer...
我会评论@ianmunoz的帖子,但我没有足够的声誉。你可以把dplyr的mutate_each和replace替换为take care of the NA to 0 replacement。从@aL3xa的答案中使用dataframe…
> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)> d <- as.data.frame(m)> d V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 8 1 9 6 9 NA 8 9 82 8 3 6 8 2 1 NA NA 6 33 6 6 3 NA 2 NA NA 5 7 74 10 6 1 1 7 9 1 10 3 105 10 6 7 10 10 3 2 5 4 66 2 4 1 5 7 NA NA 8 4 47 7 2 3 1 4 10 NA 8 7 78 9 5 8 10 5 3 5 8 3 29 9 1 8 7 6 5 NA NA 6 710 6 10 8 7 1 1 2 2 5 7> d %>% mutate_each( funs_( interp( ~replace(., is.na(.),0) ) ) ) V1 V2 V3 V4 V5 V6 V7 V8 V9 V101 4 8 1 9 6 9 0 8 9 82 8 3 6 8 2 1 0 0 6 33 6 6 3 0 2 0 0 5 7 74 10 6 1 1 7 9 1 10 3 105 10 6 7 10 10 3 2 5 4 66 2 4 1 5 7 0 0 8 4 47 7 2 3 1 4 10 0 8 7 78 9 5 8 10 5 3 5 8 3 29 9 1 8 7 6 5 0 0 6 710 6 10 8 7 1 1 2 2 5 7
We're using standard evaluation (SE) here which is why we need the underscore on "funs_
." We also use lazyeval
's interp
/~
and the .
references "everything we are working with", i.e. the data frame. Now there are zeros!
我们在这里使用标准评估(SE),这就是为什么我们需要在“funs_”上下划线。我们还使用了lazyeval的interp/~和the。引用“我们正在处理的一切”,即数据框架。现在是零!
#12
4
You can use replace()
您可以使用替代()
For example:
例如:
> x <- c(-1,0,1,0,NA,0,1,1)> x1 <- replace(x,5,1)> x1[1] -1 0 1 0 1 0 1 1> x1 <- replace(x,5,mean(x,na.rm=T))> x1[1] -1.00 0.00 1.00 0.00 0.29 0.00 1.00 1.00
#13
3
Another dplyr
pipe compatible option with tidyr
method replace_na
that works for several columns:
另一个与tidyrmethod replace_na兼容的dplyr管道选项,适用于几个列:
require(dplyr)require(tidyr)m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)d <- as.data.frame(m)myList <- setNames(lapply(vector("list", ncol(d)), function(x) x <- 0), names(d))df <- d %>% replace_na(myList)
You can easily restrict to e.g. numeric columns:
你可以很容易地限制在数字列:
d$str <- c("string", NA)myList <- myList[sapply(d, is.numeric)]df <- d %>% replace_na(myList)
#14
3
This simple function extracted from Datacamp could help:
从Datacamp中提取的这个简单函数可以帮助:
replace_missings <- function(x, replacement) { is_miss <- is.na(x) x[is_miss] <- replacement message(sum(is_miss), " missings replaced by the value ", replacement) x}
Then
然后
replace_missings(df, replacement = 0)