Suppose I write the following R code:
假设我写以下R代码:
first.value <- sample(100, 100, replace=TRUE)
second.value <- sample(10, 100, replace=TRUE)
X <- data.frame(first.value, second.value)
split.X <- split(X, second.value)
This code creates a data frame with two fields, and splits into bins according to the second. Now suppose I wanted to normalize each bin; i.e., subtract the mean and divide by the standard deviation. I could accomplish this by
这段代码创建了一个包含两个字段的数据框架,并根据第二个字段将其分解为多个容器。现在假设我想让每个箱子标准化;即。,减去均值,除以标准差。我可以通过
normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})
But this creates a new list with the normalized versions of each bin. What I really want to do is replace the copy of the data in split.X
with its normalized version.
但是这会创建一个包含每个bin的规范化版本的新列表。我真正想做的是在split中替换数据的拷贝。X的标准化版本。
To illustrate, here's some sample output:
为了说明这一点,这里有一些示例输出:
> first.value <- sample(100, 100, replace=TRUE)
> second.value <- sample(10, 100, replace=TRUE)
> X <- data.frame(first.value, second.value)
> split.X <- split(X, second.value)
> normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})
> split.X[[1]]
first.value second.value
4 34 1
8 40 1
24 21 1
31 34 1
37 23 1
40 22 1
> normalized.first.value[[1]]
[1] 0.625 1.375 -1.000 0.625 -0.750 -0.875
What I really want to do is to put the values of normalized.first.value[[1]]
into split.X[[1]]$first.value
, and the same for the other indices.
我真正想做的是把这些值标准化。值[[1]]到split.X美元[[1]]。值,其他指标也是一样。
This could be achieved with a for
loop as follows:
这可以通过以下for循环实现:
for (i in 1:length(split.X)) {
split.X[[i]]$first.value <- (split.X[[i]]$first.value - mean(split.X[[i]]$first.value) / sd(split.X[[i]]$first.value);
}
But for
loops are BAD in R, and I'd like to use sapply
,lapply
, etc. if I can. Unfortunately, when dealing with a list of dataframes, sapply
and lapply
don't seem to iterate in the way I want.
但是for循环在R中是不好的,我想用sapply,lapply等等。不幸的是,在处理dataframes列表时,sapply和lapply似乎并没有按照我希望的方式进行迭代。
2 个解决方案
#1
1
You can use Map
as both the lists have the same length. It works by replacing the first column in 'split.X' by the corresponding the list
element in 'normalized.first.value'
可以使用Map,因为两个列表的长度相同。它通过替换“split”中的第一列来工作。通过“normalized.first.value”中相应的列表元素
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
Or we can loop through the length of 'split.X', get the list elements of the 'split.X' and 'normalized.first.value' based on the index and then replace.
或者我们可以对“分裂”的长度进行循环。X,获取“split”的列表元素。X”和“normalized.first。值'基于索引,然后替换。
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
#2
2
Here's a more arcane way (though I still reckon the for
loop is fine in this case)
这里有一个更神秘的方法(尽管我仍然认为for循环在这个例子中是好的)
new.split.X <- mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
How it works: applies [<-
on each split.X[[i]]
. The T
is the i
index to replace (i.e. all of them), 'first.value'
is the j
index to replace (that column), normalized.first.value
contains the replacements.
工作原理:应用[<-对每个分割。x [i]]。T是要替换的i指数(也就是所有的)'first。value'是要替换(该列)的j索引,是normalize .first。价值包含了替代品。
The loop may be easier to read in the end though, and probably not slower than tricksy *apply
solutions.
不过,循环在最后可能更容易阅读,而且可能不会比花哨的应用解决方案慢。
library(rbenchmark)
benchmark(loop={
for (i in 1:length(split.X))
split.X[[i]]$first.value <- normalized.first.value[[i]]
},
mapply={
mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
},
Map={
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
},
lapply={
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
})
test replications elapsed relative user.self sys.self user.child sys.child
4 lapply 100 0.034 4.857 0.035 0 0 0
1 loop 100 0.007 1.000 0.007 0 0 0
3 Map 100 0.012 1.714 0.013 0 0 0
2 mapply 100 0.030 4.286 0.032 0 0 0
So the explicit loop is the fastest, and easieset to read anyway.
所以显式循环是最快的,而且是容易阅读的。
#1
1
You can use Map
as both the lists have the same length. It works by replacing the first column in 'split.X' by the corresponding the list
element in 'normalized.first.value'
可以使用Map,因为两个列表的长度相同。它通过替换“split”中的第一列来工作。通过“normalized.first.value”中相应的列表元素
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
Or we can loop through the length of 'split.X', get the list elements of the 'split.X' and 'normalized.first.value' based on the index and then replace.
或者我们可以对“分裂”的长度进行循环。X,获取“split”的列表元素。X”和“normalized.first。值'基于索引,然后替换。
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
#2
2
Here's a more arcane way (though I still reckon the for
loop is fine in this case)
这里有一个更神秘的方法(尽管我仍然认为for循环在这个例子中是好的)
new.split.X <- mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
How it works: applies [<-
on each split.X[[i]]
. The T
is the i
index to replace (i.e. all of them), 'first.value'
is the j
index to replace (that column), normalized.first.value
contains the replacements.
工作原理:应用[<-对每个分割。x [i]]。T是要替换的i指数(也就是所有的)'first。value'是要替换(该列)的j索引,是normalize .first。价值包含了替代品。
The loop may be easier to read in the end though, and probably not slower than tricksy *apply
solutions.
不过,循环在最后可能更容易阅读,而且可能不会比花哨的应用解决方案慢。
library(rbenchmark)
benchmark(loop={
for (i in 1:length(split.X))
split.X[[i]]$first.value <- normalized.first.value[[i]]
},
mapply={
mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
},
Map={
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
},
lapply={
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
})
test replications elapsed relative user.self sys.self user.child sys.child
4 lapply 100 0.034 4.857 0.035 0 0 0
1 loop 100 0.007 1.000 0.007 0 0 0
3 Map 100 0.012 1.714 0.013 0 0 0
2 mapply 100 0.030 4.286 0.032 0 0 0
So the explicit loop is the fastest, and easieset to read anyway.
所以显式循环是最快的,而且是容易阅读的。