I have the data.frame below. I want to add a column that classifies my data according to column 1 (h_no
) in that way that the first series of h_no 1,2,3,4 is class 1, the second series of h_no
(1 to 7) is class 2 etc. such as indicated in the last column.
我有下面的数据。我想添加一个列,根据列1 (h_no)对数据进行分类,这样h_no 1、2、3、4的第一个系列就是类1,h_no(1到7)的第二个系列就是类2等等,如上一列所示。
h_no h_freq h_freqsq
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3
5 个解决方案
#1
135
You can add a column to your data using various techniques. The quotes below come from the "Details" section of the relevant help text, [[.data.frame
.
您可以使用各种技术向数据添加列。下面的引用来自相关帮助文本的“Details”部分(.data.frame)。
Data frames can be indexed in several modes. When
[
and[[
are used with a single vector index (x[i]
orx[[i]]
), they index the data frame as if it were a list.数据帧可以用几种模式进行索引。当[and]与单个向量索引(x[i]或x[i])一起使用时,它们将数据帧作为一个列表进行索引。
my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector
The data.frame method for
$
, treatsx
as a list$的data.frame方法将x视为一个列表
my.dataframe$new.col <- a.vector
When
[
and[[
are used with two indices (x[i, j]
andx[[i, j]]
) they act like indexing a matrix当[和[[]使用两个指标时(x[i, j]和x[[i, j]])它们就像索引一个矩阵。
my.dataframe[ , "new.col"] <- a.vector
Since the method for data.frame
assumes that if you don't specify if you're working with columns or rows, it will assume you mean columns.
由于data.frame方法假定,如果您不指定是否使用列或行,那么它将假定您是指列。
For your example, this should work:
对于您的示例,这应该是可行的:
# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))
# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs
# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})
# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)
no h_freq h_freqsq group
1 1 0.40998238 0.06463876 1
2 2 0.98086928 0.33093795 1
3 3 0.28908651 0.74077119 1
4 4 0.10476768 0.56784786 1
5 1 0.75478995 0.60479945 2
6 2 0.26974011 0.95231761 2
7 3 0.53676266 0.74370154 2
8 4 0.99784066 0.37499294 2
9 5 0.89771767 0.83467805 2
10 6 0.05363139 0.32066178 2
11 7 0.71741529 0.84572717 2
12 1 0.10654430 0.32917711 3
13 2 0.41971959 0.87155514 3
14 3 0.32432646 0.65789294 3
15 4 0.77896780 0.27599187 3
16 5 0.06100008 0.55399326 3
#2
10
Easily: Your data frame is A
容易:您的数据帧是A
b <- A[,1]
b <- b==1
b <- cumsum(b)
Then you get the column b.
然后得到b。
#3
7
If I understand the question correctly, you want to detect when the h_no
doesn't increase and then increment the class
. (I'm going to walk through how I solved this problem, there is a self-contained function at the end.)
如果我正确地理解了这个问题,您希望检测h_no何时没有增加,然后增加类。(我将介绍如何解决这个问题,最后有一个独立的函数)
Working
We only care about the h_no
column for the moment, so we can extract that from the data frame:
我们暂时只关心h_no列,所以可以从数据框架中提取:
> h_no <- data$h_no
We want to detect when h_no
doesn't go up, which we can do by working out when the difference between successive elements is either negative or zero. R provides the diff
function which gives us the vector of differences:
我们想要检测h_no何时没有上升,我们可以通过计算出连续元素之间的差值是负还是零来实现。R提供了diff函数,它给出了差的矢量:
> d.h_no <- diff(h_no)
> d.h_no
[1] 1 1 1 -3 1 1 1 1 1 1 -6 1 1 1
Once we have that, it is a simple matter to find the ones that are non-positive:
一旦我们有了这个,找到那些非正的就很简单了:
> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE
In R, TRUE
and FALSE
are basically the same as 1
and 0
, so if we get the cumulative sum of nonpos
, it will increase by 1 in (almost) the appropriate spots. The cumsum
function (which is basically the opposite of diff
) can do this.
在R中,真和假基本上等于1和0,所以如果我们得到非pos的累积和,在(几乎)合适的点上它会增加1。cumsum函数(基本上是diff的反义词)可以做到这一点。
> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2
But, there are two problems: the numbers are one too small; and, we are missing the first element (there should be four in the first class).
但是,有两个问题:数字太小;并且,我们丢失了第一个元素(第一节课应该有四个元素)。
The first problem is simply solved: 1+cumsum(nonpos)
. And the second just requires adding a 1
to the front of the vector, since the first element is always in class 1
:
第一个问题简单地解决了:1+cumsum(non - pos)。第二个只需要在向量的前面加上一个1,因为第一个元素总是在类1中:
> classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Now, we can attach it back onto our data frame with cbind
(by using the class=
syntax, we can give the column the class
heading):
现在,我们可以用cbind(通过使用class=语法,我们可以给列一个类标题)将它附加回我们的数据框架中:
> data_w_classes <- cbind(data, class=classes)
And data_w_classes
now contains the result.
data_w_classes现在包含结果。
Final result
We can compress the lines together and wrap it all up into a function to make it easier to use:
我们可以把这些线压缩在一起,把它们打包成一个函数,这样更容易使用:
classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}
Or, since it makes sense for the class
to be a factor:
或者,因为这类是有意义的一个因素:
classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}
You use either function like:
您可以使用以下两种函数:
> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column
(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )
(这种解决这个问题的方法很好,因为它避免了显式迭代,这通常是R的推荐,也避免了生成大量的中间向量和列表等。而且它如何写在一行上也很简洁:)
#4
2
In addition to Roman's answer, something like this might be even simpler. Note that I haven't tested it because I do not have access to R right now.
除了罗曼的回答,类似这样的事情可能更简单。注意,我没有测试它,因为我现在没有访问R。
# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})
The function iterates over the values in n_ho
and always returns the categorie that the current value belongs to. If a value of 1
is detected, we increase the global variable index
and continue.
函数遍历n_ho中的值,并始终返回当前值所属的类别。如果检测到值1,则增加全局变量索引并继续。
#5
1
Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))
#1
135
You can add a column to your data using various techniques. The quotes below come from the "Details" section of the relevant help text, [[.data.frame
.
您可以使用各种技术向数据添加列。下面的引用来自相关帮助文本的“Details”部分(.data.frame)。
Data frames can be indexed in several modes. When
[
and[[
are used with a single vector index (x[i]
orx[[i]]
), they index the data frame as if it were a list.数据帧可以用几种模式进行索引。当[and]与单个向量索引(x[i]或x[i])一起使用时,它们将数据帧作为一个列表进行索引。
my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector
The data.frame method for
$
, treatsx
as a list$的data.frame方法将x视为一个列表
my.dataframe$new.col <- a.vector
When
[
and[[
are used with two indices (x[i, j]
andx[[i, j]]
) they act like indexing a matrix当[和[[]使用两个指标时(x[i, j]和x[[i, j]])它们就像索引一个矩阵。
my.dataframe[ , "new.col"] <- a.vector
Since the method for data.frame
assumes that if you don't specify if you're working with columns or rows, it will assume you mean columns.
由于data.frame方法假定,如果您不指定是否使用列或行,那么它将假定您是指列。
For your example, this should work:
对于您的示例,这应该是可行的:
# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))
# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs
# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})
# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)
no h_freq h_freqsq group
1 1 0.40998238 0.06463876 1
2 2 0.98086928 0.33093795 1
3 3 0.28908651 0.74077119 1
4 4 0.10476768 0.56784786 1
5 1 0.75478995 0.60479945 2
6 2 0.26974011 0.95231761 2
7 3 0.53676266 0.74370154 2
8 4 0.99784066 0.37499294 2
9 5 0.89771767 0.83467805 2
10 6 0.05363139 0.32066178 2
11 7 0.71741529 0.84572717 2
12 1 0.10654430 0.32917711 3
13 2 0.41971959 0.87155514 3
14 3 0.32432646 0.65789294 3
15 4 0.77896780 0.27599187 3
16 5 0.06100008 0.55399326 3
#2
10
Easily: Your data frame is A
容易:您的数据帧是A
b <- A[,1]
b <- b==1
b <- cumsum(b)
Then you get the column b.
然后得到b。
#3
7
If I understand the question correctly, you want to detect when the h_no
doesn't increase and then increment the class
. (I'm going to walk through how I solved this problem, there is a self-contained function at the end.)
如果我正确地理解了这个问题,您希望检测h_no何时没有增加,然后增加类。(我将介绍如何解决这个问题,最后有一个独立的函数)
Working
We only care about the h_no
column for the moment, so we can extract that from the data frame:
我们暂时只关心h_no列,所以可以从数据框架中提取:
> h_no <- data$h_no
We want to detect when h_no
doesn't go up, which we can do by working out when the difference between successive elements is either negative or zero. R provides the diff
function which gives us the vector of differences:
我们想要检测h_no何时没有上升,我们可以通过计算出连续元素之间的差值是负还是零来实现。R提供了diff函数,它给出了差的矢量:
> d.h_no <- diff(h_no)
> d.h_no
[1] 1 1 1 -3 1 1 1 1 1 1 -6 1 1 1
Once we have that, it is a simple matter to find the ones that are non-positive:
一旦我们有了这个,找到那些非正的就很简单了:
> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE
In R, TRUE
and FALSE
are basically the same as 1
and 0
, so if we get the cumulative sum of nonpos
, it will increase by 1 in (almost) the appropriate spots. The cumsum
function (which is basically the opposite of diff
) can do this.
在R中,真和假基本上等于1和0,所以如果我们得到非pos的累积和,在(几乎)合适的点上它会增加1。cumsum函数(基本上是diff的反义词)可以做到这一点。
> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2
But, there are two problems: the numbers are one too small; and, we are missing the first element (there should be four in the first class).
但是,有两个问题:数字太小;并且,我们丢失了第一个元素(第一节课应该有四个元素)。
The first problem is simply solved: 1+cumsum(nonpos)
. And the second just requires adding a 1
to the front of the vector, since the first element is always in class 1
:
第一个问题简单地解决了:1+cumsum(non - pos)。第二个只需要在向量的前面加上一个1,因为第一个元素总是在类1中:
> classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Now, we can attach it back onto our data frame with cbind
(by using the class=
syntax, we can give the column the class
heading):
现在,我们可以用cbind(通过使用class=语法,我们可以给列一个类标题)将它附加回我们的数据框架中:
> data_w_classes <- cbind(data, class=classes)
And data_w_classes
now contains the result.
data_w_classes现在包含结果。
Final result
We can compress the lines together and wrap it all up into a function to make it easier to use:
我们可以把这些线压缩在一起,把它们打包成一个函数,这样更容易使用:
classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}
Or, since it makes sense for the class
to be a factor:
或者,因为这类是有意义的一个因素:
classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}
You use either function like:
您可以使用以下两种函数:
> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column
(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )
(这种解决这个问题的方法很好,因为它避免了显式迭代,这通常是R的推荐,也避免了生成大量的中间向量和列表等。而且它如何写在一行上也很简洁:)
#4
2
In addition to Roman's answer, something like this might be even simpler. Note that I haven't tested it because I do not have access to R right now.
除了罗曼的回答,类似这样的事情可能更简单。注意,我没有测试它,因为我现在没有访问R。
# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})
The function iterates over the values in n_ho
and always returns the categorie that the current value belongs to. If a value of 1
is detected, we increase the global variable index
and continue.
函数遍历n_ho中的值,并始终返回当前值所属的类别。如果检测到值1,则增加全局变量索引并继续。
#5
1
Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))