I have a data frame with a numerical ID variable which identify the Primary, Secondary and Ultimate Sampling Units from a multistage sampling scheme. I want to split the original ID variable into three new variables, identifying the different sampling units separately:
我有一个带有数字ID变量的数据框,用于从多级采样方案中识别主要,次要和终极采样单元。我想将原始ID变量拆分为三个新变量,分别识别不同的采样单位:
Example:
例:
>df[1:2,]
ID Var var1 var2 var3 var4 var5
501901 9 SP.1 1 W 12.10
501901 9 SP.1 2 W 17.68
What I want:
我想要的是:
>df[1:2,]
ID1 ID2 ID3 var1 var2 var3 var4 var5
5 01 901 9 SP.1 1 W 12.10
5 01 901 9 SP.1 2 W 17.68
I know there is some functions available in R to split character strings, but I could not find same facilities for numbers.
我知道R中有一些功能可以分割字符串,但我找不到相同的数字设施。
Thank you,
谢谢,
Juan
胡安
7 个解决方案
#1
10
Yet another alternative is to re-read the first column using read.fwf
and specify the widths:
另一种方法是使用read.fwf重新读取第一列并指定宽度:
cbind(read.fwf(file = textConnection(as.character(df[, 1])),
widths = c(1, 2, 3), colClasses = "character",
col.names = c("ID1", "ID2", "ID3")),
df[-1])
# ID1 ID2 ID3 var1 var2 var3 var4 var5
# 1 5 01 901 9 SP.1 1 W 12.10
# 2 5 01 901 9 SP.1 2 W 17.68
One advantage here is being able to set the resulting column names in a convenient manner, and ensure that the columns are characters, thus retaining any leading zeroes that might be present.
这里的一个优点是能够以方便的方式设置结果列名称,并确保列是字符,从而保留可能存在的任何前导零。
#2
18
You could use for example use substring
:
您可以使用例如使用substring:
df <- data.frame(ID = c(501901, 501902))
splitted <- t(sapply(df$ID, function(x) substring(x, first=c(1,2,4), last=c(1,3,6))))
cbind(df, splitted)
# ID 1 2 3
#1 501901 5 01 901
#2 501902 5 01 902
#3
5
This should work:
这应该工作:
df <- cbind(do.call(rbind, strsplit(gsub('(.)(..)(...)', '\\1 \\2 \\3', paste(df[,1])),' ')), df[,-1]) # You need that paste() there because gsub() works only with text.
Or with substr()
或者使用substr()
df <- cbind(ID1=substr(df[, 1],1,1), ID2=substr(df[, 1],2,3), ID3=substr(df[, 1],4,6), df[, -1])
#4
5
Since they are numbers, you will have to do some math to extract the digits you want. A number represented in radix-10 can be written as:
由于它们是数字,因此您必须进行一些数学运算才能提取所需的数字。以radix-10表示的数字可写为:
d0*10^0 + d1*10^1 + d2*10^2 ... etc. where d0..dn are the digits of the number.
Thus, to extract the most significant digit from a 6-digit number which is mathematically represented as:
因此,要从6位数字中提取最高有效数字,在数学上表示为:
number = d5*10^5 + d4*10^4 + d3*10^3 + d2*10^2 + d1*10^1 + d0*10^0
As you can see, dividing this number by 10^5 will get you:
如您所见,将此数字除以10 ^ 5将得到:
number / 10^5 = d5*10^0 + d4*10^(-1) + d3*10^(-2) + d2*10^(-3) + d1*10^(-4) + d0*10^(-5)
Voila! Now you have extracted the most significant digit if you interpret the result as an integer, because all the other digits now have a weight less than 0 and thus are smaller than 1. You can do similar things for extracting the other digits. For digits in least significant position you can do modulo operation instead of division.
瞧!现在,如果将结果解释为整数,则已提取最高有效数字,因为现在所有其他数字的权重都小于0,因此小于1.您可以执行类似的操作来提取其他数字。对于最不重要位置的数字,您可以进行模运算而不是除法运算。
Examples:
例子:
501901 / 10^5 = 5 // first digit
501901 % 10^5 = 1 // last digit
(501901 / 10^4) % 10^1 = 0 // second digit
(501901 / 10^2) % 10^2 = 19 // third and fourth digit
#5
4
Several neat answers have been made years ago, but a solution I find useful, using the outer
function, has not been mentioned. In this age of search engines, I put it here in case others could find it handy.
几年前已经提出了几个简洁的答案,但我没有提到使用外部函数找到有用的解决方案。在这个搜索引擎时代,我把它放在这里以防其他人发现它很方便。
I was faced with a slightly simpler problem: turning a column of 6-digit numbers into 6 columns representing each digit. This can be solved using a combination of outer
, integer division (%/%
) and modulo (%%
).
我遇到了一个稍微简单的问题:将一列6位数字转换成6列代表每个数字。这可以使用外部整数除法(%/%)和模数(%%)的组合来解决。
DF <- data.frame("ID" = runif(3)*10^6, "a" = sample(letters, 3,T))
DF <- cbind(DF, "ID" = outer(DF$ID, 10^c(5:0), function(a, b) a %/% b %% 10))
DF
# ID a ID.1 ID.2 ID.3 ID.4 ID.5 ID.6
# 1 814895 z 8 1 4 8 9 5
# 2 417209 q 4 1 7 2 0 9
# 3 545797 c 5 4 5 7 9 7
The question asked here is slightly more complex, requiring different values for both integer division and modulo.
这里提出的问题稍微复杂一些,需要整数除法和模数的不同值。
DF <- data.frame("ID" = runif(3)*10^6, "a" = sample(letters, 3,T))
DF <- cbind(DF, "ID" = outer(DF$ID, c(1:3), function(a,b) a %/% 10^c(5,3,0)[b] %% 10^b))
DF
# ID a ID.1 ID.2 ID.3
# 1 809372 q 8 9 372
# 2 954790 g 9 54 789
# 3 166970 l 1 66 969
#6
3
If you don't want to convert to character
for some reason, following is one of the way to achieve what you want
如果您由于某种原因不想转换为角色,以下是实现您想要的方式之一
DF <- data.frame(ID = c(501901, 501902), var1 = c("a", "b"), var2 = c("c", "d"))
result <- t(sapply(DF$ID, function(y) {
c(y%/%1e+05, (y - y%/%1e+05 * 1e+05)%/%1000, y - y%/%1000 * 1000)
}))
DF <- cbind(result, DF[, -1])
names(DF)[1:3] <- c("ID1", "ID2", "ID3")
DF
## ID1 ID2 ID3 var1 var2
## 1 5 1 901 a c
## 2 5 1 902 b d
#7
2
With so many answers it felt like I needed to come up with something :)
有这么多的答案,我觉得我需要提出一些东西:)
library(qdap)
x <- colSplit(dat$ID_Var, col.sep="")
data.frame(ID1=x[, 1], ID2=paste2(x[, 2:3], sep=""),
ID3=paste2(x[, 4:6],sep=""), dat[, -1])
## ID1 ID2 ID3 var1 var2 var3 var4 var5
## 1 5 01 901 9 SP.1 1 W 12.10
## 2 5 01 901 9 SP.1 2 W 17.68
#1
10
Yet another alternative is to re-read the first column using read.fwf
and specify the widths:
另一种方法是使用read.fwf重新读取第一列并指定宽度:
cbind(read.fwf(file = textConnection(as.character(df[, 1])),
widths = c(1, 2, 3), colClasses = "character",
col.names = c("ID1", "ID2", "ID3")),
df[-1])
# ID1 ID2 ID3 var1 var2 var3 var4 var5
# 1 5 01 901 9 SP.1 1 W 12.10
# 2 5 01 901 9 SP.1 2 W 17.68
One advantage here is being able to set the resulting column names in a convenient manner, and ensure that the columns are characters, thus retaining any leading zeroes that might be present.
这里的一个优点是能够以方便的方式设置结果列名称,并确保列是字符,从而保留可能存在的任何前导零。
#2
18
You could use for example use substring
:
您可以使用例如使用substring:
df <- data.frame(ID = c(501901, 501902))
splitted <- t(sapply(df$ID, function(x) substring(x, first=c(1,2,4), last=c(1,3,6))))
cbind(df, splitted)
# ID 1 2 3
#1 501901 5 01 901
#2 501902 5 01 902
#3
5
This should work:
这应该工作:
df <- cbind(do.call(rbind, strsplit(gsub('(.)(..)(...)', '\\1 \\2 \\3', paste(df[,1])),' ')), df[,-1]) # You need that paste() there because gsub() works only with text.
Or with substr()
或者使用substr()
df <- cbind(ID1=substr(df[, 1],1,1), ID2=substr(df[, 1],2,3), ID3=substr(df[, 1],4,6), df[, -1])
#4
5
Since they are numbers, you will have to do some math to extract the digits you want. A number represented in radix-10 can be written as:
由于它们是数字,因此您必须进行一些数学运算才能提取所需的数字。以radix-10表示的数字可写为:
d0*10^0 + d1*10^1 + d2*10^2 ... etc. where d0..dn are the digits of the number.
Thus, to extract the most significant digit from a 6-digit number which is mathematically represented as:
因此,要从6位数字中提取最高有效数字,在数学上表示为:
number = d5*10^5 + d4*10^4 + d3*10^3 + d2*10^2 + d1*10^1 + d0*10^0
As you can see, dividing this number by 10^5 will get you:
如您所见,将此数字除以10 ^ 5将得到:
number / 10^5 = d5*10^0 + d4*10^(-1) + d3*10^(-2) + d2*10^(-3) + d1*10^(-4) + d0*10^(-5)
Voila! Now you have extracted the most significant digit if you interpret the result as an integer, because all the other digits now have a weight less than 0 and thus are smaller than 1. You can do similar things for extracting the other digits. For digits in least significant position you can do modulo operation instead of division.
瞧!现在,如果将结果解释为整数,则已提取最高有效数字,因为现在所有其他数字的权重都小于0,因此小于1.您可以执行类似的操作来提取其他数字。对于最不重要位置的数字,您可以进行模运算而不是除法运算。
Examples:
例子:
501901 / 10^5 = 5 // first digit
501901 % 10^5 = 1 // last digit
(501901 / 10^4) % 10^1 = 0 // second digit
(501901 / 10^2) % 10^2 = 19 // third and fourth digit
#5
4
Several neat answers have been made years ago, but a solution I find useful, using the outer
function, has not been mentioned. In this age of search engines, I put it here in case others could find it handy.
几年前已经提出了几个简洁的答案,但我没有提到使用外部函数找到有用的解决方案。在这个搜索引擎时代,我把它放在这里以防其他人发现它很方便。
I was faced with a slightly simpler problem: turning a column of 6-digit numbers into 6 columns representing each digit. This can be solved using a combination of outer
, integer division (%/%
) and modulo (%%
).
我遇到了一个稍微简单的问题:将一列6位数字转换成6列代表每个数字。这可以使用外部整数除法(%/%)和模数(%%)的组合来解决。
DF <- data.frame("ID" = runif(3)*10^6, "a" = sample(letters, 3,T))
DF <- cbind(DF, "ID" = outer(DF$ID, 10^c(5:0), function(a, b) a %/% b %% 10))
DF
# ID a ID.1 ID.2 ID.3 ID.4 ID.5 ID.6
# 1 814895 z 8 1 4 8 9 5
# 2 417209 q 4 1 7 2 0 9
# 3 545797 c 5 4 5 7 9 7
The question asked here is slightly more complex, requiring different values for both integer division and modulo.
这里提出的问题稍微复杂一些,需要整数除法和模数的不同值。
DF <- data.frame("ID" = runif(3)*10^6, "a" = sample(letters, 3,T))
DF <- cbind(DF, "ID" = outer(DF$ID, c(1:3), function(a,b) a %/% 10^c(5,3,0)[b] %% 10^b))
DF
# ID a ID.1 ID.2 ID.3
# 1 809372 q 8 9 372
# 2 954790 g 9 54 789
# 3 166970 l 1 66 969
#6
3
If you don't want to convert to character
for some reason, following is one of the way to achieve what you want
如果您由于某种原因不想转换为角色,以下是实现您想要的方式之一
DF <- data.frame(ID = c(501901, 501902), var1 = c("a", "b"), var2 = c("c", "d"))
result <- t(sapply(DF$ID, function(y) {
c(y%/%1e+05, (y - y%/%1e+05 * 1e+05)%/%1000, y - y%/%1000 * 1000)
}))
DF <- cbind(result, DF[, -1])
names(DF)[1:3] <- c("ID1", "ID2", "ID3")
DF
## ID1 ID2 ID3 var1 var2
## 1 5 1 901 a c
## 2 5 1 902 b d
#7
2
With so many answers it felt like I needed to come up with something :)
有这么多的答案,我觉得我需要提出一些东西:)
library(qdap)
x <- colSplit(dat$ID_Var, col.sep="")
data.frame(ID1=x[, 1], ID2=paste2(x[, 2:3], sep=""),
ID3=paste2(x[, 4:6],sep=""), dat[, -1])
## ID1 ID2 ID3 var1 var2 var3 var4 var5
## 1 5 01 901 9 SP.1 1 W 12.10
## 2 5 01 901 9 SP.1 2 W 17.68