
时间:2021-07-11 21:25:45

I want to create new columns by splitting a vector in a data frame.


I have such a data frame:


YEAR Variable1 Variable2 
2009 000000    00000001
2010 000000    00000001
2011 000000    00000001
2009 000000    00000002
2010 000000    00000002
2009 000000    00000003
2009 100000    10000001
2010 100000    10000001
2009 100000    10000011

As you can see Variable2 is related to Variable 1 (Variable2=Variable1+last two digits: e.g. 01, 02, 03... indicating subcategories). I want to split Variable2 in so many variables as the subcategories are. The result should be:

如您所见,Variable2与变量1相关(变量2 =变量1 +最后两位数:例如01,02,03 ......表示子类别)。我想将Variable2拆分为子类别中的许多变量。结果应该是:

YEAR Variable1 Variable2 Variable3 Variable4 ... 
2009 000000    00000001  0         0        
2010 000000    00000001  0         0
2011 000000    00000001  0         0
2009 000000    0         00000002  0
2010 000000    0         00000002  0
2009 000000    0         0         00000003
2009 100000    10000001  0         0     
2010 100000    10000001  0         0     
2009 100000    0         0         0       ...      10000011 

How would you proceed? I thought I should try to recode Variable2 in a loop.. I tried by manipulating strings, but I didn´t solve the problem..

你会怎么做?我以为我应该尝试在循环中重新编码Variable2 ..我尝试通过操纵字符串,但我没有解决问题..

6 个解决方案



This will work. First let's build the data.


values <- paste0("0000000", 1:4)
dt <- data.table(val = sample(values, 10, replace = TRUE))

A for loop is enough to define the new columns.


for(level_var in dt[, unique(val)]){
  dt[, eval(level_var) := ifelse(val == level_var, level_var, 0)]



Here's another suggestion. The code is somewhat longer, but I believe it does the trick, and I hope that it can be understood easily. I'm assuming that the original data is stored in a tab-separated file named 'data.dat'. The output of the code is stored in the matrix 'new_matrix'. The entries are characters, but it shouldn't be a problem to convert them into integers if necessary.


data <- read.table('data.dat', sep='\t', header = TRUE, colClasses = "character")
var2 <- data[3]
nc <- nchar(var2[1,1])
last2 <-substr(var2[,1],nc-1,nc)
subcat <-levels(factor(last2))
mrows <- nrow(data)
mcols <- length(subcat)
varnames <-paste0("Variable",as.character(c(1:(mcols+1))))
new_matrix <- matrix(paste(replicate(nc,"0"),collapse=""),nrow=mrows,ncol=mcols+2)
colnames(new_matrix) <- c("YEAR",varnames)
for (i in 1:mcols) {
    relevant_rows <- which(last2 == subcat[i])

Hope this helps.




Using reshape2. A one-line solution. Another line if we'd like to remove the NA values.


df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009),
df <- dcast(df, YEAR + Var1 + Var2 ~ Var2, value.var = "Var2")[, -3]
df[is.na(df)] <- 0


  YEAR   Var1 0000001 0000002 0000003 1000001 1000011
1 2009 000000 0000001       0       0       0       0
2 2009 000000       0 0000002       0       0       0
3 2009 000000       0       0 0000003       0       0
4 2009 100000       0       0       0 1000001       0
5 2009 100000       0       0       0       0 1000011
6 2010 000000 0000001       0       0       0       0
7 2010 000000       0 0000002       0       0       0
8 2010 100000       0       0       0 1000001       0
9 2011 000000 0000001       0       0       0       0



Here's another approach. Note that I choose to make the subcat dummy variables into binary indicator variables to reduce redundancy:



data <- read.table(header=TRUE, text='
  year var1      var2
  2009 000000    00000001
  2010 000000    00000001
  2009 000000    00000002
  2010 000000    00000002
  2009 000000    00000003
  2009 100000    10000001
  2009 100000    10000004
  2010 100000    10000010                 
', colClasses = c('character', 'character', 'character'))

Simplifying var2 column:


subCat <- function(s) {
  substr(s, nchar(s) - 1, nchar(s))
data$var2 <- subCat(data$var2)

Creating dummies:

Method 1:

t <- table(1:length(data$var2), data$var2)
data <- cbind(data, as.data.frame.matrix(t))
data$var2 <- NULL


 year   var1 01 02 03 04 10
1 2009 000000  1  0  0  0  0
2 2010 000000  1  0  0  0  0
3 2009 000000  0  1  0  0  0
4 2010 000000  0  1  0  0  0
5 2009 000000  0  0  1  0  0
6 2009 100000  1  0  0  0  0
7 2009 100000  0  0  0  1  0
8 2010 100000  0  0  0  0  1


Method 2:

data$var2 <- subCat(data$var2)
data3 <- cbind(data, dummy(data$var2))
data3$var2 = NULL


  year   var1 data01 data02 data03 data04 data10
1 2009 000000      1      0      0      0      0
2 2010 000000      1      0      0      0      0
3 2009 000000      0      1      0      0      0
4 2010 000000      0      1      0      0      0
5 2009 000000      0      0      1      0      0
6 2009 100000      1      0      0      0      0
7 2009 100000      0      0      0      1      0
8 2010 100000      0      0      0      0      1


Method 3:

dummies <- sapply(unique(data$var2), function(x) as.numeric(data$var2 == x))
data <- cbind(data, dummies)
data$var2 = NULL


  year   var1 X01 X02 X03 X04 X10
1 2009 000000   1   0   0   0   0
2 2010 000000   1   0   0   0   0
3 2009 000000   0   1   0   0   0
4 2010 000000   0   1   0   0   0
5 2009 000000   0   0   1   0   0
6 2009 100000   1   0   0   0   0
7 2009 100000   0   0   0   1   0
8 2010 100000   0   0   0   0   1



df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009),

df <- mutate(df, tag=paste(YEAR, Var1, Var2, sep='-'))
df <- dcast(df, YEAR + Var1 + tag ~ Var2, fun.aggregate = NULL)
df$tag <- NULL
df <- apply(df, 2, function(x) sub('^(.*)-(.*)-', '', x))
df[is.na(df)] <- 0
df <- as.data.frame(df)


  YEAR   Var1 0000001 0000002 0000003 1000001 1000011
1 2009 000000 0000001       0       0       0       0
2 2009 000000       0 0000002       0       0       0
3 2009 000000       0       0 0000003       0       0
4 2009 100000       0       0       0 1000001       0
5 2009 100000       0       0       0       0 1000011
6 2010 000000 0000001       0       0       0       0
7 2010 000000       0 0000002       0       0       0
8 2010 100000       0       0       0 1000001       0
9 2011 000000 0000001       0       0       0       0



Thank you for all these answers. I found the solution by combining the answer of Michele Usuelli and the comment to his answer of Synergist. I also learnt more about data.table

谢谢你所有这些答案。我通过将Michele Usuelli的答案和评论结合到Synergist的答案中找到了解决方案。我还学到了更多关于data.table的知识

NbTabelle <- data.table(val=Netz)
for(level_var in namesvec){
NbTabelle[, eval(level_var) := ifelse(substr(eval(val), 7, 8) == level_var, val, 0)]

Where namesvec is the variable names vector that I created from the previous generated tables, leaving apart the variable val. I appreciated the generality of Synergist code, but for my purpose I needed only the last two digits.




This will work. First let's build the data.


values <- paste0("0000000", 1:4)
dt <- data.table(val = sample(values, 10, replace = TRUE))

A for loop is enough to define the new columns.


for(level_var in dt[, unique(val)]){
  dt[, eval(level_var) := ifelse(val == level_var, level_var, 0)]



Here's another suggestion. The code is somewhat longer, but I believe it does the trick, and I hope that it can be understood easily. I'm assuming that the original data is stored in a tab-separated file named 'data.dat'. The output of the code is stored in the matrix 'new_matrix'. The entries are characters, but it shouldn't be a problem to convert them into integers if necessary.


data <- read.table('data.dat', sep='\t', header = TRUE, colClasses = "character")
var2 <- data[3]
nc <- nchar(var2[1,1])
last2 <-substr(var2[,1],nc-1,nc)
subcat <-levels(factor(last2))
mrows <- nrow(data)
mcols <- length(subcat)
varnames <-paste0("Variable",as.character(c(1:(mcols+1))))
new_matrix <- matrix(paste(replicate(nc,"0"),collapse=""),nrow=mrows,ncol=mcols+2)
colnames(new_matrix) <- c("YEAR",varnames)
for (i in 1:mcols) {
    relevant_rows <- which(last2 == subcat[i])

Hope this helps.




Using reshape2. A one-line solution. Another line if we'd like to remove the NA values.


df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009),
df <- dcast(df, YEAR + Var1 + Var2 ~ Var2, value.var = "Var2")[, -3]
df[is.na(df)] <- 0


  YEAR   Var1 0000001 0000002 0000003 1000001 1000011
1 2009 000000 0000001       0       0       0       0
2 2009 000000       0 0000002       0       0       0
3 2009 000000       0       0 0000003       0       0
4 2009 100000       0       0       0 1000001       0
5 2009 100000       0       0       0       0 1000011
6 2010 000000 0000001       0       0       0       0
7 2010 000000       0 0000002       0       0       0
8 2010 100000       0       0       0 1000001       0
9 2011 000000 0000001       0       0       0       0



Here's another approach. Note that I choose to make the subcat dummy variables into binary indicator variables to reduce redundancy:



data <- read.table(header=TRUE, text='
  year var1      var2
  2009 000000    00000001
  2010 000000    00000001
  2009 000000    00000002
  2010 000000    00000002
  2009 000000    00000003
  2009 100000    10000001
  2009 100000    10000004
  2010 100000    10000010                 
', colClasses = c('character', 'character', 'character'))

Simplifying var2 column:


subCat <- function(s) {
  substr(s, nchar(s) - 1, nchar(s))
data$var2 <- subCat(data$var2)

Creating dummies:

Method 1:

t <- table(1:length(data$var2), data$var2)
data <- cbind(data, as.data.frame.matrix(t))
data$var2 <- NULL


 year   var1 01 02 03 04 10
1 2009 000000  1  0  0  0  0
2 2010 000000  1  0  0  0  0
3 2009 000000  0  1  0  0  0
4 2010 000000  0  1  0  0  0
5 2009 000000  0  0  1  0  0
6 2009 100000  1  0  0  0  0
7 2009 100000  0  0  0  1  0
8 2010 100000  0  0  0  0  1


Method 2:

data$var2 <- subCat(data$var2)
data3 <- cbind(data, dummy(data$var2))
data3$var2 = NULL


  year   var1 data01 data02 data03 data04 data10
1 2009 000000      1      0      0      0      0
2 2010 000000      1      0      0      0      0
3 2009 000000      0      1      0      0      0
4 2010 000000      0      1      0      0      0
5 2009 000000      0      0      1      0      0
6 2009 100000      1      0      0      0      0
7 2009 100000      0      0      0      1      0
8 2010 100000      0      0      0      0      1


Method 3:

dummies <- sapply(unique(data$var2), function(x) as.numeric(data$var2 == x))
data <- cbind(data, dummies)
data$var2 = NULL


  year   var1 X01 X02 X03 X04 X10
1 2009 000000   1   0   0   0   0
2 2010 000000   1   0   0   0   0
3 2009 000000   0   1   0   0   0
4 2010 000000   0   1   0   0   0
5 2009 000000   0   0   1   0   0
6 2009 100000   1   0   0   0   0
7 2009 100000   0   0   0   1   0
8 2010 100000   0   0   0   0   1



df <- data.frame(YEAR=c(2009,2010,2011,2009,2010,2009,2009,2010,2009),

df <- mutate(df, tag=paste(YEAR, Var1, Var2, sep='-'))
df <- dcast(df, YEAR + Var1 + tag ~ Var2, fun.aggregate = NULL)
df$tag <- NULL
df <- apply(df, 2, function(x) sub('^(.*)-(.*)-', '', x))
df[is.na(df)] <- 0
df <- as.data.frame(df)


  YEAR   Var1 0000001 0000002 0000003 1000001 1000011
1 2009 000000 0000001       0       0       0       0
2 2009 000000       0 0000002       0       0       0
3 2009 000000       0       0 0000003       0       0
4 2009 100000       0       0       0 1000001       0
5 2009 100000       0       0       0       0 1000011
6 2010 000000 0000001       0       0       0       0
7 2010 000000       0 0000002       0       0       0
8 2010 100000       0       0       0 1000001       0
9 2011 000000 0000001       0       0       0       0



Thank you for all these answers. I found the solution by combining the answer of Michele Usuelli and the comment to his answer of Synergist. I also learnt more about data.table

谢谢你所有这些答案。我通过将Michele Usuelli的答案和评论结合到Synergist的答案中找到了解决方案。我还学到了更多关于data.table的知识

NbTabelle <- data.table(val=Netz)
for(level_var in namesvec){
NbTabelle[, eval(level_var) := ifelse(substr(eval(val), 7, 8) == level_var, val, 0)]

Where namesvec is the variable names vector that I created from the previous generated tables, leaving apart the variable val. I appreciated the generality of Synergist code, but for my purpose I needed only the last two digits.
