不丢失字符- R的分割字符串

时间:2021-10-27 02:29:40

I have two columns in a much larger dataframe that I am having difficult splitting. I have used strsplit in past when I was trying to split using a "space", "," or some other delimiter. The hard part here is I don't want to lose any information AND when I split some parts I will end up with missing information. I would like to end up with four columns in the end. Here's a sample of a couple rows of what I have now.

在一个更大的dataframe中有两个列,我很难拆分。我曾在尝试使用“空格”、“”或其他分隔符进行分隔时使用过strsplit。这里的难点是我不想丢失任何信息,当我分割一些部分时,我将以丢失信息而告终。我希望最后有四列。这是我现在有的几行的样本。

age-gen  surv-camp
45M      1LC
9F       0
12M      1AC
67M      1LC

Here is what I would like to ultimately get.

这就是我最终想要得到的。

age   gen   surv   camp
45    M     1      LC
9     F     0      
12    M     1      AC
67    M     1      LC

I've done quite a lot of hunting around on here and have found a number of responses in Java, C++, html etc., but I haven't found anything that explains how to do this in R and when you have missing data.

我在这里做了大量的搜索,并在Java、c++、html等语言中找到了许多响应,但是我还没有找到任何能解释如何在R中执行此操作以及当您丢失数据时的方法。

I saw this about adding a space between values and then just splitting on the space, but I don't see how this would work 1) with missing data, 2) when I don't have consistent numeric or character values in each row.

我看到了在值之间添加空格,然后在空格上进行分割,但是我看不出这是如何工作的1)缺少数据,2)当我在每行中没有一致的数值或字符值时。

1 个解决方案

#1


3  

We loop through the columns of 'df1' (lapply(df1, ..), create a delimiter after the numeric substring using sub, read the vector as data.frame with read.table, rbind the list of data.frames and change the column names of the output.

我们循环遍历'df1' (lapply(df1, ..)的列,使用sub在数值子字符串之后创建一个分隔符,将向量读入data.frame with read。表中,rbind data.frame列表,并更改输出的列名。

res <- do.call(cbind, lapply(df1, function(x)
      read.table(text=sub("(\\d+)", "\\1,", x), 
          header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0     
#3  12   M    1   AC
#4  67   M    1   LC

Or using separate from tidyr

或者使用单独的tidyr

library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>% 
       separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0 <NA>
#3  12   M    1   AC
#4  67   M    1   LC

Or as @Frank mentioned, we can use tstrsplit from data.table

或者如@Frank提到的,我们可以使用data.table中的tstrsplit

library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x) 
    tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE, 
                        type.convert=TRUE)), recursive = FALSE)]

EDIT: Added the convert = TRUE in separate to change the type of columns after the split.

编辑:添加转换= TRUE,以改变拆分后的列类型。

data

df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC", 
 "0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"), 
class = "data.frame", row.names = c(NA, -4L))

#1


3  

We loop through the columns of 'df1' (lapply(df1, ..), create a delimiter after the numeric substring using sub, read the vector as data.frame with read.table, rbind the list of data.frames and change the column names of the output.

我们循环遍历'df1' (lapply(df1, ..)的列,使用sub在数值子字符串之后创建一个分隔符,将向量读入data.frame with read。表中,rbind data.frame列表,并更改输出的列名。

res <- do.call(cbind, lapply(df1, function(x)
      read.table(text=sub("(\\d+)", "\\1,", x), 
          header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0     
#3  12   M    1   AC
#4  67   M    1   LC

Or using separate from tidyr

或者使用单独的tidyr

library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>% 
       separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
#  age gen surv camp
#1  45   M    1   LC
#2   9   F    0 <NA>
#3  12   M    1   AC
#4  67   M    1   LC

Or as @Frank mentioned, we can use tstrsplit from data.table

或者如@Frank提到的,我们可以使用data.table中的tstrsplit

library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x) 
    tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE, 
                        type.convert=TRUE)), recursive = FALSE)]

EDIT: Added the convert = TRUE in separate to change the type of columns after the split.

编辑:添加转换= TRUE,以改变拆分后的列类型。

data

df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC", 
 "0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"), 
class = "data.frame", row.names = c(NA, -4L))