I have two columns in a much larger dataframe that I am having difficult splitting. I have used strsplit
in past when I was trying to split using a "space", "," or some other delimiter. The hard part here is I don't want to lose any information AND when I split some parts I will end up with missing information. I would like to end up with four columns in the end. Here's a sample of a couple rows of what I have now.
在一个更大的dataframe中有两个列,我很难拆分。我曾在尝试使用“空格”、“”或其他分隔符进行分隔时使用过strsplit。这里的难点是我不想丢失任何信息,当我分割一些部分时,我将以丢失信息而告终。我希望最后有四列。这是我现在有的几行的样本。
age-gen surv-camp
45M 1LC
9F 0
12M 1AC
67M 1LC
Here is what I would like to ultimately get.
这就是我最终想要得到的。
age gen surv camp
45 M 1 LC
9 F 0
12 M 1 AC
67 M 1 LC
I've done quite a lot of hunting around on here and have found a number of responses in Java, C++, html etc., but I haven't found anything that explains how to do this in R and when you have missing data.
我在这里做了大量的搜索,并在Java、c++、html等语言中找到了许多响应,但是我还没有找到任何能解释如何在R中执行此操作以及当您丢失数据时的方法。
I saw this about adding a space between values and then just splitting on the space, but I don't see how this would work 1) with missing data, 2) when I don't have consistent numeric or character values in each row.
我看到了在值之间添加空格,然后在空格上进行分割,但是我看不出这是如何工作的1)缺少数据,2)当我在每行中没有一致的数值或字符值时。
1 个解决方案
#1
3
We loop through the columns of 'df1' (lapply(df1, ..
), create a delimiter after the numeric substring using sub
, read the vector
as data.frame with read.table
, rbind
the list
of data.frames
and change the column names of the output.
我们循环遍历'df1' (lapply(df1, ..)的列,使用sub在数值子字符串之后创建一个分隔符,将向量读入data.frame with read。表中,rbind data.frame列表,并更改输出的列名。
res <- do.call(cbind, lapply(df1, function(x)
read.table(text=sub("(\\d+)", "\\1,", x),
header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0
#3 12 M 1 AC
#4 67 M 1 LC
Or using separate
from tidyr
或者使用单独的tidyr
library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>%
separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0 <NA>
#3 12 M 1 AC
#4 67 M 1 LC
Or as @Frank mentioned, we can use tstrsplit
from data.table
或者如@Frank提到的,我们可以使用data.table中的tstrsplit
library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x)
tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE,
type.convert=TRUE)), recursive = FALSE)]
EDIT: Added the convert = TRUE
in separate
to change the type
of columns after the split.
编辑:添加转换= TRUE,以改变拆分后的列类型。
data
df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC",
"0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"),
class = "data.frame", row.names = c(NA, -4L))
#1
3
We loop through the columns of 'df1' (lapply(df1, ..
), create a delimiter after the numeric substring using sub
, read the vector
as data.frame with read.table
, rbind
the list
of data.frames
and change the column names of the output.
我们循环遍历'df1' (lapply(df1, ..)的列,使用sub在数值子字符串之后创建一个分隔符,将向量读入data.frame with read。表中,rbind data.frame列表,并更改输出的列名。
res <- do.call(cbind, lapply(df1, function(x)
read.table(text=sub("(\\d+)", "\\1,", x),
header=FALSE, sep=",", stringsAsFactors=FALSE)))
colnames(res) <- scan(text=names(df1), sep=".", what="", quiet = TRUE)
res
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0
#3 12 M 1 AC
#4 67 M 1 LC
Or using separate
from tidyr
或者使用单独的tidyr
library(tidyr)
library(dplyr)
separate(df1, age.gen, into = c("age", "gen"), "(?<=\\d)(?=[A-Za-z])", convert= TRUE) %>%
separate(surv.camp, into = c("surv", "camp"), "(?<=\\d)(?=[A-Za-z])", convert = TRUE)
# age gen surv camp
#1 45 M 1 LC
#2 9 F 0 <NA>
#3 12 M 1 AC
#4 67 M 1 LC
Or as @Frank mentioned, we can use tstrsplit
from data.table
或者如@Frank提到的,我们可以使用data.table中的tstrsplit
library(data.table)
setDT(df1)[, unlist(lapply(.SD, function(x)
tstrsplit(x, "(?<=[0-9])(?=[a-zA-Z])", perl=TRUE,
type.convert=TRUE)), recursive = FALSE)]
EDIT: Added the convert = TRUE
in separate
to change the type
of columns after the split.
编辑:添加转换= TRUE,以改变拆分后的列类型。
data
df1 <- structure(list(age.gen = c("45M", "9F", "12M", "67M"), surv.camp = c("1LC",
"0", "1AC", "1LC")), .Names = c("age.gen", "surv.camp"),
class = "data.frame", row.names = c(NA, -4L))