将一个dataframe字符串列拆分为多个不同的列。

时间:2021-11-14 22:57:57

What I am trying to accomplish is splitting a column into multiple columns. I would prefer the first column to contain "F", second column "US", third "CA6" or "DL", and the fourth to be "Z13" or "U13" etc etc. My entire df follows the same pattern of X.XX.XXXX.XXX or X.XX.XXX.XXX or X.XX.XX.XXX and I know the third column is where my problem lies because of the different lengths. I have only used substr in the past and I could use that here with some if statements but would like to learn how to use stringr package and POSIX to do this (unless there is a better option). Thank you in advance.

我要完成的是将一个列拆分为多个列。我希望第一列包含“F”,第二列“US”,第三个“CA6”或“DL”,第四列是“Z13”或“U13”等等,我的整个df都遵循相同的x . xxxxxxxx。XXX或X.XX.XXX。XXX或X.XX.XX。我知道第三列是我的问题所在因为长度不同。我在过去只使用了substr,我可以在这里使用一些if语句,但是想学习如何使用stringr包和POSIX来实现这个功能(除非有更好的选项)。提前谢谢你。

Here is my df:

这是我的df:

c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

3 个解决方案

#1


44  

A very direct way is to just use read.table on your character vector:

一种非常直接的方法就是使用read。角色向量表:

> read.table(text = text, sep = ".", colClasses = "character")
   V1 V2  V3  V4
1   F US CLE V13
2   F US CA6 U13
3   F US CA6 U13
4   F US CA6 U13
5   F US CA6 U13
6   F US CA6 U13
7   F US CA6 U13
8   F US CA6 U13
9   F US  DL U13
10  F US  DL U13
11  F US  DL U13
12  F US  DL Z13
13  F US  DL Z13

colClasses needs to be specified, otherwise F gets converted to FALSE (which is something I need to fix in "splitstackshape", otherwise I would have recommended that :) )

需要指定colClasses,否则F将转换为FALSE(这是我需要在“splitstackshape”中修复的东西,否则我将建议:))


Update (> a year later)...

Alternatively, you can use my cSplit function, like this:

或者,您可以使用我的cSplit函数,如下所示:

cSplit(as.data.table(text), "text", ".")
#     text_1 text_2 text_3 text_4
#  1:      F     US    CLE    V13
#  2:      F     US    CA6    U13
#  3:      F     US    CA6    U13
#  4:      F     US    CA6    U13
#  5:      F     US    CA6    U13
#  6:      F     US    CA6    U13
#  7:      F     US    CA6    U13
#  8:      F     US    CA6    U13
#  9:      F     US     DL    U13
# 10:      F     US     DL    U13
# 11:      F     US     DL    U13
# 12:      F     US     DL    Z13
# 13:      F     US     DL    Z13

Or, separate from "tidyr", like this:

或者,与“tidyr”分开,像这样:

library(dplyr)
library(tidyr)

as.data.frame(text) %>% separate(text, into = paste("V", 1:4, sep = "_"))
#    V_1 V_2 V_3 V_4
# 1    F  US CLE V13
# 2    F  US CA6 U13
# 3    F  US CA6 U13
# 4    F  US CA6 U13
# 5    F  US CA6 U13
# 6    F  US CA6 U13
# 7    F  US CA6 U13
# 8    F  US CA6 U13
# 9    F  US  DL U13
# 10   F  US  DL U13
# 11   F  US  DL U13
# 12   F  US  DL Z13
# 13   F  US  DL Z13

#2


15  

Is this what you are trying to do?

这是你想做的吗?

# Our data
text <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

#  Split into individual elements by the '.' character
#  Remember to escape it, because '.' by itself matches any single character
elems <- unlist( strsplit( text , "\\." ) )

#  We know the dataframe should have 4 columns, so make a matrix
m <- matrix( elems , ncol = 4 , byrow = TRUE )

#  Coerce to data.frame - head() is just to illustrate the top portion
head( as.data.frame( m ) )
#  V1 V2  V3  V4
#1  F US CLE V13
#2  F US CA6 U13
#3  F US CA6 U13
#4  F US CA6 U13
#5  F US CA6 U13
#6  F US CA6 U13

#3


6  

The way via unlist and matrix seems a bit convoluted, and requires you to hard-code the number of elements (this is actually a pretty big no-go. Of course you could circumvent hard-coding that number and determine it at run-time)

通过unlist和矩阵的方式似乎有点复杂,需要您硬编码元素的数量(这实际上是一个相当大的不去的地方)。当然,您可以绕过硬编码这个数字,并在运行时确定它。

I would go a different route, and construct a data frame directly from the list that strsplit returns. For me, this is conceptually simpler. There are essentially two ways of doing this:

我将选择不同的路径,并直接从strsplit返回的列表中构造一个数据帧。对我来说,这在概念上更简单。基本上有两种方法:

  1. as.data.frame – but since the list is exactly the wrong way round (we have a list of rows rather than a list of columns) we have to transpose the result. We also clear the rownames since they are ugly by default (but that’s strictly unnecessary!):

    但是由于列表是完全错误的(我们有一个行列表而不是列的列表),我们必须转置结果。我们还清除了rownames,因为它们在默认情况下是丑陋的(但这是完全没有必要的!)

    `rownames<-`(t(as.data.frame(strsplit(text, '\\.'))), NULL)
    
  2. Alternatively, use rbind to construct a data frame from the list of rows. We use do.call to call rbind with all the rows as separate arguments:

    或者,使用rbind从行列表中构造一个数据框架。我们使用。调用rbind,将所有行作为单独的参数:

    do.call(rbind, strsplit(text, '\\.'))
    

Both ways yield the same result:

两种方法都产生相同的结果:

     [,1] [,2] [,3]  [,4]
[1,] "F"  "US" "CLE" "V13"
[2,] "F"  "US" "CA6" "U13"
[3,] "F"  "US" "CA6" "U13"
[4,] "F"  "US" "CA6" "U13"
[5,] "F"  "US" "CA6" "U13"
[6,] "F"  "US" "CA6" "U13"
…

Clearly, the second way is much simpler than the first.

显然,第二种方法比第一个简单得多。

#1


44  

A very direct way is to just use read.table on your character vector:

一种非常直接的方法就是使用read。角色向量表:

> read.table(text = text, sep = ".", colClasses = "character")
   V1 V2  V3  V4
1   F US CLE V13
2   F US CA6 U13
3   F US CA6 U13
4   F US CA6 U13
5   F US CA6 U13
6   F US CA6 U13
7   F US CA6 U13
8   F US CA6 U13
9   F US  DL U13
10  F US  DL U13
11  F US  DL U13
12  F US  DL Z13
13  F US  DL Z13

colClasses needs to be specified, otherwise F gets converted to FALSE (which is something I need to fix in "splitstackshape", otherwise I would have recommended that :) )

需要指定colClasses,否则F将转换为FALSE(这是我需要在“splitstackshape”中修复的东西,否则我将建议:))


Update (> a year later)...

Alternatively, you can use my cSplit function, like this:

或者,您可以使用我的cSplit函数,如下所示:

cSplit(as.data.table(text), "text", ".")
#     text_1 text_2 text_3 text_4
#  1:      F     US    CLE    V13
#  2:      F     US    CA6    U13
#  3:      F     US    CA6    U13
#  4:      F     US    CA6    U13
#  5:      F     US    CA6    U13
#  6:      F     US    CA6    U13
#  7:      F     US    CA6    U13
#  8:      F     US    CA6    U13
#  9:      F     US     DL    U13
# 10:      F     US     DL    U13
# 11:      F     US     DL    U13
# 12:      F     US     DL    Z13
# 13:      F     US     DL    Z13

Or, separate from "tidyr", like this:

或者,与“tidyr”分开,像这样:

library(dplyr)
library(tidyr)

as.data.frame(text) %>% separate(text, into = paste("V", 1:4, sep = "_"))
#    V_1 V_2 V_3 V_4
# 1    F  US CLE V13
# 2    F  US CA6 U13
# 3    F  US CA6 U13
# 4    F  US CA6 U13
# 5    F  US CA6 U13
# 6    F  US CA6 U13
# 7    F  US CA6 U13
# 8    F  US CA6 U13
# 9    F  US  DL U13
# 10   F  US  DL U13
# 11   F  US  DL U13
# 12   F  US  DL Z13
# 13   F  US  DL Z13

#2


15  

Is this what you are trying to do?

这是你想做的吗?

# Our data
text <- c("F.US.CLE.V13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", "F.US.CA6.U13", 
"F.US.DL.U13", "F.US.DL.U13", "F.US.DL.U13", "F.US.DL.Z13", "F.US.DL.Z13"
)

#  Split into individual elements by the '.' character
#  Remember to escape it, because '.' by itself matches any single character
elems <- unlist( strsplit( text , "\\." ) )

#  We know the dataframe should have 4 columns, so make a matrix
m <- matrix( elems , ncol = 4 , byrow = TRUE )

#  Coerce to data.frame - head() is just to illustrate the top portion
head( as.data.frame( m ) )
#  V1 V2  V3  V4
#1  F US CLE V13
#2  F US CA6 U13
#3  F US CA6 U13
#4  F US CA6 U13
#5  F US CA6 U13
#6  F US CA6 U13

#3


6  

The way via unlist and matrix seems a bit convoluted, and requires you to hard-code the number of elements (this is actually a pretty big no-go. Of course you could circumvent hard-coding that number and determine it at run-time)

通过unlist和矩阵的方式似乎有点复杂,需要您硬编码元素的数量(这实际上是一个相当大的不去的地方)。当然,您可以绕过硬编码这个数字,并在运行时确定它。

I would go a different route, and construct a data frame directly from the list that strsplit returns. For me, this is conceptually simpler. There are essentially two ways of doing this:

我将选择不同的路径,并直接从strsplit返回的列表中构造一个数据帧。对我来说,这在概念上更简单。基本上有两种方法:

  1. as.data.frame – but since the list is exactly the wrong way round (we have a list of rows rather than a list of columns) we have to transpose the result. We also clear the rownames since they are ugly by default (but that’s strictly unnecessary!):

    但是由于列表是完全错误的(我们有一个行列表而不是列的列表),我们必须转置结果。我们还清除了rownames,因为它们在默认情况下是丑陋的(但这是完全没有必要的!)

    `rownames<-`(t(as.data.frame(strsplit(text, '\\.'))), NULL)
    
  2. Alternatively, use rbind to construct a data frame from the list of rows. We use do.call to call rbind with all the rows as separate arguments:

    或者,使用rbind从行列表中构造一个数据框架。我们使用。调用rbind,将所有行作为单独的参数:

    do.call(rbind, strsplit(text, '\\.'))
    

Both ways yield the same result:

两种方法都产生相同的结果:

     [,1] [,2] [,3]  [,4]
[1,] "F"  "US" "CLE" "V13"
[2,] "F"  "US" "CA6" "U13"
[3,] "F"  "US" "CA6" "U13"
[4,] "F"  "US" "CA6" "U13"
[5,] "F"  "US" "CA6" "U13"
[6,] "F"  "US" "CA6" "U13"
…

Clearly, the second way is much simpler than the first.

显然,第二种方法比第一个简单得多。