在第一个空间中分割一个字符串。

时间:2022-08-22 12:54:49

I'd like to split a vector of character strings (people's names) into two columns (vectors). The problem is some people have a 'two word' last name. I'd like to split the first and last names into two columns. I can slit out and take the first names using the code below but the last name eludes me. (look at obs 29 in the sample set below to get an idea as the Ford has a "last name" of Pantera L that must be kept together)

我想把字符串(人名)的向量分成两列(向量)。问题是有些人的姓是“两个字”。我想把名字和名字分成两栏。我可以用下面的代码把名字分门别类,但我想不起姓氏。(看看下面的样本集合中的obs 29,就能得到一个想法,因为福特有一个必须放在一起的Pantera L的“姓”。)

What I have attempted to do so far;

我到目前为止所尝试的;

x<-rownames(mtcars)
unlist(strsplit(x, " .*"))

What I'd like it to look like:

我希望它看起来像:

            MANUF       MAKE
27          Porsche     914-2
28          Lotus       Europa
29          Ford        Pantera L
30          Ferrari     Dino
31          Maserati    Bora
32          Volvo       142E

6 个解决方案

#1


25  

The regular expression rexp matches the word at the start of the string, an optional space, then the rest of the string. The parenthesis are subexpressions accessed as backreferences \\1 and \\2.

正则表达式rexp匹配字符串开头的单词,一个可选空格,然后是字符串的其余部分。括号是作为反向引用访问的子表达式,\1和\2。

rexp <- "^(\\w+)\\s?(.*)$"
y <- data.frame(MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
tail(y)
#       MANUF      MAKE
# 27  Porsche     914-2
# 28    Lotus    Europa
# 29     Ford Pantera L
# 30  Ferrari      Dino
# 31 Maserati      Bora
# 32    Volvo      142E

#2


17  

For me, Hadley's colsplit function in the reshape2 package is the most intuitive for this purpose. Joshua's way is more general (ie can be used wherever a regex could be used) and flexible (if you want to change the specification); but the colsplit function is perfectly suited to this specific setting:

对我来说,Hadley在reshape2包中的colsplit函数是最直观的。Joshua的方法更通用(即可以在任何可以使用regex的地方使用)和灵活(如果您想更改规范);但是colsplit函数非常适合这个特定的设置:

library(reshape2)
y <- colsplit(x," ",c("MANUF","MAKE"))
tail(y)
#      MANUF      MAKE
#27  Porsche     914-2
#28    Lotus    Europa
#29     Ford Pantera L
#30  Ferrari      Dino
#31 Maserati      Bora
#32    Volvo      142E

#3


11  

Here are two approaches:

这里有两种方法:

1) strsplit. This approach uses only functions in the core of R and no complex regular expressions. Replace the first space with a semicolon (using sub and not gsub), strsplit on the semicolon and then rbind it into a 2 column matrix:

1)strsplit。这种方法只使用R核心的函数,而不使用复杂的正则表达式。用分号替换第一个空格(使用sub而不是gsub),将分号上的strsplit,然后重新绑定到一个2列矩阵中:

mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))
colnames(mat) <- c("MANUF", "MAKE")

2) strapply in gsubfn package Here is a one-liner using strapply in the gsubfn package. The two parenthesized portions of the regular expression capture the desired first and second columns respectively and the function (which is specified in formula notation -- its the same as specifying function(x, y) c(MANUF = x, MAKE = y)) grabs them and adds names. The simplify=rbind argument is to used to turn it into a matrix as in the prior solution.

2) strapply in gsubfn包这里是一个在gsubfn包中使用strapply的一行程序。正则表达式的两个括号部分分别捕获所需的第一列和第二列,函数(在公式表示法中指定)获取它们并添加名称(x, y) c(MANUF = x, MAKE = y)。简单=rbind参数用于将其转换为矩阵,就像前面的解决方案一样。

library(gsubfn)
mat <- strapply(x, "(\\S+)\\s+(.*)", ~ c(MANUF = x, MAKE = y), simplify = rbind)

Note: In either case a "character" matrix, mat, is returned. If a data frame of "character" columns is desired then add this:

注意:无论哪种情况,都会返回一个“字符”矩阵mat。如果需要“字符”列的数据帧,则添加以下内容:

DF <- as.data.frame(mat, stringsAsFactors = FALSE)

Omit the stringsAsFactors argument if "factor" columns are wanted.

如果需要“factor”列,则省略stringsAsFactors参数。

#4


7  

Yet another way of doing it:

还有另一种方法:

str_split from stringr will handle the split, but returns it in a different form (a list, like strsplit does). Manipulating into the correct form is straightforward though.

str_split从stringr将处理拆分,但以不同的形式返回它(一个列表,就像strsplit一样)。操作到正确的形式是很简单的。

library(stringr)
split_x <- str_split(x, " ", 2)
(y <- data.frame(
  MANUF = sapply(split_x, head, n = 1),
  MAKE  = sapply(split_x, tail, n = 1)
))

Or, as Hadley mentioned in the comments, with str_split_fixed.

或者,正如哈德利在评论中提到的,使用str_split_fixed。

y <- as.data.frame(str_split_fixed(x, " ", 2))
colnames(y) <- c("MANUF", "MAKE")
y

#5


0  

If you can do pattern and group matching, I'd try something like this (untested):

如果你能做模式和组匹配,我想试试这样的东西(未经测试):

\s+(.*)\s+(.*)

#6


0  

I think searching for [^\s]+ would work. Untested.

我认为寻找[^ \ s]+是可行的。未测试。

#1


25  

The regular expression rexp matches the word at the start of the string, an optional space, then the rest of the string. The parenthesis are subexpressions accessed as backreferences \\1 and \\2.

正则表达式rexp匹配字符串开头的单词,一个可选空格,然后是字符串的其余部分。括号是作为反向引用访问的子表达式,\1和\2。

rexp <- "^(\\w+)\\s?(.*)$"
y <- data.frame(MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
tail(y)
#       MANUF      MAKE
# 27  Porsche     914-2
# 28    Lotus    Europa
# 29     Ford Pantera L
# 30  Ferrari      Dino
# 31 Maserati      Bora
# 32    Volvo      142E

#2


17  

For me, Hadley's colsplit function in the reshape2 package is the most intuitive for this purpose. Joshua's way is more general (ie can be used wherever a regex could be used) and flexible (if you want to change the specification); but the colsplit function is perfectly suited to this specific setting:

对我来说,Hadley在reshape2包中的colsplit函数是最直观的。Joshua的方法更通用(即可以在任何可以使用regex的地方使用)和灵活(如果您想更改规范);但是colsplit函数非常适合这个特定的设置:

library(reshape2)
y <- colsplit(x," ",c("MANUF","MAKE"))
tail(y)
#      MANUF      MAKE
#27  Porsche     914-2
#28    Lotus    Europa
#29     Ford Pantera L
#30  Ferrari      Dino
#31 Maserati      Bora
#32    Volvo      142E

#3


11  

Here are two approaches:

这里有两种方法:

1) strsplit. This approach uses only functions in the core of R and no complex regular expressions. Replace the first space with a semicolon (using sub and not gsub), strsplit on the semicolon and then rbind it into a 2 column matrix:

1)strsplit。这种方法只使用R核心的函数,而不使用复杂的正则表达式。用分号替换第一个空格(使用sub而不是gsub),将分号上的strsplit,然后重新绑定到一个2列矩阵中:

mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))
colnames(mat) <- c("MANUF", "MAKE")

2) strapply in gsubfn package Here is a one-liner using strapply in the gsubfn package. The two parenthesized portions of the regular expression capture the desired first and second columns respectively and the function (which is specified in formula notation -- its the same as specifying function(x, y) c(MANUF = x, MAKE = y)) grabs them and adds names. The simplify=rbind argument is to used to turn it into a matrix as in the prior solution.

2) strapply in gsubfn包这里是一个在gsubfn包中使用strapply的一行程序。正则表达式的两个括号部分分别捕获所需的第一列和第二列,函数(在公式表示法中指定)获取它们并添加名称(x, y) c(MANUF = x, MAKE = y)。简单=rbind参数用于将其转换为矩阵,就像前面的解决方案一样。

library(gsubfn)
mat <- strapply(x, "(\\S+)\\s+(.*)", ~ c(MANUF = x, MAKE = y), simplify = rbind)

Note: In either case a "character" matrix, mat, is returned. If a data frame of "character" columns is desired then add this:

注意:无论哪种情况,都会返回一个“字符”矩阵mat。如果需要“字符”列的数据帧,则添加以下内容:

DF <- as.data.frame(mat, stringsAsFactors = FALSE)

Omit the stringsAsFactors argument if "factor" columns are wanted.

如果需要“factor”列,则省略stringsAsFactors参数。

#4


7  

Yet another way of doing it:

还有另一种方法:

str_split from stringr will handle the split, but returns it in a different form (a list, like strsplit does). Manipulating into the correct form is straightforward though.

str_split从stringr将处理拆分,但以不同的形式返回它(一个列表,就像strsplit一样)。操作到正确的形式是很简单的。

library(stringr)
split_x <- str_split(x, " ", 2)
(y <- data.frame(
  MANUF = sapply(split_x, head, n = 1),
  MAKE  = sapply(split_x, tail, n = 1)
))

Or, as Hadley mentioned in the comments, with str_split_fixed.

或者,正如哈德利在评论中提到的,使用str_split_fixed。

y <- as.data.frame(str_split_fixed(x, " ", 2))
colnames(y) <- c("MANUF", "MAKE")
y

#5


0  

If you can do pattern and group matching, I'd try something like this (untested):

如果你能做模式和组匹配,我想试试这样的东西(未经测试):

\s+(.*)\s+(.*)

#6


0  

I think searching for [^\s]+ would work. Untested.

我认为寻找[^ \ s]+是可行的。未测试。