拆分字符超过1个字

时间:2021-10-27 02:29:52

I have the following character:

我有以下特征:

endvotes <- "Yes106No85EH2NT6ES0P1"

I'd like to get a data.framelooking like this

我想得到一个像这样的data.framelook

    Yes    No   EH   NT   ES  P
    106    85   2    6    0   1

I know how to split each one of those, for example like this:

我知道如何拆分其中的每一个,例如:

yes <- unlist(str_split(end_votes, "\\No"))[1]
yes <- as.integer(unlist(str_split(yes, "Yes"))[2])

yes
[1] 106

I guess one possibility would be to split by positions, but the numbers (one, two or three digits) are not always the same, therefore I'd like to split by the answers (yes, no, etc.). Of course I could do this for every answer (as above) but I'm sure there is a more elegant way. Can anyone tell me how this is done nicely? Thanks

我猜一种可能性是按位置分割,但数字(一位,两位或三位数)并不总是相同,因此我想分开答案(是,不,等等)。当然,我可以为每个答案(如上所述)做到这一点,但我确信有一种更优雅的方式。谁能告诉我这是如何做得很好的?谢谢

4 个解决方案

#1


3  

endvotes <- "Yes106No85EH2NT6ES0P1"

names <- strsplit(endvotes, "[[:digit:]]+")[[1]]
numbers <- strsplit(endvotes, "[[:alpha:]]+")[[1]][-1]

setNames(as.data.frame(t(as.numeric(numbers))), names)
#  Yes No EH NT ES P
#1 106 85  2  6  0 1

#2


3  

There is no need to use regex at all. Try this function from stringi package which splits character vector by character classes (like number, letters on punctuation):

根本不需要使用正则表达式。尝试使用stringi包中的这个函数,它将字符向量按字符类(如数字,标点符号)分开:

require(stringi)
stri_split_charclass(str=endvotes,"\\p{N}",omit_empty=T)[[1]]
## [1] "Yes" "No"  "EH"  "NT"  "ES"  "P"  
stri_split_charclass(str=endvotes,"\\p{L}",omit_empty=T)[[1]]
## [1] "106" "85"  "2"   "6"   "0"   "1"  

str is just vector, \p{N} and \p{L} are classes by which you want to split (N means numbers, L means letters). omit_empty to remove "" - empty strings.

str只是向量,\ p {N}和\ p {L}是要分割的类(N表示数字,L表示字母)。 omit_empty删除“” - 空字符串。

#3


2  

Well you can use a regex like this one, and each match will have the text in the first capturing group, value in the second:

那么你可以使用像这样的正则表达式,每个匹配将在第一个捕获组中具有文本,在第二个中具有值:

([a-zA-Z]+)([0-9]+)

Basically this selects a string of letters, followed by a string of digits. The parenthesis are capturing groups, that will allow you to retrieve the values you want easily.

基本上,这会选择一串字母,后跟一串数字。括号是捕获组,允许您轻松检索所需的值。

See Demo here

在这里看演示

#4


2  

You can try this regex too..

你也可以试试这个正则表达式..

strsplit(endvotes, split = "(?<=[A-Za-z])(?=[0-9])|(?<=[0-9])(?=[A-Za-z])", perl = T)
## [[1]]
##  [1] "Yes" "106" "No"  "85"  "EH"  "2"   "NT"  "6"   "ES"  "0"   "P"   "1"  
##

To get desired format

获得所需的格式

S <- strsplit(endvotes, split = "(?<=[A-Za-z])(?=[0-9])|(?<=[0-9])(?=[A-Za-z])", perl = T)[[1]]
res <- data.frame(t(S[seq_along(S)%%2 == 0]))
names(res) <- t(S[seq_along(S)%%2 == 1])
res
##   Yes No EH NT ES P
## 1 106 85  2  6  0 1  

OR

res <- data.frame(t(regmatches(endvotes, gregexpr("[0-9]+", endvotes))[[1]]))
names(res) <- t(regmatches(endvotes, gregexpr("[A-Za-z]+", endvotes))[[1]])
res
##   Yes No EH NT ES P
## 1 106 85  2  6  0 1

#1


3  

endvotes <- "Yes106No85EH2NT6ES0P1"

names <- strsplit(endvotes, "[[:digit:]]+")[[1]]
numbers <- strsplit(endvotes, "[[:alpha:]]+")[[1]][-1]

setNames(as.data.frame(t(as.numeric(numbers))), names)
#  Yes No EH NT ES P
#1 106 85  2  6  0 1

#2


3  

There is no need to use regex at all. Try this function from stringi package which splits character vector by character classes (like number, letters on punctuation):

根本不需要使用正则表达式。尝试使用stringi包中的这个函数,它将字符向量按字符类(如数字,标点符号)分开:

require(stringi)
stri_split_charclass(str=endvotes,"\\p{N}",omit_empty=T)[[1]]
## [1] "Yes" "No"  "EH"  "NT"  "ES"  "P"  
stri_split_charclass(str=endvotes,"\\p{L}",omit_empty=T)[[1]]
## [1] "106" "85"  "2"   "6"   "0"   "1"  

str is just vector, \p{N} and \p{L} are classes by which you want to split (N means numbers, L means letters). omit_empty to remove "" - empty strings.

str只是向量,\ p {N}和\ p {L}是要分割的类(N表示数字,L表示字母)。 omit_empty删除“” - 空字符串。

#3


2  

Well you can use a regex like this one, and each match will have the text in the first capturing group, value in the second:

那么你可以使用像这样的正则表达式,每个匹配将在第一个捕获组中具有文本,在第二个中具有值:

([a-zA-Z]+)([0-9]+)

Basically this selects a string of letters, followed by a string of digits. The parenthesis are capturing groups, that will allow you to retrieve the values you want easily.

基本上,这会选择一串字母,后跟一串数字。括号是捕获组,允许您轻松检索所需的值。

See Demo here

在这里看演示

#4


2  

You can try this regex too..

你也可以试试这个正则表达式..

strsplit(endvotes, split = "(?<=[A-Za-z])(?=[0-9])|(?<=[0-9])(?=[A-Za-z])", perl = T)
## [[1]]
##  [1] "Yes" "106" "No"  "85"  "EH"  "2"   "NT"  "6"   "ES"  "0"   "P"   "1"  
##

To get desired format

获得所需的格式

S <- strsplit(endvotes, split = "(?<=[A-Za-z])(?=[0-9])|(?<=[0-9])(?=[A-Za-z])", perl = T)[[1]]
res <- data.frame(t(S[seq_along(S)%%2 == 0]))
names(res) <- t(S[seq_along(S)%%2 == 1])
res
##   Yes No EH NT ES P
## 1 106 85  2  6  0 1  

OR

res <- data.frame(t(regmatches(endvotes, gregexpr("[0-9]+", endvotes))[[1]]))
names(res) <- t(regmatches(endvotes, gregexpr("[A-Za-z]+", endvotes))[[1]])
res
##   Yes No EH NT ES P
## 1 106 85  2  6  0 1