I would like to split a column of strings on the first two colons, but not on any subsequent colons:
我想在前两个冒号上分割一列字符串,但不在任何后续冒号上:
my.data <- read.table(text='
my.string some.data
123:34:56:78 -100
87:65:43:21 -200
a4:b6:c8888 -300
11:bbbb:ccccc -400
uu:vv:ww:xx -500', header = TRUE)
desired.result <- read.table(text='
my.string1 my.string2 my.string3 some.data
123 34 56:78 -100
87 65 43:21 -200
a4 b6 c8888 -300
11 bbbb ccccc -400
uu vv ww:xx -500', header = TRUE)
I have searched extensively and the following question is the closest to my current dilemma:
我进行了广泛的搜索,以下问题最接近我目前的困境:
Split on first comma in string
在字符串中的第一个逗号上拆分
Thank you for any suggestions. I prefer to use base R.
谢谢你的任何建议。我更喜欢使用底座R.
EDIT:
The number of characters before the first colon is not always two and the number of characters between the first two colons is not always two. So, I edited the example to reflect this.
第一个冒号前的字符数不总是两个,前两个冒号之间的字符数不总是两个。所以,我编辑了这个例子以反映这一点。
5 个解决方案
#1
3
In base R:
在基地R:
> my.data <- read.table(text='
+
+ my.string some.data
+ 123:34:56:78 -100
+ 87:65:43:21 -200
+ a4:b6:c8888 -300
+ 11:bbbb:ccccc -400
+ uu:vv:ww:xx -500', header = TRUE,stringsAsFactors=FALSE)
> m <- regexec ("^([^:]+):([^:]+):(.*)$",my.data$my.string)
> my.data$my.string1 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(2)))
> my.data$my.string2 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(3)))
> my.data$my.string3 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(4)))
> my.data
my.string some.data my.string1 my.string2 my.string3
1 123:34:56:78 -100 123 34 56:78
2 87:65:43:21 -200 87 65 43:21
3 a4:b6:c8888 -300 a4 b6 c8888
4 11:bbbb:ccccc -400 11 bbbb ccccc
5 uu:vv:ww:xx -500 uu vv ww:xx
You'll see I've used stringsAsFactors=FALSE
to ensure that my.string
can be processed as a vector of strings.
你会看到我使用stringsAsFactors = FALSE来确保my.string可以作为字符串向量处理。
#2
3
Using package stringr
:
使用包stringr:
str_match(my.data$my.string, "(.+?):(.+?):(.*)")
[,1] [,2] [,3] [,4]
[1,] "123:34:56:78" "123" "34" "56:78"
[2,] "87:65:43:21" "87" "65" "43:21"
[3,] "a4:b6:c8888" "a4" "b6" "c8888"
[4,] "11:bbbb:ccccc" "11" "bbbb" "ccccc"
[5,] "uu:vv:ww:xx" "uu" "vv" "ww:xx"
UPDATE: with latest example (above) and Hadley's comment solution:
更新:使用最新的示例(上图)和Hadley的评论解决方案:
str_split_fixed(my.data$my.string, ":", 3)
[,1] [,2] [,3]
[1,] "123" "34" "56:78"
[2,] "87" "65" "43:21"
[3,] "a4" "b6" "c8888"
[4,] "11" "bbbb" "ccccc"
[5,] "uu" "vv" "ww:xx"
#3
1
Replace first two ":" with ",", and then split on ",".
将前两个“:”替换为“,”,然后拆分为“,”。
x <- gsub("([[:alnum:]]*):([[:alnum:]]*):(.)","\\1,\\2,\\3","12:34:56:78")
strsplit(x,",")
Applying to data frame
应用于数据框架
a.list <- sapply(my.data$my.string, function(x) strsplit(gsub("([[:alnum:]]*):([[:alnum:]]*):(.)","\\1,\\2,\\3",x),","))
a.vect <- unlist(a.list)
a.df <- as.data.frame(matrix(a.vect,ncol=3,byrow=T), stringsAsFactors = F)
names(a.df) <- c("my.string1", "my.string2", "my.string3")
a.df$some.data <- my.data$some.data
a.df
#4
1
I'm a bit late to the game. And my solution has much overlap with the earlier answers. Nevertheless, it might be useful someone:
我比赛有点晚了。我的解决方案与之前的答案有很多重叠。然而,有人可能会有用:
# Replace first two colons with commas.
new.string = gsub(pattern="(^[^:]+):([^:]+):(.+$)",
replacement="\\1,\\2,\\3",
x=my.data$my.string)
# Split on commas, producing a list.
split.data = strsplit(new.string, ",")
# Change list into matrix, then data.frame.
new.data = data.frame(do.call(rbind, split.data))
names(new.data) = paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string = NULL
my.data = cbind(new.data, my.data)
my.data
# my.string1 my.string2 my.string3 some.data
# 1 123 34 56:78 -100
# 2 87 65 43:21 -200
# 3 a4 b6 c8888 -300
# 4 11 bbbb ccccc -400
# 5 uu vv ww:xx -500
As noted by @topchef, commas (or some other character) must guaranteed to be absent from the data.
正如@topchef所指出的,必须保证数据中不存在逗号(或其他一些字符)。
Also, at least two colons must be present in each string, or else the pattern doesn't match anything and thus no splitting occurs.
此外,每个字符串中必须至少存在两个冒号,否则该模式与任何内容都不匹配,因此不会发生分裂。
#5
0
Couldn't you just strsplit(sub(":\s*", XX, x), XX) (like the example listed on your link to the other question) on the first colon, take the second half and split on the first colon again?
难道你不能只是strsplit(sub(“:\ s *”,XX,x),XX)(就像你在另一个问题的链接上列出的例子)在第一个冒号上,取下半部分并在第一个冒号上拆分再次冒号?
#1
3
In base R:
在基地R:
> my.data <- read.table(text='
+
+ my.string some.data
+ 123:34:56:78 -100
+ 87:65:43:21 -200
+ a4:b6:c8888 -300
+ 11:bbbb:ccccc -400
+ uu:vv:ww:xx -500', header = TRUE,stringsAsFactors=FALSE)
> m <- regexec ("^([^:]+):([^:]+):(.*)$",my.data$my.string)
> my.data$my.string1 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(2)))
> my.data$my.string2 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(3)))
> my.data$my.string3 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(4)))
> my.data
my.string some.data my.string1 my.string2 my.string3
1 123:34:56:78 -100 123 34 56:78
2 87:65:43:21 -200 87 65 43:21
3 a4:b6:c8888 -300 a4 b6 c8888
4 11:bbbb:ccccc -400 11 bbbb ccccc
5 uu:vv:ww:xx -500 uu vv ww:xx
You'll see I've used stringsAsFactors=FALSE
to ensure that my.string
can be processed as a vector of strings.
你会看到我使用stringsAsFactors = FALSE来确保my.string可以作为字符串向量处理。
#2
3
Using package stringr
:
使用包stringr:
str_match(my.data$my.string, "(.+?):(.+?):(.*)")
[,1] [,2] [,3] [,4]
[1,] "123:34:56:78" "123" "34" "56:78"
[2,] "87:65:43:21" "87" "65" "43:21"
[3,] "a4:b6:c8888" "a4" "b6" "c8888"
[4,] "11:bbbb:ccccc" "11" "bbbb" "ccccc"
[5,] "uu:vv:ww:xx" "uu" "vv" "ww:xx"
UPDATE: with latest example (above) and Hadley's comment solution:
更新:使用最新的示例(上图)和Hadley的评论解决方案:
str_split_fixed(my.data$my.string, ":", 3)
[,1] [,2] [,3]
[1,] "123" "34" "56:78"
[2,] "87" "65" "43:21"
[3,] "a4" "b6" "c8888"
[4,] "11" "bbbb" "ccccc"
[5,] "uu" "vv" "ww:xx"
#3
1
Replace first two ":" with ",", and then split on ",".
将前两个“:”替换为“,”,然后拆分为“,”。
x <- gsub("([[:alnum:]]*):([[:alnum:]]*):(.)","\\1,\\2,\\3","12:34:56:78")
strsplit(x,",")
Applying to data frame
应用于数据框架
a.list <- sapply(my.data$my.string, function(x) strsplit(gsub("([[:alnum:]]*):([[:alnum:]]*):(.)","\\1,\\2,\\3",x),","))
a.vect <- unlist(a.list)
a.df <- as.data.frame(matrix(a.vect,ncol=3,byrow=T), stringsAsFactors = F)
names(a.df) <- c("my.string1", "my.string2", "my.string3")
a.df$some.data <- my.data$some.data
a.df
#4
1
I'm a bit late to the game. And my solution has much overlap with the earlier answers. Nevertheless, it might be useful someone:
我比赛有点晚了。我的解决方案与之前的答案有很多重叠。然而,有人可能会有用:
# Replace first two colons with commas.
new.string = gsub(pattern="(^[^:]+):([^:]+):(.+$)",
replacement="\\1,\\2,\\3",
x=my.data$my.string)
# Split on commas, producing a list.
split.data = strsplit(new.string, ",")
# Change list into matrix, then data.frame.
new.data = data.frame(do.call(rbind, split.data))
names(new.data) = paste("my.string", seq(ncol(new.data)), sep="")
my.data$my.string = NULL
my.data = cbind(new.data, my.data)
my.data
# my.string1 my.string2 my.string3 some.data
# 1 123 34 56:78 -100
# 2 87 65 43:21 -200
# 3 a4 b6 c8888 -300
# 4 11 bbbb ccccc -400
# 5 uu vv ww:xx -500
As noted by @topchef, commas (or some other character) must guaranteed to be absent from the data.
正如@topchef所指出的,必须保证数据中不存在逗号(或其他一些字符)。
Also, at least two colons must be present in each string, or else the pattern doesn't match anything and thus no splitting occurs.
此外,每个字符串中必须至少存在两个冒号,否则该模式与任何内容都不匹配,因此不会发生分裂。
#5
0
Couldn't you just strsplit(sub(":\s*", XX, x), XX) (like the example listed on your link to the other question) on the first colon, take the second half and split on the first colon again?
难道你不能只是strsplit(sub(“:\ s *”,XX,x),XX)(就像你在另一个问题的链接上列出的例子)在第一个冒号上,取下半部分并在第一个冒号上拆分再次冒号?