This question already has an answer here:
这个问题已经有了答案:
- Chopping a string into a vector of fixed width character elements 10 answers
- 将一个字符串切成一个固定宽度字符的向量,10个答案。
I have a string such as:
我有一个字符串,例如:
"aabbccccdd"
“aabbccccdd”
I want to break this string into a vector of substrings of length 2 :
我想把这个字符串分解成长度为2的子串的向量:
"aa" "bb" "cc" "cc" "dd"
“aa”“bb”“cc”“cc”“dd”
5 个解决方案
#1
39
Here is one way
这是方法之一
substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"
or more generally
或更一般的
text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"
Edit: This is much, much faster
编辑:这要快得多。
sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
It first splits the string into characters. Then, it pastes together the even elements and the odd elements.
它首先将字符串分割成字符。然后,它将连元素和奇元素混在一起。
Timings
计时
text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
sst <- strsplit(text, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 g1 100 95.451 79.87531 95.438 0 0 0
#2 g2 100 1.195 1.00000 1.196 0 0 0
#2
8
string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)
# the indices where each substr will start
starts <- seq(1,num.chars, by=2)
# chop it up
sapply(starts, function(ii) {
substr(string, ii, ii+1)
})
Which gives
这给了
[1] "aa" "bb" "cc" "cc" "dd"
#3
5
There are two easy possibilities:
有两种简单的可能性:
s <- "aabbccccdd"
-
gregexpr
andregmatches
:gregexpr regmatches:
regmatches(s, gregexpr(".{2}", s))[[1]] # [1] "aa" "bb" "cc" "cc" "dd"
-
strsplit
:strsplit:
strsplit(s, "(?<=.{2})", perl = TRUE)[[1]] # [1] "aa" "bb" "cc" "cc" "dd"
#4
1
One can use a matrix to group the characters:
你可以用一个矩阵来对这些字符进行分组:
s2 <- function(x) {
m <- matrix(strsplit(x, '')[[1]], nrow=2)
apply(m, 2, paste, collapse='')
}
s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"
Unfortunately, this breaks for an input of odd string length, giving a warning:
不幸的是,这是一个奇怪的字符串长度的输入,给出一个警告:
s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
## data length [3] is not a sub-multiple or multiple of the number of rows [2]
More unfortunate is that g1
and g2
from @GSee silently return incorrect results for an input of odd string length:
更不幸的是,@GSee的g1和g2会以奇数字符串长度的输入返回错误的结果:
g1('abc')
## [1] "ab"
g2('abc')
## [1] "ab" "cb"
Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:
这里是s2的spirit函数,为每个组中的字符数取一个参数,并在必要时保留最后一个条目:
s <- function(x, n) {
sst <- strsplit(x, '')[[1]]
m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
m[seq_along(sst)] <- sst
apply(m, 2, paste, collapse='')
}
s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d"
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"
(It is indeed slower than g2
, but faster than g1
by about a factor of 7)
(它确实比g2慢,但比g1快1 / 7)
#5
1
Ugly but works
丑陋但作品
sequenceString <- "ATGAATAAAG"
J=3#maximum sequence length in file
sequenceSmallVecStart <-
substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J),
seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")
Gives ATG AAT AAA G
给予ATG AAT AAA G。
#1
39
Here is one way
这是方法之一
substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"
or more generally
或更一般的
text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"
Edit: This is much, much faster
编辑:这要快得多。
sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
It first splits the string into characters. Then, it pastes together the even elements and the odd elements.
它首先将字符串分割成字符。然后,它将连元素和奇元素混在一起。
Timings
计时
text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
sst <- strsplit(text, "")[[1]]
paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
# test replications elapsed relative user.self sys.self user.child sys.child
#1 g1 100 95.451 79.87531 95.438 0 0 0
#2 g2 100 1.195 1.00000 1.196 0 0 0
#2
8
string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)
# the indices where each substr will start
starts <- seq(1,num.chars, by=2)
# chop it up
sapply(starts, function(ii) {
substr(string, ii, ii+1)
})
Which gives
这给了
[1] "aa" "bb" "cc" "cc" "dd"
#3
5
There are two easy possibilities:
有两种简单的可能性:
s <- "aabbccccdd"
-
gregexpr
andregmatches
:gregexpr regmatches:
regmatches(s, gregexpr(".{2}", s))[[1]] # [1] "aa" "bb" "cc" "cc" "dd"
-
strsplit
:strsplit:
strsplit(s, "(?<=.{2})", perl = TRUE)[[1]] # [1] "aa" "bb" "cc" "cc" "dd"
#4
1
One can use a matrix to group the characters:
你可以用一个矩阵来对这些字符进行分组:
s2 <- function(x) {
m <- matrix(strsplit(x, '')[[1]], nrow=2)
apply(m, 2, paste, collapse='')
}
s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"
Unfortunately, this breaks for an input of odd string length, giving a warning:
不幸的是,这是一个奇怪的字符串长度的输入,给出一个警告:
s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
## data length [3] is not a sub-multiple or multiple of the number of rows [2]
More unfortunate is that g1
and g2
from @GSee silently return incorrect results for an input of odd string length:
更不幸的是,@GSee的g1和g2会以奇数字符串长度的输入返回错误的结果:
g1('abc')
## [1] "ab"
g2('abc')
## [1] "ab" "cb"
Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:
这里是s2的spirit函数,为每个组中的字符数取一个参数,并在必要时保留最后一个条目:
s <- function(x, n) {
sst <- strsplit(x, '')[[1]]
m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
m[seq_along(sst)] <- sst
apply(m, 2, paste, collapse='')
}
s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d"
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld"
(It is indeed slower than g2
, but faster than g1
by about a factor of 7)
(它确实比g2慢,但比g1快1 / 7)
#5
1
Ugly but works
丑陋但作品
sequenceString <- "ATGAATAAAG"
J=3#maximum sequence length in file
sequenceSmallVecStart <-
substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J),
seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")
Gives ATG AAT AAA G
给予ATG AAT AAA G。