This seems like a really simple task, but I can't find a good solution in base R. I have a character string with 2N characters. How do I split this into a character vector of length N, with each element being a 2-character string?
这似乎是一个非常简单的任务,但是我在base r中找不到一个好的解决方案,我有一个包含2N个字符的字符串。如何将它分割成长度为N的字符向量,每个元素都是一个2字符的字符串?
I could use something like substr
with Vectorize
:
我可以用substr和Vectorize
vss <- Vectorize(substr, c("start", "stop"))
ch <- paste(rep("a", 1e6), collapse="")
vss(ch, seq(1, nchar(ch), by=2), seq(2, nchar(ch), by=2))
but this is really slow for long strings (O(N^2) I believe).
但这是很慢长字符串(O(N ^ 2)我相信)。
1 个解决方案
#1
2
If you want speed, Rcpp
is always a good choice:
如果你想要速度,Rcpp总是一个不错的选择:
library(Rcpp);
cppFunction('
List strsplitN(std::vector<std::string> v, int N ) {
if (N < 1) throw std::invalid_argument("N must be >= 1.");
List res(v.size());
for (int i = 0; i < v.size(); ++i) {
int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
std::vector<std::string> resCur(num,std::string(N,0));
for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
res[i] = resCur;
}
return res;
}
');
ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
## user system elapsed
## 0.109 0.015 0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000
Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.
参考:http://gallery.rcpp.org/articles/strings_with_rcpp/。
More demos:
更多的演示:
strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.
There are two important caveats with the above implementation:
上述实施有两个重要的注意事项:
1: It doesn't handle NA
s correctly. Rcpp seems to stringify to 'NA'
when it's forced to come up with a std::string
. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA
.
1:它不能正确地处理NAs。当Rcpp*设计出std::string时,它似乎会被绑定到NA。您可以用包装器在Rland中轻松地解决这个问题,包装器用一个真正的NA替换有问题的列表组件。
x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##
2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.
2:它不能正确地处理多字节字符。这是一个比较困难的问题,需要重写核心函数实现,以使用具有单点感知的遍历。修复这个问题也会导致严重的性能损失,因为在赋值循环之前,您无法一次性分配每个向量。
strsplitN('aΩ',1L);
## [[1]]
## [1] "a" "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##
#1
2
If you want speed, Rcpp
is always a good choice:
如果你想要速度,Rcpp总是一个不错的选择:
library(Rcpp);
cppFunction('
List strsplitN(std::vector<std::string> v, int N ) {
if (N < 1) throw std::invalid_argument("N must be >= 1.");
List res(v.size());
for (int i = 0; i < v.size(); ++i) {
int num = v[i].size()/N + (v[i].size()%N == 0 ? 0 : 1);
std::vector<std::string> resCur(num,std::string(N,0));
for (int j = 0; j < num; ++j) resCur[j].assign(v[i].substr(j*N,N));
res[i] = resCur;
}
return res;
}
');
ch <- paste(rep('a',1e6),collapse='');
system.time({ res <- strsplitN(ch,2L); });
## user system elapsed
## 0.109 0.015 0.121
head(res[[1L]]); tail(res[[1L]]);
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
## [1] "aa" "aa" "aa" "aa" "aa" "aa"
length(res[[1L]]);
## [1] 500000
Useful reference: http://gallery.rcpp.org/articles/strings_with_rcpp/.
参考:http://gallery.rcpp.org/articles/strings_with_rcpp/。
More demos:
更多的演示:
strsplitN(c('abcd','efgh'),2L);
## [[1]]
## [1] "ab" "cd"
##
## [[2]]
## [1] "ef" "gh"
##
strsplitN(c('abcd','efgh'),3L);
## [[1]]
## [1] "abc" "d"
##
## [[2]]
## [1] "efg" "h"
##
strsplitN(c('abcd','efgh'),1L);
## [[1]]
## [1] "a" "b" "c" "d"
##
## [[2]]
## [1] "e" "f" "g" "h"
##
strsplitN(c('abcd','efgh'),5L);
## [[1]]
## [1] "abcd"
##
## [[2]]
## [1] "efgh"
##
strsplitN(character(),5L);
## list()
strsplitN(c('abcd','efgh'),0L);
## Error: N must be >= 1.
There are two important caveats with the above implementation:
上述实施有两个重要的注意事项:
1: It doesn't handle NA
s correctly. Rcpp seems to stringify to 'NA'
when it's forced to come up with a std::string
. You can easily solve this in Rland with a wrapper that replaces the offending list components with a true NA
.
1:它不能正确地处理NAs。当Rcpp*设计出std::string时,它似乎会被绑定到NA。您可以用包装器在Rland中轻松地解决这个问题,包装器用一个真正的NA替换有问题的列表组件。
x <- c('a',NA); strsplitN(x,1L);
## [[1]]
## [1] "a"
##
## [[2]]
## [1] "N" "A"
##
x <- c('a',NA); ifelse(is.na(x),NA,strsplitN(x,1L));
## [[1]]
## [1] "a"
##
## [[2]]
## [1] NA
##
2: It doesn't handle multibyte characters correctly. This is a tougher problem, and would require a rewrite of the core function implementation to use a Unicode-aware traversal. Fixing this problem would also incur a significant performance penalty, since you wouldn't be able to preallocate each vector in one shot prior to the assignment loop.
2:它不能正确地处理多字节字符。这是一个比较困难的问题,需要重写核心函数实现,以使用具有单点感知的遍历。修复这个问题也会导致严重的性能损失,因为在赋值循环之前,您无法一次性分配每个向量。
strsplitN('aΩ',1L);
## [[1]]
## [1] "a" "\xce" "\xa9"
##
strsplit('aΩ','');
## [[1]]
## [1] "a" "Ω"
##