I have two character variables (names of objects) and I want to extract the largest common substring.
我有两个字符变量(对象的名称),我想提取最大的公共子字符串。
a <- c('blahABCfoo', 'blahDEFfoo')
b <- c('XXABC-123', 'XXDEF-123')
I want the following as a result:
我希望得到以下结果:
[1] "ABC" "DEF"
These vectors as input should give the same result:
这些向量作为输入应该得到相同的结果:
a <- c('textABCxx', 'textDEFxx')
b <- c('zzABCblah', 'zzDEFblah')
These examples are representative. The strings contain identifying elements, and the remainder of the text in each vector element is common, but unknown.
这些例子是代表。字符串包含标识元素,每个向量元素中文本的其余部分是通用的,但是未知的。
Is there a solution, in one of the following places (in order of preference):
是否有解决方案,在下列任何一处(按优先次序):
-
Base R
基地R
-
Recommended Packages
推荐的包
-
Packages available on CRAN
包可以在凹口
The answer to the supposed-duplicate does not fulfill these requirements.
假设副本的答案不满足这些要求。
3 个解决方案
#1
9
Here's a CRAN package for that:
这里有一个CRAN包:
library(qualV)
sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))
#2
9
If you dont mind using bioconductor packages, then, You can use Rlibstree
. The installation is pretty straightforward.
如果你不介意使用生物导体,那么,你可以使用Rlibstree。安装非常简单。
source("http://bioconductor.org/biocLite.R")
biocLite("Rlibstree")
Then, you can do:
然后,你能做什么:
require(Rlibstree)
ll <- list(a,b)
lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE),
function(x) getLongestCommonSubstring(x))
# $X1
# [1] "ABC"
# $X2
# [1] "DEF"
On a side note: I'm not quite sure if Rlibstree
uses libstree 0.42
or libstree 0.43
. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was using libstree 0.42
. Just a heads up.
附加说明:我不太确定Rlibstree使用了libstree 0.42还是libstree 0.43。这两个库都在源代码包中。我记得在使用libstree 0.42的perl中,遇到了一个内存泄漏(因此出现了一个错误)。只是一个头。
#3
0
Because I have too many things I don't want to do, I did this instead:
因为我有太多不想做的事情,所以我就这样做了:
Rgames> for(jj in 1:100) {
+ str2<-sample(letters,100,rep=TRUE)
+ str1<-sample(letters,100,rep=TRUE)
+ longs[jj]<-length(lcstring(str1,str2)[[1]])
+ }
Rgames> table(longs)
longs
2 3 4
59 39 2
Anyone care to do a statistical estimate of the actual distribution of matching strings? (lcstring
is just a brute-force home-rolled function; the output contains all max strings which is why I only look at the first list element)
有人愿意对匹配字符串的实际分布做统计估计吗?(lcstring只是一个蛮力自滚函数;输出包含所有的max字符串,这就是为什么我只看第一个列表元素的原因)
#1
9
Here's a CRAN package for that:
这里有一个CRAN包:
library(qualV)
sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))
#2
9
If you dont mind using bioconductor packages, then, You can use Rlibstree
. The installation is pretty straightforward.
如果你不介意使用生物导体,那么,你可以使用Rlibstree。安装非常简单。
source("http://bioconductor.org/biocLite.R")
biocLite("Rlibstree")
Then, you can do:
然后,你能做什么:
require(Rlibstree)
ll <- list(a,b)
lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE),
function(x) getLongestCommonSubstring(x))
# $X1
# [1] "ABC"
# $X2
# [1] "DEF"
On a side note: I'm not quite sure if Rlibstree
uses libstree 0.42
or libstree 0.43
. Both libraries are present in the source package. I remember running into a memory leak (and hence an error) on a huge array in perl that was using libstree 0.42
. Just a heads up.
附加说明:我不太确定Rlibstree使用了libstree 0.42还是libstree 0.43。这两个库都在源代码包中。我记得在使用libstree 0.42的perl中,遇到了一个内存泄漏(因此出现了一个错误)。只是一个头。
#3
0
Because I have too many things I don't want to do, I did this instead:
因为我有太多不想做的事情,所以我就这样做了:
Rgames> for(jj in 1:100) {
+ str2<-sample(letters,100,rep=TRUE)
+ str1<-sample(letters,100,rep=TRUE)
+ longs[jj]<-length(lcstring(str1,str2)[[1]])
+ }
Rgames> table(longs)
longs
2 3 4
59 39 2
Anyone care to do a statistical estimate of the actual distribution of matching strings? (lcstring
is just a brute-force home-rolled function; the output contains all max strings which is why I only look at the first list element)
有人愿意对匹配字符串的实际分布做统计估计吗?(lcstring只是一个蛮力自滚函数;输出包含所有的max字符串,这就是为什么我只看第一个列表元素的原因)