I have a question regarding finding the longest common substring in R. While searching through a few posts on *, I got to know about the qualV package. However, I see that the LCS function in this package actually finds all characters from string1 which are present in string2, even if they are not contiguous.
我有一个关于在R中查找最长公共子字符串的问题。在*上搜索几个帖子时,我了解了qualV包。但是,我看到这个包中的LCS函数实际上找到了string1中存在的所有字符,即使它们不是连续的。
To explain, if the strings are string1 : "hello" string2 : "hel12345lo" I expect the output to be hel, however I get the output as hello. I must be doing something wrong. Please see my code below.
为了解释,如果字符串是string1:“hello”string2:“hel12345lo”我希望输出为hel,但是我得到输出为hello。我一定做错了什么。请参阅下面的代码。
library(qualV)
a= "hello"
b="hel123l5678o"
sapply(seq_along(a), function(i)
paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
collapse = ""))
I have also tried the Rlibstree method but I still get substrings which are not contiguous. Also, the length of the substring is also off from my expectation.s Please see below.
我也尝试过Rlibstree方法,但我仍然得到不连续的子串。此外,子串的长度也与我的预期不同。请参阅下文。
> a = "hello"
> b = "h1e2l3l4o5"
> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"
> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
21
4 个解决方案
#1
7
Here are three possible solutions.
这有三种可能的解决方案。
library(stringi)
library(stringdist)
a <- "hello"
b <- "hel123l5678o"
## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"
There are also adist()
and agrep()
in base R, and the stringdist
package has a few functions that run the LCS method. Here's a look at stringsidt
. It returns the number of unpaired characters.
基础R中还有adist()和agrep(),stringdist包有一些运行LCS方法的函数。以下是stringsidt。它返回未配对字符的数量。
stringdist(a, b, method="lcs")
# [1] 7
Filter("!", mapply(
stringdist,
stri_sub(b, 1, 1:nchar(b)),
stri_sub(a, 1, 1:nchar(b)),
MoreArgs = list(method = "lcs")
))
# h he hel
# 0 0 0
Now that I've explored this a bit more, I think adist()
might be the way to go. If we set counts=TRUE
we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate()
we can use that matrix to get the matches from a to b.
现在我已经对此进行了更多的探索,我认为adist()可能是最佳选择。如果我们设置counts = TRUE,我们得到一系列Matches,Insertions等。所以如果你把它赋给stri_locate(),我们可以使用那个矩阵从a到b得到匹配。
ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"
So the M
values denote straight across matches. We can go and get the substrings with stri_sub()
所以M值表示直接匹配。我们可以用stri_sub()获取子串
stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l" "o"
Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.
对不起,我没有解释得那么好,因为我不熟悉字符串距离算法。
#2
1
I'm not sure what you did to get your output of "hello". Based on trial-and-error experiments below, it appears that the LCS
function will (a) not regard a string as an LCS if a character follows what would otherwise be an LCS; (b) find multiple, equally-long LCS's (unlike sub() that finds just the first); (c) the order of the elements in the strings doesn't matter -- which has no illustration below; and (b) the order of the string in the LCS call doesn't matter -- also not shown.
我不知道你做了什么来得到你的输出“你好”。基于下面的反复试验,看来LCS函数将(a)如果一个字符遵循LCS的字符,则不将字符串视为LCS; (b)找到多个同样长的LCS(不像找到第一个的sub()); (c)字符串中元素的顺序无关紧要 - 下面没有说明; (b)LCS呼叫中字符串的顺序无关紧要 - 也未显示。
So, your "hello" of a had no LCS in b since the "hel" of b was followed by a character. Well, that's my current hypothesis.
所以,你的“hello”a中没有LCS,因为b的“hel”后跟一个角色。嗯,这是我目前的假设。
Point A above:
以上A点:
a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd")
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...
a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd")
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it
a= c("hello", "hel", "abcd")
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character
Point B above:
B点以上:
a= c("hello", "hel", "abcd")
b= c("hello", "abcd")
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all
#3
0
Leveraging @RichardScriven's insight that adist
could be used, but this function combines it all,
利用@ RichardScriven的洞察力可以使用adist,但是这个功能结合了所有,
EDIT This was tricky because we needed to get the longest_string
in two contexts, so I made this function:
编辑这很棘手,因为我们需要在两个上下文中获取longest_string,所以我做了这个函数:
longest_string <- function(s){return(s[which.max(nchar(s))])}
This combines @RichardSriven's work using the library...
这结合了@ RichardSriven使用图书馆的工作......
library(stringi)
library(stringdist)
lcsbstr <- function(a,b) {
sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
longest_cmn_sbstr <- longest_string(cmn_sbstr)
return(longest_cmn_sbstr)
}
We can rewrite it to avoid the use of any external libraries (but still using the adist
)...
我们可以重写它以避免使用任何外部库(但仍使用adist)...
lcsbstr_no_lib <- function(a,b) {
matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
lengths<- attr(matches, 'match.length')
which_longest <- which.max(lengths)
index_longest <- matches[which_longest]
length_longest <- lengths[which_longest]
longest_cmn_sbstr <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
return(longest_cmn_sbstr )
}
All of identify only 'hello '
as the longest common substring, instead of 'hello r'
:
所有只识别'hello'作为最长的公共子串,而不是'hello r':
identical('hello ',
lcsbstr_no_lib('hello world', 'hello there'),
lcsbstr( 'hello world', 'hello there'))
EDIT And since the edit, works regardless of which argument is the longer of the two:
编辑自编辑以来,无论哪个参数都是两者中的较长者,它们都可以工作:
identical('hello',
lcsbstr_no_lib('hello', 'hello there'),
lcsbstr( 'hello', 'hello there'),
lcsbstr_no_lib('hello there', 'hello'),
lcsbstr( 'hello there', 'hello'))
LAST EDIT But this is only good if you accept this behavior. Notice this result:
最后编辑但是,如果您接受此行为,这只会很好。注意这个结果:
lcsbstr('hello world', 'hello')
#[1] 'hell'
I was expecting 'hello'
, but since the transformation actually moves (via deletion) the world to become the hello, so only the hell part is considered a match according to the M
:
我期待'你好',但由于转换实际上移动(通过删除)世界成为你好,所以根据M,只有地狱部分被认为是匹配:
drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1] vvvv v
#[1] "hello world"
This behavior is observed using [this Levenstein tool] -- it gives two possible solutions, equivalent to these two transforms; can we tell adist which one we prefer? (the one with the greater number of consecutive M
)
使用[这个Levenstein工具]观察到这种行为 - 它提供了两种可能的解决方案,相当于这两种转换;我们可以告诉adist我们更喜欢哪一个吗? (具有更多连续M的那个)
#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"
Finally, don't forget adist allows you to pass in ignore.case = TRUE
(FALSE
is the default)
最后,不要忘记adist允许你传入ignore.case = TRUE(FALSE是默认值)
#4
0
df <- data.frame(A. = c("Australia", "Network"),
B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)
auxFun <- function(x) {
a <- strsplit(x[[1]], "")[[1]]
b <- strsplit(x[[2]], "")[[1]]
lastchar <- suppressWarnings(which(!(a == b)))[1] - 1
if(lastchar > 0){
out <- paste0(a[1:lastchar], collapse = "")
} else {
out <- ""
}
return(out)
}
df$C. <- apply(df, 1, auxFun)
df
A. B. C.
1 Australia Austria Austr
2 Network Netconnect Net
#1
7
Here are three possible solutions.
这有三种可能的解决方案。
library(stringi)
library(stringdist)
a <- "hello"
b <- "hel123l5678o"
## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"
There are also adist()
and agrep()
in base R, and the stringdist
package has a few functions that run the LCS method. Here's a look at stringsidt
. It returns the number of unpaired characters.
基础R中还有adist()和agrep(),stringdist包有一些运行LCS方法的函数。以下是stringsidt。它返回未配对字符的数量。
stringdist(a, b, method="lcs")
# [1] 7
Filter("!", mapply(
stringdist,
stri_sub(b, 1, 1:nchar(b)),
stri_sub(a, 1, 1:nchar(b)),
MoreArgs = list(method = "lcs")
))
# h he hel
# 0 0 0
Now that I've explored this a bit more, I think adist()
might be the way to go. If we set counts=TRUE
we get a sequence of Matches, Insertions, etc. So if you give that to stri_locate()
we can use that matrix to get the matches from a to b.
现在我已经对此进行了更多的探索,我认为adist()可能是最佳选择。如果我们设置counts = TRUE,我们得到一系列Matches,Insertions等。所以如果你把它赋给stri_locate(),我们可以使用那个矩阵从a到b得到匹配。
ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"
So the M
values denote straight across matches. We can go and get the substrings with stri_sub()
所以M值表示直接匹配。我们可以用stri_sub()获取子串
stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l" "o"
Sorry I haven't explained that very well as I'm not well versed in string distance algorithms.
对不起,我没有解释得那么好,因为我不熟悉字符串距离算法。
#2
1
I'm not sure what you did to get your output of "hello". Based on trial-and-error experiments below, it appears that the LCS
function will (a) not regard a string as an LCS if a character follows what would otherwise be an LCS; (b) find multiple, equally-long LCS's (unlike sub() that finds just the first); (c) the order of the elements in the strings doesn't matter -- which has no illustration below; and (b) the order of the string in the LCS call doesn't matter -- also not shown.
我不知道你做了什么来得到你的输出“你好”。基于下面的反复试验,看来LCS函数将(a)如果一个字符遵循LCS的字符,则不将字符串视为LCS; (b)找到多个同样长的LCS(不像找到第一个的sub()); (c)字符串中元素的顺序无关紧要 - 下面没有说明; (b)LCS呼叫中字符串的顺序无关紧要 - 也未显示。
So, your "hello" of a had no LCS in b since the "hel" of b was followed by a character. Well, that's my current hypothesis.
所以,你的“hello”a中没有LCS,因为b的“hel”后跟一个角色。嗯,这是我目前的假设。
Point A above:
以上A点:
a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd")
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...
a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd")
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it
a= c("hello", "hel", "abcd")
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character
Point B above:
B点以上:
a= c("hello", "hel", "abcd")
b= c("hello", "abcd")
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all
#3
0
Leveraging @RichardScriven's insight that adist
could be used, but this function combines it all,
利用@ RichardScriven的洞察力可以使用adist,但是这个功能结合了所有,
EDIT This was tricky because we needed to get the longest_string
in two contexts, so I made this function:
编辑这很棘手,因为我们需要在两个上下文中获取longest_string,所以我做了这个函数:
longest_string <- function(s){return(s[which.max(nchar(s))])}
This combines @RichardSriven's work using the library...
这结合了@ RichardSriven使用图书馆的工作......
library(stringi)
library(stringdist)
lcsbstr <- function(a,b) {
sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
longest_cmn_sbstr <- longest_string(cmn_sbstr)
return(longest_cmn_sbstr)
}
We can rewrite it to avoid the use of any external libraries (but still using the adist
)...
我们可以重写它以避免使用任何外部库(但仍使用adist)...
lcsbstr_no_lib <- function(a,b) {
matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
lengths<- attr(matches, 'match.length')
which_longest <- which.max(lengths)
index_longest <- matches[which_longest]
length_longest <- lengths[which_longest]
longest_cmn_sbstr <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
return(longest_cmn_sbstr )
}
All of identify only 'hello '
as the longest common substring, instead of 'hello r'
:
所有只识别'hello'作为最长的公共子串,而不是'hello r':
identical('hello ',
lcsbstr_no_lib('hello world', 'hello there'),
lcsbstr( 'hello world', 'hello there'))
EDIT And since the edit, works regardless of which argument is the longer of the two:
编辑自编辑以来,无论哪个参数都是两者中的较长者,它们都可以工作:
identical('hello',
lcsbstr_no_lib('hello', 'hello there'),
lcsbstr( 'hello', 'hello there'),
lcsbstr_no_lib('hello there', 'hello'),
lcsbstr( 'hello there', 'hello'))
LAST EDIT But this is only good if you accept this behavior. Notice this result:
最后编辑但是,如果您接受此行为,这只会很好。注意这个结果:
lcsbstr('hello world', 'hello')
#[1] 'hell'
I was expecting 'hello'
, but since the transformation actually moves (via deletion) the world to become the hello, so only the hell part is considered a match according to the M
:
我期待'你好',但由于转换实际上移动(通过删除)世界成为你好,所以根据M,只有地狱部分被认为是匹配:
drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1] vvvv v
#[1] "hello world"
This behavior is observed using [this Levenstein tool] -- it gives two possible solutions, equivalent to these two transforms; can we tell adist which one we prefer? (the one with the greater number of consecutive M
)
使用[这个Levenstein工具]观察到这种行为 - 它提供了两种可能的解决方案,相当于这两种转换;我们可以告诉adist我们更喜欢哪一个吗? (具有更多连续M的那个)
#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"
Finally, don't forget adist allows you to pass in ignore.case = TRUE
(FALSE
is the default)
最后,不要忘记adist允许你传入ignore.case = TRUE(FALSE是默认值)
#4
0
df <- data.frame(A. = c("Australia", "Network"),
B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)
auxFun <- function(x) {
a <- strsplit(x[[1]], "")[[1]]
b <- strsplit(x[[2]], "")[[1]]
lastchar <- suppressWarnings(which(!(a == b)))[1] - 1
if(lastchar > 0){
out <- paste0(a[1:lastchar], collapse = "")
} else {
out <- ""
}
return(out)
}
df$C. <- apply(df, 1, auxFun)
df
A. B. C.
1 Australia Austria Austr
2 Network Netconnect Net