Excel VBA,一种模糊匹配

时间:2021-11-19 19:25:29

I want to automate a script in Excel VBA and I am stuck.

我想在Excel VBA中自动化脚本,我被卡住了。

I have a string = "The patient population is x. From this, a lot of patients are males. A particular male patint has 3 deadly diseases." (the strings will be longer)

我有一个字符串=“患者人数是x。从这个,很多患者都是男性。特别是男性患有3种致命疾病。” (字符串会更长)

Now what i want to do is search how many times the word "patient" appears in this string even if the words have spell problems and how are they written.

现在我要做的是搜索“患者”这个词出现在这个字符串中的次数,即使这些单词有拼写问题以及它们是如何写的。

My ideea was to let`s say match with a confidence of 80% the word "patient" with all the words in the string and the result i am aiming for is .. there are 3 matches and the words that are in the string :"patient","patients","patint ". Is there a way to do this?

我的想法是让我们说80%的信心匹配“患者”这个词与字符串中的所有单词和我的目标结果是...有3个匹配和字符串中的单词: “病人”,“病人”,“patint”。有没有办法做到这一点?

4 个解决方案

#1


0  

The concept you are looking for is called "full text search".

您正在寻找的概念称为“全文搜索”。

I'm not a 100 % sure, but I think that is it not native in Excel nor VBA. To my best knowledge not even MS Access supports this.

我不是100%肯定,但我认为这不是Excel中的本机,也不是VBA。据我所知,甚至MS Access都不支持此功能。

Check out the Add-In suggested by Alex K or look at embedding a real database in your app.

查看Alex K建议的加载项,或者查看在您的应用中嵌入真实数据库。

#2


2  

YMMV of course but two things to look at are:

YMMV当然要看两件事:

Fuzzy Lookup Add-In for Excel

Excel的模糊查找加载项

... performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data.

...在Microsoft Excel中执行文本数据的模糊匹配。它可用于识别单个表中的模糊重复行,或模糊连接两个不同表之间的类似行。匹配对各种错误都很稳健,包括拼写错误,缩写,同义词和添加/缺失数据。

Calculating the Levenshtein Distance may also be useful.

计算Levenshtein距离也可能有用。

#3


1  

Here is a VBA implementation of the Levenshein Distance. You can adjust the threshold to fit your needs.

这是Levenshein距离的VBA实现。您可以调整阈值以满足您的需求。

Public Function Levenshtein(str1 As String, str2 As String) As Integer
On Error GoTo ErrHandler
    Dim arrLev, intLen1 As Integer, intLen2 As Integer, i As Integer
    Dim j, arrStr1, arrStr2, intMini As Integer

    intLen1 = Len(str1)
    ReDim arrStr1(intLen1 + 1)
    intLen2 = Len(str2)

    ReDim arrStr2(intLen2 + 1)
    ReDim arrLev(intLen1 + 1, intLen2 + 1)

    arrLev(0, 0) = 0
    For i = 1 To intLen1
        arrLev(i, 0) = i
        arrStr1(i) = Mid(str1, i, 1)
    Next

    For j = 1 To intLen2
        arrLev(0, j) = j
        arrStr2(j) = Mid(str2, j, 1)
    Next

    For j = 1 To intLen2
        For i = 1 To intLen1
            If arrStr1(i) = arrStr2(j) Then
                arrLev(i, j) = arrLev(i - 1, j - 1)
            Else
                intMini = arrLev(i - 1, j) 'deletion
                If intMini > arrLev(i, j - 1) Then intMini = arrLev(i, j - 1) 'insertion
                If intMini > arrLev(i - 1, j - 1) Then intMini = arrLev(i - 1, j - 1) 'deletion

                arrLev(i, j) = intMini + 1
            End If
        Next
    Next

    Levenshtein = arrLev(intLen1, intLen2)
    Exit Function

ErrHandler:
    MsgBox Err.Description
    Exit Function
End Function

#4


0  

You could use the Soundex2 algorithm to match similar-sounding words. This SO post has some pointers on soundex in VBA.
Note that the algorithm relies on characteristics predominantly found in English.

您可以使用Soundex2算法来匹配类似发音的单词。这个SO帖子对VBA中的soundex有一些指示。请注意,该算法依赖于主要在英语中发现的特征。

#1


0  

The concept you are looking for is called "full text search".

您正在寻找的概念称为“全文搜索”。

I'm not a 100 % sure, but I think that is it not native in Excel nor VBA. To my best knowledge not even MS Access supports this.

我不是100%肯定,但我认为这不是Excel中的本机,也不是VBA。据我所知,甚至MS Access都不支持此功能。

Check out the Add-In suggested by Alex K or look at embedding a real database in your app.

查看Alex K建议的加载项,或者查看在您的应用中嵌入真实数据库。

#2


2  

YMMV of course but two things to look at are:

YMMV当然要看两件事:

Fuzzy Lookup Add-In for Excel

Excel的模糊查找加载项

... performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data.

...在Microsoft Excel中执行文本数据的模糊匹配。它可用于识别单个表中的模糊重复行,或模糊连接两个不同表之间的类似行。匹配对各种错误都很稳健,包括拼写错误,缩写,同义词和添加/缺失数据。

Calculating the Levenshtein Distance may also be useful.

计算Levenshtein距离也可能有用。

#3


1  

Here is a VBA implementation of the Levenshein Distance. You can adjust the threshold to fit your needs.

这是Levenshein距离的VBA实现。您可以调整阈值以满足您的需求。

Public Function Levenshtein(str1 As String, str2 As String) As Integer
On Error GoTo ErrHandler
    Dim arrLev, intLen1 As Integer, intLen2 As Integer, i As Integer
    Dim j, arrStr1, arrStr2, intMini As Integer

    intLen1 = Len(str1)
    ReDim arrStr1(intLen1 + 1)
    intLen2 = Len(str2)

    ReDim arrStr2(intLen2 + 1)
    ReDim arrLev(intLen1 + 1, intLen2 + 1)

    arrLev(0, 0) = 0
    For i = 1 To intLen1
        arrLev(i, 0) = i
        arrStr1(i) = Mid(str1, i, 1)
    Next

    For j = 1 To intLen2
        arrLev(0, j) = j
        arrStr2(j) = Mid(str2, j, 1)
    Next

    For j = 1 To intLen2
        For i = 1 To intLen1
            If arrStr1(i) = arrStr2(j) Then
                arrLev(i, j) = arrLev(i - 1, j - 1)
            Else
                intMini = arrLev(i - 1, j) 'deletion
                If intMini > arrLev(i, j - 1) Then intMini = arrLev(i, j - 1) 'insertion
                If intMini > arrLev(i - 1, j - 1) Then intMini = arrLev(i - 1, j - 1) 'deletion

                arrLev(i, j) = intMini + 1
            End If
        Next
    Next

    Levenshtein = arrLev(intLen1, intLen2)
    Exit Function

ErrHandler:
    MsgBox Err.Description
    Exit Function
End Function

#4


0  

You could use the Soundex2 algorithm to match similar-sounding words. This SO post has some pointers on soundex in VBA.
Note that the algorithm relies on characteristics predominantly found in English.

您可以使用Soundex2算法来匹配类似发音的单词。这个SO帖子对VBA中的soundex有一些指示。请注意,该算法依赖于主要在英语中发现的特征。