使用带有NLTK的NLTK检查两个单词之间的相似性

时间:2022-01-04 14:08:30

I have a two lists and I want to check the similarity between each words in the two list and find out the maximum similarity.Here is my code,

我有两个列表,我想检查两个列表中每个单词之间的相似性,找出最大的相似度。这是我的代码,

from nltk.corpus import wordnet

list1 = ['Compare', 'require']
list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show', 'spell', 'state', 'tell', 'trace', 'write']
list = []

for word1 in list1:
    for word2 in list2:
        wordFromList1 = wordnet.synsets(word1)[0]
        wordFromList2 = wordnet.synsets(word2)[0]
        s = wordFromList1.wup_similarity(wordFromList2)
        list.append(s)

print(max(list)) 

But this will result an error:

但这会导致错误:

wordFromList2 = wordnet.synsets(word2)[0]
        IndexError: list index out of range

Please help me to fix this.
Thanking you

请帮我解决这个问题。感谢您

2 个解决方案

#1


10  

You're getting an error if a synset list is empty, and you try to get the element at (non-existent) index zero. But why only check the zero'th element? If you want to check everything, try all pairs of elements in the returned synsets. You can use itertools.product() to save yourself two for-loops:

如果synset列表为空,则会出现错误,并且您尝试将元素设置为(不存在)索引为零。但为什么只检查第零个元素?如果要检查所有内容,请尝试返回的同义词集中的所有元素对。您可以使用itertools.product()为自己保存两个for循环:

from itertools import product
sims = []

for word1, word2 in product(list1, list2):
    syns1 = wordnet.synsets(word1)
    syns2 = wordnet.synsets(word2)
    for sense1, sense2 in product(syns1, syns2):
        d = wordnet.wup_similarity(sense1, sense2)
        sims.append((d, syns1, syns2))

This is inefficient because the same synsets are looked up again and again, but it is the closest to the logic of your code. If you have enough data to make speed an issue, you can speed it up by collecting the synsets for all words in list1 and list2 once, and taking the product of the synsets.

这是低效的,因为一次又一次地查找相同的同义词,但它最接近代码的逻辑。如果您有足够的数据来提高速度问题,您可以通过收集list1和list2中所有单词的同义词集,并获取同义词的乘积来加快速度。

>>> allsyns1 = set(ss for word in list1 for ss in wordnet.synsets(word))
>>> allsyns2 = set(ss for word in list2 for ss in wordnet.synsets(word))
>>> best = max((wordnet.wup_similarity(s1, s2) or 0, s1, s2) for s1, s2 in 
        product(allsyns1, allsyns2))
>>> print(best)
(0.9411764705882353, Synset('command.v.02'), Synset('order.v.01'))

#2


8  

Try checking whether these lists are empty before you use then:

在使用之前,请尝试检查这些列表是否为空:

from nltk.corpus import wordnet

list1 = ['Compare', 'require']
list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show', 'spell', 'state', 'tell', 'trace', 'write']
list = []

for word1 in list1:
    for word2 in list2:
        wordFromList1 = wordnet.synsets(word1)
        wordFromList2 = wordnet.synsets(word2)
        if wordFromList1 and wordFromList2: #Thanks to @alexis' note
            s = wordFromList1[0].wup_similarity(wordFromList2[0])
            list.append(s)

print(max(list))

#1


10  

You're getting an error if a synset list is empty, and you try to get the element at (non-existent) index zero. But why only check the zero'th element? If you want to check everything, try all pairs of elements in the returned synsets. You can use itertools.product() to save yourself two for-loops:

如果synset列表为空,则会出现错误,并且您尝试将元素设置为(不存在)索引为零。但为什么只检查第零个元素?如果要检查所有内容,请尝试返回的同义词集中的所有元素对。您可以使用itertools.product()为自己保存两个for循环:

from itertools import product
sims = []

for word1, word2 in product(list1, list2):
    syns1 = wordnet.synsets(word1)
    syns2 = wordnet.synsets(word2)
    for sense1, sense2 in product(syns1, syns2):
        d = wordnet.wup_similarity(sense1, sense2)
        sims.append((d, syns1, syns2))

This is inefficient because the same synsets are looked up again and again, but it is the closest to the logic of your code. If you have enough data to make speed an issue, you can speed it up by collecting the synsets for all words in list1 and list2 once, and taking the product of the synsets.

这是低效的,因为一次又一次地查找相同的同义词,但它最接近代码的逻辑。如果您有足够的数据来提高速度问题,您可以通过收集list1和list2中所有单词的同义词集,并获取同义词的乘积来加快速度。

>>> allsyns1 = set(ss for word in list1 for ss in wordnet.synsets(word))
>>> allsyns2 = set(ss for word in list2 for ss in wordnet.synsets(word))
>>> best = max((wordnet.wup_similarity(s1, s2) or 0, s1, s2) for s1, s2 in 
        product(allsyns1, allsyns2))
>>> print(best)
(0.9411764705882353, Synset('command.v.02'), Synset('order.v.01'))

#2


8  

Try checking whether these lists are empty before you use then:

在使用之前,请尝试检查这些列表是否为空:

from nltk.corpus import wordnet

list1 = ['Compare', 'require']
list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show', 'spell', 'state', 'tell', 'trace', 'write']
list = []

for word1 in list1:
    for word2 in list2:
        wordFromList1 = wordnet.synsets(word1)
        wordFromList2 = wordnet.synsets(word2)
        if wordFromList1 and wordFromList2: #Thanks to @alexis' note
            s = wordFromList1[0].wup_similarity(wordFromList2[0])
            list.append(s)

print(max(list))