有没有办法删除字符串中的重复和连续的单词/短语?

时间:2021-02-27 01:35:22

Is there a way to remove duplicate and continuous words/phrases in a string? E.g.

有没有办法删除字符串中的重复和连续的单词/短语?例如。

[in]: foo foo bar bar foo bar

[in]:foo foo bar bar foo bar

[out]: foo bar foo bar

[出]:foo bar foo bar

I have tried this:

我试过这个:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'

What happens when it gets a little more complicated and i want to remove phrases (let's say phrases can be made up of up to 5 words)? how can it be done? E.g.

当它变得更复杂并且我想删除短语时会发生什么(让我们说短语最多可以由5个单词组成)?如何做呢?例如。

[in]: foo bar foo bar foo bar

[in]:foo bar foo bar foo bar

[out]: foo bar

[出]:foo吧

Another example:

另一个例子:

[in]: this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .

[in]:这是一个句子句子,这是一个短语短语重复的句子。句子不是前奏。

[out]: this is a sentence where phrases duplicate . sentence are not prhases .

[out]:这是一个短语重复的句子。句子不是前奏。

6 个解决方案

#1


13  

You can use re module for that.

您可以使用re模块。

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'

If you want to match any number of consecutive occurrences:

如果要匹配任意数量的连续出现:

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'    

Edit. An addition for your last example. To do so you'll have to call re.sub while there're duplicate phrases. So:

编辑。最后一个例子的补充。为此,您必须在重复短语时调用re.sub。所以:

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'

#2


6  

I love itertools. It seems like every time I want to write something, itertools already has it. In this case, groupby takes a list and groups repeated, sequential items from that list into a tuple of (item_value, iterator_of_those_values). Use it here like:

我喜欢itertools。似乎每次我想写东西,itertools已经拥有它。在这种情况下,groupby接受一个列表,并将该列表中重复的顺序项分组为(item_value,iterator_of_those_values)元组。在这里使用它像:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'

So let's extend that a little with a function that returns a list with its duplicated repeated values removed:

因此,让我们使用一个函数扩展一点,该函数返回一个列表,其中删除了重复的重复值:

from itertools import chain, groupby

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

That's great for one-word phrases, but not helpful for longer phrases. What to do? Well, first, we'll want to check for longer phrases by striding over our original phrase:

这对于单词短语非常有用,但对于较长的短语没有帮助。该怎么办?好吧,首先,我们要通过跨越原始短语检查更长的短语:

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

Now we're cooking! OK. So our strategy here is to first remove all the single-word duplicates. Next, we'll remove the two-word duplicates, starting from offset 0 then 1. After that, three-word duplicates starting at offsets 0, 1, and 2, and so on until we've hit five-word duplicates:

现在我们正在做饭!好。所以我们的策略是首先删除所有单字重复项。接下来,我们将删除两个字的副本,从偏移0开始然后是1.之后,从偏移0,1和2开始的三字重复,依此类推,直到我们击中五个字的重复:

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

Putting it all together:

把它们放在一起:

from itertools import chain, groupby

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'

b = 'this is a sentence where phrases duplicate . sentence are not prhases .'

print ' '.join(cleanse(a.split(), 5)) == b

#3


0  

Personally, I do not think we need to use any other modules for this (although I admit some of them are GREAT). I just managed this with simple looping by first converting the string into a list. I tried it on all the examples listed above. It works fine.

就个人而言,我认为我们不需要使用任何其他模块(虽然我承认其中一些是伟大的)。我只是通过首先将字符串转换为列表来简单循环来管理它。我在上面列出的所有例子中尝试过它。它工作正常。

sentence = str(raw_input("Please enter your sentence:\n"))

word_list = sentence.split()

def check_if_same(i,j): # checks if two sets of lists are the same

    global word_list
    next = (2*j)-i   # this gets the end point for the second of the two lists to compare (it is essentially j + phrase_len)
    is_same = False
    if word_list[i:j] == word_list[j:next]:

        is_same = True
        # The line below is just for debugging. Prints lists we are comparing and whether it thinks they are equal or not
        #print "Comparing: " + ' '.join(word_list[i:j]) + " " + ''.join(word_list[j:next]) + " " + str(answer)

    return is_same

phrase_len = 1

while phrase_len <= int(len(word_list) / 2): # checks the sentence for different phrase lengths

    curr_word_index=0

    while curr_word_index < len(word_list): # checks all the words of the sentence for the specified phrase length

        result = check_if_same(curr_word_index, curr_word_index + phrase_len) # checks similarity

        if result == True:
            del(word_list[curr_word_index : curr_word_index + phrase_len]) # deletes the repeated phrase
        else:
            curr_word_index += 1

    phrase_len += 1

print "Answer: " + ' '.join(word_list)

#4


0  

With a pattern similar to sharcashmo's pattern, you can use subn that returns the number of replacements, inside a while loop :

使用类似于sharcashmo模式的模式,您可以在while循环中使用返回替换次数的subn:

import re

txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'

pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
repl = r'\1'

res = txt

while True:
    [res, nbr] = pattern.subn(repl, res)
    if (nbr == 0):
        break

print res

When there is no more replacements the while loop stops.

当没有更多替换时,while循环停止。

With this method you can get all overlapped matches (that is impossible with a single pass in a replacement context), without testing two times the same pattern.

使用此方法,您可以获得所有重叠匹配(在替换上下文中单次传递是不可能的),而不会测试相同模式的两次。

#5


-1  

txt1 = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
txt2 =  'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'

def remove_duplicates(txt):
    result = []
    for word in txt.split():
        if word not in result:
            result.append(word)
    return ' '.join(result)

Ouput:

输出继电器:

In [7]: remove_duplicate_words(txt1)                                                                                                                                  
Out[7]: 'this is a foo bar black sheep , have you any wool woo yes sir three bag wu'                                                                                  

In [8]: remove_duplicate_words(txt2)                                                                                                                                 
Out[8]: 'this is a sentence where phrases duplicate' 

#6


-1  

This should fix any number of adjacent duplicates, and works with both of your examples. I convert the string to a list, fix it, then convert back to a string for output:

这应该修复任意数量的相邻重复项,并与您的两个示例一起使用。我将字符串转换为列表,修复它,然后转换回字符串输出:

mywords = "foo foo bar bar foo bar"
list = mywords.split()
def remove_adjacent_dups(alist):
    result = []
    most_recent_elem = None
    for e in alist:
        if e != most_recent_elem:
            result.append(e)
            most_recent_elem = e
    to_string = ' '.join(result)
    return to_string

print remove_adjacent_dups(list)

Output:

输出:

foo bar foo bar

#1


13  

You can use re module for that.

您可以使用re模块。

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'

If you want to match any number of consecutive occurrences:

如果要匹配任意数量的连续出现:

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'    

Edit. An addition for your last example. To do so you'll have to call re.sub while there're duplicate phrases. So:

编辑。最后一个例子的补充。为此,您必须在重复短语时调用re.sub。所以:

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'

#2


6  

I love itertools. It seems like every time I want to write something, itertools already has it. In this case, groupby takes a list and groups repeated, sequential items from that list into a tuple of (item_value, iterator_of_those_values). Use it here like:

我喜欢itertools。似乎每次我想写东西,itertools已经拥有它。在这种情况下,groupby接受一个列表,并将该列表中重复的顺序项分组为(item_value,iterator_of_those_values)元组。在这里使用它像:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'

So let's extend that a little with a function that returns a list with its duplicated repeated values removed:

因此,让我们使用一个函数扩展一点,该函数返回一个列表,其中删除了重复的重复值:

from itertools import chain, groupby

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

That's great for one-word phrases, but not helpful for longer phrases. What to do? Well, first, we'll want to check for longer phrases by striding over our original phrase:

这对于单词短语非常有用,但对于较长的短语没有帮助。该怎么办?好吧,首先,我们要通过跨越原始短语检查更长的短语:

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

Now we're cooking! OK. So our strategy here is to first remove all the single-word duplicates. Next, we'll remove the two-word duplicates, starting from offset 0 then 1. After that, three-word duplicates starting at offsets 0, 1, and 2, and so on until we've hit five-word duplicates:

现在我们正在做饭!好。所以我们的策略是首先删除所有单字重复项。接下来,我们将删除两个字的副本,从偏移0开始然后是1.之后,从偏移0,1和2开始的三字重复,依此类推,直到我们击中五个字的重复:

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

Putting it all together:

把它们放在一起:

from itertools import chain, groupby

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'

b = 'this is a sentence where phrases duplicate . sentence are not prhases .'

print ' '.join(cleanse(a.split(), 5)) == b

#3


0  

Personally, I do not think we need to use any other modules for this (although I admit some of them are GREAT). I just managed this with simple looping by first converting the string into a list. I tried it on all the examples listed above. It works fine.

就个人而言,我认为我们不需要使用任何其他模块(虽然我承认其中一些是伟大的)。我只是通过首先将字符串转换为列表来简单循环来管理它。我在上面列出的所有例子中尝试过它。它工作正常。

sentence = str(raw_input("Please enter your sentence:\n"))

word_list = sentence.split()

def check_if_same(i,j): # checks if two sets of lists are the same

    global word_list
    next = (2*j)-i   # this gets the end point for the second of the two lists to compare (it is essentially j + phrase_len)
    is_same = False
    if word_list[i:j] == word_list[j:next]:

        is_same = True
        # The line below is just for debugging. Prints lists we are comparing and whether it thinks they are equal or not
        #print "Comparing: " + ' '.join(word_list[i:j]) + " " + ''.join(word_list[j:next]) + " " + str(answer)

    return is_same

phrase_len = 1

while phrase_len <= int(len(word_list) / 2): # checks the sentence for different phrase lengths

    curr_word_index=0

    while curr_word_index < len(word_list): # checks all the words of the sentence for the specified phrase length

        result = check_if_same(curr_word_index, curr_word_index + phrase_len) # checks similarity

        if result == True:
            del(word_list[curr_word_index : curr_word_index + phrase_len]) # deletes the repeated phrase
        else:
            curr_word_index += 1

    phrase_len += 1

print "Answer: " + ' '.join(word_list)

#4


0  

With a pattern similar to sharcashmo's pattern, you can use subn that returns the number of replacements, inside a while loop :

使用类似于sharcashmo模式的模式,您可以在while循环中使用返回替换次数的subn:

import re

txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'

pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
repl = r'\1'

res = txt

while True:
    [res, nbr] = pattern.subn(repl, res)
    if (nbr == 0):
        break

print res

When there is no more replacements the while loop stops.

当没有更多替换时,while循环停止。

With this method you can get all overlapped matches (that is impossible with a single pass in a replacement context), without testing two times the same pattern.

使用此方法,您可以获得所有重叠匹配(在替换上下文中单次传递是不可能的),而不会测试相同模式的两次。

#5


-1  

txt1 = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
txt2 =  'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'

def remove_duplicates(txt):
    result = []
    for word in txt.split():
        if word not in result:
            result.append(word)
    return ' '.join(result)

Ouput:

输出继电器:

In [7]: remove_duplicate_words(txt1)                                                                                                                                  
Out[7]: 'this is a foo bar black sheep , have you any wool woo yes sir three bag wu'                                                                                  

In [8]: remove_duplicate_words(txt2)                                                                                                                                 
Out[8]: 'this is a sentence where phrases duplicate' 

#6


-1  

This should fix any number of adjacent duplicates, and works with both of your examples. I convert the string to a list, fix it, then convert back to a string for output:

这应该修复任意数量的相邻重复项,并与您的两个示例一起使用。我将字符串转换为列表,修复它,然后转换回字符串输出:

mywords = "foo foo bar bar foo bar"
list = mywords.split()
def remove_adjacent_dups(alist):
    result = []
    most_recent_elem = None
    for e in alist:
        if e != most_recent_elem:
            result.append(e)
            most_recent_elem = e
    to_string = ' '.join(result)
    return to_string

print remove_adjacent_dups(list)

Output:

输出:

foo bar foo bar