高效删除Ruby中数组中其他元素的所有子串

时间:2022-06-17 21:42:36

I have a complex problem of in-place editing of an array at hand. I have an array in which some of the elements are sub-strings of other elements.I want to delete all the sub-strings and keep the supersets/strings only. i.e. Array => ['1', '1 1', '1 1 1', '1 1 1 2', '1 2 3 1', '1 2', '2 3'] After operation I should have a sanitized array => ['1 1 1 2', '1 2 3 1']

我有一个复杂的问题,即手头编辑一个数组。我有一个数组,其中一些元素是其他元素的子字符串。我想删除所有子字符串并仅保留超集/字符串。即Array => ['1','1 1','1 1 1','1 1 1 2','1 2 3 1','1 2','2 3']术后我应该有一个sanitized array => ['1 1 1 2','1 2 3 1']

Is there an efficient algorithm to achieve the same ?

是否有一种有效的算法来实现相同的目标?

3 个解决方案

#1


3  

This approach uses some array math to remove itself from the array and then checks to see if it shows up as a substring. I have no idea how performant this is.

这种方法使用一些数组数学从数组中删除自己,然后检查它是否显示为子字符串。我不知道这是多么高效。

a = ['1', '1 1', '1 1 1', '1 1 1 2', '1 2 3 1', '1 2', '2 3']
a.uniq.delete_if { |i| (a-[i]).any? {|j| j.include? i } }

I changed to using a delete_if as it will improve performance as you are shortening your array anytime a substring is found making subsequent checks slightly faster.

我改为使用delete_if,因为它可以提高性能,因为您可以在找到子字符串时缩短数组,从而使后续检查更快一些。

UPDATE: Cary Swoveland found an issue when the array contains duplicates. I have added a uniq to dedupe the array first although it's not entirely clear what should happen if an element is repeated, should both be removed since they are substrings of each other? I have solved this problem on the assumption that duplicates result in only one item showing in the output, but this could be wrong.

更新:当数组包含重复项时,Cary Swoveland发现了一个问题。我已经添加了一个uniq来重复数组,尽管不完全清楚如果一个元素重复会发生什么,两者都应该被删除,因为它们是彼此的子串?我已经解决了这个问题,假设重复导致输出中只显示一个项目,但这可能是错误的。

#2


2  

It uses less memory, performs less computations. This one will delete substrings both ways, looping will be less. Brought

它使用更少的内存,执行更少的计算。这个将以两种方式删除子串,循环将更少。带

             user       system     total       real
    First    0.000000   0.000000   0.000000 (  0.000076)
    Second   0.010000   0.000000   0.010000 (  0.000037)
    Third    0.000000   0.000000   0.000000 (  0.000019)

Above mentioned is the benchmark results for the 2 algos mentioned above(First and Second) and this one(Third).

上面提到的是上面提到的2个算法(第一个和第二个)和第一个(第三个)的基准结果。

array = ['1 1 1', '1', '1 1', '1 1 1 2', '1 2 3 1', '1 2', '2 3', '1 2 3', '1 1 1']

i1 = 0
arr_len = array.length
last_index = arr_len - 1

while i1 <= last_index
  w1 = array[i1]
  i2 = i1 + 1
  while i2 <= last_index
    w2 = array[i2]
    # If w2 is a subset of w1
    if w1.include? w2
      # Delete from index i2
      array.delete_at(i2)
      # Decrement the array_length as one element is deleted
      arr_len -= 1
      # Decrement last index, as one element is deleted
      last_index -= 1
      next
    end
    # If w1 comes out to be a subset of w2
    if w2.include? w1
      # Delete the value from that index
      array.delete_at(i1)
      # Decrement the array_length as one element is deleted
      arr_len -= 1
      # Decrement last index, as one element is deleted
      last_index -= 1
      # Reset value of w1 as it is deleted in this operation
      w1 = array[i1]
      # Reset index of 2nd loop to start matching again
      i2 = i1 + 1
      # Move next from here only
      next
    end
    i2 += 1
  end
  i1 += 1
end

#3


1  

Here's a way that removes substrings as they are found.

这是一种在找到子字符串时删除子字符串的方法。

a = ['1', '1 1', '1 1 1', '1 1 1 2', '1 2 3 1', '1 2', '2 3']

b = a.dup
b.size.times do
  first, *rest = b
  (rest.any? { |t| t.include? first }) ? b.shift : b.rotate!
end
b #=> ["1 1 1 2", "1 2 3 1"]

To see what's happening, insert

要查看发生了什么,请插入

puts "first=\"#{first}\n, rest=#{rest}"

after first,*rest = b. That prints the following (before I reformatted).

首先,* rest = b。这打印出以下内容(在我重新格式化之前)。

first="1",       rest=["1 1", "1 1 1", "1 1 1 2", "1 2 3 1", "1 2", "2 3"]
first="1 1",     rest=["1 1 1", "1 1 1 2", "1 2 3 1", "1 2", "2 3"]
first="1 1 1",   rest=["1 1 1 2", "1 2 3 1", "1 2", "2 3"]
first="1 1 1 2", rest=["1 2 3 1", "1 2", "2 3"]
first="1 2 3 1", rest=["1 2", "2 3", "1 1 1 2"]
first="1 2",     rest=["2 3", "1 1 1 2", "1 2 3 1"]
first="2 3",     rest=["1 1 1 2", "1 2 3 1"]

#1


3  

This approach uses some array math to remove itself from the array and then checks to see if it shows up as a substring. I have no idea how performant this is.

这种方法使用一些数组数学从数组中删除自己,然后检查它是否显示为子字符串。我不知道这是多么高效。

a = ['1', '1 1', '1 1 1', '1 1 1 2', '1 2 3 1', '1 2', '2 3']
a.uniq.delete_if { |i| (a-[i]).any? {|j| j.include? i } }

I changed to using a delete_if as it will improve performance as you are shortening your array anytime a substring is found making subsequent checks slightly faster.

我改为使用delete_if,因为它可以提高性能,因为您可以在找到子字符串时缩短数组,从而使后续检查更快一些。

UPDATE: Cary Swoveland found an issue when the array contains duplicates. I have added a uniq to dedupe the array first although it's not entirely clear what should happen if an element is repeated, should both be removed since they are substrings of each other? I have solved this problem on the assumption that duplicates result in only one item showing in the output, but this could be wrong.

更新:当数组包含重复项时,Cary Swoveland发现了一个问题。我已经添加了一个uniq来重复数组,尽管不完全清楚如果一个元素重复会发生什么,两者都应该被删除,因为它们是彼此的子串?我已经解决了这个问题,假设重复导致输出中只显示一个项目,但这可能是错误的。

#2


2  

It uses less memory, performs less computations. This one will delete substrings both ways, looping will be less. Brought

它使用更少的内存,执行更少的计算。这个将以两种方式删除子串,循环将更少。带

             user       system     total       real
    First    0.000000   0.000000   0.000000 (  0.000076)
    Second   0.010000   0.000000   0.010000 (  0.000037)
    Third    0.000000   0.000000   0.000000 (  0.000019)

Above mentioned is the benchmark results for the 2 algos mentioned above(First and Second) and this one(Third).

上面提到的是上面提到的2个算法(第一个和第二个)和第一个(第三个)的基准结果。

array = ['1 1 1', '1', '1 1', '1 1 1 2', '1 2 3 1', '1 2', '2 3', '1 2 3', '1 1 1']

i1 = 0
arr_len = array.length
last_index = arr_len - 1

while i1 <= last_index
  w1 = array[i1]
  i2 = i1 + 1
  while i2 <= last_index
    w2 = array[i2]
    # If w2 is a subset of w1
    if w1.include? w2
      # Delete from index i2
      array.delete_at(i2)
      # Decrement the array_length as one element is deleted
      arr_len -= 1
      # Decrement last index, as one element is deleted
      last_index -= 1
      next
    end
    # If w1 comes out to be a subset of w2
    if w2.include? w1
      # Delete the value from that index
      array.delete_at(i1)
      # Decrement the array_length as one element is deleted
      arr_len -= 1
      # Decrement last index, as one element is deleted
      last_index -= 1
      # Reset value of w1 as it is deleted in this operation
      w1 = array[i1]
      # Reset index of 2nd loop to start matching again
      i2 = i1 + 1
      # Move next from here only
      next
    end
    i2 += 1
  end
  i1 += 1
end

#3


1  

Here's a way that removes substrings as they are found.

这是一种在找到子字符串时删除子字符串的方法。

a = ['1', '1 1', '1 1 1', '1 1 1 2', '1 2 3 1', '1 2', '2 3']

b = a.dup
b.size.times do
  first, *rest = b
  (rest.any? { |t| t.include? first }) ? b.shift : b.rotate!
end
b #=> ["1 1 1 2", "1 2 3 1"]

To see what's happening, insert

要查看发生了什么,请插入

puts "first=\"#{first}\n, rest=#{rest}"

after first,*rest = b. That prints the following (before I reformatted).

首先,* rest = b。这打印出以下内容(在我重新格式化之前)。

first="1",       rest=["1 1", "1 1 1", "1 1 1 2", "1 2 3 1", "1 2", "2 3"]
first="1 1",     rest=["1 1 1", "1 1 1 2", "1 2 3 1", "1 2", "2 3"]
first="1 1 1",   rest=["1 1 1 2", "1 2 3 1", "1 2", "2 3"]
first="1 1 1 2", rest=["1 2 3 1", "1 2", "2 3"]
first="1 2 3 1", rest=["1 2", "2 3", "1 1 1 2"]
first="1 2",     rest=["2 3", "1 1 1 2", "1 2 3 1"]
first="2 3",     rest=["1 1 1 2", "1 2 3 1"]