保存前从阵列中删除重复项

时间:2022-02-06 00:06:12

I periodically fetch the latest tweets with a certain hashtag and save them locally. In order to prevent saving duplicates, I use the method below. Unfortunately, it does not seem to be working... so what's wrong with this code:

我会定期使用特定的标签获取最新的推文并将其保存在本地。为了防止保存重复,我使用下面的方法。不幸的是,它似乎没有工作......所以这段代码有什么问题:

    def remove_duplicates
      before = @tweets.size
      @tweets.delete_if {|tweet| !((Tweet.all :conditions => { :twitter_id => tweet.twitter_id}).empty?) }
      duplicates = before - @tweets.size
      puts "#{duplicates} duplicates found"
    end

Where @tweets is an array of Tweet objects fetched from twitter. I'd appreciate any solution that works and especially one that might be more elegant...

其中@tweets是从twitter获取的Tweet对象数组。我很感激任何有效的解决方案,特别是可能更优雅的解决方案......

4 个解决方案

#1


you can validate_uniqueness_of :twitter_id in the Tweet model (where this code should be). This will cause duplicates to fail to save.

你可以在Tweet模型中验证validate_uniqueness_of:twitter_id(此代码应该在哪里)。这将导致重复无法保存。

#2


Since it sounds like you're using the Twitter search API, a better solution is to use the since_id parameter. Keep track of the last twitter status id you got from your previous query and use that as the since_id parameter on your next query.

由于听起来您正在使用Twitter搜索API,因此更好的解决方案是使用since_id参数。跟踪您从上一个查询中获得的最后一个Twitter状态ID,并将其用作下一个查询的since_id参数。

More information is available at Twitter Search API Method: search

有关更多信息,请访问Twitter Search API方法:搜索

#3


array.uniq!

Removes duplicate elements from self. Returns nil if no changes are made (that is, no duplicates are found).

从self中删除重复的元素。如果未进行任何更改,则返回nil(即,未找到重复项)。

#4


Ok, turns out the problem was a bit of different nature: When looking closer into it, I found out that multipe Tweets were saved with the twitter_id 2147483647... This is the upper limit for integer fields :)

好吧,事实证明这个问题有点不同:当仔细观察它时,我发现多重推文是用twitter_id 2147483647保存的......这是整数字段的上限:)

Changing the field to bigint solved the problem. It took me very long to figure out since MySQL did silently fail and just reverted to the maximum value as long as it could. (until I added the unique index). I quickly tried it out with postgres, which returned a nice "Integer out of range" error, which then pointed me to the real cause of the problem here.

将字段更改为bigint解决了这个问题。我花了很长时间才弄清楚,因为MySQL确实无声地失败了,只要可能就恢复到最大值。 (直到我添加了唯一索引)。我很快用postgres尝试了它,它返回了一个很好的“整数超出范围”错误,然后指出了问题的真正原因。

Thanks Ben for the validation and indexing tips, as they lead to much cleaner code now!

感谢Ben提供的验证和索引提示,因为它们现在可以提供更清晰的代码!

#1


you can validate_uniqueness_of :twitter_id in the Tweet model (where this code should be). This will cause duplicates to fail to save.

你可以在Tweet模型中验证validate_uniqueness_of:twitter_id(此代码应该在哪里)。这将导致重复无法保存。

#2


Since it sounds like you're using the Twitter search API, a better solution is to use the since_id parameter. Keep track of the last twitter status id you got from your previous query and use that as the since_id parameter on your next query.

由于听起来您正在使用Twitter搜索API,因此更好的解决方案是使用since_id参数。跟踪您从上一个查询中获得的最后一个Twitter状态ID,并将其用作下一个查询的since_id参数。

More information is available at Twitter Search API Method: search

有关更多信息,请访问Twitter Search API方法:搜索

#3


array.uniq!

Removes duplicate elements from self. Returns nil if no changes are made (that is, no duplicates are found).

从self中删除重复的元素。如果未进行任何更改,则返回nil(即,未找到重复项)。

#4


Ok, turns out the problem was a bit of different nature: When looking closer into it, I found out that multipe Tweets were saved with the twitter_id 2147483647... This is the upper limit for integer fields :)

好吧,事实证明这个问题有点不同:当仔细观察它时,我发现多重推文是用twitter_id 2147483647保存的......这是整数字段的上限:)

Changing the field to bigint solved the problem. It took me very long to figure out since MySQL did silently fail and just reverted to the maximum value as long as it could. (until I added the unique index). I quickly tried it out with postgres, which returned a nice "Integer out of range" error, which then pointed me to the real cause of the problem here.

将字段更改为bigint解决了这个问题。我花了很长时间才弄清楚,因为MySQL确实无声地失败了,只要可能就恢复到最大值。 (直到我添加了唯一索引)。我很快用postgres尝试了它,它返回了一个很好的“整数超出范围”错误,然后指出了问题的真正原因。

Thanks Ben for the validation and indexing tips, as they lead to much cleaner code now!

感谢Ben提供的验证和索引提示,因为它们现在可以提供更清晰的代码!