在python中使用字典消除重复

时间:2021-02-07 07:38:29

I'm trying to eliminate duplicate tweets from a couchDB database. I want to eliminate the retweeted tweets by using the retweeted_status ID field. Heres is my code; it doesn't work and returns the error "string indices must be characters". Any help will be appreciated.

我正在尝试从couchDB数据库中删除重复的推文。我想通过使用retweeted_status ID字段消除转推的推文。 Heres是我的代码;它不起作用并返回错误“字符串索引必须是字符”。任何帮助将不胜感激。

# initialize a dictionary of tweet ids
# the first time an id is found, put it into the dict as a key (with value 1 (not used))
uniqueIDs = {}

numtweets = len(search_results)
numdeleted = 0

for tweet in search_results:
    # find retweeted_status
    if 'retweeted_status' in tweet.keys():
        retweetID = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
        #tweetID = retweeted_status['id']
        # get the tweetid from the keys
        #tweetID = tweet['id']
        # if it is already in the id dictionary then delete this one
        if retweetID in uniqueIDs.keys():
            db.delete(tweet)
            numdeleted += 1
        # otherwise add it to the unique ids
        else:
            uniqueIDs[retweetID] = 1
    else:
        # reduce the count if we skipped one
        numtweets -= 1


print "Number of tweets at beginning = ", numtweets
print "Number of tweets deleted = ", numdeleted

2 个解决方案

#1


You're declaring retweetID as a list, but then using it as a single value. Instead of using [... for ... in ...], you should just loop through tweet['retweeted_status']. You'd have something like this:

您将retweetID声明为列表,然后将其用作单个值。而不是使用[... for ... in ...],你应该只是通过tweet ['retweeted_status']循环。你有这样的事情:

if 'retweeted_status' in tweet:  # Note, don't need .keys()
    for retweedID in tweet['retweeted_status']:
        if retweetID in uniqueIDs:  # Again, don't need .keys()
            ...
        else:
            uniqueIDs[retweetID] = 1

#2


 retweetID = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]

The above line sets retweetID as a list. May be the following is what you wanted to do.

上面的行将retweetID设置为列表。可能以下是你想做的事情。

for tweet in search_results:
    # find retweeted_status
    if 'retweeted_status' in tweet.keys():
        retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
        #tweetID = retweeted_status['id']
        # get the tweetid from the keys
        #tweetID = tweet['id']
        # if it is already in the id dictionary then delete this one
        for tweet_id in retweetIDs:
            if retweetID in uniqueIDs.keys():
                db.delete(tweet)
                numdeleted += 1
        # otherwise add it to the unique ids
            else:
                uniqueIDs[tweet_id] = 1
    else:
        # reduce the count if we skipped one
        numtweets -= 1

This checks for the presence of each of those retweet ids in your uniqueIds. If you just want the unique tweet ids, then instead of dictionary, you can use a set too.

这将检查您的uniqueIds中是否存在每个转发ID。如果您只想要唯一的推文ID,那么您也可以使用集合而不是字典。

unique_ids = set()
for tweet in search_results:
    if 'retweeted_status' not in tweet.keys():
         continue
    retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
    unique_ids.update(retweetIds)

#1


You're declaring retweetID as a list, but then using it as a single value. Instead of using [... for ... in ...], you should just loop through tweet['retweeted_status']. You'd have something like this:

您将retweetID声明为列表,然后将其用作单个值。而不是使用[... for ... in ...],你应该只是通过tweet ['retweeted_status']循环。你有这样的事情:

if 'retweeted_status' in tweet:  # Note, don't need .keys()
    for retweedID in tweet['retweeted_status']:
        if retweetID in uniqueIDs:  # Again, don't need .keys()
            ...
        else:
            uniqueIDs[retweetID] = 1

#2


 retweetID = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]

The above line sets retweetID as a list. May be the following is what you wanted to do.

上面的行将retweetID设置为列表。可能以下是你想做的事情。

for tweet in search_results:
    # find retweeted_status
    if 'retweeted_status' in tweet.keys():
        retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
        #tweetID = retweeted_status['id']
        # get the tweetid from the keys
        #tweetID = tweet['id']
        # if it is already in the id dictionary then delete this one
        for tweet_id in retweetIDs:
            if retweetID in uniqueIDs.keys():
                db.delete(tweet)
                numdeleted += 1
        # otherwise add it to the unique ids
            else:
                uniqueIDs[tweet_id] = 1
    else:
        # reduce the count if we skipped one
        numtweets -= 1

This checks for the presence of each of those retweet ids in your uniqueIds. If you just want the unique tweet ids, then instead of dictionary, you can use a set too.

这将检查您的uniqueIds中是否存在每个转发ID。如果您只想要唯一的推文ID,那么您也可以使用集合而不是字典。

unique_ids = set()
for tweet in search_results:
    if 'retweeted_status' not in tweet.keys():
         continue
    retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
    unique_ids.update(retweetIds)