I'm trying to eliminate duplicate tweets from a couchDB database. I want to eliminate the retweeted tweets by using the retweeted_status ID field. Heres is my code; it doesn't work and returns the error "string indices must be characters". Any help will be appreciated.
我正在尝试从couchDB数据库中删除重复的推文。我想通过使用retweeted_status ID字段消除转推的推文。 Heres是我的代码;它不起作用并返回错误“字符串索引必须是字符”。任何帮助将不胜感激。
# initialize a dictionary of tweet ids
# the first time an id is found, put it into the dict as a key (with value 1 (not used))
uniqueIDs = {}
numtweets = len(search_results)
numdeleted = 0
for tweet in search_results:
# find retweeted_status
if 'retweeted_status' in tweet.keys():
retweetID = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
#tweetID = retweeted_status['id']
# get the tweetid from the keys
#tweetID = tweet['id']
# if it is already in the id dictionary then delete this one
if retweetID in uniqueIDs.keys():
db.delete(tweet)
numdeleted += 1
# otherwise add it to the unique ids
else:
uniqueIDs[retweetID] = 1
else:
# reduce the count if we skipped one
numtweets -= 1
print "Number of tweets at beginning = ", numtweets
print "Number of tweets deleted = ", numdeleted
2 个解决方案
#1
You're declaring retweetID
as a list, but then using it as a single value. Instead of using [... for ... in ...]
, you should just loop through tweet['retweeted_status']
. You'd have something like this:
您将retweetID声明为列表,然后将其用作单个值。而不是使用[... for ... in ...],你应该只是通过tweet ['retweeted_status']循环。你有这样的事情:
if 'retweeted_status' in tweet: # Note, don't need .keys()
for retweedID in tweet['retweeted_status']:
if retweetID in uniqueIDs: # Again, don't need .keys()
...
else:
uniqueIDs[retweetID] = 1
#2
retweetID = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
The above line sets retweetID as a list. May be the following is what you wanted to do.
上面的行将retweetID设置为列表。可能以下是你想做的事情。
for tweet in search_results:
# find retweeted_status
if 'retweeted_status' in tweet.keys():
retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
#tweetID = retweeted_status['id']
# get the tweetid from the keys
#tweetID = tweet['id']
# if it is already in the id dictionary then delete this one
for tweet_id in retweetIDs:
if retweetID in uniqueIDs.keys():
db.delete(tweet)
numdeleted += 1
# otherwise add it to the unique ids
else:
uniqueIDs[tweet_id] = 1
else:
# reduce the count if we skipped one
numtweets -= 1
This checks for the presence of each of those retweet ids in your uniqueIds. If you just want the unique tweet ids, then instead of dictionary, you can use a set too.
这将检查您的uniqueIds中是否存在每个转发ID。如果您只想要唯一的推文ID,那么您也可以使用集合而不是字典。
unique_ids = set()
for tweet in search_results:
if 'retweeted_status' not in tweet.keys():
continue
retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
unique_ids.update(retweetIds)
#1
You're declaring retweetID
as a list, but then using it as a single value. Instead of using [... for ... in ...]
, you should just loop through tweet['retweeted_status']
. You'd have something like this:
您将retweetID声明为列表,然后将其用作单个值。而不是使用[... for ... in ...],你应该只是通过tweet ['retweeted_status']循环。你有这样的事情:
if 'retweeted_status' in tweet: # Note, don't need .keys()
for retweedID in tweet['retweeted_status']:
if retweetID in uniqueIDs: # Again, don't need .keys()
...
else:
uniqueIDs[retweetID] = 1
#2
retweetID = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
The above line sets retweetID as a list. May be the following is what you wanted to do.
上面的行将retweetID设置为列表。可能以下是你想做的事情。
for tweet in search_results:
# find retweeted_status
if 'retweeted_status' in tweet.keys():
retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
#tweetID = retweeted_status['id']
# get the tweetid from the keys
#tweetID = tweet['id']
# if it is already in the id dictionary then delete this one
for tweet_id in retweetIDs:
if retweetID in uniqueIDs.keys():
db.delete(tweet)
numdeleted += 1
# otherwise add it to the unique ids
else:
uniqueIDs[tweet_id] = 1
else:
# reduce the count if we skipped one
numtweets -= 1
This checks for the presence of each of those retweet ids in your uniqueIds. If you just want the unique tweet ids, then instead of dictionary, you can use a set too.
这将检查您的uniqueIds中是否存在每个转发ID。如果您只想要唯一的推文ID,那么您也可以使用集合而不是字典。
unique_ids = set()
for tweet in search_results:
if 'retweeted_status' not in tweet.keys():
continue
retweetIDs = [retweeted_status[id] for retweeted_status in tweet ['retweeted_status']]
unique_ids.update(retweetIds)