I am writing a python search engine that reads a list of jokes, tokenizes the words, creates an index (key:words,values:document ids of the document(s) the word appears in, and then searches the index. The tokenization and search methods work fine, but my indexing method is giving me issues and I can't figure out why.
我正在编写一个python搜索引擎,它读取一个笑话列表,对单词进行标记,创建索引(键:单词,值:单词出现的文档的文档ID),然后搜索索引。搜索方法工作正常,但我的索引方法给我的问题,我无法弄清楚为什么。
Code
码
def create_index(tokens):
lst = -1
duplicates =[]
index = {}
docnum = len(tokens)
for list in tokens:
lst += 1
xst = -1
for x in list:
xst += 1
element = tokens[lst][xst]
value = ""
elin = 0
if element not in duplicates:
for r in range(docnum):
if element in tokens[r]:
if elin > 0:
value += " "
value += "%d" % r
elin += 1
print(tokens[r])
print(r)
if len(value) > 1:
index[element] = value.split()
duplicates += element
else:
index[element] = value
duplicates += element
return index
the print statements are there just so I could see where the problem lies, and indeed I saw that certain r values in the for loop are being skipped entirely. I'm pretty sure I'm not modifying what I'm iterating over, which I believe is what usually causes for loops to skip.
印刷语句就在那里,我可以看到问题所在,实际上我看到for循环中的某些r值被完全跳过。我很确定我没有修改我正在迭代的东西,我相信这通常会导致循环跳过。
Input (tokens)
输入(令牌)
[['What', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden', 'His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs'], ['What', 'do', 'you', 'call', 'a', 'chicken', 'crossing', 'the', 'road', 'Poultry', 'in', 'motion'], ['What', 'do', 'you', 'call', 'it', 'when', 'a', 'cat', 'sues', 'another', 'cat', 'A', 'Clawsuit'], ['What', 'does', 'an', 'envelope', 'say', 'when', 'you', 'lick', 'it', 'Nothing', 'It', 'just', 'shuts', 'up'], ['How', 'can', 'you', 'tell', 'the', 'ocean', 'is', 'friendly', 'It', 'waves'], ['What', 's', 'black', 'white', 'green', 'and', 'bumpy', 'A', 'pickle', 'wearing', 'a', 'tuxedo'], ['When', 'was', 'meat', 'so', 'high', 'When', 'the', 'cow', 'jumped', 'over', 'the', 'moon'], ['What', 'happened', 'to', 'the', 'wind', 'It', 'blew', 'away'], ['What', 'starts', 'with', 'T', 'is', 'full', 'of', 'T', 'and', 'ends', 'with', 'T', 'A', 'teapot'], ['What', 'is', 'a', 'hermit', 'A', 'girl', 's', 'baseball', 'glove'], ['What', 'does', 'a', 'television', 'have', 'in', 'common', 'with', 'a', 'rabbit', 'His', 'ears'], ['What', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'Why', 'are', 'you', 'always', 'picking', 'on', 'me'], ['What', 'did', 'the', 'guy', 'say', 'when', 'he', 'walked', 'into', 'the', 'bar', 'Ouch'], ['How', 'is', 'a', 'locksmith', 'like', 'a', 'typewritter', 'They', 'both', 'have', 'a', 'lot', 'of', 'keys'], ['What', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'A', 'cow', 'with', 'a', 'cold'], ['What', 'is', 'a', 'caterpillar', 'afraid', 'of', 'A', 'dogerpillar'], ['Who', 'has', 'the', 'strongest', 'underwear', 'Arnold', 'Shortsineger'], ['Why', 'did', 'the', 'elephant', 'decide', 'not', 'to', 'move', 'Because', 'he', 'couldn', 't', 'lift', 'his', 'trunk'], ['Which', 'are', 'the', 'stronger', 'days', 'of', 'the', 'week', 'Saturday', 'and', 'Sunday', 'The', 'rest', 'are', 'weekdays'], ['Which', 'runs', 'faster', 'hot', 'or', 'cold', 'Hot', 'Everyone', 'can', 'catch', 'a', 'cold'], ['Why', 'did', 'the', 'strawberry', 'cross', 'the', 'road', 'Because', 'his', 'mother', 'was', 'in', 'a', 'jam'], ['How', 'do', 'you', 'keep', 'a', 'lion', 'from', 'charging', 'Take', 'away', 'its', 'credit', 'cards'], ['What', 'do', 'you', 'call', 'a', 'cow', 'with', 'a', 'twitch', 'Beef', 'jerky'], ['What', 'is', 'the', 'best', 'way', 'to', 'keep', 'water', 'from', 'running', 'Don', 't', 'pay', 'the', 'water', 'bill'], ['Where', 'do', 'cows', 'go', 'to', 'have', 'fun', 'The', 'moovies'], ['What', 'time', 'was', 'is', 'when', 'the', 'elephant', 'sat', 'on', 'a', 'chair', 'Time', 'to', 'get', 'a', 'new', 'chair'], ['What', 'did', 'the', 'flower', 'say', 'to', 'the', 'bike', 'Petal'], ['Why', 'do', 'we', 'not', 'tell', 'secrets', 'in', 'a', 'corn', 'patch', 'Too', 'many', 'ears'], ['What', 's', 'yellow', 'and', 'writes', 'A', 'ball', 'point', 'banana'], ['Did', 'people', 'laugh', 'when', 'the', 'lady', 'fell', 'on', 'the', 'ice', 'No', 'but', 'the', 'ice', 'cracked', 'up'], ['What', 'word', 'is', 'always', 'spelled', 'incorrectly', 'Incorrectly'], ['What', 'should', 'you', 'do', 'if', 'your', 'dog', 'is', 'missing', 'Check', 'the', 'Lost', 'and', 'Hound'], ['What', 'have', 'you', 'seen', 'that', 'you', 'will', 'never', 'see', 'again', 'Yesterday'], ['Why', 'did', 'the', 'turkey', 'cross', 'the', 'road', 'To', 'get', 'to', 'the', 'chicken'], ['Why', 'was', 'Cinderella', 'Late', 'for', 'the', 'ball', 'She', 'forgot', 'to', 'swing', 'the', 'bat'], ['How', 'can', 'you', 'tell', 'there', 's', 'a', 'hippo', 'in', 'your', 'oven', 'The', 'oven', 'door', 'won', 't', 'close'], ['What', 'do', 'you', 'get', 'when', 'you', 'cross', 'a', 'shark', 'and', 'Flipper', 'A', 'fat', 'shark'], ['What', 'did', 'the', 'lamp', 'say', 'to', 'the', 'other', 'lamp', 'You', 'turn', 'me', 'on'], ['What', 'did', 'the', 'sidewalk', 'do', 'when', 'he', 'heard', 'a', 'funny', 'joke', 'He', 'cracked', 'up'], ['Why', 'did', 'the', 'chicken', 'cross', 'the', 'road', 'He', 'wanted', 'to', 'prove', 'to', 'the', 'armadillo', 'that', 'it', 'could', 'be', 'done'], ['Why', 'did', 'the', 'dalmation', 'need', 'glasses', 'He', 'was', 'seeing', 'spots'], ['What', 'did', 'the', 'book', 'say', 'to', 'the', 'page', 'Don', 't', 'turn', 'away', 'from', 'me'], ['What', 'do', 'you', 'call', 'a', 'pig', 'in', 'a', 'butcher', 'shop', 'A', 'pork', 'chop'], ['In', 'France', 'what', 'do', 'frogs', 'eat', 'French', 'Flies'], ['What', 'is', 'yellow', 'and', 'wears', 'a', 'mask', 'The', 'Lone', 'Lemon'], ['What', 'do', 'astronauts', 'eat', 'for', 'dinner', 'Launch', 'meat'], ['Why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'Because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'], ['What', 'did', 'they', 'digital', 'clock', 'say', 'to', 'its', 'mom', 'Look', 'mom', 'no', 'hands'], ['What', 'is', 'the', 'best', 'way', 'to', 'raise', 'a', 'child', 'In', 'an', 'elevator'], ['What', 'is', 'always', 'behind', 'the', 'time', 'The', 'back', 'of', 'the', 'clock'], ['What', 's', 'green', 'and', 'sings', 'Elvis', 'Parsley'], ['What', 'has', '10', 'letters', 'and', 'starts', 'with', 'gas', 'An', 'automobile'], ['Why', 'did', 'they', 'bury', 'the', 'battery', 'Because', 'it', 'was', 'dead'], ['Why', 'did', 'the', 'chicken', 'go', 'to', 'the', 'library', 'To', 'check', 'out', 'a', 'bawk', 'bawk', 'bawkbawk'], ['If', 'a', 'woodchuck', 'had', 'a', 'name', 'what', 'would', 'it', 'be', 'Chuck', 'Wood'], ['How', 'do', 'you', 'catch', 'a', 'squirrel', 'Climb', 'a', 'tree', 'and', 'act', 'like', 'a', 'nut'], ['What', 'did', 'one', 'penny', 'say', 'to', 'the', 'other', 'Let', 's', 'get', 'together', 'and', 'make', 'some', 'sense'], ['Why', 'don', 't', 'lobsters', 'share', 'Because', 'they', 'are', 'shellfish'], ['Why', 'does', 'the', 'man', 'wish', 'he', 'could', 'be', 'a', 'guitar', 'player', 'in', 'a', 'room', 'full', 'of', 'beautiful', 'girls', 'Because', 'if', 'he', 'was', 'a', 'guitar', 'player', 'he', 'would', 'have', 'his', 'pick'], ['What', 'is', 'a', 'cat', 's', 'favorite', 'color', 'Purrple'], ['Why', 'did', 'the', 'ghost', 'float', 'across', 'the', 'road', 'Because', 'he', 'couldn', 't', 'walk'], ['What', 'kind', 'of', 'star', 'could', 'hurt', 'you', 'A', 'shooting', 'star']]
Output (shortened)
输出(缩短)
['What', 'did', 'the', 'little', 'boy', 'tell', 'the', 'game', 'warden', 'His', 'dad', 'was', 'in', 'the', 'kitchen', 'poaching', 'eggs']
0
['What', 'do', 'you', 'call', 'a', 'chicken', 'crossing', 'the', 'road', 'Poultry', 'in', 'motion']
1
['What', 'do', 'you', 'call', 'it', 'when', 'a', 'cat', 'sues', 'another', 'cat', 'A', 'Clawsuit']
2
['What', 'does', 'an', 'envelope', 'say', 'when', 'you', 'lick', 'it', 'Nothing', 'It', 'just', 'shuts', 'up']
3
['What', 's', 'black', 'white', 'green', 'and', 'bumpy', 'A', 'pickle', 'wearing', 'a', 'tuxedo']
5
['What', 'happened', 'to', 'the', 'wind', 'It', 'blew', 'away']
7
['What', 'starts', 'with', 'T', 'is', 'full', 'of', 'T', 'and', 'ends', 'with', 'T', 'A', 'teapot']
8
['What', 'is', 'a', 'hermit', 'A', 'girl', 's', 'baseball', 'glove']
9
['What', 'does', 'a', 'television', 'have', 'in', 'common', 'with', 'a', 'rabbit', 'His', 'ears']
10
['What', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'Why', 'are', 'you', 'always', 'picking', 'on', 'me']
11
['What', 'did', 'the', 'guy', 'say', 'when', 'he', 'walked', 'into', 'the', 'bar', 'Ouch']
12
['What', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'A', 'cow', 'with', 'a', 'cold']
14
['What', 'is', 'a', 'caterpillar', 'afraid', 'of', 'A', 'dogerpillar']
15
['What', 'do', 'you', 'call', 'a', 'cow', 'with', 'a', 'twitch', 'Beef', 'jerky']
22
['What', 'is', 'the', 'best', 'way', 'to', 'keep', 'water', 'from', 'running', 'Don', 't', 'pay', 'the', 'water', 'bill']
23
['What', 'time', 'was', 'is', 'when', 'the', 'elephant', 'sat', 'on', 'a', 'chair', 'Time', 'to', 'get', 'a', 'new', 'chair']
25
['What', 'did', 'the', 'flower', 'say', 'to', 'the', 'bike', 'Petal']
26
['What', 's', 'yellow', 'and', 'writes', 'A', 'ball', 'point', 'banana']
28
['What', 'word', 'is', 'always', 'spelled', 'incorrectly', 'Incorrectly']
30
With the input being a list of lists, the idea is to search through and put every unique word in as a key, with values being the documents it appears in. However the index produced
输入是一个列表列表,其想法是搜索并将每个唯一的单词作为一个键放入,其值是它出现的文档。但是生成的索引
Index
指数
{'walked': ['12'], 'across': ['60'], 'incorrectly': ['30'], 'Nothing': '3', 'keys': ['13'], 'frogs': ['43'], 'meat': ['6', '45'], 'be': ['39', '54', '58'], 'mother': ['20'], 'Petal': ['26'], 'typewritter': ['13'], 'see': ['32'], 'keep': ['21', '23'], 'jam': ['20'], 'black': '5', 'lick': '3', 'girls': ['58'], 'sues': '2', 'sidewalk': ['38'], 'book': ['41'], 'strongest': ['16'], 'shop': ['42'], 'favorite': ['59'], 'A': ['2', '5', '8', '9', '14', '15', '28', '36', '42', '61'], 'seen': ['32'], 'Did': ['29'], 'glove': '9', 'hands': ['47'], 'room': ['58'], 'patch': ['27'], 'T': '8', 'legs': ['14'], 'hippo': ['35'], 'his': ['17', '20', '46', '58'], 'cows': ['24'], 'new': ['25'], 'cards': ['21'], 'you': ['1', '2', '3', '4', '11', '21', '22', '31', '32', '35', '36', '42', '55', '61'], 'Look': ['47'], 'tree': ['55'], 'tell': ['0', '4', '27', '35'], 'missing': ['31'], 'front': ['46'], 'lion': ['21'], 'automobile': ['51'], 'eggs': '0', 'spots': ['40'], 'water': ['23'], 'full': ['8', '58'], 'moon': '6', 'need': ['40'], 'When': '6', 'moovies': ['24'], 'rest': ['18'], 'cat': ['2', '59'], 'boy': '0', 'fell': ['29'], 'hermit': '9', 'dogerpillar': ['15'], 'sings': ['50'], 'couldn': ['17', '60'], 'dad': '0', 'clock': ['47', '49'], 'turkey': ['33'], 'sat': ['25'], 'He': ['38', '39', '40'], 'bike': ['26'], 'seeing': ['40'], 'elevator': ['48'], 'Parsley': ['50'], 'secrets': ['27'], 'They': ['13'], 'flower': ['26'], 'both': ['13'], 'close': ['35'], 'bill': ['23', '46'], 'crossing': '1', 'glasses': ['40'], 'bawkbawk': ['53'], 'many': ['27'], 'penny': ['56'], 'guy': ['12'], 'chop': ['42'], 'had': ['54'], 'oven': ['35'], 'are': ['11', '18', '57'], 'jerky': ['22'], 'some': ['56'], 'Launch': ['45'], 'Ouch': ['12'], 'An': ['51'], 'Where': ['24'], 'caterpillar': ['15'], 'faster': ['19'], 'do': ['1', '2', '21', '22', '24', '27', '31', '36', '38', '42', '43', '45', '55'], 'lamp': ['37'], 'white': '5', 'Hot': ['19'], 'friendly': '4', 'pick': ['58'], 'me': ['11', '37', '41'], 'high': '6', 'always': ['11', '30', '46', '49'], 'crop': ['11'], 'face': ['46'], 'In': ['43', '48'], 'sad': ['46'], 'spelled': ['30'], 'poaching': '0', 'color': ['59'], 'twitch': ['22'], 'To': ['33', '53'], 'door': ['35'], 'Lemon': ['44'], 'jumped': '6', 'lobsters': ['57'], 'raise': ['48'], 'armadillo': ['39'], 'locksmith': ['13'], 'bawk': ['53'], 'It': ['3', '4', '7'], 'astronauts': ['45'], 'Flies': ['43'], 'Too': ['27'], 'digital': ['47'], 'Because': ['17', '20', '46', '52', '57', '58', '60'], 'joke': ['38'], 'ocean': '4', 'make': ['56'], 'hurt': ['61'], 'go': ['24', '53'], 'four': ['14'], 'with': ['8', '10', '14', '22', '51'], 'You': ['37'], 'laugh': ['29'], 'Don': ['23', '41'], 'Hound': ['31'], 'tuxedo': '5', 'farmer': ['11'], 'yellow': ['28', '44'], 'that': ['32', '39'], 'Wood': ['54'], 'if': ['31', '58'], 'eat': ['43', '45'], 'star': ['61'], 'happened': '7', 'catch': ['19', '55'], 'again': ['32'], 'check': ['53'], 'trunk': ['17'], 'together': ['56'], 'what': ['43', '54'], 'ice': ['29'], 'cross': ['20', '33', '36', '39'], 'banana': ['28'], 'mask': ['44'], 'Time': ['25'], 'Incorrectly': ['30'], 'underwear': ['16'], 'goes': ['14'], 'Cinderella': ['34'], 'they': ['47', '52', '57'], 'does': ['3', '10', '58'], 'child': ['48'], 'is': ['4', '8', '9', '13', '15', '23', '25', '30', '31', '44', '46', '48', '49', '59'], 'lady': ['29'], 'game': '0', 'the': ['0', '1', '4', '6', '7', '11', '12', '16', '17', '18', '20', '23', '25', '26', '29', '31', '33', '34', '37', '38', '39', '40', '41', '46', '48', '49', '52', '53', '56', '58', '60'], 'chicken': ['1', '33', '39', '53'], 'float': ['60'], 'His': ['0', '10'], 'mom': ['47'], '10': ['51'], 'people': ['29'], 'elephant': ['17', '25'], 'one': ['56'], 'will': ['32'], 'wish': ['58'], 'television': ['10'], 'can': ['4', '19', '35'], 'ends': '8', 'duck': ['46'], 'fat': ['36'], 'chair': ['25'], 'Clawsuit': '2', 'kind': ['61'], 'common': ['10'], 'its': ['21', '47'], 'wind': '7', 'shooting': ['61'], 'best': ['23', '48'], 'time': ['25', '49'], 'Poultry': '1', 'no': ['47'], 'not': ['17', '27'], 'picking': ['11'], 'won': ['35'], 'sees': ['46'], 'on': ['11', '25', '29', '37'], 'funny': ['38'], 'pig': ['42'], 'of': ['8', '13', '15', '18', '46', '49', '58', '61'], 'behind': ['49'], 'an': ['3', '48'], 'little': ['0', '46'], 'word': ['30'], 'point': ['28'], 'get': ['25', '33', '36', '56'], 'don': ['57'], 'could': ['39', '58', '61'], 'in': ['0', '1', '10', '20', '27', '35', '42', '46', '58'], 'away': ['7', '21', '41'], 'Everyone': ['19'], 'Let': ['56'], 'swing': ['34'], 'it': ['2', '3', '39', '52', '54'], 'Late': ['34'], 'way': ['23', '48'], 'rabbit': ['10'], 'to': ['7', '11', '17', '23', '24', '25', '26', '33', '34', '37', '39', '41', '47', '48', '53', '56'], 'have': ['10', '13', '24', '32', '58'], 'just': '3', 'heard': ['38'], 'over': '6', 'shark': ['36'], 'so': ['6', '46'], 'teapot': '8', 'strawberry': ['20'], 'from': ['21', '23', '41'], 'call': ['1', '2', '22', '42'], 'Lone': ['44'], 'weekdays': ['18'], 'ghost': ['60'], 'bumpy': '5', 'or': ['19'], 'girl': '9', 'Chuck': ['54'], 'guitar': ['58'], 'we': ['27'], 'Why': ['11', '17', '20', '27', '33', '34', '39', '40', '46', '52', '53', '57', '58', '60'], 'bat': ['34'], 'there': ['35'], 'The': ['18', '24', '35', '44', '49'], 'your': ['31', '35'], 'would': ['54', '58'], 'act': ['55'], 'writes': ['28'], 'No': ['29'], 'pickle': '5', 'beautiful': ['58'], 'running': ['23'], 'corn': ['27'], 'week': ['18'], 'wears': ['44'], 'French': ['43'], 'walk': ['60'], 'name': ['54'], 'sense': ['56'], 'never': ['32'], 'Take': ['21'], 'out': ['53'], 'woodchuck': ['54'], 'pay': ['23'], 'shellfish': ['57'], 'cold': ['14', '19'], 'letters': ['51'], 'butcher': ['42'], 'another': '2', 'runs': ['19'], 'ball': ['28', '34'], 'Sunday': ['18'], 'up': ['3', '29', '38'], 'like': ['13', '55'], 'page': ['41'], 'man': ['58'], 'If': ['54'], 'Which': ['18', '19'], 'nut': ['55'], 'share': ['57'], 'say': ['3', '11', '12', '26', '37', '41', '47', '56'], 'How': ['4', '13', '21', '35', '55'], 'Beef': ['22'], 'wanted': ['39'], 'dalmation': ['40'], 'baseball': '9', 'dinner': ['45'], 'ears': ['10', '27'], 'What': ['0', '1', '2', '3', '5', '7', '8', '9', '10', '11', '12', '14', '15', '22', '23', '25', '26', '28', '30', '31', '32', '36', '37', '38', '41', '42', '44', '45', '47', '48', '49', '50', '51', '56', '59', '61'], 'has': ['14', '16', '51'], 'prove': ['39'], 'Lost': ['31'], 'road': ['1', '20', '33', '39', '60'], 'green': ['5', '50'], 'wearing': '5', 'fun': ['24'], 'move': ['17'], 'cracked': ['29', '38'], 'shuts': '3', 'was': ['0', '6', '20', '25', '34', '40', '52', '58'], 'blew': '7', 'stronger': ['18'], 'Climb': ['55'], 'motion': '1', 'decide': ['17'], 'kitchen': '0', 'turn': ['37', '41'], 'cow': ['6', '14', '22'], 'he': ['12', '17', '38', '46', '58', '60'], 'days': ['18'], 'Yesterday': ['32'], 'envelope': '3', 'Arnold': ['16'], 'hot': ['19'], 'waves': '4', 'She': ['34'], 'bury': ['52'], 'pork': ['42'], 'afraid': ['15'], 'when': ['2', '3', '12', '25', '29', '36', '38'], 'other': ['37', '56'], 'warden': '0', 'did': ['0', '11', '12', '17', '20', '26', '33', '37', '38', '39', '40', '41', '47', '52', '53', '56', '60'], 'Purrple': ['59'], 'forgot': ['34'], 'Check': ['31'], 'charging': ['21'], 'and': ['5', '8', '14', '18', '28', '31', '36', '44', '50', '51', '55', '56'], 'player': ['58'], 'gas': ['51'], 'for': ['34', '45'], 'dog': ['31'], 'Flipper': ['36'], 'Elvis': ['50'], 'library': ['53'], 'credit': ['21'], 'Shortsineger': ['16'], 'but': ['29'], 'battery': ['52'], 'lot': ['13'], 'lift': ['17'], 'booo': ['14'], 'Who': ['16'], 'Saturday': ['18'], 'back': ['49'], 'should': ['31'], 'starts': ['8', '51'], 'done': ['39'], 'France': ['43'], 'squirrel': ['55'], 'into': ['12'], 'dead': ['52'], 'bar': ['12']}
is incorrect because my for loop is skipping steps. Is it possible that the amount of looping being done is messing with the loop? After adding the two print statements for debugging the output was 11373 lines long. If so, can anyone recommend a better alternative? Apologies if I'm missing an obvious answer.
是不正确的,因为我的for循环正在跳过步骤。循环的数量是否可能与循环混乱?添加两个打印语句进行调试后,输出为11373行。如果是这样,有人可以推荐更好的选择吗?如果我错过了一个明显的答案,请道歉。
1 个解决方案
#1
1
I'm not sure I understand you correctly, but:
我不确定我是否理解正确,但是:
from collections import defaultdict
def create_index(tokens):
index = defaultdict(list)
for docid, ls in enumerate(tokens):
for word in set(ls):
index[word].append(docid)
return index
That makes index
a dictionary from words to lists of document ids:
这使索引成为从单词到文档ID列表的字典:
>>> tokens = ["this is a test".split(), "that is another test".split()]
>>> create_index(tokens)
defaultdict(<class 'list'>, {'test': [0, 1], 'is': [0, 1], 'this': [0], 'that': [1], 'a': [0], 'another': [1]})
#1
1
I'm not sure I understand you correctly, but:
我不确定我是否理解正确,但是:
from collections import defaultdict
def create_index(tokens):
index = defaultdict(list)
for docid, ls in enumerate(tokens):
for word in set(ls):
index[word].append(docid)
return index
That makes index
a dictionary from words to lists of document ids:
这使索引成为从单词到文档ID列表的字典:
>>> tokens = ["this is a test".split(), "that is another test".split()]
>>> create_index(tokens)
defaultdict(<class 'list'>, {'test': [0, 1], 'is': [0, 1], 'this': [0], 'that': [1], 'a': [0], 'another': [1]})