
时间:2022-09-13 09:35:48

I'm writing a project on extracting a semantic orientation from a review stored in a text file. I have a 400*2 array, each row contains a word and it's weight. I want to check which of these words is in the text file, and calculate the weight of the whole content.

我正在编写一个项目,用于从存储在文本文件中的评论中提取语义方向。我有一个400 * 2阵列,每行包含一个单词,它的重量。我想检查文本文件中的哪些单词,并计算整个内容的权重。

My question is -

我的问题是 -

what is the most efficient way to do it? Should I search for each word separately, for example with a for loop? Do I get any benefit from storing the content of the text file in a string object?


2 个解决方案





This may work for you. You can use find




This may be out of the box thinking, but if you don't care for semantic/grammatic connection of the words:


  • sort all words from the text by length
  • 按长度排序文本中的所有单词
  • sort your array by length
  • 按长度排序数组


  • Write a for-loop:
  • 写一个for循环:
  • Call len() (length) on each word from the text.
  • 从文本中的每个单词调用len()(长度)。
  • Then only check against those words which have the same length.
  • 然后只检查那些长度相同的单词。

With some tinkering it might give you a good performance boost instead of the "naive" search.


Also look into search algorithms if you want to achieve an additional boost (concerning finding the first word (of the 400) with e.g. 6 letters - then go "down" the list until the first word with 5 letters comes up, then stop.

如果你想获得额外的提升(关于用例如6个字母找到第一个单词(400) - 然后在列表中“向下”直到第一个带有5个字母的单词出现,然后停止,也要查看搜索算法。

Alternatively you could also build an index array with the indexes of the first and last of all 5-letter words (analog for the rest), assuming your words dont change.






This may work for you. You can use find




This may be out of the box thinking, but if you don't care for semantic/grammatic connection of the words:


  • sort all words from the text by length
  • 按长度排序文本中的所有单词
  • sort your array by length
  • 按长度排序数组


  • Write a for-loop:
  • 写一个for循环:
  • Call len() (length) on each word from the text.
  • 从文本中的每个单词调用len()(长度)。
  • Then only check against those words which have the same length.
  • 然后只检查那些长度相同的单词。

With some tinkering it might give you a good performance boost instead of the "naive" search.


Also look into search algorithms if you want to achieve an additional boost (concerning finding the first word (of the 400) with e.g. 6 letters - then go "down" the list until the first word with 5 letters comes up, then stop.

如果你想获得额外的提升(关于用例如6个字母找到第一个单词(400) - 然后在列表中“向下”直到第一个带有5个字母的单词出现,然后停止,也要查看搜索算法。

Alternatively you could also build an index array with the indexes of the first and last of all 5-letter words (analog for the rest), assuming your words dont change.
