如何在文本文件中搜索一组单词?

时间:2022-09-13 09:35:48

I'm writing a project on extracting a semantic orientation from a review stored in a text file. I have a 400*2 array, each row contains a word and it's weight. I want to check which of these words is in the text file, and calculate the weight of the whole content.

我正在编写一个项目,用于从存储在文本文件中的评论中提取语义方向。我有一个400 * 2阵列,每行包含一个单词,它的重量。我想检查文本文件中的哪些单词,并计算整个内容的权重。

My question is -

我的问题是 -

what is the most efficient way to do it? Should I search for each word separately, for example with a for loop? Do I get any benefit from storing the content of the text file in a string object?

最有效的方法是什么?我应该单独搜索每个单词,例如使用for循环吗?将文本文件的内容存储在字符串对象中可以获得任何好处吗?

2 个解决方案

#1


0  

https://docs.python.org/3.6/library/mmap.html

https://docs.python.org/3.6/library/mmap.html

This may work for you. You can use find

这可能对你有用。你可以使用find

#2


0  

This may be out of the box thinking, but if you don't care for semantic/grammatic connection of the words:

这可能是开箱即用的想法,但如果你不关心单词的语义/语法连接:

  • sort all words from the text by length
  • 按长度排序文本中的所有单词
  • sort your array by length
  • 按长度排序数组

.

  • Write a for-loop:
  • 写一个for循环:
  • Call len() (length) on each word from the text.
  • 从文本中的每个单词调用len()(长度)。
  • Then only check against those words which have the same length.
  • 然后只检查那些长度相同的单词。

With some tinkering it might give you a good performance boost instead of the "naive" search.

通过一些修补,它可能会给你一个良好的性能提升,而不是“天真”的搜索。

Also look into search algorithms if you want to achieve an additional boost (concerning finding the first word (of the 400) with e.g. 6 letters - then go "down" the list until the first word with 5 letters comes up, then stop.

如果你想获得额外的提升(关于用例如6个字母找到第一个单词(400) - 然后在列表中“向下”直到第一个带有5个字母的单词出现,然后停止,也要查看搜索算法。

Alternatively you could also build an index array with the indexes of the first and last of all 5-letter words (analog for the rest), assuming your words dont change.

或者你也可以构建一个索引数组,其中包含所有5个字母单词的第一个和最后一个的索引(其余为模拟),假设你的单词没有改变。

#1


0  

https://docs.python.org/3.6/library/mmap.html

https://docs.python.org/3.6/library/mmap.html

This may work for you. You can use find

这可能对你有用。你可以使用find

#2


0  

This may be out of the box thinking, but if you don't care for semantic/grammatic connection of the words:

这可能是开箱即用的想法,但如果你不关心单词的语义/语法连接:

  • sort all words from the text by length
  • 按长度排序文本中的所有单词
  • sort your array by length
  • 按长度排序数组

.

  • Write a for-loop:
  • 写一个for循环:
  • Call len() (length) on each word from the text.
  • 从文本中的每个单词调用len()(长度)。
  • Then only check against those words which have the same length.
  • 然后只检查那些长度相同的单词。

With some tinkering it might give you a good performance boost instead of the "naive" search.

通过一些修补,它可能会给你一个良好的性能提升,而不是“天真”的搜索。

Also look into search algorithms if you want to achieve an additional boost (concerning finding the first word (of the 400) with e.g. 6 letters - then go "down" the list until the first word with 5 letters comes up, then stop.

如果你想获得额外的提升(关于用例如6个字母找到第一个单词(400) - 然后在列表中“向下”直到第一个带有5个字母的单词出现,然后停止,也要查看搜索算法。

Alternatively you could also build an index array with the indexes of the first and last of all 5-letter words (analog for the rest), assuming your words dont change.

或者你也可以构建一个索引数组,其中包含所有5个字母单词的第一个和最后一个的索引(其余为模拟),假设你的单词没有改变。