100万句可以省去非相关的英语单词

I am trying to train a Naive Bayes classifier with positive/negative words extracting from a sentiment. example:

我正在尝试训练一个朴素贝叶斯分类器，从情感中提取出积极/消极的词。例子:

I love this movie :))

我喜欢这部电影

I hate when it rains :(

我讨厌下雨:

The idea is I extract positive or negative sentences based on the emoctions used, but in order to train a classifier and persist it into database.

我的想法是根据所使用的emoctions提取正或负的句子，但为了训练分类器并将其持久化到数据库中。

The problem is that I have more than 1 million such sentences, so if I train it word by word, the database will go for a toss. I want to remove all non-relevant word example 'I','this', 'when', 'it' so that number of times I have to make a database query is less.

问题是我有超过100万个这样的句子，所以如果我逐字逐句地训练它，这个数据库就会被抛来抛去。我想删除所有不相关的单词示例“I”、“this”、“when”、“it”，这样我进行数据库查询的次数就会减少。

Please help me in resolving this issue to suggest me better ways of doing it

请帮助我解决这个问题，并提出更好的解决办法

Thank you

谢谢你！

3 个解决方案

#1

You might want to check this out http://books.google.com/books?id=CE1QzecoVf4C&lpg=PA390&ots=OHuYwLRhag&dq=sentiment%20%20mining%20for%20fortune%20500&pg=PA379#v=onepage&q=sentiment%20%20mining%20for%20fortune%20500&f=false

你可能想看看这个http://books.google.com/books?

#2

There are two common approaches:

有两种常见的方法:

Compile a stop list.
编译一个停止列表。
POS tag the sentences and throw out those parts of speech that you think are not interesting.
POS会给句子加上标签，并把你认为不有趣的部分去掉。

In both cases, determining which words/POS tags are relevant may be done using a measure such as PMI.

在这两种情况下，可以使用PMI之类的度量来确定哪些单词/POS标记是相关的。

Mind you: standard stop lists from information retrieval may or may not work in sentiment analysis. I recently read a paper (no reference, sorry) where it was claimed that ! and ?, commonly removed in search engines, are valuable clues for sentiment analysis. (So may 'I', esp. when you also have a neutral category.)

注意:信息检索的标准停止列表在情绪分析中可能有效，也可能无效。我最近读了一篇论文(没有参考文献，抱歉)，它声称!通常在搜索引擎中被删除的?也是情感分析的重要线索。(我也可以，尤其是当你有一个中性的类别时。)

Edit: you can also safely throw away everything that occurs only once in the training set (so called hapax legomena). Words that occur once have little information value for your classifier, but may take up a lot of space.

编辑:你也可以安全的丢弃在训练集中只发生一次的所有事情(所谓的hapax legomena)。出现一次的单词对分类器来说信息价值不大，但可能会占用很多空间。

#3

To reduce amount of data retrieved from your database, you may create in your database a dictionary -- a table that maps words* to numbers** -- and than retrieve only a number vector for training and a complete sentence for manual marking a sentiment.

为了减少从数据库中检索到的数据量，您可以在数据库中创建一个字典——一个将单词*映射到数字*的表——而不是只检索一个数字向量进行训练，并创建一个完整的句子来手工标记一种情绪。

|* No scientific publication comes to my mind but maybe it is enough to use only stems or lemmas instead of words. It would reduce the size of the dictionary.

我的脑海里没有任何科学的出版物，但也许仅仅用茎或引理代替文字就足够了。它将减少字典的大小。

|** If this operation kills your database, you can create a dictionary in a local application -- that uses a text indexing engine (e.g., apache lucene) -- and store only the result in your database.

如果这个操作杀死了您的数据库，您可以在本地应用程序中创建一个字典——它使用文本索引引擎(例如，apache lucene)——并且只将结果存储在数据库中。

#1