分析无法适应内存的数据

I have a database which has raw text that needs to be analysed. For example, I have collected the title tags of hundreds of millions of individual webpages and clustered them based on topic. I am now interested in performing some additional tests on subsets of each topic cluster. The problem is two-fold. First, I cannot fit all of the text into memory to evaluate it. Secondly, I need run several of these analyses in parallel, so even if I could fit a subset into memory, I certainly could not fit many subsets into memory.

我有一个数据库,其中包含需要分析的原始文本。例如,我收集了数亿个网页的标题标签,并根据主题对其进行了聚类。我现在有兴趣对每个主题集群的子集执行一些额外的测试。问题是双重的。首先,我不能将所有文本都放入内存来评估它。其次,我需要并行运行其中几个分析,所以即使我可以将一个子集放入内存中,我当然也无法将许多子集放入内存中。

I have been working with generators, but often it is necessary to know information about rows of data that have already been loaded and evaluated.

我一直在使用生成器,但通常需要知道已经加载和评估的数据行的信息。

My question is this: What are the best methods for handling and analysing data that cannot fit into memory. The data necessarily must be extracted from some sort of database (currently mysql but likely will be switching to a more powerful solution soon.)

我的问题是:处理和分析无法融入内存的数据的最佳方法是什么。必须从某种数据库中提取数据(目前是mysql,但很快就会转向更强大的解决方案。)

I am building the software that handles the data in Python.

我正在构建用Python处理数据的软件。

Thank you,

EDIT

I will be researching and brainstorming on this all day and plan on continuing to post my thoughts and findings. Please leave any input or advice you might have.

我将整天研究和集思广益,并计划继续发表我的想法和发现。请留下您可能提供的任何意见或建议。

IDEA 1: Tokenize words and n-grams and save to file. For each string pulled from database, tokenize using tokens in an already existing file. If a token does not exist, create it. For each word token, combine from right to left until a single representation of all the words in a string exists. Search an existing list (that can fit in memory) that consists of reduced tokens to find potential matches and similarities. Each reduced token will contain an identifier that indicates token categories. If a reduced token (one that was created by combination of word tokens) is found to match categorically against a tokenized string of interest, but not directly, then the reduced token will be broken down into its counterparts and compared word-token by word-token to the string of interest.

IDEA 1:标记单词和n-gram并保存到文件。对于从数据库中提取的每个字符串,使用现有文件中的标记进行标记。如果令牌不存在,请创建它。对于每个单词标记,从右到左组合,直到存在字符串中所有单词的单个表示。搜索包含减少的令牌以查找潜在匹配和相似性的现有列表(可以适合内存)。每个简化的令牌将包含指示令牌类别的标识符。如果发现缩减的令牌(由单词令牌组合创建的令牌)与感兴趣的令牌化字符串明确匹配,而不是直接匹配,则缩减的令牌将被分解为其对应物,并按字词比较单词令牌 - 令牌到感兴趣的字符串。

I have no idea if there already exists a library or module that can do this, nor am I sure how much benefit I will gain from it. However, my priorities are: 1) conserve memory, 2) worry about runtime. Thoughts?

我不知道是否已经存在可以执行此操作的库或模块,我也不确定从中获得多少好处。但是,我的优先事项是:1)节省内存,2)担心运行时。思考?

EDIT 2

Hadoop is definitely going to be the solution to this problem. I found some great resources on natural language processing in python and hadoop. See below:

Hadoop肯定会解决这个问题。我在python和hadoop中找到了一些关于自然语言处理的很好的资源。见下文:

Thanks for your help!

谢谢你的帮助!

2 个解决方案

#1

Map/Reduce was created for this purpose.

为此目的创建了Map / Reduce。

The best map reduce engine is Hadoop, but it has a high learning curve and needs many nodes for it to be worth it. If this is a small project, you could use MongoDB, which is a really easy to use database and includes an internal map reduce engine which uses Javascript. The map reduce framework is really simple and easy to learn, but it lacks all the tools that you could get in the JDK using Hadoop.

最好的地图缩减引擎是Hadoop,但它具有很高的学习曲线,需要很多节点才值得。如果这是一个小项目,你可以使用MongoDB,这是一个非常容易使用的数据库,并包含一个使用Javascript的内部地图缩减引擎。 map reduce框架非常简单易学,但缺少使用Hadoop在JDK中获得的所有工具。

WARNING: You can only run one map reduce job at a time on MongoDB's map reduce engine. This is alright for chaining jobs or medium datasets (<100GB), but it lacks Hadoop's parallelism.

警告:您只能在MongoDB的地图缩减引擎上一次运行一个地图缩减作业。这对于链接作业或中等数据集(<100GB)是好的,但它缺乏Hadoop的并行性。

#2

currently mysql but likely will be switching to a more powerful solution soon.

目前mysql但很可能很快就会转向更强大的解决方案。

Please don't worse time - for most types tasks tunned MySQL is the best solution.

请不要更糟糕的时间 - 对于大多数类型的任务调整MySQL是最好的解决方案。

For processing huge data massives use iteratools or Build a Basic Python Iterator

要处理大量数据,请使用iteratools或Build a Basic Python Iterator

About how iterate data. It depends from you algorithm.

关于如何迭代数据。这取决于你的算法。

#1