Web数据的大型机器学习

If I wanted to do large amounts of data fitting using matrices that were too large to fit in memory what tools/libraries would I look into? Specifically, if I was running on data from a website normally using php+mysql how would you suggest making an offline process that could run large matrix operations in a reasonable amount of time?

如果我想使用太大而不适合内存的矩阵来进行大量数据拟合,我会研究哪些工具/库?具体来说,如果我通常使用php + mysql运行来自网站的数据,你会建议制作一个可以在合理的时间内运行大型矩阵运算的离线流程吗?

Possible answers might be like "you should use this language with these distributed matrix algorithm to map reduce on many machines". I imagine that php isn't the best language for this so the flow would be more like some other offline process reads the data from the database, does the learning, and stores back the rules in a format that php can make use of later (since the other parts of the site are built in php).

可能的答案可能是“你应该使用这种语言与这些分布式矩阵算法在许多机器上映射reduce”。我认为php不是最好的语言,因此流程更像是其他一些离线进程从数据库读取数据,进行学习,并以php可以在以后使用的格式存储规则(因为该网站的其他部分是建立在PHP)。

Not sure if this is the right place to ask this one (would have asked it in the machine learning SE but it never made it out of beta).

不确定这是否是一个问这个的正确的地方(会在机器学习SE中问它,但它从未使它成为测试版)。

2 个解决方案

#1

There are lots of things that you need to do if you want to process large amounts of data. One way of processing web scale data is to use Map/Reduce and maybe you can look at Apache Mahout Which is a scalable machine learning package containing

如果要处理大量数据,则需要执行许多操作。处理Web规模数据的一种方法是使用Map / Reduce,也许您可以查看Apache Mahout这是一个可扩展的机器学习包,其中包含

Collaborative Filtering
User and Item based recommenders

基于用户和项目的推荐人

K-Means, Fuzzy K-Means clustering

K-Means,模糊K-Means聚类

And many more.

还有很多。

Specifically what you want to do might be available in some opensource project, such as Weka but you might need to migrate/create code to do a distribute job.

具体而言,您可能会在某些开源项目(例如Weka)中提供您想要执行的操作,但您可能需要迁移/创建代码才能执行分发作业。

Hope the above gives you an idea.

希望以上给你一个想法。

#2

Machine Learning is a wide field and can be used for many different things (for instance supervised predictive modelling and unsupervised data exploration). Depending on what you want to achieve and on the nature and dimensions of your data, finding scalable algorithms that are both interesting both in terms of the quality of the model they output and the scalability to leverage large training sets and the speed and memory consumption at prediction time is a hard problem that cannot be answered in general. Some algorithm can be scalable because they are online (i.e. learn incrementally without having to load all the dataset at once), other are scalable because they can be divided into subtasks that can be executed in parallel). It all depends on what you are trying to achieve and on which kind of data you collected / annotated in the past.

机器学习是一个广泛的领域,可以用于许多不同的事情(例如监督预测建模和无监督数据探索)。根据您想要实现的目标以及数据的性质和维度,找到可扩展的算法,这些算法既可以输出模型的质量,也可以利用大型培训集的速度和内存消耗的速度和内存消耗。预测时间是一个难以回答的难题。一些算法可以是可扩展的,因为它们是在线的(即,在不必一次加载所有数据集的情况下逐步学习),其他算法是可缩放的,因为它们可以被分成可以并行执行的子任务。这完全取决于您要实现的目标以及您过去收集/注释的数据类型。

For instance for text classification, simple linear models like logistic regression with good features (TF-IDF normalization, optionally bi-grams and optionally chi2 feature selection) can scale to very large dataset (millions of documents) without the need for any kind of cluster parallelization on a cluster. Have a look at liblinear and vowpal wabbit for building such scalable classification models.

例如,对于文本分类,简单的线性模型,如具有良好特征的逻辑回归(TF-IDF归一化,可选的二元组和可选的二元特征选择)可以扩展到非常大的数据集(数百万个文档),而无需任何类型的聚类集群上的并行化。看看liblinear和vowpal wabbit,以构建这种可扩展的分类模型。

#1

Collaborative Filtering
User and Item based recommenders