快速搜索一个巨大的数组matlab

I am currently implementing an algorithm into matlab that searches through a database of customers which bought certain articles. This database looks as following:

我目前正在用matlab实现一种算法，该算法可以通过购买特定文章的客户数据库进行搜索。该数据库如下所示:

[ 0   1   2   3   4   5 NaN NaN;
  4   6   7   8 NaN NaN NaN NaN;
...]

Just the size of that thing is size(data) = [90810 30]. Now I want to find frequent itemsets within that database (without making too much use of toolboxes). I will provide a toyexample here:

它的大小就是size(data) =[90810 30]。现在，我希望在该数据库中找到频繁的项目集(不需要过多地使用工具箱)。我在这里举个例子:

toyset = [
  0,  1,  2,  3,  4,  5,  6,  7,  8,  9;
  5,  6,  7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
  5,  6,  7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
  1,  6,  7,  9, 10, 11,NaN,NaN,NaN,NaN;
  2,  4,  8, 11, 12,NaN,NaN,NaN,NaN,NaN];

This would generate the following itemsets when applying a minimum support of 0.5 [support = (occurences_of_set) / (all_sets) ]:

当应用0.5的最小支持时，这会生成以下项集[support = (occurrence es_of_set) / (all_sets)]:

frequent_itemsets = [
  7,NaN,NaN;
  6,NaN,NaN;
  5,NaN,NaN;
  6,  7,NaN;
  5,  7,NaN;
  5,  6,NaN;
  5,  6,  7];

My problem now is finding how frequent the itemset is in the dataset. Currently I use the following algorithm (which works perfectly fine btw):

我现在的问题是发现条目集在数据集中的频率。目前我使用的算法如下(顺便说一句，非常好用):

function list = preprocess(subjectArray, combinations, progressBar)
% =========================================================================
% 
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
% 
% =========================================================================
% 
%   preprocesses the array; Finds the frequency of articles
%   subjectArray    - Array that contains customer data
%   combinations    - The article combinations to be found
%   progressBar     - The progress bar to indicate the progress of the 
%                     algorithm 
% 
% =========================================================================

    [countCustomers,maxSizeCustomers] = size(subjectArray);
    [countCombinations,sizeCombinations] = size(combinations);
    list=zeros(1,countCombinations);

    for i = 1:countCustomers
        waitbar(i/countCustomers,progressBar,sprintf('Preprocess: %.0f/%.0f\nSet size:%.0f',i,countCustomers,sizeCombinations));
        for k = 1 : countCombinations
            helpArray = zeros(1,maxSizeCustomers);
            help2Array = zeros(1,sizeCombinations);
            for j = 1:sizeCombinations
                helpArray = helpArray + (subjectArray(i,:) == combinations(k,j));
                help2Array(j) = any(helpArray);
            end
            list(k) = list(k) + all(help2Array);
        end
    end
end

My only problem is that is takes AGES!!! Literally!! Is there any simple possibility (except for sets of length 1, I know that can be made faster by simple counting) to make this faster?

我唯一的问题是那需要很长时间!!真的! !是否有任何简单的可能性(除了长度为1的集合，我知道通过简单的计数可以更快地实现)使这个更快?

I think that this:

我认为:

helpArray = helpArray + (subjectArray(i,j) == combinations(k,:));

is the bottleneck? But I am not sure since I don't know how fast matlab does certain operations.

是瓶颈?但我不确定，因为我不知道matlab的某些操作有多快。

Thanks for looking into it, mind_

谢谢你的调查，介意吗

What I ended up with doing:

我最后做的是:

function list = preprocess(subjectArray, combinations)
% =========================================================================
% 
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
% 
% =========================================================================
% 
%   preprocesses the array; Finds the frequency of articles
%   subjectArray    - Array that contains customer data
%   combinations    - The article combinations to be found
% 
% =========================================================================

    [countCustomers,maxSizeCustomers] = size(subjectArray);
    [countCombinations,sizeCombinations] = size(combinations);
    list=zeros(1,countCombinations);


    if sizeCombinations == 1
        for i = 1 : countCustomers
            for j = 1 : maxSizeCustomers
                x = subjectArray(i,j) + 1;
                if isnan(x), break; end
                list(x+1) = list(x+1) + 1;
            end
        end
    else
        for i = 1:countCombinations
            logical = zeros(size(subjectArray));
            for j = 1:sizeCombinations
                logical = logical + (subjectArray == combinations(i,j));
            end
            list(i) = sum(sum(logical,2) == sizeCombinations);
        end
    end
end

Thanks for all the support!

谢谢大家的支持!

2 个解决方案

#1

Three suggestions I see right off the bat:

我马上就能看到三个建议:

First, your waitbar is adding an additional three and half minutes to your search. According to this thread: http://www.mathworks.com/matlabcentral/newsreader/view_thread/261380 it takes code going through 240,000 items an extra 550 seconds to execute if you include the waitbar, scale that to 90,000 and you still have 3 and a half minutes of extra time.

首先，你的waitbar增加了3分半钟的搜索时间。根据这个线程:http://www.mathworks.com/matlabcentral/newsreader/view_thread/261380，如果您包含waitbar，那么将代码扩展到90,000，您仍然有3.5分钟的额外时间。

To calculate the initially frequent options, use the sum of logical indexing, for example, see how frequent a 7 occurs in your dataset.

要计算最初的频繁选项，请使用逻辑索引的总和，例如，查看7在数据集中出现的频率。

logical7=subjectArray==7;
numOf7s=sum(sum(logical7));

Do this for each value, I have a feeling that even though there will be extra code, it will speed up the initial processing quite a bit.

对每个值都这样做，我有一种感觉，即使有额外的代码，它也会大大加快初始处理。

To make that code nicer, you can do things like

为了使代码更美观，您可以做类似的事情

preallocate logical mat, with each 3d slice representing a number (6th slice represents freq. = 5, 7th slice represents freq. = 6)

预分配逻辑席，每个3d切片表示一个数字(第6片代表freq. = 5，第7片代表freq. = 6)

logMat=zeros([size(subjectArray) maxPossibleVal+1]) %Max Possible val is 9 in toy box ex.

在玩具箱ex中，最大可能的val是9。

then fill each slice with the logical# matricies

然后用逻辑#矩阵填充每个片

for i=0:maxPossibleVal
  logMat(:,:,i+1) = subjectArray==i;
end

Once again, you can get your sums from each logical slice and those that are less than a certain threshold, you can remove from your log. mat (I would also use logical indexing to remove the slices that don't meet the threshold)

同样，您可以从每个逻辑片和那些小于某个阈值的逻辑片中获取和，您可以从日志中删除它们。mat(我还将使用逻辑索引删除不满足阈值的切片)

Now the nice thing about having everything logically indexed is you can combine your slices with addition or multiplication to get the different combination frequencies. You could even rotate the result of these operations and then use the "sum" command, followed by logical indexing to get the frequency that the two or three numbers occur together.

现在让所有东西都有逻辑索引的好处是你可以把你的切片和加法或乘法组合起来得到不同的组合频率。您甚至可以旋转这些操作的结果，然后使用“sum”命令，然后使用逻辑索引获得两个或三个数字一起出现的频率。

logM7=logMat(:,:,8)
logM8=logMat(:,:,9)

combo7and8=logical(double(logM7)+double(logM8))

combo7and8 =逻辑(双(logM7)+双(logM8))

%You could probably replace this maybe with an | to make this simpler/faster

%你可以用|替换它，使它更简单/更快

freq7and8=sum(sum(combo7and8')==2)

%sum by default finds sum of columns, turn our rows into columns, then figure out what rows are equal to 2, add all the logical 1's together, and you have the freq. of the 7 and 8s in each dataset.

默认情况下，%sum会查找列的和，将我们的行转换为列，然后计算出哪些行等于2，将所有逻辑1相加，然后在每个数据集中有7和8s的freq。

This entire post can be summarized by two things:

这篇文章可以概括为两件事:

Take off the waitbar

脱下waitbar

Know that it is possible to use logical indexing almost everywhere in your code, which is much faster than for loops

要知道，在代码中几乎所有地方都可以使用逻辑索引，这比循环要快得多

#2

Sorry but I cannot comment (my reputation is too low, I suppose) Frequent itemset mining is quite complex. If you have a huge dataset and you choose a low threshold for an item(set) to be frequent, with your approach (apriori?) you have to be prepared to wait a long time :) Usually when you deal with nested for loops with matlab you experience low performance, too. What threshold did you choose? how large is your dataset?

对不起，我不能评论(我的声誉可能太低了)频繁的项目集挖掘非常复杂。如果您有一个庞大的数据集，并且您选择了一个项(集合)的低阈值为频繁，那么您的方法(apriori?)您必须准备好等待很长一段时间:)通常，当您使用matlab处理嵌套for循环时，您的性能也会很低。你选择了什么门槛?你的数据集有多大?

#1