递归搜索大型数据集中的索引的性能

I came across a question today of search efficiency for large sets today and I've done by best to boil it down to the most basic case. I feel like this sort of thing probably relates to some classic problem or basic concept I'm missing, so a pointer to that would be great.

今天我遇到了一个关于大型套装搜索效率的问题，我最好将其归结为最基本的情况。我觉得这种事情可能与我遗漏的一些经典问题或基本概念有关，所以指向它的指针会很棒。

Suppose I have a table definition like

假设我有一个表定义，如

CREATE TABLE foo(
    id int,
    type bool,
    reference int,
    PRIMARY KEY(id),
    FOREIGN KEY(reference) REFERENCES foo(id),
    UNIQUE KEY(reference)
) Engine=InnoDB;

Populated with n rows where n/2 are randomly assigned type=1. Each row references another with its same type except for the first, which has reference=null.

填充n行，其中n / 2随机分配类型= 1。每行引用另一个具有相同类型的行，除了第一行，它具有reference = null。

Now we want to print all items with type 1. I assume that at some point, it will be faster to recursively call something like

现在我们要打印所有类型为1的项目。我假设在某些时候，递归调用类似的东西会更快

function printFoo1($ref){
    if($ref==null)
        return;
    $q = 'SELECT id, reference FROM foo WHERE id='.$ref;
    $arr = mysql_fetch_array( mysql_query($q) );
    echo $arr[0];
    printFoo1($arr[1]);
}

As opposed to

相反

function printFoo2($ref){
    $q = 'SELECT id FROM foo WHERE type=1';
    $res = mysql_query($q);
    while( $id = mysql_fetch_array($res) ){
        echo $id[0];
    }
}

The main point here being that function 1 searches for "id", which is indexed, whereas function 2 has to make n/2 comparisons that don't result in a hit, but that the overhead of multiple queries is going to be significantly greater than the single SELECT.

这里的要点是函数1搜索“id”，它被索引，而函数2必须进行n / 2次比较，这不会导致命中，但是多次查询的开销会大大增加比单个SELECT。

Is my assumption correct? If so, how large of a data set would we need before function 1 outperforms function 2?

我的假设是否正确？如果是这样，在函数1优于函数2之前，我们需要多大的数据集？

1 个解决方案

#1

Your example is a bit difficult to parse, but ill start at the top:

你的例子有点难以解析，但在顶部开始不好：

Your first function does not return all of the elements with type = 1. It returns all of the elements that are dependent (based on references) to the element you pass in. From the PHP standpoint, since the link/handle is already open there is a non-trivial overhead from your function call with each successive request, not to mention the string concatenation incurring a cost with each execution of that line.

你的第一个函数没有返回type = 1的所有元素。它返回所有依赖的元素（基于引用）到你传入的元素。从PHP的角度来看，因为链接/句柄已经打开了对于每个连续的请求，函数调用是一个非平凡的开销，更不用说字符串连接在每次执行该行时都会产生成本。

Typically it is better to use the second function styling because it only queries the database one time and will return the elements you are requesting without further work. It will come down to a profiler of course, to determine which is going to return faster, but from my tests the second is hands down the better choice:

通常情况下，最好使用第二个函数样式，因为它只查询数据库一次，并返回您请求的元素而无需进一步的工作。当然，它将归结为一个分析器，以确定哪个会更快地返回，但从我的测试中，第二个是更好的选择：

This was executed with n = 5000 elements in the db (n/2 = 2500 type 1 and passing in reference = highest id with type = 1 from query of db).

这是在db中使用n = 5000个元素执行的（n / 2 = 2500类型1，并且从db的查询中传入reference =最高id，类型= 1）。

printFoo1: 3.591840 seconds
printFoo2: 0.010340 seconds

It wouldn't really make sense for it to work any other way. If you were able to do what you propose that would make JOIN calls have to perform less efficient as well.

以任何其他方式工作都没有意义。如果你能够做你的建议，那么JOIN调用也必须降低效率。

Code

码

$res = mysql_query('SELECT MAX( id ) as `MAX_ID` FROM `foo` WHERE `type` = 1', $link);
$res2 = mysql_fetch_assoc($res);

$id = $res2['MAX_ID'];

// cleanup result and free resources here

echo "printFoo1: ";
$start = microtime(true);
printFoo1($id);
echo microtime(true) - $start;

echo '<br />';

echo "printFoo2: ";
$start = microtime(true);
printFoo2();
echo microtime(true) - $start;

mysql_close($link);

All of this was tested on PHP 5.2.17 running on Linux

所有这些都是在Linux上运行的PHP 5.2.17上测试的

#1