So I have a quick question on how to verify the big O of a function.
我有个问题,关于如何验证一个函数的大O。
for example: a quicksort algorithm sorting an array of 5000000 element yields a time interval of 0.008524 seconds, running the same algorithm with 1000000 element yields 0.017909. How would I check the big O if my quicksort is/is not big O of n*log(n)??
例如:对5000000元素数组排序的快速排序算法的时间间隔为0.008524秒,对1000000元素执行相同的算法的时间间隔为0.017909秒。如果我的quicksort是/不是大O (n *log(n)),我怎么检查大O呢?
What I think I understand: n increased by 2 thus the runtime should increase by 2*log(2)?
我想我理解的是:n增加了2所以运行时应该增加2*log(2)?
f(n) = 0.008524 -> n log (n)
f(n) = 0.008524 -> n log (n)
f(2n) = 0.017909 ->2n*log(2n)
f(2 n)= 0.017909 - > 2 n *日志(2 n)
5 个解决方案
#1
2
Big-O notation is asymptotic. That means it only applies in the limit as n gets large.
大0符号是渐近。这意味着它只适用于当n变大时的极限。
There are many reasons why sorting with 50 and 100 elements might not track O(n log n), caching effects being a likely candidate. If you try 100,000 versus 200,000 versus 1 million, you will probably find it tracks a little better.
使用50和100个元素进行排序可能不会跟踪O(n log n)的原因有很多,缓存效果可能是一个候选。如果你尝试10万比20万比100万,你可能会发现它的轨迹更好。
Another possibility is that most quicksort implementations are only O(n log n) on average; some inputs will take longer. The odds of encountering such a pathological input are higher for 50 elements than for 100,000.
另一种可能性是大多数快速排序实现平均只有O(n log n);一些输入将需要更长的时间。50个元素遇到这种病理输入的几率比10万个要高。
Ultimately, though, you do not "verify" big-O run time; you prove it based on what the algorithm does.
但最终,您不会“验证”大o运行时间;你可以根据算法来证明它。
#2
1
Big-O notation is not normally concerned with run-time in seconds. It's concerned with the number of operations that were carried out by the algorithm. The only way to establish that is to look at the code for the function.
大o符号通常不关心运行时的秒数。它与算法执行的操作数有关。建立这个函数的唯一方法是查看函数的代码。
Not only will run-time be affected by lower-order terms (remember that big-O notation only concerns itself with the highest-order term), but also things such as program startup overhead, cache efficiency and branch prediction.
不仅运行时将受到低阶项的影响(请记住,大o表示法只涉及最高阶项本身),而且还涉及诸如程序启动开销、缓存效率和分支预测等内容。
Having said all that, it's possible that none of these other effects will be significant in your case. In which case, if n is doubled, then you would expect the run-time to increase from k.n.log(n) to k.2n.log(2n) = k(2n.log(2) + 2n.log(n)).
话虽如此,这些其他的影响在你的案例中可能都不重要。在这种情况下,如果n增加了一倍,那么您将期望运行时从kn .n.log(n)增加到k.n.log(2n) = k(2n.log(2) + 2n.log(n))。
#3
1
There are a couple of points to keep in mind: first and foremost that as you change sizes, the hardware (at least on a typical computer) will have an effect as well. Particularly, when your data becomes to large to fit in a particular size of cache, you can expect to see a substantial jump in run-time that's completely independent of the algorithm in question.
有几点需要记住:首先,最重要的是,当您更改大小时,硬件(至少在典型的计算机上)也会产生影响。特别是,当您的数据变得足够大,以适应特定大小的缓存时,您可以期望看到运行时的大幅跳跃,这完全独立于所讨论的算法。
To get a good idea of the algorithm proper, you should (probably) start by comparing to some algorithm with a really obvious complexity, but working with the same size of data. For one obvious possibility, time how long it takes to fill your array with random numbers. At least assuming a reasonably typical PRNG, that should certainly be linear.
要正确地了解算法,您应该(可能)首先比较具有明显复杂性的某些算法,但是使用相同大小的数据。一个明显的可能性是,用随机数填充数组需要多长时间。至少假设一个相当典型的PRNG,它应该是线性的。
Then time your algorithm, and see how it changes relative to the linear algorithm for the same sizes. For example, you might use some code like this:
然后时间你的算法,看看它是如何变化的相对于相同大小的线性算法。例如,您可以使用如下代码:
#include <vector>
#include <algorithm>
#include <iostream>
#include <time.h>
#include <string>
#include <iomanip>
class timer {
clock_t begin;
std::ostream &os;
std::string d;
public:
timer(std::string const &delim = "\n", std::ostream &rep=std::cout)
: os(rep), begin(clock()), d(delim)
{}
~timer() { os << double(clock()-begin)/CLOCKS_PER_SEC << d; }
};
int main() {
static const unsigned int meg = 1024 * 1024;
std::cout << std::setw(10) << "Size" << "\tfill\tsort\n";
for (unsigned size=10000; size <512*meg; size *= 2) {
std::vector<int> numbers(size);
std::cout << std::setw(10) << size << "\t";
{
timer fill_time("\t");
std::fill_n(numbers.begin(), size, 0);
for (int i=0; i<size; i++)
numbers[i] = rand();
}
{
timer sort_time;
std::sort(numbers.begin(), numbers.end());
}
}
return 0;
}
If I graph both the time to fill and the time to sort, I get something like this:
如果我把填满的时间和排序的时间都画出来,我会得到这样的结果:
Since our sizes are exponential, our linear algorithm shows a (roughly) exponential curve. The time to sort is obviously growing (somewhat) faster still.
由于我们的大小是指数型的,我们的线性算法显示了(大致)指数曲线。排序的时间显然还在以更快的速度增长。
Edit: In fairness, I should probably add that log(N) grows so slowly that for almost any practical amount of data, it contributes very little. For most practical purposes, you can just about treat quicksort (for example) as being linear on the size, just with a somewhat greater constant factor than filling memory. Growing the size linearly and graphing the results makes this more apparent:
编辑:公平地说,我应该加上log(N)增长如此缓慢,以至于对于几乎所有的实际数据,它的贡献都非常小。对于大多数实际用途,您可以将快速排序(例如)视为大小上的线性,只是使用比填充内存更大的常数因子。线性增大尺寸并绘制结果图使这一点更加明显:
If you look carefully, you can probably see the upper line showing just a slight upward curve from the "log(N)" factor. On the other hand, I'm not sure I'd have noticed any curvature at all if I hadn't already known that it should be there.
如果你仔细看的话,你可能会看到上面的线只显示了“log(N)”因子的轻微向上曲线。另一方面,我不确定如果我不知道它应该在那里,我是否会注意到任何曲率。
#4
0
Two data points aren't really enough.
两个数据点还不够。
However, 2*n*log(2*n) = 2*n*(log(n) + log(2)), so you can see that the multiplier should be approximately 2*log(2) when you double the size. That looks plausible for the numbers you gave. You should add a few more points and double check.
但是,2*n*log(2*n) = 2*n*(log(n) + log(2))),所以当你将大小加倍时,乘数应该是2*log(2)。对你给出的数字来说,这似乎是合理的。你应该再加几分,再检查一遍。
Note that there is likely to be some constant term in your timing, at least if you include program startup in there - it might be significant. If, on the other hand, you only time the sorting phase without the startup, it is more accurate. You should repeat the process over many permutations of the input data to ensure you get a representative set of values.
请注意,在你的时间里可能会有一些常数项,至少如果你在其中包含程序启动——它可能很重要。另一方面,如果您只对排序阶段计时,而不启动,则会更准确。您应该在输入数据的许多排列上重复这个过程,以确保获得一组具有代表性的值。
#5
0
With this code you are looking at, it is dependent on how long it takes for the method to run through the 50 elements compared to 100, not the timing itself. Like if I was iterating through an array, that would be linear time O(n), because the array would has to go through each index to the end. N represents how large the array would be. Besides, Big O notation is meant for the overall scheme of things, in the long run. If you averaged the timing for 50, 100, 1000, 100000, 1000000, you would see on average it would have a O(nlog(n)).
使用您正在查看的代码,它取决于方法运行50个元素所需的时间,而不是时间本身。就像我在数组中迭代一样,这是线性时间O(n)因为数组必须遍历每个索引直到最后。N表示数组的大小。此外,从长远来看,大O符号是指事物的整体计划。如果你取时间的平均值为50,100,1000,100000,1000000,你会发现它的平均长度是O(nlog(n))
#1
2
Big-O notation is asymptotic. That means it only applies in the limit as n gets large.
大0符号是渐近。这意味着它只适用于当n变大时的极限。
There are many reasons why sorting with 50 and 100 elements might not track O(n log n), caching effects being a likely candidate. If you try 100,000 versus 200,000 versus 1 million, you will probably find it tracks a little better.
使用50和100个元素进行排序可能不会跟踪O(n log n)的原因有很多,缓存效果可能是一个候选。如果你尝试10万比20万比100万,你可能会发现它的轨迹更好。
Another possibility is that most quicksort implementations are only O(n log n) on average; some inputs will take longer. The odds of encountering such a pathological input are higher for 50 elements than for 100,000.
另一种可能性是大多数快速排序实现平均只有O(n log n);一些输入将需要更长的时间。50个元素遇到这种病理输入的几率比10万个要高。
Ultimately, though, you do not "verify" big-O run time; you prove it based on what the algorithm does.
但最终,您不会“验证”大o运行时间;你可以根据算法来证明它。
#2
1
Big-O notation is not normally concerned with run-time in seconds. It's concerned with the number of operations that were carried out by the algorithm. The only way to establish that is to look at the code for the function.
大o符号通常不关心运行时的秒数。它与算法执行的操作数有关。建立这个函数的唯一方法是查看函数的代码。
Not only will run-time be affected by lower-order terms (remember that big-O notation only concerns itself with the highest-order term), but also things such as program startup overhead, cache efficiency and branch prediction.
不仅运行时将受到低阶项的影响(请记住,大o表示法只涉及最高阶项本身),而且还涉及诸如程序启动开销、缓存效率和分支预测等内容。
Having said all that, it's possible that none of these other effects will be significant in your case. In which case, if n is doubled, then you would expect the run-time to increase from k.n.log(n) to k.2n.log(2n) = k(2n.log(2) + 2n.log(n)).
话虽如此,这些其他的影响在你的案例中可能都不重要。在这种情况下,如果n增加了一倍,那么您将期望运行时从kn .n.log(n)增加到k.n.log(2n) = k(2n.log(2) + 2n.log(n))。
#3
1
There are a couple of points to keep in mind: first and foremost that as you change sizes, the hardware (at least on a typical computer) will have an effect as well. Particularly, when your data becomes to large to fit in a particular size of cache, you can expect to see a substantial jump in run-time that's completely independent of the algorithm in question.
有几点需要记住:首先,最重要的是,当您更改大小时,硬件(至少在典型的计算机上)也会产生影响。特别是,当您的数据变得足够大,以适应特定大小的缓存时,您可以期望看到运行时的大幅跳跃,这完全独立于所讨论的算法。
To get a good idea of the algorithm proper, you should (probably) start by comparing to some algorithm with a really obvious complexity, but working with the same size of data. For one obvious possibility, time how long it takes to fill your array with random numbers. At least assuming a reasonably typical PRNG, that should certainly be linear.
要正确地了解算法,您应该(可能)首先比较具有明显复杂性的某些算法,但是使用相同大小的数据。一个明显的可能性是,用随机数填充数组需要多长时间。至少假设一个相当典型的PRNG,它应该是线性的。
Then time your algorithm, and see how it changes relative to the linear algorithm for the same sizes. For example, you might use some code like this:
然后时间你的算法,看看它是如何变化的相对于相同大小的线性算法。例如,您可以使用如下代码:
#include <vector>
#include <algorithm>
#include <iostream>
#include <time.h>
#include <string>
#include <iomanip>
class timer {
clock_t begin;
std::ostream &os;
std::string d;
public:
timer(std::string const &delim = "\n", std::ostream &rep=std::cout)
: os(rep), begin(clock()), d(delim)
{}
~timer() { os << double(clock()-begin)/CLOCKS_PER_SEC << d; }
};
int main() {
static const unsigned int meg = 1024 * 1024;
std::cout << std::setw(10) << "Size" << "\tfill\tsort\n";
for (unsigned size=10000; size <512*meg; size *= 2) {
std::vector<int> numbers(size);
std::cout << std::setw(10) << size << "\t";
{
timer fill_time("\t");
std::fill_n(numbers.begin(), size, 0);
for (int i=0; i<size; i++)
numbers[i] = rand();
}
{
timer sort_time;
std::sort(numbers.begin(), numbers.end());
}
}
return 0;
}
If I graph both the time to fill and the time to sort, I get something like this:
如果我把填满的时间和排序的时间都画出来,我会得到这样的结果:
Since our sizes are exponential, our linear algorithm shows a (roughly) exponential curve. The time to sort is obviously growing (somewhat) faster still.
由于我们的大小是指数型的,我们的线性算法显示了(大致)指数曲线。排序的时间显然还在以更快的速度增长。
Edit: In fairness, I should probably add that log(N) grows so slowly that for almost any practical amount of data, it contributes very little. For most practical purposes, you can just about treat quicksort (for example) as being linear on the size, just with a somewhat greater constant factor than filling memory. Growing the size linearly and graphing the results makes this more apparent:
编辑:公平地说,我应该加上log(N)增长如此缓慢,以至于对于几乎所有的实际数据,它的贡献都非常小。对于大多数实际用途,您可以将快速排序(例如)视为大小上的线性,只是使用比填充内存更大的常数因子。线性增大尺寸并绘制结果图使这一点更加明显:
If you look carefully, you can probably see the upper line showing just a slight upward curve from the "log(N)" factor. On the other hand, I'm not sure I'd have noticed any curvature at all if I hadn't already known that it should be there.
如果你仔细看的话,你可能会看到上面的线只显示了“log(N)”因子的轻微向上曲线。另一方面,我不确定如果我不知道它应该在那里,我是否会注意到任何曲率。
#4
0
Two data points aren't really enough.
两个数据点还不够。
However, 2*n*log(2*n) = 2*n*(log(n) + log(2)), so you can see that the multiplier should be approximately 2*log(2) when you double the size. That looks plausible for the numbers you gave. You should add a few more points and double check.
但是,2*n*log(2*n) = 2*n*(log(n) + log(2))),所以当你将大小加倍时,乘数应该是2*log(2)。对你给出的数字来说,这似乎是合理的。你应该再加几分,再检查一遍。
Note that there is likely to be some constant term in your timing, at least if you include program startup in there - it might be significant. If, on the other hand, you only time the sorting phase without the startup, it is more accurate. You should repeat the process over many permutations of the input data to ensure you get a representative set of values.
请注意,在你的时间里可能会有一些常数项,至少如果你在其中包含程序启动——它可能很重要。另一方面,如果您只对排序阶段计时,而不启动,则会更准确。您应该在输入数据的许多排列上重复这个过程,以确保获得一组具有代表性的值。
#5
0
With this code you are looking at, it is dependent on how long it takes for the method to run through the 50 elements compared to 100, not the timing itself. Like if I was iterating through an array, that would be linear time O(n), because the array would has to go through each index to the end. N represents how large the array would be. Besides, Big O notation is meant for the overall scheme of things, in the long run. If you averaged the timing for 50, 100, 1000, 100000, 1000000, you would see on average it would have a O(nlog(n)).
使用您正在查看的代码,它取决于方法运行50个元素所需的时间,而不是时间本身。就像我在数组中迭代一样,这是线性时间O(n)因为数组必须遍历每个索引直到最后。N表示数组的大小。此外,从长远来看,大O符号是指事物的整体计划。如果你取时间的平均值为50,100,1000,100000,1000000,你会发现它的平均长度是O(nlog(n))