So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:
所以基本上我今天需要优化这段代码。它试图找到由一些函数产生的最长序列的第一百万起始数:
public static void main(String[] args) {
int mostLen = 0;
int mostInt = 0;
long currTime = System.currentTimeMillis();
for(int j=2; j<=1000000; j++) {
long i = j;
int len = 0;
while((i=next(i)) != 1) {
len++;
}
if(len > mostLen) {
mostLen = len;
mostInt = j;
}
}
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if(i%2==0) {
return i/2;
} else {
return i*3+1;
}
}
My mistake was to try to introduce multithreading:
我的错误是尝试引入多线程:
void doSearch() throws ExecutionException, InterruptedException {
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
long currTime = System.currentTimeMillis();
List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
for (int j = 2; j <= 1000000; j++) {
MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
worker.setBean(new ValueBean(j, 0));
Future<ValueBean> f = executor.submit(worker);
list.add(f);
}
System.out.println(System.currentTimeMillis() - currTime);
int mostLen = 0;
int mostInt = 0;
for (Future<ValueBean> f : list) {
final int len = f.get().getLen();
if (len > mostLen) {
mostLen = len;
mostInt = f.get().getNum();
}
}
executor.shutdown();
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
public class MyCallable<T> implements Callable<ValueBean> {
public ValueBean bean;
public void setBean(ValueBean bean) {
this.bean = bean;
}
public ValueBean call() throws Exception {
long i = bean.getNum();
int len = 0;
while ((i = next(i)) != 1) {
len++;
}
return new ValueBean(bean.getNum(), len);
}
}
public class ValueBean {
int num;
int len;
public ValueBean(int num, int len) {
this.num = num;
this.len = len;
}
public int getNum() {
return num;
}
public int getLen() {
return len;
}
}
long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).
不幸的是,多线程版本的工作速度比4个处理器(内核)上的单线程慢5倍。
Then I tried a bit more crude approach:
然后我尝试了一些更原始的方法:
static int mostLen = 0;
static int mostInt = 0;
synchronized static void updateIfMore(int len, int intgr) {
if (len > mostLen) {
mostLen = len;
mostInt = intgr;
}
}
public static void main(String[] args) throws InterruptedException {
long currTime = System.currentTimeMillis();
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
for (int i = 2; i <= 1000000; i++) {
final int j = i;
executor.execute(new Runnable() {
public void run() {
long l = j;
int len = 0;
while ((l = next(l)) != 1) {
len++;
}
updateIfMore(len, j);
}
});
}
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
and it worked much faster, but still it was slower than the single thread approach.
并且它工作得更快,但它仍然比单线程方法慢。
I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next
with:
我希望不是因为我搞砸了多线程的方式,而是这种特殊的计算/算法不适合并行计算。如果我通过替换下面的方法来改变计算以使其更加处理器:
long next(long i) {
Random r = new Random();
for(int j=0; j<10; j++) {
r.nextLong();
}
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.
两个多线程版本的启动速度比4核机器上的单线程版本快两倍。
So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:
所以显然必须有一些阈值可以用来确定是否值得引入多线程,我的问题是:
What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)
有什么基本规则可以帮助确定给定的计算是否足够密集以通过并行运行来优化(不花费精力实际实现它?)
4 个解决方案
#1
2
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
我认为还有另一个组件,你没有考虑。当工作单元彼此不依赖时,并行化最有效。当后面的计算结果取决于早期的计算结果时,并行运行计算是次优的。从“我需要第一个值来计算第二个值”的意义上来说,依赖性可能很强。在这种情况下,任务是完全串行的,并且在不等待早期计算的情况下无法计算以后的值。在“如果我有第一个值,我可以更快地计算第二个值”的意义上,也可能存在较弱的依赖性。在这种情况下,并行化的成本是某些工作可能重复。
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4
. Once through the inner loop produces i == 2
, but you just computed the result for j == 2
two iterations ago, if you saved the value of len
you can compute it as len(4) = 1 + len(2).
这个问题有助于在没有多线程的情况下进行优化,因为如果您已经掌握了之前的结果,则可以更快地计算某些后期值。拿,例如j == 4.一旦通过内循环产生i == 2,但你刚刚计算了j == 2两次迭代前的结果,如果保存了len的值,你可以将它计算为len(4 )= 1 + len(2)。
Using an array to store previously computed values of len
and a little bit twiddling in the next
method, you can complete the task >50x faster.
使用数组存储先前计算的len值并在下一个方法中稍微麻烦,您可以更快地完成任务> 50倍。
#2
4
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
有效实现多线程的关键是确保成本不会太高。没有固定的规则,因为它们严重依赖于您的硬件。
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
启动和停止线程的成本很高。当然,您已经使用了执行器服务,它可以大大降低这些成本,因为它使用一堆工作线程来执行Runnables。但是每个Runnable仍然会带来一些开销。减少可运行数量并增加每个人必须完成的工作量将提高性能,但是您仍然希望有足够的可运行执行服务,以便有效地将它们分配到工作线程上。
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
您已选择为每个起始值创建一个runnable,因此最终创建1000000 runnables。你可能会得到更好的结果,让每个Runnable做一批1000个起始值。这意味着您只需要1000个runnable就可以大大减少开销。
#3
2
"Will the performance gain be greater than the cost of context switching and thread creation?"
“性能增益是否会高于上下文切换和线程创建的成本?”
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
这是一个非常操作系统,语言和硬件,依赖成本;这个问题有一些关于Java成本的讨论,但有一些数字和一些指示如何计算成本。
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
对于CPU密集型工作,您还希望每个CPU或更少的一个线程。感谢David Harkness指向如何计算出这个数字的线程。
#4
1
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.
估计线程可以在不与其他线程(直接或通过公共数据)交互的情况下完成的工作量。如果这项工作可以在1微秒或更短的时间内完成,则开销太大而且多线程是没有用的。如果它是1毫秒或更长,多线程应该运行良好。如果介于两者之间,则需要进行实验测试。
#1
2
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
我认为还有另一个组件,你没有考虑。当工作单元彼此不依赖时,并行化最有效。当后面的计算结果取决于早期的计算结果时,并行运行计算是次优的。从“我需要第一个值来计算第二个值”的意义上来说,依赖性可能很强。在这种情况下,任务是完全串行的,并且在不等待早期计算的情况下无法计算以后的值。在“如果我有第一个值,我可以更快地计算第二个值”的意义上,也可能存在较弱的依赖性。在这种情况下,并行化的成本是某些工作可能重复。
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4
. Once through the inner loop produces i == 2
, but you just computed the result for j == 2
two iterations ago, if you saved the value of len
you can compute it as len(4) = 1 + len(2).
这个问题有助于在没有多线程的情况下进行优化,因为如果您已经掌握了之前的结果,则可以更快地计算某些后期值。拿,例如j == 4.一旦通过内循环产生i == 2,但你刚刚计算了j == 2两次迭代前的结果,如果保存了len的值,你可以将它计算为len(4 )= 1 + len(2)。
Using an array to store previously computed values of len
and a little bit twiddling in the next
method, you can complete the task >50x faster.
使用数组存储先前计算的len值并在下一个方法中稍微麻烦,您可以更快地完成任务> 50倍。
#2
4
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
有效实现多线程的关键是确保成本不会太高。没有固定的规则,因为它们严重依赖于您的硬件。
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
启动和停止线程的成本很高。当然,您已经使用了执行器服务,它可以大大降低这些成本,因为它使用一堆工作线程来执行Runnables。但是每个Runnable仍然会带来一些开销。减少可运行数量并增加每个人必须完成的工作量将提高性能,但是您仍然希望有足够的可运行执行服务,以便有效地将它们分配到工作线程上。
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
您已选择为每个起始值创建一个runnable,因此最终创建1000000 runnables。你可能会得到更好的结果,让每个Runnable做一批1000个起始值。这意味着您只需要1000个runnable就可以大大减少开销。
#3
2
"Will the performance gain be greater than the cost of context switching and thread creation?"
“性能增益是否会高于上下文切换和线程创建的成本?”
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
这是一个非常操作系统,语言和硬件,依赖成本;这个问题有一些关于Java成本的讨论,但有一些数字和一些指示如何计算成本。
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
对于CPU密集型工作,您还希望每个CPU或更少的一个线程。感谢David Harkness指向如何计算出这个数字的线程。
#4
1
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.
估计线程可以在不与其他线程(直接或通过公共数据)交互的情况下完成的工作量。如果这项工作可以在1微秒或更短的时间内完成,则开销太大而且多线程是没有用的。如果它是1毫秒或更长,多线程应该运行良好。如果介于两者之间,则需要进行实验测试。