I realize this is more of a hardware question, but this is also very relevant to software, especially when programming for mult-threaded multi-core/cpu environments.
我意识到这更像是一个硬件问题,但这也与软件非常相关,特别是在为多线程多核/ cpu环境编程时。
Which is better, and why? Whether it be regarding efficiency, speed, productivity, usability, etc.
哪个更好?为什么?无论是效率,速度,生产力,可用性等。
1.) A computer/server with 4 quad-core CPUs?
1.)具有4个四核CPU的计算机/服务器?
or
2.) A computer/server with 16 single-core CPUs?
2.)具有16个单核CPU的计算机/服务器?
Please assume all other factors (speed, cache, bus speeds, bandwidth, etc.) are equal.
请假设所有其他因素(速度,缓存,总线速度,带宽等)相等。
Edit:
I'm interested in the performance aspect in general. As to if it's particularly better at one aspect and horrible (or not preferable) at another, then I'd like to know that as well.
我对整体表现方面很感兴趣。至于它是否在某个方面特别好,而在另一个方面特别好(或不可取),那么我也想知道这一点。
And if I have to choose, I'd be most interested which is better in regards to I/O-bound applications, and compute-bound applications.
如果我必须选择,我最感兴趣的是在I / O绑定应用程序和计算绑定应用程序方面更好。
4 个解决方案
#1
That's not an easy question to answer. Computer architecture is unsurprisingly rather complicated. Below are some guidelines but even these are simplifications. A lot of this will come down to your application and what constraints you're working within (both business and technical).
这不是一个容易回答的问题。毫无疑问,计算机体系结构相当复杂。以下是一些指导原则,但即使这些也是简化。很多这将归结为您的应用程序以及您正在进行的约束(业务和技术)。
CPUs have several (2-3 generally) levels of caching on the CPU. Some modern CPUs also have a memory controller on the die. That can greatly improve the speed of swapping memory between cores. Memory I/O between CPUs will have to go on an external bus, which tends to be slower.
CPU在CPU上有几个(通常2-3个)级别的缓存。一些现代CPU在芯片上也有一个内存控制器。这可以大大提高核心之间交换内存的速度。 CPU之间的内存I / O必须使用外部总线,后者往往速度较慢。
AMD/ATI chips use HyperTransport, which is a point-to-point protocol.
AMD / ATI芯片使用HyperTransport,这是一种点对点协议。
Complicating all this however is the bus architecture. Intel's Core 2 Duo/Quad system uses a shared bus. Think of this like Ethernet or cable internet where there is only so much bandwidth to go round and every new participant just takes another share from the whole. Core i7 and newer Xeons use QuickPath, which is pretty similar to HyperTransport.
然而,使所有这些复杂化的是总线架构。英特尔的Core 2 Duo / Quad系统使用共享总线。可以想象这就像以太网或有线互联网一样,只有那么多的带宽可以绕过,而每个新参与者只需要从整体上获得另一份分享。 Core i7和更新的Xeons使用QuickPath,这与HyperTransport非常相似。
More cores will occupy less space, use less space and less power and cost less (unless you're using really low powered CPUs) both in per-core terms and the cost of other hardware (eg motherboards).
更多内核占用更少的空间,占用更少的空间,更少的功耗和更低的成本(除非您使用真正低功耗的CPU),无论是在每个核心方面还是其他硬件(例如主板)的成本。
Generally speaking one CPU will the the cheapest (both in terms of hardware AND software). Commodity hardware can be used for this. Once you go to the second socket you tend to have to use different chipsets, more expensive motherboards and often more expensive RAM (eg ECC fully buffered RAM) so you take a massive cost hit going from one CPU to two. It's one reason so many large sites (including Flickr, Google and others) use thousands of commodity servers (although Google's servers are somewhat customized to include things like a 9V battery but the principle is the same).
一般来说,一个CPU将是最便宜的(在硬件和软件方面)。可以使用商品硬件。一旦你进入第二个插座,你往往不得不使用不同的芯片组,更昂贵的主板和更昂贵的RAM(例如ECC全缓冲RAM),因此你需要从一个CPU到两个CPU的巨大成本。这是很多大型网站(包括Flickr,谷歌和其他网站)使用数千种商品服务器的原因之一(尽管谷歌的服务器有些定制,包括像9V电池这样的东西,但原理是相同的)。
Your edits don't really change much. "Performance" is a highly subjective concept. Performance at what? Bear in mind though that if your application isn't sufficiently multithreaded (or multiprocess) to take advantage of extra cores then you can actually decrease performance by adding more cores.
你的编辑并没有太大改变。 “表演”是一个非常主观的概念。表现在什么?请记住,如果您的应用程序没有足够的多线程(或多进程)来利用额外的内核,那么您实际上可以通过添加更多内核来降低性能。
I/O bound applications probably won't prefer one over the other. They are, after all, bound by I/O not CPU.
I / O绑定应用程序可能不会优先于其他应用程序。毕竟,它们受I / O而不是CPU的约束。
For compute-based applications well it depends on the nature of the computation. If you're doing lots of floating point you may benefit far more by using a GPU to offload calculations (eg using Nvidia CUDA). You can get a huge performance benefit from this. Take a look at the GPU client for Folding@Home for an example of this.
对于基于计算的应用程序,它取决于计算的性质。如果您正在进行大量浮点运算,那么使用GPU卸载计算(例如使用Nvidia CUDA)可能会带来更多好处。您可以从中获得巨大的性能优势。看看Folding @ Home的GPU客户端就是一个例子。
In short, your question doesn't lend itself to a specific answer because the subject is complicated and there's just not enough information. Technical architecture is something that has to be designed for the specific application.
简而言之,您的问题并不适用于具体的答案,因为主题很复杂,而且信息不够充分。技术架构必须针对特定应用进行设计。
#2
Well, the point is that all other factors can't really be equal.
嗯,重点是所有其他因素实际上并不相同。
The main problem with multi-CPU is latency and bandwidth when the two CPU sockets have to intercommunicate. And this has to happen constantly to make sure their local caches aren't out of sync. This incurs latency, and sometimes can be the bottleneck of your code. (Not always of course.)
当两个CPU插槽必须相互通信时,多CPU的主要问题是延迟和带宽。而这必须不断发生,以确保他们的本地缓存不会同步。这会导致延迟,有时可能成为代码的瓶颈。 (当然不总是这样。)
#3
More cores on fewer CPUs is definitely faster as SPWorley writes. His answer is close to three years old now but the trends are there and I believe his answer needs some clarification. First some history.
SPWorley写道时,更少CPU上的更多内核肯定更快。他的回答现在接近三年了,但趋势已经存在,我相信他的回答需要一些澄清。首先是一些历史。
In the early eighties the 80286 became the first microprocessor where virtual memory was feasible. Not that it hadn't been tried before, but intel integrated the management of virtual memory onto the chip (on-die) instead of having an off-die solution. This resulted in their memory management solution being much faster than those of their competitors because all memory management (especially the translation of virtual to physical addresses) was designed into and part of the generic processing.
在80年代早期,80286成为第一个虚拟内存可行的微处理器。并非之前没有尝试过,但是intel将虚拟内存管理集成到芯片上(片上),而不是使用片外解决方案。这导致他们的内存管理解决方案比竞争对手的解决方案快得多,因为所有内存管理(特别是虚拟到物理地址的转换)都被设计为通用处理的一部分。
Remember those big clunky P2 & P3 processors from intel and early athlon & durons from AMD that were set on a side and contained in a big plastic package? The reason for this was to be able to fit a cache chip next to the processor chip since the fabrication processes of the time made it unfeasible to fit the cache onto the processor die itself. Voilà an off-die, on-processor solution. These cache chips would, due to timing limitations, run at a fraction (50% or so) of the CPUs clock frequency. As soon as the manufacturing processes caught up, caches were moved on-die and began to run at the internal clock frequency.
还记得来自英特尔的那些笨重的P2和P3处理器以及来自AMD的早期的Athlon和durons,这些处理器设置在一个大的塑料包装中吗?其原因是能够在处理器芯片旁边安装缓存芯片,因为时间的制造过程使得将缓存装配到处理器芯片本身上是不可行的。 Voilà是一种片外处理器解决方案。由于时序限制,这些高速缓存芯片将以CPU时钟频率的一小部分(50%左右)运行。一旦制造过程陷入困境,缓存就会在芯片上移动并开始以内部时钟频率运行。
A few years ago AMD moved the RAM memory controller from the Northbridge (off-die) and onto the processor (on-die). Why? Because it makes memory operations more efficient (faster) by eliminating external addressing wiring by half and eliminates going through the Northbridge (CPU-wiring-Northbridge-wiring-RAM became CPU-wiring-RAM). The change also made it possible to have several independent memory controllers with their own sets of RAM operating simultaneously on the same die which increases the memory bandwidth of the processor.
几年前,AMD将RAM内存控制器从北桥(片外)移到了处理器(片上)。为什么?因为它通过消除外部寻址接线减少一半来消除通过北桥(CPU布线 - 北桥 - 布线 - RAM成为CPU布线RAM),从而使存储器操作更高效(更快)。这一变化还使得有可能在同一个芯片上同时运行多个独立的内存控制器,并增加处理器的内存带宽。
To get back to the clarification we see a long-term trend toward moving performance-critical functionality from the motherboard and onto the processor die. In addition to those mentioned we have seen the integration of multiple cores onto the same die, off-die L2/on-die L1 caches have become off-die L3 /on-die L1 and L2 caches which are now on-die L1, L2 and L3 caches. The caches have become larger and larger to the extent that they take up more space than the cores themselves.
为了回到澄清,我们看到了从主板和处理器芯片上移动性能关键功能的长期趋势。除了上面提到的那些,我们已经看到多个内核集成到同一个芯片上,片外L2 /片上L1高速缓存已经成为片外L3 /片上L1和L2高速缓存,现在是片上L1, L2和L3缓存。高速缓存变得越来越大,以至于它们占用的空间比核心本身更多。
So, to sum up: anytime you need to go off-die things slow down dramatically. The answer: make sure to stay on-die as much as possible and streamline the design of anything that needs to go off-die.
所以,总结一下:任何时候你需要离开的东西都会急剧减速。答案是:确保尽可能地保持在模具上,并简化任何需要脱模的设计。
#4
It depends on the architecture to some extent; BUT a quad core CPU is pretty much the same (or better) than 4 physically separate CPUs due to the reduced communication (i.e doesn't have to go off die and not travel very far, which is a factor), and shared resources.
它在某种程度上取决于架构;但是四核CPU与4个物理上独立的CPU几乎相同(或者更好),因为通信减少了(即不必去死,不能走很远,这是一个因素),以及共享资源。
#1
That's not an easy question to answer. Computer architecture is unsurprisingly rather complicated. Below are some guidelines but even these are simplifications. A lot of this will come down to your application and what constraints you're working within (both business and technical).
这不是一个容易回答的问题。毫无疑问,计算机体系结构相当复杂。以下是一些指导原则,但即使这些也是简化。很多这将归结为您的应用程序以及您正在进行的约束(业务和技术)。
CPUs have several (2-3 generally) levels of caching on the CPU. Some modern CPUs also have a memory controller on the die. That can greatly improve the speed of swapping memory between cores. Memory I/O between CPUs will have to go on an external bus, which tends to be slower.
CPU在CPU上有几个(通常2-3个)级别的缓存。一些现代CPU在芯片上也有一个内存控制器。这可以大大提高核心之间交换内存的速度。 CPU之间的内存I / O必须使用外部总线,后者往往速度较慢。
AMD/ATI chips use HyperTransport, which is a point-to-point protocol.
AMD / ATI芯片使用HyperTransport,这是一种点对点协议。
Complicating all this however is the bus architecture. Intel's Core 2 Duo/Quad system uses a shared bus. Think of this like Ethernet or cable internet where there is only so much bandwidth to go round and every new participant just takes another share from the whole. Core i7 and newer Xeons use QuickPath, which is pretty similar to HyperTransport.
然而,使所有这些复杂化的是总线架构。英特尔的Core 2 Duo / Quad系统使用共享总线。可以想象这就像以太网或有线互联网一样,只有那么多的带宽可以绕过,而每个新参与者只需要从整体上获得另一份分享。 Core i7和更新的Xeons使用QuickPath,这与HyperTransport非常相似。
More cores will occupy less space, use less space and less power and cost less (unless you're using really low powered CPUs) both in per-core terms and the cost of other hardware (eg motherboards).
更多内核占用更少的空间,占用更少的空间,更少的功耗和更低的成本(除非您使用真正低功耗的CPU),无论是在每个核心方面还是其他硬件(例如主板)的成本。
Generally speaking one CPU will the the cheapest (both in terms of hardware AND software). Commodity hardware can be used for this. Once you go to the second socket you tend to have to use different chipsets, more expensive motherboards and often more expensive RAM (eg ECC fully buffered RAM) so you take a massive cost hit going from one CPU to two. It's one reason so many large sites (including Flickr, Google and others) use thousands of commodity servers (although Google's servers are somewhat customized to include things like a 9V battery but the principle is the same).
一般来说,一个CPU将是最便宜的(在硬件和软件方面)。可以使用商品硬件。一旦你进入第二个插座,你往往不得不使用不同的芯片组,更昂贵的主板和更昂贵的RAM(例如ECC全缓冲RAM),因此你需要从一个CPU到两个CPU的巨大成本。这是很多大型网站(包括Flickr,谷歌和其他网站)使用数千种商品服务器的原因之一(尽管谷歌的服务器有些定制,包括像9V电池这样的东西,但原理是相同的)。
Your edits don't really change much. "Performance" is a highly subjective concept. Performance at what? Bear in mind though that if your application isn't sufficiently multithreaded (or multiprocess) to take advantage of extra cores then you can actually decrease performance by adding more cores.
你的编辑并没有太大改变。 “表演”是一个非常主观的概念。表现在什么?请记住,如果您的应用程序没有足够的多线程(或多进程)来利用额外的内核,那么您实际上可以通过添加更多内核来降低性能。
I/O bound applications probably won't prefer one over the other. They are, after all, bound by I/O not CPU.
I / O绑定应用程序可能不会优先于其他应用程序。毕竟,它们受I / O而不是CPU的约束。
For compute-based applications well it depends on the nature of the computation. If you're doing lots of floating point you may benefit far more by using a GPU to offload calculations (eg using Nvidia CUDA). You can get a huge performance benefit from this. Take a look at the GPU client for Folding@Home for an example of this.
对于基于计算的应用程序,它取决于计算的性质。如果您正在进行大量浮点运算,那么使用GPU卸载计算(例如使用Nvidia CUDA)可能会带来更多好处。您可以从中获得巨大的性能优势。看看Folding @ Home的GPU客户端就是一个例子。
In short, your question doesn't lend itself to a specific answer because the subject is complicated and there's just not enough information. Technical architecture is something that has to be designed for the specific application.
简而言之,您的问题并不适用于具体的答案,因为主题很复杂,而且信息不够充分。技术架构必须针对特定应用进行设计。
#2
Well, the point is that all other factors can't really be equal.
嗯,重点是所有其他因素实际上并不相同。
The main problem with multi-CPU is latency and bandwidth when the two CPU sockets have to intercommunicate. And this has to happen constantly to make sure their local caches aren't out of sync. This incurs latency, and sometimes can be the bottleneck of your code. (Not always of course.)
当两个CPU插槽必须相互通信时,多CPU的主要问题是延迟和带宽。而这必须不断发生,以确保他们的本地缓存不会同步。这会导致延迟,有时可能成为代码的瓶颈。 (当然不总是这样。)
#3
More cores on fewer CPUs is definitely faster as SPWorley writes. His answer is close to three years old now but the trends are there and I believe his answer needs some clarification. First some history.
SPWorley写道时,更少CPU上的更多内核肯定更快。他的回答现在接近三年了,但趋势已经存在,我相信他的回答需要一些澄清。首先是一些历史。
In the early eighties the 80286 became the first microprocessor where virtual memory was feasible. Not that it hadn't been tried before, but intel integrated the management of virtual memory onto the chip (on-die) instead of having an off-die solution. This resulted in their memory management solution being much faster than those of their competitors because all memory management (especially the translation of virtual to physical addresses) was designed into and part of the generic processing.
在80年代早期,80286成为第一个虚拟内存可行的微处理器。并非之前没有尝试过,但是intel将虚拟内存管理集成到芯片上(片上),而不是使用片外解决方案。这导致他们的内存管理解决方案比竞争对手的解决方案快得多,因为所有内存管理(特别是虚拟到物理地址的转换)都被设计为通用处理的一部分。
Remember those big clunky P2 & P3 processors from intel and early athlon & durons from AMD that were set on a side and contained in a big plastic package? The reason for this was to be able to fit a cache chip next to the processor chip since the fabrication processes of the time made it unfeasible to fit the cache onto the processor die itself. Voilà an off-die, on-processor solution. These cache chips would, due to timing limitations, run at a fraction (50% or so) of the CPUs clock frequency. As soon as the manufacturing processes caught up, caches were moved on-die and began to run at the internal clock frequency.
还记得来自英特尔的那些笨重的P2和P3处理器以及来自AMD的早期的Athlon和durons,这些处理器设置在一个大的塑料包装中吗?其原因是能够在处理器芯片旁边安装缓存芯片,因为时间的制造过程使得将缓存装配到处理器芯片本身上是不可行的。 Voilà是一种片外处理器解决方案。由于时序限制,这些高速缓存芯片将以CPU时钟频率的一小部分(50%左右)运行。一旦制造过程陷入困境,缓存就会在芯片上移动并开始以内部时钟频率运行。
A few years ago AMD moved the RAM memory controller from the Northbridge (off-die) and onto the processor (on-die). Why? Because it makes memory operations more efficient (faster) by eliminating external addressing wiring by half and eliminates going through the Northbridge (CPU-wiring-Northbridge-wiring-RAM became CPU-wiring-RAM). The change also made it possible to have several independent memory controllers with their own sets of RAM operating simultaneously on the same die which increases the memory bandwidth of the processor.
几年前,AMD将RAM内存控制器从北桥(片外)移到了处理器(片上)。为什么?因为它通过消除外部寻址接线减少一半来消除通过北桥(CPU布线 - 北桥 - 布线 - RAM成为CPU布线RAM),从而使存储器操作更高效(更快)。这一变化还使得有可能在同一个芯片上同时运行多个独立的内存控制器,并增加处理器的内存带宽。
To get back to the clarification we see a long-term trend toward moving performance-critical functionality from the motherboard and onto the processor die. In addition to those mentioned we have seen the integration of multiple cores onto the same die, off-die L2/on-die L1 caches have become off-die L3 /on-die L1 and L2 caches which are now on-die L1, L2 and L3 caches. The caches have become larger and larger to the extent that they take up more space than the cores themselves.
为了回到澄清,我们看到了从主板和处理器芯片上移动性能关键功能的长期趋势。除了上面提到的那些,我们已经看到多个内核集成到同一个芯片上,片外L2 /片上L1高速缓存已经成为片外L3 /片上L1和L2高速缓存,现在是片上L1, L2和L3缓存。高速缓存变得越来越大,以至于它们占用的空间比核心本身更多。
So, to sum up: anytime you need to go off-die things slow down dramatically. The answer: make sure to stay on-die as much as possible and streamline the design of anything that needs to go off-die.
所以,总结一下:任何时候你需要离开的东西都会急剧减速。答案是:确保尽可能地保持在模具上,并简化任何需要脱模的设计。
#4
It depends on the architecture to some extent; BUT a quad core CPU is pretty much the same (or better) than 4 physically separate CPUs due to the reduced communication (i.e doesn't have to go off die and not travel very far, which is a factor), and shared resources.
它在某种程度上取决于架构;但是四核CPU与4个物理上独立的CPU几乎相同(或者更好),因为通信减少了(即不必去死,不能走很远,这是一个因素),以及共享资源。