随机存取存储器是如何工作的?为什么是恒时随机访问?

时间:2021-09-02 13:28:51

Or in other words, why does accessing an arbitrary element in an array take constant time (instead of O(n) or some other time)?

或者换句话说,为什么访问数组中的任意元素要花费常数时间(而不是O(n)或其他时间)?

I googled my heart out looking for an answer to this and did not find a very good one so I'm hoping one of you can share your low level knowledge with me.

我在谷歌上搜索了我的心,想找到一个答案,但是没有找到一个很好的答案,所以我希望你们中的一个能和我分享你的低层次知识。

Just to give you an idea of how low of an answer I'm hoping for, I'll tell you why I THINK it takes constant time.

为了让你们知道我希望的答案有多低,我将告诉你们为什么我认为它需要常数时间。

When I say array[4] = 12 in a program, I'm really just storing the bit representation of the memory address into a register. This physical register in the hardware will turn on the corresponding electrical signals according to the bit representation I fed it. Those electrical signals will then somehow magically ( hopefully someone can explain the magic ) access the right memory address in physical/main memory.

当我在程序中说数组[4]= 12时,我实际上只是将内存地址的位表示存储到寄存器中。硬件中的这个物理寄存器将根据我输入的位表示打开相应的电信号。这些电信号会以某种方式神奇地(希望有人能解释这个魔法)访问物理/主内存中的正确内存地址。

I know that was rough, but it was just to give you an idea of what kind of answer I'm looking for.

我知道这很粗略,但这只是为了让你知道我在寻找什么样的答案。

(editor's note: From the OP's later comments, he understands that address calculations take constant time, and just wonders about what happens after that.)

(编者注:从OP后来的评论中,他知道地址计算需要固定的时间,只是想知道之后会发生什么。)

4 个解决方案

#1


13  

Because software likes O(1) "working" memory and thus the hardware is designed to behave that way

The basic point is that the address space of a program is thought as abstractly having O(1) access performance, i.e. whatever memory location you want to read, it should take some constant time (which anyway isn't related to the distance between it and the last memory access). So, being arrays nothing more than contiguous chunks of address space, they should inherit this property (accessing an element of an array is just a matter of adding the index to the start address of the array, and then dereferencing the obtained pointer).

最基本的一点是,程序的地址空间被认为抽象地具有O(1)访问性能,即无论您想要读取的内存位置,它都应该花费一定的时间(无论如何,这与它与最后的内存访问之间的距离无关)。因此,数组只不过是地址空间的连续块,它们应该继承这个属性(访问数组的元素仅仅是将索引添加到数组的开始地址,然后取消对获得的指针的引用)。

This property comes from the fact that, in general, the address space of a program have some correspondence with the physical RAM of the PC, which, as the name (random access memory) partially implies, should have by itself the property that, whatever location in the RAM you want to access, you get to it in constant time (as opposed, for example, to a tape drive, where the seek time depends from the actual length of tape you have to move to get there).

这个属性来自这一事实,在一般情况下,程序的地址空间中有对应的物理RAM PC,,正如其名字(随机存取内存)部分所暗示的,应该有本身的性质,任何你想要的在内存中的位置来访问,你可以在常数时间(例如,与磁带驱动器,在寻道时间从胶带的实际长度取决于你必须移动到那里)。

Now, for "regular" RAM this property is (at least AFAIK) true - when the processor/motherboard/memory controller asks to a RAM chip to get some data, it does so in constant time; the details aren't really relevant for software development, and the internals of memory chips changed many times in the past and will change again in the future. If you are interested in an overview of the details of current RAMs, you can have a look here about DRAMs.

现在,对于“常规”RAM,这个属性(至少是AFAIK)是正确的——当处理器/主板/内存控制器请求RAM芯片获取一些数据时,它是在恒定的时间内进行的;这些细节与软件开发并没有多大关系,内存芯片的内部结构在过去发生了很多变化,将来也会再次发生变化。如果你对当前公羊的细节感兴趣,你可以在这里看一下DRAMs。

The general concept is that RAM chips don't contain a tape that must be moved, or a disk arm that must be positioned; when you ask to them a byte at some location, the work (mostly changing the settings of some hardware muxes, that connect the output to the cells where the byte status is stored) is the same for any location you could be asking for; thus, you get O(1) performance

一般的概念是RAM芯片不包含必须移动的磁带或必须定位的磁盘臂;当您在某个位置向它们请求一个字节时,对于您可能要求的任何位置,工作(主要是更改某些硬件muxes的设置,这些硬件muxes将输出连接到存储字节状态的单元)都是相同的;因此,得到O(1)性能

There is some overhead behind this (the logical address have to be mapped to physical address by the MMU, the various motherboard pieces have to talk with each other to tell the RAM to fetch the data and bring it back to the processor, ...), but the hardware is designed to do so in more or less constant time.

这背后有一些开销(逻辑地址必须映射到物理地址的MMU,各种主板部分要跟对方告诉RAM获取数据并把它带回处理器,…),但是硬件设计在或多或少的持续时间。

So:

所以:

arrays map over address space, which is mapped over RAM, which has O(1) random access; being all maps (more or less) O(1), arrays keep the O(1) random access performance of RAM.

数组映射到地址空间上,地址空间映射到具有O(1)随机访问的RAM上;由于所有映射(或多或少)都是O(1),数组保持了RAM的O(1)随机访问性能。


The point that does matter to software developers, instead, is that, although we see a flat address space and it normally maps over RAM, on modern machines it's false that accessing any element has the same cost. In facts, accessing elements that are in the same zone can be way cheaper than jumping around the address space, due to the fact that the processor has several onboard caches (=smaller but faster on-chip memories) that keep recently used data and memory that is in the same neighborhood; thus, if you have good data locality, continuous operations in memory won't keep hitting the ram (which have much longer latency than caches), and in the end your code will run way faster.

对软件开发人员来说,重要的一点是,尽管我们看到一个平面地址空间,它通常映射到RAM上,但在现代机器上,访问任何元素的成本都是相同的,这是错误的。事实上,访问位于同一区域的元素比在地址空间中跳转要便宜得多,因为处理器有几个板载缓存(=更小但速度更快的片内内存),这些缓存保存最近使用的数据和内存位于同一区域;因此,如果您有良好的数据局部性,内存中的连续操作就不会持续攻击ram(它的延迟时间要比缓存长得多),并且在最后,您的代码将会以更快的速度运行。

Also, under memory pressure, operating systems that provide virtual memory can decide to move rarely used pages of you address space to disk, and fetch them on demand if they are accessed (in response to a page fault); such operation is very costly, and, again, strongly deviates from the idea that accessing any virtual memory address is the same.

此外,在内存压力下,提供虚拟内存的操作系统可以决定移动很少使用的分页空间到磁盘,如果访问它们(针对页面错误),则可以按需获取它们;这种操作非常昂贵,并且再次强烈背离了访问任何虚拟内存地址都是相同的这一想法。

#2


8  

The calculation to get from the start of the array to any given element takes only two operations, a multiplication (times sizeof(element)) and addition. Both of those operations are constant time. Often with today's processors it can be done in essentially no time at all, as the processor is optimized for this kind of access.

从数组开始到任何给定元素的计算只需执行两个操作:乘法(乘以sizeof(元素))和加法。这两个操作都是常数时间。通常,使用今天的处理器,它可以在根本不需要任何时间的情况下完成,因为处理器已经为这种访问进行了优化。

#3


4  

When I say array[4] = 12 in a program, I'm really just storing the bit representation of the memory address into a register. This physical register in the hardware will turn on the corresponding electrical signals according to the bit representation I fed it. Those electrical signals will then somehow magically ( hopefully someone can explain the magic ) access the right memory address in physical/main memory.

当我在程序中说数组[4]= 12时,我实际上只是将内存地址的位表示存储到寄存器中。硬件中的这个物理寄存器将根据我输入的位表示打开相应的电信号。这些电信号会以某种方式神奇地(希望有人能解释这个魔法)访问物理/主内存中的正确内存地址。

I am not quite sure what you are asking but I dont see any answers related to what is really going on in the magic of the hardware. Hopefully I understood enough to go through this long winded explanation (which is still very high level).

我不太清楚你在问什么,但我看不出任何与硬件魔力有关的答案。我希望我能理解这个冗长的解释(这仍然是非常高的水平)。

array[4] = 12;

So from comments it sounds like it is understood that you have to get the base address of array, and then multiply by the size of an array element (or shift if that optimization is possible) to get the address (from your programs perspective) of the memory location. Right of the bat we have a problem. Are these items already in registers or do we have to go get them? The base address for array may or may not be in a register depending on code that surrounds this line of code, in particular code that precedes it. That address might be on the stack or in some other location depending on where you declared it and how. And that may or may not matter as to how long it takes. An optimizing compiler may (often) go so far as to pre-compute the address of array[4] and place that somewhere so it can go into a register and the multiply never happens at runtime, so it is absolutely not true that the computation of array[4] for a random access is a fixed amount of time compared to other random accesses. Depending on the processor, some immediate patterns are one instruction others take more that also has a factor on whether this address is read from .text or stack or etc, etc...To not chicken and egg that problem to death, assume we have the address of array[4] computed.

因此,从注释中可以理解,您必须获得数组的基本地址,然后乘以数组元素的大小(或者如果可能的话,进行移位),才能获得内存位置的地址(从您的程序角度)。我们有个问题。这些东西已经在登记册里了吗?还是我们必须去取它们?数组的基本地址可能在寄存器中,也可能不在寄存器中,这取决于围绕这一行代码的代码,特别是它前面的代码。该地址可能在堆栈上,也可能在其他位置,这取决于您声明它的位置和方式。而这可能或不重要,至于需要多长时间。一个优化的编译器可能(通常)至于pre-compute数组的地址[4]和地点,所以它可以进入注册,将永远不会发生在运行时,这绝对是不正确的计算随机存取数组[4]是一个固定的时间相对于其他随机访问。根据处理器的不同,一些即时模式是另一些指令的一种,它也有一个因素,即这个地址是从。text还是stack,等等。为了避免这个问题的发生,假设我们已经计算了数组[4]的地址。

This is a write operation, from the programmers perspective. Starting with a simple processor, no cache, no write buffer, no mmu, etc. Eventually the simple processor will put the address on the edge of the processor core, with a write strobe and data, each processors bus is different than other processor families, but it is roughly the same the address and data can come out in the same cycle or in separate cycles. The command type (read, write) can happen at the same time or different. but the command comes out. The edge of the processor core is connected to a memory controller that decodes that address. The result is a destination, is this a peripheral if so which one and on what bus, is this memory, if so on what memory bus and so on. Assume ram, assume this simple processor has sram not dram. Sram is more expensive and faster in an apples to apples comparison. The sram has an address and write/read strobes and other controls. Eventually you will have the transaction type, read/write, the address and the data. The sram however its geometry is will route and store the individual bits in their individual pairs/groups of transistors.

从程序员的角度来看,这是一个写操作。开始用一个简单的处理器,没有缓存,没有写缓冲区,没有mmu,等。最后简单处理器将边缘的处理器核心,地址写选通脉冲和数据,比其他处理器家族每个处理器总线是不同的,但它是大致相同的地址和数据可以在同一周期或在不同的周期。命令类型(读、写)可以同时发生或不同。但是命令出来了。处理器核心的边缘连接到一个解码该地址的内存控制器。结果是一个目标,这是一个外围设备如果是,在哪个总线上,这个内存,如果是,在哪个内存总线上等等。假设这个简单的处理器有sram而不是dram。在苹果和苹果的比较中,Sram的价格更高,速度更快。sram具有地址、写/读频闪和其他控件。最终您将拥有事务类型、读/写、地址和数据。然而,sram的几何结构将引导和存储单个比特到它们各自的对/组晶体管中。

A write cycle can be fire and forget. All the information that is needed to complete the transaction, this is a write, this is the address, this is the data, is known right then and there. The memory controller can if it chooses tell the processor that the write transaction is complete, even if the data is nowhere near the memory. That address/data pair will take its time getting to the memory and the processor can keep operating. Some systems though the design is such that the processors write transaction waits until a signal comes back to indicate that the write has made it all the way to the ram. In a fire and forget type setup, that address/data will be queued up somewhere, and work its way to the ram. The queue cant be infinitely deep otherwise it would be the ram itself, so it is finite, and it is possible and likely that many writes in a row can fill that queue faster than the other end can write to ram. At that point the current and or next write has to wait for the queue to indicate there is room for one more. So in situations like this, how fast your write happens, whether your simple processor is I/O bound or not has to do with prior transactions which may or may not be write instructions that preceded this instruction in question.

写周期可以是火,也可以是遗忘。完成事务所需要的所有信息,这是写,这是地址,这是数据,此时此地是已知的。如果内存控制器选择告诉处理器写事务已经完成,即使数据离内存很远。地址/数据对将花费时间到达内存,处理器可以继续运行。有些系统的设计是这样的:处理器写事务一直等到信号返回,表明写操作一直到达ram。在fire和forget类型设置中,该地址/数据将在某处排队,并按其方式工作到ram。队列不可能无限深,否则它就是ram本身,所以它是有限的,有可能而且很有可能的是,许多行中的写操作能够以比另一端对ram写快的速度填充该队列。此时,当前和下一个写操作必须等待队列表明还有更多的空间。在这种情况下,写操作的速度,简单处理器是否受I/O限制,都与之前的事务有关,这些事务可能是写指令,也可能不是写指令。

Now add some complexity. ECC or whatever name you want to call it (EDAC, is another one). The way an ECC memory works is the writes are all a fixed size, even if your implementation is four 8 bit wide memory parts giving you 32 bits of data per write, you have to have a fixed with that the ECC covers and you have to write the data bits plus the ecc bits all at the same time (have to compute the ecc over the full width). So if this was an 8 bit write for example into a 32 bit ECC protected memory then that write cycle requires a read cycle. Read the 32 bits (check the ecc on that read) modify the new 8 bits in that 32 bit pattern, compute the new ecc pattern, write the 32 bits plus ecc bits. Naturally that read portion of the write cycle can end up with an ecc error, which just makes life even more fun. Single bit errors can be corrected usually (what good is an ECC/EDAC if it cant), multi-bit errors not. How the hardware is designed to handle these faults affects what happens next, the read fault may just trickle back to the processor faulting the write transaction, or it may go back as an interrupt, etc. But here is another place where one random access is not the same as another, depending on the memory being accessed, and the size of the access a read-modify-write definitely takes longer than a simple write.

现在添加一些复杂性。ECC或任何你想叫它的名字(EDAC是另一个)。ECC内存的工作方式写的都是一个固定大小的,即使你实现宽四8位内存部分给你每写32位的数据,你必须有一个固定的ECC封面和你写的数据位加上ECC位同时在全宽(需要计算ECC)。所以如果这是一个8位的写入,比方说到一个32位的ECC保护内存那么这个写入周期就需要一个读周期。读取32位(检查读取的ecc)修改32位模式中的新8位,计算新的ecc模式,写入32位加上ecc位。自然,写周期的读操作部分会导致ecc错误,这只会让生活更有趣。单位错误通常可以纠正(如果不能,ECC/EDAC有什么好处),多位错误则不能。硬件设计是如何处理这些错误影响到接下来会发生什么,读错可能只是细流出错处理器写事务,或者它可能返回一个中断,等等。但这里是另一个地方,一个随机访问是不一样的,根据内存访问和访问读-修改-写的大小绝对比一个简单的写需要更长的时间。

Dram can also fall into this fixed width category, even without ECC. Actually all memory falls into this category at some point. The memory array is optimized on the silicon for a certain height and width in units of bits. You cannot violate that memory it can only be read and written in units of that width at that level. The silicon libraries will include many geometries of ram, and the designers will chose those geometries for their parts, and the parts will have fixed limits and often you can use multiple parts to get some integer multiple width of that size, and sometimes the design will allow you to write to only one of those parts if only some of the bits are changing, or some designs will force all parts to light up. Notice how the next ddr family of modules that you plug into your home computer or laptop, the first wave is many parts on both sides of the board. Then as that technology gets older and more boring, it may change to fewer parts on both sides of the board, eventually becoming fewer parts on one side of the board before that technology is obsolete and we are already into the next.

Dram也可以归入这个固定宽度的类别,即使没有ECC。实际上所有的内存都属于这一类。在硅上对存储器阵列进行了优化,以比特为单位,达到一定的高度和宽度。你不能破坏那个内存,它只能以那个宽度的单位来读写。硅库将包含许多几何图形的内存,和设计师会选择这些几何图形部分,和部分固定的限制,通常可以使用多个部分来获得一些整数倍数的宽度尺寸,设计,有时只会让你写一个部分如果只有一些比特正在发生变化,或一些设计将迫使所有部分点亮。请注意,在你的家用电脑或笔记本电脑上的下一个ddr系列模块,第一波是在板的两边的许多部分。然后随着技术变得越来越老,越来越无聊,它可能会变成董事会两边的更少的部分,最终在技术过时之前,在董事会的一边变得更少的部分,我们已经进入了下一个阶段。

This fixed width category also carries with it alignment penalties. Unfortunately most folks learn on x86 machines, which dont restrict you to aligned accesses like many other platforms. There is a definite performance penalty on x86 or others for unaligned accesses, if allowed. It is usually when folks go to a mips or usually an arm on some battery powered device is when they first learn as programmers about aligned accesses. And sadly find them to be painful rather than a blessing (due to the simplicity both in programming and for the hardware benefits that come from it). In a nutshell if your memory is say 32 bits wide and can only be accessed, read or write, 32 bits at a time that means it is limited to aligned accesses only. A memory bus on a 32 bit wide memory usually does not have the lower address bits a[1:0] because there is no use for them. those lower bits from a programmers perspective are zeros. if though our write was 32 bits against one of these 32 bit memories and the address was 0x1002. Then somebody along the line has to read the memory at address 0x1000 and take two of our bytes and modify that 32 bit value, then write it back. Then take the 32 bits at address 0x1004 and modify two bytes and write it back. four bus cycles for a single write. If we were writing 32 bits to address 0x1008 though it would be a simple 32 bit write, no reads.

这个固定宽度类别也带有它的对齐惩罚。不幸的是,大多数人都是在x86机器上学习的,这些机器不会像许多其他平台那样限制访问对齐。如果允许的话,在x86或其他平台上对不对齐的访问有一定的性能损失。通常情况下,当人们在一些电池驱动的设备上使用mips或arm时,他们首先会学习到关于对齐访问的程序员。不幸的是,他们觉得这是痛苦的,而不是一种祝福(因为编程和硬件带来的好处都是如此)。简而言之,如果您的内存是32位宽,只能访问、读取或写入,那么每次32位就意味着只能访问对齐的访问。在32位宽内存上的内存总线通常没有较低的地址位A[1:0],因为它们没有用处。从程序员的角度来看,那些较低的位是零。如果我们的写入是32位对于这32位内存中的一个,地址是0x1002。然后沿着这一行的人必须读取地址0x1000的内存,取两个字节并修改那个32位的值,然后写回去。然后取地址为0x1004的32位并修改两个字节并将其写回。一个字的四个公共汽车周期。如果我们编写32位来处理0x1008,尽管它是一个简单的32位写,没有读。

sram vs dram. dram is painfully slow, but super cheap. a half to a quarter the number of transistors per bit. (4 for sram for example 1 for dram). Sram remembers the bit so long as the power is on. Dram has to be refreshed like a rechargable battery. Even if the power stays on a single bit will only be remembered for a very short period of time. So some hardware along the way (ddr controller, etc) has to regularly perform bus cycles telling that ram to remember a certain chunk of the memory. Those cycles steal time from your processor wanting to access that memory. dram is very slow, it may say 2133Mhz (2.133ghz) on the box. But it is really more like 133Mhz ram, right 0.133Ghz. The first cheat is ddr. Normally things in the digital world happen once per clock cycle. The clock goes to an asserted state then goes to a deasserted state (ones and zeros) one cycle is one clock. DDR means that it can do something on both the high half cycle and on the low half cycle. so that 2133Ghz memory really uses a 1066mhz clock. Then pipeline like parallelisms happen, you can shove in commands, in bursts, at that high rate, but eventually that ram has to actually get accessed. Overall dram is non-determinstic and very slow. Sram on the other hand, no refreshes required it remembers so long as the power is on. Can be several times faster (133mhz * N), and so on. It can be deterministic.

sram vs dram。dram慢得令人痛苦,但超级便宜。每一比特的晶体管数量的一半到四分之一。(例如,sram为dram)。只要电源还开着,Sram就能记住这一点。Dram必须像可充电电池一样重新充电。即使力量只停留在一段时间,也只会在很短的时间内被记住。因此,沿途的一些硬件(ddr控制器等)必须定期执行总线周期,告诉ram记住一定的内存块。这些周期从处理器中窃取希望访问该内存的时间。dram速度很慢,可能是盒子上的2133Mhz (2.133ghz)。但它更像是133Mhz的ram, 0。133ghz。第一个欺骗是ddr。通常情况下,数字世界的每一个时钟周期发生一次。时钟进入一个断言状态,然后进入一个断言状态(1和0)一个周期是一个时钟。DDR意味着它可以在高半周期和低半周期上做一些事情。所以2133Ghz的内存使用1066mhz的时钟。然后,类似并行的管道发生了,你可以在命令中,以这样高的速度,一气呵成地插入命令,但最终那个ram必须被实际访问。总体dram是不确定性和非常缓慢的。另一方面,只要打开电源,就不需要刷新就能记住。可以快几倍(133mhz * N),等等。它可以是确定的。

The next hurdle, cache. Cache is good and bad. Cache is generally made from sram. Hopefully you have an understanding of a cache. If the processor or someone upstream has marked the transaction as non-cacheable then it goes through uncached to the memory bus on the other side. If cacheable then the a portion of the address is looked up in a table and will result in a hit or miss. this being a write, depending on the cache and/or transaction settings, if it is a miss it may pass through to the other side. If there is a hit then the data will be written into the cache memory, depending on the cache type it may also pass through to the other side or that data may sit in the cache waiting for some other chunk of data to evict it and then it gets written to the other side. caches definitely make reads and sometimes make writes non-deterministic. Sequential accesses have the most benefit as your eviction rate is lower, the first access in a cache line is slow relative to the others, then the rest are fast. which is where we get this term of random access anyway. Random accesses go against the schemes that are designed to make sequential accesses faster.

下一个障碍,缓存。缓存有好有坏。缓存通常由sram构成。希望您对缓存有所了解。如果处理器或上游的某人将事务标记为不可缓存,那么它将通过未封顶的方式传递到另一端的内存总线。如果可缓存,那么该地址的一部分将被查找到一个表中,并将导致一个命中或错误。这是一个写入,取决于缓存和/或事务设置,如果它是一个遗漏,它可能会传递到另一边。如果有冲击,那么数据将被写入到缓存内存,根据缓存类型也可能通过另一方或缓存中的数据可能坐等待其他的数据块来驱逐它,然后它被写入到另一边。缓存绝对会进行读取,有时会使写入不确定性。顺序访问最有好处,因为您的驱逐率较低,在高速缓存中的第一个访问速度相对于其他的是慢的,然后其余的都是快速的。这就是随机访问这一术语的由来。随机访问与那些旨在使顺序访问速度更快的方案背道而驰。

Sometimes the far side of your cache has a write buffer. A relatively small queue/pipe/buffer/fifo that holds some number of write transactions. Another fire and forget deal, with those benefits.

有时,缓存的远端有一个写缓冲区。一个相对较小的队列/管道/缓冲区/fifo,包含一些写事务。另一场火灾与遗忘的交易,与那些好处。

Multiple layers of caches. l1, l2, l3...L1 is usually the fastest either by its technology or proximity, and usually the smallest, and it goes up from there speed and size and some of that has to do with cost of the memory. We are doing a write, but when you do a cache enabled read understand that if l1 has a miss it goes to l2 which if it has a miss goes to l3 which if it has a miss goes to main memory, then l3, l2 and l1 all will store a copy. So a miss on all 3 is of course the most painful and is slower than if you had no cache at all, but sequential reads will give you the cached items which are now in l1 and super fast, for the cache to be useful sequential reads over the cache line should take less time overall than reading that much memory directly from the slow dram. A system doesnt have to have 3 layers of caches, it can vary. Likewise some systems can separate instruction fetches from data reads and can have separate caches which dont evict each other, and some the caches are not separate and instruction fetches can evict data from data reads.

多层次的缓存。l1,l2,l3……L1通常是最快的,因为它的技术或者它的邻近性,通常是最小的,而且它的速度和大小都在提高,这与内存的成本有关。我们正在做写,但当你启用缓存读明白如果l1错过→l2,如果它有一个小姐去l3,如果它有一个小姐去主内存,然后l3,l2和l1都将存储一个副本。所以小姐在所有3当然是最痛苦的,是低于如果没有缓存,但顺序读取会给你现在的缓存条目l1和超级快,有用的顺序读取的缓存的缓存线应该采取更少的时间总比读那么多直接从缓慢的dram内存。一个系统不需要有三层缓存,它可以变化。同样,有些系统可以将指令读取与数据读取分离开来,并且可以有不同的缓存,不会相互排斥,有些缓存不是分开的,指令读取可以从数据读取中删除数据。

caches help with alignment issues. But of course there is an even more severe penalty for an unaligned access across cache lines. Caches tend to operate using chunks of memory called cache lines. These are often some integer multiple in size of the memory on the other side. a 32 bit memory for example the cache line might be 128 bits or 256 bits for example. So if and when the cache line is in the cache, then a read-modify-write due to an unaligned write is against faster memory, still more painful than aligned but not as painful. If it were an unaligned read and the address was such that part of that data is on one side of a cache line boundary and the other on the other then two cache lines have to be read. A 16 bit read for example can cost you many bytes read against the slowest memory, obviously several times slower than if you had no caches at all. Depending on how the caches and memory system in general is designed, if you do a write across a cache line boundary it may be similarly painful, or perhaps not as much it might have the fraction write to the cache, and the other fraction go out on the far side as a smaller sized write.

缓存有助于对齐问题。但是,当然,对跨缓存线路的不对齐访问还有更严重的惩罚。缓存往往使用称为缓存线的内存块进行操作。这些通常是另一边内存大小的整数倍。32位内存,例如,缓存行可能是128位或256位。因此,如果缓存行在缓存中,那么由于未对齐的写入而导致的读-修改-写将不利于更快的内存,仍然比对齐的内存更痛苦,但也没有那么痛苦。如果它是未对齐的读取,并且地址是这样的:数据的一部分位于缓存线边界的一边,另一部分位于另一边,那么必须读取两条缓存线。例如,一个16位的读取会花费你许多字节来读取最慢的内存,显然比没有缓存要慢好几倍。取决于缓存和内存系统总体设计,如果你写在高速缓存线路边界可能是同样痛苦,或许不一样可能会有一部分写缓存,而另一部分出去在远端一个较小尺寸的写。

Next layer of complexity is the mmu. Allowing the processor and programmer the illusion of flat memory spaces and/or control over what is cached or not, and/or memory protection, and/or the illusion that all programs are running in the same address space (so your toolchain can always compile/link for address 0x8000 for example). The mmu takes a portion of the virtual address on the processor core side. looks that up in a table, or series of tables, those lookups are often in system address space so each one of those lookups may be one or more of everything stated above as each are a memory cycle on the system memory. Those lookups can result in ecc faults even though you are trying to do a write. Eventually after one or two or three or more reads, the mmu has determined what the address is on the other side of the mmu is, and the properties (cacheable or not, etc) and that is passed on to the next thing (l1, etc) and all of the above applies. Some mmus have a bit of a cache in them of some number of prior transactions, remember because programs are sequential, the tricks used to boost the illusion of memory performance are based on sequential accesses, not random accesses. So some number of lookups might be stored in the mmu so it doesnt have to go out to main memory right away...

下一层复杂性是mmu。允许处理器和程序员对平坦的内存空间和/或对缓存内容的控制,以及/或内存保护,以及/或对所有程序都在相同的地址空间中运行的错觉(例如,您的工具链总是可以为地址0x8000编译/链接)。mmu接收处理器核心端的部分虚拟地址。在表或一系列表中查找,这些查找通常位于系统地址空间中,因此这些查找中的每个查找都可能是上面所述的所有查询中的一个或多个,因为每个查询都是系统内存中的内存循环。这些查找可能导致ecc错误,即使您正在尝试写操作。最终,在一次或两次或三次或多次读取之后,mmu已经确定了mmu另一端的地址是什么,以及属性(可缓存性或非可缓存性等),这些属性将传递给下一个对象(l1等),所有这些都适用。有些mmus会在一些之前的事务中有一些缓存,记住,因为程序是顺序的,所以增强内存性能错觉的技巧是基于顺序访问,而不是随机访问。因此,一些查找可能会存储在mmu中,因此它不必立即转到主内存中……

So in a modern computer with mmus, caches, dram, sequential reads in particular, but also writes are likely to be faster than random access. The difference can be dramatic. The first transaction in a sequential read or write is at that moment a random access as it has not been seen ever or for a while. Once the sequence continues though the optimizations fall in order and the next few/some are noticeably faster. The size and alignment of your transaction plays an important role in performance as well. While there are so many non-deterministic things going on, as a programmer with this knowledge you modify your programs to run much faster, or if unlucky or on purpose can modify your programs to run much slower. Sequential is going to be, in general faster on one of these systems. random access is going to be very non-deterministic. array[4]=12; followed by array[37]=12; Those two high level operations could take dramatically different amounts of time, both in the computation of the write address and the actual writes themselves. But for example discarded_variable=array[3]; array[3]=11; array[4]=12; Can quite often execute significantly faster than array[3]=11; array[4]=12;

所以在现代计算机中,使用mmus,缓存,dram,尤其是顺序读取,但是写操作可能比随机访问要快。这种差异可能是戏剧性的。顺序读或写中的第一个事务当时是一个随机访问,因为它从来没有出现过,也没有出现过一段时间。一旦序列继续,优化就会按顺序进行,接下来的几个/一些会明显地更快。事务的大小和对齐对性能也起着重要的作用。虽然有很多不确定性的事情在发生,但是作为一个拥有这些知识的程序员,你可以修改你的程序以使其运行得更快,或者如果不幸或故意修改你的程序以使其运行得更慢。序列是,一般来说,在这些系统中更快。随机访问是非确定性的。数组[4]= 12;其次是array[37]= 12;这两个高级操作在写地址的计算和实际的写操作本身上都可能需要非常不同的时间。但是例如discarded_variable = array[3];数组[3]= 11;数组[4]= 12;可以比数组[3]=11执行得更快;数组[4]= 12;

#4


2  

Arrays in C and C++ have random access because they are stored in RAM - Random Access Memory in a finite, predictable order. As a result, a simple linear operation is required to determine the location of a given record (a[i] = a + sizeof(a[0]) * i). This calculation has constant time. From the CPU's perspective, no "seek" or "rewind" operation is required, it simply tells memory "load the value at address X".

C和c++中的数组具有随机存取,因为它们存储在随机存取内存中,在有限的、可预测的顺序中。因此,需要一个简单的线性操作来确定给定记录的位置(a[i] = a + sizeof(a[0]) * i)。从CPU的角度来看,不需要“查找”或“重风”操作,它只告诉内存“在地址X上加载值”。

However: On a modern CPU the idea that it takes constant time to fetch data is no-longer true. It takes constant amortized time, depending on whether a given piece of data is in cache or not.

但是:在现代CPU上,获取数据需要恒定时间的想法不再正确。它需要恒定的平摊时间,这取决于给定的数据是否在缓存中。

Still - the general principle is that the time to fetch a given set of 4 or 8 bytes from RAM is the same regardless of the address. E.g. if, from a clean slate, you access RAM[0] and RAM[4294967292] the CPU will get the response within the same number of cycles.

一般的原则是,从RAM中获取一个给定的4或8字节的集合的时间是相同的,不管地址是什么。如果,从一个干净的石板,你访问RAM[0]和RAM[4294967292],*处理器将得到响应在相同数量的周期。

#include <iostream>
#include <cstring>
#include <chrono>

// 8Kb of space.
char smallSpace[8 * 1024];

// 64Mb of space (larger than cache)
char bigSpace[64 * 1024 * 1024];

void populateSpaces()
{
    memset(smallSpace, 0, sizeof(smallSpace));
    memset(bigSpace, 0, sizeof(bigSpace));
    std::cout << "Populated spaces" << std::endl;
}

unsigned int doWork(char* ptr, size_t size)
{
    unsigned int total = 0;
    const char* end = ptr + size;
    while (ptr < end) {
        total += *(ptr++);
    }
    return total;
}

using namespace std;
using namespace chrono;

void doTiming(const char* label, char* ptr, size_t size)
{
    cout << label << ": ";
    const high_resolution_clock::time_point start = high_resolution_clock::now();
    auto result = doWork(ptr, size);
    const high_resolution_clock::time_point stop = high_resolution_clock::now();
    auto delta = duration_cast<nanoseconds>(stop - start).count();
    cout << "took " << delta << "ns (result is " << result << ")" << endl;
}

int main()
{
    cout << "Timer resultion is " << 
        duration_cast<nanoseconds>(high_resolution_clock::duration(1)).count()
        << "ns" << endl;

    populateSpaces();

    doTiming("first small", smallSpace, sizeof(smallSpace));
    doTiming("second small", smallSpace, sizeof(smallSpace));
    doTiming("third small", smallSpace, sizeof(smallSpace));
    doTiming("bigSpace", bigSpace, sizeof(bigSpace));
    doTiming("bigSpace redo", bigSpace, sizeof(bigSpace));
    doTiming("smallSpace again", smallSpace, sizeof(smallSpace));
    doTiming("smallSpace once more", smallSpace, sizeof(smallSpace));
    doTiming("smallSpace last", smallSpace, sizeof(smallSpace));
}

Live demo: http://ideone.com/9zOW5q

现场演示:http://ideone.com/9zOW5q

Output (from ideone, which may not be ideal)

输出(来自ideone,可能不太理想)

Success  time: 0.33 memory: 68864 signal:0
Timer resultion is 1ns
Populated spaces
doWork/small: took 8384ns (result is 8192)
doWork/small: took 7702ns (result is 8192)
doWork/small: took 7686ns (result is 8192)
doWork/big: took 64921206ns (result is 67108864)
doWork/big: took 65120677ns (result is 67108864)
doWork/small: took 8237ns (result is 8192)
doWork/small: took 7678ns (result is 8192)
doWork/small: took 7677ns (result is 8192)
Populated spaces
strideWork/small: took 10112ns (result is 16384)
strideWork/small: took 9570ns (result is 16384)
strideWork/small: took 9559ns (result is 16384)
strideWork/big: took 65512138ns (result is 134217728)
strideWork/big: took 65005505ns (result is 134217728)

What we are seeing here are the effects of cache on memory access performance. The first time we hit smallSpace it takes ~8100ns to access all 8kb of small space. But when we call it again immediately after, twice, it takes ~600ns less at ~7400ns.

我们在这里看到的是缓存对内存访问性能的影响。我们第一次遇到smallSpace时,需要8100ns才能访问所有8kb的小空间。但是当我们马上再次调用它时,两次,在7400ns时,它会少花600纳秒。

Now we go away and do bigspace, which is bigger than current CPU cache, so we know we've blown away the L1 and L2 caches.

现在我们离开,做bigspace,它比当前的CPU缓存大,所以我们知道我们已经把L1和L2缓存炸开了。

Coming back to small, which we're sure is not cached now, we again see ~8100ns for the first time and ~7400 for the second two.

回到small,我们确信现在没有缓存,我们第一次看到了~8100ns,第二次看到了~7400 ns。

We blow the cache out and now we introduce a different behavior. We use a strided loop version. This amplifies the "cache miss" effect and significantly bumps the timing, although "small space" fits into L2 cache so we still see a reduction between pass 1 and the following 2 passes.

我们吹灭缓存,现在我们引入一个不同的行为。我们使用横切循环版本。这放大了“缓存缺失”的影响,并显著地影响了时间,尽管“小空间”适合于L2缓存,因此我们仍然可以看到pass 1和后续2传递之间的减少。

#1


13  

Because software likes O(1) "working" memory and thus the hardware is designed to behave that way

The basic point is that the address space of a program is thought as abstractly having O(1) access performance, i.e. whatever memory location you want to read, it should take some constant time (which anyway isn't related to the distance between it and the last memory access). So, being arrays nothing more than contiguous chunks of address space, they should inherit this property (accessing an element of an array is just a matter of adding the index to the start address of the array, and then dereferencing the obtained pointer).

最基本的一点是,程序的地址空间被认为抽象地具有O(1)访问性能,即无论您想要读取的内存位置,它都应该花费一定的时间(无论如何,这与它与最后的内存访问之间的距离无关)。因此,数组只不过是地址空间的连续块,它们应该继承这个属性(访问数组的元素仅仅是将索引添加到数组的开始地址,然后取消对获得的指针的引用)。

This property comes from the fact that, in general, the address space of a program have some correspondence with the physical RAM of the PC, which, as the name (random access memory) partially implies, should have by itself the property that, whatever location in the RAM you want to access, you get to it in constant time (as opposed, for example, to a tape drive, where the seek time depends from the actual length of tape you have to move to get there).

这个属性来自这一事实,在一般情况下,程序的地址空间中有对应的物理RAM PC,,正如其名字(随机存取内存)部分所暗示的,应该有本身的性质,任何你想要的在内存中的位置来访问,你可以在常数时间(例如,与磁带驱动器,在寻道时间从胶带的实际长度取决于你必须移动到那里)。

Now, for "regular" RAM this property is (at least AFAIK) true - when the processor/motherboard/memory controller asks to a RAM chip to get some data, it does so in constant time; the details aren't really relevant for software development, and the internals of memory chips changed many times in the past and will change again in the future. If you are interested in an overview of the details of current RAMs, you can have a look here about DRAMs.

现在,对于“常规”RAM,这个属性(至少是AFAIK)是正确的——当处理器/主板/内存控制器请求RAM芯片获取一些数据时,它是在恒定的时间内进行的;这些细节与软件开发并没有多大关系,内存芯片的内部结构在过去发生了很多变化,将来也会再次发生变化。如果你对当前公羊的细节感兴趣,你可以在这里看一下DRAMs。

The general concept is that RAM chips don't contain a tape that must be moved, or a disk arm that must be positioned; when you ask to them a byte at some location, the work (mostly changing the settings of some hardware muxes, that connect the output to the cells where the byte status is stored) is the same for any location you could be asking for; thus, you get O(1) performance

一般的概念是RAM芯片不包含必须移动的磁带或必须定位的磁盘臂;当您在某个位置向它们请求一个字节时,对于您可能要求的任何位置,工作(主要是更改某些硬件muxes的设置,这些硬件muxes将输出连接到存储字节状态的单元)都是相同的;因此,得到O(1)性能

There is some overhead behind this (the logical address have to be mapped to physical address by the MMU, the various motherboard pieces have to talk with each other to tell the RAM to fetch the data and bring it back to the processor, ...), but the hardware is designed to do so in more or less constant time.

这背后有一些开销(逻辑地址必须映射到物理地址的MMU,各种主板部分要跟对方告诉RAM获取数据并把它带回处理器,…),但是硬件设计在或多或少的持续时间。

So:

所以:

arrays map over address space, which is mapped over RAM, which has O(1) random access; being all maps (more or less) O(1), arrays keep the O(1) random access performance of RAM.

数组映射到地址空间上,地址空间映射到具有O(1)随机访问的RAM上;由于所有映射(或多或少)都是O(1),数组保持了RAM的O(1)随机访问性能。


The point that does matter to software developers, instead, is that, although we see a flat address space and it normally maps over RAM, on modern machines it's false that accessing any element has the same cost. In facts, accessing elements that are in the same zone can be way cheaper than jumping around the address space, due to the fact that the processor has several onboard caches (=smaller but faster on-chip memories) that keep recently used data and memory that is in the same neighborhood; thus, if you have good data locality, continuous operations in memory won't keep hitting the ram (which have much longer latency than caches), and in the end your code will run way faster.

对软件开发人员来说,重要的一点是,尽管我们看到一个平面地址空间,它通常映射到RAM上,但在现代机器上,访问任何元素的成本都是相同的,这是错误的。事实上,访问位于同一区域的元素比在地址空间中跳转要便宜得多,因为处理器有几个板载缓存(=更小但速度更快的片内内存),这些缓存保存最近使用的数据和内存位于同一区域;因此,如果您有良好的数据局部性,内存中的连续操作就不会持续攻击ram(它的延迟时间要比缓存长得多),并且在最后,您的代码将会以更快的速度运行。

Also, under memory pressure, operating systems that provide virtual memory can decide to move rarely used pages of you address space to disk, and fetch them on demand if they are accessed (in response to a page fault); such operation is very costly, and, again, strongly deviates from the idea that accessing any virtual memory address is the same.

此外,在内存压力下,提供虚拟内存的操作系统可以决定移动很少使用的分页空间到磁盘,如果访问它们(针对页面错误),则可以按需获取它们;这种操作非常昂贵,并且再次强烈背离了访问任何虚拟内存地址都是相同的这一想法。

#2


8  

The calculation to get from the start of the array to any given element takes only two operations, a multiplication (times sizeof(element)) and addition. Both of those operations are constant time. Often with today's processors it can be done in essentially no time at all, as the processor is optimized for this kind of access.

从数组开始到任何给定元素的计算只需执行两个操作:乘法(乘以sizeof(元素))和加法。这两个操作都是常数时间。通常,使用今天的处理器,它可以在根本不需要任何时间的情况下完成,因为处理器已经为这种访问进行了优化。

#3


4  

When I say array[4] = 12 in a program, I'm really just storing the bit representation of the memory address into a register. This physical register in the hardware will turn on the corresponding electrical signals according to the bit representation I fed it. Those electrical signals will then somehow magically ( hopefully someone can explain the magic ) access the right memory address in physical/main memory.

当我在程序中说数组[4]= 12时,我实际上只是将内存地址的位表示存储到寄存器中。硬件中的这个物理寄存器将根据我输入的位表示打开相应的电信号。这些电信号会以某种方式神奇地(希望有人能解释这个魔法)访问物理/主内存中的正确内存地址。

I am not quite sure what you are asking but I dont see any answers related to what is really going on in the magic of the hardware. Hopefully I understood enough to go through this long winded explanation (which is still very high level).

我不太清楚你在问什么,但我看不出任何与硬件魔力有关的答案。我希望我能理解这个冗长的解释(这仍然是非常高的水平)。

array[4] = 12;

So from comments it sounds like it is understood that you have to get the base address of array, and then multiply by the size of an array element (or shift if that optimization is possible) to get the address (from your programs perspective) of the memory location. Right of the bat we have a problem. Are these items already in registers or do we have to go get them? The base address for array may or may not be in a register depending on code that surrounds this line of code, in particular code that precedes it. That address might be on the stack or in some other location depending on where you declared it and how. And that may or may not matter as to how long it takes. An optimizing compiler may (often) go so far as to pre-compute the address of array[4] and place that somewhere so it can go into a register and the multiply never happens at runtime, so it is absolutely not true that the computation of array[4] for a random access is a fixed amount of time compared to other random accesses. Depending on the processor, some immediate patterns are one instruction others take more that also has a factor on whether this address is read from .text or stack or etc, etc...To not chicken and egg that problem to death, assume we have the address of array[4] computed.

因此,从注释中可以理解,您必须获得数组的基本地址,然后乘以数组元素的大小(或者如果可能的话,进行移位),才能获得内存位置的地址(从您的程序角度)。我们有个问题。这些东西已经在登记册里了吗?还是我们必须去取它们?数组的基本地址可能在寄存器中,也可能不在寄存器中,这取决于围绕这一行代码的代码,特别是它前面的代码。该地址可能在堆栈上,也可能在其他位置,这取决于您声明它的位置和方式。而这可能或不重要,至于需要多长时间。一个优化的编译器可能(通常)至于pre-compute数组的地址[4]和地点,所以它可以进入注册,将永远不会发生在运行时,这绝对是不正确的计算随机存取数组[4]是一个固定的时间相对于其他随机访问。根据处理器的不同,一些即时模式是另一些指令的一种,它也有一个因素,即这个地址是从。text还是stack,等等。为了避免这个问题的发生,假设我们已经计算了数组[4]的地址。

This is a write operation, from the programmers perspective. Starting with a simple processor, no cache, no write buffer, no mmu, etc. Eventually the simple processor will put the address on the edge of the processor core, with a write strobe and data, each processors bus is different than other processor families, but it is roughly the same the address and data can come out in the same cycle or in separate cycles. The command type (read, write) can happen at the same time or different. but the command comes out. The edge of the processor core is connected to a memory controller that decodes that address. The result is a destination, is this a peripheral if so which one and on what bus, is this memory, if so on what memory bus and so on. Assume ram, assume this simple processor has sram not dram. Sram is more expensive and faster in an apples to apples comparison. The sram has an address and write/read strobes and other controls. Eventually you will have the transaction type, read/write, the address and the data. The sram however its geometry is will route and store the individual bits in their individual pairs/groups of transistors.

从程序员的角度来看,这是一个写操作。开始用一个简单的处理器,没有缓存,没有写缓冲区,没有mmu,等。最后简单处理器将边缘的处理器核心,地址写选通脉冲和数据,比其他处理器家族每个处理器总线是不同的,但它是大致相同的地址和数据可以在同一周期或在不同的周期。命令类型(读、写)可以同时发生或不同。但是命令出来了。处理器核心的边缘连接到一个解码该地址的内存控制器。结果是一个目标,这是一个外围设备如果是,在哪个总线上,这个内存,如果是,在哪个内存总线上等等。假设这个简单的处理器有sram而不是dram。在苹果和苹果的比较中,Sram的价格更高,速度更快。sram具有地址、写/读频闪和其他控件。最终您将拥有事务类型、读/写、地址和数据。然而,sram的几何结构将引导和存储单个比特到它们各自的对/组晶体管中。

A write cycle can be fire and forget. All the information that is needed to complete the transaction, this is a write, this is the address, this is the data, is known right then and there. The memory controller can if it chooses tell the processor that the write transaction is complete, even if the data is nowhere near the memory. That address/data pair will take its time getting to the memory and the processor can keep operating. Some systems though the design is such that the processors write transaction waits until a signal comes back to indicate that the write has made it all the way to the ram. In a fire and forget type setup, that address/data will be queued up somewhere, and work its way to the ram. The queue cant be infinitely deep otherwise it would be the ram itself, so it is finite, and it is possible and likely that many writes in a row can fill that queue faster than the other end can write to ram. At that point the current and or next write has to wait for the queue to indicate there is room for one more. So in situations like this, how fast your write happens, whether your simple processor is I/O bound or not has to do with prior transactions which may or may not be write instructions that preceded this instruction in question.

写周期可以是火,也可以是遗忘。完成事务所需要的所有信息,这是写,这是地址,这是数据,此时此地是已知的。如果内存控制器选择告诉处理器写事务已经完成,即使数据离内存很远。地址/数据对将花费时间到达内存,处理器可以继续运行。有些系统的设计是这样的:处理器写事务一直等到信号返回,表明写操作一直到达ram。在fire和forget类型设置中,该地址/数据将在某处排队,并按其方式工作到ram。队列不可能无限深,否则它就是ram本身,所以它是有限的,有可能而且很有可能的是,许多行中的写操作能够以比另一端对ram写快的速度填充该队列。此时,当前和下一个写操作必须等待队列表明还有更多的空间。在这种情况下,写操作的速度,简单处理器是否受I/O限制,都与之前的事务有关,这些事务可能是写指令,也可能不是写指令。

Now add some complexity. ECC or whatever name you want to call it (EDAC, is another one). The way an ECC memory works is the writes are all a fixed size, even if your implementation is four 8 bit wide memory parts giving you 32 bits of data per write, you have to have a fixed with that the ECC covers and you have to write the data bits plus the ecc bits all at the same time (have to compute the ecc over the full width). So if this was an 8 bit write for example into a 32 bit ECC protected memory then that write cycle requires a read cycle. Read the 32 bits (check the ecc on that read) modify the new 8 bits in that 32 bit pattern, compute the new ecc pattern, write the 32 bits plus ecc bits. Naturally that read portion of the write cycle can end up with an ecc error, which just makes life even more fun. Single bit errors can be corrected usually (what good is an ECC/EDAC if it cant), multi-bit errors not. How the hardware is designed to handle these faults affects what happens next, the read fault may just trickle back to the processor faulting the write transaction, or it may go back as an interrupt, etc. But here is another place where one random access is not the same as another, depending on the memory being accessed, and the size of the access a read-modify-write definitely takes longer than a simple write.

现在添加一些复杂性。ECC或任何你想叫它的名字(EDAC是另一个)。ECC内存的工作方式写的都是一个固定大小的,即使你实现宽四8位内存部分给你每写32位的数据,你必须有一个固定的ECC封面和你写的数据位加上ECC位同时在全宽(需要计算ECC)。所以如果这是一个8位的写入,比方说到一个32位的ECC保护内存那么这个写入周期就需要一个读周期。读取32位(检查读取的ecc)修改32位模式中的新8位,计算新的ecc模式,写入32位加上ecc位。自然,写周期的读操作部分会导致ecc错误,这只会让生活更有趣。单位错误通常可以纠正(如果不能,ECC/EDAC有什么好处),多位错误则不能。硬件设计是如何处理这些错误影响到接下来会发生什么,读错可能只是细流出错处理器写事务,或者它可能返回一个中断,等等。但这里是另一个地方,一个随机访问是不一样的,根据内存访问和访问读-修改-写的大小绝对比一个简单的写需要更长的时间。

Dram can also fall into this fixed width category, even without ECC. Actually all memory falls into this category at some point. The memory array is optimized on the silicon for a certain height and width in units of bits. You cannot violate that memory it can only be read and written in units of that width at that level. The silicon libraries will include many geometries of ram, and the designers will chose those geometries for their parts, and the parts will have fixed limits and often you can use multiple parts to get some integer multiple width of that size, and sometimes the design will allow you to write to only one of those parts if only some of the bits are changing, or some designs will force all parts to light up. Notice how the next ddr family of modules that you plug into your home computer or laptop, the first wave is many parts on both sides of the board. Then as that technology gets older and more boring, it may change to fewer parts on both sides of the board, eventually becoming fewer parts on one side of the board before that technology is obsolete and we are already into the next.

Dram也可以归入这个固定宽度的类别,即使没有ECC。实际上所有的内存都属于这一类。在硅上对存储器阵列进行了优化,以比特为单位,达到一定的高度和宽度。你不能破坏那个内存,它只能以那个宽度的单位来读写。硅库将包含许多几何图形的内存,和设计师会选择这些几何图形部分,和部分固定的限制,通常可以使用多个部分来获得一些整数倍数的宽度尺寸,设计,有时只会让你写一个部分如果只有一些比特正在发生变化,或一些设计将迫使所有部分点亮。请注意,在你的家用电脑或笔记本电脑上的下一个ddr系列模块,第一波是在板的两边的许多部分。然后随着技术变得越来越老,越来越无聊,它可能会变成董事会两边的更少的部分,最终在技术过时之前,在董事会的一边变得更少的部分,我们已经进入了下一个阶段。

This fixed width category also carries with it alignment penalties. Unfortunately most folks learn on x86 machines, which dont restrict you to aligned accesses like many other platforms. There is a definite performance penalty on x86 or others for unaligned accesses, if allowed. It is usually when folks go to a mips or usually an arm on some battery powered device is when they first learn as programmers about aligned accesses. And sadly find them to be painful rather than a blessing (due to the simplicity both in programming and for the hardware benefits that come from it). In a nutshell if your memory is say 32 bits wide and can only be accessed, read or write, 32 bits at a time that means it is limited to aligned accesses only. A memory bus on a 32 bit wide memory usually does not have the lower address bits a[1:0] because there is no use for them. those lower bits from a programmers perspective are zeros. if though our write was 32 bits against one of these 32 bit memories and the address was 0x1002. Then somebody along the line has to read the memory at address 0x1000 and take two of our bytes and modify that 32 bit value, then write it back. Then take the 32 bits at address 0x1004 and modify two bytes and write it back. four bus cycles for a single write. If we were writing 32 bits to address 0x1008 though it would be a simple 32 bit write, no reads.

这个固定宽度类别也带有它的对齐惩罚。不幸的是,大多数人都是在x86机器上学习的,这些机器不会像许多其他平台那样限制访问对齐。如果允许的话,在x86或其他平台上对不对齐的访问有一定的性能损失。通常情况下,当人们在一些电池驱动的设备上使用mips或arm时,他们首先会学习到关于对齐访问的程序员。不幸的是,他们觉得这是痛苦的,而不是一种祝福(因为编程和硬件带来的好处都是如此)。简而言之,如果您的内存是32位宽,只能访问、读取或写入,那么每次32位就意味着只能访问对齐的访问。在32位宽内存上的内存总线通常没有较低的地址位A[1:0],因为它们没有用处。从程序员的角度来看,那些较低的位是零。如果我们的写入是32位对于这32位内存中的一个,地址是0x1002。然后沿着这一行的人必须读取地址0x1000的内存,取两个字节并修改那个32位的值,然后写回去。然后取地址为0x1004的32位并修改两个字节并将其写回。一个字的四个公共汽车周期。如果我们编写32位来处理0x1008,尽管它是一个简单的32位写,没有读。

sram vs dram. dram is painfully slow, but super cheap. a half to a quarter the number of transistors per bit. (4 for sram for example 1 for dram). Sram remembers the bit so long as the power is on. Dram has to be refreshed like a rechargable battery. Even if the power stays on a single bit will only be remembered for a very short period of time. So some hardware along the way (ddr controller, etc) has to regularly perform bus cycles telling that ram to remember a certain chunk of the memory. Those cycles steal time from your processor wanting to access that memory. dram is very slow, it may say 2133Mhz (2.133ghz) on the box. But it is really more like 133Mhz ram, right 0.133Ghz. The first cheat is ddr. Normally things in the digital world happen once per clock cycle. The clock goes to an asserted state then goes to a deasserted state (ones and zeros) one cycle is one clock. DDR means that it can do something on both the high half cycle and on the low half cycle. so that 2133Ghz memory really uses a 1066mhz clock. Then pipeline like parallelisms happen, you can shove in commands, in bursts, at that high rate, but eventually that ram has to actually get accessed. Overall dram is non-determinstic and very slow. Sram on the other hand, no refreshes required it remembers so long as the power is on. Can be several times faster (133mhz * N), and so on. It can be deterministic.

sram vs dram。dram慢得令人痛苦,但超级便宜。每一比特的晶体管数量的一半到四分之一。(例如,sram为dram)。只要电源还开着,Sram就能记住这一点。Dram必须像可充电电池一样重新充电。即使力量只停留在一段时间,也只会在很短的时间内被记住。因此,沿途的一些硬件(ddr控制器等)必须定期执行总线周期,告诉ram记住一定的内存块。这些周期从处理器中窃取希望访问该内存的时间。dram速度很慢,可能是盒子上的2133Mhz (2.133ghz)。但它更像是133Mhz的ram, 0。133ghz。第一个欺骗是ddr。通常情况下,数字世界的每一个时钟周期发生一次。时钟进入一个断言状态,然后进入一个断言状态(1和0)一个周期是一个时钟。DDR意味着它可以在高半周期和低半周期上做一些事情。所以2133Ghz的内存使用1066mhz的时钟。然后,类似并行的管道发生了,你可以在命令中,以这样高的速度,一气呵成地插入命令,但最终那个ram必须被实际访问。总体dram是不确定性和非常缓慢的。另一方面,只要打开电源,就不需要刷新就能记住。可以快几倍(133mhz * N),等等。它可以是确定的。

The next hurdle, cache. Cache is good and bad. Cache is generally made from sram. Hopefully you have an understanding of a cache. If the processor or someone upstream has marked the transaction as non-cacheable then it goes through uncached to the memory bus on the other side. If cacheable then the a portion of the address is looked up in a table and will result in a hit or miss. this being a write, depending on the cache and/or transaction settings, if it is a miss it may pass through to the other side. If there is a hit then the data will be written into the cache memory, depending on the cache type it may also pass through to the other side or that data may sit in the cache waiting for some other chunk of data to evict it and then it gets written to the other side. caches definitely make reads and sometimes make writes non-deterministic. Sequential accesses have the most benefit as your eviction rate is lower, the first access in a cache line is slow relative to the others, then the rest are fast. which is where we get this term of random access anyway. Random accesses go against the schemes that are designed to make sequential accesses faster.

下一个障碍,缓存。缓存有好有坏。缓存通常由sram构成。希望您对缓存有所了解。如果处理器或上游的某人将事务标记为不可缓存,那么它将通过未封顶的方式传递到另一端的内存总线。如果可缓存,那么该地址的一部分将被查找到一个表中,并将导致一个命中或错误。这是一个写入,取决于缓存和/或事务设置,如果它是一个遗漏,它可能会传递到另一边。如果有冲击,那么数据将被写入到缓存内存,根据缓存类型也可能通过另一方或缓存中的数据可能坐等待其他的数据块来驱逐它,然后它被写入到另一边。缓存绝对会进行读取,有时会使写入不确定性。顺序访问最有好处,因为您的驱逐率较低,在高速缓存中的第一个访问速度相对于其他的是慢的,然后其余的都是快速的。这就是随机访问这一术语的由来。随机访问与那些旨在使顺序访问速度更快的方案背道而驰。

Sometimes the far side of your cache has a write buffer. A relatively small queue/pipe/buffer/fifo that holds some number of write transactions. Another fire and forget deal, with those benefits.

有时,缓存的远端有一个写缓冲区。一个相对较小的队列/管道/缓冲区/fifo,包含一些写事务。另一场火灾与遗忘的交易,与那些好处。

Multiple layers of caches. l1, l2, l3...L1 is usually the fastest either by its technology or proximity, and usually the smallest, and it goes up from there speed and size and some of that has to do with cost of the memory. We are doing a write, but when you do a cache enabled read understand that if l1 has a miss it goes to l2 which if it has a miss goes to l3 which if it has a miss goes to main memory, then l3, l2 and l1 all will store a copy. So a miss on all 3 is of course the most painful and is slower than if you had no cache at all, but sequential reads will give you the cached items which are now in l1 and super fast, for the cache to be useful sequential reads over the cache line should take less time overall than reading that much memory directly from the slow dram. A system doesnt have to have 3 layers of caches, it can vary. Likewise some systems can separate instruction fetches from data reads and can have separate caches which dont evict each other, and some the caches are not separate and instruction fetches can evict data from data reads.

多层次的缓存。l1,l2,l3……L1通常是最快的,因为它的技术或者它的邻近性,通常是最小的,而且它的速度和大小都在提高,这与内存的成本有关。我们正在做写,但当你启用缓存读明白如果l1错过→l2,如果它有一个小姐去l3,如果它有一个小姐去主内存,然后l3,l2和l1都将存储一个副本。所以小姐在所有3当然是最痛苦的,是低于如果没有缓存,但顺序读取会给你现在的缓存条目l1和超级快,有用的顺序读取的缓存的缓存线应该采取更少的时间总比读那么多直接从缓慢的dram内存。一个系统不需要有三层缓存,它可以变化。同样,有些系统可以将指令读取与数据读取分离开来,并且可以有不同的缓存,不会相互排斥,有些缓存不是分开的,指令读取可以从数据读取中删除数据。

caches help with alignment issues. But of course there is an even more severe penalty for an unaligned access across cache lines. Caches tend to operate using chunks of memory called cache lines. These are often some integer multiple in size of the memory on the other side. a 32 bit memory for example the cache line might be 128 bits or 256 bits for example. So if and when the cache line is in the cache, then a read-modify-write due to an unaligned write is against faster memory, still more painful than aligned but not as painful. If it were an unaligned read and the address was such that part of that data is on one side of a cache line boundary and the other on the other then two cache lines have to be read. A 16 bit read for example can cost you many bytes read against the slowest memory, obviously several times slower than if you had no caches at all. Depending on how the caches and memory system in general is designed, if you do a write across a cache line boundary it may be similarly painful, or perhaps not as much it might have the fraction write to the cache, and the other fraction go out on the far side as a smaller sized write.

缓存有助于对齐问题。但是,当然,对跨缓存线路的不对齐访问还有更严重的惩罚。缓存往往使用称为缓存线的内存块进行操作。这些通常是另一边内存大小的整数倍。32位内存,例如,缓存行可能是128位或256位。因此,如果缓存行在缓存中,那么由于未对齐的写入而导致的读-修改-写将不利于更快的内存,仍然比对齐的内存更痛苦,但也没有那么痛苦。如果它是未对齐的读取,并且地址是这样的:数据的一部分位于缓存线边界的一边,另一部分位于另一边,那么必须读取两条缓存线。例如,一个16位的读取会花费你许多字节来读取最慢的内存,显然比没有缓存要慢好几倍。取决于缓存和内存系统总体设计,如果你写在高速缓存线路边界可能是同样痛苦,或许不一样可能会有一部分写缓存,而另一部分出去在远端一个较小尺寸的写。

Next layer of complexity is the mmu. Allowing the processor and programmer the illusion of flat memory spaces and/or control over what is cached or not, and/or memory protection, and/or the illusion that all programs are running in the same address space (so your toolchain can always compile/link for address 0x8000 for example). The mmu takes a portion of the virtual address on the processor core side. looks that up in a table, or series of tables, those lookups are often in system address space so each one of those lookups may be one or more of everything stated above as each are a memory cycle on the system memory. Those lookups can result in ecc faults even though you are trying to do a write. Eventually after one or two or three or more reads, the mmu has determined what the address is on the other side of the mmu is, and the properties (cacheable or not, etc) and that is passed on to the next thing (l1, etc) and all of the above applies. Some mmus have a bit of a cache in them of some number of prior transactions, remember because programs are sequential, the tricks used to boost the illusion of memory performance are based on sequential accesses, not random accesses. So some number of lookups might be stored in the mmu so it doesnt have to go out to main memory right away...

下一层复杂性是mmu。允许处理器和程序员对平坦的内存空间和/或对缓存内容的控制,以及/或内存保护,以及/或对所有程序都在相同的地址空间中运行的错觉(例如,您的工具链总是可以为地址0x8000编译/链接)。mmu接收处理器核心端的部分虚拟地址。在表或一系列表中查找,这些查找通常位于系统地址空间中,因此这些查找中的每个查找都可能是上面所述的所有查询中的一个或多个,因为每个查询都是系统内存中的内存循环。这些查找可能导致ecc错误,即使您正在尝试写操作。最终,在一次或两次或三次或多次读取之后,mmu已经确定了mmu另一端的地址是什么,以及属性(可缓存性或非可缓存性等),这些属性将传递给下一个对象(l1等),所有这些都适用。有些mmus会在一些之前的事务中有一些缓存,记住,因为程序是顺序的,所以增强内存性能错觉的技巧是基于顺序访问,而不是随机访问。因此,一些查找可能会存储在mmu中,因此它不必立即转到主内存中……

So in a modern computer with mmus, caches, dram, sequential reads in particular, but also writes are likely to be faster than random access. The difference can be dramatic. The first transaction in a sequential read or write is at that moment a random access as it has not been seen ever or for a while. Once the sequence continues though the optimizations fall in order and the next few/some are noticeably faster. The size and alignment of your transaction plays an important role in performance as well. While there are so many non-deterministic things going on, as a programmer with this knowledge you modify your programs to run much faster, or if unlucky or on purpose can modify your programs to run much slower. Sequential is going to be, in general faster on one of these systems. random access is going to be very non-deterministic. array[4]=12; followed by array[37]=12; Those two high level operations could take dramatically different amounts of time, both in the computation of the write address and the actual writes themselves. But for example discarded_variable=array[3]; array[3]=11; array[4]=12; Can quite often execute significantly faster than array[3]=11; array[4]=12;

所以在现代计算机中,使用mmus,缓存,dram,尤其是顺序读取,但是写操作可能比随机访问要快。这种差异可能是戏剧性的。顺序读或写中的第一个事务当时是一个随机访问,因为它从来没有出现过,也没有出现过一段时间。一旦序列继续,优化就会按顺序进行,接下来的几个/一些会明显地更快。事务的大小和对齐对性能也起着重要的作用。虽然有很多不确定性的事情在发生,但是作为一个拥有这些知识的程序员,你可以修改你的程序以使其运行得更快,或者如果不幸或故意修改你的程序以使其运行得更慢。序列是,一般来说,在这些系统中更快。随机访问是非确定性的。数组[4]= 12;其次是array[37]= 12;这两个高级操作在写地址的计算和实际的写操作本身上都可能需要非常不同的时间。但是例如discarded_variable = array[3];数组[3]= 11;数组[4]= 12;可以比数组[3]=11执行得更快;数组[4]= 12;

#4


2  

Arrays in C and C++ have random access because they are stored in RAM - Random Access Memory in a finite, predictable order. As a result, a simple linear operation is required to determine the location of a given record (a[i] = a + sizeof(a[0]) * i). This calculation has constant time. From the CPU's perspective, no "seek" or "rewind" operation is required, it simply tells memory "load the value at address X".

C和c++中的数组具有随机存取,因为它们存储在随机存取内存中,在有限的、可预测的顺序中。因此,需要一个简单的线性操作来确定给定记录的位置(a[i] = a + sizeof(a[0]) * i)。从CPU的角度来看,不需要“查找”或“重风”操作,它只告诉内存“在地址X上加载值”。

However: On a modern CPU the idea that it takes constant time to fetch data is no-longer true. It takes constant amortized time, depending on whether a given piece of data is in cache or not.

但是:在现代CPU上,获取数据需要恒定时间的想法不再正确。它需要恒定的平摊时间,这取决于给定的数据是否在缓存中。

Still - the general principle is that the time to fetch a given set of 4 or 8 bytes from RAM is the same regardless of the address. E.g. if, from a clean slate, you access RAM[0] and RAM[4294967292] the CPU will get the response within the same number of cycles.

一般的原则是,从RAM中获取一个给定的4或8字节的集合的时间是相同的,不管地址是什么。如果,从一个干净的石板,你访问RAM[0]和RAM[4294967292],*处理器将得到响应在相同数量的周期。

#include <iostream>
#include <cstring>
#include <chrono>

// 8Kb of space.
char smallSpace[8 * 1024];

// 64Mb of space (larger than cache)
char bigSpace[64 * 1024 * 1024];

void populateSpaces()
{
    memset(smallSpace, 0, sizeof(smallSpace));
    memset(bigSpace, 0, sizeof(bigSpace));
    std::cout << "Populated spaces" << std::endl;
}

unsigned int doWork(char* ptr, size_t size)
{
    unsigned int total = 0;
    const char* end = ptr + size;
    while (ptr < end) {
        total += *(ptr++);
    }
    return total;
}

using namespace std;
using namespace chrono;

void doTiming(const char* label, char* ptr, size_t size)
{
    cout << label << ": ";
    const high_resolution_clock::time_point start = high_resolution_clock::now();
    auto result = doWork(ptr, size);
    const high_resolution_clock::time_point stop = high_resolution_clock::now();
    auto delta = duration_cast<nanoseconds>(stop - start).count();
    cout << "took " << delta << "ns (result is " << result << ")" << endl;
}

int main()
{
    cout << "Timer resultion is " << 
        duration_cast<nanoseconds>(high_resolution_clock::duration(1)).count()
        << "ns" << endl;

    populateSpaces();

    doTiming("first small", smallSpace, sizeof(smallSpace));
    doTiming("second small", smallSpace, sizeof(smallSpace));
    doTiming("third small", smallSpace, sizeof(smallSpace));
    doTiming("bigSpace", bigSpace, sizeof(bigSpace));
    doTiming("bigSpace redo", bigSpace, sizeof(bigSpace));
    doTiming("smallSpace again", smallSpace, sizeof(smallSpace));
    doTiming("smallSpace once more", smallSpace, sizeof(smallSpace));
    doTiming("smallSpace last", smallSpace, sizeof(smallSpace));
}

Live demo: http://ideone.com/9zOW5q

现场演示:http://ideone.com/9zOW5q

Output (from ideone, which may not be ideal)

输出(来自ideone,可能不太理想)

Success  time: 0.33 memory: 68864 signal:0
Timer resultion is 1ns
Populated spaces
doWork/small: took 8384ns (result is 8192)
doWork/small: took 7702ns (result is 8192)
doWork/small: took 7686ns (result is 8192)
doWork/big: took 64921206ns (result is 67108864)
doWork/big: took 65120677ns (result is 67108864)
doWork/small: took 8237ns (result is 8192)
doWork/small: took 7678ns (result is 8192)
doWork/small: took 7677ns (result is 8192)
Populated spaces
strideWork/small: took 10112ns (result is 16384)
strideWork/small: took 9570ns (result is 16384)
strideWork/small: took 9559ns (result is 16384)
strideWork/big: took 65512138ns (result is 134217728)
strideWork/big: took 65005505ns (result is 134217728)

What we are seeing here are the effects of cache on memory access performance. The first time we hit smallSpace it takes ~8100ns to access all 8kb of small space. But when we call it again immediately after, twice, it takes ~600ns less at ~7400ns.

我们在这里看到的是缓存对内存访问性能的影响。我们第一次遇到smallSpace时,需要8100ns才能访问所有8kb的小空间。但是当我们马上再次调用它时,两次,在7400ns时,它会少花600纳秒。

Now we go away and do bigspace, which is bigger than current CPU cache, so we know we've blown away the L1 and L2 caches.

现在我们离开,做bigspace,它比当前的CPU缓存大,所以我们知道我们已经把L1和L2缓存炸开了。

Coming back to small, which we're sure is not cached now, we again see ~8100ns for the first time and ~7400 for the second two.

回到small,我们确信现在没有缓存,我们第一次看到了~8100ns,第二次看到了~7400 ns。

We blow the cache out and now we introduce a different behavior. We use a strided loop version. This amplifies the "cache miss" effect and significantly bumps the timing, although "small space" fits into L2 cache so we still see a reduction between pass 1 and the following 2 passes.

我们吹灭缓存,现在我们引入一个不同的行为。我们使用横切循环版本。这放大了“缓存缺失”的影响,并显著地影响了时间,尽管“小空间”适合于L2缓存,因此我们仍然可以看到pass 1和后续2传递之间的减少。