L1缓存的代价是多少?

时间:2021-05-19 03:49:45

Edit: For reference purposes (if anyone stumbles across this question), Igor Ostrovsky wrote a great post about cache misses. It discusses several different issues and shows example numbers. End Edit

编辑:为了参考目的(如果有人在这个问题上遇到困难),Igor Ostrovsky写了一篇关于缓存遗漏的文章。它讨论了几个不同的问题,并给出了示例编号。最后编辑

I did some testing <long story goes here> and am wondering if a performance difference is due to memory cache misses. The following code demonstrates the issue and boils it down to the critical timing portion. The following code has a couple of loops that visit memory in random order and then in ascending address order.

我做了一些测试 ,我想知道性能差异是否是由于内存缓存遗漏所致。下面的代码演示了这个问题,并将其归结为关键的计时部分。下面的代码有几个循环,它们以随机顺序访问内存,然后以升序的地址顺序。

I ran it on an XP machine (compiled with VS2005: cl /O2) and on a Linux box (gcc –Os). Both produced similar times. These times are in milliseconds. I believe all loops are running and are not optimized out (otherwise it would run “instantly”).

我在一台XP机器上运行它(用VS2005: cl /O2编译)和一个Linux框(gcc -Os)。次都产生相似。这些时间单位是毫秒。我相信所有的循环都在运行并且没有经过优化(否则它将“立即”运行)。

*** Testing 20000 nodes
Total Ordered Time: 888.822899
Total Random Time: 2155.846268

Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).

这些数字有意义吗?这种差异主要是由于L1缓存遗漏还是其他的原因?我测试的XP (P4)机器是3.2 ghz,我怀疑(但不知道)有一个32KB的L1缓存和512KB L2。有20,000个条目(80KB),我假设没有大量的L2遗漏。这是(3.2 * 10 ^ 9周期/秒)* 3.2 * 10 ^ 9秒/小姐)= 10.1周期/小姐。这对我来说似乎太高了。也许不是,也许我的数学不好。我尝试用VTune来测量缓存缺失,但我得到了一个BSOD。现在我无法让它连接到许可证服务器(grrrr)。

typedef struct stItem
{
   long     lData;
   //char     acPad[20];
} LIST_NODE;



#if defined( WIN32 )
void StartTimer( LONGLONG *pt1 )
{
   QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}

void StopTimer( LONGLONG t1, double *pdMS )
{
   LONGLONG t2, llFreq;

   QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
   QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );
   *pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0;
}
#else
// doesn't need 64-bit integer in this case
void StartTimer( LONGLONG *pt1 )
{
   // Just use clock(), this test doesn't need higher resolution
   *pt1 = clock();
}

void StopTimer( LONGLONG t1, double *pdMS )
{
   LONGLONG t2 = clock();
   *pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 );
}
#endif



long longrand()
{
   #if defined( WIN32 )
   // Stupid cheesy way to make sure it is not just a 16-bit rand value
   return ( rand() << 16 ) | rand();
   #else
   return rand();
   #endif
}

// get random value in the given range
int randint( int m, int n )
{
   int ret = longrand() % ( n - m + 1 );
   return ret + m;
}

// I think I got this out of Programming Pearls (Bentley).
void ShuffleArray
(
   long *plShuffle,  // (O) return array of "randomly" ordered integers
   long lNumItems    // (I) length of array
)
{
   long i;
   long j;
   long t;

   for ( i = 0; i < lNumItems; i++ )
      plShuffle[i] = i;

   for ( i = 0; i < lNumItems; i++ )
      {
      j = randint( i, lNumItems - 1 );

      t = plShuffle[i];
      plShuffle[i] = plShuffle[j];
      plShuffle[j] = t;
      }
}



int main( int argc, char* argv[] )
{
   long          *plDataValues;
   LIST_NODE     *pstNodes;
   long          lNumItems = 20000;
   long          i, j;
   LONGLONG      t1;  // for timing
   double dms;

   if ( argc > 1 && atoi(argv[1]) > 0 )
      lNumItems = atoi( argv[1] );

   printf( "\n\n*** Testing %u nodes\n", lNumItems );

   srand( (unsigned int)time( 0 ));

   // allocate the nodes as one single chunk of memory
   pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));
   assert( pstNodes != NULL );

   // Create an array that gives the access order for the nodes
   plDataValues = (long*)malloc( lNumItems * sizeof( long ));
   assert( plDataValues != NULL );

   // Access the data in order
   for ( i = 0; i < lNumItems; i++ )
      plDataValues[i] = i;

   StartTimer( &t1 );

   // Loop through and access the memory a bunch of times
   for ( j = 0; j < lNumItems; j++ )
      {
      for ( i = 0; i < lNumItems; i++ )
         {
         pstNodes[plDataValues[i]].lData = i * j;
         }
      }

   StopTimer( t1, &dms );
   printf( "Total Ordered Time: %f\n", dms );

   // now access the array positions in a "random" order
   ShuffleArray( plDataValues, lNumItems );

   StartTimer( &t1 );

   for ( j = 0; j < lNumItems; j++ )
      {
      for ( i = 0; i < lNumItems; i++ )
         {
         pstNodes[plDataValues[i]].lData = i * j;
         }
      }

   StopTimer( t1, &dms );
   printf( "Total Random Time: %f\n", dms );

}

8 个解决方案

#1


21  

While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.

虽然我不能提供答案是否有意义的数字(我不精通缓存延迟,但是备案~ 10周期L1缓存错过嘛),我可以提供它会把作为一个工具来帮助你呈现给你看到2测试之间的缓存性能的差异。

Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.

Cachegrind是一个Valgrind工具(它的框架可以为always-可爱的memcheck提供支持),它配置了缓存和分支的点击/错过。它会让你知道你的程序中有多少缓存命中/错过。

#2


48  

Here is an attempt to provide insight into the relative cost of cache misses by analogy with baking chocolate chip cookies ...

下面是一个尝试,通过类比烘焙巧克力曲奇饼干,来了解缓存的相对成本。

Your hands are your registers. It takes you 1 second to drop chocolate chips into the dough.

你的手就是你的登记簿。你要花1秒的时间把巧克力屑放进面团里。

The kitchen counter is your L1 cache, twelve times slower than registers. It takes 12 x 1 = 12 seconds to step to the counter, pick up the bag of walnuts, and empty some into your hand.

厨房计数器是你的L1缓存,比寄存器慢12倍。用12秒的时间到柜台,拿起一袋核桃,把一些空的放到你的手里。

The fridge is your L2 cache, four times slower than L1. It takes 4 x 12 = 48 seconds to walk to the fridge, open it, move last night's leftovers out of the way, take out a carton of eggs, open the carton, put 3 eggs on the counter, and put the carton back in the fridge.

冰箱是你的L2缓存,比L1慢4倍。用4×12 = 48秒的时间走到冰箱,打开它,把昨晚的剩菜取出来,取出一盒鸡蛋,打开纸盒,在柜台上放3个鸡蛋,然后把纸盒放回冰箱。

The cupboard is your L3 cache, three times slower than L2. It takes 3 x 48 = 2 minutes and 24 seconds to take three steps to the cupboard, bend down, open the door, root around to find the baking supply tin, extract it from the cupboard, open it, dig to find the baking powder, put it on the counter and sweep up the mess you spilled on the floor.

柜子是你的L3缓存,比L2慢3倍。需要3 x 48 = 2分24秒采取三个步骤的柜子里,弯下腰,开门,根寻找烘焙供应锡,从橱柜中提取它,打开它,挖掘发现泡打粉,把它放在柜台上,并且把你洒在地板上的混乱。

And main memory? That's the corner store, 5 times slower than L3. It takes 5 x 2:24 = 12 minutes to find your wallet, put on your shoes and jacket, dash down the street, grab a litre of milk, dash home, take off your shoes and jacket, and get back to the kitchen.

和主存?这是拐角商店,比L3慢5倍。要找到你的钱包要花5 x 2:24 = 12分钟,穿上你的鞋子和外套,冲到街上,拿一升牛奶,冲回家,脱掉你的鞋子和外套,然后回到厨房。

Note that all these accesses are constant complexity -- O(1) -- but the differences between them can have a huge impact on performance. Optimizing purely for big-O complexity is like deciding whether to add chocolate chips to the batter 1 at a time or 10 at a time, but forgetting to put them on your grocery list.

请注意,所有这些访问都是恒定的复杂度——O(1)——但是它们之间的差异会对性能产生巨大的影响。纯粹为了大复杂度而优化,就像决定是否在一次或每次10次中加入巧克力片,但忘记把它们列在你的购物清单上。

Moral of the story: Organize your memory accesses so the CPU has to go for groceries as rarely as possible.

故事的寓意:组织你的内存访问,这样CPU就会尽可能少地去买杂货。

Numbers were taken from the CPU Cache Flushing Fallacy blog post, which indicates that for a particular 2012-era Intel processor, the following is true:

数据来自于CPU高速缓存的错误日志,这表明对于一个特定的2012年Intel处理器来说,以下是正确的:

  • register access = 4 instructions per cycle
  • 注册访问=每周期4条指令。
  • L1 latency = 3 cycles (12 x register)
  • L1延迟= 3周期(12 x注册)
  • L2 latency = 12 cycles (4 x L1, 48 x register)
  • L2潜伏期= 12周期(4 x L1, 48 x寄存器)
  • L3 latency = 38 cycles (3 x L2, 12 x L1, 144 x register)
  • L3延迟= 38周期(144年12 x L1,L2 3 x,x注册)
  • DRAM latency = 65 ns = 195 cycles on a 3 GHz CPU (5 x L3, 15 x L2, 60 x L1, 720 x register)
  • DRAM潜伏期= 3ghz CPU上的65个ns = 195个周期(5 x L3, 15 x L2, 60 x L1, 720 x寄存器)

The Gallery of Processor Cache Effects also makes good reading on this topic.

处理器缓存效果的图库也能很好地阅读这个主题。

L1缓存的代价是多少?

#3


17  

3.2ns for an L1 cache miss is entirely plausible. For comparison, on one particular modern multicore PowerPC CPU, an L1 miss is about 40 cycles -- a little longer for some cores than others, depending on how far they are from the L2 cache (yes really). An L2 miss is at least 600 cycles.

对于一个L1缓存的遗漏是完全可信的。相比之下,在一个特别的现代多核PowerPC CPU上,L1的缺位大约是40个周期——对于某些内核来说,这要比其他的多一些,这取决于它们与L2缓存的距离(是的,确实如此)。一个L2失误至少有600个周期。

Cache is everything in performance; CPUs are so much faster than memory now that you're really almost optimizing for the memory bus instead of the core.

缓存是性能的全部;cpu的速度比内存快得多,以至于你实际上几乎是在优化内存总线而不是核心。

#4


6  

Well yeah that does look like it will mainly be L1 cache misses.

是的,看起来它主要是L1缓存遗漏。

10 cycles for an L1 cache miss does sound about reasonable, probably a little on the low side.

L1高速缓存的10个周期听起来很合理,可能有点偏低。

A read from RAM is going to take of the order of 100s or may be even 1000s (Am too tired to attempt to do the maths right now ;)) of cycles so its still a huge win over that.

从RAM中读出的数据将会是100s的顺序,或者甚至可能是1000s(我太累了,无法尝试现在的数学运算),因此它仍然是一个巨大的胜利。

#5


3  

If you plan on using cachegrind, please note that it is a cache hit/miss simulator only. It won't always be accurate. For example: if you access some memory location, say 0x1234 in a loop 1000 times, cachegrind will always show you that there was only one cache miss (the first access) even if you have something like:

如果您打算使用cachegrind,请注意它只是一个缓存命中/丢失模拟器。这并不总是准确的。例如:如果您访问某个内存位置,比如在一个循环中输入0x1234,那么cachegrind将始终显示只有一个缓存缺失(第一次访问),即使您有类似的东西:

clflush 0x1234 in your loop.

clflush 0x1234在您的循环中。

On x86, this will cause all 1000 cache misses.

在x86上,这将导致所有1000个缓存丢失。

#6


2  

Some numbers for a 3.4GHz P4 from a Lavalys Everest run:

从一辆拉瓦里的埃佛勒斯峰出发,一些数字为3.4GHz P4:

  • the L1 dcache is 8K (cacheline 64 bytes)
  • L1 dcache是8K (cacheline 64字节)
  • L2 is 512K
  • L2是512 k
  • L1 fetch latency is 2 cycles
  • L1获取延迟是两个周期。
  • L2 fetch latency is about double what you are seeing: 20 cycles
  • L2取回延时大约是你看到的两倍:20个周期。

More here: http://www.freeweb.hu/instlatx64/GenuineIntel0000F25_P4_Gallatin_MemLatX86.txt

更多:http://www.freeweb.hu/instlatx64/GenuineIntel0000F25_P4_Gallatin_MemLatX86.txt

(for the latencies look at the bottom of the page)

(对于延迟,请看页面底部)

#7


0  

It's difficult to say anything for sure without a lot more testing, but in my experience that scale of difference definitely can be attributed to the CPU L1 and/or L2 cache, especially in a scenario with randomized access. You could probably make it even worse by ensuring that each access is at least some minimum distance from the last.

在没有更多测试的情况下,很难确定任何事情,但在我的经验中,差异的大小肯定可以归因于CPU L1和/或L2缓存,特别是在随机访问的场景中。通过确保每个访问至少距离上一次的最小距离,您可能会使情况变得更糟。

#8


-2  

The easiest thing to do is to take a scaled photograph of the target cpu and physically measure the distance between the core and the level-1 cache. Multiply that distance by the distance electrons can travel per second in copper. Then figure out how many clock-cycles you can have in that same time. That's the minimum number of cpu cycles you'll waste on a L1 cache miss.

最简单的方法是使用目标cpu的缩放照片,并从物理上测量核心和一级缓存之间的距离。乘以距离,电子每秒可以在铜中传播。然后算出你可以同时拥有多少个时钟周期。这是在L1缓存中浪费的cpu周期的最小值。

You can also work out the minimum cost of fetching data from RAM in terms of the number of CPU cycles wasted in the same way. You might be amazed.

您还可以计算从RAM中获取数据的最小成本,以同样的方式浪费CPU周期的数量。你可能会惊讶。

Notice that what you're seeing here definitely has something to do with cache-misses (be it L1 or both L1 and L2) because normally the cache will pull out data on the same cache line once you access anything on that cache-line requiring less trips to RAM.

请注意,您在这里看到的内容肯定与缓存丢失有关(无论是L1还是L1和L2),因为通常情况下,一旦您访问了需要较少访问RAM的缓存线路上的任何内容,缓存就会在相同的缓存线路上提取数据。

However, what you're probably also seeing is the fact that RAM (even though it's calls Random Access Memory) still preferres linear memory access.

然而,您可能还看到的事实是RAM(尽管它被称为随机访问内存)仍然更喜欢线性内存访问。

#1


21  

While I can't offer an answer to whether or not the numbers make sense (I'm not well versed in the cache latencies, but for the record ~10 cycle L1 cache misses sounds about right), I can offer you Cachegrind as a tool to help you actually see the differences in cache performance between your 2 tests.

虽然我不能提供答案是否有意义的数字(我不精通缓存延迟,但是备案~ 10周期L1缓存错过嘛),我可以提供它会把作为一个工具来帮助你呈现给你看到2测试之间的缓存性能的差异。

Cachegrind is a Valgrind tool (the framework that powers the always-lovely memcheck) which profiles cache and branch hits/misses. It will give you an idea of how many cache hits/misses you are actually getting in your program.

Cachegrind是一个Valgrind工具(它的框架可以为always-可爱的memcheck提供支持),它配置了缓存和分支的点击/错过。它会让你知道你的程序中有多少缓存命中/错过。

#2


48  

Here is an attempt to provide insight into the relative cost of cache misses by analogy with baking chocolate chip cookies ...

下面是一个尝试,通过类比烘焙巧克力曲奇饼干,来了解缓存的相对成本。

Your hands are your registers. It takes you 1 second to drop chocolate chips into the dough.

你的手就是你的登记簿。你要花1秒的时间把巧克力屑放进面团里。

The kitchen counter is your L1 cache, twelve times slower than registers. It takes 12 x 1 = 12 seconds to step to the counter, pick up the bag of walnuts, and empty some into your hand.

厨房计数器是你的L1缓存,比寄存器慢12倍。用12秒的时间到柜台,拿起一袋核桃,把一些空的放到你的手里。

The fridge is your L2 cache, four times slower than L1. It takes 4 x 12 = 48 seconds to walk to the fridge, open it, move last night's leftovers out of the way, take out a carton of eggs, open the carton, put 3 eggs on the counter, and put the carton back in the fridge.

冰箱是你的L2缓存,比L1慢4倍。用4×12 = 48秒的时间走到冰箱,打开它,把昨晚的剩菜取出来,取出一盒鸡蛋,打开纸盒,在柜台上放3个鸡蛋,然后把纸盒放回冰箱。

The cupboard is your L3 cache, three times slower than L2. It takes 3 x 48 = 2 minutes and 24 seconds to take three steps to the cupboard, bend down, open the door, root around to find the baking supply tin, extract it from the cupboard, open it, dig to find the baking powder, put it on the counter and sweep up the mess you spilled on the floor.

柜子是你的L3缓存,比L2慢3倍。需要3 x 48 = 2分24秒采取三个步骤的柜子里,弯下腰,开门,根寻找烘焙供应锡,从橱柜中提取它,打开它,挖掘发现泡打粉,把它放在柜台上,并且把你洒在地板上的混乱。

And main memory? That's the corner store, 5 times slower than L3. It takes 5 x 2:24 = 12 minutes to find your wallet, put on your shoes and jacket, dash down the street, grab a litre of milk, dash home, take off your shoes and jacket, and get back to the kitchen.

和主存?这是拐角商店,比L3慢5倍。要找到你的钱包要花5 x 2:24 = 12分钟,穿上你的鞋子和外套,冲到街上,拿一升牛奶,冲回家,脱掉你的鞋子和外套,然后回到厨房。

Note that all these accesses are constant complexity -- O(1) -- but the differences between them can have a huge impact on performance. Optimizing purely for big-O complexity is like deciding whether to add chocolate chips to the batter 1 at a time or 10 at a time, but forgetting to put them on your grocery list.

请注意,所有这些访问都是恒定的复杂度——O(1)——但是它们之间的差异会对性能产生巨大的影响。纯粹为了大复杂度而优化,就像决定是否在一次或每次10次中加入巧克力片,但忘记把它们列在你的购物清单上。

Moral of the story: Organize your memory accesses so the CPU has to go for groceries as rarely as possible.

故事的寓意:组织你的内存访问,这样CPU就会尽可能少地去买杂货。

Numbers were taken from the CPU Cache Flushing Fallacy blog post, which indicates that for a particular 2012-era Intel processor, the following is true:

数据来自于CPU高速缓存的错误日志,这表明对于一个特定的2012年Intel处理器来说,以下是正确的:

  • register access = 4 instructions per cycle
  • 注册访问=每周期4条指令。
  • L1 latency = 3 cycles (12 x register)
  • L1延迟= 3周期(12 x注册)
  • L2 latency = 12 cycles (4 x L1, 48 x register)
  • L2潜伏期= 12周期(4 x L1, 48 x寄存器)
  • L3 latency = 38 cycles (3 x L2, 12 x L1, 144 x register)
  • L3延迟= 38周期(144年12 x L1,L2 3 x,x注册)
  • DRAM latency = 65 ns = 195 cycles on a 3 GHz CPU (5 x L3, 15 x L2, 60 x L1, 720 x register)
  • DRAM潜伏期= 3ghz CPU上的65个ns = 195个周期(5 x L3, 15 x L2, 60 x L1, 720 x寄存器)

The Gallery of Processor Cache Effects also makes good reading on this topic.

处理器缓存效果的图库也能很好地阅读这个主题。

L1缓存的代价是多少?

#3


17  

3.2ns for an L1 cache miss is entirely plausible. For comparison, on one particular modern multicore PowerPC CPU, an L1 miss is about 40 cycles -- a little longer for some cores than others, depending on how far they are from the L2 cache (yes really). An L2 miss is at least 600 cycles.

对于一个L1缓存的遗漏是完全可信的。相比之下,在一个特别的现代多核PowerPC CPU上,L1的缺位大约是40个周期——对于某些内核来说,这要比其他的多一些,这取决于它们与L2缓存的距离(是的,确实如此)。一个L2失误至少有600个周期。

Cache is everything in performance; CPUs are so much faster than memory now that you're really almost optimizing for the memory bus instead of the core.

缓存是性能的全部;cpu的速度比内存快得多,以至于你实际上几乎是在优化内存总线而不是核心。

#4


6  

Well yeah that does look like it will mainly be L1 cache misses.

是的,看起来它主要是L1缓存遗漏。

10 cycles for an L1 cache miss does sound about reasonable, probably a little on the low side.

L1高速缓存的10个周期听起来很合理,可能有点偏低。

A read from RAM is going to take of the order of 100s or may be even 1000s (Am too tired to attempt to do the maths right now ;)) of cycles so its still a huge win over that.

从RAM中读出的数据将会是100s的顺序,或者甚至可能是1000s(我太累了,无法尝试现在的数学运算),因此它仍然是一个巨大的胜利。

#5


3  

If you plan on using cachegrind, please note that it is a cache hit/miss simulator only. It won't always be accurate. For example: if you access some memory location, say 0x1234 in a loop 1000 times, cachegrind will always show you that there was only one cache miss (the first access) even if you have something like:

如果您打算使用cachegrind,请注意它只是一个缓存命中/丢失模拟器。这并不总是准确的。例如:如果您访问某个内存位置,比如在一个循环中输入0x1234,那么cachegrind将始终显示只有一个缓存缺失(第一次访问),即使您有类似的东西:

clflush 0x1234 in your loop.

clflush 0x1234在您的循环中。

On x86, this will cause all 1000 cache misses.

在x86上,这将导致所有1000个缓存丢失。

#6


2  

Some numbers for a 3.4GHz P4 from a Lavalys Everest run:

从一辆拉瓦里的埃佛勒斯峰出发,一些数字为3.4GHz P4:

  • the L1 dcache is 8K (cacheline 64 bytes)
  • L1 dcache是8K (cacheline 64字节)
  • L2 is 512K
  • L2是512 k
  • L1 fetch latency is 2 cycles
  • L1获取延迟是两个周期。
  • L2 fetch latency is about double what you are seeing: 20 cycles
  • L2取回延时大约是你看到的两倍:20个周期。

More here: http://www.freeweb.hu/instlatx64/GenuineIntel0000F25_P4_Gallatin_MemLatX86.txt

更多:http://www.freeweb.hu/instlatx64/GenuineIntel0000F25_P4_Gallatin_MemLatX86.txt

(for the latencies look at the bottom of the page)

(对于延迟,请看页面底部)

#7


0  

It's difficult to say anything for sure without a lot more testing, but in my experience that scale of difference definitely can be attributed to the CPU L1 and/or L2 cache, especially in a scenario with randomized access. You could probably make it even worse by ensuring that each access is at least some minimum distance from the last.

在没有更多测试的情况下,很难确定任何事情,但在我的经验中,差异的大小肯定可以归因于CPU L1和/或L2缓存,特别是在随机访问的场景中。通过确保每个访问至少距离上一次的最小距离,您可能会使情况变得更糟。

#8


-2  

The easiest thing to do is to take a scaled photograph of the target cpu and physically measure the distance between the core and the level-1 cache. Multiply that distance by the distance electrons can travel per second in copper. Then figure out how many clock-cycles you can have in that same time. That's the minimum number of cpu cycles you'll waste on a L1 cache miss.

最简单的方法是使用目标cpu的缩放照片,并从物理上测量核心和一级缓存之间的距离。乘以距离,电子每秒可以在铜中传播。然后算出你可以同时拥有多少个时钟周期。这是在L1缓存中浪费的cpu周期的最小值。

You can also work out the minimum cost of fetching data from RAM in terms of the number of CPU cycles wasted in the same way. You might be amazed.

您还可以计算从RAM中获取数据的最小成本,以同样的方式浪费CPU周期的数量。你可能会惊讶。

Notice that what you're seeing here definitely has something to do with cache-misses (be it L1 or both L1 and L2) because normally the cache will pull out data on the same cache line once you access anything on that cache-line requiring less trips to RAM.

请注意,您在这里看到的内容肯定与缓存丢失有关(无论是L1还是L1和L2),因为通常情况下,一旦您访问了需要较少访问RAM的缓存线路上的任何内容,缓存就会在相同的缓存线路上提取数据。

However, what you're probably also seeing is the fact that RAM (even though it's calls Random Access Memory) still preferres linear memory access.

然而,您可能还看到的事实是RAM(尽管它被称为随机访问内存)仍然更喜欢线性内存访问。