I am very interested to know what is the preferred method of memory allocation static vs dynamic
is good for performance (e.g., running time) when you know the exact number of objects/items in C
on Linux
. Cost for a small number of objects (small amount of memory) and as well as for a large number of objects (huge amount of memory).
当您知道Linux上C中的对象/项的确切数量时,我很想知道内存分配静态方法和动态方法(例如,运行时间)对性能有什么好处。少量对象(少量内存)和大量对象(大量内存)的成本。
e.g., type A[N]
vs type *A = malloc(sizeof(type) * N)
例如,A型[N] vs A型* = malloc(sizeof(type) * N)
Please let me know. Thank you.
请让我知道。谢谢你!
Note: We can benchmark this and probably know the answer. But I would like to know the concepts that explain the performance differences between these two allocation methods.
注意:我们可以对其进行基准测试,并可能知道答案。但是我想知道解释这两种分配方法之间性能差异的概念。
4 个解决方案
#1
10
Static allocation will be much faster. Static allocation can happen at global scope, and on the stack.
静态分配会快得多。静态分配可以发生在全局范围和堆栈上。
In global scope statically allocated memory is built into the binary image. That is the total size of the memory required, and where it is to be located in the running binary is computed at compile time. Then when the program loads the operating system loader will allocate enough memory for all the global static arrays. I'm pretty sure it happens in constant time for all the allocations. ( e.g. more allocations don't cost more time )
在全局作用域中,静态分配内存被构建到二进制映像中。这是所需内存的总大小,它位于运行的二进制文件中的位置是在编译时计算的。然后,当程序加载时,操作系统加载程序将为所有全局静态数组分配足够的内存。我很确定所有分配都是在常数时间内发生的。(例如,更多的拨款不会花更多的时间)
In local scope, static allocations are allocated on the stack. This involves simply reserving a fixed number of bytes on the stack, and happens in constant time per allocation. Stack space is very limited.
在本地范围内,静态分配被分配到堆栈上。这只需要在堆栈上保留固定数量的字节,每次分配的时间都是固定的。堆栈空间非常有限。
Dynamic memory must be allocated from a heap, and even in the best case most allocations will take time that scales more than linear with each allocation, like n log n time or something.
动态内存必须从堆中分配,即使在最好的情况下,大多数分配也需要比每次分配的线性时间更长的时间,比如n log n时间或其他时间。
Also practically speaking the dynamic allocation will be many times slower than static allocation.
实际上,动态分配比静态分配要慢很多倍。
@update: As has been pointed out in comments below: stack allocations are not technically static allocations ( but they are allocations with the syntactic form used in OP's question ).
@update:正如在下面的评论中所指出的:堆栈分配不是技术上的静态分配(但它们是在OP的问题中使用的语法形式的分配)。
Also when speaking about "allocation time", I'm considering total time to manage the memory ( alloc and free ).
在谈到“分配时间”时,我也在考虑管理内存的总时间(alloc和free)。
In some dynamic allocators allocation time is faster than freeing time.
在一些动态分配器中,分配时间比释放时间快。
It is also true that some fast allocators trade memory efficiency for allocation speed. In these cases static is still "better" in that static and stack allocs are no time and constant time respectively while allocating exact sized blocks.
同样,一些快速分配器也用内存效率换取分配速度。在这些情况下,静态仍然“更好”,因为在分配精确大小的块时,静态分配和堆栈分配分别不是时间和常数时间。
Dynamic allocators to be fast trade off significant memory efficiency ( e.g. buddy allocators round up to the next power of two sized block, like 33k alloc will use 64k )
动态分配器是快速交换重要的内存效率(例如,伙伴分配器,直到两个大小块的下一个幂,比如33k alloc将使用64k)
#2
5
True static allocations (globals, and local variables marked static) are glued into sections and loaded along with the rest of the section at load time (execve
) using mmap
. They're essentially free, already covered by the cost of loading the program.
真正的静态分配(全局分配和标记为静态的局部变量)被粘到节中,并在加载时(execve)与节的其余部分一起使用mmap加载。它们本质上是免费的,已经被加载程序的成本所覆盖。
Automatic arrays with statically known sizes can be glued together with other local variables and allocated by adjusting the stack pointer. That's essentially one integer subtraction (the stack grows downwards) or something close to 1 ns on modern processors.
具有静态已知大小的自动数组可以与其他本地变量粘在一起,并通过调整堆栈指针来分配。这实际上是一个整数减法(堆栈向下增长)或者在现代处理器上接近1ns。
Variable length arrays can't be glued to other variables so they should cost about 1 ns each.
可变长度数组不能粘在其他变量上,因此每个变量的成本应该在1ns左右。
I tried measuring malloc
s with different sizes in a single threaded process and I got this, which would imply that malloc+free
pairs cost about 50-100 times as much as stack variables for allocations <16MiB. After that, it spikes upwards (32MiB and 64MiB both cost about a hundred times as much as allocations <=16MiB do):
我尝试在一个线程进程中测量不同大小的mallocs,得到了这个结果,这意味着对于分配<16MiB的malloc+*对的成本大约是堆栈变量的50-100倍。在此之后,它会急剧上升(32MiB和64MiB的成本都是拨款<=16MiB的100倍):
These are just the costs of obtaining the pointer though. Actual access may cause page faults and thereby further increase the cost of the memory block. The page faults should be extremely rare with stack allocations, but likely with first-time accesses to dynamically allocated memory.
这只是获取指针的成本。实际访问可能导致页面错误,从而进一步增加内存块的成本。对于堆栈分配来说,页面错误应该是非常罕见的,但是很可能是首次访问动态分配的内存。
Additionally, many small dynamic allocations will put quite a strain on caches as your memory will be fragmented. (You can reduce this fragmentation by using memory pools, either your own or the obstack functionality available as part of glibc.)
此外,许多小的动态分配将对缓存造成很大的压力,因为您的内存将是分段的。(您可以通过使用您自己的内存池或作为glibc的一部分提供的obstack功能来减少这种碎片。)
#3
4
Global memory is normally allocated when your program (or shared module/dll) loads and pre-populated with its initial values. This typically has its own memory section.
当程序(或共享模块/dll)加载并预填充初始值时,通常会分配全局内存。它通常有自己的内存部分。
The stack is a block (series of memory pages) of memory allocated by the kernel when you create a new thread. Allocating stack memory on the stack typically is done by decremeting the stack pointer (one machine instruction - (stack a typically full descening - newer allocations have lower addresses). When a stack object is deleted, the stack pointer is incremented). Stacks therefore have strict first in last out semantic. Stack are also fixed size so when you run out you get a stack overflow.
堆栈是内核在创建新线程时分配的内存块(内存页系列)。在堆栈上分配堆栈内存通常是通过命令栈指针(一个机器指令(一个典型的全下降-更新的分配有较低的地址)来完成的。当一个堆栈对象被删除时,堆栈指针就会增加)。因此,栈具有严格的先到后的语义。栈也是固定的大小,所以当你用完时你会得到一个栈溢出。
When one uses malloc (in C) or new (C++), memory is allocated on the heap/free-store. This is any number memory block. When more memory is needed than has been already allocated, malloc goes to the kernel to request more memory from the system. Typically malloc will use freed memory already obtained from the system.
当使用malloc(在C中)或new (c++)时,内存被分配到堆/*存储中。这是任意数量的内存块。当需要的内存超过已经分配的内存时,malloc就会向内核请求更多的内存。通常malloc将使用已经从系统获得的释放内存。
Calls to malloc must be assumed to be slow and should be avoided for any performance critical or real-time routines. This is for 2 reasons.
必须假定对malloc的调用是缓慢的,并且对于任何性能关键或实时例程都应该避免调用。这有两个原因。
- The heap needs to allocate and free any size of memory in any order. Older algorithms used to walk a list of freed blocks until one of suitable size was found. Since memory could be highly fragmented, this could potentially be slow. If the heap is used on multiple threads some sort of locking needs to be provided. Much research has been done to improve this situation and modern heaps the jemalloc and tcmalloc do improve things. However, don't count on this.
- 堆需要以任何顺序分配和释放任何大小的内存。旧的算法用于遍历释放块的列表,直到找到一个合适的大小。由于内存可能是高度碎片化的,这可能会很慢。如果堆在多个线程上使用,则需要提供某种锁定。为了改善这一状况,人们做了大量的研究,现代的jemalloc和tcmalloc确实改善了这一状况。然而,不要指望这一点。
- If required memory cannot be allocated from what is already managed by the heap allocator, the heap allocator will need to ask the kernel for more memory. (On unix this is done with the
mmap
orsbrk
system calls). The kernel needs to find some unallocated memory pages and the has to map these pages into your processes memory space). Any form of caching goes out the window). So expect this to be super slow (for a computer). - 如果无法从堆分配器已经管理的内存中分配所需的内存,那么堆分配器将需要向内核请求更多的内存。(在unix上,这是通过mmap或sbrk系统调用完成的)。内核需要找到一些未分配的内存页,并且必须将这些页映射到进程内存空间中)。任何形式的缓存都从窗口消失。因此,期望这是超级慢(对于计算机)。
So if you need do any performance critical processing, allocate all the memory up front and then reuse the already allocated memory. Always assume calls to malloc and free will be slow.
因此,如果需要进行任何性能关键处理,请预先分配所有内存,然后重用已分配的内存。总是假设对malloc和free的调用会很慢。
#4
2
If you know exactly how much space you need to set aside and your primary concern is performance with respect to memory management and your code doesn't need to be re-entrant, then you will want to allocate objects with static storage duration, either by declaring them at file scope or by using the static
storage class specifier.
如果你知道你需要预留多少空间和你主要关心的是性能对内存管理和代码不需要再进入的,那么你会想和静态存储分配对象持续时间、通过宣布他们在文件范围或使用静态存储类说明符。
int data[SOME_NUMBER];
void foo( /* some list of parameters here */ )
{
static int some_more_data[SOME_OTHER_NUMBER];
...
}
Both data
and some_more_data
exist over the lifetime of the program, but some_more_data
is only visible within the foo
function1.
数据和some_more_data在程序的整个生命周期中都存在,但是some_more_data仅在foo function1中可见。
In practice, items with static storage duration have space set aside for them in the binary image itself, so that memory is available as soon as the program is loaded. This may have an advantage as far as locality is concerned; if the data and the code are in the same page, there's no need to swap. Of course, it depends on the code.
在实践中,具有静态存储持续时间的项在二进制映像本身中为它们预留了空间,以便在加载程序时就可以获得内存。这对当地可能有好处;如果数据和代码在同一页中,则不需要交换。当然,这取决于代码。
The obvious drawback is that your executable file will have a larger footprint. Another drawback is that your code isn't re-entrant; every invocation of foo
is working on the same block of memory when it reads or writes some_more_data
. Depending on what your code does, that may or may not be a big deal.
明显的缺点是您的可执行文件将具有更大的内存占用。另一个缺点是您的代码不是可重入的;当foo读取或写入some_more_data时,每个foo调用都在同一块内存上工作。这取决于您的代码所做的事情,这可能是也可能不是什么大事。
If your code needs to be re-entrant, then you can't use objects with static storage duration. If the block of data is relatively small, use objects with automatic storage duration (i.e., regular local variables).
如果您的代码需要重新进入,那么您不能使用具有静态存储持续时间的对象。如果数据块相对较小,则使用具有自动存储持续时间(即普通局部变量)。
void foo( /* some list of parameters here */ )
{
int some_more_data[SOME_SMALLISH_NUMBER];
...
}
In this case, some_more_data
only exists for the lifetime of the foo
function; each time you call foo
, it allocates another some_more_data
object automatically. Whatever overhead there is in setting aside the memory is part of the overhead of calling the function in the first place (at least in my experience). The stack pointer still gets adjusted whether you're using local variables or not, so using local variables won't make it any slower. The main issue is that the available memory for automatic objects is relatively small; for objects above a certain size, this approach simply won't work.
在这种情况下,some_more_data只存在于foo函数的生存期;每次调用foo时,它都会自动分配另一个some_more_data对象。设置内存的开销是调用函数的开销的一部分(至少在我的经验中是这样)。无论是否使用局部变量,堆栈指针仍然会被调整,因此使用局部变量不会使它变得更慢。主要问题是自动对象的可用内存相对较小;对于超过一定大小的对象,这种方法根本不起作用。
If your code needs to be re-entrant and you need to allocate large blocks of memory, you're pretty much stuck with dynamic memory management (malloc/calloc/realloc
and free
). Depending on how you design your code, you can minimize some of the performance issues.
如果您的代码需要重新进入,并且您需要分配大量的内存块,那么您将不得不进行动态内存管理(malloc/calloc/realloc和free)。根据您如何设计代码,您可以最小化一些性能问题。
1. Visibility rules are enforced during translation from source to machine code; they don't really apply at run time.
#1
10
Static allocation will be much faster. Static allocation can happen at global scope, and on the stack.
静态分配会快得多。静态分配可以发生在全局范围和堆栈上。
In global scope statically allocated memory is built into the binary image. That is the total size of the memory required, and where it is to be located in the running binary is computed at compile time. Then when the program loads the operating system loader will allocate enough memory for all the global static arrays. I'm pretty sure it happens in constant time for all the allocations. ( e.g. more allocations don't cost more time )
在全局作用域中,静态分配内存被构建到二进制映像中。这是所需内存的总大小,它位于运行的二进制文件中的位置是在编译时计算的。然后,当程序加载时,操作系统加载程序将为所有全局静态数组分配足够的内存。我很确定所有分配都是在常数时间内发生的。(例如,更多的拨款不会花更多的时间)
In local scope, static allocations are allocated on the stack. This involves simply reserving a fixed number of bytes on the stack, and happens in constant time per allocation. Stack space is very limited.
在本地范围内,静态分配被分配到堆栈上。这只需要在堆栈上保留固定数量的字节,每次分配的时间都是固定的。堆栈空间非常有限。
Dynamic memory must be allocated from a heap, and even in the best case most allocations will take time that scales more than linear with each allocation, like n log n time or something.
动态内存必须从堆中分配,即使在最好的情况下,大多数分配也需要比每次分配的线性时间更长的时间,比如n log n时间或其他时间。
Also practically speaking the dynamic allocation will be many times slower than static allocation.
实际上,动态分配比静态分配要慢很多倍。
@update: As has been pointed out in comments below: stack allocations are not technically static allocations ( but they are allocations with the syntactic form used in OP's question ).
@update:正如在下面的评论中所指出的:堆栈分配不是技术上的静态分配(但它们是在OP的问题中使用的语法形式的分配)。
Also when speaking about "allocation time", I'm considering total time to manage the memory ( alloc and free ).
在谈到“分配时间”时,我也在考虑管理内存的总时间(alloc和free)。
In some dynamic allocators allocation time is faster than freeing time.
在一些动态分配器中,分配时间比释放时间快。
It is also true that some fast allocators trade memory efficiency for allocation speed. In these cases static is still "better" in that static and stack allocs are no time and constant time respectively while allocating exact sized blocks.
同样,一些快速分配器也用内存效率换取分配速度。在这些情况下,静态仍然“更好”,因为在分配精确大小的块时,静态分配和堆栈分配分别不是时间和常数时间。
Dynamic allocators to be fast trade off significant memory efficiency ( e.g. buddy allocators round up to the next power of two sized block, like 33k alloc will use 64k )
动态分配器是快速交换重要的内存效率(例如,伙伴分配器,直到两个大小块的下一个幂,比如33k alloc将使用64k)
#2
5
True static allocations (globals, and local variables marked static) are glued into sections and loaded along with the rest of the section at load time (execve
) using mmap
. They're essentially free, already covered by the cost of loading the program.
真正的静态分配(全局分配和标记为静态的局部变量)被粘到节中,并在加载时(execve)与节的其余部分一起使用mmap加载。它们本质上是免费的,已经被加载程序的成本所覆盖。
Automatic arrays with statically known sizes can be glued together with other local variables and allocated by adjusting the stack pointer. That's essentially one integer subtraction (the stack grows downwards) or something close to 1 ns on modern processors.
具有静态已知大小的自动数组可以与其他本地变量粘在一起,并通过调整堆栈指针来分配。这实际上是一个整数减法(堆栈向下增长)或者在现代处理器上接近1ns。
Variable length arrays can't be glued to other variables so they should cost about 1 ns each.
可变长度数组不能粘在其他变量上,因此每个变量的成本应该在1ns左右。
I tried measuring malloc
s with different sizes in a single threaded process and I got this, which would imply that malloc+free
pairs cost about 50-100 times as much as stack variables for allocations <16MiB. After that, it spikes upwards (32MiB and 64MiB both cost about a hundred times as much as allocations <=16MiB do):
我尝试在一个线程进程中测量不同大小的mallocs,得到了这个结果,这意味着对于分配<16MiB的malloc+*对的成本大约是堆栈变量的50-100倍。在此之后,它会急剧上升(32MiB和64MiB的成本都是拨款<=16MiB的100倍):
These are just the costs of obtaining the pointer though. Actual access may cause page faults and thereby further increase the cost of the memory block. The page faults should be extremely rare with stack allocations, but likely with first-time accesses to dynamically allocated memory.
这只是获取指针的成本。实际访问可能导致页面错误,从而进一步增加内存块的成本。对于堆栈分配来说,页面错误应该是非常罕见的,但是很可能是首次访问动态分配的内存。
Additionally, many small dynamic allocations will put quite a strain on caches as your memory will be fragmented. (You can reduce this fragmentation by using memory pools, either your own or the obstack functionality available as part of glibc.)
此外,许多小的动态分配将对缓存造成很大的压力,因为您的内存将是分段的。(您可以通过使用您自己的内存池或作为glibc的一部分提供的obstack功能来减少这种碎片。)
#3
4
Global memory is normally allocated when your program (or shared module/dll) loads and pre-populated with its initial values. This typically has its own memory section.
当程序(或共享模块/dll)加载并预填充初始值时,通常会分配全局内存。它通常有自己的内存部分。
The stack is a block (series of memory pages) of memory allocated by the kernel when you create a new thread. Allocating stack memory on the stack typically is done by decremeting the stack pointer (one machine instruction - (stack a typically full descening - newer allocations have lower addresses). When a stack object is deleted, the stack pointer is incremented). Stacks therefore have strict first in last out semantic. Stack are also fixed size so when you run out you get a stack overflow.
堆栈是内核在创建新线程时分配的内存块(内存页系列)。在堆栈上分配堆栈内存通常是通过命令栈指针(一个机器指令(一个典型的全下降-更新的分配有较低的地址)来完成的。当一个堆栈对象被删除时,堆栈指针就会增加)。因此,栈具有严格的先到后的语义。栈也是固定的大小,所以当你用完时你会得到一个栈溢出。
When one uses malloc (in C) or new (C++), memory is allocated on the heap/free-store. This is any number memory block. When more memory is needed than has been already allocated, malloc goes to the kernel to request more memory from the system. Typically malloc will use freed memory already obtained from the system.
当使用malloc(在C中)或new (c++)时,内存被分配到堆/*存储中。这是任意数量的内存块。当需要的内存超过已经分配的内存时,malloc就会向内核请求更多的内存。通常malloc将使用已经从系统获得的释放内存。
Calls to malloc must be assumed to be slow and should be avoided for any performance critical or real-time routines. This is for 2 reasons.
必须假定对malloc的调用是缓慢的,并且对于任何性能关键或实时例程都应该避免调用。这有两个原因。
- The heap needs to allocate and free any size of memory in any order. Older algorithms used to walk a list of freed blocks until one of suitable size was found. Since memory could be highly fragmented, this could potentially be slow. If the heap is used on multiple threads some sort of locking needs to be provided. Much research has been done to improve this situation and modern heaps the jemalloc and tcmalloc do improve things. However, don't count on this.
- 堆需要以任何顺序分配和释放任何大小的内存。旧的算法用于遍历释放块的列表,直到找到一个合适的大小。由于内存可能是高度碎片化的,这可能会很慢。如果堆在多个线程上使用,则需要提供某种锁定。为了改善这一状况,人们做了大量的研究,现代的jemalloc和tcmalloc确实改善了这一状况。然而,不要指望这一点。
- If required memory cannot be allocated from what is already managed by the heap allocator, the heap allocator will need to ask the kernel for more memory. (On unix this is done with the
mmap
orsbrk
system calls). The kernel needs to find some unallocated memory pages and the has to map these pages into your processes memory space). Any form of caching goes out the window). So expect this to be super slow (for a computer). - 如果无法从堆分配器已经管理的内存中分配所需的内存,那么堆分配器将需要向内核请求更多的内存。(在unix上,这是通过mmap或sbrk系统调用完成的)。内核需要找到一些未分配的内存页,并且必须将这些页映射到进程内存空间中)。任何形式的缓存都从窗口消失。因此,期望这是超级慢(对于计算机)。
So if you need do any performance critical processing, allocate all the memory up front and then reuse the already allocated memory. Always assume calls to malloc and free will be slow.
因此,如果需要进行任何性能关键处理,请预先分配所有内存,然后重用已分配的内存。总是假设对malloc和free的调用会很慢。
#4
2
If you know exactly how much space you need to set aside and your primary concern is performance with respect to memory management and your code doesn't need to be re-entrant, then you will want to allocate objects with static storage duration, either by declaring them at file scope or by using the static
storage class specifier.
如果你知道你需要预留多少空间和你主要关心的是性能对内存管理和代码不需要再进入的,那么你会想和静态存储分配对象持续时间、通过宣布他们在文件范围或使用静态存储类说明符。
int data[SOME_NUMBER];
void foo( /* some list of parameters here */ )
{
static int some_more_data[SOME_OTHER_NUMBER];
...
}
Both data
and some_more_data
exist over the lifetime of the program, but some_more_data
is only visible within the foo
function1.
数据和some_more_data在程序的整个生命周期中都存在,但是some_more_data仅在foo function1中可见。
In practice, items with static storage duration have space set aside for them in the binary image itself, so that memory is available as soon as the program is loaded. This may have an advantage as far as locality is concerned; if the data and the code are in the same page, there's no need to swap. Of course, it depends on the code.
在实践中,具有静态存储持续时间的项在二进制映像本身中为它们预留了空间,以便在加载程序时就可以获得内存。这对当地可能有好处;如果数据和代码在同一页中,则不需要交换。当然,这取决于代码。
The obvious drawback is that your executable file will have a larger footprint. Another drawback is that your code isn't re-entrant; every invocation of foo
is working on the same block of memory when it reads or writes some_more_data
. Depending on what your code does, that may or may not be a big deal.
明显的缺点是您的可执行文件将具有更大的内存占用。另一个缺点是您的代码不是可重入的;当foo读取或写入some_more_data时,每个foo调用都在同一块内存上工作。这取决于您的代码所做的事情,这可能是也可能不是什么大事。
If your code needs to be re-entrant, then you can't use objects with static storage duration. If the block of data is relatively small, use objects with automatic storage duration (i.e., regular local variables).
如果您的代码需要重新进入,那么您不能使用具有静态存储持续时间的对象。如果数据块相对较小,则使用具有自动存储持续时间(即普通局部变量)。
void foo( /* some list of parameters here */ )
{
int some_more_data[SOME_SMALLISH_NUMBER];
...
}
In this case, some_more_data
only exists for the lifetime of the foo
function; each time you call foo
, it allocates another some_more_data
object automatically. Whatever overhead there is in setting aside the memory is part of the overhead of calling the function in the first place (at least in my experience). The stack pointer still gets adjusted whether you're using local variables or not, so using local variables won't make it any slower. The main issue is that the available memory for automatic objects is relatively small; for objects above a certain size, this approach simply won't work.
在这种情况下,some_more_data只存在于foo函数的生存期;每次调用foo时,它都会自动分配另一个some_more_data对象。设置内存的开销是调用函数的开销的一部分(至少在我的经验中是这样)。无论是否使用局部变量,堆栈指针仍然会被调整,因此使用局部变量不会使它变得更慢。主要问题是自动对象的可用内存相对较小;对于超过一定大小的对象,这种方法根本不起作用。
If your code needs to be re-entrant and you need to allocate large blocks of memory, you're pretty much stuck with dynamic memory management (malloc/calloc/realloc
and free
). Depending on how you design your code, you can minimize some of the performance issues.
如果您的代码需要重新进入,并且您需要分配大量的内存块,那么您将不得不进行动态内存管理(malloc/calloc/realloc和free)。根据您如何设计代码,您可以最小化一些性能问题。
1. Visibility rules are enforced during translation from source to machine code; they don't really apply at run time.