I have a question about dynamic memory allocation.
我有一个关于动态内存分配的问题。
Context: I'm writing a program that reads a text file of words and counts the frequency with which each word occurs (one word per line).
上下文:我正在编写一个程序来读取单词的文本文件,并计算每个单词出现的频率(每行一个单词)。
This particular function reads the file, counts the lines and characters, then dynamically allocates memory to the array of string pointers, an array storing the count of characters for each line and the strings themselves. (The other parts are less directly relevant to my question).
这个特殊的函数读取文件,计算行和字符,然后动态地将内存分配给字符串指针数组,一个数组存储每行的字符数和字符串本身。 (其他部分与我的问题不太直接相关)。
Question: How often should I reallocate memory if I run out of space? I set a constant ("memstart") for setting the initial memory allocation value. In the below code snippet I realloc for every line over the value of "memstart". Would the program process faster if a reallocated a larger block of memory instead of increasing the memory space by 1 "variable type" each time?
问题:如果空间不足,我应该多久重新分配一次内存?我设置了一个常量(“memstart”)来设置初始内存分配值。在下面的代码片段中,我重新分配了“memstart”值的每一行。如果重新分配更大的内存块而不是每次增加1“变量类型”的内存空间,程序是否会更快?
What would be best practice for something like this?
这样的事情最好的做法是什么?
Code Snip:
int read_alloc(FILE* fin, FILE *tmp, char **wdp, int *sz){
int line_cnt= 0, chr, let=1;
do{
chr=getc(fin);
let++;
//count characters
if(chr!=EOF){
chr=tolower(chr);
fputc(chr, tmp);
}
//convert to lcase and write to temp file
if ('\n' == chr || chr==EOF){
sz[(line_cnt)]=((let)*sizeof(char)); //save size needed to store string in array
*(wdp+(line_cnt))=malloc((let)*sizeof(char)); //allocate space for the string
if ((line_cnt-1) >= memstart){
realloc(wdp, (sizeof(wdp)*(memstart+line_cnt))); //if more space needed increase size
realloc(sz, (sizeof(sz)*(memstart+line_cnt)));
}
line_cnt++;
let=1;
}
} while (EOF != chr);
return (line_cnt);
}
2 个解决方案
#1
7
While the question is about how often realloc
should be called, looking at OP's code, I think it's better to start with how safely it should be done.
虽然问题是关于调用realloc的频率,看看OP的代码,我认为最好从应该如何安全地开始。
The C11 standard states (n1570 draft, § 7.22.3.5, The realloc function, emphasis mine):
C11标准规定(n1570草案,§7.22.3.5,realloc函数,强调我的):
Synopsis
#include <stdlib.h>
void *realloc(void *ptr, size_t size);
#include
void * realloc(void * ptr,size_t size); Description
The realloc function deallocates the old object pointed to by ptr and returns a pointer to a new object that has the size specified by size. The contents of the new object shall be the same as that of the old object prior to deallocation, up to the lesser of the new and old sizes. Any bytes in the new object beyond the size of the old object have indeterminate values.
If ptr is a null pointer, the realloc function behaves like the malloc function for the specified size. (...). If memory for the new object cannot be allocated, the old object is not deallocated and its value is unchanged.
Returns
The realloc function returns a pointer to the new object (which may have the same value as a pointer to the old object), or a null pointer if the new object could not be allocated.说明realloc函数释放ptr指向的旧对象,并返回指向具有size指定大小的新对象的指针。新对象的内容应与解除分配之前的旧对象的内容相同,直到新旧大小中的较小者为止。新对象中超出旧对象大小的任何字节都具有不确定的值。如果ptr是空指针,则realloc函数的行为类似于指定大小的malloc函数。 (......)。如果无法分配新对象的内存,则不会释放旧对象,并且其值不会更改。返回realloc函数返回指向新对象的指针(可能与指向旧对象的指针具有相同的值),如果无法分配新对象,则返回空指针。
Now let's consider this snippet from the question, where sz
is declared as int* sz;
现在让我们从问题中考虑这个片段,其中sz被声明为int * sz;
realloc(sz, (sizeof(sz)*(memstart+line_cnt)));
The return value is lost, so we can't know if the call succeeded and if it did, sz
is invalidated. Moreover, sizeof(sz)
is the size of the pointer, not of the pointed type (int
).
返回值丢失,因此我们无法知道调用是否成功,如果成功,则sz无效。而且,sizeof(sz)是指针的大小,而不是指向类型(int)的大小。
A more safe (and correct) pattern would be:
更安全(和更正确)的模式是:
size_t new_size = /* Whatever, let's say */ size + SOME_COSTANT + size / 2;
void *tmp = realloc(ptr, new_size * sizeof *ptr);
if ( tmp == NULL ) {
/* Deal with the error, e.g. log a message with perror, return NULL
(if this is in a function) or just give up, but remeber that
realloc doesn't invalidate nor free 'ptr' on failure */
exit(EXIT_FAILURE);
}
ptr = tmp; // <- on success, realloc invalidated ptr
size = new_size;
Now, to answer the question, realloc
should be called only when needed, because it involves potentially expansive system calls. So either allocate a big chunk ahead or choose a growing stratey like doubling (or 1.5 times) the size every time.
现在,要回答这个问题,只应在需要时调用realloc,因为它涉及可能广泛的系统调用。因此要么预先分配一大块,要么选择不断增长的策略,例如每次加倍(或1.5倍)大小。
It's worth noting that if possible, the OS could perform the reallocation without copying any element of the original array.
值得注意的是,如果可能,操作系统可以执行重新分配而不复制原始数组的任何元素。
#2
4
The classic answer is to double each time, but a factor of 1.5 might be better. The important bit is that you multiply your array size by some factor each time, rather than adding additional space each time.
经典的答案是每次加倍,但1.5倍可能更好。重要的是,每次将数组大小乘以某个因子,而不是每次都增加额外的空间。
Each re-allocation might need to copy the previous array into a new one. We'd like to minimize these copies. If we will be adding n
items, and we start with an array of size a
, increase by a factor of r
each re-allocation, to end with a value of n
, the sequence of (re-)allocations will be a, ar, ar^2, ar^3, ..., n. The sum of that sequence is (nr-a)/(r-1). Thus the total space is of order O(n).
每次重新分配可能需要将先前的阵列复制到新阵列中。我们想尽量减少这些副本。如果我们将添加n个项目,并且我们从大小为a的数组开始,每次重新分配增加r因子,以n值结束,(重新)分配的顺序将是a,ar ,ar ^ 2,ar ^ 3,...,n。该序列的总和是(nr-a)/(r-1)。因此,总空间为O(n)阶。
Suppose instead we start with a, and this time add r each time. The sequence is a, a+r, a+2r, a+3r, ..., n. The sum of that sequence will be 0.5*((n^2-a^2)/r + a + n). In this case the total space of order O(n^2). Much worse!
假设我们从a开始,这次每次都添加r。序列是a,a + r,a + 2r,a + 3r,...,n。该序列的总和将是0.5 *((n ^ 2-a ^ 2)/ r + a + n)。在这种情况下,订单的总空间为O(n ^ 2)。更糟糕!
With a constant factor of 2, the array will be in the worse case 1/2 empty. That's probably ok. You can always shrink the allocation when you're done and know the final size.
在常数因子为2的情况下,阵列将处于最差的情况下1/2空。那可能还可以。完成后,您可以随时缩小分配并了解最终大小。
As pointed out in another answer, there are several bugs in the manner in which you call realloc()
, but that wasn't the question.
正如另一个答案所指出的那样,你调用realloc()的方式有几个错误,但这不是问题。
#1
7
While the question is about how often realloc
should be called, looking at OP's code, I think it's better to start with how safely it should be done.
虽然问题是关于调用realloc的频率,看看OP的代码,我认为最好从应该如何安全地开始。
The C11 standard states (n1570 draft, § 7.22.3.5, The realloc function, emphasis mine):
C11标准规定(n1570草案,§7.22.3.5,realloc函数,强调我的):
Synopsis
#include <stdlib.h>
void *realloc(void *ptr, size_t size);
#include
void * realloc(void * ptr,size_t size); Description
The realloc function deallocates the old object pointed to by ptr and returns a pointer to a new object that has the size specified by size. The contents of the new object shall be the same as that of the old object prior to deallocation, up to the lesser of the new and old sizes. Any bytes in the new object beyond the size of the old object have indeterminate values.
If ptr is a null pointer, the realloc function behaves like the malloc function for the specified size. (...). If memory for the new object cannot be allocated, the old object is not deallocated and its value is unchanged.
Returns
The realloc function returns a pointer to the new object (which may have the same value as a pointer to the old object), or a null pointer if the new object could not be allocated.说明realloc函数释放ptr指向的旧对象,并返回指向具有size指定大小的新对象的指针。新对象的内容应与解除分配之前的旧对象的内容相同,直到新旧大小中的较小者为止。新对象中超出旧对象大小的任何字节都具有不确定的值。如果ptr是空指针,则realloc函数的行为类似于指定大小的malloc函数。 (......)。如果无法分配新对象的内存,则不会释放旧对象,并且其值不会更改。返回realloc函数返回指向新对象的指针(可能与指向旧对象的指针具有相同的值),如果无法分配新对象,则返回空指针。
Now let's consider this snippet from the question, where sz
is declared as int* sz;
现在让我们从问题中考虑这个片段,其中sz被声明为int * sz;
realloc(sz, (sizeof(sz)*(memstart+line_cnt)));
The return value is lost, so we can't know if the call succeeded and if it did, sz
is invalidated. Moreover, sizeof(sz)
is the size of the pointer, not of the pointed type (int
).
返回值丢失,因此我们无法知道调用是否成功,如果成功,则sz无效。而且,sizeof(sz)是指针的大小,而不是指向类型(int)的大小。
A more safe (and correct) pattern would be:
更安全(和更正确)的模式是:
size_t new_size = /* Whatever, let's say */ size + SOME_COSTANT + size / 2;
void *tmp = realloc(ptr, new_size * sizeof *ptr);
if ( tmp == NULL ) {
/* Deal with the error, e.g. log a message with perror, return NULL
(if this is in a function) or just give up, but remeber that
realloc doesn't invalidate nor free 'ptr' on failure */
exit(EXIT_FAILURE);
}
ptr = tmp; // <- on success, realloc invalidated ptr
size = new_size;
Now, to answer the question, realloc
should be called only when needed, because it involves potentially expansive system calls. So either allocate a big chunk ahead or choose a growing stratey like doubling (or 1.5 times) the size every time.
现在,要回答这个问题,只应在需要时调用realloc,因为它涉及可能广泛的系统调用。因此要么预先分配一大块,要么选择不断增长的策略,例如每次加倍(或1.5倍)大小。
It's worth noting that if possible, the OS could perform the reallocation without copying any element of the original array.
值得注意的是,如果可能,操作系统可以执行重新分配而不复制原始数组的任何元素。
#2
4
The classic answer is to double each time, but a factor of 1.5 might be better. The important bit is that you multiply your array size by some factor each time, rather than adding additional space each time.
经典的答案是每次加倍,但1.5倍可能更好。重要的是,每次将数组大小乘以某个因子,而不是每次都增加额外的空间。
Each re-allocation might need to copy the previous array into a new one. We'd like to minimize these copies. If we will be adding n
items, and we start with an array of size a
, increase by a factor of r
each re-allocation, to end with a value of n
, the sequence of (re-)allocations will be a, ar, ar^2, ar^3, ..., n. The sum of that sequence is (nr-a)/(r-1). Thus the total space is of order O(n).
每次重新分配可能需要将先前的阵列复制到新阵列中。我们想尽量减少这些副本。如果我们将添加n个项目,并且我们从大小为a的数组开始,每次重新分配增加r因子,以n值结束,(重新)分配的顺序将是a,ar ,ar ^ 2,ar ^ 3,...,n。该序列的总和是(nr-a)/(r-1)。因此,总空间为O(n)阶。
Suppose instead we start with a, and this time add r each time. The sequence is a, a+r, a+2r, a+3r, ..., n. The sum of that sequence will be 0.5*((n^2-a^2)/r + a + n). In this case the total space of order O(n^2). Much worse!
假设我们从a开始,这次每次都添加r。序列是a,a + r,a + 2r,a + 3r,...,n。该序列的总和将是0.5 *((n ^ 2-a ^ 2)/ r + a + n)。在这种情况下,订单的总空间为O(n ^ 2)。更糟糕!
With a constant factor of 2, the array will be in the worse case 1/2 empty. That's probably ok. You can always shrink the allocation when you're done and know the final size.
在常数因子为2的情况下,阵列将处于最差的情况下1/2空。那可能还可以。完成后,您可以随时缩小分配并了解最终大小。
As pointed out in another answer, there are several bugs in the manner in which you call realloc()
, but that wasn't the question.
正如另一个答案所指出的那样,你调用realloc()的方式有几个错误,但这不是问题。