I have a data set that looks like this:
我有一个如下所示的数据集:
000 100 200 300 010 020 030 001 002 003
001 101 201 301 011 021 031 000 002 003
002 102 202 302 012 022 032 001 000 003
003 103 203 303 013 023 033 001 002 000
010 110 210 310 000 020 030 011 012 013
020 120 220 320 010 000 030 021 022 023
030 130 230 330 010 020 000 031 032 033
033 133 233 333 013 023 003 031 032 030
100 000 200 300 110 120 130 101 102 103
133 033 233 333 113 123 103 131 132 130
200 100 000 300 210 220 230 201 202 203
233 133 033 333 213 223 203 231 232 230
300 100 200 000 310 320 330 301 302 303
303 103 203 003 313 323 333 301 302 300
313 113 213 013 303 323 333 311 312 310
330 130 230 030 310 320 300 331 332 333
331 131 231 031 311 321 301 330 332 333
332 132 232 032 312 322 302 331 330 333
333 133 233 033 313 323 303 331 332 330
What I intend to do is to generate list of unique strings from it, yielding:
我打算做的是从中生成唯一字符串列表,产生:
000
001
002
003
010
011
012
013
020
021
022
023
030
031
032
033
100
101
102
103
110
113
120
123
130
131
132
133
200
201
202
203
210
213
220
223
230
231
232
233
300
301
302
303
310
311
312
313
320
321
322
323
330
331
332
333
The code I have to generate that is this. But it is very memory consumptive. Because in reality the string is of length >36 and there are more than 35 million lines in a file. Each line with >36*3 number of columns/entries. Is there a memory efficient way to do it?
我必须生成的代码就是这个。但这是非常记忆消耗的。因为实际上字符串的长度大于36,并且文件中的行数超过3500万行。每行具有> 36 * 3列数/条目。有记忆效率的方法吗?
#include <iostream>
#include <vector>
#include <fstream>
#include <sstream>
#include <map>
using namespace std;
int main ( int arg_count, char *arg_vec[] ) {
if (arg_count !=2 ) {
cerr << "expected one argument" << endl;
return EXIT_FAILURE;
}
string line;
ifstream myfile (arg_vec[1]);
map <string,int> Tags;
if (myfile.is_open())
{
while (getline(myfile,line) )
{
stringstream ss(line);
string Elem;
while (ss >> Elem) {
Tags[Elem] = 1;
}
}
myfile.close();
}
else { cout << "Unable to open file";}
for (map <string,int>::iterator iter = Tags.begin(); iter !=
Tags.end();iter++) {
cout << (*iter).first << endl;
}
return 0;
}
9 个解决方案
#1
This depends a bit on the characteristics of your dataset. In the worse case, where all strings are unique, you will need either O(n) memory to record your seen-set, or O(n^2) time to re-scan the entire file on each word. However, there are improvements that can be made.
这取决于数据集的特征。在更糟糕的情况下,在所有字符串都是唯一的情况下,您将需要O(n)内存来记录您的看见集,或者需要O(n ^ 2)时间来重新扫描每个单词上的整个文件。但是,可以进行改进。
First off, if your dataset only consists of 3-digit integers, then a simple array of 1000 bools will be much more memory effieicnt than a map.
首先,如果您的数据集只包含3位整数,那么1000个bool的简单数组将比地图具有更高的内存效率。
If you're using general data, then another good approach would be to sort the set, so copies of the same string end up adjacent, then simply remove adjacent words. There are well-researched algorithms for sorting a dataset too large to fit in memory. This is most effective when a large percentage of the words in the set are unique, and thus holding a set of seen words in memory is prohibitively expensive.
如果你使用的是通用数据,那么另一个好的方法是对集合进行排序,因此相同字符串的副本最终会相邻,然后只需删除相邻的单词。有很好的研究算法可以对数据集进行排序,因为数据集太大而无法适应内存。当集合中的大部分单词是唯一的时,这是最有效的,因此在存储器中保持一组看到的单词非常昂贵。
Incidentally, this can be implemented easily with a shell pipeline, as GNU sort does the external sort for you:
顺便说一下,这可以通过shell管道轻松实现,因为GNU sort为你做了外部排序:
tr " " "\n" < testdata | LC_ALL=C sort | uniq
Passing LC_ALL=C to sort disables locale processing and multibyte character set support, which can give a significant speed boost here.
传递LC_ALL = C进行排序会禁用区域设置处理和多字节字符集支持,这可以显着提高速度。
#2
O(1) memory [ram]:
O(1)记忆[ram]:
If you want to use no memory at all (besides a couple temp variables) you could simply read 1 entry at a time and add it to the output file if it doesn't already exist in the output file. This would be slower on time though since you'd have to read 1 entry at a time from the output file.
如果你想根本不使用任何内存(除了几个临时变量),你可以简单地一次读取1个条目,如果它不存在于输出文件中,则将其添加到输出文件中。由于您必须一次从输出文件中读取1个条目,因此时间会慢一些。
You could insert the entry into the output file in alphabetical order though and then you would be able to see if the entry already exists or not in O(logn) time via binary search per entry being inserted. To actually insert you need to re-create the file though which is O(nlogn). You do this n times for each input string, so overall the algorithm would run in O(n^2logn) (which includes lookup to find insertion pos + insertion) and use O(1) RAM memory.
您可以按字母顺序将条目插入到输出文件中,然后通过插入每个条目的二进制搜索,您可以在O(logn)时间内查看条目是否已存在。要实际插入,您需要重新创建文件,尽管是O(nlogn)。你对每个输入字符串执行n次操作,因此整个算法将在O(n ^ 2logn)中运行(包括查找以查找插入位置+插入)并使用O(1)RAM内存。
Since your output file is already in alphabetical order though future lookups would also only be O(logn) via binary search.
由于您的输出文件已按字母顺序排列,但未来的查找也只能通过二进制搜索进行O(logn)。
You could also minimize the re-creation phase of the file by leaving excessive space between entries in the file. WHen the algorithm was done you could do a vacuum on the file. This would bring it down to O(nlogn).
您还可以通过在文件中的条目之间留出过多空间来最小化文件的重新创建阶段。算法完成后,您可以对文件执行真空操作。这将把它归结为O(nlogn)。
Another way to reduce memory:
另一种减少记忆的方法:
If it's common that your strings share common prefixes then you can use a trie and probably use less memory overall since you mentioned your strings are > length 36. This would still use a lot of memory though.
如果您的字符串共享公共前缀是很常见的,那么您可以使用trie并且可能使用更少的内存,因为您提到您的字符串>长度36.这仍然会使用大量内存。
#3
Well, std::set might be slightly faster and use less memory than std::map.
好吧,std :: set可能会比std :: map稍微快些一些并且使用更少的内存。
#4
It seems given that large number of entries, there will be a reasonable amount of overlap in the sequences of symbols. You could build tree using the entries at each position each sequence as nodes. Say you have an entry 12345 and 12346 then the first four entries in the sequence overlap and thus could be stored in a tree with 6 nodes.
似乎给出了大量的条目,符号序列中将存在合理的重叠量。您可以使用每个位置的条目构建树,每个序列作为节点。假设您有一个条目12345和12346,那么序列中的前四个条目重叠,因此可以存储在具有6个节点的树中。
You could walk the tree to see if a given symbol is contained at a given position, if not you would add it. When you reach the end of the given string you would just mark that node as terminating a string node. Reproducing the unique entries would be a matter of a depth first traversal of the tree the path from the root node to each terminator represents a unique entry.
你可以走树,看看给定的符号是否包含在给定的位置,如果没有,你会添加它。当您到达给定字符串的末尾时,您只需将该节点标记为终止字符串节点。再现唯一条目将是树的深度优先遍历的问题,从根节点到每个终结器的路径表示唯一条目。
If you partition the dataset, say into X thousand line chunks and aggregate the trees it would make a nice map-reduce job.
如果你对数据集进行分区,比如说进入X千行的块并聚合树,那么它将成为一个很好的map-reduce作业。
You're looking at a node space of 10^36 so if the data is entirely random you're looking at a large possible number of nodes. If there's a good deal of overlap and a smaller number of unique entries you will probably find you use a good deal less
您正在查看10 ^ 36的节点空间,因此如果数据完全是随机的,那么您正在查看可能数量很大的节点。如果存在大量重叠和较少数量的唯一条目,您可能会发现您使用的优惠更少
#5
You could probably just write an in-place sort for it. You're still going to have to load the file to memory though, because in-place sorting with IO wouldn't be efficient. You'd have to read and write for every comparison.
你可能只是为它写一个就地排序。您仍然需要将文件加载到内存中,因为使用IO进行就地排序效率不高。你必须为每次比较阅读和写作。
#6
std::set
would be better, but you should look into hash sets. Those are not available in the current C++ standard library (although it is supposed to be in C++0x's) but some compilers have implementations. Visual C++ has stdext::hash_set and boost has some kind of stuff for this, see Boost Multi-index Containers Library.
std :: set会更好,但你应该研究哈希集。那些在当前的C ++标准库中不可用(尽管它应该在C ++ 0x中),但是一些编译器有实现。 Visual C ++有stdext :: hash_set和boost有一些东西,请参阅Boost Multi-index Containers Library。
#7
Try STXXL as an external stl for huge datasets.
尝试将STXXL作为大型数据集的外部stl。
#8
The solution I would suggest is to use memory mapped file to access the data and radix sort to sort the list. This is trivial if the strings are of same length. If the strings are of different lengths, use radix sort for a fast presorting using the n first characters, then sort the unsorted sub lists with whatever method is most appropriate. With very short sublists, a simple bubble sort would do it. With longer lists use quick sort or use the STL set with a custom class providing the compare operator.
我建议的解决方案是使用内存映射文件来访问数据和基数排序来对列表进行排序。如果字符串长度相同,这是微不足道的。如果字符串具有不同的长度,则使用基数排序来使用n个第一个字符进行快速预分类,然后使用最合适的方法对未排序的子列表进行排序。使用非常短的子列表,可以执行简单的冒泡操作。使用较长的列表使用快速排序或使用STL集与提供比较运算符的自定义类。
With memory mapping and radix sort, the memory requirement and performance would be optimal. All you need is the mapping space (~size of file) and a table of 32bit values with one value per string to hold the linked list for radix sort.
通过内存映射和基数排序,内存要求和性能将是最佳的。您所需要的只是映射空间(文件大小)和32位值表,每个字符串有一个值来保存基数列表以进行基数排序。
You could also save memory and speed up the sorting by using a more compact encoding of the strings. A simple encoding would use 2 bits per character, using values 1,2 and 3 for the three letters and 0 to signal the end of string. Or more efficient, use a base 3 encoding, packed into integers and encode the length of string in front. Let say you have characters c1, c2, C3, c4 the encoding would yield the integer c1*3^3 + c2*3^2 + c3*3^1 + c4*3^0. This suppose you assign a value from 0 to 2 to each character. By using this encoding you'll save memory and will optimize sorting if the number of strings to sort is huge.
您还可以使用更紧凑的字符串编码来节省内存并加快排序速度。一个简单的编码将使用每个字符2位,使用值1,2和3表示三个字母,0表示字符串结束。或者更高效,使用base 3编码,打包成整数并编码前面的字符串长度。假设您有字符c1,c2,C3,c4,编码将产生整数c1 * 3 ^ 3 + c2 * 3 ^ 2 + c3 * 3 ^ 1 + c4 * 3 ^ 0。这假设您为每个字符分配一个0到2的值。通过使用此编码,您将节省内存,并且如果要排序的字符串数量很大,将优化排序。
#9
You can look at the excellent library Judy arrays. A compact trie based, very fast and memory efficient data structure witch scale to billons strings. Better than any search-tree. You can use the JudySL functions to your purpose. You can use it similar to your program, but it is much faster and more memory efficient.
你可以看看优秀的Judy数组库。基于紧凑的基于trie,非常快速且内存有效的数据结构,可扩展到字符串。比任何搜索树都好。您可以将JudySL功能用于您的目的。您可以使用它类似于您的程序,但速度更快,内存效率更高。
#1
This depends a bit on the characteristics of your dataset. In the worse case, where all strings are unique, you will need either O(n) memory to record your seen-set, or O(n^2) time to re-scan the entire file on each word. However, there are improvements that can be made.
这取决于数据集的特征。在更糟糕的情况下,在所有字符串都是唯一的情况下,您将需要O(n)内存来记录您的看见集,或者需要O(n ^ 2)时间来重新扫描每个单词上的整个文件。但是,可以进行改进。
First off, if your dataset only consists of 3-digit integers, then a simple array of 1000 bools will be much more memory effieicnt than a map.
首先,如果您的数据集只包含3位整数,那么1000个bool的简单数组将比地图具有更高的内存效率。
If you're using general data, then another good approach would be to sort the set, so copies of the same string end up adjacent, then simply remove adjacent words. There are well-researched algorithms for sorting a dataset too large to fit in memory. This is most effective when a large percentage of the words in the set are unique, and thus holding a set of seen words in memory is prohibitively expensive.
如果你使用的是通用数据,那么另一个好的方法是对集合进行排序,因此相同字符串的副本最终会相邻,然后只需删除相邻的单词。有很好的研究算法可以对数据集进行排序,因为数据集太大而无法适应内存。当集合中的大部分单词是唯一的时,这是最有效的,因此在存储器中保持一组看到的单词非常昂贵。
Incidentally, this can be implemented easily with a shell pipeline, as GNU sort does the external sort for you:
顺便说一下,这可以通过shell管道轻松实现,因为GNU sort为你做了外部排序:
tr " " "\n" < testdata | LC_ALL=C sort | uniq
Passing LC_ALL=C to sort disables locale processing and multibyte character set support, which can give a significant speed boost here.
传递LC_ALL = C进行排序会禁用区域设置处理和多字节字符集支持,这可以显着提高速度。
#2
O(1) memory [ram]:
O(1)记忆[ram]:
If you want to use no memory at all (besides a couple temp variables) you could simply read 1 entry at a time and add it to the output file if it doesn't already exist in the output file. This would be slower on time though since you'd have to read 1 entry at a time from the output file.
如果你想根本不使用任何内存(除了几个临时变量),你可以简单地一次读取1个条目,如果它不存在于输出文件中,则将其添加到输出文件中。由于您必须一次从输出文件中读取1个条目,因此时间会慢一些。
You could insert the entry into the output file in alphabetical order though and then you would be able to see if the entry already exists or not in O(logn) time via binary search per entry being inserted. To actually insert you need to re-create the file though which is O(nlogn). You do this n times for each input string, so overall the algorithm would run in O(n^2logn) (which includes lookup to find insertion pos + insertion) and use O(1) RAM memory.
您可以按字母顺序将条目插入到输出文件中,然后通过插入每个条目的二进制搜索,您可以在O(logn)时间内查看条目是否已存在。要实际插入,您需要重新创建文件,尽管是O(nlogn)。你对每个输入字符串执行n次操作,因此整个算法将在O(n ^ 2logn)中运行(包括查找以查找插入位置+插入)并使用O(1)RAM内存。
Since your output file is already in alphabetical order though future lookups would also only be O(logn) via binary search.
由于您的输出文件已按字母顺序排列,但未来的查找也只能通过二进制搜索进行O(logn)。
You could also minimize the re-creation phase of the file by leaving excessive space between entries in the file. WHen the algorithm was done you could do a vacuum on the file. This would bring it down to O(nlogn).
您还可以通过在文件中的条目之间留出过多空间来最小化文件的重新创建阶段。算法完成后,您可以对文件执行真空操作。这将把它归结为O(nlogn)。
Another way to reduce memory:
另一种减少记忆的方法:
If it's common that your strings share common prefixes then you can use a trie and probably use less memory overall since you mentioned your strings are > length 36. This would still use a lot of memory though.
如果您的字符串共享公共前缀是很常见的,那么您可以使用trie并且可能使用更少的内存,因为您提到您的字符串>长度36.这仍然会使用大量内存。
#3
Well, std::set might be slightly faster and use less memory than std::map.
好吧,std :: set可能会比std :: map稍微快些一些并且使用更少的内存。
#4
It seems given that large number of entries, there will be a reasonable amount of overlap in the sequences of symbols. You could build tree using the entries at each position each sequence as nodes. Say you have an entry 12345 and 12346 then the first four entries in the sequence overlap and thus could be stored in a tree with 6 nodes.
似乎给出了大量的条目,符号序列中将存在合理的重叠量。您可以使用每个位置的条目构建树,每个序列作为节点。假设您有一个条目12345和12346,那么序列中的前四个条目重叠,因此可以存储在具有6个节点的树中。
You could walk the tree to see if a given symbol is contained at a given position, if not you would add it. When you reach the end of the given string you would just mark that node as terminating a string node. Reproducing the unique entries would be a matter of a depth first traversal of the tree the path from the root node to each terminator represents a unique entry.
你可以走树,看看给定的符号是否包含在给定的位置,如果没有,你会添加它。当您到达给定字符串的末尾时,您只需将该节点标记为终止字符串节点。再现唯一条目将是树的深度优先遍历的问题,从根节点到每个终结器的路径表示唯一条目。
If you partition the dataset, say into X thousand line chunks and aggregate the trees it would make a nice map-reduce job.
如果你对数据集进行分区,比如说进入X千行的块并聚合树,那么它将成为一个很好的map-reduce作业。
You're looking at a node space of 10^36 so if the data is entirely random you're looking at a large possible number of nodes. If there's a good deal of overlap and a smaller number of unique entries you will probably find you use a good deal less
您正在查看10 ^ 36的节点空间,因此如果数据完全是随机的,那么您正在查看可能数量很大的节点。如果存在大量重叠和较少数量的唯一条目,您可能会发现您使用的优惠更少
#5
You could probably just write an in-place sort for it. You're still going to have to load the file to memory though, because in-place sorting with IO wouldn't be efficient. You'd have to read and write for every comparison.
你可能只是为它写一个就地排序。您仍然需要将文件加载到内存中,因为使用IO进行就地排序效率不高。你必须为每次比较阅读和写作。
#6
std::set
would be better, but you should look into hash sets. Those are not available in the current C++ standard library (although it is supposed to be in C++0x's) but some compilers have implementations. Visual C++ has stdext::hash_set and boost has some kind of stuff for this, see Boost Multi-index Containers Library.
std :: set会更好,但你应该研究哈希集。那些在当前的C ++标准库中不可用(尽管它应该在C ++ 0x中),但是一些编译器有实现。 Visual C ++有stdext :: hash_set和boost有一些东西,请参阅Boost Multi-index Containers Library。
#7
Try STXXL as an external stl for huge datasets.
尝试将STXXL作为大型数据集的外部stl。
#8
The solution I would suggest is to use memory mapped file to access the data and radix sort to sort the list. This is trivial if the strings are of same length. If the strings are of different lengths, use radix sort for a fast presorting using the n first characters, then sort the unsorted sub lists with whatever method is most appropriate. With very short sublists, a simple bubble sort would do it. With longer lists use quick sort or use the STL set with a custom class providing the compare operator.
我建议的解决方案是使用内存映射文件来访问数据和基数排序来对列表进行排序。如果字符串长度相同,这是微不足道的。如果字符串具有不同的长度,则使用基数排序来使用n个第一个字符进行快速预分类,然后使用最合适的方法对未排序的子列表进行排序。使用非常短的子列表,可以执行简单的冒泡操作。使用较长的列表使用快速排序或使用STL集与提供比较运算符的自定义类。
With memory mapping and radix sort, the memory requirement and performance would be optimal. All you need is the mapping space (~size of file) and a table of 32bit values with one value per string to hold the linked list for radix sort.
通过内存映射和基数排序,内存要求和性能将是最佳的。您所需要的只是映射空间(文件大小)和32位值表,每个字符串有一个值来保存基数列表以进行基数排序。
You could also save memory and speed up the sorting by using a more compact encoding of the strings. A simple encoding would use 2 bits per character, using values 1,2 and 3 for the three letters and 0 to signal the end of string. Or more efficient, use a base 3 encoding, packed into integers and encode the length of string in front. Let say you have characters c1, c2, C3, c4 the encoding would yield the integer c1*3^3 + c2*3^2 + c3*3^1 + c4*3^0. This suppose you assign a value from 0 to 2 to each character. By using this encoding you'll save memory and will optimize sorting if the number of strings to sort is huge.
您还可以使用更紧凑的字符串编码来节省内存并加快排序速度。一个简单的编码将使用每个字符2位,使用值1,2和3表示三个字母,0表示字符串结束。或者更高效,使用base 3编码,打包成整数并编码前面的字符串长度。假设您有字符c1,c2,C3,c4,编码将产生整数c1 * 3 ^ 3 + c2 * 3 ^ 2 + c3 * 3 ^ 1 + c4 * 3 ^ 0。这假设您为每个字符分配一个0到2的值。通过使用此编码,您将节省内存,并且如果要排序的字符串数量很大,将优化排序。
#9
You can look at the excellent library Judy arrays. A compact trie based, very fast and memory efficient data structure witch scale to billons strings. Better than any search-tree. You can use the JudySL functions to your purpose. You can use it similar to your program, but it is much faster and more memory efficient.
你可以看看优秀的Judy数组库。基于紧凑的基于trie,非常快速且内存有效的数据结构,可扩展到字符串。比任何搜索树都好。您可以将JudySL功能用于您的目的。您可以使用它类似于您的程序,但速度更快,内存效率更高。