I need to store a 2d matrix containing zip codes and the distance in km between each one of them. My client has an application that calculates the distances which are then stored in an Excel file. Currently, there are 952 places. So the matrix would have 952x952 = 906304 entries.
我需要存储一个包含邮政编码的二维矩阵,以及每个矩阵之间的距离。我的客户端有一个计算距离的应用程序,然后存储在Excel文件中。目前,有952个地方。因此矩阵将具有952x952 = 906304个条目。
I tried to map this into a HashMap[Integer, Float]. The Integer is the hash code of the two Strings for two places, e.g. "A" and "B". The float value is the distance in km between them.
我试图将其映射到HashMap [Integer,Float]。 Integer是两个字符串的哈希码,例如: “A”和“B”。浮点值是它们之间以km为单位的距离。
While filling in the data I run into OutOfMemoryExceptions after 205k entries. Do you have a tip how I can store this in a clever way? I even don't know if it's clever to have the whole bunch in memory. My options are SQL and MS Access...
填写数据时,我会在205k条目后运行OutOfMemoryExceptions。你有一个提示,我怎么能以聪明的方式存储它?我甚至不知道在整个记忆中是否都很聪明。我的选择是SQL和MS Access ......
The problem is that I need to access the data very quickly and possibly very often which is why I chose the HashMap because it runs in O(1) for the look up.
问题是我需要非常快速地访问数据,这可能是我选择HashMap的原因,因为它在O(1)中运行以进行查找。
Thansk for your replies and suggestions!
Thansk的回复和建议!
Marco
马尔科
9 个解决方案
#1
2
Can you simply boost the memory available to the JVM ?
你能简单地增加JVM可用的内存吗?
java -Xmx512m ...
By default the maximum memory configuration is 64Mb. Some more tuning tips here. If you can do this then you can keep the data in-process and maximise the performance (i.e. you don't need to calculate on the fly).
默认情况下,最大内存配置为64Mb。这里还有一些调整技巧。如果您可以这样做,那么您可以保持数据在进行中并最大化性能(即您不需要动态计算)。
#2
8
A 2d array would be more memory efficient. You can use a small hashmap to map the 952 places into a number between 0 and 951 . Then, just do:
2d阵列将更有效地记忆。您可以使用小型散列映射将952个位置映射到0到951之间的数字。那么,就这样做:
float[][] distances= new float[952][952];
To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array.
要查找,只需使用两个哈希查找将两个位置转换为两个整数,并将它们用作二维数组的索引。
By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap.
通过这种方式,你可以避免浮动的装箱,以及大型hashmap的内存开销。
However, 906304 really isn't that many entries, you may just need to increase the Xmx maximum heap size
但是,906304确实没有那么多条目,你可能只需要增加Xmx的最大堆大小
#3
5
I would have thought that you could calculate the distances on the fly. Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; e.g. longitude/latitude of the notional centres of each ZIP code.
我原以为你可以在飞行中计算距离。大概有人已经这样做了,所以你只需要找出他们使用的算法和输入数据;例如每个邮政编码的名义中心的经度/纬度。
EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs.
编辑:有两种常用的算法用于查找由经度/纬度对给出的两点之间的(近似)测地距离。
-
The Vicenty formula is based on an ellipsoid approximation. It is more accurate, but more complicated to implement.
Vicenty公式基于椭球近似。它更准确,但实施起来更复杂。
-
The Haversine formula is based on a spherical approximation. It is less accurate (0.3%), but simpler to implement.
Haversine公式基于球面近似。它不太准确(0.3%),但实施起来更简单。
#4
2
I upvoted Chi's and Benjamin's answers, because they're telling you what you need to do, but while I'm here, I'd like to stress that using the hashcode of the two strings directly will get you into trouble. You're likely to run into the problem of hash collisions.
我赞同Chi和Benjamin的答案,因为他们告诉你你需要做什么,但是当我在这里时,我想强调直接使用两个字符串的哈希码会让你陷入困境。您可能会遇到哈希冲突的问题。
This would not be a problem if you were concatenating the two strings (being careful to use a delimiter which cannot appear in the place designators), and letting HashMap do its magic, but the method you suggested, using the hashcodes for the two strings as a key, that's going to get you into trouble.
如果你连接两个字符串(小心使用不能出现在地方指示符中的分隔符),并让HashMap发挥其魔力,但是你建议的方法,使用两个字符串的哈希码,这不会是一个问题。一把钥匙,这会让你陷入困境。
#5
1
You will simply need more memory. When starting your Java process, kick it off like so:
你只需要更多的内存。在启动Java进程时,请将其启动,如下所示:
java -Xmx256M MyClass
java -Xmx256M MyClass
The -Xmx defines the max heap size, so this says the process can use up to 256 MB of memory for the heap. If you still run out, keep bumping that number up until you hit the physical limit.
-Xmx定义了最大堆大小,因此这表示进程最多可以为堆使用256 MB内存。如果你仍然用光,请继续撞到该数字,直到达到物理极限。
#6
1
Lately I've managed similar requisites for my master thesis.
最近我为我的硕士论文设定了类似的要求。
I ended with a Matrix class that uses a double[]
, not a double[][]
, in order to alleviate double deref costs (data[i]
that is an array, then array[i][j]
that is a double
) while allowing the VM to allocate a big, contiguous chunk of memory:
我以一个使用double []而不是double [] []的Matrix类结束,以减轻双重deref成本(data [i]是一个数组,然后array [i] [j]是一个double )允许VM分配一大块连续的内存:
public class Matrix {
private final double data[];
private final int rows;
private final int columns;
public Matrix(int rows, int columns, double[][] initializer) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
int k = 0;
for (int i = 0; i < initializer.length; i++) {
System.arraycopy(initializer[i], 0, data, k, initializer[i].length);
k += initializer[i].length;
}
}
public Matrix set(int i, int j, double value) {
data[j + i * columns] = value;
return this;
}
public double get(int i, int j) {
return data[j + i * columns];
}
}
this class should use less memory than an HashMap
since it uses a primitive array (no boxing needed): it needs only 906304 * 8 ~ 8 Mb
(for doubles) or 906304 * 4 ~ 4 Mb
(for floats). My 2 cents.
这个类应该使用比HashMap更少的内存,因为它使用原始数组(不需要装箱):它只需要906304 * 8~8 Mb(用于双打)或906304 * 4~4 Mb(用于浮点数)。我的2美分。
NB I've omitted some sanity checks for simplicity's sake
NB为了简单起见,我省略了一些健全性检查
#7
1
Stephen C. has a good point: if the distances are as-the-crow-flies, then you could probably save memory by doing some calculations on the fly. All you'd need is space for the longitude and latitude for 952 zip codes and then you could use the vicenty formula to do your calculation when you need to. This would make your memory usage O(n) in zipcodes.
斯蒂芬C.有一个很好的观点:如果距离像蝇一样苍蝇,那么你可以通过动态计算来节省记忆。所有你需要的是952邮政编码的经度和纬度的空间,然后你可以使用副词公式在你需要时进行计算。这将使您的内存使用O(n)在zipcodes中。
Of course, that solution makes some assumptions that may turn out to be false in your particular case, i.e. that you have longitude and latitude data for your zipcodes and that you're concerned with as-the-crow-flies distances and not something more complicated like driving directions.
当然,该解决方案会做出一些假设,在您的特定情况下可能会变得错误,即您拥有拉链码的经度和纬度数据,并且您关注的是蝇蝇距离而不是更多像行车路线一样复杂。
If those assumptions are true though, trading a few computes for a whole bunch of memory might help you scale in the future if you ever need to handle a bigger dataset.
如果这些假设是正确的,那么如果您需要处理更大的数据集,那么为一大堆内存交换一些计算可能会帮助您在将来扩展。
#8
1
The above suggestions regarding heap size will be helpful. However, I am not sure if you gave an accurate description of the size of your matrix.
关于堆大小的上述建议将有所帮助。但是,我不确定您是否准确描述了矩阵的大小。
Suppose you have 4 locations. Then you need to assess the distances between A->B, A->C, A->D, B->C, B->D, C->D. This suggests six entries in your HashMap (4 choose 2).
假设您有4个位置。然后,您需要评估A-> B,A-> C,A-> D,B-> C,B-> D,C-> D之间的距离。这表示您的HashMap中有六个条目(4个选择2)。
That would lead me to believe the actual optimal size of your HashMap is (952 choose 2)=452,676; NOT 952x952=906,304.
这会让我相信你的HashMap的实际最佳大小是(952选择2)= 452,676; NOT 952x952 = 906,304。
This is all assuming, of course, that you only store one-way relationships (i.e. from A->B, but not from B->A, since that is redundant), which I would recommend since you are already experiencing problems with memory space.
当然,这只是假设你只存储单向关系(即从A-> B,但不是从B-> A,因为那是多余的),我建议你因为你已经遇到内存问题空间。
Edit: Should have said that the size of your matrix is not optimal, rather than saying the description was not accurate.
编辑:应该说你的矩阵的大小不是最优的,而不是说描述不准确。
#9
0
Create a new class with 2 slots for location names. Have it always put the alphabetically first name in the first slot. Give it a proper equals and hashcode method. Give it a compareTo (e.g. order alphabetically by names). Throw them all in an array. Sort it.
创建一个包含2个位置名称的新类。它始终将字母顺序的名字放在第一个插槽中。给它一个适当的equals和hashcode方法。给它一个compareTo(例如按名称的字母顺序排序)。把它们全部扔进阵列中。把它分类。
Also, hash1 = hash2 does not imply object1 = object2. Don't ever do this. It's a hack.
另外,hash1 = hash2并不意味着object1 = object2。不要这样做。这是一个黑客。
#1
2
Can you simply boost the memory available to the JVM ?
你能简单地增加JVM可用的内存吗?
java -Xmx512m ...
By default the maximum memory configuration is 64Mb. Some more tuning tips here. If you can do this then you can keep the data in-process and maximise the performance (i.e. you don't need to calculate on the fly).
默认情况下,最大内存配置为64Mb。这里还有一些调整技巧。如果您可以这样做,那么您可以保持数据在进行中并最大化性能(即您不需要动态计算)。
#2
8
A 2d array would be more memory efficient. You can use a small hashmap to map the 952 places into a number between 0 and 951 . Then, just do:
2d阵列将更有效地记忆。您可以使用小型散列映射将952个位置映射到0到951之间的数字。那么,就这样做:
float[][] distances= new float[952][952];
To look things up, just use two hash lookups to convert the two places into two integers, and use them as indexes into the 2d array.
要查找,只需使用两个哈希查找将两个位置转换为两个整数,并将它们用作二维数组的索引。
By doing it this way, you avoid the boxing of floats, and also the memory overhead of the large hashmap.
通过这种方式,你可以避免浮动的装箱,以及大型hashmap的内存开销。
However, 906304 really isn't that many entries, you may just need to increase the Xmx maximum heap size
但是,906304确实没有那么多条目,你可能只需要增加Xmx的最大堆大小
#3
5
I would have thought that you could calculate the distances on the fly. Presumably someone has already done this, so you simply need to find out what algorithm they used, and the input data; e.g. longitude/latitude of the notional centres of each ZIP code.
我原以为你可以在飞行中计算距离。大概有人已经这样做了,所以你只需要找出他们使用的算法和输入数据;例如每个邮政编码的名义中心的经度/纬度。
EDIT: There are two commonly used algorithms for finding the (approximate) geodesic distance between two points given by longitude/latitude pairs.
编辑:有两种常用的算法用于查找由经度/纬度对给出的两点之间的(近似)测地距离。
-
The Vicenty formula is based on an ellipsoid approximation. It is more accurate, but more complicated to implement.
Vicenty公式基于椭球近似。它更准确,但实施起来更复杂。
-
The Haversine formula is based on a spherical approximation. It is less accurate (0.3%), but simpler to implement.
Haversine公式基于球面近似。它不太准确(0.3%),但实施起来更简单。
#4
2
I upvoted Chi's and Benjamin's answers, because they're telling you what you need to do, but while I'm here, I'd like to stress that using the hashcode of the two strings directly will get you into trouble. You're likely to run into the problem of hash collisions.
我赞同Chi和Benjamin的答案,因为他们告诉你你需要做什么,但是当我在这里时,我想强调直接使用两个字符串的哈希码会让你陷入困境。您可能会遇到哈希冲突的问题。
This would not be a problem if you were concatenating the two strings (being careful to use a delimiter which cannot appear in the place designators), and letting HashMap do its magic, but the method you suggested, using the hashcodes for the two strings as a key, that's going to get you into trouble.
如果你连接两个字符串(小心使用不能出现在地方指示符中的分隔符),并让HashMap发挥其魔力,但是你建议的方法,使用两个字符串的哈希码,这不会是一个问题。一把钥匙,这会让你陷入困境。
#5
1
You will simply need more memory. When starting your Java process, kick it off like so:
你只需要更多的内存。在启动Java进程时,请将其启动,如下所示:
java -Xmx256M MyClass
java -Xmx256M MyClass
The -Xmx defines the max heap size, so this says the process can use up to 256 MB of memory for the heap. If you still run out, keep bumping that number up until you hit the physical limit.
-Xmx定义了最大堆大小,因此这表示进程最多可以为堆使用256 MB内存。如果你仍然用光,请继续撞到该数字,直到达到物理极限。
#6
1
Lately I've managed similar requisites for my master thesis.
最近我为我的硕士论文设定了类似的要求。
I ended with a Matrix class that uses a double[]
, not a double[][]
, in order to alleviate double deref costs (data[i]
that is an array, then array[i][j]
that is a double
) while allowing the VM to allocate a big, contiguous chunk of memory:
我以一个使用double []而不是double [] []的Matrix类结束,以减轻双重deref成本(data [i]是一个数组,然后array [i] [j]是一个double )允许VM分配一大块连续的内存:
public class Matrix {
private final double data[];
private final int rows;
private final int columns;
public Matrix(int rows, int columns, double[][] initializer) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
int k = 0;
for (int i = 0; i < initializer.length; i++) {
System.arraycopy(initializer[i], 0, data, k, initializer[i].length);
k += initializer[i].length;
}
}
public Matrix set(int i, int j, double value) {
data[j + i * columns] = value;
return this;
}
public double get(int i, int j) {
return data[j + i * columns];
}
}
this class should use less memory than an HashMap
since it uses a primitive array (no boxing needed): it needs only 906304 * 8 ~ 8 Mb
(for doubles) or 906304 * 4 ~ 4 Mb
(for floats). My 2 cents.
这个类应该使用比HashMap更少的内存,因为它使用原始数组(不需要装箱):它只需要906304 * 8~8 Mb(用于双打)或906304 * 4~4 Mb(用于浮点数)。我的2美分。
NB I've omitted some sanity checks for simplicity's sake
NB为了简单起见,我省略了一些健全性检查
#7
1
Stephen C. has a good point: if the distances are as-the-crow-flies, then you could probably save memory by doing some calculations on the fly. All you'd need is space for the longitude and latitude for 952 zip codes and then you could use the vicenty formula to do your calculation when you need to. This would make your memory usage O(n) in zipcodes.
斯蒂芬C.有一个很好的观点:如果距离像蝇一样苍蝇,那么你可以通过动态计算来节省记忆。所有你需要的是952邮政编码的经度和纬度的空间,然后你可以使用副词公式在你需要时进行计算。这将使您的内存使用O(n)在zipcodes中。
Of course, that solution makes some assumptions that may turn out to be false in your particular case, i.e. that you have longitude and latitude data for your zipcodes and that you're concerned with as-the-crow-flies distances and not something more complicated like driving directions.
当然,该解决方案会做出一些假设,在您的特定情况下可能会变得错误,即您拥有拉链码的经度和纬度数据,并且您关注的是蝇蝇距离而不是更多像行车路线一样复杂。
If those assumptions are true though, trading a few computes for a whole bunch of memory might help you scale in the future if you ever need to handle a bigger dataset.
如果这些假设是正确的,那么如果您需要处理更大的数据集,那么为一大堆内存交换一些计算可能会帮助您在将来扩展。
#8
1
The above suggestions regarding heap size will be helpful. However, I am not sure if you gave an accurate description of the size of your matrix.
关于堆大小的上述建议将有所帮助。但是,我不确定您是否准确描述了矩阵的大小。
Suppose you have 4 locations. Then you need to assess the distances between A->B, A->C, A->D, B->C, B->D, C->D. This suggests six entries in your HashMap (4 choose 2).
假设您有4个位置。然后,您需要评估A-> B,A-> C,A-> D,B-> C,B-> D,C-> D之间的距离。这表示您的HashMap中有六个条目(4个选择2)。
That would lead me to believe the actual optimal size of your HashMap is (952 choose 2)=452,676; NOT 952x952=906,304.
这会让我相信你的HashMap的实际最佳大小是(952选择2)= 452,676; NOT 952x952 = 906,304。
This is all assuming, of course, that you only store one-way relationships (i.e. from A->B, but not from B->A, since that is redundant), which I would recommend since you are already experiencing problems with memory space.
当然,这只是假设你只存储单向关系(即从A-> B,但不是从B-> A,因为那是多余的),我建议你因为你已经遇到内存问题空间。
Edit: Should have said that the size of your matrix is not optimal, rather than saying the description was not accurate.
编辑:应该说你的矩阵的大小不是最优的,而不是说描述不准确。
#9
0
Create a new class with 2 slots for location names. Have it always put the alphabetically first name in the first slot. Give it a proper equals and hashcode method. Give it a compareTo (e.g. order alphabetically by names). Throw them all in an array. Sort it.
创建一个包含2个位置名称的新类。它始终将字母顺序的名字放在第一个插槽中。给它一个适当的equals和hashcode方法。给它一个compareTo(例如按名称的字母顺序排序)。把它们全部扔进阵列中。把它分类。
Also, hash1 = hash2 does not imply object1 = object2. Don't ever do this. It's a hack.
另外,hash1 = hash2并不意味着object1 = object2。不要这样做。这是一个黑客。