I have a big file, it's expected to be around 12 GB. I want to load it all into memory on a beefy 64-bit machine with 16 GB RAM, but I think Java does not support byte arrays that big:
我有一个很大的文件,预计是12 GB左右。我想把它全部加载到内存中,在一个16 GB RAM的64位机器上,但是我认为Java不支持这么大的字节数组:
File f = new File(file);
long size = f.length();
byte data[] = new byte[size]; // <- does not compile, not even on 64bit JVM
Is it possible with Java?
用Java可以吗?
The compile error from the Eclipse compiler is:
来自Eclipse编译器的编译错误是:
Type mismatch: cannot convert from long to int
javac gives:
javac给:
possible loss of precision
found : long
required: int
byte data[] = new byte[size];
11 个解决方案
#1
18
Java array indices are of type int
(4 bytes or 32 bits), so I'm afraid you're limited to 231 − 1 or 2147483647 slots in your array. I'd read the data into another data structure, like a 2D array.
Java int类型的数组索引(4字节或32位),所以恐怕你限于231−1或2147483647槽阵列。我将数据读入另一个数据结构,比如二维数组。
#2
13
package com.deans.rtl.util;
import java.io.FileInputStream;
import java.io.IOException;
/**
*
* @author william.deans@gmail.com
*
* Written to work with byte arrays requiring address space larger than 32 bits.
*
*/
public class ByteArray64 {
private final long CHUNK_SIZE = 1024*1024*1024; //1GiB
long size;
byte [][] data;
public ByteArray64( long size ) {
this.size = size;
if( size == 0 ) {
data = null;
} else {
int chunks = (int)(size/CHUNK_SIZE);
int remainder = (int)(size - ((long)chunks)*CHUNK_SIZE);
data = new byte[chunks+(remainder==0?0:1)][];
for( int idx=chunks; --idx>=0; ) {
data[idx] = new byte[(int)CHUNK_SIZE];
}
if( remainder != 0 ) {
data[chunks] = new byte[remainder];
}
}
}
public byte get( long index ) {
if( index<0 || index>=size ) {
throw new IndexOutOfBoundsException("Error attempting to access data element "+index+". Array is "+size+" elements long.");
}
int chunk = (int)(index/CHUNK_SIZE);
int offset = (int)(index - (((long)chunk)*CHUNK_SIZE));
return data[chunk][offset];
}
public void set( long index, byte b ) {
if( index<0 || index>=size ) {
throw new IndexOutOfBoundsException("Error attempting to access data element "+index+". Array is "+size+" elements long.");
}
int chunk = (int)(index/CHUNK_SIZE);
int offset = (int)(index - (((long)chunk)*CHUNK_SIZE));
data[chunk][offset] = b;
}
/**
* Simulates a single read which fills the entire array via several smaller reads.
*
* @param fileInputStream
* @throws IOException
*/
public void read( FileInputStream fileInputStream ) throws IOException {
if( size == 0 ) {
return;
}
for( int idx=0; idx<data.length; idx++ ) {
if( fileInputStream.read( data[idx] ) != data[idx].length ) {
throw new IOException("short read");
}
}
}
public long size() {
return size;
}
}
}
#3
6
If necessary, you can load the data into an array of arrays, which will give you a maximum of int.maxValue squared bytes, more than even the beefiest machine would hold well in memory.
如果有必要,您可以将数据加载到数组数组中,这将给您最大的int.maxValue平方字节,甚至比最强大的机器在内存中保存的更好。
#4
2
I suggest you define some "block" objects, each of which holds (say) 1Gb in an array, then make an array of those.
我建议您定义一些“块”对象,每个对象在一个数组中持有(比如说)1Gb,然后创建一个数组。
#5
2
No, arrays are indexed by int
s (except some versions of JavaCard that use short
s). You will need to slice it up into smaller arrays, probably wrapping in a type that gives you get(long)
, set(long,byte)
, etc. With sections of data that large, you might want to map the file use java.nio.
不,数组是由ints索引的(除了一些使用short的JavaCard版本)。您将需要将它分割成更小的数组,可能需要封装一个类型,该类型为get(long)、set(long,byte)等等。
#6
2
You might consider using FileChannel and MappedByteBuffer to memory map the file,
您可以考虑使用FileChannel和MappedByteBuffer来映射文件的内存,
FileChannel fCh = new RandomAccessFile(file,"rw").getChannel();
long size = fCh.size();
ByteBuffer map = fCh.map(FileChannel.MapMode.READ_WRITE, 0, fileSize);
Edit:
编辑:
Ok, I'm an idiot it looks like ByteBuffer only takes a 32-bit index as well which is odd since the size parameter to FileChannel.map is a long... But if you decide to break up the file into multiple 2Gb chunks for loading I'd still recommend memory mapped IO as there can be pretty large performance benefits. You're basically moving all IO responsibility to the OS kernel.
好吧,我是个白痴看起来ByteBuffer也只接受32位索引,这很奇怪,因为它的size参数是FileChannel。地图是一个漫长的……但是,如果您决定将文件分成多个2Gb的块来加载,那么我仍然推荐内存映射IO,因为这可以带来相当大的性能好处。基本上就是把所有的IO责任转移到OS内核。
#7
1
Java arrays use integers for their indices. As a result, the maximum array size is Integer.MAX_VALUE.
Java数组使用整数作为索引。因此,最大数组大小是Integer.MAX_VALUE。
(Unfortunately, I can't find any proof from Sun themselves about this, but there are plenty of discussions on their forums about it already.)
(不幸的是,我无法从Sun自己那里找到任何证据,但他们的论坛上已经有很多关于此事的讨论。)
I think the best solution you could do in the meantime would be to make a 2D array, i.e.:
我认为在此期间最好的解决办法是制作一个2D数组,例如:
byte[][] data;
#8
1
As others have said, all Java arrays of all types are indexed by int
, and so can be of max size 231 − 1, or 2147483647 elements (~2 billion). This is specified by the Java Language Specification so switching to another operating system or Java Virtual Machine won't help.
别人已经说过,所有的Java数组类型是int索引,因此可以最大大小231−1或2147483647元素(20亿年~)。这是由Java语言规范指定的,因此切换到另一个操作系统或Java虚拟机不会有帮助。
If you wanted to write a class to overcome this as suggested above you could, which could use an array of arrays (for a lot of flexibility) or change types (a long
is 8 bytes so a long[]
can be 8 times bigger than a byte[]
).
如果您想编写一个类来解决上述问题,您可以使用数组数组(对于很大的灵活性)或更改类型(长为8字节,所以长[]可以比字节[]大8倍)。
#9
1
I think the idea of memory-mapping the file (using the CPU's virtual memory hardware) is the right approach. Except that MappedByteBuffer has the same limitation of 2Gb as native arrays. This guy claims to have solved the problem with a pretty simple alternative to MappedByteBuffer:
我认为内存映射文件(使用CPU的虚拟内存硬件)的想法是正确的。除了MappedByteBuffer与本机数组具有相同的2Gb限制。这个人声称用一种相当简单的方法解决了这个问题,而不是MappedByteBuffer:
http://nyeggen.com/post/2014-05-18-memory-mapping-%3E2gb-of-data-in-java/
http://nyeggen.com/post/2014-05-18-memory-mapping-%3E2gb-of-data-in-java/
https://gist.github.com/bnyeggen/c679a5ea6a68503ed19f#file-mmapper-java
https://gist.github.com/bnyeggen/c679a5ea6a68503ed19f file-mmapper-java
Unfortunately the JVM crashes when you read beyond 500Mb.
不幸的是,当读取超过500Mb时,JVM会崩溃。
#10
1
don't limit your self with Integer.MAX_VALUE
不要用Integer.MAX_VALUE限制自己
although this question has been asked many years ago, but a i wanted to participate with a simple example using only java se without any external libraries
虽然这个问题在许多年前就已经被问到过,但是我想用一个简单的例子来参与其中,它只使用java se而不使用任何外部库。
at first let's say it's theoretically impossible but practically possible
首先,我们假设这在理论上是不可能的,但实际上是可能的
a new look : if the array is an object of elements what about having an object that is array of arrays
一个新的外观:如果数组是元素的对象,那么拥有一个数组数组的对象又如何呢
here's the example
这里的例子
import java.lang.reflect.Array;
import java.util.ArrayList;
import java.util.List;
/**
*
* @author Anosa
*/
public class BigArray<t>{
private final static int ARRAY_LENGTH = 1000000;
public final long length;
private List<t[]> arrays;
public BigArray(long length, Class<t> glasss)
{
this.length = length;
arrays = new ArrayList<>();
setupInnerArrays(glasss);
}
private void setupInnerArrays(Class<t> glasss)
{
long numberOfArrays = length / ARRAY_LENGTH;
long remender = length % ARRAY_LENGTH;
/*
we can use java 8 lambdas and streams:
LongStream.range(0, numberOfArrays).
forEach(i ->
{
arrays.add((t[]) Array.newInstance(glasss, ARRAY_LENGTH));
});
*/
for (int i = 0; i < numberOfArrays; i++)
{
arrays.add((t[]) Array.newInstance(glasss, ARRAY_LENGTH));
}
if (remender > 0)
{
//the remainer will 100% be less than the [ARRAY_LENGTH which is int ] so
//no worries of casting (:
arrays.add((t[]) Array.newInstance(glasss, (int) remender));
}
}
public void put(t value, long index)
{
if (index >= length || index < 0)
{
throw new IndexOutOfBoundsException("out of the reange of the array, your index must be in this range [0, " + length + "]");
}
int indexOfArray = (int) (index / ARRAY_LENGTH);
int indexInArray = (int) (index - (indexOfArray * ARRAY_LENGTH));
arrays.get(indexOfArray)[indexInArray] = value;
}
public t get(long index)
{
if (index >= length || index < 0)
{
throw new IndexOutOfBoundsException("out of the reange of the array, your index must be in this range [0, " + length + "]");
}
int indexOfArray = (int) (index / ARRAY_LENGTH);
int indexInArray = (int) (index - (indexOfArray * ARRAY_LENGTH));
return arrays.get(indexOfArray)[indexInArray];
}
}
}
and here's the test
这是测试
public static void main(String[] args)
{
long length = 60085147514l;
BigArray<String> array = new BigArray<>(length, String.class);
array.put("peace be upon you", 1);
array.put("yes it worj", 1755);
String text = array.get(1755);
System.out.println(text + " i am a string comming from an array ");
}
this code is only limited by only Long.MAX_VALUE
and Java heap but you can exceed it as you want (I made it 3800 MB)
此代码仅受长时间限制。MAX_VALUE和Java堆,但您可以根据自己的需要进行扩展(我将其设置为3800 MB)
i hope this is useful and provide a simple answer
我希望这是有用的,并提供一个简单的答案。
#11
0
java doesn't support direct array with more than 2^32 elements presently,
java不支持直接数组超过2 ^ 32个元素目前,
hope to see this feature of java in future
希望以后能看到java的这个特性
#1
18
Java array indices are of type int
(4 bytes or 32 bits), so I'm afraid you're limited to 231 − 1 or 2147483647 slots in your array. I'd read the data into another data structure, like a 2D array.
Java int类型的数组索引(4字节或32位),所以恐怕你限于231−1或2147483647槽阵列。我将数据读入另一个数据结构,比如二维数组。
#2
13
package com.deans.rtl.util;
import java.io.FileInputStream;
import java.io.IOException;
/**
*
* @author william.deans@gmail.com
*
* Written to work with byte arrays requiring address space larger than 32 bits.
*
*/
public class ByteArray64 {
private final long CHUNK_SIZE = 1024*1024*1024; //1GiB
long size;
byte [][] data;
public ByteArray64( long size ) {
this.size = size;
if( size == 0 ) {
data = null;
} else {
int chunks = (int)(size/CHUNK_SIZE);
int remainder = (int)(size - ((long)chunks)*CHUNK_SIZE);
data = new byte[chunks+(remainder==0?0:1)][];
for( int idx=chunks; --idx>=0; ) {
data[idx] = new byte[(int)CHUNK_SIZE];
}
if( remainder != 0 ) {
data[chunks] = new byte[remainder];
}
}
}
public byte get( long index ) {
if( index<0 || index>=size ) {
throw new IndexOutOfBoundsException("Error attempting to access data element "+index+". Array is "+size+" elements long.");
}
int chunk = (int)(index/CHUNK_SIZE);
int offset = (int)(index - (((long)chunk)*CHUNK_SIZE));
return data[chunk][offset];
}
public void set( long index, byte b ) {
if( index<0 || index>=size ) {
throw new IndexOutOfBoundsException("Error attempting to access data element "+index+". Array is "+size+" elements long.");
}
int chunk = (int)(index/CHUNK_SIZE);
int offset = (int)(index - (((long)chunk)*CHUNK_SIZE));
data[chunk][offset] = b;
}
/**
* Simulates a single read which fills the entire array via several smaller reads.
*
* @param fileInputStream
* @throws IOException
*/
public void read( FileInputStream fileInputStream ) throws IOException {
if( size == 0 ) {
return;
}
for( int idx=0; idx<data.length; idx++ ) {
if( fileInputStream.read( data[idx] ) != data[idx].length ) {
throw new IOException("short read");
}
}
}
public long size() {
return size;
}
}
}
#3
6
If necessary, you can load the data into an array of arrays, which will give you a maximum of int.maxValue squared bytes, more than even the beefiest machine would hold well in memory.
如果有必要,您可以将数据加载到数组数组中,这将给您最大的int.maxValue平方字节,甚至比最强大的机器在内存中保存的更好。
#4
2
I suggest you define some "block" objects, each of which holds (say) 1Gb in an array, then make an array of those.
我建议您定义一些“块”对象,每个对象在一个数组中持有(比如说)1Gb,然后创建一个数组。
#5
2
No, arrays are indexed by int
s (except some versions of JavaCard that use short
s). You will need to slice it up into smaller arrays, probably wrapping in a type that gives you get(long)
, set(long,byte)
, etc. With sections of data that large, you might want to map the file use java.nio.
不,数组是由ints索引的(除了一些使用short的JavaCard版本)。您将需要将它分割成更小的数组,可能需要封装一个类型,该类型为get(long)、set(long,byte)等等。
#6
2
You might consider using FileChannel and MappedByteBuffer to memory map the file,
您可以考虑使用FileChannel和MappedByteBuffer来映射文件的内存,
FileChannel fCh = new RandomAccessFile(file,"rw").getChannel();
long size = fCh.size();
ByteBuffer map = fCh.map(FileChannel.MapMode.READ_WRITE, 0, fileSize);
Edit:
编辑:
Ok, I'm an idiot it looks like ByteBuffer only takes a 32-bit index as well which is odd since the size parameter to FileChannel.map is a long... But if you decide to break up the file into multiple 2Gb chunks for loading I'd still recommend memory mapped IO as there can be pretty large performance benefits. You're basically moving all IO responsibility to the OS kernel.
好吧,我是个白痴看起来ByteBuffer也只接受32位索引,这很奇怪,因为它的size参数是FileChannel。地图是一个漫长的……但是,如果您决定将文件分成多个2Gb的块来加载,那么我仍然推荐内存映射IO,因为这可以带来相当大的性能好处。基本上就是把所有的IO责任转移到OS内核。
#7
1
Java arrays use integers for their indices. As a result, the maximum array size is Integer.MAX_VALUE.
Java数组使用整数作为索引。因此,最大数组大小是Integer.MAX_VALUE。
(Unfortunately, I can't find any proof from Sun themselves about this, but there are plenty of discussions on their forums about it already.)
(不幸的是,我无法从Sun自己那里找到任何证据,但他们的论坛上已经有很多关于此事的讨论。)
I think the best solution you could do in the meantime would be to make a 2D array, i.e.:
我认为在此期间最好的解决办法是制作一个2D数组,例如:
byte[][] data;
#8
1
As others have said, all Java arrays of all types are indexed by int
, and so can be of max size 231 − 1, or 2147483647 elements (~2 billion). This is specified by the Java Language Specification so switching to another operating system or Java Virtual Machine won't help.
别人已经说过,所有的Java数组类型是int索引,因此可以最大大小231−1或2147483647元素(20亿年~)。这是由Java语言规范指定的,因此切换到另一个操作系统或Java虚拟机不会有帮助。
If you wanted to write a class to overcome this as suggested above you could, which could use an array of arrays (for a lot of flexibility) or change types (a long
is 8 bytes so a long[]
can be 8 times bigger than a byte[]
).
如果您想编写一个类来解决上述问题,您可以使用数组数组(对于很大的灵活性)或更改类型(长为8字节,所以长[]可以比字节[]大8倍)。
#9
1
I think the idea of memory-mapping the file (using the CPU's virtual memory hardware) is the right approach. Except that MappedByteBuffer has the same limitation of 2Gb as native arrays. This guy claims to have solved the problem with a pretty simple alternative to MappedByteBuffer:
我认为内存映射文件(使用CPU的虚拟内存硬件)的想法是正确的。除了MappedByteBuffer与本机数组具有相同的2Gb限制。这个人声称用一种相当简单的方法解决了这个问题,而不是MappedByteBuffer:
http://nyeggen.com/post/2014-05-18-memory-mapping-%3E2gb-of-data-in-java/
http://nyeggen.com/post/2014-05-18-memory-mapping-%3E2gb-of-data-in-java/
https://gist.github.com/bnyeggen/c679a5ea6a68503ed19f#file-mmapper-java
https://gist.github.com/bnyeggen/c679a5ea6a68503ed19f file-mmapper-java
Unfortunately the JVM crashes when you read beyond 500Mb.
不幸的是,当读取超过500Mb时,JVM会崩溃。
#10
1
don't limit your self with Integer.MAX_VALUE
不要用Integer.MAX_VALUE限制自己
although this question has been asked many years ago, but a i wanted to participate with a simple example using only java se without any external libraries
虽然这个问题在许多年前就已经被问到过,但是我想用一个简单的例子来参与其中,它只使用java se而不使用任何外部库。
at first let's say it's theoretically impossible but practically possible
首先,我们假设这在理论上是不可能的,但实际上是可能的
a new look : if the array is an object of elements what about having an object that is array of arrays
一个新的外观:如果数组是元素的对象,那么拥有一个数组数组的对象又如何呢
here's the example
这里的例子
import java.lang.reflect.Array;
import java.util.ArrayList;
import java.util.List;
/**
*
* @author Anosa
*/
public class BigArray<t>{
private final static int ARRAY_LENGTH = 1000000;
public final long length;
private List<t[]> arrays;
public BigArray(long length, Class<t> glasss)
{
this.length = length;
arrays = new ArrayList<>();
setupInnerArrays(glasss);
}
private void setupInnerArrays(Class<t> glasss)
{
long numberOfArrays = length / ARRAY_LENGTH;
long remender = length % ARRAY_LENGTH;
/*
we can use java 8 lambdas and streams:
LongStream.range(0, numberOfArrays).
forEach(i ->
{
arrays.add((t[]) Array.newInstance(glasss, ARRAY_LENGTH));
});
*/
for (int i = 0; i < numberOfArrays; i++)
{
arrays.add((t[]) Array.newInstance(glasss, ARRAY_LENGTH));
}
if (remender > 0)
{
//the remainer will 100% be less than the [ARRAY_LENGTH which is int ] so
//no worries of casting (:
arrays.add((t[]) Array.newInstance(glasss, (int) remender));
}
}
public void put(t value, long index)
{
if (index >= length || index < 0)
{
throw new IndexOutOfBoundsException("out of the reange of the array, your index must be in this range [0, " + length + "]");
}
int indexOfArray = (int) (index / ARRAY_LENGTH);
int indexInArray = (int) (index - (indexOfArray * ARRAY_LENGTH));
arrays.get(indexOfArray)[indexInArray] = value;
}
public t get(long index)
{
if (index >= length || index < 0)
{
throw new IndexOutOfBoundsException("out of the reange of the array, your index must be in this range [0, " + length + "]");
}
int indexOfArray = (int) (index / ARRAY_LENGTH);
int indexInArray = (int) (index - (indexOfArray * ARRAY_LENGTH));
return arrays.get(indexOfArray)[indexInArray];
}
}
}
and here's the test
这是测试
public static void main(String[] args)
{
long length = 60085147514l;
BigArray<String> array = new BigArray<>(length, String.class);
array.put("peace be upon you", 1);
array.put("yes it worj", 1755);
String text = array.get(1755);
System.out.println(text + " i am a string comming from an array ");
}
this code is only limited by only Long.MAX_VALUE
and Java heap but you can exceed it as you want (I made it 3800 MB)
此代码仅受长时间限制。MAX_VALUE和Java堆,但您可以根据自己的需要进行扩展(我将其设置为3800 MB)
i hope this is useful and provide a simple answer
我希望这是有用的,并提供一个简单的答案。
#11
0
java doesn't support direct array with more than 2^32 elements presently,
java不支持直接数组超过2 ^ 32个元素目前,
hope to see this feature of java in future
希望以后能看到java的这个特性