以1024字节的块分割Java String

What's an efficient way of splitting a String into chunks of 1024 bytes in java? If there is more than one chunk then the header(fixed size string) needs to be repeated in all subsequent chunks.

在java中将String拆分为1024字节的块的有效方法是什么?如果存在多个块,则需要在所有后续块中重复标头(固定大小的字符串)。

4 个解决方案

#1

Strings and bytes are two completely different things, so wanting to split a String into bytes is as meaningless as wanting to split a painting into verses.

字符串和字节是两个完全不同的东西,所以想要将一个字符串拆分成字节与想要将一个绘画分成几个节目一样毫无意义。

What is it that you actually want to do?

你真的想做什么?

To convert between strings and bytes, you need to specify an encoding that can encode all the characters in the String. Depending on the encoding and the characters, some of them may span more than one byte.

要在字符串和字节之间进行转换,您需要指定一个可以编码String中所有字符的编码。根据编码和字符,它们中的一些可能跨越多个字节。

You can either split the String into chunks of 1024 characters and encode those as bytes, but then each chunk may be more than 1024 bytes.

您可以将String拆分为1024个字符的块,并将其编码为字节,但每个块可能超过1024个字节。

Or you can encode the original string into bytes and then split them into chunks of 1024, but then you have to make sure to append them as bytes before decoding the whole into a String again, or you may get garbled characters at the split points when a character spans more than 1 byte.

或者你可以将原始字符串编码为字节,然后将它们分成1024块,但是你必须确保将它们作为字节附加,然后再将整体解码为字符串,否则你可能会在分裂点处出现乱码。一个字符跨越1个字节以上。

If you're worried about memory usage when the String can be very long, you should use streams (java.io package) to to the en/decoding and splitting, in order to avoid keeping the data in memory several times as copies. Ideally, you should avoid having the original String in one piece at all and instead use streams to read it in small chunks from wherever you get it from.

如果你担心String可能很长时间内的内存使用,你应该使用流(java.io包)来进行en /解码和拆分,以避免数据作为副本多次保存在内存中。理想情况下,您应该避免将原始字符串整合在一起,而是使用流从任何位置以小块的形式读取它。

#2

You have two ways, the fast and the memory conservative way. But first, you need to know what characters are in the String. ASCII? Are there umlauts (characters between 128 and 255) or even Unicode (s.getChar() returns something > 256). Depending on that, you will need to use a different encoding. If you have binary data, try "iso-8859-1" because it will preserve the data in the String. If you have Unicode, try "utf-8". I'll assume binary data:

你有两种方式,快速和记忆保守的方式。但首先,您需要知道String中的字符。 ASCII?是否有变音符号(128到255之间的字符)或甚至Unicode(s.getChar()返回> 256)。根据需要,您需要使用不同的编码。如果您有二进制数据,请尝试“iso-8859-1”,因为它将保留字符串中的数据。如果您有Unicode,请尝试“utf-8”。我假设二进制数据:

String encoding = "iso-8859-1";

The fastest way:

最快的方法:

ByteArrayInputStream in = new ByteArrayInputStream (string.getBytes(encoding));

Note that the String is Unicode, so every character needs two bytes. You will have to specify the encoding (don't rely on the "platform default". This will only cause pain later).

请注意,String是Unicode,因此每个字符都需要两个字节。您必须指定编码(不要依赖“平台默认”。这只会导致以后的痛苦)。

Now you can read it in 1024 chunks using

现在你可以用1024块来读它

byte[] buffer = new byte[1024];
int len;
while ((len = in.read(buffer)) > 0) { ... }

This needs about three times as much RAM as the original String.

这需要大约三倍于原始字符串的RAM。

A more memory conservative way is to write a converter which takes a StringReader and an OutputStreamWriter (which wraps a ByteArrayOutputStream). Copy bytes from the reader to the writer until the underlying buffer contains one chunk of data:

更保守的内存方式是编写一个转换器,它接受一个StringReader和一个OutputStreamWriter(包装一个ByteArrayOutputStream)。将字节从阅读器复制到编写器,直到底层缓冲区包含一个数据块:

When it does, copy the data to the real output (prepending the header), copy the additional bytes (which the Unicode->byte conversion may have generated) to a temp buffer, call buffer.reset() and write the temp buffer to buffer.

如果是这样,将数据复制到实际输出(在标题前面),将附加字节(Unicode->字节转换可能生成)复制到临时缓冲区,调用buffer.reset()并将临时缓冲区写入缓冲。

Code looks like this (untested):

代码看起来像这样(未经测试):

StringReader r = new StringReader (string);
ByteArrayOutputStream buffer = new ByteArrayOutputStream (1024*2); // Twice as large as necessary
OutputStreamWriter w = new OutputStreamWriter  (buffer, encoding);

char[] cbuf = new char[100];
byte[] tempBuf;
int len;
while ((len = r.read(cbuf, 0, cbuf.length)) > 0) {
    w.write(cbuf, 0, len);
    w.flush();
    if (buffer.size()) >= 1024) {
        tempBuf = buffer.toByteArray();
        ... ready to process one chunk ...
        buffer.reset();
        if (tempBuf.length > 1024) {
            buffer.write(tempBuf, 1024, tempBuf.length - 1024);
        }
    }
}
... check if some data is left in buffer and process that, too ...

This only needs a couple of kilobytes of RAM.

这只需要几千字节的RAM。

[EDIT] There has been a lengthy discussion about binary data in Strings in the comments. First of all, it's perfectly safe to put binary data into a String as long as you are careful when creating it and storing it somewhere. To create such a String, take a byte[] array and:

[编辑]在评论中对字符串中的二进制数据进行了长时间的讨论。首先,将二进制数据放入String中是完全安全的,只要在创建它并将其存储在某处时要小心。要创建这样的String,请使用byte []数组和:

String safe = new String (array, "iso-8859-1");

In Java, ISO-8859-1 (a.k.a ISO-Latin1) is a 1:1 mapping. This means the bytes in the array will not be interpreted in any way. Now you can use substring() and the like on the data or search it with index, run regexp's on it, etc. For example, find the position of a 0-byte:

在Java中,ISO-8859-1(a.k.a ISO-Latin1)是1:1映射。这意味着不会以任何方式解释数组中的字节。现在你可以在数据上使用substring()等,或者用索引搜索它,在其上运行regexp等等。例如,找到一个0字节的位置:

int pos = safe.indexOf('\u0000');

This is especially useful if you don't know the encoding of the data and want to have a look at it before some codec messes with it.

如果您不知道数据的编码并希望在某些编解码器混淆之前查看它,这将特别有用。

To write the data somewhere, the reverse operation is:

要在某处写入数据,反向操作是:

byte[] data = safe.getBytes("iso-8859-1");

byte [] data = safe.getBytes(“iso-8859-1”);

Never use the default methods new String(array) or String.getBytes()! One day, your code is going to be executed on a different platform and it will break.

永远不要使用默认方法new String(array)或String.getBytes()!有一天,您的代码将在不同的平台上执行,它将会中断。

Now the problem of characters > 255 in the String. If you use this method, you won't ever have any such character in your Strings. That said, if there were any for some reason, then getBytes() would throw an Exception because there is no way to express all Unicode characters in ISO-Latin1, so you're safe in the sense that the code will not fail silently.

现在字符串中的字符> 255的问题。如果您使用此方法,您的字符串中将不会有任何此类字符。也就是说,如果由于某种原因有任何原因,那么getBytes()将抛出异常,因为无法在ISO-Latin1中表达所有Unicode字符,所以在代码不会无声地失败的意义上你是安全的。

Some might argue that this is not safe enough and you should never mix bytes and String. In this day an age, we don't have that luxury. A lot of data has no explicit encoding information (files, for example, don't have an "encoding" attribute in the same way as they have access permissions or a name). XML is one of the few formats which has explicit encoding information and there are editors like Emacs or jEdit which use comments to specify this vital information. This means that, when processing streams of bytes, you must always know in which encoding they are. As of now, it's not possible to write code which will always work, no matter where the data comes from.

有些人可能认为这不够安全,你不应该混合使用字节和字符串。在这个时代,我们没有那么奢侈。许多数据没有明确的编码信息(例如,文件没有“编码”属性,就像它们具有访问权限或名称一样)。 XML是少数具有显式编码信息的格式之一,并且有像Emacs或jEdit这样的编辑器使用注释来指定这些重要信息。这意味着,在处理字节流时,您必须始终知道它们是哪种编码。截至目前,无论数据来自何处,都无法编写始终有效的代码。

Even with XML, you must read the header of the file as bytes to determine the encoding before you can decode the meat.

即使使用XML,您也必须将文件的标题作为字节读取,以确定编码,然后才能解码。

The important point is to sit down and figure out which encoding was used to generate the data stream you have to process. If you do that, you're good, if you don't, you're doomed. The confusion originates from the fact that most people are not aware that the same byte can mean different things depending on the encoding or even that there is more than one encoding. Also, it would have helped if Sun hadn't introduced the notion of "platform default encoding."

重要的是坐下来找出用于生成必须处理的数据流的编码。如果你这样做,你很好,如果你不这样做,你就注定要失败。混淆源于这样的事实:大多数人不知道相同的字节可能意味着不同的东西取决于编码或甚至不止一种编码。此外,如果Sun没有引入“平台默认编码”的概念,它会有所帮助。

Important points for beginners:

初学者的重点:

There is more than one encoding (charset).

有多个编码(charset)。

There are more characters than the English language uses. There are even several sets of digits (ASCII, full width, Arabic-Indic, Bengali).

比英语使用的字符多。甚至有几组数字(ASCII,全宽,阿拉伯语 - 印度语,孟加拉语)。

You must know which encoding was used to generate the data which you are processing.

您必须知道使用了哪种编码来生成您正在处理的数据。

You must know which encoding you should use to write the data you are processing.

您必须知道应该使用哪种编码来编写正在处理的数据。

You must know the correct way to specify this encoding information so the next program can decode your output (XML header, HTML meta tag, special encoding comment, whatever).

您必须知道指定此编码信息的正确方法,以便下一个程序可以解码您的输出(XML标头,HTML元标记,特殊编码注释等)。

The days of ASCII are over.

ASCII的日子结束了。

#3

I know I am late, however I was looking for a solution myself and then found my answer as best answer:

我知道我迟到了,但是我一直在寻找解决方案然后找到答案作为最佳答案:

private static String chunk_split(String original, int length, String separator) throws IOException {
    ByteArrayInputStream bis = new ByteArrayInputStream(original.getBytes());
    int n = 0;
    byte[] buffer = new byte[length];
    String result = "";
    while ((n = bis.read(buffer)) > 0) {
        for (byte b : buffer) {
            result += (char) b;
        }
        Arrays.fill(buffer, (byte) 0);
        result += separator;
    }
    return result;
}

Example:

public static void main(String[] args) throws IOException{
       String original = "abcdefghijklmnopqrstuvwxyz";
       System.out.println(chunk_split(original,5,"\n"));
}

Output:

abced
fghij
klmno
pqrst
uvwxy
z

#4

I was trying this for myself, I need to chunk a huge String (nearly 10 MB) by 1 MB. This helps chunk the data in minimal amount of time. (less than a second).

我正在为自己尝试这个,我需要将一个巨大的字符串(接近10 MB)大小减少1 MB。这有助于在最短的时间内对数据进行分块。 (不到一秒钟)

private static ArrayList<String> chunkLogMessage(String logMessage) throws Exception {
    ArrayList<String> messages = new ArrayList<>();
    if(logMessage.getBytes().length > CHUNK_SIZE) {
        Log.e("chunk_started", System.currentTimeMillis()+"");
        byte[] buffer = new byte[CHUNK_SIZE];
        int start = 0, end = buffer.length;
        long remaining = logMessage.getBytes().length;
        ByteArrayInputStream inputStream = new ByteArrayInputStream(logMessage.getBytes());
        while ((inputStream.read(buffer, start, end)) != -1){
            ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
            outputStream.write(buffer, start, end);
            messages.add(outputStream.toString("UTF-8"));
            remaining = remaining - end;
            if(remaining <= end){
                end = (int) remaining;
            }
        }
        Log.e("chunk_ended", System.currentTimeMillis()+"");
        return messages;
    }
    messages.add(logMessage);
    return messages;
}

Logcat:

22:08:00.262 3382-3425/com.sample.app E/chunk_started: 1533910080261
22:08:01.228 3382-3425/com.sample.app E/chunk_ended: 1533910081228
22:08:02.468 3382-3425/com.sample.app E/chunk_started: 1533910082468
22:08:03.478 3382-3425/com.sample.app E/chunk_ended: 1533910083478
22:09:19.801 3382-3382/com.sample.app E/chunk_started: 1533910159801
22:09:20.662 3382-3382/com.sample.app E/chunk_ended: 1533910160662

#1