I am trying to read in the content of a file to any readable form. I am using a FileInputStream to read from the file to a byte array, and then am trying to convert that byte array into a String.
我正在尝试读取文件的内容到任何可读的格式。我使用FileInputStream从文件读取到字节数组,然后尝试将该字节数组转换为字符串。
So far, I have tried 3 different ways:
到目前为止,我尝试了三种不同的方式:
FileInputStream inputStream = new FileInputStream(file);
byte[] clearTextBytes = new byte[(int) file.length()];
inputStream.read(clearTextBytes);
String s = IOUtils.toString(inputStream); //first way
String str = new String(clearTextBytes, "UTF-8"); //second way
String string = Arrays.toString(clearTextBytes); //third way
String[] byteValue = string.substring(1, string.length() - 1).split(",");
byte[] bytes = new byte[byteValue.length]
for(int i=0, len=bytes.length; i<len; i++){
bytes[i] = Byte.parseByte(byteValue[i].trim());
}
String newStr = new String(bytes);
When I print out each of the Strings: 1) prints out nothing, and 2 & 3) print out a lot of weird characters, such as: PK!�Q���[Content_Types].xml �(���MO�@��&��f��]���pP<*���v �ݏ�,_��i�I�(zi�N��}fڝ�
��h�5)�&��6Sf����c|�"�d��R�d���Eo�r�� �l�������:0Tɭ�"Э�p'䧘��tn��&� q(=X����!.���,�_�WF�L8W......
当我打印出每个字符串:1)打印,2 & 3)打印出很多奇怪的字符,如:Q PK !����[Content_Types]。xml����莫�@��&��f��]���页< *����vݏ�,_��我��(N zi���} fڝ���h�5�&��6科幻����c |�“d���d R����Eo R����l�������:0 tɭ�“Э�p '䧘��tn���和q(= X����!。���,�_�WF�L8W……
I would love any advice on how to properly convert my byte array to a String.
关于如何正确地将我的字节数组转换成字符串,我非常喜欢。
4 个解决方案
#1
4
As others have noted, the data doesn't look like it contains any text, so it quite possibly binary data, rather than text. Note files which start with PK
could be in PKZIP format and the randomness of your data does suggest it could be compressed. http://www.garykessler.net/library/file_sigs.html Try making the renaming the file to have .ZIP
at the end and see if you can open it in file explorer.
正如其他人所指出的,数据看起来不像包含任何文本,所以很可能是二进制数据,而不是文本。注意,以PK开头的文件可以是PKZIP格式,而且数据的随机性确实表明它可以被压缩。尝试将文件重命名为. zip,并查看是否可以在文件资源管理器中打开它。
From the link above, the start of a DOCX file looks as follows.
从上面的链接中,DOCX文件的开始如下所示。
50 4B 03 04 14 00 06 00 PK...... DOCX, PPTX, XLSX
504b 03 04 14 00 06 PK…多克斯,PPTX XLSX
Microsoft Office Open XML Format (OOXML) Document NOTE: There is no subheader for MS OOXML files as there is with DOC, PPT, and XLS files. To better understand the format of these files, rename any OOXML file to have a .ZIP extension and then unZIP the file; look at the resultant file named [Content_Types].xml to see the content types. In particular, look for the <Override PartName= tag, where you will find word, ppt, or xl, respectively. Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes at the end of the file.
Assuming you have text data, most likely the character encoding is not your default, nor UTF-8. You need to a) check what the encoding is, b) check the corruption is not when you output the string instead of in the input.
假设您有文本数据,那么字符编码很可能不是默认的,也不是UTF-8。你需要a)检查编码是什么,b)检查腐败不是当你输出字符串而不是输入。
You can try brute force to find a character set which doesn't produce any unknown characters.
您可以尝试使用蛮力找到一个不会产生任何未知字符的字符集。
public static Set<Charset> possibleCharsets(byte[] bytes) {
Set<Charset> charsets = new LinkedHashSet<>();
for (Charset charset : Charset.availableCharsets().values()) {
if (!new String(bytes, charset).contains("�"))
charsets.add(charset);
}
return charsets;
}
#2
0
UTF8 can hold about 2,097,152 different characters, them who have no image you see the questionmark. Try the classic dos codepage instead:
UTF8可以容纳大约2097152个不同的字符,这些字符没有图像,你可以看到问号。试试经典的dos代码页:
new String(clearTextBytes, "DOS-US");
#3
0
Check this out for getting text contents of word file: You'd need Apache POI libraries.
检查一下获取word文件的文本内容:您需要Apache POI库。
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
[...]
XWPFDocument docx = new XWPFDocument(new FileInputStream("file.docx"));
XWPFWordExtractor we = new XWPFWordExtractor(docx);
System.out.println(we.getText());
#4
0
I've written a very basic program to read the contents of a file and to print each string on a new line in the console. Here is the content of the file:
我编写了一个非常基本的程序来读取文件的内容,并在控制台的新行上打印每个字符串。以下是该文件的内容:
Here is the program I wrote:
这是我写的程序:
import java.io.*;
import java.util.*;
class Test {
public static void main(String args[]) throws FileNotFoundException {
File file = new File("File1.txt");
Scanner input = new Scanner(file);
while (input.hasNext()) {
System.out.println(input.next());
}
input.close();
} // main()
} // class Test
This is the output to the console:
这是控制台的输出:
apples
pears
1
2
3
oranges
carrots
bananas
pineapples
#1
4
As others have noted, the data doesn't look like it contains any text, so it quite possibly binary data, rather than text. Note files which start with PK
could be in PKZIP format and the randomness of your data does suggest it could be compressed. http://www.garykessler.net/library/file_sigs.html Try making the renaming the file to have .ZIP
at the end and see if you can open it in file explorer.
正如其他人所指出的,数据看起来不像包含任何文本,所以很可能是二进制数据,而不是文本。注意,以PK开头的文件可以是PKZIP格式,而且数据的随机性确实表明它可以被压缩。尝试将文件重命名为. zip,并查看是否可以在文件资源管理器中打开它。
From the link above, the start of a DOCX file looks as follows.
从上面的链接中,DOCX文件的开始如下所示。
50 4B 03 04 14 00 06 00 PK...... DOCX, PPTX, XLSX
504b 03 04 14 00 06 PK…多克斯,PPTX XLSX
Microsoft Office Open XML Format (OOXML) Document NOTE: There is no subheader for MS OOXML files as there is with DOC, PPT, and XLS files. To better understand the format of these files, rename any OOXML file to have a .ZIP extension and then unZIP the file; look at the resultant file named [Content_Types].xml to see the content types. In particular, look for the <Override PartName= tag, where you will find word, ppt, or xl, respectively. Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes at the end of the file.
Assuming you have text data, most likely the character encoding is not your default, nor UTF-8. You need to a) check what the encoding is, b) check the corruption is not when you output the string instead of in the input.
假设您有文本数据,那么字符编码很可能不是默认的,也不是UTF-8。你需要a)检查编码是什么,b)检查腐败不是当你输出字符串而不是输入。
You can try brute force to find a character set which doesn't produce any unknown characters.
您可以尝试使用蛮力找到一个不会产生任何未知字符的字符集。
public static Set<Charset> possibleCharsets(byte[] bytes) {
Set<Charset> charsets = new LinkedHashSet<>();
for (Charset charset : Charset.availableCharsets().values()) {
if (!new String(bytes, charset).contains("�"))
charsets.add(charset);
}
return charsets;
}
#2
0
UTF8 can hold about 2,097,152 different characters, them who have no image you see the questionmark. Try the classic dos codepage instead:
UTF8可以容纳大约2097152个不同的字符,这些字符没有图像,你可以看到问号。试试经典的dos代码页:
new String(clearTextBytes, "DOS-US");
#3
0
Check this out for getting text contents of word file: You'd need Apache POI libraries.
检查一下获取word文件的文本内容:您需要Apache POI库。
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
[...]
XWPFDocument docx = new XWPFDocument(new FileInputStream("file.docx"));
XWPFWordExtractor we = new XWPFWordExtractor(docx);
System.out.println(we.getText());
#4
0
I've written a very basic program to read the contents of a file and to print each string on a new line in the console. Here is the content of the file:
我编写了一个非常基本的程序来读取文件的内容,并在控制台的新行上打印每个字符串。以下是该文件的内容:
Here is the program I wrote:
这是我写的程序:
import java.io.*;
import java.util.*;
class Test {
public static void main(String args[]) throws FileNotFoundException {
File file = new File("File1.txt");
Scanner input = new Scanner(file);
while (input.hasNext()) {
System.out.println(input.next());
}
input.close();
} // main()
} // class Test
This is the output to the console:
这是控制台的输出:
apples
pears
1
2
3
oranges
carrots
bananas
pineapples