I have a text file with 1000 lines in the following format:
我有一个1000行的文本文件,格式如下:
19 x 75 Bullnose Architrave/Skirting £1.02
I am writing a method that reads the file line by line in - This works OK.
我正在编写一个逐行读取文件的方法 - 这个工作正常。
I then want to split each string using the "£" as a deliminater & write it out to an ArrayList<String>
in the following format:
然后,我想使用“£”作为分隔符拆分每个字符串,并按以下格式将其写入ArrayList
19 x 75 Bullnose Architrave/Skirting, Metre, 1.02
This is how I have approached it (productList
is the ArrayList
, declared/instantiated outside the try block):
这就是我接近它的方式(productList是ArrayList,在try块之外声明/实例化):
try{
br = new BufferedReader(new FileReader(aFile));
String inputLine = br.readLine();
String delim = "£";
while (inputLine != null){
String[]halved = inputLine.split(delim, 2);
String lineOut = halved[0] + ", Metre, " + halved[1];//Array out of bounds
productList.add(lineOut);
inputLine = br.readLine();
}
}
The String is not splitting and I keep getting an ArrayIndexOutOfBoundsException
. I'm not very familiar with regex. I've also tried using the old StringTokenizer
but get the same result.
String没有拆分,我不断得到一个ArrayIndexOutOfBoundsException。我对正则表达式不太熟悉。我也尝试过使用旧的StringTokenizer,但得到了相同的结果。
Is there an issue with £
as a delim or is it something else? I did wonder if it is something to do with the second token not being read as a String
?
£作为一个delim是否存在问题,还是其他问题?我确实想知道是否与第二个令牌没有被读作字符串有关?
Any ideas would be helpful.
任何想法都会有所帮助。
3 个解决方案
#1
6
Here are some of the possible causes:
以下是一些可能的原因:
-
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
该文件的编码与您用于读取它的编码不匹配,并且文件中的“磅”字符被“损坏”为其他内容。
-
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
该文件和您的源代码使用不同的磅字符。例如,Unicode有两个看起来像“英镑符号”的代码点 - 英镑字符(00A3)和里拉字符(2084)......然后是罗马semuncia字符(10192)。
-
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
您正在尝试编译UTF-8编码的源文件,而不告诉编译器它是UTF-8编码的。
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
从您的评论来看,这是一个编码不匹配的问题;即,Java使用的“默认”编码与文件的实际编码不匹配。有两种方法可以解决这个问题:
-
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
更改文件的编码以匹配Java的默认编码。你似乎已经尝试过但失败了。 (这不是我这样做的方式......)
-
Change the program to open the file with a specific (non default) encoding; e.g. change
更改程序以使用特定(非默认)编码打开文件;例如更改
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where
encoding
is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).其中encoding是文件实际字符编码的名称。这里列出了Java理解的编码名称,但我的猜测是它是“ISO-8859-1”(又名Latin-1)。
#2
0
This is probably a case of encoding mismatch. To check for this,
这可能是编码不匹配的情况。要检查这个,
- Print
delim.length
and make sure it is1
. - Print
inputLine.length
and make sure it is the right value (42
).
打印delim.length并确保它为1。
打印inputLine.length并确保它是正确的值(42)。
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
如果其中一个不是预期值,那么您必须确保在任何地方都使用UTF-8。
You say delim.length
is 1, so this is good. On the other hand if inputLine.length
is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02"
you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
你说delim.length是1,所以这很好。另一方面,如果inputLine.length是34,这是非常错误的。对于“19 x 75 Bullnose Architrave / Skirting£1.02”,如果一切都符合预期,你应该得到42。如果您的文件是UTF-8编码但读作ISO-8859-1或类似文件,您将获得43。
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
现在我有点不知所措。要调试它,您可以单独打印字符串的每个字符,并检查它们有什么问题。
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");
#3
-1
Many thanks for all your replies.
非常感谢您的所有回复。
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
在读取中指定编码并将原始文本文件保存为UTF -8已经有效。
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
然而,经验告诉我,使用“£”或实际上在不同编码中可能具有多个表示的其他字符来划分文本是一种糟糕的策略。
I have decided to take a different approach:
我决定采取不同的方法:
1) Find the last space in the input string & replace it with "xxx" or similar.
1)找到输入字符串中的最后一个空格并将其替换为“xxx”或类似字符。
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
2)使用分隔符“xxx”拆分它。哪个应该拆分字符串并撕掉“£”。
3) Carry on..
3)继续..
#1
6
Here are some of the possible causes:
以下是一些可能的原因:
-
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
该文件的编码与您用于读取它的编码不匹配,并且文件中的“磅”字符被“损坏”为其他内容。
-
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
该文件和您的源代码使用不同的磅字符。例如,Unicode有两个看起来像“英镑符号”的代码点 - 英镑字符(00A3)和里拉字符(2084)......然后是罗马semuncia字符(10192)。
-
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
您正在尝试编译UTF-8编码的源文件,而不告诉编译器它是UTF-8编码的。
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
从您的评论来看,这是一个编码不匹配的问题;即,Java使用的“默认”编码与文件的实际编码不匹配。有两种方法可以解决这个问题:
-
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
更改文件的编码以匹配Java的默认编码。你似乎已经尝试过但失败了。 (这不是我这样做的方式......)
-
Change the program to open the file with a specific (non default) encoding; e.g. change
更改程序以使用特定(非默认)编码打开文件;例如更改
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where
encoding
is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).其中encoding是文件实际字符编码的名称。这里列出了Java理解的编码名称,但我的猜测是它是“ISO-8859-1”(又名Latin-1)。
#2
0
This is probably a case of encoding mismatch. To check for this,
这可能是编码不匹配的情况。要检查这个,
- Print
delim.length
and make sure it is1
. - Print
inputLine.length
and make sure it is the right value (42
).
打印delim.length并确保它为1。
打印inputLine.length并确保它是正确的值(42)。
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
如果其中一个不是预期值,那么您必须确保在任何地方都使用UTF-8。
You say delim.length
is 1, so this is good. On the other hand if inputLine.length
is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02"
you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
你说delim.length是1,所以这很好。另一方面,如果inputLine.length是34,这是非常错误的。对于“19 x 75 Bullnose Architrave / Skirting£1.02”,如果一切都符合预期,你应该得到42。如果您的文件是UTF-8编码但读作ISO-8859-1或类似文件,您将获得43。
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
现在我有点不知所措。要调试它,您可以单独打印字符串的每个字符,并检查它们有什么问题。
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");
#3
-1
Many thanks for all your replies.
非常感谢您的所有回复。
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
在读取中指定编码并将原始文本文件保存为UTF -8已经有效。
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
然而,经验告诉我,使用“£”或实际上在不同编码中可能具有多个表示的其他字符来划分文本是一种糟糕的策略。
I have decided to take a different approach:
我决定采取不同的方法:
1) Find the last space in the input string & replace it with "xxx" or similar.
1)找到输入字符串中的最后一个空格并将其替换为“xxx”或类似字符。
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
2)使用分隔符“xxx”拆分它。哪个应该拆分字符串并撕掉“£”。
3) Carry on..
3)继续..