In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.
在我的软件中,我需要将字符串分成单词。我目前拥有超过19,000,000个文档,每个文档超过30个单词。
Which of the following two ways is the best way to do this (in terms of performance)?
以下哪两种方法是最好的方法(在性能方面)?
StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {
or
String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)
10 个解决方案
#1
61
If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.
如果您的数据已经在数据库中,您需要解析字符串,我建议重复使用indexOf。它比任何一种解决方案快很多倍。
However, getting the data from a database is still likely to much more expensive.
但是,从数据库获取数据仍然可能要昂贵得多。
StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
sb.append(i).append(' ');
String sample = sb.toString();
int runs = 100000;
for (int i = 0; i < 5; i++) {
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
StringTokenizer st = new StringTokenizer(sample);
List<String> list = new ArrayList<String>();
while (st.hasMoreTokens())
list.add(st.nextToken());
}
long time = System.nanoTime() - start;
System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
Pattern spacePattern = Pattern.compile(" ");
for (int r = 0; r < runs; r++) {
List<String> list = Arrays.asList(spacePattern.split(sample, 0));
}
long time = System.nanoTime() - start;
System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
List<String> list = new ArrayList<String>();
int pos = 0, end;
while ((end = sample.indexOf(' ', pos)) >= 0) {
list.add(sample.substring(pos, end));
pos = end + 1;
}
}
long time = System.nanoTime() - start;
System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
}
}
prints
StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us
The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)
打开文件的成本约为8毫秒。由于文件太小,您的缓存可能会将性能提高2-5倍。即使如此,它将花费大约10个小时打开文件。使用split vs StringTokenizer的成本远低于0.01 ms。解析1900万x 30个单词*每个单词8个字母大约需要10秒钟(每2秒约1 GB)
If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/
如果你想提高性能,我建议你有更少的文件。例如使用数据库。如果您不想使用SQL数据库,我建议使用其中一个http://nosql-database.org/
#2
14
Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.
在Java 7中拆分只是为此输入调用indexOf,请参阅源代码。拆分应该非常快,接近indexOf的重复调用。
#3
5
The Java API specification recommends using split
. See the documentation of StringTokenizer
.
Java API规范建议使用split。请参阅StringTokenizer的文档。
#4
4
Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)
) also reduces processing time. So, if you're looking for performance, I would recommend using something like:
另一个重要的事情,就我注意到的那样,没有记录,要求StringTokenizer返回分隔符和标记化字符串(通过使用构造函数StringTokenizer(String str,String delim,boolean returnDelims))也减少了处理时间。所以,如果你正在寻找性能,我建议使用类似的东西:
private static final String DELIM = "#";
public void splitIt(String input) {
StringTokenizer st = new StringTokenizer(input, DELIM, true);
while (st.hasMoreTokens()) {
String next = getNext(st);
System.out.println(next);
}
}
private String getNext(StringTokenizer st){
String value = st.nextToken();
if (DELIM.equals(value))
value = null;
else if (st.hasMoreTokens())
st.nextToken();
return value;
}
Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.
尽管getNext()方法引入了开销,但是丢弃了分隔符,根据我的基准测试,它仍然快50%。
#5
3
Use split.
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.
StringTokenizer是一个遗留类,出于兼容性原因而保留,尽管在新代码中不鼓励使用它。建议任何寻求此功能的人都使用split方法。
#6
2
What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?
19,000,000份文件必须在那里做什么?你是否必须定期在所有文件中分词?或者这是一个拍摄问题?
If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.
如果您一次显示/请求一个文档,只有30个单词,这是一个非常小的问题,任何方法都可以工作。
If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.
如果你必须一次处理所有文件,只有30个单词,这是一个非常小的问题,你更有可能是IO绑定。
#7
2
While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.
在运行micro(在这种情况下,甚至纳米)基准测试时,会有很多因素影响您的结果。 JIT优化和垃圾收集仅举几例。
In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.
为了从微基准测试中获得有意义的结果,请查看jmh库。它有很好的样本捆绑在如何运行良好的基准测试。
#8
2
Regardless of its legacy status, I would expect StringTokenizer
to be significantly quicker than String.split()
for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself via indexOf()
. In fact String.split()
has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.
无论遗留状态如何,我都希望StringTokenizer比String.split()明显更快,因为它不使用正则表达式:它只是直接扫描输入,就像你自己通过indexOf()一样。事实上,每次调用时,String.split()都必须编译正则表达式,因此它甚至不如直接使用正则表达式那样高效。
#9
1
This could be a reasonable benchmarking using 1.6.0
这可能是使用1.6.0的合理基准测试
http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8
#10
1
Performance wise StringTokeniser is way better than split. Check the code below,
性能明智的StringTokeniser比split更好。检查下面的代码,
But according to Java docs its use is discouraged. Check Here
但根据Java文档,它的使用是不鼓励的。检查一下
#1
61
If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.
如果您的数据已经在数据库中,您需要解析字符串,我建议重复使用indexOf。它比任何一种解决方案快很多倍。
However, getting the data from a database is still likely to much more expensive.
但是,从数据库获取数据仍然可能要昂贵得多。
StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
sb.append(i).append(' ');
String sample = sb.toString();
int runs = 100000;
for (int i = 0; i < 5; i++) {
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
StringTokenizer st = new StringTokenizer(sample);
List<String> list = new ArrayList<String>();
while (st.hasMoreTokens())
list.add(st.nextToken());
}
long time = System.nanoTime() - start;
System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
Pattern spacePattern = Pattern.compile(" ");
for (int r = 0; r < runs; r++) {
List<String> list = Arrays.asList(spacePattern.split(sample, 0));
}
long time = System.nanoTime() - start;
System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
List<String> list = new ArrayList<String>();
int pos = 0, end;
while ((end = sample.indexOf(' ', pos)) >= 0) {
list.add(sample.substring(pos, end));
pos = end + 1;
}
}
long time = System.nanoTime() - start;
System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
}
}
prints
StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us
The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)
打开文件的成本约为8毫秒。由于文件太小,您的缓存可能会将性能提高2-5倍。即使如此,它将花费大约10个小时打开文件。使用split vs StringTokenizer的成本远低于0.01 ms。解析1900万x 30个单词*每个单词8个字母大约需要10秒钟(每2秒约1 GB)
If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/
如果你想提高性能,我建议你有更少的文件。例如使用数据库。如果您不想使用SQL数据库,我建议使用其中一个http://nosql-database.org/
#2
14
Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.
在Java 7中拆分只是为此输入调用indexOf,请参阅源代码。拆分应该非常快,接近indexOf的重复调用。
#3
5
The Java API specification recommends using split
. See the documentation of StringTokenizer
.
Java API规范建议使用split。请参阅StringTokenizer的文档。
#4
4
Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)
) also reduces processing time. So, if you're looking for performance, I would recommend using something like:
另一个重要的事情,就我注意到的那样,没有记录,要求StringTokenizer返回分隔符和标记化字符串(通过使用构造函数StringTokenizer(String str,String delim,boolean returnDelims))也减少了处理时间。所以,如果你正在寻找性能,我建议使用类似的东西:
private static final String DELIM = "#";
public void splitIt(String input) {
StringTokenizer st = new StringTokenizer(input, DELIM, true);
while (st.hasMoreTokens()) {
String next = getNext(st);
System.out.println(next);
}
}
private String getNext(StringTokenizer st){
String value = st.nextToken();
if (DELIM.equals(value))
value = null;
else if (st.hasMoreTokens())
st.nextToken();
return value;
}
Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.
尽管getNext()方法引入了开销,但是丢弃了分隔符,根据我的基准测试,它仍然快50%。
#5
3
Use split.
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.
StringTokenizer是一个遗留类,出于兼容性原因而保留,尽管在新代码中不鼓励使用它。建议任何寻求此功能的人都使用split方法。
#6
2
What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?
19,000,000份文件必须在那里做什么?你是否必须定期在所有文件中分词?或者这是一个拍摄问题?
If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.
如果您一次显示/请求一个文档,只有30个单词,这是一个非常小的问题,任何方法都可以工作。
If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.
如果你必须一次处理所有文件,只有30个单词,这是一个非常小的问题,你更有可能是IO绑定。
#7
2
While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.
在运行micro(在这种情况下,甚至纳米)基准测试时,会有很多因素影响您的结果。 JIT优化和垃圾收集仅举几例。
In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.
为了从微基准测试中获得有意义的结果,请查看jmh库。它有很好的样本捆绑在如何运行良好的基准测试。
#8
2
Regardless of its legacy status, I would expect StringTokenizer
to be significantly quicker than String.split()
for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself via indexOf()
. In fact String.split()
has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.
无论遗留状态如何,我都希望StringTokenizer比String.split()明显更快,因为它不使用正则表达式:它只是直接扫描输入,就像你自己通过indexOf()一样。事实上,每次调用时,String.split()都必须编译正则表达式,因此它甚至不如直接使用正则表达式那样高效。
#9
1
This could be a reasonable benchmarking using 1.6.0
这可能是使用1.6.0的合理基准测试
http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8
#10
1
Performance wise StringTokeniser is way better than split. Check the code below,
性能明智的StringTokeniser比split更好。检查下面的代码,
But according to Java docs its use is discouraged. Check Here
但根据Java文档,它的使用是不鼓励的。检查一下