为什么string.intern()这么慢?

时间:2021-05-21 21:19:15

Before anyone questions the fact of using string.intern() at all, let me say that I need it in my particular application for memory and performance reasons. [1]

在任何人质疑使用string.intern()的事实之前,让我说我需要在我的特定应用程序中出于内存和性能原因。 [1]

So, until now I used String.intern() and assumed it was the most efficient way to do it. However, I noticed since ages it is a bottleneck in the software. [2]

所以,直到现在我使用String.intern()并认为它是最有效的方法。但是,我注意到它已经成为软件的瓶颈。 [2]

Then, just recently, I tried to replace the String.intern() by a huge map where I put/get the strings in order to obtain each time a unique instance. I expected this would be slower... but it was exactly the opposite! It was tremendously faster! Replacing the intern() by pushing/polling a map (which achieves exactly the same) resulted in more than one order of magnitude faster.

然后,就在最近,我尝试用一​​个巨大的地图替换String.intern(),在那里我放置/获取字符串以便每次获得一个唯一的实例。我预计这会慢一点......但恰恰相反!它的速度非常快!通过推送/轮询地图(实现完全相同)来替换实习生()导致速度提高一个数量级以上。

The question is: why is intern() so slow?!? Why isn't it then simply backed up by a map (or actually, just a customized set) and would be tremendously faster? I'm puzzled.

问题是:为什么实习生()这么慢?!?那为什么不是简单地用地图备份(或者实际上只是一个定制的集合)并且速度会快得多?我很困惑。

[1]: For the unconvinced ones: It is in natural language processing and has to process gigabytes of text, therefore needs to avoid many instances of a same string to avoid blowing up the memory and referential string comparison to be fast enough.

[1]:对于不相信的人:它是在自然语言处理中并且必须处理千兆字节的文本,因此需要避免相同字符串的许多实例以避免炸毁内存和引用字符串比较足够快。

[2]: without it (normal strings) it is impossible, with it, this particular step remains the most computation intensive one

[2]:没有它(正常的字符串)是不可能的,有了它,这个特定的步骤仍然是计算密集程度最高的一步

EDIT:

Due to the surprising interest in this post, here is some code to test it out:

由于这篇文章令人惊讶的兴趣,这里有一些代码来测试它:

http://pastebin.com/4CD8ac69

And the results of interning a bit more than 1 million strings:

而实习结果超过100万字符串:

  • HashMap: 4 seconds
  • HashMap:4秒

  • String.intern(): 54 seconds
  • String.intern():54秒

Due to avoid some warm-up / OS IO caching and stuff like this, the experiment was repeated by inverting the order of both benchmarks:

由于避免了一些预热/ OS IO缓存和类似的东西,通过颠倒两个基准的顺序重复实验:

  • String.intern(): 69 seconds
  • String.intern():69秒

  • HashMap: 3 seconds
  • HashMap:3秒

As you see, the difference is very noticeable, more than tenfolds. (Using OpenJDK 1.6.0_22 64bits ...but using the sun one resulted in similar results I think)

如你所见,差异非常显着,超过十倍。 (使用OpenJDK 1.6.0_22 64位...但使用太阳一个导致类似的结果我认为)

5 个解决方案

#1


3  

Most likely reason for the performance difference: String.intern() is a native method, and calling a native method incurs massive overhead.

最可能的原因是性能差异:String.intern()是一种本机方法,调用本机方法会产生大量开销。

So why is it a native method? Probably because it uses the constant pool, which is a low-level VM construct.

那为什么它是一种原生方法呢?可能是因为它使用常量池,这是一个低级VM构造。

#2


6  

This article discusses the implementation of String.intern(). In Java 6 and 7, the implementation used a fixed size (1009) hashtable so as the number entries grew, the performance became O(n). The fixed size can be changed using -XX:StringTableSize=N. Apparently, in Java8 the default size is larger but issue remains.

本文讨论String.intern()的实现。在Java 6和7中,实现使用固定大小(1009)哈希表,因此数字条目增长,性能变为O(n)。可以使用-XX:StringTableSize = N更改固定大小。显然,在Java8中,默认大小较大但问题仍然存在。

#3


3  

@Michael Borgwardt said this in a comment:

@Michael Borgwardt在评论中说:

intern() is not synchronized, at least at the Java language level.

intern()不同步,至少在Java语言级别。

I think that you mean that the String.intern() method is not declared as synchronized in the sourcecode of the String class. And indeed, that is a true statement.

我认为你的意思是String.intern()方法没有在String类的源代码中声明为synchronized。事实上,这是一个真实的陈述。

However:

  • Declaring intern() as synchronized would only lock the current String instance, because it is an instance method, not a static method. So they couldn't implement string pool synchronization that way.

    将intern()声明为synchronized将仅锁定当前的String实例,因为它是实例方法,而不是静态方法。所以他们无法以这种方式实现字符串池同步。

  • If you step back and think about it, the string pool has to perform some kind of internal synchronization. If it didn't it would be unusable in a multi-threaded application, because there is simply no practical way for all code that uses the intern() method to do external synchronization.

    如果您退一步考虑它,字符串池必须执行某种内部同步。如果没有,它将在多线程应用程序中无法使用,因为对于使用intern()方法进行外部同步的所有代码,根本没有实用的方法。

So, the internal synchronization that the string pool performs could be a bottleneck in multi-threaded application that uses intern() heavily.

因此,字符串池执行的内部同步可能是使用intern()的多线程应用程序的瓶颈。

#4


2  

I can't speak from any great experience with it, but from the String docs:

我不能说任何有关它的经验,但是从String文档:

"When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the {@link #equals(Object)} method, then the string from the pool returned. Otherwise, this String object is added to the pool and a reference to this String object is returned."

“当调用intern方法时,如果池已经包含一个等于此字符串对象的字符串(由{@link #equals(Object)}方法确定,则返回池中的字符串。否则,将添加此String对象到池中,返回对此String对象的引用。“

When dealing with large numbers of objects, any solution involving hashing will outperform one that doesn't. I think you're just seeing the result of misusing a Java language feature. Interning isn't there to act as a Map of strings for your use. You should use a Map for that (or Set, as appropriate). The String table is for optimization at the language level, not the app level.

处理大量对象时,任何涉及散列的解决方案都将胜过没有散列的解决方案。我想你只是看到了滥用Java语言功能的结果。实习不是作为您使用的字符串映射。你应该使用Map(或适当的Set)。 String表用于语言级别的优化,而不是应用级别。

#5


-1  

The accepted answer is wrong.String.intern become slow is becuase two reasons:
1. the -XX:StringTableSize limitation.
In java,it uses a internal hashtable to manage string cache,in java 6,the default StringTableSize value is 1009,which means string.intern is O(the number of string object/ 1009),when more and more string object been created,it's becoming slower.

接受的答案是错误的。字符串变慢是因为两个原因:1。-XX:StringTableSize限制。在java中,它使用内部哈希表来管理字符串缓存,在java 6中,默认的StringTableSize值是1009,这意味着当创建越来越多的字符串对象时,string.intern是O(字符串对象的数量/ 1009),它变慢了。

\openjdk7\hotspot\src\share\vm\classfile\symbolTable.cpp

oop StringTable::intern(Handle string_or_null, jchar* name,  
                        int len, TRAPS) {  
  unsigned int hashValue = java_lang_String::hash_string(name, len);  
  int index = the_table()->hash_to_index(hashValue);  
  oop string = the_table()->lookup(index, name, len, hashValue);  
  // Found  
  if (string != NULL) return string;  
  // Otherwise, add to symbol to table  
  return the_table()->basic_add(index, string_or_null, name, len,  
                                hashValue, CHECK_NULL);  
}

2. In java 6,the string cache pool is in the perm area,not in the heap,Most of the time,we config the perm size relatively small.

2.在java 6中,字符串缓存池位于perm区域,而不是堆中。大多数情况下,我们配置的perm大小相对较小。

#1


3  

Most likely reason for the performance difference: String.intern() is a native method, and calling a native method incurs massive overhead.

最可能的原因是性能差异:String.intern()是一种本机方法,调用本机方法会产生大量开销。

So why is it a native method? Probably because it uses the constant pool, which is a low-level VM construct.

那为什么它是一种原生方法呢?可能是因为它使用常量池,这是一个低级VM构造。

#2


6  

This article discusses the implementation of String.intern(). In Java 6 and 7, the implementation used a fixed size (1009) hashtable so as the number entries grew, the performance became O(n). The fixed size can be changed using -XX:StringTableSize=N. Apparently, in Java8 the default size is larger but issue remains.

本文讨论String.intern()的实现。在Java 6和7中,实现使用固定大小(1009)哈希表,因此数字条目增长,性能变为O(n)。可以使用-XX:StringTableSize = N更改固定大小。显然,在Java8中,默认大小较大但问题仍然存在。

#3


3  

@Michael Borgwardt said this in a comment:

@Michael Borgwardt在评论中说:

intern() is not synchronized, at least at the Java language level.

intern()不同步,至少在Java语言级别。

I think that you mean that the String.intern() method is not declared as synchronized in the sourcecode of the String class. And indeed, that is a true statement.

我认为你的意思是String.intern()方法没有在String类的源代码中声明为synchronized。事实上,这是一个真实的陈述。

However:

  • Declaring intern() as synchronized would only lock the current String instance, because it is an instance method, not a static method. So they couldn't implement string pool synchronization that way.

    将intern()声明为synchronized将仅锁定当前的String实例,因为它是实例方法,而不是静态方法。所以他们无法以这种方式实现字符串池同步。

  • If you step back and think about it, the string pool has to perform some kind of internal synchronization. If it didn't it would be unusable in a multi-threaded application, because there is simply no practical way for all code that uses the intern() method to do external synchronization.

    如果您退一步考虑它,字符串池必须执行某种内部同步。如果没有,它将在多线程应用程序中无法使用,因为对于使用intern()方法进行外部同步的所有代码,根本没有实用的方法。

So, the internal synchronization that the string pool performs could be a bottleneck in multi-threaded application that uses intern() heavily.

因此,字符串池执行的内部同步可能是使用intern()的多线程应用程序的瓶颈。

#4


2  

I can't speak from any great experience with it, but from the String docs:

我不能说任何有关它的经验,但是从String文档:

"When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the {@link #equals(Object)} method, then the string from the pool returned. Otherwise, this String object is added to the pool and a reference to this String object is returned."

“当调用intern方法时,如果池已经包含一个等于此字符串对象的字符串(由{@link #equals(Object)}方法确定,则返回池中的字符串。否则,将添加此String对象到池中,返回对此String对象的引用。“

When dealing with large numbers of objects, any solution involving hashing will outperform one that doesn't. I think you're just seeing the result of misusing a Java language feature. Interning isn't there to act as a Map of strings for your use. You should use a Map for that (or Set, as appropriate). The String table is for optimization at the language level, not the app level.

处理大量对象时,任何涉及散列的解决方案都将胜过没有散列的解决方案。我想你只是看到了滥用Java语言功能的结果。实习不是作为您使用的字符串映射。你应该使用Map(或适当的Set)。 String表用于语言级别的优化,而不是应用级别。

#5


-1  

The accepted answer is wrong.String.intern become slow is becuase two reasons:
1. the -XX:StringTableSize limitation.
In java,it uses a internal hashtable to manage string cache,in java 6,the default StringTableSize value is 1009,which means string.intern is O(the number of string object/ 1009),when more and more string object been created,it's becoming slower.

接受的答案是错误的。字符串变慢是因为两个原因:1。-XX:StringTableSize限制。在java中,它使用内部哈希表来管理字符串缓存,在java 6中,默认的StringTableSize值是1009,这意味着当创建越来越多的字符串对象时,string.intern是O(字符串对象的数量/ 1009),它变慢了。

\openjdk7\hotspot\src\share\vm\classfile\symbolTable.cpp

oop StringTable::intern(Handle string_or_null, jchar* name,  
                        int len, TRAPS) {  
  unsigned int hashValue = java_lang_String::hash_string(name, len);  
  int index = the_table()->hash_to_index(hashValue);  
  oop string = the_table()->lookup(index, name, len, hashValue);  
  // Found  
  if (string != NULL) return string;  
  // Otherwise, add to symbol to table  
  return the_table()->basic_add(index, string_or_null, name, len,  
                                hashValue, CHECK_NULL);  
}

2. In java 6,the string cache pool is in the perm area,not in the heap,Most of the time,we config the perm size relatively small.

2.在java 6中,字符串缓存池位于perm区域,而不是堆中。大多数情况下,我们配置的perm大小相对较小。