Java 8 introduced String Deduplication that can be enabled by launching JVM with -XX:+UseStringDeduplication
option allowing to save some memory by referencing similar String
objects instead of keeping duplicates. Of course it's effectiveness varies from program to program depending on utilisation of Strings
but I think it is safe to say that in general it can be considered beneficial for most applications (if not all) making me wonder about few things:
Java 8引入了可以通过使用-XX:+ usestringdeduplicate选项启动JVM来启用的字符串重复数据删除,该选项允许通过引用类似的字符串对象来保存一些内存,而不是保存重复的数据。当然,它的有效性因程序而异,取决于字符串的使用,但我认为可以肯定地说,总的来说,它对大多数应用(如果不是全部的话)都是有益的,这让我对以下几点感到疑惑:
Why is it not enabled by default? Is it because of costs associated with dedeuplication or simply because G1GC is still considered new?
为什么默认情况下不启用它?是由于与开发相关的成本,还是仅仅因为G1GC仍然被认为是新的?
Are there (or could there be) any edge cases where you would not want to use deduplication?
是否有(或可能存在)任何你不想使用去复制的边界情况?
2 个解决方案
#1
16
Cases where String de-duplication could be harmful include:
字符串重复删除可能有害的情况包括:
- Lots of strings but a very low probability of duplicates: the time overhead of looking for duplicates and the time and space overhead of the de-duping hashtable would not be repaid.
- 很多字符串,但是重复的可能性非常低:查找重复的时间开销和反duping散列表的时间和空间开销都不会得到补偿。
- A reasonable probability of duplicates, but most strings die in within a couple of GC cycles1: the de-duplication has much less benefit if the de-dups are going to be GC'ed soon anyway.
- 重复的可能性是合理的,但是大多数字符串会在几个GC cycles1中死去:如果去dup很快就会被GC处理,那么去重复的好处就会小得多。
(The second case is not about strings that don't survive the first GC cycle. It would make no sense for the GC to even try to de-dup strings that it knows to be garbage.)
(第二种情况不是关于不能在第一个GC周期中生存的字符串。对于GC来说,甚至尝试去除它知道是垃圾的字符串也是没有意义的。
We can only speculate as to why the Java team didn't turn on de-duping by default, but they are in a much better position to make rational (i.e. evidence based) decisions on this that you and I. My understanding is that they have access to many large real-world applications for benchmarking / trying out the effects of optimizations. They may also have deep contacts to a number of partner or customer organizations with similarly large code-bases and concerns about efficiency ... who they could ask for info on which optimizations really work.
我们只能推测为什么Java团队没有打开de-duping默认情况下,但他们是在一个更好的位置做出理性(即基于证据的)决定这你和我。我的理解是,他们拥有许多大型实际应用为基准/尝试优化的影响。他们也可能与一些拥有类似大型代码库和对效率的关注的伙伴或客户组织有深入的联系……他们可以向谁询问哪些优化是有效的。
1 - This depends on the value of the StringDeduplicationAgeThreshold
JVM setting. This defaults to 3 meaning that (roughly) a string has to survive 3 minor collections or a major collection to be considered for de-duping. But anyhow, if a string is de-duped and then found to be unreachable shortly afterwards, the de-duping overheads will not be repaid for that string.
1 -这取决于stringdeduplicate agethreshold JVM设置的值。这个默认值为3,这意味着(大致)一个字符串必须在3个小集合或一个大集合之后才能进行反duping。但是无论如何,如果一个字符串被取消了,然后在稍后发现它不可访问,那么这个字符串就不会得到反duping的开销了。
If you are asking when you should consider enabling de-duping, my advice would be to try it and see if it helps on a per-application basis. But you need to do some application-level benchmarking (which takes effort!) to be sure that the de-duping is beneficial ...
如果您正在询问何时应该考虑启用反duping,我的建议是尝试一下,看看它是否对每个应用程序都有帮助。但是,您需要做一些应用程序级的基准测试(这需要付出努力!),以确保去duping是有益的……
A careful read of JEP 192 would also help you understand the issues, and make a judgment on how they might apply for your Java application.
仔细阅读JEP 192也可以帮助您理解问题,并判断它们如何应用于Java应用程序。
#2
11
I absolutely understand that this does not answer the question, just wanted to mention that jdk-9 introduces one more optimization that is on by default called :
我完全明白这并不能回答这个问题,只是想提一下,jdk-9在默认情况下引入了一个更优化的方法:
-XX:+CompactStrings
- xx:+ CompactStrings
where Latin1 characters occupy a single byte instead of two (via a char). Because of that change many internal methods of String have changed - they act the same to the user, but internally they are faster in a lot of cases.
其中Latin1字符占用一个字节而不是两个字节(通过字符)。由于这一变化,字符串的许多内部方法都发生了变化——它们对用户的作用是相同的,但在很多情况下,它们在内部的运行速度更快。
Also in case of Strings for concatenating two Strings together via the plus sign the javac is going to generate different bytecode.
同样,如果字符串通过加号将两个字符串连接在一起,则javac将生成不同的字节码。
There is no bytecode instruction that concatenates two Strings together so the javac would generate a
没有将两个字符串连接在一起的字节码指令,因此javac将生成一个
StringBuilder#append
StringBuilder #附加
in the back-end. Until jdk-9.
在后端。直到jdk-9。
Now the bytecode delegates to
现在的字节码代表。
StringConcatFactory#makeConcatWithConstants
StringConcatFactory # makeConcatWithConstants
or
或
StringConcatFactory#makeConcat
StringConcatFactory # makeConcat
via the invokedynamic bytecode instruction:
通过invokedynamic字节码指令:
aload_0
1: aload_2
2: aload_1
3: invokedynamic #8, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
8: areturn
How the two strings are concatenated is a Runtime decision now. it could be still a StringBuilder or it could be a concatenation of byte arrays, etc. All you know that this can change and you will get the fastest possible solution.
如何连接这两个字符串现在是一个运行时决定。它可能仍然是一个StringBuilder,也可能是字节数组的串联,等等。所有这些都可以改变,你会得到最快的解决方案。
EDIT
编辑
I've just debugged and saw that there are quite a lot of strategies on how to append these Strings:
我刚刚调试了一下,看到有很多关于如何附加这些字符串的策略:
private enum Strategy {
/**
* Bytecode generator, calling into {@link java.lang.StringBuilder}.
*/
BC_SB,
/**
* Bytecode generator, calling into {@link java.lang.StringBuilder};
* but trying to estimate the required storage.
*/
BC_SB_SIZED,
/**
* Bytecode generator, calling into {@link java.lang.StringBuilder};
* but computing the required storage exactly.
*/
BC_SB_SIZED_EXACT,
/**
* MethodHandle-based generator, that in the end calls into {@link java.lang.StringBuilder}.
* This strategy also tries to estimate the required storage.
*/
MH_SB_SIZED,
/**
* MethodHandle-based generator, that in the end calls into {@link java.lang.StringBuilder}.
* This strategy also estimate the required storage exactly.
*/
MH_SB_SIZED_EXACT,
/**
* MethodHandle-based generator, that constructs its own byte[] array from
* the arguments. It computes the required storage exactly.
*/
MH_INLINE_SIZED_EXACT
}
The default being:
默认:
MH_INLINE_SIZED_EXACT
MH_INLINE_SIZED_EXACT
#1
16
Cases where String de-duplication could be harmful include:
字符串重复删除可能有害的情况包括:
- Lots of strings but a very low probability of duplicates: the time overhead of looking for duplicates and the time and space overhead of the de-duping hashtable would not be repaid.
- 很多字符串,但是重复的可能性非常低:查找重复的时间开销和反duping散列表的时间和空间开销都不会得到补偿。
- A reasonable probability of duplicates, but most strings die in within a couple of GC cycles1: the de-duplication has much less benefit if the de-dups are going to be GC'ed soon anyway.
- 重复的可能性是合理的,但是大多数字符串会在几个GC cycles1中死去:如果去dup很快就会被GC处理,那么去重复的好处就会小得多。
(The second case is not about strings that don't survive the first GC cycle. It would make no sense for the GC to even try to de-dup strings that it knows to be garbage.)
(第二种情况不是关于不能在第一个GC周期中生存的字符串。对于GC来说,甚至尝试去除它知道是垃圾的字符串也是没有意义的。
We can only speculate as to why the Java team didn't turn on de-duping by default, but they are in a much better position to make rational (i.e. evidence based) decisions on this that you and I. My understanding is that they have access to many large real-world applications for benchmarking / trying out the effects of optimizations. They may also have deep contacts to a number of partner or customer organizations with similarly large code-bases and concerns about efficiency ... who they could ask for info on which optimizations really work.
我们只能推测为什么Java团队没有打开de-duping默认情况下,但他们是在一个更好的位置做出理性(即基于证据的)决定这你和我。我的理解是,他们拥有许多大型实际应用为基准/尝试优化的影响。他们也可能与一些拥有类似大型代码库和对效率的关注的伙伴或客户组织有深入的联系……他们可以向谁询问哪些优化是有效的。
1 - This depends on the value of the StringDeduplicationAgeThreshold
JVM setting. This defaults to 3 meaning that (roughly) a string has to survive 3 minor collections or a major collection to be considered for de-duping. But anyhow, if a string is de-duped and then found to be unreachable shortly afterwards, the de-duping overheads will not be repaid for that string.
1 -这取决于stringdeduplicate agethreshold JVM设置的值。这个默认值为3,这意味着(大致)一个字符串必须在3个小集合或一个大集合之后才能进行反duping。但是无论如何,如果一个字符串被取消了,然后在稍后发现它不可访问,那么这个字符串就不会得到反duping的开销了。
If you are asking when you should consider enabling de-duping, my advice would be to try it and see if it helps on a per-application basis. But you need to do some application-level benchmarking (which takes effort!) to be sure that the de-duping is beneficial ...
如果您正在询问何时应该考虑启用反duping,我的建议是尝试一下,看看它是否对每个应用程序都有帮助。但是,您需要做一些应用程序级的基准测试(这需要付出努力!),以确保去duping是有益的……
A careful read of JEP 192 would also help you understand the issues, and make a judgment on how they might apply for your Java application.
仔细阅读JEP 192也可以帮助您理解问题,并判断它们如何应用于Java应用程序。
#2
11
I absolutely understand that this does not answer the question, just wanted to mention that jdk-9 introduces one more optimization that is on by default called :
我完全明白这并不能回答这个问题,只是想提一下,jdk-9在默认情况下引入了一个更优化的方法:
-XX:+CompactStrings
- xx:+ CompactStrings
where Latin1 characters occupy a single byte instead of two (via a char). Because of that change many internal methods of String have changed - they act the same to the user, but internally they are faster in a lot of cases.
其中Latin1字符占用一个字节而不是两个字节(通过字符)。由于这一变化,字符串的许多内部方法都发生了变化——它们对用户的作用是相同的,但在很多情况下,它们在内部的运行速度更快。
Also in case of Strings for concatenating two Strings together via the plus sign the javac is going to generate different bytecode.
同样,如果字符串通过加号将两个字符串连接在一起,则javac将生成不同的字节码。
There is no bytecode instruction that concatenates two Strings together so the javac would generate a
没有将两个字符串连接在一起的字节码指令,因此javac将生成一个
StringBuilder#append
StringBuilder #附加
in the back-end. Until jdk-9.
在后端。直到jdk-9。
Now the bytecode delegates to
现在的字节码代表。
StringConcatFactory#makeConcatWithConstants
StringConcatFactory # makeConcatWithConstants
or
或
StringConcatFactory#makeConcat
StringConcatFactory # makeConcat
via the invokedynamic bytecode instruction:
通过invokedynamic字节码指令:
aload_0
1: aload_2
2: aload_1
3: invokedynamic #8, 0 // InvokeDynamic #0:makeConcatWithConstants:(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
8: areturn
How the two strings are concatenated is a Runtime decision now. it could be still a StringBuilder or it could be a concatenation of byte arrays, etc. All you know that this can change and you will get the fastest possible solution.
如何连接这两个字符串现在是一个运行时决定。它可能仍然是一个StringBuilder,也可能是字节数组的串联,等等。所有这些都可以改变,你会得到最快的解决方案。
EDIT
编辑
I've just debugged and saw that there are quite a lot of strategies on how to append these Strings:
我刚刚调试了一下,看到有很多关于如何附加这些字符串的策略:
private enum Strategy {
/**
* Bytecode generator, calling into {@link java.lang.StringBuilder}.
*/
BC_SB,
/**
* Bytecode generator, calling into {@link java.lang.StringBuilder};
* but trying to estimate the required storage.
*/
BC_SB_SIZED,
/**
* Bytecode generator, calling into {@link java.lang.StringBuilder};
* but computing the required storage exactly.
*/
BC_SB_SIZED_EXACT,
/**
* MethodHandle-based generator, that in the end calls into {@link java.lang.StringBuilder}.
* This strategy also tries to estimate the required storage.
*/
MH_SB_SIZED,
/**
* MethodHandle-based generator, that in the end calls into {@link java.lang.StringBuilder}.
* This strategy also estimate the required storage exactly.
*/
MH_SB_SIZED_EXACT,
/**
* MethodHandle-based generator, that constructs its own byte[] array from
* the arguments. It computes the required storage exactly.
*/
MH_INLINE_SIZED_EXACT
}
The default being:
默认:
MH_INLINE_SIZED_EXACT
MH_INLINE_SIZED_EXACT