为什么StringBuilder比字符串操作快,而List比LinkedList快?

时间:2022-02-07 07:16:45

So we are told that StringBuilder should be used when you are doing more than a few operations on a string (I've heard as low as three). Therefore we should replace this:

所以我们被告知,当你在一个字符串上做超过几个操作时,应该使用StringBuilder(我听说过只有三个操作)。因此,我们应该替换这个:

string s = "";
foreach (var item in items) // where items is IEnumerable<string>
    s += item;

With this:

用这个:

string s = new StringBuilder(items).ToString();

I assume that internally StringBuilder holds references to each Appended string, combining then on request. Lets compare this to the HybridDictionary, that uses a LinkedList for the first 10 elements, then swaps to a HashTable when the list grows more then 10. As we can see the same kind of pattern is here, small number of references = linkedList, else make ever increasing blocks of arrays.

我假设StringBuilder内部保存对每个附加字符串的引用,然后根据请求进行组合。让我们将其与HybridDictionary进行比较,后者对前10个元素使用LinkedList,然后在列表增长超过10时将其交换为HashTable。正如我们在这里看到的相同的模式,少量的引用= linkedList,否则会产生越来越多的数组块。

Lets look at how a List works. Start off with a list size (internal default is 4). Add elements to the internal array, if the array is full, make a new array of double the size of the current array, copy the current array's elements across, then add the new element and make the new array the current array.

让我们看看列表是如何工作的。开始一个列表大小(内部默认4)。将元素添加到内部数组,如果数组已满,新建一个双电流大小的数组,数组复制当前数组的元素,然后加入新元素,使新数组当前数组。

Can you see my confusion as to the performance benefits? For all elements besides strings, we make new arrays, copy old values and add the new value. But for strings that's bad? because we know that "a" + "b" makes a new string reference from the two old references, "a" and "b".

你能看出我对性能好处的困惑吗?对于字符串之外的所有元素,我们创建新数组、复制旧值并添加新值。但是对于字符串来说,这是不好的吗?因为我们知道“a”+“b”从两个旧的引用“a”和“b”中创建了一个新的字符串引用。

Hope my question isn't too confusing. Why does there seem to be a double standard between string concatenation and array concatenation (I know strings are arrays of chars)?

希望我的问题不要让人迷惑。为什么字符串连接和数组连接之间似乎有一个双重标准(我知道字符串是字符的数组)?

String: Making new references is bad!

字符串:做新的引用是不好的!

T : where T != String: Making new references is good!

T: where T != String:新建引用很好!

Edit: Maybe what I'm really asking here, is when does making new, bigger arrays and copying the old values across, start being faster than have references to randomly places objects all over the heap?

编辑:也许我真正想问的是,什么时候创建新的、更大的数组并复制旧值,开始比在堆中随机放置对象的引用快?

Double edit: By faster I mean reading, writing and finding variables, not inserting or removing (i.e. LinkedList would kickass at inserting for example, but I don't care about that).

重复编辑:我说的更快的意思是读、写和查找变量,而不是插入或删除(例如,LinkedList会在插入时出错,但我不在乎)。

Final edit: I don't care about StringBuilder, I'm interested in the trade off in time taken to copy data from one part of the heap to another for cache alignments, vs just taking the cache misses from teh cpu and have references all over the heap. When does one become faster then the other?*

最后编辑:我不关心StringBuilder,我感兴趣的是将数据从堆的一个部分复制到另一个部分以获得缓存对齐,而不是只从cpu中获取缓存缺失并在堆中有引用。什么时候一个比另一个快

3 个解决方案

#1


4  

Therefore we should replace this:

因此,我们应该替换这个:

No you shouldn't. The first case you showed string concatenation that can take place at compile time and have replaced it with string concatenation that takes place a runtime. The former is much more desirable, and will execute faster than the latter.

不你不应该。第一个例子中,您展示了可以在编译时发生的字符串连接,并将其替换为运行时发生的字符串连接。前者更可取,而且执行速度要比后者快。

It's important to use a string builder when the number of strings being concatted is not known at compile time. Often (but not always) this means concatting strings in a loop.

当编译时不知道字符串的数量时,使用字符串构建器是很重要的。通常(但不总是)这意味着在循环中连接字符串。

Earlier versions of String Builder (before 4.0, if memory serves), did internally look more or less like a List<char>, and it's correct that post 4.0 it looks more like a LinkedList<char[]>. However, the key difference here between using a StringBuilder and using regular string concatenation in a loop is not the difference between a linked list style in which objects contain references to the next object in the "chain" and an array-based style in which an internal buffer overallocates space and is reallocated occasionally as needed, but rather the difference between a mutable object and an immutable object. The problem with traditional string concatenation is that, since strings are immutable, each concatenation must copy all of the memory from both strings into a new string. When using a StringBuilder the new string only needs to be copied onto the end of some type of data structure, leaving all of the existing memory as it is. What type of data structure that is isn't terribly important here; we can rely on Microsoft to use a structure/algorithm that has been proven to have the best performance characteristics for the most common situations.

早期版本的String Builder(在4.0之前,如果有内存的话)在内部看起来或多或少像一个列表 ,而post 4.0看起来更像一个LinkedList 。然而,使用StringBuilder之间的关键区别和使用常规字符串连接在一个循环中不是一个链表的区别的风格在哪些对象包含的下一个对象引用的“链”和一个基于数组的风格重新分配一个内部缓冲区overallocates空间和偶尔会根据需要,而是一个可变对象之间的区别和一个不可变的对象。传统字符串连接的问题是,由于字符串是不可变的,每个连接必须将两个字符串中的所有内存复制到一个新的字符串中。当使用StringBuilder时,新字符串只需要复制到某种数据结构的末尾,保留所有现有的内存。什么类型的数据结构在这里不是很重要;我们可以依赖微软的结构/算法,已经被证明在最常见的情况下具有最佳的性能特征。 []>

#2


1  

It seems to me that you are conflating the resizing of a list with the evaluation of a string expression, and assuming that the two should behave the same way.

在我看来,您是在将一个列表的大小调整为一个字符串表达式的值,并假设两者的行为应该是相同的。

Consider your example: string s = "a" + "b" + "c" + "d"

考虑您的示例:string s = "a" + "b" + "c" + "d"

Assuming no optimisations of the constant expression (which the compiler would handle automatically), what this will do is evaluate each operation in turn:

假设常量表达式没有优化(编译器会自动处理),那么这将依次对每个操作进行评估:

string s = (("a" + "b") + "c") + "d"

This results in the strings "ab" and "abc" being created as part of that single expression. This has to happen, because strings in.NET are immutable, which means their values cannot be changed once created. This is because, if strings were mutable, you'd have code like this:

这导致字符串“ab”和“abc”作为单个表达式的一部分被创建。这是必须的,因为弦在里面。NET是不可变的,这意味着它们的值一旦创建就不能更改。这是因为,如果字符串是可变的,就会有这样的代码:

string a = "hello";
string b = a;       // would assign b the same reference as a
string b += "world"; // would update the string it references
// now a == "helloworld"

If this were a List, the code would make more sense, and doesn't even need explanation:

如果这是一个列表,代码会更有意义,甚至不需要解释:

var a = new List<int> { 1, 2, 3 };
var b = a;
b.Add(4);
// now a == { 1, 2, 3, 4 }

So the reason that non-string "list" types allocate extra memory early is for reasons of efficiency, and to reduce allocations when the list is extended. The reason that a string does not do that is because a string's value is never updated once created.

因此,非字符串“列表”类型提前分配额外内存的原因是为了提高效率,并在扩展列表时减少分配。字符串不这样做的原因是字符串的值在创建后永远不会更新。

Your assumption about the operation of the StringBuilder is irrelevant, but the purpose of a StringBuilder is essentially to create a non-immutable object that reduces the overhead of multiple string operations.

关于StringBuilder操作的假设是无关的,但是StringBuilder的目的本质上是创建一个非不可变的对象,以减少多个字符串操作的开销。

#3


0  

The backing store of a StringBuilder is a char[] that gets resized as needed. Nothing is turned into a string until you invoke StringBuilder.ToString() on it.

StringBuilder的后台存储是一个char[],可以根据需要调整大小。在调用StringBuilder.ToString()之前,没有任何东西会变成一个字符串。

The backing store of List<T> is a T[] that gets resized as needed.

List 的后置存储是一个T[],可以根据需要调整大小。

The problem with something like

问题是

string s = a + b + c + d ;

is that the compiler parses it as

编译器是这样分析的吗

  +
 / \
a   +
   / \
  b   +
     / \
    c   d

and, unless it can see opportunities for optimization, do something like

而且,除非它能看到优化的机会,否则做一些类似的事情

string t1 = c +  d ;
string t2 = b + t1 ;
string s  = a + t2 ;

thus creating two temporaries and the final string. With a StringBuilder, though, it's going to build out the character array it needs and at the end create one string.

这样就创建了两个临时字符串和最后一个字符串。但是,对于StringBuilder,它将构建它需要的字符数组,并在最后创建一个字符串。

This is a win because strings, once created, are immutable (can't be changed) and are generally interned in the string pool (meaning that there is only ever one instance of the string...no matter how many time your create the string "abc", every instance will always be a reference to the same object in the string pool.

这是一个胜利,因为字符串一旦创建,是不可变的(不能更改),并且通常在字符串池中进行交互(意味着该字符串只有一个实例……)无论您创建字符串“abc”的时间有多长,每个实例都始终是字符串池中相同对象的引用。

This adds cost to string creation as well: having determined the candidate string, the runtime has to check the string pool to see if it already exists. If it does, that reference is used; if it does not the candidate string is added to the string pool.

这也增加了字符串创建的成本:在确定了候选字符串之后,运行时必须检查字符串池,看看它是否已经存在。如果是,则使用该引用;如果没有,候选字符串将被添加到字符串池中。

Your example, though:

不过,你的例子:

string s = "a" + "b" + "c" + "d" ;

is a non-sequitur: the compile sees the constant expression and does an optimization called constant folding, so it becomes (even in debug mode):

是一个非序列:编译看到常量表达式并进行名为常量折叠的优化,因此它变成(即使在调试模式下):

string s = "abcd" ;

Similar optimizations happen with arithmetic expressions:

算术表达式也有类似的优化:

int x = 12 / 3 ;

is going to be optimized away to

会被优化到

int x = 4 ;

#1


4  

Therefore we should replace this:

因此,我们应该替换这个:

No you shouldn't. The first case you showed string concatenation that can take place at compile time and have replaced it with string concatenation that takes place a runtime. The former is much more desirable, and will execute faster than the latter.

不你不应该。第一个例子中,您展示了可以在编译时发生的字符串连接,并将其替换为运行时发生的字符串连接。前者更可取,而且执行速度要比后者快。

It's important to use a string builder when the number of strings being concatted is not known at compile time. Often (but not always) this means concatting strings in a loop.

当编译时不知道字符串的数量时,使用字符串构建器是很重要的。通常(但不总是)这意味着在循环中连接字符串。

Earlier versions of String Builder (before 4.0, if memory serves), did internally look more or less like a List<char>, and it's correct that post 4.0 it looks more like a LinkedList<char[]>. However, the key difference here between using a StringBuilder and using regular string concatenation in a loop is not the difference between a linked list style in which objects contain references to the next object in the "chain" and an array-based style in which an internal buffer overallocates space and is reallocated occasionally as needed, but rather the difference between a mutable object and an immutable object. The problem with traditional string concatenation is that, since strings are immutable, each concatenation must copy all of the memory from both strings into a new string. When using a StringBuilder the new string only needs to be copied onto the end of some type of data structure, leaving all of the existing memory as it is. What type of data structure that is isn't terribly important here; we can rely on Microsoft to use a structure/algorithm that has been proven to have the best performance characteristics for the most common situations.

早期版本的String Builder(在4.0之前,如果有内存的话)在内部看起来或多或少像一个列表 ,而post 4.0看起来更像一个LinkedList 。然而,使用StringBuilder之间的关键区别和使用常规字符串连接在一个循环中不是一个链表的区别的风格在哪些对象包含的下一个对象引用的“链”和一个基于数组的风格重新分配一个内部缓冲区overallocates空间和偶尔会根据需要,而是一个可变对象之间的区别和一个不可变的对象。传统字符串连接的问题是,由于字符串是不可变的,每个连接必须将两个字符串中的所有内存复制到一个新的字符串中。当使用StringBuilder时,新字符串只需要复制到某种数据结构的末尾,保留所有现有的内存。什么类型的数据结构在这里不是很重要;我们可以依赖微软的结构/算法,已经被证明在最常见的情况下具有最佳的性能特征。 []>

#2


1  

It seems to me that you are conflating the resizing of a list with the evaluation of a string expression, and assuming that the two should behave the same way.

在我看来,您是在将一个列表的大小调整为一个字符串表达式的值,并假设两者的行为应该是相同的。

Consider your example: string s = "a" + "b" + "c" + "d"

考虑您的示例:string s = "a" + "b" + "c" + "d"

Assuming no optimisations of the constant expression (which the compiler would handle automatically), what this will do is evaluate each operation in turn:

假设常量表达式没有优化(编译器会自动处理),那么这将依次对每个操作进行评估:

string s = (("a" + "b") + "c") + "d"

This results in the strings "ab" and "abc" being created as part of that single expression. This has to happen, because strings in.NET are immutable, which means their values cannot be changed once created. This is because, if strings were mutable, you'd have code like this:

这导致字符串“ab”和“abc”作为单个表达式的一部分被创建。这是必须的,因为弦在里面。NET是不可变的,这意味着它们的值一旦创建就不能更改。这是因为,如果字符串是可变的,就会有这样的代码:

string a = "hello";
string b = a;       // would assign b the same reference as a
string b += "world"; // would update the string it references
// now a == "helloworld"

If this were a List, the code would make more sense, and doesn't even need explanation:

如果这是一个列表,代码会更有意义,甚至不需要解释:

var a = new List<int> { 1, 2, 3 };
var b = a;
b.Add(4);
// now a == { 1, 2, 3, 4 }

So the reason that non-string "list" types allocate extra memory early is for reasons of efficiency, and to reduce allocations when the list is extended. The reason that a string does not do that is because a string's value is never updated once created.

因此,非字符串“列表”类型提前分配额外内存的原因是为了提高效率,并在扩展列表时减少分配。字符串不这样做的原因是字符串的值在创建后永远不会更新。

Your assumption about the operation of the StringBuilder is irrelevant, but the purpose of a StringBuilder is essentially to create a non-immutable object that reduces the overhead of multiple string operations.

关于StringBuilder操作的假设是无关的,但是StringBuilder的目的本质上是创建一个非不可变的对象,以减少多个字符串操作的开销。

#3


0  

The backing store of a StringBuilder is a char[] that gets resized as needed. Nothing is turned into a string until you invoke StringBuilder.ToString() on it.

StringBuilder的后台存储是一个char[],可以根据需要调整大小。在调用StringBuilder.ToString()之前,没有任何东西会变成一个字符串。

The backing store of List<T> is a T[] that gets resized as needed.

List 的后置存储是一个T[],可以根据需要调整大小。

The problem with something like

问题是

string s = a + b + c + d ;

is that the compiler parses it as

编译器是这样分析的吗

  +
 / \
a   +
   / \
  b   +
     / \
    c   d

and, unless it can see opportunities for optimization, do something like

而且,除非它能看到优化的机会,否则做一些类似的事情

string t1 = c +  d ;
string t2 = b + t1 ;
string s  = a + t2 ;

thus creating two temporaries and the final string. With a StringBuilder, though, it's going to build out the character array it needs and at the end create one string.

这样就创建了两个临时字符串和最后一个字符串。但是,对于StringBuilder,它将构建它需要的字符数组,并在最后创建一个字符串。

This is a win because strings, once created, are immutable (can't be changed) and are generally interned in the string pool (meaning that there is only ever one instance of the string...no matter how many time your create the string "abc", every instance will always be a reference to the same object in the string pool.

这是一个胜利,因为字符串一旦创建,是不可变的(不能更改),并且通常在字符串池中进行交互(意味着该字符串只有一个实例……)无论您创建字符串“abc”的时间有多长,每个实例都始终是字符串池中相同对象的引用。

This adds cost to string creation as well: having determined the candidate string, the runtime has to check the string pool to see if it already exists. If it does, that reference is used; if it does not the candidate string is added to the string pool.

这也增加了字符串创建的成本:在确定了候选字符串之后,运行时必须检查字符串池,看看它是否已经存在。如果是,则使用该引用;如果没有,候选字符串将被添加到字符串池中。

Your example, though:

不过,你的例子:

string s = "a" + "b" + "c" + "d" ;

is a non-sequitur: the compile sees the constant expression and does an optimization called constant folding, so it becomes (even in debug mode):

是一个非序列:编译看到常量表达式并进行名为常量折叠的优化,因此它变成(即使在调试模式下):

string s = "abcd" ;

Similar optimizations happen with arithmetic expressions:

算术表达式也有类似的优化:

int x = 12 / 3 ;

is going to be optimized away to

会被优化到

int x = 4 ;