如何用字符串分割字符串，并使用。net来包含分隔符?

There are many similar questions, but apparently no perfect match, that's why I'm asking.

有很多类似的问题，但显然没有完美的匹配，这就是我问的原因。

I'd like to split a random string (e.g. 123xx456yy789) by a list of string delimiters (e.g. xx, yy) and include the delimiters in the result (here: 123, xx, 456, yy, 789).

我想用字符串分隔符列表(例如，xx, yy)将一个随机字符串(例如:123xx456yy789)拆分，并在结果中包含分隔符(这里:123,xx, 456, yy, 789)。

Good performance is a nice bonus. Regex should be avoided, if possible.

良好的业绩是一笔不错的奖金。如果可能的话，应该避免Regex。

Update: I did some performance checks and compared the results (too lazy to formally check them though). The tested solutions are (in random order):

更新:我做了一些性能检查，并比较了结果(尽管太懒了，不能正式检查它们)。测试的解决方案(按随机顺序):

Gabe
加布
Guffa
Guffa
Mafu
Mafu
Regex
正则表达式

Other solutions were not tested because either they were similar to another solution or they came in too late.

其他的解决方案没有经过测试，因为它们要么类似于另一个解决方案，要么就太晚了。

This is the test code:

这是测试代码:

class Program
{
    private static readonly List<Func<string, List<string>, List<string>>> Functions;
    private static readonly List<string> Sources;
    private static readonly List<List<string>> Delimiters;

    static Program ()
    {
        Functions = new List<Func<string, List<string>, List<string>>> ();
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Gabe (l).ToList ());
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Guffa (l).ToList ());
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Naive (l).ToList ());
        Functions.Add ((s, l) => s.SplitIncludeDelimiters_Regex (l).ToList ());

        Sources = new List<string> ();
        Sources.Add ("");
        Sources.Add (Guid.NewGuid ().ToString ());

        string str = "";
        for (int outer = 0; outer < 10; outer++) {
            for (int i = 0; i < 10; i++) {
                str += i + "**" + DateTime.UtcNow.Ticks;
            }
            str += "-";
        }
        Sources.Add (str);

        Delimiters = new List<List<string>> ();
        Delimiters.Add (new List<string> () { });
        Delimiters.Add (new List<string> () { "-" });
        Delimiters.Add (new List<string> () { "**" });
        Delimiters.Add (new List<string> () { "-", "**" });
    }

    private class Result
    {
        public readonly int FuncID;
        public readonly int SrcID;
        public readonly int DelimID;
        public readonly long Milliseconds;
        public readonly List<string> Output;

        public Result (int funcID, int srcID, int delimID, long milliseconds, List<string> output)
        {
            FuncID = funcID;
            SrcID = srcID;
            DelimID = delimID;
            Milliseconds = milliseconds;
            Output = output;
        }

        public void Print ()
        {
            Console.WriteLine ("S " + SrcID + "\tD " + DelimID + "\tF " + FuncID + "\t" + Milliseconds + "ms");
            Console.WriteLine (Output.Count + "\t" + string.Join (" ", Output.Take (10).Select (x => x.Length < 15 ? x : x.Substring (0, 15) + "...").ToArray ()));
        }
    }

    static void Main (string[] args)
    {
        var results = new List<Result> ();

        for (int srcID = 0; srcID < 3; srcID++) {
            for (int delimID = 0; delimID < 4; delimID++) {
                for (int funcId = 3; funcId >= 0; funcId--) { // i tried various orders in my tests
                    Stopwatch sw = new Stopwatch ();
                    sw.Start ();

                    var func = Functions[funcId];
                    var src = Sources[srcID];
                    var del = Delimiters[delimID];

                    for (int i = 0; i < 10000; i++) {
                        func (src, del);
                    }
                    var list = func (src, del);
                    sw.Stop ();

                    var res = new Result (funcId, srcID, delimID, sw.ElapsedMilliseconds, list);
                    results.Add (res);
                    res.Print ();
                }
            }
        }
    }
}

As you can see, it was really just a quick and dirty test, but I ran the test multiple times and with different order and the result was always very consistent. The measured time frames are in the range of milliseconds up to seconds for the larger datasets. I ignored the values in the low-millisecond range in my following evaluation because they seemed negligible in practice. Here's the output on my box:

正如您所看到的，这实际上只是一个快速而又脏的测试，但是我多次运行测试，并且顺序不同，结果总是非常一致。测量的时间帧在更大的数据集的毫秒数到秒的范围内。我在接下来的评估中忽略了低毫秒范围内的值，因为在实践中它们似乎可以忽略不计。这是我的盒子上的输出:

S 0     D 0     F 3     11ms
1
S 0     D 0     F 2     7ms
1
S 0     D 0     F 1     6ms
1
S 0     D 0     F 0     4ms
0
S 0     D 1     F 3     28ms
1
S 0     D 1     F 2     8ms
1
S 0     D 1     F 1     7ms
1
S 0     D 1     F 0     3ms
0
S 0     D 2     F 3     30ms
1
S 0     D 2     F 2     8ms
1
S 0     D 2     F 1     6ms
1
S 0     D 2     F 0     3ms
0
S 0     D 3     F 3     30ms
1
S 0     D 3     F 2     10ms
1
S 0     D 3     F 1     8ms
1
S 0     D 3     F 0     3ms
0
S 1     D 0     F 3     9ms
1       9e5282ec-e2a2-4...
S 1     D 0     F 2     6ms
1       9e5282ec-e2a2-4...
S 1     D 0     F 1     5ms
1       9e5282ec-e2a2-4...
S 1     D 0     F 0     5ms
1       9e5282ec-e2a2-4...
S 1     D 1     F 3     63ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 1     F 2     37ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 1     F 1     29ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 1     F 0     22ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 2     F 3     30ms
1       9e5282ec-e2a2-4...
S 1     D 2     F 2     10ms
1       9e5282ec-e2a2-4...
S 1     D 2     F 1     10ms
1       9e5282ec-e2a2-4...
S 1     D 2     F 0     12ms
1       9e5282ec-e2a2-4...
S 1     D 3     F 3     73ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 3     F 2     40ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 3     F 1     33ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 1     D 3     F 0     30ms
9       9e5282ec - e2a2 - 4265 - 8276 - 6dbb50fdae37
S 2     D 0     F 3     10ms
1       0**634226552821...
S 2     D 0     F 2     109ms
1       0**634226552821...
S 2     D 0     F 1     5ms
1       0**634226552821...
S 2     D 0     F 0     127ms
1       0**634226552821...
S 2     D 1     F 3     184ms
21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 1     F 2     364ms
21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 1     F 1     134ms
21      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 1     F 0     517ms
20      0**634226552821... - 0**634226552821... - 0**634226552821... - 0**634226
552821... - 0**634226552821... -
S 2     D 2     F 3     688ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 2     F 2     2404ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 2     F 1     874ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 2     F 0     717ms
201     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 3     1205ms
221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 2     3471ms
221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 1     1008ms
221     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **
S 2     D 3     F 0     1095ms
220     0 ** 634226552821217... ** 634226552821217... ** 634226552821217... ** 6
34226552821217... **

I compared the results and this is what I found:

我比较了结果，这是我发现的:

All 4 functions are fast enough for common usage.
所有4个函数都足够快，可以使用。
The naive version (aka what I wrote initially) is the worst in terms of computation time.
简单的版本(也就是我最初写的)在计算时间上是最糟糕的。
Regex is a bit slow on small datasets (probably due to initialization overhead).
Regex在小数据集上有点慢(可能是由于初始化开销)。
Regex does well on large data and hits a similar speed as the non-regex solutions.
Regex在大数据上做得很好，并且与非Regex解决方案的速度类似。
The performance-wise best seems to be Guffa's version overall, which is to be expected from the code.
性能上最好的方法似乎是Guffa的版本，从代码中可以看到。
Gabe's version sometimes omits an item, but I did not investigate this (bug?).
Gabe的版本有时省略了一个条目，但我没有对此进行调查(bug?)

To conclude this topic, I suggest to use Regex, which is reasonably fast. If performance is critical, I'd prefer Guffa's implementation.

为了结束这个话题，我建议使用Regex，它是相当快的。如果性能很重要，我更喜欢Guffa的实现。

7 个解决方案

#1

Despite your reluctance to use regex it actually nicely preserves the delimiters by using a group along with the Regex.Split method:

尽管您不愿意使用regex，但它实际上很好地保护了分隔符，使用了一个组和regex。分离方法:

string input = "123xx456yy789";
string pattern = "(xx|yy)";
string[] result = Regex.Split(input, pattern);

If you remove the parentheses from the pattern, using just "xx|yy", the delimiters are not preserved. Be sure to use Regex.Escape on the pattern if you use any metacharacters that hold special meaning in regex. The characters include \, *, +, ?, |, {, [, (,), ^, $,., #. For instance, a delimiter of . should be escaped \.. Given a list of delimiters, you need to "OR" them using the pipe | symbol and that too is a character that gets escaped. To properly build the pattern use the following code (thanks to @gabe for pointing this out):

如果从模式中删除括号，使用“xx|yy”，则不保留分隔符。一定要使用正则表达式。如果您使用任何在regex中具有特殊含义的元字符，就可以在模式上脱逃。角色包括\,*,+,{ |,(、(,)、^ $,。,#。例如，一个分隔符。应该逃\ . .给定一个分隔符列表，您需要“或”使用管道|符号，这也是一个被转义的字符。要正确地构建模式，请使用以下代码(感谢@gabe指出这一点):

var delimiters = new List<string> { ".", "xx", "yy" };
string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                  + ")";

The parentheses are concatenated rather than included in the pattern since they would be incorrectly escaped for your purposes.

括号是连接的，而不是在模式中包含的，因为它们会因为您的目的而被错误地转义。

EDIT: In addition, if the delimiters list happens to be empty, the final pattern would incorrectly be () and this would cause blank matches. To prevent this a check for the delimiters can be used. With all this in mind the snippet becomes:

编辑:此外，如果分隔符列表碰巧是空的，那么最终的模式将是错误的()，这将导致空白匹配。为了防止这种情况，可以使用对分隔符的检查。考虑到这一切，代码片段变成:

string input = "123xx456yy789";
// to reach the else branch set delimiters to new List();
var delimiters = new List<string> { ".", "xx", "yy", "()" }; 
if (delimiters.Count > 0)
{
    string pattern = "("
                     + String.Join("|", delimiters.Select(d => Regex.Escape(d))
                                                  .ToArray())
                     + ")";
    string[] result = Regex.Split(input, pattern);
    foreach (string s in result)
    {
        Console.WriteLine(s);
    }
}
else
{
    // nothing to split
    Console.WriteLine(input);
}

If you need a case-insensitive match for the delimiters use the RegexOptions.IgnoreCase option: Regex.Split(input, pattern, RegexOptions.IgnoreCase)

如果您需要对分隔符使用不区分大小写的匹配，则使用RegexOptions。IgnoreCase选项:正则表达式。分割(输入、模式RegexOptions.IgnoreCase)

EDIT #2: the solution so far matches split tokens that might be a substring of a larger string. If the split token should be matched completely, rather than part of a substring, such as a scenario where words in a sentence are used as the delimiters, then the word-boundary \b metacharacter should be added around the pattern.

编辑#2:到目前为止，解决方案匹配了可能是更大字符串的子字符串的拆分令牌。如果拆分令牌应该完全匹配，而不是一个子字符串的一部分，比如在一个句子中使用单词作为分隔符的场景，那么就应该在模式周围添加单词-边界\b元字符。

For example, consider this sentence (yea, it's corny): "Welcome to *... where the stack never overflows!"

例如，考虑这个句子(是的，它是陈词滥调):“欢迎来到*……”在那里，堆栈永远不会溢出!

If the delimiters were { "stack", "flow" } the current solution would split "*" and return 3 strings { "stack", "over", "flow" }. If you needed an exact match, then the only place this would split would be at the word "stack" later in the sentence and not "*".

如果分隔符是{“stack”，“flow”}，那么当前的解决方案将分离“*”并返回3个字符串{“堆栈”，“over”，“flow”}。如果你需要一个精确的匹配，那么唯一会分裂的地方是“堆栈”这个词后面的句子，而不是“*”。

To achieve an exact match behavior alter the pattern to include \b as in \b(delim1|delim2|delimN)\b:

为了达到精确的匹配行为，需要改变模式，将\b包含在\b中(delim1|delim2|delimN)\b:

string pattern = @"\b("
                + String.Join("|", delimiters.Select(d => Regex.Escape(d)))
                + @")\b";

Finally, if trimming the spaces before and after the delimiters is desired, add \s* around the pattern as in \s*(delim1|delim2|delimN)\s*. This can be combined with \b as follows:

最后，如果需要在分隔符之前和之后对空格进行修剪，可以在模式中加上\s*(delim1|delim2|delimN)\s*。这可以与\b结合如下:

string pattern = @"\s*\b("
                + String.Join("|", delimiters.Select(d => Regex.Escape(d)))
                + @")\b\s*";

#2

Ok, sorry, maybe this one:

好的，抱歉，可能是这个:

    string source = "123xx456yy789";
    foreach (string delimiter in delimiters)
        source = source.Replace(delimiter, ";" + delimiter + ";");
    string[] parts = source.Split(';');

#3

Here's a solution that doesn't use a regular expression and doesn't make more strings than necessary:

这里有一个不使用正则表达式的解决方案，它不会产生比需要更多的字符串:

public static List<string> Split(string searchStr, string[] separators)
{
    List<string> result = new List<string>();
    int length = searchStr.Length;
    int lastMatchEnd = 0;
    for (int i = 0; i < length; i++)
    {
        for (int j = 0; j < separators.Length; j++)
        {
            string str = separators[j];
            int sepLen = str.Length;
            if (((searchStr[i] == str[0]) && (sepLen <= (length - i))) && ((sepLen == 1) || (String.CompareOrdinal(searchStr, i, str, 0, sepLen) == 0)))
            {
                result.Add(searchStr.Substring(lastMatchEnd, i - lastMatchEnd));
                result.Add(separators[j]);
                i += sepLen - 1;
                lastMatchEnd = i + 1;
                break;
            }
        }
    }
    if (lastMatchEnd != length)
        result.Add(searchStr.Substring(lastMatchEnd));
    return result;
}

#4

I came up with a solution for something similar a while back. To efficiently split a string you can keep a list of the next occurance of each delimiter. That way you minimise the times that you have to look for each delimiter.

我想出了一个类似的解决方案。为了有效地分割字符串，您可以保留每个分隔符下一个出现的列表。这样你就可以最小化你需要寻找每个分隔符的次数。

This algorithm will perform well even for a long string and a large number of delimiters:

这个算法即使对于一个长字符串和大量的定界符也能很好地执行:

string input = "123xx456yy789";
string[] delimiters = { "xx", "yy" };

int[] nextPosition = delimiters.Select(d => input.IndexOf(d)).ToArray();
List<string> result = new List<string>();
int pos = 0;
while (true) {
  int firstPos = int.MaxValue;
  string delimiter = null;
  for (int i = 0; i < nextPosition.Length; i++) {
    if (nextPosition[i] != -1 && nextPosition[i] < firstPos) {
      firstPos = nextPosition[i];
      delimiter = delimiters[i];
    }
  }
  if (firstPos != int.MaxValue) {
    result.Add(input.Substring(pos, firstPos - pos));
    result.Add(delimiter);
    pos = firstPos + delimiter.Length;
    for (int i = 0; i < nextPosition.Length; i++) {
      if (nextPosition[i] != -1 && nextPosition[i] < pos) {
        nextPosition[i] = input.IndexOf(delimiters[i], pos);
      }
    }
  } else {
    result.Add(input.Substring(pos));
    break;
  }
}

(With reservations for any bugs, I just threw this version together now and I haven't tested it thorougly.)

(对于任何bug的保留，我现在只是把这个版本放在一起，我还没有对它进行测试。)

#5

A naive implementation

一个天真的实现

public IEnumerable<string> SplitX (string text, string[] delimiters)
{
    var split = text.Split (delimiters, StringSplitOptions.None);

    foreach (string part in split) {
        yield return part;
        text = text.Substring (part.Length);

        string delim = delimiters.FirstOrDefault (x => text.StartsWith (x));
        if (delim != null) {
            yield return delim;
            text = text.Substring (delim.Length);
        }
    }
}

#6

This will have identical semantics to String.Split default mode (so not including empty tokens).

这将具有与字符串相同的语义。分割默认模式(所以不包括空的令牌)。

It can be made faster by using unsafe code to iterate over the source string, though this requires you to write the iteration mechanism yourself rather than using yield return. It allocates the absolute minimum (a substring per non separator token plus the wrapping enumerator) so realistically to improve performance you would have to:

通过使用不安全的代码来遍历源字符串，可以更快地实现它，尽管这要求您自己编写迭代机制，而不是使用yield return。它分配绝对最小值(每个非分隔符标记的子字符串加上包装枚举器)，因此实际上要提高性能，您必须:

use even more unsafe code (by using 'CompareOrdinal' I effectively am)
- mainly in avoiding the overhead of character lookup on the string with a char buffer
- 主要是避免字符查找的开销，使用字符缓冲区。
使用更不安全的代码(通过使用“CompareOrdinal”，我有效地使用了)，主要是为了避免使用char缓冲区对字符串进行字符查找的开销。
make use of domain specific knowledge about the input sources or tokens.
- you may be happy to eliminate the null check on the separators
- 您可能很乐意删除分隔符上的空检查。
- you may know that the separators are almost never individual characters
- 你可能知道分隔符几乎从来不是单个字符。
利用有关输入源或令牌的特定领域知识。您可能很乐意删除分隔符上的空检查，您可能知道分隔符几乎从来不是单个字符。

The code is written as an extension method

代码是作为扩展方法编写的。

public static IEnumerable<string> SplitWithTokens(
    string str,
    string[] separators)
{
    if (separators == null || separators.Length == 0)
    {
        yield return str;
        yield break;
    }
    int prev = 0;
    for (int i = 0; i < str.Length; i++)
    {
        foreach (var sep in separators)
        {
            if (!string.IsNullOrEmpty(sep))
            {
                if (((str[i] == sep[0]) && 
                          (sep.Length <= (str.Length - i))) 
                     &&
                    ((sep.Length == 1) || 
                    (string.CompareOrdinal(str, i, sep, 0, sep.Length) == 0)))
                {
                    if (i - prev != 0)
                        yield return str.Substring(prev, i - prev);
                    yield return sep;
                    i += sep.Length - 1;
                    prev = i + 1;
                    break;
                }
            }
        }
    }
    if (str.Length - prev > 0)
        yield return str.Substring(prev, str.Length - prev);
}

#7

My first post/answer...this is a recursive approach.

我的第一篇文章/回答……这是一种递归方法。

    static void Split(string src, string[] delims, ref List<string> final)
    {
        if (src.Length == 0)
            return;

        int endTrimIndex = src.Length;
        foreach (string delim in delims)
        {
            //get the index of the first occurance of this delim
            int indexOfDelim = src.IndexOf(delim);
            //check to see if this delim is at the begining of src
            if (indexOfDelim == 0)
            {
                endTrimIndex = delim.Length;
                break;
            }
            //see if this delim comes before previously searched delims
            else if (indexOfDelim < endTrimIndex && indexOfDelim != -1)
                endTrimIndex = indexOfDelim;
        }
        final.Add(src.Substring(0, endTrimIndex));
        Split(src.Remove(0, endTrimIndex), delims, ref final);
    }

#1