我应该推出自己的ParseInt32版本吗?

时间:2022-10-29 20:47:37

I am writing a high performance parser, and it seems to me that Int32.Parse can be too slow. I wrote a simple version that assumes correct input, and it's performing much better. So should I create my own version instead? Or is there another faster method already available?

我正在编写一个高性能的解析器,在我看来Int32.Parse可能太慢了。我写了一个假设正确输入的简单版本,它的表现要好得多。那么我应该创建自己的版本吗?或者还有另一种更快的方法吗?

My method is like this:

我的方法是这样的:

// parse simple int, assuming relatively correct input (i.e. all digits)
public static int ParseInt32Simply(string str) {
    if (str == null) throw new ArgumentNullException("str");
    if (str.Length == 0) throw new ArgumentException("str is empty");

    int sign = 1, index = 0;
    if (str[0] == '-') { sign = -1; index = 1; }
    else if (str[0] == '+') { index = 1; }

    int result = 0;
    for (; index < str.Length; ++index) {
        result = 10 * result + (str[index] - '0');
    }

    if (result < 0) throw new OverflowException(str + " is too large for Int32");

    return result * sign;
}

My results are very different from the builtin equivalent:

我的结果与内置的等价物非常不同:

Int32.Parse      took 8.2775453 seconds
ParseInt32Simply took 0.6511523 seconds
Int32.Parse      took 6.7625807 seconds
ParseInt32Simply took 0.4677390 seconds

(Running 25 million iterations on my machine; a P4 3 GHz, running VS 2008 SP1)

(在我的机器上运行2500万次迭代; P4 3 GHz,运行VS 2008 SP1)

So, should I use my version? Or is there another method available that I can use?

那么,我应该使用我的版本吗?或者我可以使用另一种方法吗?

11 个解决方案

#1


If your parsing a format of which you know to be valid numbers, you can indeed write a faster custom parser. I've written a Double.Parse function for the same purpose once. And it faster to begin with the least significant digit. That way you can just increment the power of the digit your parsing.

如果您解析的格式是您知道的有效数字,那么您确实可以编写更快的自定义解析器。我曾为同一目的写过一个Double.Parse函数。从最低有效数字开始更快。这样你就可以增加解析数字的功效。

I've created a quick implementation of this,

我已经创建了一个快速实现的,

public static Int32 ParseValidNumberAsInt32(string str)
{
    if (str == null) 
        throw new ArgumentNullException("str");
    if (str.Length == 0) 
        throw new ArgumentException("str is empty");
    Int32 result = 0;
    Int32 currentPower = 1;
    Boolean isNegative = str[0] == '-';

    for (int currentCharIndex = str.Length - 1; currentCharIndex > 0; currentCharIndex--)
    {
        result += (str[currentCharIndex] - '0') * currentPower;
        currentPower *= 10;
    }
    return isNegative ? -1 * result : result + ((str[0] - '0') * currentPower);
}

If you really want speed, you can write a unsafe implementation..

如果你真的想要速度,你可以编写一个不安全的实现..

If your parsing a big file, you could read the files as raw bytes and work with those. That will make it a lot faster (no converting to unicode string, no splitting the strings in lines, no splitting the lines in substrings, no parsing the substrings), but you're going to lose maintainability.

如果您解析一个大文件,您可以将文件作为原始字节读取并使用它们。这将使它快得多(不转换为unicode字符串,不在行中拆分字符串,不在子字符串中拆分行,不解析子字符串),但是你将失去可维护性。

#2


Have you profiled your code yet to determine that ParseInt32 is actually the bottleneck? I wouldn't replace something that's part of the "standard library" of the environment you're coding in unless you know for certain that you're going to see a benefit.

您是否已经分析了您的代码以确定ParseInt32实际上是瓶颈?除非你肯定知道你会看到一个好处,否则我不会替换你正在编码的环境的“标准库”中的一部分。

#3


In .net Int32.Parse is very very quick, when it's successful.

在.net Int32.Parse非常快,当它成功。

When it fails it throws an exception - then it's very slow because exceptions are slow.

当它失败时会抛出异常 - 然后它非常慢,因为异常很慢。

You need to expand your test - you need to check times for a pattern of good and bad strings that's similar to what you need it to do.

您需要扩展测试 - 您需要检查时间是否与您需要的类似的好的和坏的字符串模式。

If you are pretty sure that all your strings are valid ints then Int32.Parse is the way to go. If you suspect that any more than a negligible few are going to be valid then it is much quicker to use Int32.TryParse, rather than a try-catch in your loop.

如果您非常确定所有字符串都是有效的int,那么Int32.Parse就是您的选择。如果你怀疑任何可以忽略不计的几个都是有效的,那么使用Int32.TryParse要快得多,而不是循环中的try-catch。

Typically if your try-catch is outside the loop use Int32.Parse - you'll get an exception and stop the first time you get an invalid value.

通常,如果您的try-catch在循环外部使用Int32.Parse - 您将获得异常并在第一次获得无效值时停止。

If your try-catch is inside the loop use Int32.TryParse instead.

如果你的try-catch在循环中,请使用Int32.TryParse。

Both Int32.Parse and Int32.TryParse are pretty highly optimised and are relatively mature - I'd expect them to be very tough to improve on unless you have specialist circumstances.

Int32.Parse和Int32.TryParse都经过了相当高度的优化并且相对成熟 - 除非你有专业的情况,否则我认为它们很难改进。

#4


My view would be that if the time savings you are getting are significant and of benefit for your application, then go for it.

我的观点是,如果节省的时间对您的应用程序而言非常重要且有益,那么请继续使用。

We had a vaguely similar issue with XML parsing and opted to do it manually for performance reasons, but it was based on a known environment - we were feeding the XML, so we could fairly safely take shortcuts in the parsing.

我们在XML解析方面有一个模糊的类似问题,并且出于性能原因选择手动执行,但它基于一个已知的环境 - 我们正在提供XML,因此我们可以相当安全地在解析中使用快捷方式。

Obviously the risk is that it is not likely to be complete as the standard library version and so new developers to the team will need to be made aware of this, lest they do something to break it.

显然风险在于它不太可能像标准库版本那样完整,因此需要让团队的新开发人员意识到这一点,以免他们做些什么来打破它。

#5


Yes - you can use your own version of parsing int as long that you are 100% certain that the source data is something you have control over (and thus always conforms to your format of Int32). Also, you should use your own code isolated from the rest of the world because if you've got this in some library you're publishing, people might want to have the standard behaviour of Int32.Parse. If you can't provide it, it's no good for them. However, as many people here suggests, you should really be certain that this is what really needs doing if you're trying to squeeze out the most of your performance. However, you probably know your own code better than anyone here.

是的 - 只要您100%确定源数据是您可以控制的内容(因此始终符合您的Int32格式),您就可以使用自己的解析int版本。此外,您应该使用自己的代码来隔离世界其他地方,因为如果您在某个库中发布了这个代码,那么人们可能希望拥有Int32.Parse的标准行为。如果你不能提供,那对他们没有好处。然而,正如这里的许多人所暗示的那样,如果你试图挤出你的大部分表现,你应该确定这才是真正需要做的事情。但是,您可能比这里的任何人都更了解您自己的代码。

Personally I would try and avoid changing parsing. IF there are other bottlenecks then those might be worth investigating first.

我个人会尝试避免更改解析。如果还有其他瓶颈,那么首先可能值得调查。

#6


If your tests are verifiable and you truly need the performance gains (e.g. you call the function like tens of thousands of times a second) than go for it.

如果您的测试是可验证的,并且您确实需要性能提升(例如,您将该函数调用为每秒数万次),而不是它。

I would just change the name... because ParseInt32Simply does not tell the maintenance programmer anything. I think a name like TrustedSourceInt32Parse or GuaranteedInt32Parse or something along those lines is a better name.

我只想更改名称...因为ParseInt32Simply并没有告诉维护程序员什么。我认为像TrustedSourceInt32Parse或GuaranteedInt32Parse之类的名称或其他类似的名称是更好的名称。

#7


I think the main problem here is your sentence assumes correct input. From reading your code, it doesn't appear to handle "12x" properly.

我认为这里的主要问题是你的句子假定正确输入。从阅读代码开始,它似乎没有正确处理“12x”。

There's lots of things that Int32.Parse does to validate the input, and might even take note of your culture to handle some culture-differences, although I can't think of any specifically for Int32.

Int32.Parse有很多东西可以验证输入,甚至可以记录你的文化来处理一些文化差异,虽然我不能想到任何专门用于Int32的东西。

You're sure that the bottleneck is Int32 in your code?

您确定代码中的瓶颈是Int32吗?

#8


How do you measure the speed? I tried this:

你如何衡量速度?我试过这个:

Stopwatch sw = new Stopwatch();
Random rand = new Random();

for (int n = 0; n < 10; n++)
{
    sw.Start();
    for (int i = 0; i < 1000000; i++)
    {
        ParseInt32Simply(rand.Next().ToString());
    }
    sw.Stop();
    Console.WriteLine(sw.Elapsed.Ticks + " - ParseInt32Simply");
    sw.Reset();

    sw.Start();
    for (int i = 0; i < 1000000; i++)
    {
        int.Parse(rand.Next().ToString());
    }
    sw.Stop();
    Console.WriteLine(sw.Elapsed.Ticks + " - int.Parse");
    sw.Reset();
    Console.WriteLine();
}

and the results are quite different:

2932852 - ParseInt32Simply
4684522 - int.Parse

结果完全不同:2932852 - ParseInt32Simply 4684522 - int.Parse

3003988 - ParseInt32Simply
4666928 - int.Parse

3003988 - ParseInt32Simply 4666928 - int.Parse

2892545 - ParseInt32Simply
4660209 - int.Parse

2892545 - ParseInt32Simply 4660209 - int.Parse

2888998 - ParseInt32Simply
4636007 - int.Parse

2888998 - ParseInt32Simply 4636007 - int.Parse

2955727 - ParseInt32Simply
4668501 - int.Parse

2955727 - ParseInt32Simply 4668501 - int.Parse

2929210 - ParseInt32Simply
4653799 - int.Parse

2929210 - ParseInt32Simply 4653799 - int.Parse

2893706 - ParseInt32Simply
4671503 - int.Parse

2893706 - ParseInt32Simply 4671503 - int.Parse

2899547 - ParseInt32Simply
4633957 - int.Parse

Your simple method is still faster, but less than 2x (that is very good performance actually!).

2899547 - ParseInt32Simply 4633957 - int.Parse您的简单方法仍然更快,但不到2倍(实际上这是非常好的表现!)。

#9


Take a look at this blog entry: Fast string to integer conversion by Karl Seguin.

看一下这篇博客文章:Karl Seguin的快速字符串到整数转换。

#10


The validation for null and empty string is not enough, you should check if parameter is valid integer.

null和空字符串的验证是不够的,你应该检查参数是否是有效的整数。

#11


how does your test look like? It seems your test is not ok.

你的测试怎么样?看来你的测试还不行。

I only have a small difference when I loop 50000 times and then I have a difference of about 30K ticks in favour of your custom method, but that is neglectable for the advantages of the normal method

当我循环50000次时,我只有一个小的差异,然后我有大约30K的差异,而不是你的自定义方法,但这对于普通方法的优点是可忽略的

#1


If your parsing a format of which you know to be valid numbers, you can indeed write a faster custom parser. I've written a Double.Parse function for the same purpose once. And it faster to begin with the least significant digit. That way you can just increment the power of the digit your parsing.

如果您解析的格式是您知道的有效数字,那么您确实可以编写更快的自定义解析器。我曾为同一目的写过一个Double.Parse函数。从最低有效数字开始更快。这样你就可以增加解析数字的功效。

I've created a quick implementation of this,

我已经创建了一个快速实现的,

public static Int32 ParseValidNumberAsInt32(string str)
{
    if (str == null) 
        throw new ArgumentNullException("str");
    if (str.Length == 0) 
        throw new ArgumentException("str is empty");
    Int32 result = 0;
    Int32 currentPower = 1;
    Boolean isNegative = str[0] == '-';

    for (int currentCharIndex = str.Length - 1; currentCharIndex > 0; currentCharIndex--)
    {
        result += (str[currentCharIndex] - '0') * currentPower;
        currentPower *= 10;
    }
    return isNegative ? -1 * result : result + ((str[0] - '0') * currentPower);
}

If you really want speed, you can write a unsafe implementation..

如果你真的想要速度,你可以编写一个不安全的实现..

If your parsing a big file, you could read the files as raw bytes and work with those. That will make it a lot faster (no converting to unicode string, no splitting the strings in lines, no splitting the lines in substrings, no parsing the substrings), but you're going to lose maintainability.

如果您解析一个大文件,您可以将文件作为原始字节读取并使用它们。这将使它快得多(不转换为unicode字符串,不在行中拆分字符串,不在子字符串中拆分行,不解析子字符串),但是你将失去可维护性。

#2


Have you profiled your code yet to determine that ParseInt32 is actually the bottleneck? I wouldn't replace something that's part of the "standard library" of the environment you're coding in unless you know for certain that you're going to see a benefit.

您是否已经分析了您的代码以确定ParseInt32实际上是瓶颈?除非你肯定知道你会看到一个好处,否则我不会替换你正在编码的环境的“标准库”中的一部分。

#3


In .net Int32.Parse is very very quick, when it's successful.

在.net Int32.Parse非常快,当它成功。

When it fails it throws an exception - then it's very slow because exceptions are slow.

当它失败时会抛出异常 - 然后它非常慢,因为异常很慢。

You need to expand your test - you need to check times for a pattern of good and bad strings that's similar to what you need it to do.

您需要扩展测试 - 您需要检查时间是否与您需要的类似的好的和坏的字符串模式。

If you are pretty sure that all your strings are valid ints then Int32.Parse is the way to go. If you suspect that any more than a negligible few are going to be valid then it is much quicker to use Int32.TryParse, rather than a try-catch in your loop.

如果您非常确定所有字符串都是有效的int,那么Int32.Parse就是您的选择。如果你怀疑任何可以忽略不计的几个都是有效的,那么使用Int32.TryParse要快得多,而不是循环中的try-catch。

Typically if your try-catch is outside the loop use Int32.Parse - you'll get an exception and stop the first time you get an invalid value.

通常,如果您的try-catch在循环外部使用Int32.Parse - 您将获得异常并在第一次获得无效值时停止。

If your try-catch is inside the loop use Int32.TryParse instead.

如果你的try-catch在循环中,请使用Int32.TryParse。

Both Int32.Parse and Int32.TryParse are pretty highly optimised and are relatively mature - I'd expect them to be very tough to improve on unless you have specialist circumstances.

Int32.Parse和Int32.TryParse都经过了相当高度的优化并且相对成熟 - 除非你有专业的情况,否则我认为它们很难改进。

#4


My view would be that if the time savings you are getting are significant and of benefit for your application, then go for it.

我的观点是,如果节省的时间对您的应用程序而言非常重要且有益,那么请继续使用。

We had a vaguely similar issue with XML parsing and opted to do it manually for performance reasons, but it was based on a known environment - we were feeding the XML, so we could fairly safely take shortcuts in the parsing.

我们在XML解析方面有一个模糊的类似问题,并且出于性能原因选择手动执行,但它基于一个已知的环境 - 我们正在提供XML,因此我们可以相当安全地在解析中使用快捷方式。

Obviously the risk is that it is not likely to be complete as the standard library version and so new developers to the team will need to be made aware of this, lest they do something to break it.

显然风险在于它不太可能像标准库版本那样完整,因此需要让团队的新开发人员意识到这一点,以免他们做些什么来打破它。

#5


Yes - you can use your own version of parsing int as long that you are 100% certain that the source data is something you have control over (and thus always conforms to your format of Int32). Also, you should use your own code isolated from the rest of the world because if you've got this in some library you're publishing, people might want to have the standard behaviour of Int32.Parse. If you can't provide it, it's no good for them. However, as many people here suggests, you should really be certain that this is what really needs doing if you're trying to squeeze out the most of your performance. However, you probably know your own code better than anyone here.

是的 - 只要您100%确定源数据是您可以控制的内容(因此始终符合您的Int32格式),您就可以使用自己的解析int版本。此外,您应该使用自己的代码来隔离世界其他地方,因为如果您在某个库中发布了这个代码,那么人们可能希望拥有Int32.Parse的标准行为。如果你不能提供,那对他们没有好处。然而,正如这里的许多人所暗示的那样,如果你试图挤出你的大部分表现,你应该确定这才是真正需要做的事情。但是,您可能比这里的任何人都更了解您自己的代码。

Personally I would try and avoid changing parsing. IF there are other bottlenecks then those might be worth investigating first.

我个人会尝试避免更改解析。如果还有其他瓶颈,那么首先可能值得调查。

#6


If your tests are verifiable and you truly need the performance gains (e.g. you call the function like tens of thousands of times a second) than go for it.

如果您的测试是可验证的,并且您确实需要性能提升(例如,您将该函数调用为每秒数万次),而不是它。

I would just change the name... because ParseInt32Simply does not tell the maintenance programmer anything. I think a name like TrustedSourceInt32Parse or GuaranteedInt32Parse or something along those lines is a better name.

我只想更改名称...因为ParseInt32Simply并没有告诉维护程序员什么。我认为像TrustedSourceInt32Parse或GuaranteedInt32Parse之类的名称或其他类似的名称是更好的名称。

#7


I think the main problem here is your sentence assumes correct input. From reading your code, it doesn't appear to handle "12x" properly.

我认为这里的主要问题是你的句子假定正确输入。从阅读代码开始,它似乎没有正确处理“12x”。

There's lots of things that Int32.Parse does to validate the input, and might even take note of your culture to handle some culture-differences, although I can't think of any specifically for Int32.

Int32.Parse有很多东西可以验证输入,甚至可以记录你的文化来处理一些文化差异,虽然我不能想到任何专门用于Int32的东西。

You're sure that the bottleneck is Int32 in your code?

您确定代码中的瓶颈是Int32吗?

#8


How do you measure the speed? I tried this:

你如何衡量速度?我试过这个:

Stopwatch sw = new Stopwatch();
Random rand = new Random();

for (int n = 0; n < 10; n++)
{
    sw.Start();
    for (int i = 0; i < 1000000; i++)
    {
        ParseInt32Simply(rand.Next().ToString());
    }
    sw.Stop();
    Console.WriteLine(sw.Elapsed.Ticks + " - ParseInt32Simply");
    sw.Reset();

    sw.Start();
    for (int i = 0; i < 1000000; i++)
    {
        int.Parse(rand.Next().ToString());
    }
    sw.Stop();
    Console.WriteLine(sw.Elapsed.Ticks + " - int.Parse");
    sw.Reset();
    Console.WriteLine();
}

and the results are quite different:

2932852 - ParseInt32Simply
4684522 - int.Parse

结果完全不同:2932852 - ParseInt32Simply 4684522 - int.Parse

3003988 - ParseInt32Simply
4666928 - int.Parse

3003988 - ParseInt32Simply 4666928 - int.Parse

2892545 - ParseInt32Simply
4660209 - int.Parse

2892545 - ParseInt32Simply 4660209 - int.Parse

2888998 - ParseInt32Simply
4636007 - int.Parse

2888998 - ParseInt32Simply 4636007 - int.Parse

2955727 - ParseInt32Simply
4668501 - int.Parse

2955727 - ParseInt32Simply 4668501 - int.Parse

2929210 - ParseInt32Simply
4653799 - int.Parse

2929210 - ParseInt32Simply 4653799 - int.Parse

2893706 - ParseInt32Simply
4671503 - int.Parse

2893706 - ParseInt32Simply 4671503 - int.Parse

2899547 - ParseInt32Simply
4633957 - int.Parse

Your simple method is still faster, but less than 2x (that is very good performance actually!).

2899547 - ParseInt32Simply 4633957 - int.Parse您的简单方法仍然更快,但不到2倍(实际上这是非常好的表现!)。

#9


Take a look at this blog entry: Fast string to integer conversion by Karl Seguin.

看一下这篇博客文章:Karl Seguin的快速字符串到整数转换。

#10


The validation for null and empty string is not enough, you should check if parameter is valid integer.

null和空字符串的验证是不够的,你应该检查参数是否是有效的整数。

#11


how does your test look like? It seems your test is not ok.

你的测试怎么样?看来你的测试还不行。

I only have a small difference when I loop 50000 times and then I have a difference of about 30K ticks in favour of your custom method, but that is neglectable for the advantages of the normal method

当我循环50000次时,我只有一个小的差异,然后我有大约30K的差异,而不是你的自定义方法,但这对于普通方法的优点是可忽略的