确定字符串是否包含实数或整数值的最快方法

时间:2021-02-27 01:34:40

I'm trying to write a function that is able to determine whether a string contains a real or an integer value.

我正在尝试编写一个能够确定字符串是包含实数值还是整数值的函数。

This is the simplest solution I could think of:

这是我能想到的最简单的解决方案:

int containsStringAnInt(char* strg){
  for (int i =0; i < strlen(strg); i++) {if (strg[i]=='.') return 0;}
  return 1;
}

But this solution is really slow when the string is long... Any optimization suggestions? Any help would really be appreciated!

但是当字符串很长时,这个解决方案真的很慢...任何优化建议?真的很感激任何帮助!

8 个解决方案

#1


You are using strlen, which means you are not worried about unicode. In that case why to use strlen or strchr, just check for '\0' (Null char)

你正在使用strlen,这意味着你不担心unicode。在这种情况下,为什么要使用strlen或strchr,只需检查'\ 0'(Null char)

int containsStringAnInt(char* strg){ 

  for (int i =0;strg[i]!='\0'; i++) {
      if (strg[i]=='.') return 0;}   
  return 1; }

Only one parsing through the string, than parsing through the string in each iteration of the loop.

只有一个解析字符串,而不是在循环的每次迭代中解析字符串。

#2


What's the syntax of your real numbers?

你的实数的语法是什么?

1e-6 is valid C++ for a literal, but will be passed as integer by your test.

1e-6对于文字是有效的C ++,但是将通过测试作为整数传递。

#3


Is your string hundreds of characters long? Otherwise, don't care about any possible performance issues. The only inefficiency is that you are using strlen() in a bad way, which means a lot of iterations over the string (inside strlen). For a simpler solution, with the same time complexity (O(n)), but probably slightly faster, use strchr().

你的字符串数百个字符长吗?否则,不要关心任何可能的性能问题。唯一的低效率是你使用strlen()的方式很糟糕,这意味着对字符串进行了大量的迭代(在strlen内部)。对于更简单的解决方案,具有相同的时间复杂度(O(n)),但可能稍微快一点,使用strchr()。

#4


Your function does not take into account exponential notation of reals (1E7, 1E-7 are both doubles)

你的函数没有考虑实数的指数表示法(1E7,1E-7都是双倍的)

Use strtol() to try to convert the string to integer first; it will also return the first position in the string where the parsing failed (this will be '.' if the number is real). If the parsing stopped at '.', use strtod() to try to convert to double. Again, the function will return the position in the string where the parsing stopped.

使用strtol()尝试首先将字符串转换为整数;它还将返回解析失败的字符串中的第一个位置(如果数字为真,则为'。')。如果解析在'。'处停止,请使用strtod()尝试转换为double。同样,该函数将返回解析停止的字符串中的位置。

Don't worry about performance, until you have profiled the program. Otherwise, for fastest possible code, construct a regular expression that describes acceptable syntax of numbers, and hand-convert it first into a FSM, then into highly optimized code.

在分析程序之前,不要担心性能。否则,为了尽可能快的代码,构造一个描述可接受的数字语法的正则表达式,并将其首先手动转换为FSM,然后再转换为高度优化的代码。

#5


So the standard note first, please don't worry about performance too much if not profiled yet :)

所以标准说明首先,如果没有分析,请不要担心性能太多:)

I'm not sure about the manual loop and checking for a dot. Two issues

我不确定手动循环和检查点。两个问题

  • Depending on the locale, the dot can actually be a "," too (here in Germany that's the case :)
  • 根据区域设置,点实际上也可以是“,”(在德国就是这样:)

  • As others noted, there is the issue with numbers like 1e7
  • 正如其他人所指出的,像1e7这样的数字存在问题

Previously I had a version using sscanf here. But measuring performance showed that sscanf is is significantly slower for bigger data-sets. So I'll show the faster solution first (Well, it's also a whole more simple. I had several bugs in the sscanf version until I got it working, while the strto[ld] version worked the first try):

以前我有一个使用sscanf的版本。但测量性能表明,对于更大的数据集,sscanf明显更慢。因此,我将首先展示更快的解决方案(嗯,它也更简单。我在sscanf版本中有几个错误,直到我使用它,而strto [ld]版本第一次尝试):

enum {
    REAL,
    INTEGER,
    NEITHER_NOR
};

int what(char const* strg){ 
    char *endp;
    strtol(strg, &endp, 10);
    if(*strg && !*endp)
        return INTEGER;
    strtod(strg, &endp);
    if(*strg && !*endp)
        return REAL;
    return NEITHER_NOR;
}


Just for fun, here is the version using sscanf:

只是为了好玩,这是使用sscanf的版本:

int what(char const* strg) {
    // test for int
    { 
        int d;     // converted value
        int n = 0; // number of chars read
        int rd = std::sscanf(strg, "%d %n", &d, &n);
        if(!strg[n] && rd == 1) {
            return INTEGER;
        }
    }
    // test for double
    { 
        double v;     // converted value
        int n = 0; // number of chars read
        int rd = std::sscanf(strg, "%lf %n", &v, &n);
        if(!strg[n] && rd == 1) {
            return REAL;
        }
    }
    return NEITHER_NOR;
}

I think that should work. Have fun.

我认为这应该有效。玩得开心。

Test was done by converting test strings (small ones) randomly 10000000 times in a loop:

通过在循环中随机转换10000000次测试字符串(小字符串)来完成测试:

  • 6.6s for sscanf
  • sscanf为6.6s

  • 1.7s for strto[dl]
  • 1.7秒为strto [dl]

  • 0.5s for manual looping until "."
  • 0.5秒用于手动循环直到“。”

Clear win for strto[ld], considering it will parse numbers correctly I will praise it as the winner over manual looping. Anyway, 1.2s/10000000 = 0.00000012 difference roughly for one conversion isn't all that much in the end.

明确赢得strto [ld],考虑到它会正确解析数字我会赞扬它作为手动循环的赢家。无论如何,1.2s / 10000000 = 0.00000012差异大致为一次转换最终并不是那么多。

#6


Strlen walks the string to find the length of the string.

Strlen遍历字符串以查找字符串的长度。

You are calling strlen with every pass of the loop. Hence, you are walking the string way many more times than necessary. This tiny change should give you a huge performance improvement:

你在循环的每次传递中调用strlen。因此,你走的字符串比必要的次数多很多倍。这个微小的变化应该会给你带来巨大的性能提升:

int containsStringAnInt(char* strg){
  int len = strlen(strg);
  for (int i =0; i < len; i++) {if (strg[i]=='.') return 0;}
  return 1;
}

Note that all I did was find the length of the string once, at the start of the function, and refer to that value repeatedly in the loop.

请注意,我所做的只是在函数的开头找到一次字符串的长度,并在循环中重复引用该值。

Please let us know what kind of performance improvement this gets you.

请告诉我们这会给您带来哪些性能提升。

#7


@Aaron, with your way also you are traversing the string twice. Once within strlen, and once again in for loop. Best way for ASCII string traversing in for loop is to check for Null char in the loop it self. Have a look at my answer, that parses the string only once within for loop, and may be partial parsing if it finds a '.' prior to end. that way if a string is like 0.01xxx (anotther 100 chars), you need not to go till end to find the length.

@Aaron,你也可以两次穿越弦乐。一旦进入strlen,再次进入for循环。 for循环中ASCII字符串遍历的最佳方法是在循环中检查Null char。看看我的答案,在for循环中只解析一次字符串,如果找到'',可能会部分解析。在结束之前。这样,如果一个字符串像0.01xxx(anotther 100 chars),你不需要一直到最后找到长度。

#8


#include <stdlib.h>
int containsStringAnInt(char* strg){ 
    if (atof(strg) == atoi(strg))
        return 1;
    return 0;
}

#1


You are using strlen, which means you are not worried about unicode. In that case why to use strlen or strchr, just check for '\0' (Null char)

你正在使用strlen,这意味着你不担心unicode。在这种情况下,为什么要使用strlen或strchr,只需检查'\ 0'(Null char)

int containsStringAnInt(char* strg){ 

  for (int i =0;strg[i]!='\0'; i++) {
      if (strg[i]=='.') return 0;}   
  return 1; }

Only one parsing through the string, than parsing through the string in each iteration of the loop.

只有一个解析字符串,而不是在循环的每次迭代中解析字符串。

#2


What's the syntax of your real numbers?

你的实数的语法是什么?

1e-6 is valid C++ for a literal, but will be passed as integer by your test.

1e-6对于文字是有效的C ++,但是将通过测试作为整数传递。

#3


Is your string hundreds of characters long? Otherwise, don't care about any possible performance issues. The only inefficiency is that you are using strlen() in a bad way, which means a lot of iterations over the string (inside strlen). For a simpler solution, with the same time complexity (O(n)), but probably slightly faster, use strchr().

你的字符串数百个字符长吗?否则,不要关心任何可能的性能问题。唯一的低效率是你使用strlen()的方式很糟糕,这意味着对字符串进行了大量的迭代(在strlen内部)。对于更简单的解决方案,具有相同的时间复杂度(O(n)),但可能稍微快一点,使用strchr()。

#4


Your function does not take into account exponential notation of reals (1E7, 1E-7 are both doubles)

你的函数没有考虑实数的指数表示法(1E7,1E-7都是双倍的)

Use strtol() to try to convert the string to integer first; it will also return the first position in the string where the parsing failed (this will be '.' if the number is real). If the parsing stopped at '.', use strtod() to try to convert to double. Again, the function will return the position in the string where the parsing stopped.

使用strtol()尝试首先将字符串转换为整数;它还将返回解析失败的字符串中的第一个位置(如果数字为真,则为'。')。如果解析在'。'处停止,请使用strtod()尝试转换为double。同样,该函数将返回解析停止的字符串中的位置。

Don't worry about performance, until you have profiled the program. Otherwise, for fastest possible code, construct a regular expression that describes acceptable syntax of numbers, and hand-convert it first into a FSM, then into highly optimized code.

在分析程序之前,不要担心性能。否则,为了尽可能快的代码,构造一个描述可接受的数字语法的正则表达式,并将其首先手动转换为FSM,然后再转换为高度优化的代码。

#5


So the standard note first, please don't worry about performance too much if not profiled yet :)

所以标准说明首先,如果没有分析,请不要担心性能太多:)

I'm not sure about the manual loop and checking for a dot. Two issues

我不确定手动循环和检查点。两个问题

  • Depending on the locale, the dot can actually be a "," too (here in Germany that's the case :)
  • 根据区域设置,点实际上也可以是“,”(在德国就是这样:)

  • As others noted, there is the issue with numbers like 1e7
  • 正如其他人所指出的,像1e7这样的数字存在问题

Previously I had a version using sscanf here. But measuring performance showed that sscanf is is significantly slower for bigger data-sets. So I'll show the faster solution first (Well, it's also a whole more simple. I had several bugs in the sscanf version until I got it working, while the strto[ld] version worked the first try):

以前我有一个使用sscanf的版本。但测量性能表明,对于更大的数据集,sscanf明显更慢。因此,我将首先展示更快的解决方案(嗯,它也更简单。我在sscanf版本中有几个错误,直到我使用它,而strto [ld]版本第一次尝试):

enum {
    REAL,
    INTEGER,
    NEITHER_NOR
};

int what(char const* strg){ 
    char *endp;
    strtol(strg, &endp, 10);
    if(*strg && !*endp)
        return INTEGER;
    strtod(strg, &endp);
    if(*strg && !*endp)
        return REAL;
    return NEITHER_NOR;
}


Just for fun, here is the version using sscanf:

只是为了好玩,这是使用sscanf的版本:

int what(char const* strg) {
    // test for int
    { 
        int d;     // converted value
        int n = 0; // number of chars read
        int rd = std::sscanf(strg, "%d %n", &d, &n);
        if(!strg[n] && rd == 1) {
            return INTEGER;
        }
    }
    // test for double
    { 
        double v;     // converted value
        int n = 0; // number of chars read
        int rd = std::sscanf(strg, "%lf %n", &v, &n);
        if(!strg[n] && rd == 1) {
            return REAL;
        }
    }
    return NEITHER_NOR;
}

I think that should work. Have fun.

我认为这应该有效。玩得开心。

Test was done by converting test strings (small ones) randomly 10000000 times in a loop:

通过在循环中随机转换10000000次测试字符串(小字符串)来完成测试:

  • 6.6s for sscanf
  • sscanf为6.6s

  • 1.7s for strto[dl]
  • 1.7秒为strto [dl]

  • 0.5s for manual looping until "."
  • 0.5秒用于手动循环直到“。”

Clear win for strto[ld], considering it will parse numbers correctly I will praise it as the winner over manual looping. Anyway, 1.2s/10000000 = 0.00000012 difference roughly for one conversion isn't all that much in the end.

明确赢得strto [ld],考虑到它会正确解析数字我会赞扬它作为手动循环的赢家。无论如何,1.2s / 10000000 = 0.00000012差异大致为一次转换最终并不是那么多。

#6


Strlen walks the string to find the length of the string.

Strlen遍历字符串以查找字符串的长度。

You are calling strlen with every pass of the loop. Hence, you are walking the string way many more times than necessary. This tiny change should give you a huge performance improvement:

你在循环的每次传递中调用strlen。因此,你走的字符串比必要的次数多很多倍。这个微小的变化应该会给你带来巨大的性能提升:

int containsStringAnInt(char* strg){
  int len = strlen(strg);
  for (int i =0; i < len; i++) {if (strg[i]=='.') return 0;}
  return 1;
}

Note that all I did was find the length of the string once, at the start of the function, and refer to that value repeatedly in the loop.

请注意,我所做的只是在函数的开头找到一次字符串的长度,并在循环中重复引用该值。

Please let us know what kind of performance improvement this gets you.

请告诉我们这会给您带来哪些性能提升。

#7


@Aaron, with your way also you are traversing the string twice. Once within strlen, and once again in for loop. Best way for ASCII string traversing in for loop is to check for Null char in the loop it self. Have a look at my answer, that parses the string only once within for loop, and may be partial parsing if it finds a '.' prior to end. that way if a string is like 0.01xxx (anotther 100 chars), you need not to go till end to find the length.

@Aaron,你也可以两次穿越弦乐。一旦进入strlen,再次进入for循环。 for循环中ASCII字符串遍历的最佳方法是在循环中检查Null char。看看我的答案,在for循环中只解析一次字符串,如果找到'',可能会部分解析。在结束之前。这样,如果一个字符串像0.01xxx(anotther 100 chars),你不需要一直到最后找到长度。

#8


#include <stdlib.h>
int containsStringAnInt(char* strg){ 
    if (atof(strg) == atoi(strg))
        return 1;
    return 0;
}