I'm working on some text conversion routines that parse time values in different formats in Ruby. This routine is growing in complexity, and I'm currently testing a better approach to this problem.
我正在研究一些文本转换例程,它们在Ruby中以不同的格式解析时间值。这个例程越来越复杂,我正在测试一个更好的解决这个问题的方法。
I'm currently testing a way to use scanf
. Why? I always thought that was faster than a regex, but what happened in Ruby? It was much slower!
我目前正在测试一种使用scanf的方法。为什么?我一直认为这比正则表达式更快,但Ruby中发生了什么?它慢得多!
What am I doing wrong?
我究竟做错了什么?
Note: I'm using ruby-1.9.2-p290 [ x86_64 ] (MRI)
注意:我正在使用ruby-1.9.2-p290 [x86_64](MRI)
First Ruby test:
第一个Ruby测试:
require "scanf"
require 'benchmark'
def duration_in_seconds_regex(duration)
if duration =~ /^\d{2,}\:\d{2}:\d{2}$/
h, m, s = duration.split(":").map{ |n| n.to_i }
h * 3600 + m * 60 + s
end
end
def duration_in_seconds_scanf(duration)
a = duration.scanf("%d:%d:%d")
a[0] * 3600 + a[1] * 60 + a[2]
end
n = 500000
Benchmark.bm do |x|
x.report { for i in 1..n; duration_in_seconds_scanf("00:10:30"); end }
end
Benchmark.bm do |x|
x.report { for i in 1..n; duration_in_seconds_regex("00:10:30"); end }
end
This is what I got using scanf
first and a regex second:
这是我首先使用scanf和正则表达式的第二个:
user system total real
95.020000 0.280000 95.300000 ( 96.364077)
user system total real
2.820000 0.000000 2.820000 ( 2.835170)
Second test using C:
使用C进行第二次测试
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/types.h>
#include <string.h>
#include <regex.h>
char *regexp(char *string, char *patrn, int *begin, int *end) {
int i, w = 0, len;
char *word = NULL;
regex_t rgT;
regmatch_t match;
regcomp(&rgT, patrn, REG_EXTENDED);
if ((regexec(&rgT, string, 1, &match, 0)) == 0) {
*begin = (int) match.rm_so;
*end = (int) match.rm_eo;
len = *end - *begin;
word = malloc(len + 1);
for (i = *begin; i<*end; i++) {
word[w] = string[i];
w++;
}
word[w] = 0;
}
regfree(&rgT);
return word;
}
int main(int argc, char** argv) {
char * str = "00:01:30";
int h, m, s;
int i, b, e;
float start_time, end_time, time_elapsed;
regex_t regex;
regmatch_t * pmatch;
char msgbuf[100];
char *pch;
char *str2;
char delims[] = ":";
char *result = NULL;
start_time = (float) clock() / CLOCKS_PER_SEC;
for (i = 0; i < 500000; i++) {
if (sscanf(str, "%d:%d:%d", &h, &m, &s) == 3) {
s = h * 3600L + m * 60L + s;
}
}
end_time = (float) clock() / CLOCKS_PER_SEC;
time_elapsed = end_time - start_time;
printf("sscanf_time (500k iterations): %.4f", time_elapsed);
start_time = (float) clock() / CLOCKS_PER_SEC;
for (i = 0; i < 500000; i++) {
char * match = regexp(str, "[0-9]{2,}:[0-9]{2}:[0-9]{2}", &b, &e);
if (strcmp(match, str) == 0) {
str2 = (char*) malloc(sizeof (str));
strcpy(str2, str);
h = strtok(str2, delims);
m = strtok(NULL, delims);
s = strtok(NULL, delims);
s = h * 3600L + m * 60L + s;
}
}
end_time = (float) clock() / CLOCKS_PER_SEC;
time_elapsed = end_time - start_time;
printf("\n\nregex_time (500k iterations): %.4f", time_elapsed);
return (EXIT_SUCCESS);
}
The C code results are obviously faster, and the regex results are slower than scanf
results as expected:
C代码结果显然更快,正则表达式结果比预期的scanf结果慢:
sscanf_time (500k iterations): 0.1774
regex_time (500k iterations): 3.9692
It is obvious that the C running time is faster, so please don't comment that Ruby is interpreted and stuff like that please.
很明显,C运行时间更快,所以请不要评论Ruby的解释和类似的东西。
This is the related gist.
这是相关的要点。
2 个解决方案
#1
4
The problem is not that it's interpreted, but that everything in Ruby is an object. You can explore "scanf.rb" in your Ruby distribution and compare it to scanf implementation in C.
问题不是它被解释,而是Ruby中的所有东西都是一个对象。您可以在Ruby发行版中浏览“scanf.rb”并将其与C中的scanf实现进行比较。
Ruby implementation of scanf based on RegExp matching. Every atom like "%d" is an object in ruby, while it's only one case item in C. So, to my mind, the reason of such execution time is lots of object allocation/deallocation.
基于RegExp匹配的scanf的Ruby实现。像“%d”这样的每个原子都是ruby中的一个对象,而它只是C中的一个案例项。所以,在我看来,这种执行时间的原因是大量的对象分配/释放。
#2
2
Assuming MRI: scanf is written in Ruby (scanf.rb) apparently 10 years ago and never touched since (and it does look complex!). split
, map
, and regexes are implemented in heavily optimized C.
假设MRI:scanf是用Ruby(scanf.rb)写的,显然是在10年前,从那以后就没用过(而且看起来确实很复杂!)。 split,map和regexs在高度优化的C中实现。
#1
4
The problem is not that it's interpreted, but that everything in Ruby is an object. You can explore "scanf.rb" in your Ruby distribution and compare it to scanf implementation in C.
问题不是它被解释,而是Ruby中的所有东西都是一个对象。您可以在Ruby发行版中浏览“scanf.rb”并将其与C中的scanf实现进行比较。
Ruby implementation of scanf based on RegExp matching. Every atom like "%d" is an object in ruby, while it's only one case item in C. So, to my mind, the reason of such execution time is lots of object allocation/deallocation.
基于RegExp匹配的scanf的Ruby实现。像“%d”这样的每个原子都是ruby中的一个对象,而它只是C中的一个案例项。所以,在我看来,这种执行时间的原因是大量的对象分配/释放。
#2
2
Assuming MRI: scanf is written in Ruby (scanf.rb) apparently 10 years ago and never touched since (and it does look complex!). split
, map
, and regexes are implemented in heavily optimized C.
假设MRI:scanf是用Ruby(scanf.rb)写的,显然是在10年前,从那以后就没用过(而且看起来确实很复杂!)。 split,map和regexs在高度优化的C中实现。