c，stdin / out中的快速I / O.

In a coding competition specified at this link there is a task where you need to read much data on stdin, do some calculations and present a whole lot of data on stdout.

在此链接指定的编码竞争中,有一项任务,您需要在stdin上读取大量数据,进行一些计算并在stdout上显示大量数据。

In my benchmarking it is almost only i/o that takes time although I have tried optimizing it as much as possible.

在我的基准测试中,尽管我尽可能地尝试优化它,但几乎只有I / O才需要时间。

What you have as input is a string (1 <= len <= 100'000) and q rows of pair of int where q also is 1 <= q <= 100'000.

你输入的是一个字符串(1 <= len <= 100'000)和q行的一对int,其中q也是1 <= q <= 100'000。

I benchmarked my code on a 100 times larger dataset (len = 10M, q = 10M) and this is the result:

我在100倍大的数据集(len = 10M,q = 10M)上对我的代码进行基准测试,这就是结果:

 Activity            time      accumulated

 Read text:          0.004     0.004
 Read numbers:       0.146     0.150
 Parse numbers:      0.200     0.350
 Calc answers:       0.001     0.351
 Format output:      0.037     0.388
 Print output:       0.143     0.531

By implementing my own formating and number parsing inline i managed to get the time down to 1/3 of the time when using printf and scanf.

通过实现我自己的格式化和内联数字解析,我设法将时间缩短到使用printf和scanf时的1/3。

However when I uploaded my solution to the competitions webpage my solution took 1.88 seconds (I think that is the total time over 22 datasets). When I look in the high-score there are several implementations (in c++) that finished in 0.05 seconds, nearly 40 times faster than mine! How is that possible?

然而,当我将我的解决方案上传到比赛网页时,我的解决方案需要1.88秒(我认为这是超过22个数据集的总时间)。当我看到高分时,有几个实现(在c ++中)在0.05秒内完成,比我快了近40倍!怎么可能?

I guess that I could speed it up a bit by using 2 threads, then I can start calculating and writing to stdout while still reading from stdin. This will however decrease the time to min(0.150, 0.143) in a theoretical best case on my large dataset. I'm still nowhere close to the highscore..

我想我可以通过使用2个线程来加速它,然后我可以开始计算并写入stdout,同时仍然从stdin读取。然而,在我的大型数据集的理论上最好的情况下,这将减少到min(0.150,0.143)的时间。我仍然没有接近高分......

In the image below you can see the statistics of the consumed time.

在下图中,您可以看到消耗时间的统计信息。

The program gets compiled by the website with this options:

该程序由网站编译,并提供以下选项:

gcc -g -O2 -std=gnu99 -static my_file.c -lm

and timed like this:

和定时这样:

time ./a.out < sample.in > sample.out

My code looks like this:

我的代码如下所示:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_LEN (100000 + 1)
#define ROW_LEN (6 + 1)
#define DOUBLE_ROW_LEN (2*ROW_LEN)

int main(int argc, char *argv[])
{
    int ret = 1;

    // Set custom buffers for stdin and out
    char stdout_buf[16384];
    setvbuf(stdout, stdout_buf, _IOFBF, 16384);
    char stdin_buf[16384];
    setvbuf(stdin, stdin_buf, _IOFBF, 16384);

    // Read stdin to buffer
    char *buf = malloc(MAX_LEN);
    if (!buf) {
        printf("Failed to allocate buffer");
        return 1;
    }
    if (!fgets(buf, MAX_LEN, stdin))
        goto EXIT_A;

    // Get the num tests
    int m ;
    scanf("%d\n", &m);

    char *num_buf = malloc(DOUBLE_ROW_LEN);
    if (!num_buf) {
        printf("Failed to allocate num_buffer");
        goto EXIT_A;
    }

    int *nn;
    int *start = calloc(m, sizeof(int));
    int *stop = calloc(m, sizeof(int));
    int *staptr = start; 
    int *stpptr = stop;
    char *cptr;
    for(int i=0; i<m; i++) {
        fgets(num_buf, DOUBLE_ROW_LEN, stdin);
        nn = staptr++;
        cptr = num_buf-1;
        while(*(++cptr) > '\n') {
            if (*cptr == ' ')
                nn = stpptr++;
            else
                *nn = *nn*10 + *cptr-'0';
        }
    }


    // Count for each test
    char *buf_end = strchr(buf, '\0');
    int len, shift;
    char outbuf[ROW_LEN];
    char *ptr_l, *ptr_r, *out;
    for(int i=0; i<m; i++) {
        ptr_l = buf + start[i];
        ptr_r = buf + stop[i];
        while(ptr_r < buf_end && *ptr_l == *ptr_r) {
            ++ptr_l;
            ++ptr_r;
        }

        // Print length of same sequence
        shift = len = (int)(ptr_l - (buf + start[i]));
        out = outbuf;
        do {
            out++;
            shift /= 10;
        } while (shift);
        *out = '\0';
        do {
            *(--out) = "0123456789"[len%10];
            len /= 10;
        } while(len);
        puts(outbuf);
    }



    ret = 0;

    free(start);
    free(stop);
EXIT_A:
    free(buf);
    return ret;
}

3 个解决方案

#1

Thanks to your question, I went and solved the problem myself. Your time is better than mine, but I'm still using some stdio functions.

感谢您的提问,我自己去解决了这个问题。你的时间比我的好,但我仍然使用一些stdio功能。

I simply do not think the high score of 0.05 seconds is bona fide. I suspect it's the product of a highly automated system that returned that result in error, and that no one ever verified it.

我根本不认为0.05秒的高分是真实的。我怀疑它是高度自动化系统的产物,它返回了导致错误,并且没有人验证过它。

How to defend that assertion? There's no real algorithmic complexity: the problem is O(n). The "trick" is to write specialized parsers for each aspect of the input (and avoid work done only in debug mode). The total time for 22 trials is 50 milliseconds, meaning each trial averages 2.25 ms? We're down near the threshold of measurability.

如何捍卫这种说法?没有真正的算法复杂性:问题是O(n)。 “技巧”是为输入的每个方面编写专门的解析器(并避免仅在调试模式下完成工作)。 22次试验的总时间是50毫秒,这意味着每个试验的平均时间为2.25毫秒?我们接近可衡量的门槛。

Competitions like the problem you addressed yourself to are unfortunate, in a way. They reinforce the naive idea that performance is the ultimate measure of a program (there's no score for clarity). Worse, they encourage going around things like scanf "for performance" while, in real life, getting a program to run correctly and fast basically never entails avoiding or even tuning stdio. In a complex system, performance comes from things like avoiding I/O, passing over the data only once, and minimizing copies. Using the DBMS effectively is often key (as it were), but such things never show up in programming challenges.

在某种程度上,像你自己解决的问题这样的竞争是不幸的。他们强化了天真的观念,即表演是计划的最终衡量标准(清晰度没有得分)。更糟糕的是,他们鼓励绕过诸如scanf“for performance”之类的东西,而在现实生活中,让程序正常运行并且快速基本上不需要避免甚至调整stdio。在复杂的系统中,性能来自诸如避免I / O,仅传递数据一次以及最小化副本之类的事情。有效地使用DBMS通常很关键(因为它),但这些事情从未出现在编程挑战中。

Parsing and formatting numbers as text does take time, and in rare circumstances can be a bottleneck. But the answer is hardly ever to rewrite the parser. Rather, the answer is to parse the text into a convenient binary form, and use that. In short: compilation.

将数字解析和格式化为文本确实需要时间,并且在极少数情况下可能成为瓶颈。但答案几乎没有改写解析器。相反,答案是将文本解析为方便的二进制形式,并使用它。简而言之:编译。

That said, a few observations may help.

也就是说,一些观察可能有所帮助。

You don't need dynamic memory for this problem, and it's not helping. The problem statement says the input array may be up to 100,000 elements, and the number of trials may be as many as 100,000. Each trial is two integer strings of up to 6 digits each separated by a space and terminated by a newline: 6 + 1 + 6 + 1 = 14. Total input, maximum is 100,000 + 1 + 6 + 1 + 100,000 * 14: under 16 KB. You are allowed 1 GB of memory.

你不需要动态内存来解决这个问题,而且它没有帮助。问题陈述说输入数组可能多达100,000个元素,试验次数可能多达100,000个。每个试验是两个最多6位数的整数字符串,每个字符串用空格分隔并以换行符结束:6 + 1 + 6 + 1 = 14.总输入,最大值为100,000 + 1 + 6 + 1 + 100,000 * 14: 16 KB。您可以使用1 GB的内存。

I just allocated a single 16 KB buffer, and read it in all at once with read(2). Then I made a single pass over that input.

我刚刚分配了一个16 KB的缓冲区,并使用read(2)一次性读取它。然后我对该输入进行了一次传递。

You got suggestions to use asynchronous I/O and threads. The problem statement says you're measured on CPU time, so neither of those help. The shortest distance between two points is a straight line; a single read into statically allocated memory wastes no motion.

您有使用异步I / O和线程的建议。问题陈述说你是用CPU时间衡量的,所以这些都没有帮助。两点之间的最短距离是一条直线;单次读入静态分配的内存不会浪费任何动作。

One ridiculous aspect of the way they measure performance is that they use gcc -g. That means assert(3) is invoked in code that is measured for performance! I couldn't get under 4 seconds on test 22 until I removed the my asserts.

他们衡量绩效的一个荒谬方面是他们使用gcc -g。这意味着在测量性能的代码中调用assert(3)!在测试22中我无法得到4秒,直到我删除了我的断言。

In sum, you did pretty well, and I suspect the winner you're baffled by is a phantom. Your code does faff about a bit, and you can dispense with dynamic memory and tuning stdio. I bet your time can be trimmed by simplifying it. To the extent that performance matters, that's where I'd direct your attention.

总之,你做得很好,我怀疑你被困惑的赢家是一个幽灵。你的代码确实很有用,你可以省去动态内存和调优stdio。我敢打赌,通过简化它可以缩短你的时间。在性能重要的程度上,这就是我引起你注意的地方。

#2

You should allocate all your buffers continuously. Allocate a buffer which is the size of all your buffers (num_buff, start, stop) then rearrange the points to the corresponding offsets by their size. This can reduce your cache miss \ page faults.

您应该连续分配所有缓冲区。分配一个缓冲区,该缓冲区是所有缓冲区的大小(num_buff,start,stop),然后按照大小将点重新排列到相应的偏移量。这可以减少缓存未命中\页面错误。

Since the read and the write operation seems to consume a lot of time you should consider adding threads. One thread should deal with I\O and another should deal with the computation. (It is worth checking if another thread for prints could speed things up as well). Make sure you don't use any locks while doing this.

由于读取和写入操作似乎消耗了大量时间,因此您应该考虑添加线程。一个线程应该处理I \ O而另一个应该处理计算。 (值得检查另一个打印线程是否也能加快速度)。确保在执行此操作时不使用任何锁定。

#3

Answering this question is tricky because optimization heavily depends on the problem you have. One idea is to look at the content of the file you are trying to read and see if there patterns or things that you can use in your favor. The code you wrote is a "general" solution for reading from a file, executing something and then writing to a file. But if you the file is not randomly generated each time and the content is always the same why not try to write a solution for that file?

回答这个问题很棘手,因为优化很大程度上取决于你所遇到的问题。一个想法是查看您尝试阅读的文件的内容,看看是否有您可以使用的模式或事物。您编写的代码是一种“通用”解决方案,用于从文件读取,执行某些操作然后写入文件。但是,如果您不是每次都随机生成文件且内容始终相同,为什么不尝试为该文件编写解决方案?

On the other hand, you could try to use low-level system functions. One that comes to my thinking is mmap which allows you to map a file directly to memory and access that memory instead of using scanf and fgets.

另一方面,您可以尝试使用低级系统功能。我想到的是mmap,它允许您将文件直接映射到内存并访问该内存而不是使用scanf和fgets。

Another thing I found that might help is in your solutin you are having two while loops, why not try and use only one? Another thing would be to do some Asynchronous I/O reading, so instead of reading the whole file in a loop, and then doing the calculation in another loop, you can try and read a portion at the beginning, start processing it async and continue reading. This link might help for the async part

我发现可能有帮助的另一件事是你的解决方案,你有两个while循环,为什么不尝试只使用一个?另一件事是做一些异步I / O读取,所以不是在循环中读取整个文件,然后在另一个循环中进行计算,你可以尝试在开头读取一部分,开始处理异步并继续读。此链接可能有助于异步部分

#1