多个子进程+从流中读取

时间:2021-08-20 14:21:26

referring to my last question (Multiple child process), i am now trying to make an external sorting implementation using multiple child process.

参考我的上一个问题(多子进程),我现在尝试使用多个子进程进行外部排序实现。

...
fp = fopen(pathname, "r"); // open inputfile in r mode
fgets(trash, 10, fp); // ignore first line

for (i=0; i<numberOfProcess; ++i) {
    #ifdef DBG
        fprintf(stderr, "\nDBG: Calling fork()\n"); 
    #endif

    if ((pids[i] = fork()) < 0) {
        perror("fork error");
        exit(EXIT_FAILURE);

    } else if (pids[i] == 0) { // Child Code

        if (numbersToSort % numberOfProcess == 0) { // 16 % 4 = 0
            partialDataSize = numbersToSort / numberOfProcess;          

            for (j=0; j<partialDataSize; j++) { 
                fscanf(fp, "%d", &arrayPartialData[j]);
                qsort(arrayPartialData, partialDataSize, sizeof(int), (void *)comp_num);

                //printf("%d\n", arrayPartialData[j]);
                // TODO: qsort data until partialDataSize
            }

        } 
        printf("pid: %d child process %d outputs: ", getpid(), pids[i]);
        printArray(arrayPartialData, partialDataSize);
        //break;
        exit(0);
    }  
}   

/* Wait for children to exit. */

while (numberOfProcess > 0) {
    pid = wait(&status);
    --numberOfProcess;
}

fclose(fp);

but of course this code outputs the same sequence of sorted integers from inputfile because of fscanf.. for example if the beginning of input file includes 5 1 4, then it outputs:

但是当然这个代码从输入文件输出相同的排序整数序列,因为fscanf ..例如,如果输入文件的开头包含5 1 4,那么它输出:

(1st child) 1 4 5
(2nd child) 1 4 5

(第1个孩子)1 4 5(第2个孩子)1 4 5

(with two child process).. because fscanf starts to read integers from the beginning of input stream.

(有两个子进程)..因为fscanf从输入流的开头开始读取整数。

my problem now is how can i continue to read the numbers starting from the point where the previous child process left? for example, if input file includes 5 1 4 8 5 10, then it can output:

我现在的问题是如何从上一个子进程离开的点开始继续读取数字?例如,如果输入文件包含5 1 4 8 5 10,那么它可以输出:

(1st child) 1 4 5

(第1个孩子)1 4 5

(2nd child) 5 8 10

(第2个孩子)5 8 10

thanks in advance;)

提前致谢;)

4 个解决方案

#1


I'd use the lower level open() and read() rather than the streams equivalent as otherwise you'll have to worry about synchronizing the stdio buffers with the underlying file descriptor. Note you'll still have issues reading complete numbers, so you'll probably need some sync between the processes.

我使用较低级别的open()和read()而不是等效的流,否则你将不得不担心将stdio缓冲区与底层文件描述符同步。请注意,读取完整数字时仍然存在问题,因此您可能需要在这些进程之间进行一些同步。

As an alternative I would suggest a single process to read the file and write a subset of the lines to subprocesses that do the sorting (using pipe()), which they then write to another process doing the merge.

作为替代方案,我建议单个进程读取文件并将行的子集写入进行排序的子进程(使用pipe()),然后将它们写入另一个进行合并的进程。

#2


If you're using fscanf, the only thing you can do is have each process read and discard numbers until it gets to those that it should work on. In your case discard i*partialdatasize numbers.

如果您正在使用fscanf,那么您唯一能做的就是让每个进程都读取并丢弃数字,直到找到它应该处理的数字。在你的情况下丢弃i * partialdatasize数字。

So e.g. 5 7 3 1 4 8 5 10 2 you might have 5 7 3

所以例如5 7 3 1 4 8 5 10 2你可能有5 7 3

1 4 8

1 4 8

5 10 2

5 10 2

which would sort to give

哪种可以给予

3 5 7

3 5 7

1 4 8

1 4 8

2 5 10.

2 5 10。

Then you have to work out how to merge the sorted results.

然后你必须弄清楚如何合并排序的结果。

#3


If you can store your integers as binary. You can have the first thread read it's block

如果您可以将整数存储为二进制。你可以让第一个线程读取它的块

fread(&arrayPartialData[j], sizeof(int), partialDataSize, fp);

Than the 2nd thread can skip the block which has already been read (Because you know the size of each block). Then you can begin reading from there, without needing to discard any data.

比第二个线程可以跳过已经读取的块(因为你知道每个块的大小)。然后你可以从那里开始阅读,而不需要丢弃任何数据。

fseek(partialDataSize * threadNumber);

I also recommend you use threads, as forking is very expensive. threads tutorial

我还建议你使用线程,因为分叉是非常昂贵的。线程教程

#4


You were working with linked channels.

您正在使用关联渠道。

from glibc 13.5.1 (emphasis is mine)

来自glibc 13.5.1(重点是我的)

Channels that come from a single opening share the same file position; we call them linked channels. Linked channels result when you make a stream from a descriptor using fdopen, when you get a descriptor from a stream with fileno, when you copy a descriptor with dup or dup2, and when descriptors are inherited during fork.

来自单个开口的频道共享相同的文件位置;我们称之为链接渠道。当您使用fdopen从描述符创建流时,使用fileno从流中获取描述符,使用dup或dup2复制描述符,以及在fork期间继承描述符时,会生成链接通道。

Apparently, you can not do I/O from both the streams simultaneously.

显然,您无法同时从两个流中执行I / O.

If you have been using a stream for I/O (or have just opened the stream), and you want to do I/O using another channel (either a stream or a descriptor) that is linked to it, you must first clean up the stream that you have been using.

如果您一直在使用I / O流(或者刚刚打开了流),并且您希望使用链接到它的另一个通道(流或描述符)进行I / O,则必须先清理你一直在使用的流。

#1


I'd use the lower level open() and read() rather than the streams equivalent as otherwise you'll have to worry about synchronizing the stdio buffers with the underlying file descriptor. Note you'll still have issues reading complete numbers, so you'll probably need some sync between the processes.

我使用较低级别的open()和read()而不是等效的流,否则你将不得不担心将stdio缓冲区与底层文件描述符同步。请注意,读取完整数字时仍然存在问题,因此您可能需要在这些进程之间进行一些同步。

As an alternative I would suggest a single process to read the file and write a subset of the lines to subprocesses that do the sorting (using pipe()), which they then write to another process doing the merge.

作为替代方案,我建议单个进程读取文件并将行的子集写入进行排序的子进程(使用pipe()),然后将它们写入另一个进行合并的进程。

#2


If you're using fscanf, the only thing you can do is have each process read and discard numbers until it gets to those that it should work on. In your case discard i*partialdatasize numbers.

如果您正在使用fscanf,那么您唯一能做的就是让每个进程都读取并丢弃数字,直到找到它应该处理的数字。在你的情况下丢弃i * partialdatasize数字。

So e.g. 5 7 3 1 4 8 5 10 2 you might have 5 7 3

所以例如5 7 3 1 4 8 5 10 2你可能有5 7 3

1 4 8

1 4 8

5 10 2

5 10 2

which would sort to give

哪种可以给予

3 5 7

3 5 7

1 4 8

1 4 8

2 5 10.

2 5 10。

Then you have to work out how to merge the sorted results.

然后你必须弄清楚如何合并排序的结果。

#3


If you can store your integers as binary. You can have the first thread read it's block

如果您可以将整数存储为二进制。你可以让第一个线程读取它的块

fread(&arrayPartialData[j], sizeof(int), partialDataSize, fp);

Than the 2nd thread can skip the block which has already been read (Because you know the size of each block). Then you can begin reading from there, without needing to discard any data.

比第二个线程可以跳过已经读取的块(因为你知道每个块的大小)。然后你可以从那里开始阅读,而不需要丢弃任何数据。

fseek(partialDataSize * threadNumber);

I also recommend you use threads, as forking is very expensive. threads tutorial

我还建议你使用线程,因为分叉是非常昂贵的。线程教程

#4


You were working with linked channels.

您正在使用关联渠道。

from glibc 13.5.1 (emphasis is mine)

来自glibc 13.5.1(重点是我的)

Channels that come from a single opening share the same file position; we call them linked channels. Linked channels result when you make a stream from a descriptor using fdopen, when you get a descriptor from a stream with fileno, when you copy a descriptor with dup or dup2, and when descriptors are inherited during fork.

来自单个开口的频道共享相同的文件位置;我们称之为链接渠道。当您使用fdopen从描述符创建流时,使用fileno从流中获取描述符,使用dup或dup2复制描述符,以及在fork期间继承描述符时,会生成链接通道。

Apparently, you can not do I/O from both the streams simultaneously.

显然,您无法同时从两个流中执行I / O.

If you have been using a stream for I/O (or have just opened the stream), and you want to do I/O using another channel (either a stream or a descriptor) that is linked to it, you must first clean up the stream that you have been using.

如果您一直在使用I / O流(或者刚刚打开了流),并且您希望使用链接到它的另一个通道(流或描述符)进行I / O,则必须先清理你一直在使用的流。