具有子任务数量的任务

Scenario is something like this, I have 4 specific URLs in hand, each URL page contains many links to a web page, I need to extract some information of those web pages. I'm planning to use nested task to do this job, Multiple tasks inside one task. Something like below.

场景是这样的，我手头有4个特定的URL，每个URL页面包含很多指向网页的链接，我需要提取这些网页的一些信息。我打算使用嵌套任务来完成这项工作，在一个任务中使用多个任务。像下面的东西。

        var t1Actions = new List<Action>();
        var t1 = Task.Factory.StartNew(() =>
            {
                foreach (var action in t1Actions)
                {
                    Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
                }
            });

        var t2Actions = new List<Action>();
        var t2 = Task.Factory.StartNew(() =>
            {
                foreach (var action in t2Actions)
                {
                    Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
                }
            });

        var t3Actions = new List<Action>();
        var t3 = Task.Factory.StartNew(() =>
            {
                foreach (var action in t3Actions)
                {
                    Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
                }
            });

        var t4Actions = new List<Action>();
        var t4 = Task.Factory.StartNew(() =>
            {
                foreach (var action in t4Actions)
                {
                    Task.Factory.StartNew(action, TaskCreationOptions.AttachedToParent);
                }
            });

        Task.WhenAll(t1, t2, t3, t4);

Here is my questions:

这是我的问题：

Is this way a good way to do jobs like what I mentioned above?
这种方式是否像我上面提到的那样做好工作？
Which one is efficient, replace child tasks with Parallel.Invoke(action) or leave it as it is?
哪一个是高效的，用Parallel.Invoke（动作）替换子任务或保持原样？
How should I notify (for example UI) if a nested task completed, Do I have control over nested tasks?
我应该如何通知（例如UI）嵌套任务是否已完成，我是否可以控制嵌套任务？

Any advice will be helpful.

任何建议都会有所帮助。

1 个解决方案

#1

The actual problem isn't how to handle child tasks. It's how to get a list of URLs from some directory pages, retrieve those pages and process them.

实际问题不在于如何处理子任务。这是如何从一些目录页面获取URL列表，检索这些页面并处理它们。

This can be done easily using .NET's Dataflow library. Each step can be implemented as a block that reads one URL and produces an output.

这可以使用.NET的Dataflow库轻松完成。每个步骤都可以实现为读取一个URL并生成输出的块。

The first block can be a TransformManyBlock that accepts one page URL and retursn a list of page URLs
第一个块可以是TransformManyBlock，它接受一个页面URL并重新查找页面URL列表
The second block can be a TransformBlock that accepts a single page URL and returns its contents
第二个块可以是TransformBlock，它接受单个页面URL并返回其内容
The third block can be an Action Block that accepts the page and does whatever is needed with it.
第三个块可以是一个Action Block，它接受页面并执行所需的任何操作。

For example:

例如：

var listBlock = new TransformManyBlock<Uri,Uri>(async uri=> 
{
    var content=await httpClient.GetStringAsync(uri);
    var uris=ProcessThePage(contents);
    return uris;
});


var downloadBlock = new TransformBlock<Uri,(Uri,string)>(async uri=> 
{
    var content=await httpClient.GetStringAsync(uri);
    return (uri,content);
});

var processingBlock = new ActionBlock<(Uri uri,string content)>(async msg=> 
{
    //Do something
    var pathFromUri(msg.uri);
    File.WriteAllText(pathFromUri,msg.content);
});

var linkOptions=new DataflowLinkOptions{PropagateCompletion=true};

listBlock.LinkTo(downloadBlock,linkOptions);    
downloadBlock.LinkTo(processingBlock,linkOptions);

Each block runs using its own Task. You can specify that a block may use more than one tasks, eg to download multiple pages concurrently.

每个块都使用自己的Task运行。您可以指定块可以使用多个任务，例如同时下载多个页面。

Each block has an input and output buffer. You can specify a limit to the input buffer to avoid flooding a block with too many messages to process. If a block reaches the limit upstream blocks will pause. This way, you could prevent eg the downloadBlock from flooding a slow processingBlock with thousands of pages.

每个块都有一个输入和输出缓冲区。您可以为输入缓冲区指定一个限制，以避免使用太多要处理的消息来阻塞块。如果块达到上限，则上游块将暂停。通过这种方式，您可以防止例如downloadBlock充斥数千页的慢处理块。

Once you have a pipeline, you can post messages to the first block. When you're done, you can tell the block to Complete(). Each block in the pipeline will finish processing messages in its input buffer and propagate the completion call to the next linked block.

一旦有了管道，就可以将消息发布到第一个块。完成后，您可以将块告诉Complete（）。管道中的每个块将在其输入缓冲区中完成处理消息，并将完成调用传播到下一个链接块。

You can await for all messages to finish by awaiting the last block's Completion task.

您可以等待最后一个块的完成任务等待所有消息完成。

var directoryPages=new Uri[]{..};

foreach(var uri in directoryPages)
{
    listBlock.Post(uri);
}

listBlock.Complete();

await processingBlock.Complete();

The ExecutionDataflowBlockOptions can be used to specify the use of multiple tasks and the intput buffer limits, eg :

ExecutionDataflowBlockOptions可用于指定多个任务的使用和输入缓冲区限制，例如：

var options=new ExecutionDataflowBlockOptions 
            {
                BoundedCapacity=10, 
                MaxDegreeOfParallelism=4,
            };

var downloadBlock = new TransformBlock<Uri,(Uri,string)>(...,options);

This means that downloadBlock will accept up to 10 URIs before signalling the listBlock to pause. It will process up to 4 Uris concurrently

这意味着在发信号通知listBlock暂停之前，downloadBlock最多可接受10个URI。它将同时处理多达4个Uris

#1