任何线程完成任务时终止多个线程

I am new to both python, and to threads. I have written python code which acts as a web crawler and searches sites for a specific keyword. My question is, how can I use threads to run three different instances of my class at the same time. When one of the instances finds the keyword, all three must close and stop crawling the web. Here is some code.

我是python和线程的新手。我编写了python代码，它充当网络爬虫，并在网站上搜索特定的关键字。我的问题是，如何使用线程同时运行我的类的三个不同实例。当其中一个实例找到关键字时，所有三个实例都必须关闭并停止对Web进行爬网。这是一些代码。

class Crawler:
      def __init__(self):
            # the actual code for finding the keyword 

 def main():  
        Crawl = Crawler()

 if __name__ == "__main__":
        main()

How can I use threads to have Crawler do three different crawls at the same time?

如何使用线程让Crawler同时执行三次不同的爬网？

5 个解决方案

#1

There doesn't seem to be a (simple) way to terminate a thread in Python.

似乎没有（简单）方法在Python中终止线程。

Here is a simple example of running multiple HTTP requests in parallel:

以下是并行运行多个HTTP请求的简单示例：

import threading

def crawl():
    import urllib2
    data = urllib2.urlopen("http://www.google.com/").read()

    print "Read google.com"

threads = []

for n in range(10):
    thread = threading.Thread(target=crawl)
    thread.start()

    threads.append(thread)

# to wait until all three functions are finished

print "Waiting..."

for thread in threads:
    thread.join()

print "Complete."

With additional overhead, you can use a multi-process aproach that's more powerful and allows you to terminate thread-like processes.

通过额外的开销，您可以使用更强大的多进程方法，并允许您终止类似线程的进程。

I've extended the example to use that. I hope this will be helpful to you:

我已经扩展了这个例子来使用它。我希望这会对你有所帮助：

import multiprocessing

def crawl(result_queue):
    import urllib2
    data = urllib2.urlopen("http://news.ycombinator.com/").read()

    print "Requested..."

    if "result found (for example)":
        result_queue.put("result!")

    print "Read site."

processs = []
result_queue = multiprocessing.Queue()

for n in range(4): # start 4 processes crawling for the result
    process = multiprocessing.Process(target=crawl, args=[result_queue])
    process.start()
    processs.append(process)

print "Waiting for result..."

result = result_queue.get() # waits until any of the proccess have `.put()` a result

for process in processs: # then kill them all off
    process.terminate()

print "Got result:", result

#2

Starting a thread is easy:

启动一个线程很简单：

thread = threading.Thread(function_to_call_inside_thread)
thread.start()

Create an event object to notify when you are done:

创建一个事件对象，以便在完成时通知：

event = threading.Event()
event.wait() # call this in the main thread to wait for the event
event.set() # call this in a thread when you are ready to stop

Once the event has fired, you'll need to add stop() methods to your crawlers.

事件触发后，您需要向抓取工具添加stop（）方法。

for crawler in crawlers:
    crawler.stop()

And then call join on the threads

然后在线程上调用join

thread.join() # waits for the thread to finish

If you do any amount of this kind of programming, you'll want to look at the eventlet module. It allows you to write "threaded" code without many of the disadvantages of threading.

如果您进行任何此类编程，您将需要查看eventlet模块。它允许您编写“线程”代码，而没有线程的许多缺点。

#3

First off, if you're new to python, I wouldn't recommend facing threads yet. Get used to the language, then tackle multi-threading.

首先，如果你是python的新手，我不建议面对线程。习惯语言，然后解决多线程问题。

With that said, if your goal is to parallelize (you said "run at the same time"), you should know that in python (or at least in the default implementation, CPython) multiple threads WILL NOT truly run in parallel, even if multiple processor cores are available. Read up on the GIL (Global Interpreter Lock) for more information.

话虽如此，如果您的目标是并行化（您说“同时运行”），您应该知道在python中（或者至少在默认实现中，CPython）多个线程不会真正并行运行，即使可以使用多个处理器内核。阅读GIL（全球口译员锁）以获取更多信息。

Finally, if you still want to go on, check the Python documentation for the threading module. I'd say Python's docs are as good as references get, with plenty of examples and explanations.

最后，如果您还想继续，请查看线程模块的Python文档。我会说Python的文档和引用一样好，有大量的例子和解释。

#4

For this problem, you can use either the threading module (which, as others have said, will not do true threading because of the GIL) or the multiprocessing module (depending on which version of Python you're using). They have very similar APIs, but I recommend multiprocessing, as it is more Pythonic, and I find that communicating between processes with Pipes pretty easy.

对于这个问题，您可以使用线程模块（正如其他人所说，由于GIL而不会执行真正的线程）或多处理模块（取决于您使用的Python版本）。它们有非常相似的API，但我推荐多处理，因为它更像Pythonic，我发现使用Pipes在进程之间进行通信非常简单。

You'll want to have your main loop, which will create your processes, and each of these processes should run your crawler have a pipe back to the main thread. Your process should listen for a message on the pipe, do some crawling, and send a message back over the pipe if it finds something (before terminating). Your main loop should loop over each of the pipes back to it, listening for this "found something" message. Once it hears that message, it should resend it over the pipes to the remaining processes, then wait for them to complete.

您将需要拥有主循环，这将创建您的进程，并且每个进程都应运行您的爬网程序，并将管道返回到主线程。您的进程应该侦听管道上的消息，进行一些爬行，并在发现某些内容时（在终止之前）通过管道发回消息。你的主循环应该遍历每个管道回到它，听取这个“发现的东西”的消息。一旦它听到该消息，它应该通过管道将其重新发送到剩余的进程，然后等待它们完成。

More information can be found here: http://docs.python.org/library/multiprocessing.html

更多信息可以在这里找到：http：//docs.python.org/library/multiprocessing.html

#5

First of all, threading is not a solution in Python. Due to GIL, Threads does not work in parallel. So you can handle this with multiprocessing and you'll be limited with the number of processor cores.

首先，线程不是Python的解决方案。由于GIL，线程不能并行工作。因此，您可以通过多处理来处理这一问题，并且您将受限于处理器核心数量。

What's the goal of your work ? You want to have a crawler ? Or you have some academic goals (learning about threading and Python, etc.) ?

你的工作目标是什么？你想要一个爬虫？或者你有一些学术目标（学习线程和Python等）？

Another point, Crawl waste more resources than other programs, so what is the sale your crawl ?

还有一点，Crawl浪费的资源比其他程序多，所以你的抓什么卖？

#1