在Python中跨越多个SQL查询的好习惯/设计是什么

I am extracting information from a website and storing it to a database using Python with MySQLdb and BeautifulSoup.

我正在从网站提取信息并使用带有MySQLdb和BeautifulSoup的Python将其存储到数据库中。

The website is organized by about 15 different cities and each city has anywhere from 10 to 150 pages. There is a total of about 500 pages.

该网站由大约15个不同的城市组织,每个城市有10到150页。总共大约500页。

For each page per city, I open the site using BeautifulSoup, extract all the neccessary information then perform a insert into or update SQL query.

对于每个城市的每个页面,我使用BeautifulSoup打开网站,提取所有必要的信息,然后执行插入或更新SQL查询。

Currently I am not using threads, and it takes a few minutes to go through all 500 pages because the Python program...

目前我没有使用线程,因为Python程序需要几分钟才能完成所有500页。

Open a page.

打开一个页面。

Extract information.
Perform SQL query.

执行SQL查询。

Open the next page...

打开下一页......

Ideally I would want to load balance the thread by having, say, 10 concurrent threads that open up about 50 pages each. But I think that may be too complicated to code.

理想情况下,我希望通过拥有10个并发线程来平衡线程,每个线程打开大约50页。但我认为编码可能过于复杂。

So instead I am thinking of having one thread per city. How would I accomplish this?

所以相反,我想每个城市有一个主题。我怎么做到这一点?

Currently my code looks like something like this:

目前我的代码看起来像这样:

//import threading
import BeautifulSoup
import urllib2
import MySQLdb

con = MySQLdb.connect( ... )

def open_page( url ):
    cur = con.cursor()
    // do SQL query

//Get a dictionary of city URL

cities = [
    'http://example.com/atlanta/',
    'http://example.com/los-angeles/',
    ...
    'http://example.com/new-york/'
]

for city_url in cities:
    soup = BeautifulSoup( urllib2.urlopen( city_url ) )

    // find every page per city
    pages = soup.findAll( 'div', { 'class' : 'page' } )

    for page in pages:
        page_url = page.find( 'a' )[ 'href' ]
        open_page( page_url )

2 个解决方案

#1

Your initial idea is absolutely feasible. Just start 10 worker threads that wait for input on one and the same queue. Then your mail process puts the urls into this queue. The load-balancing will happen automatically.

你最初的想法绝对可行。只需启动10个等待同一队列输入的工作线程。然后,您的邮件进程将URL放入此队列。负载平衡将自动发生。

If your SQL bindings are thread-safe, you can do the INSERT or UPDATE stuff in the worker threads. Otherwise, I'd add one more thread for the SQL stuff, waiting for input on a different queue. Then your worker threads would put the query into this queue, and the SQL thread would execute it.

如果您的SQL绑定是线程安全的,您可以在工作线程中执行INSERT或UPDATE操作。否则,我会为SQL内容添加一个线程,等待不同队列上的输入。然后你的工作线程将查询放入这个队列,SQL线程将执行它。

If you google for "python worker threads queue" you'll find a few examples.

如果你谷歌搜索“python worker threads queue”,你会发现一些例子。

#2

Parse urls in N-threads.

在N-threads中解析URL。

Create .CSV source from parsed urls

从解析的URL创建.CSV源

Create TEMP table

创建TEMP表

Insert into TEMP table from CSV by http://dev.mysql.com/doc/refman/5.1/en/load-data.html

通过http://dev.mysql.com/doc/refman/5.1/en/load-data.html从CSV插入TEMP表

Insert WITHOUT dupes into MAIN table from TEMP table

从TEMP表中插入没有dupes到MAIN表

#1