有比在SQlite数据库中查询HTML数据更有效的方法吗?

Maybe my question will be closed because it is not "constructive" enough but anyway... I've already searched for answers but most of them are too general. For my master thesis project I have to crawl lots of (i.e. several thousands) webpages and entirely store them in a database. This is necessary because I have to analyze them in different ways, try out several machine learning algorithms and parse them in different ways. At the moment, I'm using an SQlite database for this purpose, in combination with Django as the preferred web framework.

也许我的问题将会结束，因为它不够“有建设性”，但无论如何……我已经搜索过答案，但大多数都太笼统了。对于我的硕士论文项目，我必须抓取大量的(比如几千)个网页，并将它们全部存储在一个数据库中。这是必要的，因为我必须以不同的方式分析它们，尝试几种机器学习算法并以不同的方式解析它们。目前，我正在为此使用一个SQlite数据库，并将Django作为首选的web框架。

I've put the entire HTML data of a single webpage into a Django TextField, i.e. for each webpage there is a separate row in the database table and all the webpage's content is stored in a single column of the table. The problem now is that querying and sorting the data and especially iterating over it is very slow. I've crawled around 1000 webpages so far and the database file already is over 2 GB in size. Furthermore, my 4 GB of RAM are filled entirely and my computer (Macbook Pro mid-2009, Core2Duo 2.26 Ghz, 500 GB HDD, OSX 10.8) becomes unresponsive. If I restrict the number of items to be pre-cached, then processing the data becomes even slower because the number of queries increases. Since I have to crawl even more data, my current setup doesn't scale well enough.

我已经将单个页面的整个HTML数据放入到Django TextField中，即每个页面在数据库表中有一个单独的行，并且所有的页面内容都存储在表的一个列中。现在的问题是查询和排序数据，尤其是在数据上迭代非常缓慢。到目前为止，我已经浏览了大约1000个网页，数据库文件的大小已经超过了2gb。此外，我的4gb内存全部被填满，我的电脑(Macbook Pro 2009年年中，Core2Duo 2.26 Ghz, 500gb HDD, osx10.8)变得没有反应。如果我限制要预缓存的项的数量，那么由于查询的数量增加，处理数据就会变得更慢。由于我必须抓取更多的数据，所以我当前的设置不能很好地扩展。

My question now is the following: How can I store my HTML data more efficiently such that querying the data can be done faster? Does it make sense to switch to another RDBMS such as MySQL or PostgreSQL? Or would you recommend a document-oriented database system such as MongoDB? I only have experience with SQlite so far, so I have no idea really what to use. As the deadline for my master thesis is coming nearer, I don't have the time to try out lots of different setups.

我现在的问题是:如何更有效地存储HTML数据，以便更快地查询数据?切换到其他RDBMS(如MySQL或PostgreSQL)是否有意义?或者您会推荐面向文档的数据库系统，比如MongoDB吗?到目前为止我只有使用SQlite的经验，所以我真的不知道该用什么。随着硕士论文的截止日期越来越近，我没有时间尝试很多不同的方法。

In order to help you in helping me, here are some further requirements:

为了帮助你帮助我，这里有一些进一步的要求:

more performance in querying large HTML data compared to SQlite while not eating up all memory of my computer (workload cannot be distributed to other computers)
与SQlite相比，查询大型HTML数据的性能更好，同时不会占用计算机的所有内存(工作负载不能分配给其他计算机)
more or less good integration with Django
与Django的集成比较好
this is research work only so it will never run in a production environment but only on my computer (maybe also on my Professor's one)
这只是研究工作，所以它永远不会在生产环境中运行，而是只在我的电脑上运行(可能也在我教授的电脑上)

It would be great if you helped me in deciding which direction I should take because I feel somehow lost with this huge amount of possibilities. Thank you very much in advance! :)

如果你能帮助我决定我应该朝哪个方向走，那就太好了，因为我对这么多的可能性感到莫名其妙。非常感谢!:)

1 个解决方案

#1

This is merely an answer but suggestions you should consider. Your problem is far to specific to taken care of with a code-snippet or a hint.

这只是一个答案，但你应该考虑一下。您的问题非常具体，需要使用代码片段或提示来处理。

First: try to reduce fetched data as much as possible. This said, why store the complete HTML code, I guess you are only interested in the text published on the webpage (aka content). So strip the data during the crawl from the HTML stuff and store the pure information. If I am wrong and you are interested in something else, feel free to correct me.

首先:尽量减少获取的数据。这就是说，为什么要存储完整的HTML代码，我猜您只对网页上发布的文本(即内容)感兴趣。因此，在抓取过程中，从HTML中删除数据，并存储纯信息。如果我错了，你对别的事情感兴趣，请随时纠正我。

Second: try to produce queryable data. Your crawler should write the data in the database in a fashion you can process data easier. A map-reduce approach could be the way to go. This will take more time crawling but enables fast data retrival afterwards. This is pretty much storing only a map to all the pages you crawled and not the complete content any more or at least, your query wont touch the full data tables and only rely on the mapreduced content first.

第二:尝试生成可查询数据。爬虫应该以一种更容易处理数据的方式在数据库中编写数据。mapreduce的方法是可行的。这将花费更多的时间爬行，但是可以在之后快速地检索数据。这基本上是只存储到所有您所爬行的页面的映射，而不是完整的内容，或者至少，您的查询不会触及完整的数据表，并且只依赖于mapreduce内容。

Third: upgrade your hardware - you want to process alot of data? Be prepared (or bring time with you). Stick in as much RAM as you want and can to your Macbook (you can put ram inside right?! please say you can upgrade ram in apple stuff..) since it is really cheap

第三:升级你的硬件——你想处理大量的数据?做好准备(或与你共度时光)。在你的Macbook上尽可能多地插入内存(你可以把内存放在里面，对吗?!)请说你可以升级ram in apple stuff.. .)因为它真的很便宜

Fourth: SQLite is hdd-heavy since it relies on the OS io-cache and so on and sometimes needs ages to refetch data. If you can try to get it on an SSD drive (which will be unhealthy for the SSD in the long run ;-) ) or use a remote database with a fast connection to your pc so the hdd->ram->cpu cycle is not your limitation but only RAM and maybe CPU (I guess your program is not multi-core right?)

第四:SQLite是hdd-heavy，因为它依赖于操作系统io-cache等，有时需要时间来重新获取数据。如果你可以试着把它放在一个固态硬盘(SSD的将不健康的长远;-))或使用远程数据库的快速连接到你的电脑的硬盘驱动器- > ram - >只有ram和cpu周期不是你限制也许cpu(我猜你的程序并不是多核对吗?)

Fifths and final: even though I hate throwing in fancy words that are in the media everywhere now, have a look at IBMs article about hadoop

五分之三，也是最后一点:尽管我讨厌在媒体上说一些花哨的词汇，但还是看看ibm关于hadoop的文章吧

#1