I’m in the process of writing WebCrawler, which will get all the local site links on one particular website. The process is to identify which links are products and feed them into my price comparison software.
我正在编写WebCrawler,它将获得一个特定网站上的所有本地站点链接。这个过程是识别哪些链接是产品,并将它们输入到我的价格比较软件中。
The problem I’m finding, is that I have now got an incomplete crawl of the site standing at 5.4 million links. When running at those numbers, storing the collected links in memory in hashset and then saving them out into flat text file is probative. The hashset is blowing memory consumption and I only have around 5gigs of operating memory.
我发现的问题是,我现在有一个不完整的爬行网站站在540万个链接。当在这些数字上运行时,将收集到的链接存储在hashset中,然后将它们保存到平面文本文件中,这是一种验证。hashset正在消耗内存,而我只有大约5g的操作内存。
Each time I acquire a new link, I need to check whether it has been captured before. Hence hashset seemed the fastest way to do this comparison. With the memory issues, and my text files topping 1.5gigs in size, I thought it would be better to switch to the database – MySQL56. Which I’m running on window 7 64bit, in developer mode.
每次我获得一个新的链接,我需要检查它是否已经被捕获。因此,hashset似乎是进行这种比较的最快方法。由于内存问题,以及我的文本文件的大小超过1.5g,我认为最好切换到数据库MySQL56。我在windows 7 64位上运行,在developer模式下。
I have migrated all the data captured in mysql56 database, using “LOAD DATA LOCAL INFILE”, this seems to worked well, but the URL column Is just a varchar(400).
我已经迁移了mysql56数据库中捕获的所有数据,使用“LOAD data LOCAL INFILE”,这似乎运行得很好,但是URL列仅仅是varchar(400)。
The problem I’m having now, is a query to see if the url is present in the table is taking around 10-15 seconds. Is there any way I can dramatically improve this performance.
我现在遇到的问题是,查询表中是否存在url,需要大约10-15秒。我有什么办法可以大幅度地提高这个成绩吗?
One note that I did try was setting field to unique(with a smaller field limit), but in doing so, the database seemed to be unresponsive when doing the load data in file with 5.4 million records.
我确实尝试过将字段设置为unique(字段限制更小),但在这样做的过程中,数据库在处理包含540万条记录的文件中的加载数据时似乎没有反应。
I’m currently developing in c#, using SQLconnector.
我目前正在使用SQLconnector开发c#。
What I would like to know is, can I improve the performance of this text field, are there any alternative ways for storing and querying this data.
我想知道的是,我是否可以改进这个文本字段的性能,是否有其他方法来存储和查询这些数据。
Thanks
谢谢
2 个解决方案
#1
2
you could look into using partitions with your table structure in MySQL. http://dev.mysql.com/doc/refman/5.5/en/partitioning-types.html
您可以考虑使用具有MySQL表结构的分区。http://dev.mysql.com/doc/refman/5.5/en/partitioning-types.html
you mentioned trying to store all data in memory, but it was too much. you could put a memory cache in front of your database to gain some of the performance. memcached or I think MySQL has one of its own now.
您提到试图将所有数据存储在内存中,但它太大了。您可以将内存缓存放在数据库前面,以获得一些性能。memcached或者说MySQL现在已经有自己的了。
#2
1
You got couple options:
你有几个选择:
- First and foremost, put an index on the field. The reason it takes 10-15 seconds is because it is likely doing a table scan rather than an index scan. You can check that by looking at execution plan. It doesn't have to be unique index (unless you want DB to reject insert of same value).
- 首先,也是最重要的,在字段中放置一个索引。它需要10-15秒的原因是它可能执行表扫描而不是索引扫描。您可以通过查看执行计划来检查它。它不必是唯一索引(除非您希望DB拒绝插入相同值)。
- Another thing you can do will help with table search as well as with memory pressure. Instead of holding entire URLs in memory that can be quite lengthy, compute MD5 (or any other hashing function) of every URL, and store that in memory. Similarly, ain DB, along with URL, store the MD5 signature of the URL, and then search by that value (also indexed). This way it'll need to compare far less bytes, and thus will be faster.
- 您可以做的另一件事将有助于表搜索以及内存压力。与其在内存中保存完整的URL,还不如计算每个URL的MD5(或任何其他哈希函数),并将其存储在内存中。类似地,ain DB和URL一起存储URL的MD5签名,然后按该值进行搜索(也被索引)。这样,它就需要比较更少的字节,因此速度会更快。
- Combine your DB and memory approaches by having limited cache in memory, and full store in DB. In memory, keep MD5 keys and how old they are (time, or FIFO, or distance from your current page in website's link graph). When you need to check the link, check your memory cache. If a hit, then you know you visited the url. If cache miss, only then go to database to really see if has been visited. This will hopefully reduce number of database queries you need to do (depends really how often links repeat themselves).
- 将DB和内存方法结合起来,在内存中使用有限的缓存,在DB中使用完整的存储。在内存中,保存MD5密钥以及它们的年龄(时间、FIFO或距离当前页面在网站链接图中的距离)。当您需要检查链接时,请检查内存缓存。如果点击,那么您就知道您访问了url。如果缓存丢失,则只需要到数据库中查看是否已访问。这有望减少您需要进行的数据库查询的数量(实际上取决于链接重复的频率)。
Other things to consider for optimizations: 1. Some sites have redundant links that appear different but for your purpose are the same. Examples would be printable versions, mobile version, feedback view vs price view, etc. You may want to study site's url structures to know which ones are interesting to you and which are not. Discard the latter ones from your memory/db. 2. Some sites don't really have links as in anchor tags, and instead use JavaScript event handling to figure out if something is clickable and how to process it (e.g. jQuery's selectors). You may be missing parts of site if it employs such techniques.
优化需要考虑的其他事项:1。有些网站有看起来不同的冗余链接,但你的目的是一样的。例如可打印版本、移动版本、反馈视图和价格视图等。您可能想要研究站点的url结构,以了解哪些对您感兴趣,哪些不感兴趣。从内存/db中删除后一个。2。有些站点并没有真正的链接,而是使用JavaScript事件处理来确定是否有可点击的东西以及如何处理它(例如jQuery的选择器)。如果你的网站使用了这样的技术,你可能会失去它的一部分。
Hope this helps.
希望这个有帮助。
#1
2
you could look into using partitions with your table structure in MySQL. http://dev.mysql.com/doc/refman/5.5/en/partitioning-types.html
您可以考虑使用具有MySQL表结构的分区。http://dev.mysql.com/doc/refman/5.5/en/partitioning-types.html
you mentioned trying to store all data in memory, but it was too much. you could put a memory cache in front of your database to gain some of the performance. memcached or I think MySQL has one of its own now.
您提到试图将所有数据存储在内存中,但它太大了。您可以将内存缓存放在数据库前面,以获得一些性能。memcached或者说MySQL现在已经有自己的了。
#2
1
You got couple options:
你有几个选择:
- First and foremost, put an index on the field. The reason it takes 10-15 seconds is because it is likely doing a table scan rather than an index scan. You can check that by looking at execution plan. It doesn't have to be unique index (unless you want DB to reject insert of same value).
- 首先,也是最重要的,在字段中放置一个索引。它需要10-15秒的原因是它可能执行表扫描而不是索引扫描。您可以通过查看执行计划来检查它。它不必是唯一索引(除非您希望DB拒绝插入相同值)。
- Another thing you can do will help with table search as well as with memory pressure. Instead of holding entire URLs in memory that can be quite lengthy, compute MD5 (or any other hashing function) of every URL, and store that in memory. Similarly, ain DB, along with URL, store the MD5 signature of the URL, and then search by that value (also indexed). This way it'll need to compare far less bytes, and thus will be faster.
- 您可以做的另一件事将有助于表搜索以及内存压力。与其在内存中保存完整的URL,还不如计算每个URL的MD5(或任何其他哈希函数),并将其存储在内存中。类似地,ain DB和URL一起存储URL的MD5签名,然后按该值进行搜索(也被索引)。这样,它就需要比较更少的字节,因此速度会更快。
- Combine your DB and memory approaches by having limited cache in memory, and full store in DB. In memory, keep MD5 keys and how old they are (time, or FIFO, or distance from your current page in website's link graph). When you need to check the link, check your memory cache. If a hit, then you know you visited the url. If cache miss, only then go to database to really see if has been visited. This will hopefully reduce number of database queries you need to do (depends really how often links repeat themselves).
- 将DB和内存方法结合起来,在内存中使用有限的缓存,在DB中使用完整的存储。在内存中,保存MD5密钥以及它们的年龄(时间、FIFO或距离当前页面在网站链接图中的距离)。当您需要检查链接时,请检查内存缓存。如果点击,那么您就知道您访问了url。如果缓存丢失,则只需要到数据库中查看是否已访问。这有望减少您需要进行的数据库查询的数量(实际上取决于链接重复的频率)。
Other things to consider for optimizations: 1. Some sites have redundant links that appear different but for your purpose are the same. Examples would be printable versions, mobile version, feedback view vs price view, etc. You may want to study site's url structures to know which ones are interesting to you and which are not. Discard the latter ones from your memory/db. 2. Some sites don't really have links as in anchor tags, and instead use JavaScript event handling to figure out if something is clickable and how to process it (e.g. jQuery's selectors). You may be missing parts of site if it employs such techniques.
优化需要考虑的其他事项:1。有些网站有看起来不同的冗余链接,但你的目的是一样的。例如可打印版本、移动版本、反馈视图和价格视图等。您可能想要研究站点的url结构,以了解哪些对您感兴趣,哪些不感兴趣。从内存/db中删除后一个。2。有些站点并没有真正的链接,而是使用JavaScript事件处理来确定是否有可点击的东西以及如何处理它(例如jQuery的选择器)。如果你的网站使用了这样的技术,你可能会失去它的一部分。
Hope this helps.
希望这个有帮助。