So I am writing a simple website crawler for maintenance of in house sites. It will go through each link, adding new links as it finds them, noting down title and h1 tags etc.
所以我正在编写一个简单的网站爬虫来维护内部网站。它将遍历每个链接,在找到它们时添加新链接,记下标题和h1标签等。
Occasionally it duplicates titles and H1 tags, when there is only one in the source when I check it manually.
有时它会复制标题和H1标签,当我手动检查它时,源中只有一个。
The reason this is happening is because the crawl script is running via cron and it appears to be overlapping, so processing the same page twice.
发生这种情况的原因是因为爬网脚本是通过cron运行的,并且它似乎是重叠的,因此处理同一页面两次。
The script will basically grab a page that has been uncrawled, then if the http response is 200 it will mark it as crawled, and process what it needs to.
该脚本将基本上抓取一个未被抓取的页面,然后如果http响应为200,它将标记为已爬行,并处理它需要的内容。
So somewhere between the SELECT and the UPDATE, another thread of the script is running on the same row that was SELECTed.
因此,在SELECT和UPDATE之间,脚本的另一个线程在SELECTed的同一行上运行。
Is there a way to either SELECT and UPDATE in the same query, or lock the row returned in the SELECT so it cannot be returned again in another query in another thread until I am finished with it?
有没有办法在同一个查询中使用SELECT和UPDATE,或者锁定SELECT中返回的行,这样在完成之前它不能再在另一个线程的另一个查询中返回?
Have had a look at - http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html and general SELECT FOR UPDATE stuff, but I am still unsure.
看过 - http://dev.mysql.com/doc/refman/5.0/en/innodb-locking-reads.html和一般SELECT FOR UPDATE的东西,但我仍然不确定。
Edit
编辑
I am using something like this
我正在使用这样的东西
START TRANSACTION;
SELECT .. FOR UPDATE;
UPDATE .... ;
COMMIT;
But its not liking it. Am def using InnoDB on that table. I am thinking this may not be the way forward, as it simply puts off the processing of the row until after the commit, when I want it to physically not be able to SELECT the row again.
但它不喜欢它。我在该表上使用InnoDB。我认为这可能不是前进的方式,因为它只是在提交之后推迟行的处理,当我希望它在物理上无法再次选择行时。
I have covered this off by doing the SELECT, and then afterwards doing an UPDATE to flag a field as crawled before it processes it, but the fact that this is not seamless seems to be causing the problem. I need a way to seamlessly SELECT and UPDATE the field, or SELECT and stop it being SELECTed again until I say so.
我通过执行SELECT来解决这个问题,然后在执行UPDATE之前将字段标记为已处理的字段,但事实上这不是无缝的,这似乎导致了问题。我需要一种方法来无缝地选择和更新字段,或SELECT并再次停止它被选中,直到我这样说。
2 个解决方案
#1
3
You answered the question yourself :). SELECT FOR UPDATE
is exactly what you need if I understand your question correctly. Remember to turn off autocommit, start a transaction before select and commit the transaction after update.
你自己回答了这个问题:)。如果我正确理解你的问题,SELECT FOR UPDATE正是你所需要的。请记住关闭自动提交,在选择之前启动事务并在更新后提交事务。
Update:
更新:
I think this will do what you want:
我想这会做你想要的:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
SELECT .. FOR UPDATE;
UPDATE .... ;
COMMIT TRANSACTION;
#2
3
When you lock the row (via select for update), the other transaction will wait for lock to be released, instead of skipping the row and selecting the next one. Better strategy will be to have a flag colum in the table (none, processing, completed), maybe with a timestamp. The cron grabs row, sets flag to processing and starts processing the page. When another instance of the script is running, it selects rows that are not in the 'processing' state. When cron finishs, it updates the record once again to 'completed'
当您锁定行时(通过select for update),另一个事务将等待释放锁定,而不是跳过该行并选择下一行。更好的策略是在表中有一个标志列(无,处理,完成),可能带有时间戳。 cron抓取行,将标志设置为处理并开始处理页面。当另一个脚本实例正在运行时,它会选择不处于“处理”状态的行。当cron完成后,它再次更新记录为'completed'
#1
3
You answered the question yourself :). SELECT FOR UPDATE
is exactly what you need if I understand your question correctly. Remember to turn off autocommit, start a transaction before select and commit the transaction after update.
你自己回答了这个问题:)。如果我正确理解你的问题,SELECT FOR UPDATE正是你所需要的。请记住关闭自动提交,在选择之前启动事务并在更新后提交事务。
Update:
更新:
I think this will do what you want:
我想这会做你想要的:
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
SELECT .. FOR UPDATE;
UPDATE .... ;
COMMIT TRANSACTION;
#2
3
When you lock the row (via select for update), the other transaction will wait for lock to be released, instead of skipping the row and selecting the next one. Better strategy will be to have a flag colum in the table (none, processing, completed), maybe with a timestamp. The cron grabs row, sets flag to processing and starts processing the page. When another instance of the script is running, it selects rows that are not in the 'processing' state. When cron finishs, it updates the record once again to 'completed'
当您锁定行时(通过select for update),另一个事务将等待释放锁定,而不是跳过该行并选择下一行。更好的策略是在表中有一个标志列(无,处理,完成),可能带有时间戳。 cron抓取行,将标志设置为处理并开始处理页面。当另一个脚本实例正在运行时,它会选择不处于“处理”状态的行。当cron完成后,它再次更新记录为'completed'