I have two tables with list of urls fetched from different sources.
我有两个表,其中包含从不同来源获取的url列表。
I want to find the common entries and put them in separate table.
我想找到公共条目并将它们放在单独的表中。
This is what I'm doing:
这就是我所做的:
- find md5 hash of url while fetching them.
- 获取url时查找md5散列。
- Store them in a column.
- 将它们存储在一个列中。
- I fetch one table as an array, run a loop through it and insert the values from other table where md5 hash is the same.
- 我将一个表作为数组取出,在其中运行一个循环,并插入来自另一个表的值,其中md5散列是相同的。
EDIT: Should I strip the urls of "http://' and 'www.'
编辑:我应该去掉“http://”和“www”的url吗?
I want to know any other method, which is better and faster, using which I can perform the above.
我想知道其他的方法,它更好更快,我可以用它来执行上面的。
I am using PHP + MySQL
我用的是PHP + MySQL
3 个解决方案
#1
3
MD5 is a little bit slow if you need real speed. Try MurmurHash
如果你需要真正的速度,MD5有点慢。尝试MurmurHash
You should do the following transformations before hash calculation:
在散列计算之前,您应该进行以下转换:
- Strip "http://" and www.
- 带“http://”和www。
- Strip trailing slash
- 条末尾斜杠
- Normalize URL (urlencode it)
- 规范化的URL(urlencode)
#2
0
Try something like:
尝试:
INSERT INTO table3 (SELECT url FROM table1, table2 WHERE table1.hash = table2.hash)
That's not a valid SQL-statement, but a nested query like that should read urls from table1 and table2 that match by their hash and put them in table3.
这不是一个有效的sql语句,但是像这样的嵌套查询应该读取表1和表2中的url,根据它们的散列匹配它们并将它们放到表3中。
EDIT: If you want to sanitize your input urls (e.g. removing GET-variables), I'd do that before saving them to tabel1 and table2. I wouldn't remove http and www as "https://somesite" and "http://somesite" as well as "www.somesite.com" and "somesite.com" may have different content.
编辑:如果你想清除输入url(例如删除GET-variables),我会在保存到tabel1和table2之前进行。我不会删除http和www作为“https://somesite”和“http://somesite”以及“www.somesite.com”和“somesite.com”可能有不同的内容。
#3
0
SELECT * FROM table1 WHERE hash IN (SELECT hash FROM table2)
You may probably also want to have a look at the concept of table joins.
您可能还想了解一下表连接的概念。
#1
3
MD5 is a little bit slow if you need real speed. Try MurmurHash
如果你需要真正的速度,MD5有点慢。尝试MurmurHash
You should do the following transformations before hash calculation:
在散列计算之前,您应该进行以下转换:
- Strip "http://" and www.
- 带“http://”和www。
- Strip trailing slash
- 条末尾斜杠
- Normalize URL (urlencode it)
- 规范化的URL(urlencode)
#2
0
Try something like:
尝试:
INSERT INTO table3 (SELECT url FROM table1, table2 WHERE table1.hash = table2.hash)
That's not a valid SQL-statement, but a nested query like that should read urls from table1 and table2 that match by their hash and put them in table3.
这不是一个有效的sql语句,但是像这样的嵌套查询应该读取表1和表2中的url,根据它们的散列匹配它们并将它们放到表3中。
EDIT: If you want to sanitize your input urls (e.g. removing GET-variables), I'd do that before saving them to tabel1 and table2. I wouldn't remove http and www as "https://somesite" and "http://somesite" as well as "www.somesite.com" and "somesite.com" may have different content.
编辑:如果你想清除输入url(例如删除GET-variables),我会在保存到tabel1和table2之前进行。我不会删除http和www作为“https://somesite”和“http://somesite”以及“www.somesite.com”和“somesite.com”可能有不同的内容。
#3
0
SELECT * FROM table1 WHERE hash IN (SELECT hash FROM table2)
You may probably also want to have a look at the concept of table joins.
您可能还想了解一下表连接的概念。