比较两个url的最好和最快的方法是什么?

时间:2022-09-19 22:51:43

I have two tables with list of urls fetched from different sources.

我有两个表,其中包含从不同来源获取的url列表。

I want to find the common entries and put them in separate table.

我想找到公共条目并将它们放在单独的表中。

This is what I'm doing:

这就是我所做的:

  1. find md5 hash of url while fetching them.
  2. 获取url时查找md5散列。
  3. Store them in a column.
  4. 将它们存储在一个列中。
  5. I fetch one table as an array, run a loop through it and insert the values from other table where md5 hash is the same.
  6. 我将一个表作为数组取出,在其中运行一个循环,并插入来自另一个表的值,其中md5散列是相同的。

EDIT: Should I strip the urls of "http://' and 'www.'

编辑:我应该去掉“http://”和“www”的url吗?

I want to know any other method, which is better and faster, using which I can perform the above.

我想知道其他的方法,它更好更快,我可以用它来执行上面的。

I am using PHP + MySQL

我用的是PHP + MySQL

3 个解决方案

#1


3  

MD5 is a little bit slow if you need real speed. Try MurmurHash

如果你需要真正的速度,MD5有点慢。尝试MurmurHash

You should do the following transformations before hash calculation:

在散列计算之前,您应该进行以下转换:

  • Strip "http://" and www.
  • 带“http://”和www。
  • Strip trailing slash
  • 条末尾斜杠
  • Normalize URL (urlencode it)
  • 规范化的URL(urlencode)

#2


0  

Try something like:

尝试:

INSERT INTO table3  (SELECT url FROM table1, table2 WHERE table1.hash = table2.hash)

That's not a valid SQL-statement, but a nested query like that should read urls from table1 and table2 that match by their hash and put them in table3.

这不是一个有效的sql语句,但是像这样的嵌套查询应该读取表1和表2中的url,根据它们的散列匹配它们并将它们放到表3中。

EDIT: If you want to sanitize your input urls (e.g. removing GET-variables), I'd do that before saving them to tabel1 and table2. I wouldn't remove http and www as "https://somesite" and "http://somesite" as well as "www.somesite.com" and "somesite.com" may have different content.

编辑:如果你想清除输入url(例如删除GET-variables),我会在保存到tabel1和table2之前进行。我不会删除http和www作为“https://somesite”和“http://somesite”以及“www.somesite.com”和“somesite.com”可能有不同的内容。

#3


0  

SELECT * FROM table1 WHERE hash IN (SELECT hash FROM table2)

You may probably also want to have a look at the concept of table joins.

您可能还想了解一下表连接的概念。

#1


3  

MD5 is a little bit slow if you need real speed. Try MurmurHash

如果你需要真正的速度,MD5有点慢。尝试MurmurHash

You should do the following transformations before hash calculation:

在散列计算之前,您应该进行以下转换:

  • Strip "http://" and www.
  • 带“http://”和www。
  • Strip trailing slash
  • 条末尾斜杠
  • Normalize URL (urlencode it)
  • 规范化的URL(urlencode)

#2


0  

Try something like:

尝试:

INSERT INTO table3  (SELECT url FROM table1, table2 WHERE table1.hash = table2.hash)

That's not a valid SQL-statement, but a nested query like that should read urls from table1 and table2 that match by their hash and put them in table3.

这不是一个有效的sql语句,但是像这样的嵌套查询应该读取表1和表2中的url,根据它们的散列匹配它们并将它们放到表3中。

EDIT: If you want to sanitize your input urls (e.g. removing GET-variables), I'd do that before saving them to tabel1 and table2. I wouldn't remove http and www as "https://somesite" and "http://somesite" as well as "www.somesite.com" and "somesite.com" may have different content.

编辑:如果你想清除输入url(例如删除GET-variables),我会在保存到tabel1和table2之前进行。我不会删除http和www作为“https://somesite”和“http://somesite”以及“www.somesite.com”和“somesite.com”可能有不同的内容。

#3


0  

SELECT * FROM table1 WHERE hash IN (SELECT hash FROM table2)

You may probably also want to have a look at the concept of table joins.

您可能还想了解一下表连接的概念。