获取MySQL数据库中最重复的类似字段

时间:2021-08-21 19:14:41

Let's assume we have a database like:

假设我们有一个数据库,如:

Actions_tbl:

--------------------------------------------------------
id | Action_name                              | user_id|
--------------------------------------------------------
1  |  John reads one book                     | 1     
2  |  reading the book by john                | 1
3  |  Joe is jumping over fire                | 2
4  |  reading another book                    | 2
5  |  John reads the book in library          | 1
6  |  Joe read a    book                      | 2
7  |  read a book                             | 3
8  |  jumping with no reason is Ronald's habit| 3 

Users_tbl:

-----------------------
user_id |    user_name |
-----------------------
1       |     John
2       |     Joe
3       |     Ronald
4       |     Araz
-----------------------

Wondering if I can choose the most repeated similar action regardless of it's user and replace my own user_name with its current user!

想知道我是否可以选择最重复的类似操作而不管它的用户是什么,并用我当前的用户替换我自己的user_name!

Read one book, reading the book, reading another book, read the book in library, read a book and read a book are the ones who have most common WORDS so the staffs related to reading the book is repeated 6 times, my system should show one of those six sentences randomly and replace Araz with user_name

阅读一本书,阅读书籍,阅读另一本书,阅读书籍,阅读书籍和阅读书籍都是最常见的词语,所以与阅读书籍有关的人员重复了6次,我的系统应该显示随机的六个句子中的一个,用user_name替换Araz

Like: Araz reads the book

喜欢:阿拉兹读这本书

My Idea was to

我的想法是

select replace(a.action_name , b.user_name) from actions_tbl a, user_tble b where a.user_id = b.user_id group_by

and then check the similarities one by one in php using

然后在php中逐个检查相似之处

levenshtein()

But this one doesn't have performance at all!

但是这个根本没有表现!

Assume that I want to do the same thing for a big db and for few different tables. This will destroy my server!!!

假设我想为一个大数据库和几个不同的表做同样的事情。这会破坏我的服务器!

Any better IDEA?

还有更好的IDEA吗?

in http://www.artfulsoftware.com/infotree/queries.php#552 the levenshtein() function is implemented as a MySQL function but firstly, do u think it has enough performance? and then, how to use it in my case? Maybe a self-join van fix this issue but I'm not that good with sql!

在http://www.artfulsoftware.com/infotree/queries.php#552中,levenshtein()函数是作为MySQL函数实现的,但首先,你认为它有足够的性能吗?然后,如何在我的情况下使用它?也许一个自助加入面包车解决了这个问题,但我对sql并不是那么好!

* similar action, are the actions that have more than X% common words

*类似的动作,是具有超过X%常用词的动作


** More information and notes:**

**更多信息和说明:**

  1. I'm limited to PHP and MySQL.

    我只限于PHP和MySQL。

  2. This is just an example, in my real project the actions are long paragraphs. That's why the performance is a matter. The real scenario is: user inputted the description of its project for several projects, those data may be too similar(users would have the same area of work), I want to fill automatically(base on previous fillings) the description of next project, to save time.

    这只是一个例子,在我的真实项目中,动作是很长的段落。这就是性能问题的原因。真实情况是:用户输入了几个项目的项目描述,那些数据可能太相似(用户会有相同的工作区域),我想自动填写(基于以前的填充)下一个项目的描述,为了省时间。

  3. I would appreciate if you can have any pragmatical Solution. I checked the NLP related solutions, although they r interesting, but I don't think many of them can be accurate and can be implemented using PHP.

    如果你有任何实用的解决方案,我将不胜感激。我检查了NLP相关的解决方案,虽然它们很有趣,但我不认为它们中的许多可以准确并且可以使用PHP实现。

  4. The output should make sense and be a proper paragraph like all other projects. That's why I was thinking of choosing from previous ones.

    输出应该是有意义的,并且像所有其他项目一样是一个合适的段落。这就是为什么我在考虑选择之前的选择。


Thanks for your intellectual answers, its really appreciated if you could shed some light on the situations

感谢您的智力答案,如果您能够了解情况,我们非常感谢

2 个解决方案

#1


2  

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

你在说什么是一个文本聚类过程。您正在尝试查找类似的文本,并随意选择其中一个。我不熟悉任何执行这种形式的文本挖掘的数据库。

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

对于您所描述的内容,一种非常基本的文本挖掘技术可能会起作用。使用除用户名之外的所有单词创建术语 - 文档矩阵。然后使用奇异值分解来获得最大的奇异值和向量(这是相关矩阵的第一个主成分)。类似的活动应该沿着这条线聚集。

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

如果您的词汇量有限并且在表格中包含术语,则可以通过重叠的单词比例来测量两个动作之间的距离。你有行动中所有单词的列表吗?

#2


1  

First off, you'll have to decide whether you want to compare a given input to all existing texts, or do a pairwise comparison of all texts. Your question asks for the latter, but the application you outline sounds more like the former.

首先,您必须决定是要将给定输入与所有现有文本进行比较,还是要对所有文本进行成对比较。你的问题要求后者,但你概述的应用程序听起来更像前者。

If you compare only a single input with your database, I then I'd have hoped levenshtein distance computation to be fast enough up to medium database sizes. And there probably will be few ways to make things any faster unless you store some form of intermediate data structure about the current content of your text base. Recomputing anything for every new input will probably be just as costly.

如果只比较一个输入和数据库,那么我希望levenshtein距离计算能够快到中等数据库大小。除非您存储关于文本库当前内容的某种形式的中间数据结构,否则可能只有很少的方法可以使事情变得更快。为每个新输入重新计算任何东西可能会同样昂贵。

If you want to do a comparison for every pair, then a levenshtein computation for each of them will take too much time. You'll have to devise some other concept of similarity. The first thing that comes to my mind, which would be somewhat resilient to different forms of a word, would be a suffix tree. You could insert all paragraphs into that tree. Where suffix trees normally store a single pointer, you might want to store a pair of indices, one identifying the database row and the other denoting a position in the text of that row. After building the tree, you could traverse it to identify common substrings, and increment some similarity counter for the corresponding pair. You'll have to experiment a bit to tune this measure. You might want to impose a minimum length for a common string before you increment a counter. As long texts have a larger chance of common words even if they are semantically unrelated, you might have to compensate for length in some way. I doubt there is a canonical way to do this.

如果你想对每一对进行比较,那么每个对的levenshtein计算将花费太多时间。你必须设计一些其他的相似概念。我想到的第一件事就是后缀树,它对一个单词的不同形式有一定的弹性。您可以将所有段落插入该树中。在后缀树通常存储单个指针的地方,您可能希望存储一对索引,一个标识数据库行,另一个表示该行文本中的位置。在构建树之后,您可以遍历它以识别公共子串,并为相应的对增加一些相似性计数器。您将需要进行一些实验来调整此度量。在递增计数器之前,您可能希望为公共字符串强制设置最小长度。由于长文本具有更大的共同词的可能性,即使它们在语义上不相关,您可能必须以某种方式补偿长度。我怀疑有一种规范的方法可以做到这一点。

The term-document matrix approach suggested by Gordon sounds interesting as well, and you should be able to implement that in PHP, too. That approach will be mor sensitive to changes of word form, even if the root is the same. On the other hand, it might be easier to keep a suitable matrix for that stored in your database, and to keep that structure in sync when you update your main text table. Both of these approaches have a fundamental difference to levenshtein distance: they care less about the overall order. I belive that this is a good thing in your case, because they'll consider the texts “John read a book after he went swimming in the lake” more similar to “After swimming in the lake, Joe read a book” than levenshtein distance would.

Gordon建议的术语 - 文档矩阵方法听起来也很有趣,你也应该能够在PHP中实现它。即使根是相同的,这种方法也会对单词形式的变化敏感。另一方面,为数据库中存储的矩阵保留合适的矩阵可能更容易,并且在更新主文本表时保持该结构同步。这两种方法都与levenshtein距离有根本的区别:它们不关心整体秩序。我相信这对你来说是件好事,因为他们会考虑“约翰在湖里游泳后读书”的文字更像“在湖里游泳后,乔读了一本书”而不是levenshtein距离将。

Your example indicates that you not only want to rank similarities, but also decide on cluser boundaries, I.e. say “these form a group” and “those belong to distinct groups”. There won't be a clean cut-off for this, so you'll have to experiment with heuristics for that as well. Unless always chosing the most similar text, or the k most similar texts, is enough for your application. In any case, I'd concentrate on the similarity computation first, and add things like user name replacement later on.

您的示例表明您不仅要对相似性进行排名,还要确定cluser边界,即。说“这些形成一个群体”和“那些属于不同的群体”。对此没有一个干净的截止,所以你也必须尝试启发式。除非总是选择最相似的文本或k个最相似的文本,否则就足以满足您的应用需求。在任何情况下,我都会先关注相似度计算,然后再添加用户名替换等内容。

#1


2  

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

你在说什么是一个文本聚类过程。您正在尝试查找类似的文本,并随意选择其中一个。我不熟悉任何执行这种形式的文本挖掘的数据库。

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

对于您所描述的内容,一种非常基本的文本挖掘技术可能会起作用。使用除用户名之外的所有单词创建术语 - 文档矩阵。然后使用奇异值分解来获得最大的奇异值和向量(这是相关矩阵的第一个主成分)。类似的活动应该沿着这条线聚集。

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

如果您的词汇量有限并且在表格中包含术语,则可以通过重叠的单词比例来测量两个动作之间的距离。你有行动中所有单词的列表吗?

#2


1  

First off, you'll have to decide whether you want to compare a given input to all existing texts, or do a pairwise comparison of all texts. Your question asks for the latter, but the application you outline sounds more like the former.

首先,您必须决定是要将给定输入与所有现有文本进行比较,还是要对所有文本进行成对比较。你的问题要求后者,但你概述的应用程序听起来更像前者。

If you compare only a single input with your database, I then I'd have hoped levenshtein distance computation to be fast enough up to medium database sizes. And there probably will be few ways to make things any faster unless you store some form of intermediate data structure about the current content of your text base. Recomputing anything for every new input will probably be just as costly.

如果只比较一个输入和数据库,那么我希望levenshtein距离计算能够快到中等数据库大小。除非您存储关于文本库当前内容的某种形式的中间数据结构,否则可能只有很少的方法可以使事情变得更快。为每个新输入重新计算任何东西可能会同样昂贵。

If you want to do a comparison for every pair, then a levenshtein computation for each of them will take too much time. You'll have to devise some other concept of similarity. The first thing that comes to my mind, which would be somewhat resilient to different forms of a word, would be a suffix tree. You could insert all paragraphs into that tree. Where suffix trees normally store a single pointer, you might want to store a pair of indices, one identifying the database row and the other denoting a position in the text of that row. After building the tree, you could traverse it to identify common substrings, and increment some similarity counter for the corresponding pair. You'll have to experiment a bit to tune this measure. You might want to impose a minimum length for a common string before you increment a counter. As long texts have a larger chance of common words even if they are semantically unrelated, you might have to compensate for length in some way. I doubt there is a canonical way to do this.

如果你想对每一对进行比较,那么每个对的levenshtein计算将花费太多时间。你必须设计一些其他的相似概念。我想到的第一件事就是后缀树,它对一个单词的不同形式有一定的弹性。您可以将所有段落插入该树中。在后缀树通常存储单个指针的地方,您可能希望存储一对索引,一个标识数据库行,另一个表示该行文本中的位置。在构建树之后,您可以遍历它以识别公共子串,并为相应的对增加一些相似性计数器。您将需要进行一些实验来调整此度量。在递增计数器之前,您可能希望为公共字符串强制设置最小长度。由于长文本具有更大的共同词的可能性,即使它们在语义上不相关,您可能必须以某种方式补偿长度。我怀疑有一种规范的方法可以做到这一点。

The term-document matrix approach suggested by Gordon sounds interesting as well, and you should be able to implement that in PHP, too. That approach will be mor sensitive to changes of word form, even if the root is the same. On the other hand, it might be easier to keep a suitable matrix for that stored in your database, and to keep that structure in sync when you update your main text table. Both of these approaches have a fundamental difference to levenshtein distance: they care less about the overall order. I belive that this is a good thing in your case, because they'll consider the texts “John read a book after he went swimming in the lake” more similar to “After swimming in the lake, Joe read a book” than levenshtein distance would.

Gordon建议的术语 - 文档矩阵方法听起来也很有趣,你也应该能够在PHP中实现它。即使根是相同的,这种方法也会对单词形式的变化敏感。另一方面,为数据库中存储的矩阵保留合适的矩阵可能更容易,并且在更新主文本表时保持该结构同步。这两种方法都与levenshtein距离有根本的区别:它们不关心整体秩序。我相信这对你来说是件好事,因为他们会考虑“约翰在湖里游泳后读书”的文字更像“在湖里游泳后,乔读了一本书”而不是levenshtein距离将。

Your example indicates that you not only want to rank similarities, but also decide on cluser boundaries, I.e. say “these form a group” and “those belong to distinct groups”. There won't be a clean cut-off for this, so you'll have to experiment with heuristics for that as well. Unless always chosing the most similar text, or the k most similar texts, is enough for your application. In any case, I'd concentrate on the similarity computation first, and add things like user name replacement later on.

您的示例表明您不仅要对相似性进行排名,还要确定cluser边界,即。说“这些形成一个群体”和“那些属于不同的群体”。对此没有一个干净的截止,所以你也必须尝试启发式。除非总是选择最相似的文本或k个最相似的文本,否则就足以满足您的应用需求。在任何情况下,我都会先关注相似度计算,然后再添加用户名替换等内容。