在两列中查找可能的重复项,忽略大小写和特殊字符

时间:2021-11-30 22:44:23

Query

SELECT COUNT(*), name, number
FROM   tbl
GROUP  BY name, number
HAVING COUNT(*) > 1

It sometimes fails to find duplicates between lower case and upper case.
E.g.: sunny and Sunny don't show up as a duplicates.
So how to find all possible duplicates in PostgreSQL for two columns.

它有时无法在小写和大写之间找到重复。例如:晴天和阳光不会显示为重复。那么如何在PostgreSQL中为两列找到所有可能的重复项。

3 个解决方案

#1


14  

lower()/ upper()

Use one of these to fold characters to either lower or upper case. Special characters are not affected:

使用其中一个将字符折叠为大写或小写。特殊字符不受影响:

SELECT count(*), lower(name), number
FROM   tbl
GROUP  BY lower(name), number
HAVING count(*) > 1;

unaccent()

If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

如果你真的想忽略变音符号,就像你的评论所暗示的那样,安装附加模块unaccent,它提供了一个删除重音的文本搜索字典以及通用功能unaccent():

CREATE EXTENSION unaccent;

Makes it very simple:

使它非常简单:

SELECT lower(unaccent('Büßercafé')) AS norm

Result:

busercafe

This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:

这不会剥离非字母。添加像@Craig那样的regexp_replace():

SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ))
                                                                     AS norm

Result:

softbusercafe

You can even build a functional index on top of that:

您甚至可以在其上构建功能索引:

#2


3  

PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:

默认情况下,PostgreSQL区分大小写。您可以通过将所有值转换为单个大小写来强制它在搜索期间不区分大小写:

SELECT COUNT(*), lower(name), number FROM TABLE 
GROUP BY lower(name), number HAVING COUNT(*) > 1
  • NOTE: This has not been tested in Postgres
  • 注意:这尚未在Postgres中测试过

#3


1  

(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé into resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.

(在海报澄清之后更新了答案):“不重音”或剥离重音(dicratics)的想法通常是假的。如果您正在匹配数据以确定是否有一些误入歧途的用户或应用程序将简历发送到简历中,那就没关系,但是将其中一个更改为另一个是完全错误的,因为它们是不同的单词。即便如此,它也只是一种工作,应该与三角形或Levenshtein距离之类的字符串相似性匹配系统相结合。

The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

“unaccenting”的概念假设任何带重音的字符都有一个有效的等效非重音字符,或者至少任何给定的重音字符被该字的ascii-ized表示中的至多一个非重音字符替换。这根本不是真的;在一种语言中ö可能是一种“你”的声音,而在另一种语言中它可能是一个很长的“oo”,而“ascii-ized”拼写惯例可能反映了这一点。因此,在语言中,伪造的虚拟词“Tapö”的正确“不加重”可能是“Tapu”,而在另一个中,这个想象的单词可能被称为“Tapoo”。在两种情况下,“Tapo”的“无重音”形式都与人们在被强制进入ascii字符集时实际写入的形式相匹配。带有dicratics的单词也可以被拼成一个带连字符的单词。

You can see this in English with ligatures, where the word dæmon is ascii-ized daemon. If you stripped the ligature you'd get dmon which wouldn't match daemon, the common spelling. The same is true of æther which is typically ascii-ized to aether or ether. You can also see this in German with ß, typically "expanded" as ss.

你可以用英语用连字来看这个,其中dæmon这个词是ascii-ized守护进程。如果你剥离了结扎线,你就会得到与守护进程不匹配的dmon,这是常见的拼写。 æther通常被归类为醚或醚,情况也是如此。你也可以用ß看德语,通常是“扩展”为ss。

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W escape (shorthand for the character class [^[:alnum:]_] as per the manual) to exclude "symbols" but not accented characters:

您可以使用字符类正则表达式去除除指定字符集之外的所有字符。在这种情况下,我们使用\ W escape(根据手册的字符类[^ [:alnum:] _]的简写)来排除“符号”但不包括重音字符:

regress=# SELECT regexp_replace(lower(x),'\W','','g') 
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 café
(2 rows)

If you want to filter out accented chars too you can define your own character class:

如果你想过滤掉带重音的字符,你可以定义自己的字符类:

regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 caf
(2 rows)

If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate as per this wiki article:

如果您实际上打算将某些重音字符替换为类似的非重音字符,则可以根据此Wiki文章使用translate:

regress=# SELECT translate(
        lower(x),
        'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
        'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
    )
    FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);

 translate 
-----------
 $s^o&f!t
 cafe
(2 rows)

#1


14  

lower()/ upper()

Use one of these to fold characters to either lower or upper case. Special characters are not affected:

使用其中一个将字符折叠为大写或小写。特殊字符不受影响:

SELECT count(*), lower(name), number
FROM   tbl
GROUP  BY lower(name), number
HAVING count(*) > 1;

unaccent()

If you actually want to ignore diacritic signs, like your comments imply, install the additional module unaccent, which provides a text search dictionary that removes accents and also the general purpose function unaccent():

如果你真的想忽略变音符号,就像你的评论所暗示的那样,安装附加模块unaccent,它提供了一个删除重音的文本搜索字典以及通用功能unaccent():

CREATE EXTENSION unaccent;

Makes it very simple:

使它非常简单:

SELECT lower(unaccent('Büßercafé')) AS norm

Result:

busercafe

This doesn't strip non-letters. Add regexp_replace() like @Craig mentioned for that:

这不会剥离非字母。添加像@Craig那样的regexp_replace():

SELECT lower(unaccent(regexp_replace('$s^o&f!t Büßercafé', '\W', '', 'g') ))
                                                                     AS norm

Result:

softbusercafe

You can even build a functional index on top of that:

您甚至可以在其上构建功能索引:

#2


3  

PostgreSQL by default is case sensitive. You can force it to be case-insensitive during searches by converting all values to a single case:

默认情况下,PostgreSQL区分大小写。您可以通过将所有值转换为单个大小写来强制它在搜索期间不区分大小写:

SELECT COUNT(*), lower(name), number FROM TABLE 
GROUP BY lower(name), number HAVING COUNT(*) > 1
  • NOTE: This has not been tested in Postgres
  • 注意:这尚未在Postgres中测试过

#3


1  

(Updated answer after clarification from poster): The idea of "unaccenting" or stripping accents (dicratics) is generally bogus. It's OK-ish if you're matching data to find out if some misguided user or application munged résumé into resume, but it's totally wrong to change one into the other, as they're different words. Even then it'll only kind-of work, and should be combined with a string-similarity matching system like trigrams or Levenshtein distances.

(在海报澄清之后更新了答案):“不重音”或剥离重音(dicratics)的想法通常是假的。如果您正在匹配数据以确定是否有一些误入歧途的用户或应用程序将简历发送到简历中,那就没关系,但是将其中一个更改为另一个是完全错误的,因为它们是不同的单词。即便如此,它也只是一种工作,应该与三角形或Levenshtein距离之类的字符串相似性匹配系统相结合。

The idea of "unaccenting" presumes that any accented character has a single valid equivalent unaccented character, or at least that any given accented character is replaced with at most one unaccented character in an ascii-ized representation of the word. That simply isn't true; in one language ö might be a "u" sound, while in another it might be a long "oo", and the "ascii-ized" spelling conventions might reflect that. Thus, in language the correct "un-accenting" of the made-up dummy-word "Tapö" might be "Tapu" and in another this imaginary word might be ascii-ized to "Tapoo". In neither case will the "un-accented" form of "Tapo" match what people actually write when forced into the ascii character set. Words with dicratics may also be ascii-ized into a hyphenated word.

“unaccenting”的概念假设任何带重音的字符都有一个有效的等效非重音字符,或者至少任何给定的重音字符被该字的ascii-ized表示中的至多一个非重音字符替换。这根本不是真的;在一种语言中ö可能是一种“你”的声音,而在另一种语言中它可能是一个很长的“oo”,而“ascii-ized”拼写惯例可能反映了这一点。因此,在语言中,伪造的虚拟词“Tapö”的正确“不加重”可能是“Tapu”,而在另一个中,这个想象的单词可能被称为“Tapoo”。在两种情况下,“Tapo”的“无重音”形式都与人们在被强制进入ascii字符集时实际写入的形式相匹配。带有dicratics的单词也可以被拼成一个带连字符的单词。

You can see this in English with ligatures, where the word dæmon is ascii-ized daemon. If you stripped the ligature you'd get dmon which wouldn't match daemon, the common spelling. The same is true of æther which is typically ascii-ized to aether or ether. You can also see this in German with ß, typically "expanded" as ss.

你可以用英语用连字来看这个,其中dæmon这个词是ascii-ized守护进程。如果你剥离了结扎线,你就会得到与守护进程不匹配的dmon,这是常见的拼写。 æther通常被归类为醚或醚,情况也是如此。你也可以用ß看德语,通常是“扩展”为ss。

If you must attempt to "un-accent", "normalize" accents or "strip" accents:

You can use a character class regular expression to strip out all but a specified set of characters. In this case we use the \W escape (shorthand for the character class [^[:alnum:]_] as per the manual) to exclude "symbols" but not accented characters:

您可以使用字符类正则表达式去除除指定字符集之外的所有字符。在这种情况下,我们使用\ W escape(根据手册的字符类[^ [:alnum:] _]的简写)来排除“符号”但不包括重音字符:

regress=# SELECT regexp_replace(lower(x),'\W','','g') 
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 café
(2 rows)

If you want to filter out accented chars too you can define your own character class:

如果你想过滤掉带重音的字符,你可以定义自己的字符类:

regress=# SELECT regexp_replace(lower(x),'[^a-z0-9]','','g')
          FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);
 regexp_replace 
----------------
 soft
 caf
(2 rows)

If you actually intended to substitute some accented characters for similar unaccented characters, you could use translate as per this wiki article:

如果您实际上打算将某些重音字符替换为类似的非重音字符,则可以根据此Wiki文章使用translate:

regress=# SELECT translate(
        lower(x),
        'âãäåāăąÁÂÃÄÅĀĂĄèééêëēĕėęěĒĔĖĘĚìíîïìĩīĭÌÍÎÏÌĨĪĬóôõöōŏőÒÓÔÕÖŌŎŐùúûüũūŭůÙÚÛÜŨŪŬŮ',
        'aaaaaaaaaaaaaaaeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiooooooooooooooouuuuuuuuuuuuuuuu'
    )
    FROM ( VALUES ('$s^o&f!t'),('Café') ) vals(x);

 translate 
-----------
 $s^o&f!t
 cafe
(2 rows)