正则表达式匹配第一个非重复字符。

时间:2022-08-28 21:45:39

TL;DR

博士TL;

re.search("(.)(?!.*\1)", text).group() doesn't match the first non-repeating character contained in text (it always returns a character at or before the first non-repeated character, or before the end of the string if there are no non-repeated characters. My understanding is that re.search() should return None if there were no matches). I'm only interested in understanding why this regex is not working as intended using the Python re module, not in any other method of solving the problem

re.search(“(.)(? ! * \ 1)。group()不匹配文本中包含的第一个非重复字符(如果没有非重复字符,它总是在第一个非重复字符前或第一个非重复字符前返回字符,或者在字符串结束前返回字符。我的理解是,如果没有匹配,re.search()应该不返回任何值)。我感兴趣的是理解为什么这个regex没有按照预期使用Python re模块工作,而不是使用任何其他解决问题的方法

Full Background

完整的背景

The problem description comes from https://www.codeeval.com/open_challenges/12/. I've already solved this problem using a non-regex method, but revisited it to expand my understanding of Python's re module. The regular expressions i thought would work (named vs unnamed backreferences) are:

问题描述来自https://www.codeeval.com/open_challenger /12/。我已经使用非regex方法解决了这个问题,但是重新访问它以扩展我对Python re模块的理解。我认为有用的正则表达式(命名和未命名的反向引用)是:

(?P<letter>.)(?!.*(?P=letter)) and (.)(?!.*\1) (same results in python2 and python3)

(? P <信> 。)(? !。*(? P =信))和()。(? ! . * \ 1)(python2和python3相同的结果)

My entire program looks like this

整个程序是这样的

import re
import sys
with open(sys.argv[1], 'r') as test_cases:
    for test in test_cases:
        print(re.search("(?P<letter>.)(?!.*(?P=letter))",
                        test.strip()
                       ).group()
             )

and some input/output pairs are:

一些输入/输出对是:

rain | r
teetthing | e
cardiff | c
kangaroo | k
god | g
newtown | e
taxation | x
refurbished | f
substantially | u

According to what I've read at https://docs.python.org/2/library/re.html:

根据我在https://docs.python.org/2/library/re.html上读到的:

  • (.) creates a named group that matches any character and allows later backreferences to it as \1.
  • (.)创建一个命名组,该组与任何字符匹配,并允许以后以\1的形式对其进行反向引用。
  • (?!...) is a negative lookahead which restricts matches to cases where ... does not match.
  • (?!)是一种消极的前瞻,它将匹配限制在……不匹配。
  • .*\1 means any number (including zero) of characters followed by whatever was matched by (.) earlier
  • 。*\1是指前面与(.)匹配的任何字符(包括零)后面的任何数字(包括零)
  • re.search(pattern, string) returns only the first location where the regex pattern produces a match (and would return None if no match could be found)
  • search(pattern, string)只返回regex模式生成匹配项的第一个位置(如果找不到匹配项,则返回None)
  • .group() is equivalent to .group(0) which returns the entire match
  • group()等价于.group(0),它返回整个匹配

I think these pieces together should solve the stated problem, and it does work like I think it should for most inputs, but failed on teething. Throwing similar problems at it reveals that it seems to ignore repeated characters if they are consecutive:

我认为这两部分结合在一起就可以解决问题了,而且它确实像我认为的那样适用于大多数输入,但在开始阶段就失败了。向它抛出类似的问题表明,如果重复字符是连续的,它似乎会忽略重复字符:

tooth | o      # fails on consecutive repeated characters
aardvark | d   # but does ok if it sees them later
aah | a        # verified last one didn't work just because it was at start
heh | e        # but it works for this one
hehe | h       # What? It thinks h matches (lookahead maybe doesn't find "heh"?)
heho | e       # but it definitely finds "heh" and stops "h" from matching here
hahah | a      # so now it won't match h but will match a
hahxyz | a     # but it realizes there are 2 h characters here...
hahxyza | h    # ... Ok time for *

I know lookbehind and negative lookbehind are limited to 3-character-max fixed length strings, and cannot contain backreferences even if they evaluate to a fixed length string, but I didn't see the documentation specify any restrictions on negative lookahead.

我知道lookbehind和negative lookbehind都被限制为3字符最多的固定长度字符串,即使它们计算的是一个固定长度的字符串,也不能包含反向引用,但我没有看到文档对负面的lookahead有任何限制。

4 个解决方案

#1


13  

Well let's take your tooth example - here is what the regex-engine does (a lot simplified for better understanding)

让我们以您的牙齿为例——这是regex引擎的功能(为了更好的理解,简化了很多)

Start with t then look ahead in the string - and fail the lookahead, as there is another t.

从t开始,然后在字符串中向前看——失败的前视,因为还有另一个t。

tooth
^  °

Next take o, look ahead in the string - and fail, as there is another o.

接下来取o,在字符串中向前看——然后失败,因为还有另一个o。

tooth
 ^°

Next take the second o, look ahead in the string - no other o present - match it, return it, work done.

下一个是第二个o,在弦上向前看——没有其他o现在——匹配它,返回它,完成工作。

tooth
  ^

So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.

所以你的正则表达式不匹配第一个不重复的字符,但是第一个字符,在字符串末尾没有重复。

#2


11  

Sebastian's answer already explains pretty well why your current attempt doesn't work.

Sebastian的回答已经很好地解释了为什么你现在的尝试没有成功。

.NET

Since you're revo is interested in a .NET flavor workaround, the solution becomes trivial:

既然revo对。net风格的解决方案感兴趣,那么解决方案就变得微不足道了:

(?<letter>.)(?!.*?\k<letter>)(?<!\k<letter>.+?)

Demo link

演示链接

This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below).

这是因为。net支持可变长度的外观。您还可以使用Python获得该结果(参见下面)。

So for each letter (?<letter>.) we check:

所以对于每一个字母(? <字母> )我们检查:

  • if it's repeated further in the input (?!.*?\k<letter>)
  • 如果它在输入中被重复(?! *?\k <字母> )
  • if it was already encountered before (?<!\k<letter>.+?)
    (we have to skip the letter we're testing when going backwards, hence the +).
  • 如果它之前已经遇到(? .+?)

Python

The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace \k with \g (which is quite unfortunate as with this module \g is a group backreference, whereas with PCRE it's a recursion).

Python regex模块还支持可变长度的查找,因此上面的regex将使用一个小的语法更改:您需要用\g替换\k(这很不幸,因为这个模块\g是一个组回引用,而PCRE是一个递归)。

The regex is:

正则表达式是:

(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)

And here's an example:

这里是一个例子:

$ python
Python 2.7.10 (default, Jun  1 2015, 18:05:38)
[GCC 4.9.2] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.search(r'(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)', 'tooth')
<regex.Match object; span=(4, 5), match='h'>

PCRE

Ok, now things start to get dirty: since PCRE doesn't support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not.

好了,现在事情开始变得复杂起来:由于PCRE不支持可变长度的外观,我们需要以某种方式记住输入中是否已经遇到了给定的字母。

Unfortunately, the regex engine doesn't provide random access memory support. The best we can get in terms of generic memory is a stack - but that's not sufficient for this purpose, as a stack only lets us access its topmost element.

不幸的是,regex引擎不提供随机访问内存支持。就一般内存而言,我们能得到的最好的内存是堆栈——但这还不够,因为堆栈只允许我们访问它的最顶层元素。

If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let's see this on a limited alphabet of the three letters abc:

如果我们接受将自己限制在给定的字母表中,我们可以滥用捕获组来存储标志。我们来看看abc三个字母的有限字母:

# Anchor the pattern
\A

# For each letter, test to see if it's duplicated in the input string
(?(?=[^a]*+a[^a]*a)(?<da>))
(?(?=[^b]*+b[^b]*b)(?<db>))
(?(?=[^c]*+c[^c]*c)(?<dc>))

# Skip any duplicated letter and throw it away
[a-c]*?\K

# Check if the next letter is a duplicate
(?:
  (?(da)(*FAIL)|a)
| (?(db)(*FAIL)|b)
| (?(dc)(*FAIL)|c)
)

Here's how that works:

这是如何工作的:

  • First, the \A anchor ensures we'll process the input string only once
  • 首先,\锚确保我们只处理一次输入字符串
  • Then, for each letter X of our alphabet, we'll set up a is duplicate flag dX:
    • The conditional pattern (?(cond)then|else) is used there:
      • The condition is (?=[^X]*+X[^X]*X) which is true if the input string contains the letter X twice.
      • 条件(? =(^ X)* + X[X]^ *)这是事实如果输入字符串包含字母X的两倍。
      • If the condition is true, the then clause is (?<dX>), which is an empty capturing group that will match the empty string.
      • 如果条件为真,则then子句为(? ),它是一个空捕获组,将匹配空字符串。
      • If the condition is false, the dX group won't be matched
      • 如果条件为false,则不会匹配dX组
    • 条件模式(?(电导率),那么其他|):使用条件(? =(^ X)* + X[X]^ *)这是事实如果输入字符串包含字母X的两倍。如果条件为真,则then子句为(? ),它是一个空捕获组,将匹配空字符串。如果条件为false,则不会匹配dX组
    • Next, we lazily skip valid letters from our alphabet: [a-c]*?
    • 接下来,我们懒洋洋地跳过字母表中的有效字母:[a-c]*?
    • And we throw them out in the final match with \K
    • 我们在最后一场比赛中把他们扔出去。
    • Now, we're trying to match one letter whose dX flag is not set. For this purpose, we'll do a conditional branch: (?(dX)(*FAIL)|X)
      • If dX was matched (meaning that X is a duplicated character), we (*FAIL), forcing the engine to backtrack and try a different letter.
      • 如果dX被匹配(意味着X是一个重复的字符),我们(*失败),迫使引擎后退并尝试另一个字母。
      • If dX was not matched, we try to match X. At this point, if this succeeds, we know that X is the first non-duplicated letter.
      • 如果dX不匹配,我们试着匹配X,此时,如果这个成功,我们知道X是第一个非重复的字母。
    • 现在,我们正在尝试匹配一个字母,它的dX标志没有设置。为此,我们将做一个条件分支:(?(?)(*FAIL)|X)如果匹配dX(意味着X是一个重复的字符),我们(*失败),迫使引擎返回并尝试不同的字母。如果dX不匹配,我们试着匹配X,此时,如果这个成功,我们知道X是第一个非重复的字母。
  • 对于每个字母X的字母,我们建立了一个重复的国旗dX:条件模式(?(电导率)那么其他|):使用条件(? =(^ X)* + X[X]^ *)这是事实如果输入字符串包含字母X的两倍。如果条件为真,则then子句为(? ),它是一个空捕获组,将匹配空字符串。如果条件是假的,那么dX组将无法匹配,我们会延迟地跳过字母表中的有效字母:[a-c]*?我们扔在决赛\ K,现在我们要匹配一个字母的dX国旗没有设置。为此,我们将做一个条件分支:(?(dX)(*失败)| X)如果dX是匹配(即X是一个重复的字符),我们(*失败),迫使发动机回溯和尝试不同的信。如果dX不匹配,我们试着匹配X,此时,如果这个成功,我们知道X是第一个非重复的字母。

That last part of the pattern could also be replaced with:

该模式的最后一部分也可以被替换为:

(?:
  a (*THEN) (?(da)(*FAIL))
| b (*THEN) (?(db)(*FAIL))
| c (*THEN) (?(dc)(*FAIL))
)

Which is somewhat more optimized. It matches the current letter first and only then checks if it's a duplicate.

这是更优化的。它首先匹配当前的字母,然后只检查它是否是重复的。

The full pattern for the lowercase letters a-z looks like this:

小写字母a-z的完整模式如下:

# Anchor the pattern
\A

# For each letter, test to see if it's duplicated in the input string
(?(?=[^a]*+a[^a]*a)(?<da>))
(?(?=[^b]*+b[^b]*b)(?<db>))
(?(?=[^c]*+c[^c]*c)(?<dc>))
(?(?=[^d]*+d[^d]*d)(?<dd>))
(?(?=[^e]*+e[^e]*e)(?<de>))
(?(?=[^f]*+f[^f]*f)(?<df>))
(?(?=[^g]*+g[^g]*g)(?<dg>))
(?(?=[^h]*+h[^h]*h)(?<dh>))
(?(?=[^i]*+i[^i]*i)(?<di>))
(?(?=[^j]*+j[^j]*j)(?<dj>))
(?(?=[^k]*+k[^k]*k)(?<dk>))
(?(?=[^l]*+l[^l]*l)(?<dl>))
(?(?=[^m]*+m[^m]*m)(?<dm>))
(?(?=[^n]*+n[^n]*n)(?<dn>))
(?(?=[^o]*+o[^o]*o)(?<do>))
(?(?=[^p]*+p[^p]*p)(?<dp>))
(?(?=[^q]*+q[^q]*q)(?<dq>))
(?(?=[^r]*+r[^r]*r)(?<dr>))
(?(?=[^s]*+s[^s]*s)(?<ds>))
(?(?=[^t]*+t[^t]*t)(?<dt>))
(?(?=[^u]*+u[^u]*u)(?<du>))
(?(?=[^v]*+v[^v]*v)(?<dv>))
(?(?=[^w]*+w[^w]*w)(?<dw>))
(?(?=[^x]*+x[^x]*x)(?<dx>))
(?(?=[^y]*+y[^y]*y)(?<dy>))
(?(?=[^z]*+z[^z]*z)(?<dz>))

# Skip any duplicated letter and throw it away
[a-z]*?\K

# Check if the next letter is a duplicate
(?:
  a (*THEN) (?(da)(*FAIL))
| b (*THEN) (?(db)(*FAIL))
| c (*THEN) (?(dc)(*FAIL))
| d (*THEN) (?(dd)(*FAIL))
| e (*THEN) (?(de)(*FAIL))
| f (*THEN) (?(df)(*FAIL))
| g (*THEN) (?(dg)(*FAIL))
| h (*THEN) (?(dh)(*FAIL))
| i (*THEN) (?(di)(*FAIL))
| j (*THEN) (?(dj)(*FAIL))
| k (*THEN) (?(dk)(*FAIL))
| l (*THEN) (?(dl)(*FAIL))
| m (*THEN) (?(dm)(*FAIL))
| n (*THEN) (?(dn)(*FAIL))
| o (*THEN) (?(do)(*FAIL))
| p (*THEN) (?(dp)(*FAIL))
| q (*THEN) (?(dq)(*FAIL))
| r (*THEN) (?(dr)(*FAIL))
| s (*THEN) (?(ds)(*FAIL))
| t (*THEN) (?(dt)(*FAIL))
| u (*THEN) (?(du)(*FAIL))
| v (*THEN) (?(dv)(*FAIL))
| w (*THEN) (?(dw)(*FAIL))
| x (*THEN) (?(dx)(*FAIL))
| y (*THEN) (?(dy)(*FAIL))
| z (*THEN) (?(dz)(*FAIL))
)

And here's the demo on regex101, complete with unit tests.

这是regex101的演示,包含单元测试。

You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It's primarily of educational interest and should not be used for any serious application.

如果需要更大的字母表,可以扩展此模式,但显然这不是通用的解决方案。它主要是教育意义,不应该用于任何严肃的应用。


For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents:

对于其他口味,您可以尝试调整模式,用更简单的对等物替换PCRE特性:

  • \A becomes ^
  • \一变成了^
  • X (*THEN) (?(dX)(*FAIL)) can be replaced with (?(dX)(?!)|X)
  • X(*)(?(dX)(*失败))可以取代(?(dX)(? !)| X)
  • You may throw away the \K and replace the last noncapturnig group (?:...) with a named group like (?<letter>...) and treat its content as the result.
  • 您可能会丢弃\K并将最后一个非capturnig组(?:…)替换为一个名为(? …)的命名组,并将其内容作为结果。

The only required but somewhat unusual construct is the conditional group (?(cond)then|else).

唯一需要但有点不寻常的构造是条件组(?(cond)然后|else)。

#3


5  

Regular expressions are not optimal for the task even if you use alternative implementations of re that do not limit lookbehind by fixed length strings (such as Matthew Barnett's regex).

正则表达式对于任务来说不是最优的,即使您使用的是不受固定长度字符串(如Matthew Barnett的regex)限制的re的替代实现。

The easiest way is to count occurrences of letters and print the first one with frequency equal to 1:

最简单的方法是计算字母的出现次数,并将第一个频率为1的字母打印出来。

import sys
from collections import Counter, OrderedDict

# Counter that remembers that remembers the order entries were added
class OrderedCounter(Counter, OrderedDict):
    pass

# Calling next() once only gives the first entry
first=next

with open(sys.argv[1], 'r') as test_cases:
    for test in test_cases:
        lettfreq = OrderedCounter(test)
        print(first((l for l in lettfreq if lettfreq[l] == 1)))

#4


3  

The reason why your regex is not working is that it will not match a character that is followed by the same character, but there is nothing to prevent it from matching a character that isn't followed by the same character, even if it is preceded by the same character.

regex不工作的原因是它不会匹配后面跟着相同字符的字符,但是没有什么可以阻止它匹配后面跟着相同字符的字符,即使前面有相同字符。

#1


13  

Well let's take your tooth example - here is what the regex-engine does (a lot simplified for better understanding)

让我们以您的牙齿为例——这是regex引擎的功能(为了更好的理解,简化了很多)

Start with t then look ahead in the string - and fail the lookahead, as there is another t.

从t开始,然后在字符串中向前看——失败的前视,因为还有另一个t。

tooth
^  °

Next take o, look ahead in the string - and fail, as there is another o.

接下来取o,在字符串中向前看——然后失败,因为还有另一个o。

tooth
 ^°

Next take the second o, look ahead in the string - no other o present - match it, return it, work done.

下一个是第二个o,在弦上向前看——没有其他o现在——匹配它,返回它,完成工作。

tooth
  ^

So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.

所以你的正则表达式不匹配第一个不重复的字符,但是第一个字符,在字符串末尾没有重复。

#2


11  

Sebastian's answer already explains pretty well why your current attempt doesn't work.

Sebastian的回答已经很好地解释了为什么你现在的尝试没有成功。

.NET

Since you're revo is interested in a .NET flavor workaround, the solution becomes trivial:

既然revo对。net风格的解决方案感兴趣,那么解决方案就变得微不足道了:

(?<letter>.)(?!.*?\k<letter>)(?<!\k<letter>.+?)

Demo link

演示链接

This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below).

这是因为。net支持可变长度的外观。您还可以使用Python获得该结果(参见下面)。

So for each letter (?<letter>.) we check:

所以对于每一个字母(? <字母> )我们检查:

  • if it's repeated further in the input (?!.*?\k<letter>)
  • 如果它在输入中被重复(?! *?\k <字母> )
  • if it was already encountered before (?<!\k<letter>.+?)
    (we have to skip the letter we're testing when going backwards, hence the +).
  • 如果它之前已经遇到(? .+?)

Python

The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace \k with \g (which is quite unfortunate as with this module \g is a group backreference, whereas with PCRE it's a recursion).

Python regex模块还支持可变长度的查找,因此上面的regex将使用一个小的语法更改:您需要用\g替换\k(这很不幸,因为这个模块\g是一个组回引用,而PCRE是一个递归)。

The regex is:

正则表达式是:

(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)

And here's an example:

这里是一个例子:

$ python
Python 2.7.10 (default, Jun  1 2015, 18:05:38)
[GCC 4.9.2] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.search(r'(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)', 'tooth')
<regex.Match object; span=(4, 5), match='h'>

PCRE

Ok, now things start to get dirty: since PCRE doesn't support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not.

好了,现在事情开始变得复杂起来:由于PCRE不支持可变长度的外观,我们需要以某种方式记住输入中是否已经遇到了给定的字母。

Unfortunately, the regex engine doesn't provide random access memory support. The best we can get in terms of generic memory is a stack - but that's not sufficient for this purpose, as a stack only lets us access its topmost element.

不幸的是,regex引擎不提供随机访问内存支持。就一般内存而言,我们能得到的最好的内存是堆栈——但这还不够,因为堆栈只允许我们访问它的最顶层元素。

If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let's see this on a limited alphabet of the three letters abc:

如果我们接受将自己限制在给定的字母表中,我们可以滥用捕获组来存储标志。我们来看看abc三个字母的有限字母:

# Anchor the pattern
\A

# For each letter, test to see if it's duplicated in the input string
(?(?=[^a]*+a[^a]*a)(?<da>))
(?(?=[^b]*+b[^b]*b)(?<db>))
(?(?=[^c]*+c[^c]*c)(?<dc>))

# Skip any duplicated letter and throw it away
[a-c]*?\K

# Check if the next letter is a duplicate
(?:
  (?(da)(*FAIL)|a)
| (?(db)(*FAIL)|b)
| (?(dc)(*FAIL)|c)
)

Here's how that works:

这是如何工作的:

  • First, the \A anchor ensures we'll process the input string only once
  • 首先,\锚确保我们只处理一次输入字符串
  • Then, for each letter X of our alphabet, we'll set up a is duplicate flag dX:
    • The conditional pattern (?(cond)then|else) is used there:
      • The condition is (?=[^X]*+X[^X]*X) which is true if the input string contains the letter X twice.
      • 条件(? =(^ X)* + X[X]^ *)这是事实如果输入字符串包含字母X的两倍。
      • If the condition is true, the then clause is (?<dX>), which is an empty capturing group that will match the empty string.
      • 如果条件为真,则then子句为(? ),它是一个空捕获组,将匹配空字符串。
      • If the condition is false, the dX group won't be matched
      • 如果条件为false,则不会匹配dX组
    • 条件模式(?(电导率),那么其他|):使用条件(? =(^ X)* + X[X]^ *)这是事实如果输入字符串包含字母X的两倍。如果条件为真,则then子句为(? ),它是一个空捕获组,将匹配空字符串。如果条件为false,则不会匹配dX组
    • Next, we lazily skip valid letters from our alphabet: [a-c]*?
    • 接下来,我们懒洋洋地跳过字母表中的有效字母:[a-c]*?
    • And we throw them out in the final match with \K
    • 我们在最后一场比赛中把他们扔出去。
    • Now, we're trying to match one letter whose dX flag is not set. For this purpose, we'll do a conditional branch: (?(dX)(*FAIL)|X)
      • If dX was matched (meaning that X is a duplicated character), we (*FAIL), forcing the engine to backtrack and try a different letter.
      • 如果dX被匹配(意味着X是一个重复的字符),我们(*失败),迫使引擎后退并尝试另一个字母。
      • If dX was not matched, we try to match X. At this point, if this succeeds, we know that X is the first non-duplicated letter.
      • 如果dX不匹配,我们试着匹配X,此时,如果这个成功,我们知道X是第一个非重复的字母。
    • 现在,我们正在尝试匹配一个字母,它的dX标志没有设置。为此,我们将做一个条件分支:(?(?)(*FAIL)|X)如果匹配dX(意味着X是一个重复的字符),我们(*失败),迫使引擎返回并尝试不同的字母。如果dX不匹配,我们试着匹配X,此时,如果这个成功,我们知道X是第一个非重复的字母。
  • 对于每个字母X的字母,我们建立了一个重复的国旗dX:条件模式(?(电导率)那么其他|):使用条件(? =(^ X)* + X[X]^ *)这是事实如果输入字符串包含字母X的两倍。如果条件为真,则then子句为(? ),它是一个空捕获组,将匹配空字符串。如果条件是假的,那么dX组将无法匹配,我们会延迟地跳过字母表中的有效字母:[a-c]*?我们扔在决赛\ K,现在我们要匹配一个字母的dX国旗没有设置。为此,我们将做一个条件分支:(?(dX)(*失败)| X)如果dX是匹配(即X是一个重复的字符),我们(*失败),迫使发动机回溯和尝试不同的信。如果dX不匹配,我们试着匹配X,此时,如果这个成功,我们知道X是第一个非重复的字母。

That last part of the pattern could also be replaced with:

该模式的最后一部分也可以被替换为:

(?:
  a (*THEN) (?(da)(*FAIL))
| b (*THEN) (?(db)(*FAIL))
| c (*THEN) (?(dc)(*FAIL))
)

Which is somewhat more optimized. It matches the current letter first and only then checks if it's a duplicate.

这是更优化的。它首先匹配当前的字母,然后只检查它是否是重复的。

The full pattern for the lowercase letters a-z looks like this:

小写字母a-z的完整模式如下:

# Anchor the pattern
\A

# For each letter, test to see if it's duplicated in the input string
(?(?=[^a]*+a[^a]*a)(?<da>))
(?(?=[^b]*+b[^b]*b)(?<db>))
(?(?=[^c]*+c[^c]*c)(?<dc>))
(?(?=[^d]*+d[^d]*d)(?<dd>))
(?(?=[^e]*+e[^e]*e)(?<de>))
(?(?=[^f]*+f[^f]*f)(?<df>))
(?(?=[^g]*+g[^g]*g)(?<dg>))
(?(?=[^h]*+h[^h]*h)(?<dh>))
(?(?=[^i]*+i[^i]*i)(?<di>))
(?(?=[^j]*+j[^j]*j)(?<dj>))
(?(?=[^k]*+k[^k]*k)(?<dk>))
(?(?=[^l]*+l[^l]*l)(?<dl>))
(?(?=[^m]*+m[^m]*m)(?<dm>))
(?(?=[^n]*+n[^n]*n)(?<dn>))
(?(?=[^o]*+o[^o]*o)(?<do>))
(?(?=[^p]*+p[^p]*p)(?<dp>))
(?(?=[^q]*+q[^q]*q)(?<dq>))
(?(?=[^r]*+r[^r]*r)(?<dr>))
(?(?=[^s]*+s[^s]*s)(?<ds>))
(?(?=[^t]*+t[^t]*t)(?<dt>))
(?(?=[^u]*+u[^u]*u)(?<du>))
(?(?=[^v]*+v[^v]*v)(?<dv>))
(?(?=[^w]*+w[^w]*w)(?<dw>))
(?(?=[^x]*+x[^x]*x)(?<dx>))
(?(?=[^y]*+y[^y]*y)(?<dy>))
(?(?=[^z]*+z[^z]*z)(?<dz>))

# Skip any duplicated letter and throw it away
[a-z]*?\K

# Check if the next letter is a duplicate
(?:
  a (*THEN) (?(da)(*FAIL))
| b (*THEN) (?(db)(*FAIL))
| c (*THEN) (?(dc)(*FAIL))
| d (*THEN) (?(dd)(*FAIL))
| e (*THEN) (?(de)(*FAIL))
| f (*THEN) (?(df)(*FAIL))
| g (*THEN) (?(dg)(*FAIL))
| h (*THEN) (?(dh)(*FAIL))
| i (*THEN) (?(di)(*FAIL))
| j (*THEN) (?(dj)(*FAIL))
| k (*THEN) (?(dk)(*FAIL))
| l (*THEN) (?(dl)(*FAIL))
| m (*THEN) (?(dm)(*FAIL))
| n (*THEN) (?(dn)(*FAIL))
| o (*THEN) (?(do)(*FAIL))
| p (*THEN) (?(dp)(*FAIL))
| q (*THEN) (?(dq)(*FAIL))
| r (*THEN) (?(dr)(*FAIL))
| s (*THEN) (?(ds)(*FAIL))
| t (*THEN) (?(dt)(*FAIL))
| u (*THEN) (?(du)(*FAIL))
| v (*THEN) (?(dv)(*FAIL))
| w (*THEN) (?(dw)(*FAIL))
| x (*THEN) (?(dx)(*FAIL))
| y (*THEN) (?(dy)(*FAIL))
| z (*THEN) (?(dz)(*FAIL))
)

And here's the demo on regex101, complete with unit tests.

这是regex101的演示,包含单元测试。

You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It's primarily of educational interest and should not be used for any serious application.

如果需要更大的字母表,可以扩展此模式,但显然这不是通用的解决方案。它主要是教育意义,不应该用于任何严肃的应用。


For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents:

对于其他口味,您可以尝试调整模式,用更简单的对等物替换PCRE特性:

  • \A becomes ^
  • \一变成了^
  • X (*THEN) (?(dX)(*FAIL)) can be replaced with (?(dX)(?!)|X)
  • X(*)(?(dX)(*失败))可以取代(?(dX)(? !)| X)
  • You may throw away the \K and replace the last noncapturnig group (?:...) with a named group like (?<letter>...) and treat its content as the result.
  • 您可能会丢弃\K并将最后一个非capturnig组(?:…)替换为一个名为(? …)的命名组,并将其内容作为结果。

The only required but somewhat unusual construct is the conditional group (?(cond)then|else).

唯一需要但有点不寻常的构造是条件组(?(cond)然后|else)。

#3


5  

Regular expressions are not optimal for the task even if you use alternative implementations of re that do not limit lookbehind by fixed length strings (such as Matthew Barnett's regex).

正则表达式对于任务来说不是最优的,即使您使用的是不受固定长度字符串(如Matthew Barnett的regex)限制的re的替代实现。

The easiest way is to count occurrences of letters and print the first one with frequency equal to 1:

最简单的方法是计算字母的出现次数,并将第一个频率为1的字母打印出来。

import sys
from collections import Counter, OrderedDict

# Counter that remembers that remembers the order entries were added
class OrderedCounter(Counter, OrderedDict):
    pass

# Calling next() once only gives the first entry
first=next

with open(sys.argv[1], 'r') as test_cases:
    for test in test_cases:
        lettfreq = OrderedCounter(test)
        print(first((l for l in lettfreq if lettfreq[l] == 1)))

#4


3  

The reason why your regex is not working is that it will not match a character that is followed by the same character, but there is nothing to prevent it from matching a character that isn't followed by the same character, even if it is preceded by the same character.

regex不工作的原因是它不会匹配后面跟着相同字符的字符,但是没有什么可以阻止它匹配后面跟着相同字符的字符,即使前面有相同字符。