PHP Regex检查两个字符串是否共享两个通用字符

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.

我刚刚开始了解正则表达式，但是在做了大量的阅读(和大量的学习)之后，我仍然没有找到一个好的解决这个问题的方法。

Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).

让我澄清一下，我理解不使用正则表达式可能更好地解决这个问题，但是为了简洁起见，我只想说我需要使用正则表达式(相信我，我知道有更好的方法可以解决这个问题)。

Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.

问题就在这里。我有一个大文件，每行刚好4个字符长。

This is a regex that defines "valid" lines:

这是一个定义“有效”行的regex:

"/^[AB][CD][EF][GH]$/m"

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

在英语中，每一行的A或B在位置0处，C或D在位置1处，E或F在位置2处，G或H在位置3处。我可以假设每一行恰好是4个字符长。

What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.

我要做的是给定其中的一行，匹配所有其他包含两个或更多常见字符的行。

The below example assumes the following:

下面的例子假设如下:

$line is always a valid format
$line始终是有效的格式
BigFileOfLines.txt contains only valid lines
BigFileOfLines。txt只包含有效的行

Example:

例子:

// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
    $regex = "magic regex I'm looking for here";
    $matchingLines = array();
    preg_match_all($regex, $subject, $matchingLines);
    return $matchingLines;
}

// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);

/*
 * Desired return value (Note: this is an example set, there 
 * could be more or less than this)
 * 
 * BCEG
 * ADFG
 * BCFG
 * BDFG
*/

One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":

我知道的一种方法是像下面这样有一个正则表达式(下面的regex只适用于“ACFG”:

"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"

“/ ^(?:AC。{ 2 } | | .CF。{ 2 } FG |自动跟踪| a { 2 } G | .C.G)/美元”

This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.

这很好，性能是可以接受的。但令我困扰的是，我必须基于$line生成这个参数，我宁愿它不知道具体参数是什么。此外，如果稍后代码被修改为匹配3个或更多字符，或者如果每行的大小从4增加到16，那么这个解决方案的伸缩性也不是很好。

It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.

我只是觉得我忽略了一些非常简单的东西。看起来这可能是一个重复的问题，但是我看过的其他问题都不能真正解决这个问题。

Thanks in advance!

提前谢谢!

Update:

更新:

It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."

Regex回答的标准似乎是，用户只需发布一个正则表达式，并说“这应该对您有用”。

I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:

我认为这是一个折中办法。我真的很想理解正则表达式，所以如果你能在你的答案中包含一个完整的(合理的)解释为什么这个正则表达式:

A. Works
答:工作
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
b是最有效的(我认为可以对主题字符串做出足够多的假设，可以进行大量的优化)。

Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)

当然，如果你给出了一个有效的答案，而没有人用* a解决方案发布答案，我将它标记为:)

Update 2:

更新2:

Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.

谢谢你们所有人的回答，很多有用的信息，你们很多人都有有效的解决方案。我选择这个答案是因为在运行性能测试之后，它是最好的解决方案，平均运行时间与其他解决方案相同。

The reasons I favor this answer:

我喜欢这个答案的原因是:

The regular expression given provides excellent scalability for longer lines
给出的正则表达式为较长的行提供了良好的可扩展性
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
规则的表达看起来干净得多，对像我这样的普通人来说，更容易理解。

However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

然而，下面的答案也很值得称赞，因为它们解释了为什么它们的解决方案是最好的。如果你遇到了这个问题因为这是你想要解决的问题，请给他们读一下，这对我帮助很大。

7 个解决方案

#1

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?

你为什么不使用这个正则表达式$ regex = " / . *(线)美元。*(线)美元。* / m”;?

For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

对于您的示例，这将转换为$regex = "/.*[ACFG].* . /m";

#2

This is a regex that defines "valid" lines:

这是一个定义“有效”行的regex:

/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m

/ ^[A | B]{ 1 } |[C | D]{ 1 } | | F[E]{ 1 } |(G | H){ 1 } /美元

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

在英语中，每一行的A或B在位置0处，C或D在位置1处，E或F在位置2处，G或H在位置3处。我可以假设每一行恰好是4个字符长。

That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.

那不是regex的意思。该regex意味着每条线都有A或B或位置为0、C或D的管道，或位置为1的管道，等等;[A|B]的意思是“要么‘A’，要么‘|’或‘B’”。“|”只表示字符类之外的“或”。

Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:

此外，{1}是一个禁忌;没有任何量词，所有东西都必须出现一次。以上英文的正确regex是:

/^[AB][CD][EF][GH]$/

or, alternatively:

或者,或者:

/^(A|B)(C|D)(E|F)(G|H)$/

That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:

第二个具有在每个位置捕获字母的副作用，因此第一个捕获的组将告诉您第一个字符是A还是B，等等。如果您不想要捕获，可以使用非捕获分组:

/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/

But the character-class version is by far the usual way of writing this.

但是，到目前为止，字符类版本是通常的书写方式。

As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.

至于你的问题，它不适合正则表达式;当您解构字符串、使用适当的regex语法将其粘在一起、编译regex并进行测试时，您很可能只需要对每个字符进行比较。

I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)

我将重写你的“ACFG”正则表达式:/ ^(?:AC | A.F | . . G | .CF | .C.G | . . FG)/美元,但那只是外表;我想不出一个使用regex的更好的解决方案。(尽管麦克莱恩表示,这将是更好的为/ ^(?(?:C |生产| . . G))|(?:C(?:E | .G))|(:. .例如)/美元,但这仍然是相同的解决方案,只是在一个更efficiently-processed形式。)

#3

You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:

你已经回答了如何使用regex，并注意到它的缺点和不能伸缩，所以我认为没有必要鞭打死去的马。相反，这里有一种不用regex的方法:

function findMatchingLines($line) {
    static $file = null;
    if( !$file) $file = file("BigFileOfLines.txt");

    $search = str_split($line);
    foreach($file as $l) {
        $test = str_split($l);
        $matches = count(array_intersect($search,$test));
        if( $matches > 2) // define number of matches required here - optionally make it an argument
            return true;
    }
    // no matches
    return false;
}

#4

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).

有6种可能，至少两个字符匹配出4:MM..M.M.,M . .米,功能。,打烊。M,. .“M”表示匹配，“.”表示不匹配。

So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:

因此，您只需将输入转换为匹配所有这些可能性的regex。对于ACFG的输入，您将使用以下内容:

"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"

This, of course, is the conclusion you're already at--so good so far.

当然，这是你已经得出的结论，到目前为止还不错。

The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.

关键问题是Regex不是用于比较两个字符串的语言，它是一种将字符串与模式进行比较的语言。因此，您的比较字符串必须是模式的一部分(您已经找到了)，或者必须是输入的一部分。后一种方法允许您使用通用的匹配，但是需要您对输入进行压缩。

function findMatchingLines($line, $subject) {
  $regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
      + "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
  $matchingLines = array();
  preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
  return $matchingLines;
}

What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.

这个函数的作用是将输入字符串预粘贴到要匹配的行中，然后使用一个模式来比较第一行之后的每一行(即+ after[)。工作)回到第一行的4个字符。

If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

如果您还想根据“规则”验证这些匹配行，只需替换。在每个模式中对应的字符类(\1\2[EF][GH]等)。

#5

People may be confused by your first regex. You give:

人们可能会对您的第一个regex感到困惑。你给:

"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"

And then say:

然后说:

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

在英语中，每一行的A或B在位置0处，C或D在位置1处，E或F在位置2处，G或H在位置3处。我可以假设每一行恰好是4个字符长。

But that's not what that regex means at all.

但这根本不是regex的意思。

This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.

This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)

这是因为[A|B]表示具有三个给定字符之一的字符类(包括|)。因为{1}表示一个字符(它也是完全多余的，可以删除)，并且因为外部|在它周围的所有东西之间交替。在我上面的英文表达中，每一个大写或表示一个交替的|。(我开始计算1的位置，而不是0——我不想输入第0个位置。)

To get your English description as a regex, you would want:

要将英语描述作为regex，您需要:

/^[AB][CD][EF][GH]$/

The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.

regex将检查A或B(在字符类中)的第一个位置，然后检查下一个位置的C或D，等等。

- - -

EDIT:

编辑:

You want to test for only two of these four characters matching.

您只需要测试这四个字符中的两个匹配。

Very Strictly speaking, and picking up from @Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:

严格地说，根据@Mark Reed的回答，最快的regex(解析后)可能是:

/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/

as compared to:

相比:

/^(AC|A.E|A..G|.CE|.C.G|..EG)$/

This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.

这是因为regex实现是如何在文本中执行的。首先测试A是否在第一位置。如果成功，那么您将测试子案例。如果失败了，那么你就完成了所有可能的情况(或者有3个)。如果成功，那么您将对这两个子案例进行测试。如果没有一个成功，你就测试一下。

This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.

这个regex是专门为尽可能快地失败而创建的。分别列出每个案例，意味着失败，您将测试6个不同的案例(6个备选方案中的每个)，而不是3个案例(至少)。如果A不是第一个位置，你会立刻去测试第二个位置，而不会再打两次。等。

(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)

(请注意，我不知道PHP是如何编译regex的——它们可能编译为相同的内部表示，但我怀疑不是)。

- - -

EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.

编辑:额外的点。最快的regex是一个模糊的术语。最快的失败?最快的成功吗?给定成功和失败行的样本数据的可能范围?所有这些都必须被澄清，以真正确定你所说的最快的标准是什么。

#6

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:

这里有一些使用Levenshtein distance而不是regex的东西，应该具有足够的可扩展性，以满足您的需求:

$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match

$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
    // error checking here if necessary - $line and $match must be same length
    return (levenshtein($line, $match) <= (strlen($line) - $common));
});

var_dump($matchingLines);

#7

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:

我昨天书签的问题今天晚上发布一个答案,但是似乎我有点迟了^ ^这是我的解决方案无论如何:

/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m

It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.

它查找被任何其他字符包围的一个ACFG字符的两次出现。该循环展开并使用所有格量词，以提高性能。

Can be generated using:

可以生成使用:

function getRegexMatchingNCharactersOfLine($line, $num) {
    return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

#1

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?

你为什么不使用这个正则表达式$ regex = " / . *(线)美元。*(线)美元。* / m”;?

For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

对于您的示例，这将转换为$regex = "/.*[ACFG].* . /m";

#2

This is a regex that defines "valid" lines:

这是一个定义“有效”行的regex:

/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m

/ ^[A | B]{ 1 } |[C | D]{ 1 } | | F[E]{ 1 } |(G | H){ 1 } /美元

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

在英语中，每一行的A或B在位置0处，C或D在位置1处，E或F在位置2处，G或H在位置3处。我可以假设每一行恰好是4个字符长。

Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:

此外，{1}是一个禁忌;没有任何量词，所有东西都必须出现一次。以上英文的正确regex是:

/^[AB][CD][EF][GH]$/

or, alternatively:

或者,或者:

/^(A|B)(C|D)(E|F)(G|H)$/

第二个具有在每个位置捕获字母的副作用，因此第一个捕获的组将告诉您第一个字符是A还是B，等等。如果您不想要捕获，可以使用非捕获分组:

/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/

But the character-class version is by far the usual way of writing this.

但是，到目前为止，字符类版本是通常的书写方式。

#3

你已经回答了如何使用regex，并注意到它的缺点和不能伸缩，所以我认为没有必要鞭打死去的马。相反，这里有一种不用regex的方法:

function findMatchingLines($line) {
    static $file = null;
    if( !$file) $file = file("BigFileOfLines.txt");

    $search = str_split($line);
    foreach($file as $l) {
        $test = str_split($l);
        $matches = count(array_intersect($search,$test));
        if( $matches > 2) // define number of matches required here - optionally make it an argument
            return true;
    }
    // no matches
    return false;
}

#4

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).

有6种可能，至少两个字符匹配出4:MM..M.M.,M . .米,功能。,打烊。M,. .“M”表示匹配，“.”表示不匹配。

So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:

因此，您只需将输入转换为匹配所有这些可能性的regex。对于ACFG的输入，您将使用以下内容:

"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"

This, of course, is the conclusion you're already at--so good so far.

当然，这是你已经得出的结论，到目前为止还不错。

function findMatchingLines($line, $subject) {
  $regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
      + "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
  $matchingLines = array();
  preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
  return $matchingLines;
}

这个函数的作用是将输入字符串预粘贴到要匹配的行中，然后使用一个模式来比较第一行之后的每一行(即+ after[)。工作)回到第一行的4个字符。

If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

如果您还想根据“规则”验证这些匹配行，只需替换。在每个模式中对应的字符类(\1\2[EF][GH]等)。

#5

People may be confused by your first regex. You give:

人们可能会对您的第一个regex感到困惑。你给:

"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"

And then say:

然后说:

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

在英语中，每一行的A或B在位置0处，C或D在位置1处，E或F在位置2处，G或H在位置3处。我可以假设每一行恰好是4个字符长。

But that's not what that regex means at all.

但这根本不是regex的意思。

To get your English description as a regex, you would want:

要将英语描述作为regex，您需要:

/^[AB][CD][EF][GH]$/

The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.

regex将检查A或B(在字符类中)的第一个位置，然后检查下一个位置的C或D，等等。

- - -

EDIT:

编辑:

You want to test for only two of these four characters matching.

您只需要测试这四个字符中的两个匹配。

Very Strictly speaking, and picking up from @Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:

严格地说，根据@Mark Reed的回答，最快的regex(解析后)可能是:

/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/

as compared to:

相比:

/^(AC|A.E|A..G|.CE|.C.G|..EG)$/

(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)

(请注意，我不知道PHP是如何编译regex的——它们可能编译为相同的内部表示，但我怀疑不是)。

- - -

#6

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:

这里有一些使用Levenshtein distance而不是regex的东西，应该具有足够的可扩展性，以满足您的需求:

$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match

$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
    // error checking here if necessary - $line and $match must be same length
    return (levenshtein($line, $match) <= (strlen($line) - $common));
});

var_dump($matchingLines);

#7

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:

我昨天书签的问题今天晚上发布一个答案,但是似乎我有点迟了^ ^这是我的解决方案无论如何:

/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m

It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.

它查找被任何其他字符包围的一个ACFG字符的两次出现。该循环展开并使用所有格量词，以提高性能。

Can be generated using:

可以生成使用:

function getRegexMatchingNCharactersOfLine($line, $num) {
    return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

秒客网

PHP Regex检查两个字符串是否共享两个通用字符

7 个解决方案

#1

#2

#3

#4

#5

#6

#7

#1

#2

#3

#4

#5

#6

#7

相关文章