I'm trying to write a regular expression to sift through 3mb of text and find certain strings. Right now it works relatively well, except for one problem.
我正在尝试编写一个正则表达式来筛选3mb的文本并找到某些字符串。现在它的工作相对较好,除了一个问题。
The current expression I'm using is
我正在使用的当前表达式是
pattern = re.compile(r'[A-Z]{4} \d{3}.{4,40} \(\d\)')
This effectively searches through the enormous string and finds all occurences of 4 uppercase aplha characters followed by a space, followed by 3 numbers followed by 4-40 any kind of characters, followed by a space, followed by (n) where n
is any number.
这有效地搜索了巨大的字符串并查找了4个大写aplha字符后跟一个空格,然后是3个数字后跟4-40任何类型的字符,后跟一个空格,接着是(n),其中n是任意数字。
What I'm looking for is something like ACCT 220 Principles of Accounting I (3)
我正在寻找的是像ACCT 220会计原则I(3)
This is exactly what I want, except that it sometimes catches the pattern too early. There are some occurrences in the document that one class will precede the class where the pattern is supposed to start. For example I'll end up with BMGT 310.ACCT 220 Principles of Accounting I (3)
这正是我想要的,除了它有时太早捕捉到这种模式。在文档中有一些事件发生,一个类将位于应该启动模式的类之前。例如,我最终会得到BMGT 310.ACCT 220会计原则I(3)
I figured one way to get around this would be to not allow patterns to contain 4 upper case letters in the .{4,40}
portion of the regular expression. I've tried using ^
to no avail.
我认为解决这个问题的一种方法是不允许模式在正则表达式的。{4,40}部分中包含4个大写字母。我尝试使用^无济于事。
For example I tried something along the lines of [A-Z]{4} \d{3}([^A-Z]{4}){4,40} \(\d\)
but then I end up with an empty list since the expression didn't find anything.
例如,我尝试了[AZ] {4} \ d {3}([^ AZ] {4}){4,40} \(\ d \)这些行,但后来我最终得到一个空列表表达没找到任何东西。
I'm thinking that I just don't understand the syntax of regex so much yet. If anyone knows how to fix my expression so that it will find all instances of 4 upper case letters followed by a space, followed by three numbers, followed by 4-40 any kind of characters that do NOT contain 4 capital letters in a row, followed by a space, followed by (n) where n
is a number, that would be awesome and greatly appreciated.
我想我还没理解正则表达式的语法。如果有人知道如何修复我的表达式,那么它将找到4个大写字母后面跟一个空格的所有实例,后跟三个数字,然后是4-40任何一行不包含4个大写字母的字符,接着是一个空格,接着是(n),其中n是一个数字,这将是非常棒的,非常感激。
I understand this question might be rather confusing. If you need any more information from me, please let me know.
我理解这个问题可能会让人感到困惑。如果您需要我的更多信息,请告诉我。
1 个解决方案
#1
4
If you don't want to match 4 uppercases in a row, you can instead make use of a negative lookahead, and then match 1 character at a time with {4,40}
:
如果您不想连续匹配4个大写字母,则可以使用否定前瞻,然后一次匹配1个字符{4,40}:
Piece of your current working regex:
你当前工作的正则表达式:
.{4,40}
To be changed to:
要改为:
(?:(?![A-Z]{4}).){4,40}
A negative lookahead (?! ... )
will make a match fail if what's inside it matches. Since we have (?![A-Z]{4})
, the match will fail if there are 4 uppercase in a row. They are zero-width assertions, such that the final match won't be affected at all, and also why I'm still using a .
for the main matching.
否定前瞻(?!...)会使匹配失败,如果它内部的匹配。由于我们有(?![A-Z] {4}),如果一行中有4个大写,则匹配将失败。它们是零宽度断言,这样最后的匹配根本不会受到影响,也是为什么我还在使用。为主要匹配。
A simple example which might help explain how negative lookahead work and how to understand the zero-width assertion is this:
一个简单的例子可能有助于解释负前瞻工作如何以及如何理解零宽度断言是这样的:
w(?!o)
This regex will match the w
(see that no o
is involved) in way
, whole
, below
but not the w
in word
.
这个正则表达式将匹配w(参见没有涉及o)的方式,整个,下面但不是w in word。
(?![A-Z]{4}).
will thus match .
, unless this .
is an uppercase character followed by 3 more uppercase character (making this a 4 uppercase consecutive).
(?![A-Z] {4})。因此将匹配。,除非这样。是一个大写字符后跟另外3个大写字符(使这个连续4个大写)。
To repeat this .
now, you cannot just use (?![A-Z]{4}).{4,40}
because the negative lookahead will only apply to the first .
and not the others. The trick is thus to put (?![A-Z]{4}).
in a group and then repeat:
重复一遍。现在,你不能只使用(?![A-Z] {4})。{4,40}因为否定前瞻只适用于第一个。而不是其他人。因此诀窍就是(?![A-Z] {4})。在一个组中然后重复:
((?![A-Z]{4}).){4,40}
Last, I prefer using non-capture groups (?: ... )
because they make the regex a bit more efficient since they don't store captures:
最后,我更喜欢使用非捕获组(?:...),因为它们使正则表达式更有效,因为它们不存储捕获:
(?:(?![A-Z]{4}).){4,40}
#1
4
If you don't want to match 4 uppercases in a row, you can instead make use of a negative lookahead, and then match 1 character at a time with {4,40}
:
如果您不想连续匹配4个大写字母,则可以使用否定前瞻,然后一次匹配1个字符{4,40}:
Piece of your current working regex:
你当前工作的正则表达式:
.{4,40}
To be changed to:
要改为:
(?:(?![A-Z]{4}).){4,40}
A negative lookahead (?! ... )
will make a match fail if what's inside it matches. Since we have (?![A-Z]{4})
, the match will fail if there are 4 uppercase in a row. They are zero-width assertions, such that the final match won't be affected at all, and also why I'm still using a .
for the main matching.
否定前瞻(?!...)会使匹配失败,如果它内部的匹配。由于我们有(?![A-Z] {4}),如果一行中有4个大写,则匹配将失败。它们是零宽度断言,这样最后的匹配根本不会受到影响,也是为什么我还在使用。为主要匹配。
A simple example which might help explain how negative lookahead work and how to understand the zero-width assertion is this:
一个简单的例子可能有助于解释负前瞻工作如何以及如何理解零宽度断言是这样的:
w(?!o)
This regex will match the w
(see that no o
is involved) in way
, whole
, below
but not the w
in word
.
这个正则表达式将匹配w(参见没有涉及o)的方式,整个,下面但不是w in word。
(?![A-Z]{4}).
will thus match .
, unless this .
is an uppercase character followed by 3 more uppercase character (making this a 4 uppercase consecutive).
(?![A-Z] {4})。因此将匹配。,除非这样。是一个大写字符后跟另外3个大写字符(使这个连续4个大写)。
To repeat this .
now, you cannot just use (?![A-Z]{4}).{4,40}
because the negative lookahead will only apply to the first .
and not the others. The trick is thus to put (?![A-Z]{4}).
in a group and then repeat:
重复一遍。现在,你不能只使用(?![A-Z] {4})。{4,40}因为否定前瞻只适用于第一个。而不是其他人。因此诀窍就是(?![A-Z] {4})。在一个组中然后重复:
((?![A-Z]{4}).){4,40}
Last, I prefer using non-capture groups (?: ... )
because they make the regex a bit more efficient since they don't store captures:
最后,我更喜欢使用非捕获组(?:...),因为它们使正则表达式更有效,因为它们不存储捕获:
(?:(?![A-Z]{4}).){4,40}