正则表达式,其中括号可能不平衡

时间:2022-02-02 23:35:09

I have to pull some text out of a PDF stream as a string. This stream will contain both the markup to describe the appearance of the text, and the text itself. The string that I receive that my regex will have to run on will never contain any carriage returns or line feeds. The areas of text that I am interested in will always be inside parenthesis (and there will potentially be parenthesis inside parenthesis), and after the final parenthesis there will be the letters 'Tj'. In short, what I am after will always follow the convention:

我必须将一些文本作为字符串从PDF流中提取出来。此流将包含用于描述文本外观的标记和文本本身。我收到的正则表达式必须运行的字符串将永远不会包含任何回车符或换行符。我感兴趣的文本区域将始终在括号内(并且在括号内可能有括号),并且在最后的括号之后将有字母'Tj'。简而言之,我所追求的将永远遵循惯例:

(.....) Tj

At the moment, the regex I have is working, as long as the parenthesis are all balanced:

目前,只要括号全部平衡,我所使用的正则表达式正在工作:

\((?:[^()]|(?'paren'\()|(?'-paren'\)))+(?(paren)(?!))\)

However if the text itself contains unbalanced parethesis, this regex will not pull what I want, and I am not sure how to change it to be able to handle unbalanced parenthesis.

然而,如果文本本身包含不平衡的parethesis,这个正则表达式将不会拉我想要的,我不知道如何更改它以能够处理不平衡的括号。

Here is a sample of what would be considered a 'normal' string:

以下是一个被视为“普通”字符串的示例:

q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj

So obviously, I want to get the string 'RE: Request for Additional Information' out of that.

显然,我希望得到字符串'RE:Request for Additional Information'。

and here is an example case that my regex will fail on (I have added unbalanced parenthesis):

这是一个示例,我的正则表达式将失败(我添加了不平衡的括号):

q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj  0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj  0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj  

There are also empty sets of parenthesis in here, that look like:

这里还有一组空括号,看起来像:

() Tj

These represent carriage returns and line feeds when the PDF is rendered. Any help is appreciated. Thank you in advance.

这些表示呈现PDF时的回车符和换行符。任何帮助表示赞赏。先感谢您。

--- UPDATE to answer questions below

---更新以回答以下问题

Any type of user input can be placed between the open and closing parenthesis. I want to extract all content as provided, however that may be, even if the user forgot to balance their parenthesis. The only guarantee is that the text between the parenthesis is user input, but however they input the text is up to them, so it does NOT follow a predefined format such as ([abbrev]: [content]), etc. The content is only guaranteed to be between an open parens, a close parens, and after the close parens will be the letters 'Tj'.

任何类型的用户输入都可以放在开括号和右括号之间。我想提取所提供的所有内容,但这可能是,即使用户忘记平衡其括号。唯一的保证是括号内的文本是用户输入,但是他们输入文本取决于它们,因此它不遵循预定义的格式,如([abbrev]:[content])等。内容是只保证在一个开放的parens,一个紧密的parens之间,并且在关闭的parens之后将是字母'Tj'。

1 个解决方案

#1


0  

As I mentioned in a comment, I can't help with .NET, but I can give you an expression that might help. I think the solution requires "negative lookahead", and perl offers that. The problem is that I haven't used perl in so long I've forgotten how to get it to march through the entire stream. If I break the stream into chunks of "(...) Tj", each on its own line, my script will work on all your examples:

正如我在评论中提到的,我无法帮助.NET,但我可以给你一个可能有帮助的表达。我认为解决方案需要“消极前瞻”,而perl提供了这一点。问题是我没有使用perl这么久我忘记了如何让它在整个流中游行。如果我将流分成“(...)Tj”块,每个都在它自己的行上,我的脚本将适用于你所有的例子:

$ cat pdf_data_line_by_line.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj
0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj
0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj
0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj
0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj
0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat get_pdf_text.pl
#!/usr/bin/perl
while (<>) {
   # find some text
   if ( /[^(]*\((?!\)).*\) Tj/ ) {
      # strip off leading junk
      s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/;
      # output saved part of match
      print $_;
      print "YOUR DELIMITER HERE\n";
   }
}
$ cat pdf_data_line_by_line.txt | ./get_pdf_text.pl
RE:  Request for Additional Information
YOUR DELIMITER HERE
RE:  Request for (Additional Information
YOUR DELIMITER HERE
13. Processing TT Instructions -) Audit Note 12
YOUR DELIMITER HERE
Dear test:
YOUR DELIMITER HERE
Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

However, if I combine the examples into a single stream, it stops after the first one. I tried using "g" at the end of the 's' command, but it didn't help:

但是,如果我将这些示例合并为一个流,它将在第一个流之后停止。我尝试在's'命令的末尾使用“g”,但它没有帮助:

$ cat pdf_data_single_stream.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat pdf_data_single_stream.txt | ./get_pdf_text.pl
RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

The replacement string ...

替换字符串......

s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/

... does the following: find zero or more characters that are NOT '(', followed by a single '(' that is NOT followed by a ')' (this is where you need negative lookahead, and this eliminates '() Tj' cases), followed by zero or more spaces, then remember {the one following character if it is not a ')' and zero or more following characters}, if followed by a ') Tj', and replace all that by the remembered string. If anyone can suggest the (probably very simple) way to get the script to march all the way through the stream, then that should solve the problem at hand.

...执行以下操作:找到零个或多个不是'(',后跟单个'('后面没有'')的字符(这是你需要负前瞻的地方,这消除了'() Tj'个案),然后是零个或多个空格,然后记住{如果它不是')'跟随一个跟随字符的零个或多个跟随字符},如果后跟一个')Tj',并将其全部替换为记得的字符串。如果任何人都可以建议(可能非常简单)方法让脚本一直在流中进行游戏,那么这应该可以解决手头的问题。

#1


0  

As I mentioned in a comment, I can't help with .NET, but I can give you an expression that might help. I think the solution requires "negative lookahead", and perl offers that. The problem is that I haven't used perl in so long I've forgotten how to get it to march through the entire stream. If I break the stream into chunks of "(...) Tj", each on its own line, my script will work on all your examples:

正如我在评论中提到的,我无法帮助.NET,但我可以给你一个可能有帮助的表达。我认为解决方案需要“消极前瞻”,而perl提供了这一点。问题是我没有使用perl这么久我忘记了如何让它在整个流中游行。如果我将流分成“(...)Tj”块,每个都在它自己的行上,我的脚本将适用于你所有的例子:

$ cat pdf_data_line_by_line.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj
0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj
0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj
0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj
0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj
0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat get_pdf_text.pl
#!/usr/bin/perl
while (<>) {
   # find some text
   if ( /[^(]*\((?!\)).*\) Tj/ ) {
      # strip off leading junk
      s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/;
      # output saved part of match
      print $_;
      print "YOUR DELIMITER HERE\n";
   }
}
$ cat pdf_data_line_by_line.txt | ./get_pdf_text.pl
RE:  Request for Additional Information
YOUR DELIMITER HERE
RE:  Request for (Additional Information
YOUR DELIMITER HERE
13. Processing TT Instructions -) Audit Note 12
YOUR DELIMITER HERE
Dear test:
YOUR DELIMITER HERE
Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

However, if I combine the examples into a single stream, it stops after the first one. I tried using "g" at the end of the 's' command, but it didn't help:

但是,如果我将这些示例合并为一个流,它将在第一个流之后停止。我尝试在's'命令的末尾使用“g”,但它没有帮助:

$ cat pdf_data_single_stream.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat pdf_data_single_stream.txt | ./get_pdf_text.pl
RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here
YOUR DELIMITER HERE

The replacement string ...

替换字符串......

s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/

... does the following: find zero or more characters that are NOT '(', followed by a single '(' that is NOT followed by a ')' (this is where you need negative lookahead, and this eliminates '() Tj' cases), followed by zero or more spaces, then remember {the one following character if it is not a ')' and zero or more following characters}, if followed by a ') Tj', and replace all that by the remembered string. If anyone can suggest the (probably very simple) way to get the script to march all the way through the stream, then that should solve the problem at hand.

...执行以下操作:找到零个或多个不是'(',后跟单个'('后面没有'')的字符(这是你需要负前瞻的地方,这消除了'() Tj'个案),然后是零个或多个空格,然后记住{如果它不是')'跟随一个跟随字符的零个或多个跟随字符},如果后跟一个')Tj',并将其全部替换为记得的字符串。如果任何人都可以建议(可能非常简单)方法让脚本一直在流中进行游戏,那么这应该可以解决手头的问题。