
时间:2022-02-02 23:35:09

I have to pull some text out of a PDF stream as a string. This stream will contain both the markup to describe the appearance of the text, and the text itself. The string that I receive that my regex will have to run on will never contain any carriage returns or line feeds. The areas of text that I am interested in will always be inside parenthesis (and there will potentially be parenthesis inside parenthesis), and after the final parenthesis there will be the letters 'Tj'. In short, what I am after will always follow the convention:


(.....) Tj

At the moment, the regex I have is working, as long as the parenthesis are all balanced:



However if the text itself contains unbalanced parethesis, this regex will not pull what I want, and I am not sure how to change it to be able to handle unbalanced parenthesis.


Here is a sample of what would be considered a 'normal' string:


q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj

So obviously, I want to get the string 'RE: Request for Additional Information' out of that.

显然,我希望得到字符串'RE:Request for Additional Information'。

and here is an example case that my regex will fail on (I have added unbalanced parenthesis):


q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj  0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj  0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj  

There are also empty sets of parenthesis in here, that look like:


() Tj

These represent carriage returns and line feeds when the PDF is rendered. Any help is appreciated. Thank you in advance.


--- UPDATE to answer questions below


Any type of user input can be placed between the open and closing parenthesis. I want to extract all content as provided, however that may be, even if the user forgot to balance their parenthesis. The only guarantee is that the text between the parenthesis is user input, but however they input the text is up to them, so it does NOT follow a predefined format such as ([abbrev]: [content]), etc. The content is only guaranteed to be between an open parens, a close parens, and after the close parens will be the letters 'Tj'.


1 个解决方案



As I mentioned in a comment, I can't help with .NET, but I can give you an expression that might help. I think the solution requires "negative lookahead", and perl offers that. The problem is that I haven't used perl in so long I've forgotten how to get it to march through the entire stream. If I break the stream into chunks of "(...) Tj", each on its own line, my script will work on all your examples:


$ cat pdf_data_line_by_line.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj
0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj
0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj
0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj
0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj
0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat get_pdf_text.pl
while (<>) {
   # find some text
   if ( /[^(]*\((?!\)).*\) Tj/ ) {
      # strip off leading junk
      s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/;
      # output saved part of match
      print $_;
      print "YOUR DELIMITER HERE\n";
$ cat pdf_data_line_by_line.txt | ./get_pdf_text.pl
RE:  Request for Additional Information
RE:  Request for (Additional Information
13. Processing TT Instructions -) Audit Note 12
Dear test:
Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here

However, if I combine the examples into a single stream, it stops after the first one. I tried using "g" at the end of the 's' command, but it didn't help:


$ cat pdf_data_single_stream.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat pdf_data_single_stream.txt | ./get_pdf_text.pl
RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here

The replacement string ...


s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/

... does the following: find zero or more characters that are NOT '(', followed by a single '(' that is NOT followed by a ')' (this is where you need negative lookahead, and this eliminates '() Tj' cases), followed by zero or more spaces, then remember {the one following character if it is not a ')' and zero or more following characters}, if followed by a ') Tj', and replace all that by the remembered string. If anyone can suggest the (probably very simple) way to get the script to march all the way through the stream, then that should solve the problem at hand.

...执行以下操作:找到零个或多个不是'(',后跟单个'('后面没有'')的字符(这是你需要负前瞻的地方,这消除了'() Tj'个案),然后是零个或多个空格,然后记住{如果它不是')'跟随一个跟随字符的零个或多个跟随字符},如果后跟一个')Tj',并将其全部替换为记得的字符串。如果任何人都可以建议(可能非常简单)方法让脚本一直在流中进行游戏,那么这应该可以解决手头的问题。



As I mentioned in a comment, I can't help with .NET, but I can give you an expression that might help. I think the solution requires "negative lookahead", and perl offers that. The problem is that I haven't used perl in so long I've forgotten how to get it to march through the entire stream. If I break the stream into chunks of "(...) Tj", each on its own line, my script will work on all your examples:


$ cat pdf_data_line_by_line.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for Additional Information) Tj
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj
0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj
0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj
0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj
0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj
0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat get_pdf_text.pl
while (<>) {
   # find some text
   if ( /[^(]*\((?!\)).*\) Tj/ ) {
      # strip off leading junk
      s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/;
      # output saved part of match
      print $_;
      print "YOUR DELIMITER HERE\n";
$ cat pdf_data_line_by_line.txt | ./get_pdf_text.pl
RE:  Request for Additional Information
RE:  Request for (Additional Information
13. Processing TT Instructions -) Audit Note 12
Dear test:
Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here

However, if I combine the examples into a single stream, it stops after the first one. I tried using "g" at the end of the 's' command, but it didn't help:


$ cat pdf_data_single_stream.txt
q  Q  /Tx BMC  q  0 0 471.34 407.34 re  W  n  BT  1 0 0 1 2 397.16 Tm  /Helv 12 Tf  0 g  (RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here) Tj
$ cat pdf_data_single_stream.txt | ./get_pdf_text.pl
RE:  Request for (Additional Information) Tj 0 g  1 0 0 1 2 383.29 Tm  0 g  (     13. Processing TT Instructions -) Audit Note 12) Tj 0 g  1 0 0 1 2 369.42 Tm  0 g  () Tj  0 g  1 0 0 1 2 355.55 Tm  0 g  (Dear test:) Tj 0 g  1 0 0 1 2 341.68 Tm  0 g  () Tj  0 g  1 0 0 1 2 327.8 Tm  0 g  (Thank you for the more random words here.  )Unfortunately, more words here) terminating (words here

The replacement string ...


s/[^(]*\((?!\))[ ]*([^)].*)\) Tj/$1/

... does the following: find zero or more characters that are NOT '(', followed by a single '(' that is NOT followed by a ')' (this is where you need negative lookahead, and this eliminates '() Tj' cases), followed by zero or more spaces, then remember {the one following character if it is not a ')' and zero or more following characters}, if followed by a ') Tj', and replace all that by the remembered string. If anyone can suggest the (probably very simple) way to get the script to march all the way through the stream, then that should solve the problem at hand.

...执行以下操作:找到零个或多个不是'(',后跟单个'('后面没有'')的字符(这是你需要负前瞻的地方,这消除了'() Tj'个案),然后是零个或多个空格,然后记住{如果它不是')'跟随一个跟随字符的零个或多个跟随字符},如果后跟一个')Tj',并将其全部替换为记得的字符串。如果任何人都可以建议(可能非常简单)方法让脚本一直在流中进行游戏,那么这应该可以解决手头的问题。