如何使这个正则表达式工作?

时间:2022-11-29 23:25:41

I'm trying to match sets of data from a PDF document. Because this PDF was generated from OCR and PDF in general don't have data arranged in a way a program can fetch easily, the data I receive looks, for instance, like this:

我正在尝试匹配来自PDF文档的数据集。因为这个PDF是由OCR和PDF生成的,一般来说,没有按照程序可以轻松获取的方式排列的数据,所以我接收到的数据看起来是这样的:

12/26 CORRECTION Card Ending in 1111 427.85 3,611.31 Some avenue name12/26 OFF-US ATM WITHDRAWAL 803.00 2,808.31 OAK* SQUARE OFFICE PALM BCH GDNSFLUS 12/26 ATM WITHDRAWAL 419.46 2,388.85 Some avenue name 12/26 SERVICE CHARGES 8.39 2,380.46 Foreign Transaction Fee 12/29 OFF-US ATM WITHDRAWAL 802.50 1,577.96 THE BREAKERS PALM BCH PALM BEACH FLUS 12/30 ATM WITHDRAWAL 600.00 977.96 11111 US HWY 1, PALM BEACH, FL 12/31 ACH DEBIT 207.94 770.02 PAYBYPHONE-PYMT PHONE PYMT 1111 Dec 31 12/31 ACH DEBIT 138.00 632.02 BK OF AM CRD ACH PAYBYPHONE 01111111 Dec 31

12/26修正卡于1111年结束427.85 3611 .31一些大道name12/26美国ATM取款803.00 2808 .31 lisle的平方办公室棕榈BCH GDNSFLUS 12/26 ATM取款419.46 2388 .85一些大道的名字12/26服务费8.39 802.50 2380 .46外国交易费12/29美国ATM取款1577 .96点断路器棕榈BCH棕榈滩流感12/30 ATM取款600.00 977.96 11111号1,棕榈滩,12月31日上午10时30分,在12月31日上午10时30分,我公司将在12月31日通过电话和电话联系。

I'm trying to extract from there a date, a header, two numeric values, followed by a comment that may or may not exist, that are hopefully separated by spaces, that may or may not be there. So I went this far with my regular expression:

我试图从中提取一个日期,一个标题,两个数字值,然后是一个可能存在或不存在的注释,这些注释希望被空格分隔,可能存在也可能不存在。所以我用我的正则表达式走到了这一步:

/(\d{1,2}\/\d{1,2})\s*(.+?)\s*([\d,]+\.\d\d)\s*([\d,]+\.\d\d-?)\s*(.*?)/g

And this is the live example: https://regex101.com/r/yU2bN7/1

这是一个活生生的例子:https://regex101.com/r/yU2bN7/1

The problem is, it matches all it should, except the comment. The very last lazy (.*?) matches nothing, and if I make it greedy, it will match the other data sets as if they were part of the first match. How can I solve this problem?

问题是,它匹配所有它应该匹配的,除了注释。最后一个lazy(.*?)不匹配任何内容,如果我使它变得贪婪,它将匹配其他数据集,就好像它们是第一个匹配的一部分一样。我如何解决这个问题?

1 个解决方案

#1


3  

Add a positive lookahead for end-of-string or start-of-next-pattern:

为字符串结尾或下一个模式的开始添加一个积极的前视:

(?=$|\d{1,2}\/\d{1,2})

#1


3  

Add a positive lookahead for end-of-string or start-of-next-pattern:

为字符串结尾或下一个模式的开始添加一个积极的前视:

(?=$|\d{1,2}\/\d{1,2})