使用不一致的换行符从文本中提取数字

时间:2022-09-13 09:57:19

I have text with 6 numbers typically stored in one line

我有6个数字的文本,通常存储在一行中

SomeData\n0.00 0.00 0.00 31,570.07 0.00 31,570.07\nSomeData
SomeData\n0.00 0.00 0.00 485,007.24 0.00 485,007.24\nSomeData

This regex worked fine on it:

这个正则表达式正常工作:

\n[0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]* [0-9,.-]*\n

I noticed that every once in a while I get this:

我注意到每隔一段时间我就会得到这个:

SomeData\n0.00 0.00 10,921,594\n.89\n-\n9,563,271.0\n6\n0.00 1,358,323.83\nSomeData

Note how the linebreaks are randomly inserted after a sign or between numbers as if the system stored the values without filtering linebreaks.

请注意在符号之后或数字之间随机插入换行符,就像系统存储值而不过滤换行符一样。

I am struggling to get this extracted. I tried various expressions but my more successful one was [0-9,.-][\n]{0,1}[0-9,.-][ ]{0,1} to match an individual number.

我正在努力将这个提取出来。我尝试了各种表达式,但我更成功的表达式是[0-9,.-] [\ n] {0,1} [0-9,.-] [] {0,1}以匹配单个数字。

What expression can I use to match both variations of the number formats preferably already stripping out the inconstant line breaks?

我可以使用什么表达式来匹配数字格式的两种变体,最好是已经剥离出不定的换行符?

Update: Going with [-\n]{0,2}[0-9,]+[\n.0-9]{3,4}[\n ]{0,1} Please let me know if I there's a better way

更新:继续[ - \ n] {0,2} [0-9,] + [\ n.0-9] {3,4} [\ n] {0,1}如果我有,请告诉我更好的方法

1 个解决方案

#1


2  

One way would be to write an exact representation of what constitutes a number, so in your case [-+]?[0-9]+[0-9,]*(?:\.[0-9]+)? would do the trick. This helps, because then your search can know when a number starts and when one ends (because of rules like: a sign always is at the start a dot cannot appear multiple times, etc.). Then you want to match pairs of six delimited by either a new line or space so wrap it in a capture group and limit by 6: (...[ \n]*){6,6}. This helps because then the regex engine can figure out by backtracking what to consider a number by knowing how many it should match. Then you want to allow new lines in pretty much any position, so place the new line in each character group. You might also want to anchor the numbers on both sides, but this is not necessary, because now the regex engine will try to identify valid tuples of 6 numbers. End result is:

一种方法是写出数字构成的精确表示,所以在你的情况下[ - +]?[0-9] + [0-9,] *(?:\。[0-9] +)?会做的伎俩。这有帮助,因为那时您的搜索可以知道数字何时开始以及何时结束(由于以下规则:符号总是在开始处,点不能多次出现,等等)。然后,您希望匹配由新行或空格分隔的六对,因此将其包装在捕获组中并限制为6:(... [\ n] *){6,6}。这有帮助,因为这时正则表达式引擎可以通过知道应该匹配多少来回溯到要考虑的数字。然后,您希望在几乎任何位置允许新行,因此将新行放在每个字符组中。您可能还想在两侧锚定数字,但这不是必需的,因为现在正则表达式引擎将尝试识别6个数字的有效元组。最终结果是:

SomeData\n([-+]?[0-9\n]+[0-9,\n]*(?:\.[0-9\n]+)?[ \n]){6,6}SomeData

This will find tuples of 6 numbers no matter where the enters are. Here is an example: https://regex101.com/r/jD5nT8/1

无论输入位于何处,都会找到6个数字的元组。以下是一个示例:https://regex101.com/r/jD5nT8/1

#1


2  

One way would be to write an exact representation of what constitutes a number, so in your case [-+]?[0-9]+[0-9,]*(?:\.[0-9]+)? would do the trick. This helps, because then your search can know when a number starts and when one ends (because of rules like: a sign always is at the start a dot cannot appear multiple times, etc.). Then you want to match pairs of six delimited by either a new line or space so wrap it in a capture group and limit by 6: (...[ \n]*){6,6}. This helps because then the regex engine can figure out by backtracking what to consider a number by knowing how many it should match. Then you want to allow new lines in pretty much any position, so place the new line in each character group. You might also want to anchor the numbers on both sides, but this is not necessary, because now the regex engine will try to identify valid tuples of 6 numbers. End result is:

一种方法是写出数字构成的精确表示,所以在你的情况下[ - +]?[0-9] + [0-9,] *(?:\。[0-9] +)?会做的伎俩。这有帮助,因为那时您的搜索可以知道数字何时开始以及何时结束(由于以下规则:符号总是在开始处,点不能多次出现,等等)。然后,您希望匹配由新行或空格分隔的六对,因此将其包装在捕获组中并限制为6:(... [\ n] *){6,6}。这有帮助,因为这时正则表达式引擎可以通过知道应该匹配多少来回溯到要考虑的数字。然后,您希望在几乎任何位置允许新行,因此将新行放在每个字符组中。您可能还想在两侧锚定数字,但这不是必需的,因为现在正则表达式引擎将尝试识别6个数字的有效元组。最终结果是:

SomeData\n([-+]?[0-9\n]+[0-9,\n]*(?:\.[0-9\n]+)?[ \n]){6,6}SomeData

This will find tuples of 6 numbers no matter where the enters are. Here is an example: https://regex101.com/r/jD5nT8/1

无论输入位于何处,都会找到6个数字的元组。以下是一个示例:https://regex101.com/r/jD5nT8/1