正则表达式在PDF文件中分割文本

时间:2023-01-22 21:10:53

I have a PDF file which I converted to .txt using an online tool. Now I want to parse the data in that and split it using regular expression. I am almost done but stuck at 1 point.

我有一个PDF文件,我用一个在线工具把它转换成。txt。现在我要解析数据,并使用正则表达式拆分它。我快做完了,但还是停留在一点上。

Example of data is:

数据的例子:

00 41 53 Bid Form – Design/Build (Single-Prime Contract)

27 05 13.23 T1 Services

I want to split it like : 00 41 53 Bid Form – Design/Build (Single-Prime Contract) and other is 27 05 13.23 T1 Services

我想把它拆分为:00 41 53投标表格-设计/建造(单主要合同)和其他27 05 13.23 T1服务

The regular Expression I'm using is [0-9](\d|\ |\.)*(\D)*

我使用的正则表达式是[0-9](\d|\ |\。)*(\ d)*。

It can have numbers with spaces and/or dots, then text which can be (letters, dot, comma, (, ), -, and digits).

它可以有空格和/或点的数字,然后可以是文本(字母、点、逗号、()、-和数字)。

I cannot match a string if it has number in it like the "T1 Services" above.

如果字符串中有像上面的“T1服务”那样的数字,则无法匹配字符串。

2 个解决方案

#1


2  

If I understood this correctly , you are trying to split by newline character .This is in C#.

如果我理解正确的话,您是在尝试用换行字符来分割,这是在c#中。

string[] Result = Regex.Split(inputText, "[\r\n]+");

#2


0  

you can also done it with out regex Like this:

你也可以像这样使用regex:

string phrase = ".......\n,,,,.ll..\r\n....";
string[] words;

words = phrase.Split(new string []{"\n","\r"}), StringSplitOptions.RemoveEmptyEntries);

if you want regex only then use @mhasan solution.

如果您只想要regex,那么使用@mhasan解决方案。

#1


2  

If I understood this correctly , you are trying to split by newline character .This is in C#.

如果我理解正确的话,您是在尝试用换行字符来分割,这是在c#中。

string[] Result = Regex.Split(inputText, "[\r\n]+");

#2


0  

you can also done it with out regex Like this:

你也可以像这样使用regex:

string phrase = ".......\n,,,,.ll..\r\n....";
string[] words;

words = phrase.Split(new string []{"\n","\r"}), StringSplitOptions.RemoveEmptyEntries);

if you want regex only then use @mhasan solution.

如果您只想要regex,那么使用@mhasan解决方案。