如何从文本中查找/提取页码?

时间:2022-05-11 19:32:43

I have been doing OCR on some images which are part of a different documents and they have Page numbers in the bottom.I have figured out a way to find each document but the images are not in sequence and I want to sort them by their page numbers. One hiccups is that there are variations in the formats of the page numbers i.e

我一直在对一些图像进行OCR,这些图像是不同文档的一部分,底部有页码。我找到了找到每个文档的方法,但图像不按顺序排列,我想按页面对它们进行排序数字。一个小问题是页码格式有变化,即

  • Page 1 of 35
  • 第1页,共35页
  • Page 1-35
  • 第1-35页
  • Page 35
  • 第35页

Plus the Page can be in lowercase page also. What i am looking for is a generic method using regex to extract this from the pages. It will be great if it can be handled in one regex as compiled version will be faster than having different ones for each case. Thanks

此外,页面也可以是小写页面。我正在寻找的是使用正则表达式从页面中提取这一点的通用方法。如果它可以在一个正则表达式中处理将是很好的,因为编译版本将比为每个案例具有不同的版本更快。谢谢

2 个解决方案

#1


1  

Try with below regex,

试试下面的正则表达式,

page\s[\d]?[\s\d\-of]+

Use 'I' flag for case insensitive.

使用“I”标志不区分大小写。

RegexDemo

了RegexDemo

#2


0  

Please see if the below commands is suited for you purpouse. Thanks

请查看以下命令是否适合您的purpouse。谢谢

>>> re.findall(r'\w*\s\w*\d{1,5}','Page 1-35')
['Page 1']
>>> re.findall(r'\w*\s\w*\d{1,5}','Page 35')
['Page 35']
>>> re.findall(r'\w*\s\w*\d{1,5}','Page 1 of 35')[0]
'Page 1'
>>> re.findall(r'\w*\s\w*\d{1,5}','page 1 of 35')[0]
'page 1'
`

#1


1  

Try with below regex,

试试下面的正则表达式,

page\s[\d]?[\s\d\-of]+

Use 'I' flag for case insensitive.

使用“I”标志不区分大小写。

RegexDemo

了RegexDemo

#2


0  

Please see if the below commands is suited for you purpouse. Thanks

请查看以下命令是否适合您的purpouse。谢谢

>>> re.findall(r'\w*\s\w*\d{1,5}','Page 1-35')
['Page 1']
>>> re.findall(r'\w*\s\w*\d{1,5}','Page 35')
['Page 35']
>>> re.findall(r'\w*\s\w*\d{1,5}','Page 1 of 35')[0]
'Page 1'
>>> re.findall(r'\w*\s\w*\d{1,5}','page 1 of 35')[0]
'page 1'
`