有什么方法可以在PDF中获取搜索模式的页码?

时间:2021-08-04 21:12:27

I have a PDF named test.pdf and I need to search for text My name in that PDF.

我有一个PDF命名的测试。我需要在pdf中搜索我的名字。

By using this script, I can do the job:

通过使用这个脚本,我可以完成以下工作:

pdftotext test.pdf - | grep 'My name'

Is there any way to get the page number up to the text "My name" in terminal itself?

是否有方法可以将页面编号提升到终端本身的文本“我的名字”?

1 个解决方案

#1


3  

If you just want the linear page number (as opposed to the number which appears on the page), then you can do it by counting form-feed characters while you search for your text. pdftotext puts a form-feed at the end of every page, so the number of form-feeds prior to your text is one less than the (linear) page number the text is on. (Or thereabouts. Sometimes PDF files are not what they seem.)

如果您只想要线性页码(与页面上出现的数字相反),那么您可以通过在搜索文本时计算表单提要字符来实现。pdftotext在每个页面的末尾都放置一个表单提要,因此在文本之前的表单提要数量比文本所在的(线性)页码少1个。(左右。有时PDF文件并不是他们看起来的那样。

Something like the following should work:

以下这样的方法应该有效:

pdftotext test.pdf - |
awk -vRS=$'\f' -vNAME="My name" \
    'index($0,NAME){printf "%d: %s\n", NR, NAME;}'

The following slightly more complicated solution will prove useful if you want to scan for more than one pattern. Unlike the simple solution above, this one will give you one line per pattern match, even if the same pattern matches twice on the same page:

如果您希望扫描多个模式,以下稍微复杂一点的解决方案将被证明是有用的。与上面的简单解决方案不同,这个方案将为您提供每行模式匹配,即使相同的模式在同一页面上匹配两次:

pdftotext test.pdf - |
grep -F -o -e $'\f' -e 'My name' |
awk 'BEGIN{page=1} /\f/{++page;next} 1{printf "%d: %s\n", page, $0;}'

You can add as many patterns as you like to the grep command (by adding another -e string argument). The -F causes it to match exact strings, but that's not essential; you could use -E and a regex. The awk script assumes that all of the matches will either be a form-feed or a string that was matched, which is what you will get with the -o option to grep.

您可以向grep命令添加任意数量的模式(通过添加另一个-e字符串参数)。-F使它匹配精确的字符串,但这不是必需的;你可以用-E和正则表达式。awk脚本假定所有匹配都将是一个表单提要或一个已匹配的字符串,这就是grep的-o选项。

If you are looking for phrases, you should be aware that they might have line breaks (or even page breaks) in the middle. There's not a lot you can do about page breaks, but the first (pure awk) solution will handle line breaks if you change the call to index to a regular expression search, and write the regular expression with [[:space::]]+ replacing every single space in the original phrase:

如果您正在寻找短语,您应该注意到它们中间可能有换行符(甚至是页面换行符)。对于分页符,您可以做的不多,但是如果将对索引的调用更改为正则表达式搜索,并使用[[:space::]]+替换原始短语中的每个空格,那么第一个(纯awk)解决方案将处理换行符:

pdftotext test.pdf - |
awk -vRS=$'\f' \
    '/My[[:space:]]+Name/{printf "%d: %s\n", NR, "My Name";}'

In theory, you could extract the visible page number (or "page label" as it is called), but many PDF files do not retain this metadata and you'd need a real PDF parser to extract it.

理论上,您可以提取可见的页码(或称为“页标签”),但是许多PDF文件并不保留这些元数据,您需要一个真正的PDF解析器来提取它。

#1


3  

If you just want the linear page number (as opposed to the number which appears on the page), then you can do it by counting form-feed characters while you search for your text. pdftotext puts a form-feed at the end of every page, so the number of form-feeds prior to your text is one less than the (linear) page number the text is on. (Or thereabouts. Sometimes PDF files are not what they seem.)

如果您只想要线性页码(与页面上出现的数字相反),那么您可以通过在搜索文本时计算表单提要字符来实现。pdftotext在每个页面的末尾都放置一个表单提要,因此在文本之前的表单提要数量比文本所在的(线性)页码少1个。(左右。有时PDF文件并不是他们看起来的那样。

Something like the following should work:

以下这样的方法应该有效:

pdftotext test.pdf - |
awk -vRS=$'\f' -vNAME="My name" \
    'index($0,NAME){printf "%d: %s\n", NR, NAME;}'

The following slightly more complicated solution will prove useful if you want to scan for more than one pattern. Unlike the simple solution above, this one will give you one line per pattern match, even if the same pattern matches twice on the same page:

如果您希望扫描多个模式,以下稍微复杂一点的解决方案将被证明是有用的。与上面的简单解决方案不同,这个方案将为您提供每行模式匹配,即使相同的模式在同一页面上匹配两次:

pdftotext test.pdf - |
grep -F -o -e $'\f' -e 'My name' |
awk 'BEGIN{page=1} /\f/{++page;next} 1{printf "%d: %s\n", page, $0;}'

You can add as many patterns as you like to the grep command (by adding another -e string argument). The -F causes it to match exact strings, but that's not essential; you could use -E and a regex. The awk script assumes that all of the matches will either be a form-feed or a string that was matched, which is what you will get with the -o option to grep.

您可以向grep命令添加任意数量的模式(通过添加另一个-e字符串参数)。-F使它匹配精确的字符串,但这不是必需的;你可以用-E和正则表达式。awk脚本假定所有匹配都将是一个表单提要或一个已匹配的字符串,这就是grep的-o选项。

If you are looking for phrases, you should be aware that they might have line breaks (or even page breaks) in the middle. There's not a lot you can do about page breaks, but the first (pure awk) solution will handle line breaks if you change the call to index to a regular expression search, and write the regular expression with [[:space::]]+ replacing every single space in the original phrase:

如果您正在寻找短语,您应该注意到它们中间可能有换行符(甚至是页面换行符)。对于分页符,您可以做的不多,但是如果将对索引的调用更改为正则表达式搜索,并使用[[:space::]]+替换原始短语中的每个空格,那么第一个(纯awk)解决方案将处理换行符:

pdftotext test.pdf - |
awk -vRS=$'\f' \
    '/My[[:space:]]+Name/{printf "%d: %s\n", NR, "My Name";}'

In theory, you could extract the visible page number (or "page label" as it is called), but many PDF files do not retain this metadata and you'd need a real PDF parser to extract it.

理论上,您可以提取可见的页码(或称为“页标签”),但是许多PDF文件并不保留这些元数据,您需要一个真正的PDF解析器来提取它。