We get a large amount of data from our clients in pdf files in varying formats [layout-wise], these files are typically report output, and are typically properly annotated [they don't usually need OCR], but not formatted well enough that simply copying several hundred pages of text out of acrobat is not going to work.
我们从客户端获取大量不同格式的pdf文件中的数据[布局方面],这些文件通常是报告输出,通常是正确注释的[它们通常不需要OCR],但格式不够好,简单地从杂技演员那里复制数百页的文本是行不通的。
The best approach I've found so far is to write a script to parse the nearly-valid xml output (the comments are invalid and many characters are escaped in varying ways, é becomes [[[e9]]]é, $ becomes \$, % becomes \%...) of the command-line pdftoipe utility (to convert pdf files for a program called ipe), which gives me text elements with their positions on each page [see sample below], which works well enough for reports where the same values are on the same place on every page I care about, but would require extra scripting effort for importing matrix [cross-tab] pdf files. pdftoipe is not at all intended for this, and at best can be compiled manually using cygwin for windows.
到目前为止我发现的最好的方法是编写一个脚本来解析几乎有效的xml输出(注释无效,许多字符以不同的方式转义,é变为[[[e9]]]é,$变为\ $,%变为\%...)命令行pdftoipe实用程序(转换为名为ipe的程序的pdf文件),它为我提供了文本元素在每个页面上的位置[见下面的示例],效果很好对于报告,其中相同的值位于我关注的每个页面上的相同位置,但是需要额外的脚本工作来导入矩阵[交叉表] pdf文件。 pdftoipe完全没有用于此目的,最多可以使用cygwin for windows手动编译。
Are there libraries that make this easy from some scripting language I can tolerate? A graphical tool would be awesome too. And a pony.
是否有可以通过我能容忍的某种脚本语言轻松实现这些库?图形工具也很棒。还有一匹小马。
pdftoipe output of this sample looks like this:
此示例的pdftoipe输出如下所示:
<ipe creator="pdftoipe 2006/10/09"><info media="0 0 612 792"/>
<-- Page: 1 1 -->
<page gridsize="8">
<path fill="1 1 1" fillrule="wind">
64.8 144 m
486 144 l
486 727.2 l
64.8 727.2 l
64.8 144 l
h
</path>
<path fill="1 1 1" fillrule="wind">
64.8 144 m
486 144 l
486 727.2 l
64.8 727.2 l
64.8 144 l
h
</path>
<path fill="1 1 1" fillrule="wind">
64.8 144 m
486 144 l
486 727.2 l
64.8 727.2 l
64.8 144 l
h
</path>
<text stroke="1 0 0" pos="0 0" size="18" transformable="yes" matrix="1 0 0 1 181.8 707.88">This is a sample PDF fil</text>
<text stroke="1 0 0" pos="0 0" size="18" transformable="yes" matrix="1 0 0 1 356.28 707.88">e.</text>
<text stroke="1 0 0" pos="0 0" size="18" transformable="yes" matrix="1 0 0 1 368.76 707.88"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 692.4"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 677.88"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 663.36"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 648.84"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 634.32"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 619.8"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 605.28"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 590.76"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 576.24"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 561.72"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 547.2"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 532.68"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 518.16"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 503.64"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 489.12"> </text>
<text stroke="0 0 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 67.32 474.6"> </text>
<text stroke="0 0 1" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 67.32 456.24">If you can read this</text>
<text stroke="0 0 1" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 214.92 456.24">,</text>
<text stroke="0 0 1" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 219.48 456.24"> you already have A</text>
<text stroke="0 0 1" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 370.8 456.24">dobe Acrobat </text>
<text stroke="0 0 1" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 67.32 437.64">Reader i</text>
<text stroke="0 0 1" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 131.28 437.64">n</text>
<text stroke="0 0 1" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 141.12 437.64">stalled on your computer.</text>
<text stroke="0 0 0" pos="0 0" size="16.2" transformable="yes" matrix="1 0 0 1 337.92 437.64"> </text>
<text stroke="0 0.502 0" pos="0 0" size="12.6" transformable="yes" matrix="1 0 0 1 342.48 437.64"> </text>
<image width="800" height="600" rect="-92.04 800.64 374.4 449.76" ColorSpace="DeviceRGB" BitsPerComponent="8" Filter="DCTDecode" length="369925">
feedcafebabe...
</image>
</page>
</ipe>
4 个解决方案
#1
3
We use Xpdf in one of our applications. Its a c++ library which is primarily used for pdf rendering, although it does have a text extractor which could be useful for this project.
我们在其中一个应用程序中使用Xpdf。它是一个c ++库,主要用于pdf渲染,虽然它有一个文本提取器,可以用于这个项目。
#2
1
If you're fine with calling something external, you can use ghostscript - look at the ps2ascii script included with the distribution. I'm not sure what you want from a graphical tool - a big button that you push to chose the input and output files? A preview? You might be able to use GSView, depending on what you want.
如果您在调用外部函数时没问题,可以使用ghostscript - 查看分发包含的ps2ascii脚本。我不确定你想从图形工具中得到什么 - 你推动选择输入和输出文件的大按钮?预览?您可以使用GSView,具体取决于您的需求。
#3
1
pdftohtml -xml
although pdftoipe seems more detailed!!
虽然pdftoipe似乎更详细!!
#4
0
Have you looked at Aspose? We're using it for an ASP.net app and I've seen some examples of vbscript using it as well. It's not particularly expensive either.
你看过Aspose吗?我们将它用于ASP.net应用程序,我也看到了一些使用它的vbscript示例。它也不是特别贵。
#1
3
We use Xpdf in one of our applications. Its a c++ library which is primarily used for pdf rendering, although it does have a text extractor which could be useful for this project.
我们在其中一个应用程序中使用Xpdf。它是一个c ++库,主要用于pdf渲染,虽然它有一个文本提取器,可以用于这个项目。
#2
1
If you're fine with calling something external, you can use ghostscript - look at the ps2ascii script included with the distribution. I'm not sure what you want from a graphical tool - a big button that you push to chose the input and output files? A preview? You might be able to use GSView, depending on what you want.
如果您在调用外部函数时没问题,可以使用ghostscript - 查看分发包含的ps2ascii脚本。我不确定你想从图形工具中得到什么 - 你推动选择输入和输出文件的大按钮?预览?您可以使用GSView,具体取决于您的需求。
#3
1
pdftohtml -xml
although pdftoipe seems more detailed!!
虽然pdftoipe似乎更详细!!
#4
0
Have you looked at Aspose? We're using it for an ASP.net app and I've seen some examples of vbscript using it as well. It's not particularly expensive either.
你看过Aspose吗?我们将它用于ASP.net应用程序,我也看到了一些使用它的vbscript示例。它也不是特别贵。