i want to know the similarity of tow pdf files, but i don't want to do the detail content compare . is there any solution just from its external structure .is it possible ?thanks!
我想知道两个pdf文件的相似性,但我不想做详细内容比较。有没有任何解决方案只是从它的外部结构。它可能吗?谢谢!
3 个解决方案
#1
That sounds potentially tough, but here is some low-hanging fruit from the PDF metadata, in order of difficulty.
这听起来可能很难,但这里有一些来自PDF元数据的低调结果,按难度顺序排列。
- Document metadata such as
eBook-title
andTitle
- Number of pages in the document (counting
/Page
directives) - Compare the metadata for each page, such as
MediaBox
,CropBox
,BleedBox
,TrimBox
- Look for embedded content like images and document-specific fonts and see if they are a perfect match.
- Pull out the plain text and compare the words: word counts, most common words, etc. For Western language, you could just run the PDF through a string-finder like
strings
on Linux. Or you can go into the file and find(blah blah blah) Tj
, which is how most text is stored in PDF content.
文档元数据,如电子书标题和标题
文档中的页数(计数/页面指令)
比较每个页面的元数据,例如MediaBox,CropBox,BleedBox,TrimBox
查找嵌入的内容,如图像和特定于文档的字体,看看它们是否完美匹配。
拉出纯文本并比较单词:单词计数,最常用单词等。对于西方语言,您可以通过Linux上的字符串查找器来运行PDF。或者你可以进入文件找到(blah blah blah)Tj,这是大多数文本存储在PDF内容中的方式。
Finally, you may be able to cheat by converting them to a raster format with GhostScript or another library and then comparing them that way. If you convert to a low-resolution like 100px then the rough details might look similar.
最后,您可以通过使用GhostScript或其他库将它们转换为栅格格式然后以这种方式进行比较来作弊。如果转换为低分辨率(如100px),那么粗略的细节可能看起来很相似。
If you've never worked directly with PDF, it's not scary! It's just a text file (after you decompress it) which you can more-or-less parse line-by-line. I discuss PDF more in the HTML document to PDF answer.
如果您从未直接使用PDF,那就不可怕了!它只是一个文本文件(在解压缩之后),你可以或多或少地逐行解析。我在HTML文档中更多地讨论PDF到PDF的答案。
#2
You can tell if two files are different by running a hash on them (like md5) but that won't tell you the degree of similarity between them.
您可以通过对它们运行哈希(如md5)来判断两个文件是否不同,但这并不能告诉您它们之间的相似程度。
There are binary diff programs that can tell you where two binary files differ with reasonable results but many binary files, especially document containers, can show alot of binary difference when there are only minor internal content differences.
有二进制差异程序可以告诉你两个二进制文件在哪里有不同的合理结果,但许多二进制文件,尤其是文档容器,只有很小的内部内容差异时可以显示很多二进制差异。
I'm not familiar with the details of the pdf format. Maybe somebody else knows of a built in mechanism that might help.
我不熟悉pdf格式的细节。也许其他人知道可能有帮助的内置机制。
#3
A PDF is not just a text file. Its a binary dump of a B-tree. With compressed objects you can also get object data compressed inside other binary objects so you cannot see them.
PDF不仅仅是一个文本文件。它是B树的二进制转储。使用压缩对象,您还可以将对象数据压缩到其他二进制对象中,这样您就无法看到它们。
If you want to do low-level text manipulation you really need to use a decent tool. Acrobat 9.0 has a menu option to browse the internal PDF structure or you can use something like IText.
如果你想做低级文本操作,你真的需要使用一个体面的工具。 Acrobat 9.0有一个用于浏览内部PDF结构的菜单选项,或者您可以使用类似IText的内容。
#1
That sounds potentially tough, but here is some low-hanging fruit from the PDF metadata, in order of difficulty.
这听起来可能很难,但这里有一些来自PDF元数据的低调结果,按难度顺序排列。
- Document metadata such as
eBook-title
andTitle
- Number of pages in the document (counting
/Page
directives) - Compare the metadata for each page, such as
MediaBox
,CropBox
,BleedBox
,TrimBox
- Look for embedded content like images and document-specific fonts and see if they are a perfect match.
- Pull out the plain text and compare the words: word counts, most common words, etc. For Western language, you could just run the PDF through a string-finder like
strings
on Linux. Or you can go into the file and find(blah blah blah) Tj
, which is how most text is stored in PDF content.
文档元数据,如电子书标题和标题
文档中的页数(计数/页面指令)
比较每个页面的元数据,例如MediaBox,CropBox,BleedBox,TrimBox
查找嵌入的内容,如图像和特定于文档的字体,看看它们是否完美匹配。
拉出纯文本并比较单词:单词计数,最常用单词等。对于西方语言,您可以通过Linux上的字符串查找器来运行PDF。或者你可以进入文件找到(blah blah blah)Tj,这是大多数文本存储在PDF内容中的方式。
Finally, you may be able to cheat by converting them to a raster format with GhostScript or another library and then comparing them that way. If you convert to a low-resolution like 100px then the rough details might look similar.
最后,您可以通过使用GhostScript或其他库将它们转换为栅格格式然后以这种方式进行比较来作弊。如果转换为低分辨率(如100px),那么粗略的细节可能看起来很相似。
If you've never worked directly with PDF, it's not scary! It's just a text file (after you decompress it) which you can more-or-less parse line-by-line. I discuss PDF more in the HTML document to PDF answer.
如果您从未直接使用PDF,那就不可怕了!它只是一个文本文件(在解压缩之后),你可以或多或少地逐行解析。我在HTML文档中更多地讨论PDF到PDF的答案。
#2
You can tell if two files are different by running a hash on them (like md5) but that won't tell you the degree of similarity between them.
您可以通过对它们运行哈希(如md5)来判断两个文件是否不同,但这并不能告诉您它们之间的相似程度。
There are binary diff programs that can tell you where two binary files differ with reasonable results but many binary files, especially document containers, can show alot of binary difference when there are only minor internal content differences.
有二进制差异程序可以告诉你两个二进制文件在哪里有不同的合理结果,但许多二进制文件,尤其是文档容器,只有很小的内部内容差异时可以显示很多二进制差异。
I'm not familiar with the details of the pdf format. Maybe somebody else knows of a built in mechanism that might help.
我不熟悉pdf格式的细节。也许其他人知道可能有帮助的内置机制。
#3
A PDF is not just a text file. Its a binary dump of a B-tree. With compressed objects you can also get object data compressed inside other binary objects so you cannot see them.
PDF不仅仅是一个文本文件。它是B树的二进制转储。使用压缩对象,您还可以将对象数据压缩到其他二进制对象中,这样您就无法看到它们。
If you want to do low-level text manipulation you really need to use a decent tool. Acrobat 9.0 has a menu option to browse the internal PDF structure or you can use something like IText.
如果你想做低级文本操作,你真的需要使用一个体面的工具。 Acrobat 9.0有一个用于浏览内部PDF结构的菜单选项,或者您可以使用类似IText的内容。