I have been able to copy the raw data from an otherwise inaccessible USB drive into a monolithic file of about 250MB. Somewhere in that blob of bytes are about 40 Word documents.
我已经能够将原始数据从其他无法访问的USB驱动器复制到大约250MB的单片文件中。在那个blob字节中的某个地方大约有40个Word文档。
-
Where do I find documentation about the internal structure of Word documents such that I can parse the byte-stream, recognise where a Word doc starts and finishes and extract a copy?
我在哪里可以找到有关Word文档内部结构的文档,以便我可以解析字节流,识别Word文档的起始位置并完成并提取副本?
-
Are there any libraries in any programming language specific to this task?
是否有任何特定于此任务的编程语言的库?
-
Can anyone suggest an already existing software solution to this issue?
任何人都可以建议已经存在的软件解决方案吗?
2 个解决方案
#1
5
Two approaches:
You can mount files as volumes in linux. Provided your binary blob isn't too corrupted, you'll probably be able to break down the filesystem to find out where you files are located. Is (was) it a FAT partition or NTFS?
您可以在linux中将文件挂载为卷。如果您的二进制blob没有太多损坏,您可能能够分解文件系统以找出文件所在的位置。它是FAT分区还是NTFS?
If that doesn't work, I'd look for this string of bytes:
如果这不起作用,我会寻找这个字节串:
D0 CF 11 E0 A1 B1 1A E1
These are the "magic bytes" of office document file signatures. They might occur randomly in other data, but it's a start. You're going to run into MAJOR issues if the files are fragmented.
这些是office文档文件签名的“神奇字节”。它们可能在其他数据中随机出现,但它是一个开始。如果文件碎片化,您将遇到MAJOR问题。
Also, try to recreate pieces of the document(s) in Word as is, save it to a file and extract chunks to search for in the blob (using grep binary or whatever). Provided you have info from all parts of the file you should be able to decode WHERE in the blob they are. Piecing it back into a working DOC binary seems far fetched, but recovering the rest of the text shouldn't be impossible.
此外,尝试按原样在Word中重新创建文档片段,将其保存到文件并提取块以在blob中搜索(使用grep二进制或其他)。如果您从文件的所有部分获得信息,您应该能够解码它们中的WHERE。将它拼凑回工作的DOC二进制文件似乎很遥远,但恢复其余的文本应该是不可能的。
#2
2
The Apache POI project has a library for reading and writing all kinds of MS Office docs. If the files are in the new XML base OOXML format, you'll be looking for the start of a zip file as the XML is compressed.
Apache POI项目有一个用于读写各种MS Office文档的库。如果文件采用新的XML基础OOXML格式,那么当压缩XML时,您将寻找zip文件的开头。
#1
5
Two approaches:
You can mount files as volumes in linux. Provided your binary blob isn't too corrupted, you'll probably be able to break down the filesystem to find out where you files are located. Is (was) it a FAT partition or NTFS?
您可以在linux中将文件挂载为卷。如果您的二进制blob没有太多损坏,您可能能够分解文件系统以找出文件所在的位置。它是FAT分区还是NTFS?
If that doesn't work, I'd look for this string of bytes:
如果这不起作用,我会寻找这个字节串:
D0 CF 11 E0 A1 B1 1A E1
These are the "magic bytes" of office document file signatures. They might occur randomly in other data, but it's a start. You're going to run into MAJOR issues if the files are fragmented.
这些是office文档文件签名的“神奇字节”。它们可能在其他数据中随机出现,但它是一个开始。如果文件碎片化,您将遇到MAJOR问题。
Also, try to recreate pieces of the document(s) in Word as is, save it to a file and extract chunks to search for in the blob (using grep binary or whatever). Provided you have info from all parts of the file you should be able to decode WHERE in the blob they are. Piecing it back into a working DOC binary seems far fetched, but recovering the rest of the text shouldn't be impossible.
此外,尝试按原样在Word中重新创建文档片段,将其保存到文件并提取块以在blob中搜索(使用grep二进制或其他)。如果您从文件的所有部分获得信息,您应该能够解码它们中的WHERE。将它拼凑回工作的DOC二进制文件似乎很遥远,但恢复其余的文本应该是不可能的。
#2
2
The Apache POI project has a library for reading and writing all kinds of MS Office docs. If the files are in the new XML base OOXML format, you'll be looking for the start of a zip file as the XML is compressed.
Apache POI项目有一个用于读写各种MS Office文档的库。如果文件采用新的XML基础OOXML格式,那么当压缩XML时,您将寻找zip文件的开头。