I want to write a program in C(only c not c++ or java) that will read doc, docx, pdf and want to make it available on github to use for all who needs that code. So I started with .doc file I explored that if I open .doc file with simple notepad it will show you all text but just with some extra content which you can easily trim. So I did write a simple c program to read .doc wile in both 'r' and 'rb' mode but both time it gives me only 5-9 character in the file and those also not readable. I don't know why it's happening. Any comment or disccussion will be very helpful for me.
我想用C语言编写一个程序(只有c而不是c ++或java),它将读取doc,docx,pdf并希望在github上使用它以供所有需要该代码的人使用。所以我开始使用.doc文件,我探索过,如果我用简单的记事本打开.doc文件,它会显示所有文本,但只是一些额外的内容,你可以轻松修剪。所以我写了一个简单的c程序来读取'r'和'rb'模式下的.doc wile但是这两次它只给我5-9个字符的文件和那些也不可读。我不知道为什么会这样。任何评论或讨论对我都非常有帮助。
Here is the link for github Source code. Please help me to complete all three format.
这是github源代码的链接。请帮我完成所有三种格式。
3 个解决方案
#1
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF
character in your file.
要回答您的具体问题,您的小应用程序停止阅读的原因是因为它错误地认为您的文件中存在EOF字符。
Look at your code:
看看你的代码:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp)
in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int
", not "char
".
将fgetc(fp)的结果存储在char类型的变量中,该变量是单字节变量。但是,fgetc的结果非常有目的地是“int”,而不是“char”。
fgetc
always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF
, which is often implemented as a -1 value.
fgetc始终返回0到255范围内的正结果,除非到达文件末尾,在这种情况下它返回EOF,通常将其实现为-1值。
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
如果您读取值为255的字节并将其存储在int中,则一切正常,它将存储为值255并且您的循环可以继续。如果将结果存储在char中,它将被解释为等于EOF。你的循环停止了。
#2
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek
to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
不要指望这个想法随处可见。 .doc是一种巨大的二进制文件格式,解析起来非常复杂。话虽如此,Cubia提到了文件的文本部分开始的偏移量。我不熟悉格式的细节,但如果原始文本包含在一个位置,请使用fseek来获取它并在到达结尾时停止。对于其他格式,情况并非如此,因为它们非常不同。
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.docx和.pdf应该更容易解析,因为它们是更现代的格式。如果你想从docx读取任何内容,你需要从一个包含大量xml的zip文件中读取,并使用解析器来确定你想要的文本。
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
.pdf应该是三者中最容易的,因为你可以找到一个几乎可以做你想要的库。
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
至于为什么你从程序中得到奇怪的输出,请记住.doc是一种二进制格式,绝大多数数据都是你认为的垃圾。将它转储到终端将产生可读的文本,但也会产生一堆控制字符,这些控制字符应该与您的终端有关。
As a last note - don't try to read docx files directly using fread
- they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
作为最后一点 - 不要尝试直接使用fread读取docx文件 - 它们被压缩,因此您可能无法恢复未更改的文本。看一下libarchive。此外 - 期望必须阅读文档规范。 docx似乎是openoffice格式的微软扩展。请参阅此和一些PDF规范文档(有多个版本)。
#3
Look at the .doc
file type as a txt
file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
将.doc文件类型视为txt文件,但在内容之前,中间和之后使用额外的不可打印字符。这些不可打印的字符用于定义特殊格式,元数据和其他信息。
With this said, all .doc
files follow a certain structure.
有了这个说法,所有.doc文件都遵循某种结构。
If you open two different .doc
files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00
(2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek()
function).
如果在十六进制编辑器中打开两个不同的.doc文件,您会注意到两个文件的文本内容都是从文件开头的0xA00(2560字节)偏移量开始的。这意味着当您最初打开文件时,可以忽略文件的前2560个字节(请查看fseek()函数)。
From this point on, you can read the contents of your file until you reach '\0'
.
从此时起,您可以阅读文件的内容,直到达到'\ 0'。
I have not seen the implementation of a .pdf
or a .docx
file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
我没有看到.pdf或.docx文件的实现,但您可以使用十六进制编辑器打开这两个文件,并找出可以使用哪种模式隔离文件的重要内容。
Hope this helps.
希望这可以帮助。
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
编辑:您始终可以找到有关您要操作的不同文件格式的文档。以下是PDF文件类型的规范:
http://www.adobe.com/devnet/pdf/pdf_reference.html http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
#1
To answer your specific question, the reason your little application stops reading is because it mistakenly thinks there is an EOF
character in your file.
要回答您的具体问题,您的小应用程序停止阅读的原因是因为它错误地认为您的文件中存在EOF字符。
Look at your code:
看看你的代码:
char ch;
int nol=0, not=0, nob=0, noc=0;
FILE *fp;
fp = fopen("file.doc","rb");
while(1)
{
ch = fgetc(fp);
if(ch==EOF)
{
break;
}
You store the result of fgetc(fp)
in a variable of type char, which is a single-byte variable. However, the result of fgetc is very purposefully "int
", not "char
".
将fgetc(fp)的结果存储在char类型的变量中,该变量是单字节变量。但是,fgetc的结果非常有目的地是“int”,而不是“char”。
fgetc
always returns a positive result in the range 0 to 255, except for when you reach the end of the file in which case it returns EOF
, which is often implemented as a -1 value.
fgetc始终返回0到255范围内的正结果,除非到达文件末尾,在这种情况下它返回EOF,通常将其实现为-1值。
If you read a byte of value 255 and store it in an int, everything is OK, it's stored as the value 255 and your loop can continue. If you store the result in a char, it's going to be interpreted equal to EOF. And your loop stops.
如果您读取值为255的字节并将其存储在int中,则一切正常,它将存储为值255并且您的循环可以继续。如果将结果存储在char中,它将被解释为等于EOF。你的循环停止了。
#2
Don't expect to get anywhere with this idea. .doc is a huge binary file format that is inhumanly complicated to parse. With that said, Cubia mentioned the offset where the text section of the document starts. I'm not familiar with the details of the format, but if the raw text is contained in one location, use fseek
to get at it and stop when you reach the end. This won't be the case for the other formats because they are very different.
不要指望这个想法随处可见。 .doc是一种巨大的二进制文件格式,解析起来非常复杂。话虽如此,Cubia提到了文件的文本部分开始的偏移量。我不熟悉格式的细节,但如果原始文本包含在一个位置,请使用fseek来获取它并在到达结尾时停止。对于其他格式,情况并非如此,因为它们非常不同。
.docx and .pdf should be easier to parse because they are more modern formats. If you want to read anything from a docx you need to read from a zip file with a ton of xml in it and use a parser to figure out which text you want.
.docx和.pdf应该更容易解析,因为它们是更现代的格式。如果你想从docx读取任何内容,你需要从一个包含大量xml的zip文件中读取,并使用解析器来确定你想要的文本。
.pdf should be the easiest of the three because you might be able to find a library out there that can almost do what you want.
.pdf应该是三者中最容易的,因为你可以找到一个几乎可以做你想要的库。
As for why you are getting strange output from your program, remember that .doc is a binary format and the vast majority of the data is garbage from your perspective. Dumping it to the terminal will yield readable text but also a bunch of control characters that should screw with your terminal.
至于为什么你从程序中得到奇怪的输出,请记住.doc是一种二进制格式,绝大多数数据都是你认为的垃圾。将它转储到终端将产生可读的文本,但也会产生一堆控制字符,这些控制字符应该与您的终端有关。
As a last note - don't try to read docx files directly using fread
- they are compressed so you likely won't recover the text unaltered. Take a look at libarchive. Also - expect to have to read the document specifications. docx seems to be a microsoft extension to the openoffice format. See this and some PDF specification documents (there are multiple versions).
作为最后一点 - 不要尝试直接使用fread读取docx文件 - 它们被压缩,因此您可能无法恢复未更改的文本。看一下libarchive。此外 - 期望必须阅读文档规范。 docx似乎是openoffice格式的微软扩展。请参阅此和一些PDF规范文档(有多个版本)。
#3
Look at the .doc
file type as a txt
file but with extra non-printable characters before, in the middle, and after your content. These non-printable characters are used for defining special formatting, metadata and other infos.
将.doc文件类型视为txt文件,但在内容之前,中间和之后使用额外的不可打印字符。这些不可打印的字符用于定义特殊格式,元数据和其他信息。
With this said, all .doc
files follow a certain structure.
有了这个说法,所有.doc文件都遵循某种结构。
If you open two different .doc
files in a hex editor, you will notice that the text content of both files start at an offset of 0xA00
(2560 bytes) from the beginning of the file. This means that when you open your file initially, you can ignore the first 2560 bytes of the file (Take a look at the fseek()
function).
如果在十六进制编辑器中打开两个不同的.doc文件,您会注意到两个文件的文本内容都是从文件开头的0xA00(2560字节)偏移量开始的。这意味着当您最初打开文件时,可以忽略文件的前2560个字节(请查看fseek()函数)。
From this point on, you can read the contents of your file until you reach '\0'
.
从此时起,您可以阅读文件的内容,直到达到'\ 0'。
I have not seen the implementation of a .pdf
or a .docx
file, but you can take open up both files with a hex editor and figure out what pattern you can use the isolate the important contents of the files.
我没有看到.pdf或.docx文件的实现,但您可以使用十六进制编辑器打开这两个文件,并找出可以使用哪种模式隔离文件的重要内容。
Hope this helps.
希望这可以帮助。
EDIT : You can always find documentation on the different file formats that you want to manipulate. Here are the specifications of the PDF file type :
编辑:您始终可以找到有关您要操作的不同文件格式的文档。以下是PDF文件类型的规范:
http://www.adobe.com/devnet/pdf/pdf_reference.html http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf