如何调试损坏的docx文件?

时间:2023-01-15 08:31:18

I have an issue where .doc and .pdf files are coming out OK but a .docx file is coming out corrupt.

我有一个问题,.doc和.pdf文件出来正常但.docx文件出现损坏。

In order to solve that I am trying to debug why the .docx is corrupt.

为了解决这个问题,我试图调试为什么.docx已损坏。

I learned that the docx format is much stricter with regard to extra characters than either .pdf or .doc. Therefore I have searched the various xml files WITHIN the docx file looking for invalid XML. But I can't find any. It all validates fine.

我了解到docx格式在额外字符方面比.pdf或.doc更严格。因此,我搜索了docx文件中的各种xml文件,查找无效的XML。但我找不到任何东西。这一切都很好。

如何调试损坏的docx文件?

Could anyone suggest directions for me to investigate now?

有人可以建议我现在调查的方向吗?

UPDATE:

更新:

The full listing of files inside the folder is as follows:

文件夹中文件的完整列表如下:

/_rels
    .rels

/customXml
    /_rels
        .rels
    item1.xml
    itemProps1.xml

/docProps
    app.xml
    core.xml

/word
    /_rels
        document.xml.rels
    /media
        image1.jpeg
    /theme
        theme1.xml
    document.xml
    fontTable.xml
    numbering.xml
    settings.xml
    styles.xml
    stylesWithEffects.xml
    webSettings.xml

[Content_Types].xml

UPDATE 2:

更新2:

I should also have mentioned that the reason for corruption is almost certainly a bad binary file POST on my behalf.

我还应该提到腐败的原因几乎肯定是代表我的一个糟糕的二进制文件POST。

why are docx files corrupted by binary post, but .doc and .pdf are fine?

为什么docx文件被二进制文件损坏了,但.doc和.pdf都没问题?

UPDATE 3:

更新3:

I have tried the demo various docx repair tools. They all seem to repair the file ok but give no clue as to the cause of the error.

我已经尝试了各种docx修复工具的演示。他们似乎都修复了文件,但没有提供错误原因的线索。

My next step is to examine the contents of the corrupted file with the repaired version.

我的下一步是使用修复版本检查损坏文件的内容。

If anybody knows of a docx repair tool that gives a decent error message I'd appreciate hearing about it. In fact I might post that as a separate question.

如果有人知道docx修复工具提供了一个体面的错误消息,我会很感激听到它。事实上,我可能会将其作为一个单独的问题发布。

UPDATE 4 (2017)

更新4(2017)

I never solved this problem. I have tried all the tools suggested in the answers below but none of them worked for me.

我从未解决过这个问题。我已经尝试了下面答案中建议的所有工具,但它们都不适合我。

I have since progressed a little further and found a block of 0000 missing when opening the .docx in Sublime Text. More details in the new question here: What could be causing this corruption in .docx files during httpwebrequest?

自从Sublime Text中打开.docx后,我已经进一步发展了一个块0000。这里新问题的更多细节:httpwebrequest期间.docx文件中可能导致这种损坏的原因是什么?

4 个解决方案

#1


3  

Usually, when there is an error with a particular XML file, Word tells you on which line of which file the error happens. So I believe the problem comes from either the Zipping of the file, either the folder structure.

通常,当特定XML文件出错时,Word会告诉您错误发生在哪个文件的哪一行。所以我认为问题来自文件的压缩,文件夹结构。

Here is the folder structure of a Word file:

这是Word文件的文件夹结构:

The .docx format is a zipped file that contains the following folders:

.docx格式是一个包含以下文件夹的压缩文件:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

It seems that you have only what is inside the word folder, isn't it ? If this doesn't work, could you please either send the corrupted Docx or post the structure of your folders inside your zip ?

看来你只有word文件夹里面的内容,不是吗?如果这不起作用,您可以发送损坏的Docx或在您的zip中发布文件夹的结构吗?

#2


3  

I used the "Open XML SDK 2.5 Productivity Tool" (http://www.microsoft.com/en-us/download/details.aspx?id=30425) to find a problem with a broken hyperlink reference.

我使用“Open XML SDK 2.5 Productivity Tool”(http://www.microsoft.com/en-us/download/details.aspx?id=30425)来查找超链接引用损坏的问题。

You have to download/install the SDK first, then the tool. The tool will open and analyze the document for problems.

您必须先下载/安装SDK,然后再下载该工具。该工具将打开并分析文档以查找问题。

#3


1  

Many years late, but I found this which actually worked for me. (From https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)

很多年来,但我发现这对我有用。 (来自https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)

(wordDoc is a WordprocessingDocument)

(wordDoc是一个WordprocessingDocument)

using DocumentFormat.OpenXml.Validation;

使用DocumentFormat.OpenXml.Validation;

        try
        {
            var validator = new OpenXmlValidator();
            var count = 0;
            foreach (var error in validator.Validate(wordDoc))
            {
                count++;
                Console.WriteLine("Error " + count);
                Console.WriteLine("Description: " + error.Description);
                Console.WriteLine("ErrorType: " + error.ErrorType);
                Console.WriteLine("Node: " + error.Node);
                Console.WriteLine("Path: " + error.Path.XPath);
                Console.WriteLine("Part: " + error.Part.Uri);
                Console.WriteLine("-------------------------------------------");
            }

            Console.WriteLine("count={0}", count);
        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

#4


-2  

web docx validator worked for me : http://ucd.eeonline.org/validator/index.php

web docx验证器为我工作:http://ucd.eeonline.org/validator/index.php

#1


3  

Usually, when there is an error with a particular XML file, Word tells you on which line of which file the error happens. So I believe the problem comes from either the Zipping of the file, either the folder structure.

通常,当特定XML文件出错时,Word会告诉您错误发生在哪个文件的哪一行。所以我认为问题来自文件的压缩,文件夹结构。

Here is the folder structure of a Word file:

这是Word文件的文件夹结构:

The .docx format is a zipped file that contains the following folders:

.docx格式是一个包含以下文件夹的压缩文件:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

It seems that you have only what is inside the word folder, isn't it ? If this doesn't work, could you please either send the corrupted Docx or post the structure of your folders inside your zip ?

看来你只有word文件夹里面的内容,不是吗?如果这不起作用,您可以发送损坏的Docx或在您的zip中发布文件夹的结构吗?

#2


3  

I used the "Open XML SDK 2.5 Productivity Tool" (http://www.microsoft.com/en-us/download/details.aspx?id=30425) to find a problem with a broken hyperlink reference.

我使用“Open XML SDK 2.5 Productivity Tool”(http://www.microsoft.com/en-us/download/details.aspx?id=30425)来查找超链接引用损坏的问题。

You have to download/install the SDK first, then the tool. The tool will open and analyze the document for problems.

您必须先下载/安装SDK,然后再下载该工具。该工具将打开并分析文档以查找问题。

#3


1  

Many years late, but I found this which actually worked for me. (From https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)

很多年来,但我发现这对我有用。 (来自https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)

(wordDoc is a WordprocessingDocument)

(wordDoc是一个WordprocessingDocument)

using DocumentFormat.OpenXml.Validation;

使用DocumentFormat.OpenXml.Validation;

        try
        {
            var validator = new OpenXmlValidator();
            var count = 0;
            foreach (var error in validator.Validate(wordDoc))
            {
                count++;
                Console.WriteLine("Error " + count);
                Console.WriteLine("Description: " + error.Description);
                Console.WriteLine("ErrorType: " + error.ErrorType);
                Console.WriteLine("Node: " + error.Node);
                Console.WriteLine("Path: " + error.Path.XPath);
                Console.WriteLine("Part: " + error.Part.Uri);
                Console.WriteLine("-------------------------------------------");
            }

            Console.WriteLine("count={0}", count);
        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

#4


-2  

web docx validator worked for me : http://ucd.eeonline.org/validator/index.php

web docx验证器为我工作:http://ucd.eeonline.org/validator/index.php