如何使用php将docx文档转换为html?

时间:2022-10-30 13:49:21

I want to be able to upload an MS word document and export it a page in my site.

我希望能够上传MS Word文档并将其导出到我的网站中。

Is there any way to accomplish this?

有没有办法实现这个目标?

5 个解决方案

#1


20  

//FUNCTION :: read a docx file and return the string
function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags

            $contents = explode('\n',strip_tags($xml->saveXML()));
            $text = '';
            foreach($contents as $i=>$content) {
                $text .= $contents[$i];
            }
            return $text;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

ZipArchive and DOMDocument are both inside PHP so you don't need to install/include/require additional libraries.

ZipArchive和DOMDocument都在PHP内部,因此您不需要安装/ include / require其他库。

#2


3  

One may use PHPDocX.

可以使用PHPDocX。

It has support for practically all HTML CSS styles. Moreover you may use templates to add extra formatting to your HTML via the replaceTemplateVariableByHTML.

它几乎支持所有HTML CSS样式。此外,您可以使用模板通过replaceTemplateVariableByHTML为HTML添加额外的格式。

The HTML methods of PHPDocX also allow for the direct use of Word styles. You may use something like this:

PHPDocX的HTML方法也允许直接使用Word样式。你可以使用这样的东西:

$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));

$ docx-> embedHTML($ myHTML,array('tableStyle'=>'MediumGrid3-accent5PHPDOCX'));

If you want that all your tables use the MediumGrid3-accent5 Word style. The embedHTML method as well as its version for templates (replaceTemplateVariableByHTML) preserve inheritance, meaning by that that you may use a predefined Word style and override with CSS any of its properties.

如果您希望所有表都使用MediumGrid3-accent5 Word样式。 embedHTML方法及其模板版本(replaceTemplateVariableByHTML)保留了继承,这意味着您可以使用预定义的Word样式并使用CSS覆盖其任何属性。

You may also extract selected parts of your HTML using 'JQuery type' selectors.

您还可以使用“JQuery类型”选择器提取HTML的选定部分。

#3


3  

this might helpful for you How to Convert Docx to HTML

这可能对您有所帮助如何将Docx转换为HTML

#4


1  

You can convert Word docx documents to html using Print2flash library. Here is an PHP excerpt from my client's site which converts a document to html:

您可以使用Print2flash库将Word docx文档转换为html。这是我客户网站的PHP摘录,它将文档转换为html:

include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);

It converts a document which path is specified in $wordfile variable to a html page file specified by $htmlFile variable. All formatting, hyperlinks and charts are retained. You can get the required const.php file altogether with a fuller sample from Print2flash SDK.

它将$ wordfile变量中指定路径的文档转换为$ htmlFile变量指定的html页面文件。保留所有格式,超链接和图表。您可以使用Print2flash SDK中的更全面的示例获得所需的const.php文件。

#5


0  

If you don't refuse REST API, then you can use:

如果您不拒绝REST API,那么您可以使用:

  • Apache Tika. Is a proven OSS leader for text-extraction
  • 阿帕奇塔卡。是经过验证的OSS文本提取领导者
  • If you don't want to hassle with configuring and want ready-to-go solution you can use RawText, but it's not free.
  • 如果您不想麻烦配置并想要准备好的解决方案,您可以使用RawText,但它不是免费的。

Sample code for RawText:

RawText的示例代码:

$result = $rawText -> parse($your_file)

#1


20  

//FUNCTION :: read a docx file and return the string
function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags

            $contents = explode('\n',strip_tags($xml->saveXML()));
            $text = '';
            foreach($contents as $i=>$content) {
                $text .= $contents[$i];
            }
            return $text;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

ZipArchive and DOMDocument are both inside PHP so you don't need to install/include/require additional libraries.

ZipArchive和DOMDocument都在PHP内部,因此您不需要安装/ include / require其他库。

#2


3  

One may use PHPDocX.

可以使用PHPDocX。

It has support for practically all HTML CSS styles. Moreover you may use templates to add extra formatting to your HTML via the replaceTemplateVariableByHTML.

它几乎支持所有HTML CSS样式。此外,您可以使用模板通过replaceTemplateVariableByHTML为HTML添加额外的格式。

The HTML methods of PHPDocX also allow for the direct use of Word styles. You may use something like this:

PHPDocX的HTML方法也允许直接使用Word样式。你可以使用这样的东西:

$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));

$ docx-> embedHTML($ myHTML,array('tableStyle'=>'MediumGrid3-accent5PHPDOCX'));

If you want that all your tables use the MediumGrid3-accent5 Word style. The embedHTML method as well as its version for templates (replaceTemplateVariableByHTML) preserve inheritance, meaning by that that you may use a predefined Word style and override with CSS any of its properties.

如果您希望所有表都使用MediumGrid3-accent5 Word样式。 embedHTML方法及其模板版本(replaceTemplateVariableByHTML)保留了继承,这意味着您可以使用预定义的Word样式并使用CSS覆盖其任何属性。

You may also extract selected parts of your HTML using 'JQuery type' selectors.

您还可以使用“JQuery类型”选择器提取HTML的选定部分。

#3


3  

this might helpful for you How to Convert Docx to HTML

这可能对您有所帮助如何将Docx转换为HTML

#4


1  

You can convert Word docx documents to html using Print2flash library. Here is an PHP excerpt from my client's site which converts a document to html:

您可以使用Print2flash库将Word docx文档转换为html。这是我客户网站的PHP摘录,它将文档转换为html:

include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);

It converts a document which path is specified in $wordfile variable to a html page file specified by $htmlFile variable. All formatting, hyperlinks and charts are retained. You can get the required const.php file altogether with a fuller sample from Print2flash SDK.

它将$ wordfile变量中指定路径的文档转换为$ htmlFile变量指定的html页面文件。保留所有格式,超链接和图表。您可以使用Print2flash SDK中的更全面的示例获得所需的const.php文件。

#5


0  

If you don't refuse REST API, then you can use:

如果您不拒绝REST API,那么您可以使用:

  • Apache Tika. Is a proven OSS leader for text-extraction
  • 阿帕奇塔卡。是经过验证的OSS文本提取领导者
  • If you don't want to hassle with configuring and want ready-to-go solution you can use RawText, but it's not free.
  • 如果您不想麻烦配置并想要准备好的解决方案,您可以使用RawText,但它不是免费的。

Sample code for RawText:

RawText的示例代码:

$result = $rawText -> parse($your_file)