用PHP读/写MS Word文件

时间:2022-09-16 09:48:27

Is it possible to read and write Word (2003 and 2007) files in PHP without using a COM object? I know that I can:

是否可以在不使用COM对象的情况下在PHP中读取和写入Word(2003和2007)文件?我知道我可以:

$file = fopen('c:\file.doc', 'w+');
fwrite($file, $text);
fclose();

but Word will read it as an HTML file not a native .doc file.

但Word会将其读作HTML文件而不是本机.doc文件。

15 个解决方案

#1


27  

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

读取二进制Word文档将涉及根据DOC格式的已发布文件格式规范创建解析器。我认为这不是真正可行的解决方案。

You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.

您可以使用Microsoft Office XML格式来读取和写入Word文件 - 这与Word和2003版本的Word兼容。对于阅读,您必须确保以正确的格式保存Word文档(在Word 2007中称为Word 2003 XML-Document)。对于编写,您只需遵循公开可用的XML模式。我从未使用过这种格式从PHP写出Office文档,但是我用它来读取Excel工作表(自然保存为XML-Spreadsheet 2003)并在网页上显示其数据。由于文件显然是XML数据,因此在内部导航并找出如何提取所需数据是没有问题的。

The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.

另一个选项 - 仅限Word 2007选项(如果未在Word 2003中安装OpenXML文件格式) - 将重新输入OpenXML。正如databyss在这里指出的那样,DOCX文件格式只是一个包含XML文件的ZIP存档。 MSDN上有很多关于OpenXML文件格式的资源,因此您应该能够弄清楚如何读取您想要的数据。我认为写作会复杂得多 - 这取决于你投入多少时间。

Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

也许您可以查看PHPExcel,它是一个能够写入Excel 2007文件并使用OpenXML标准从Excel 2007文件读取的库。您可以在尝试读取和编写OpenXML Word文档时了解所涉及的工作。

#2


17  

this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007

这适用于vs

<?php



/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;


?>

#3


8  

You can use Antiword, it is a free MS Word reader for Linux and most popular OS.

您可以使用Antiword,它是适用于Linux和大多数流行操作系统的免费MS Word阅读器。

$document_file = 'c:\file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);

#4


6  

Just updating the code

只是更新代码

<?php

/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $word_text = @fread($fileHandle, filesize($userDoc));
    $line = "";
    $tam = filesize($userDoc);
    $nulos = 0;
    $caracteres = 0;
    for($i=1536; $i<$tam; $i++)
    {
        $line .= $word_text[$i];

        if( $word_text[$i] == 0)
        {
            $nulos++;
        }
        else
        {
            $nulos=0;
            $caracteres++;
        }

        if( $nulos>1996)
        {   
            break;  
        }
    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);
    //$outtext = "<pre>";

    $outtext = "";
    foreach($lines as $thisline)
    {
        $tam = strlen($thisline);
        if( !$tam )
        {
            continue;
        }

        $new_line = ""; 
        for($i=0; $i<$tam; $i++)
        {
            $onechar = $thisline[$i];
            if( $onechar > chr(240) )
            {
                continue;
            }

            if( $onechar >= chr(0x20) )
            {
                $caracteres++;
                $new_line .= $onechar;
            }

            if( $onechar == chr(0x14) )
            {
                $new_line .= "</a>";
            }

            if( $onechar == chr(0x07) )
            {
                $new_line .= "\t";
                if( isset($thisline[$i+1]) )
                {
                    if( $thisline[$i+1] == chr(0x07) )
                    {
                        $new_line .= "\n";
                    }
                }
            }
        }
        //troca por hiperlink
        $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
        $new_line = str_replace("\o" ,">",$new_line); 
        $new_line .= "\n";

        //link de imagens
        $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
        $new_line = str_replace("\*" ,"><br>",$new_line); 
        $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 


        $outtext .= nl2br($new_line);
    }

 return $outtext;
} 

$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);

echo $text;


?>

#5


5  

I don't know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.

我不知道在PHP中阅读本机Word文档,但如果你想用PHP编写Word文档,WordprocessingML(又名WordML)可能是一个很好的解决方案。您所要做的就是以正确的格式创建XML文档。我相信Word 2003和2007都支持WordML。

#6


4  

Most probably you won't be able to read Word documents without COM.

很可能你没有COM就无法阅读Word文档。

Writing was covered in this topic

本主题介绍了写作

#7


2  

www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)...I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse...

www.phplivedocx.org是一个基于SOAP的服务,这意味着你总是需要在线测试文件,但是没有足够的示例供它使用。奇怪的是,我发现只有在下载2天后(需要另外的zend框架),它是一个基于SOAP的程序(诅咒我!!!)...我认为没有COM它只是不可能在Linux服务器上,唯一的想法是在另一个可以解析的可用文件中更改doc文件...

#8


1  

2007 might be a bit complicated as well.

2007年也可能有点复杂。

The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.

.docx格式是一个zip文件,其中包含一些文件夹,其中包含其他文件,用于格式化和其他内容。

Rename a .docx file to .zip and you'll see what I mean.

将.docx文件重命名为.zip,您将看到我的意思。

So if you can work within zip files in PHP, you should be on the right path.

因此,如果您可以在PHP中的zip文件中工作,那么您应该走在正确的道路上。

#9


1  

phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.

phpLiveDocx是一个Zend Framework组件,可以在Linux,Windows和Mac上以PHP语言读写DOC和DOCX文件。

See the project web site at:

查看项目网站:

http://www.phplivedocx.org

http://www.phplivedocx.org

#10


1  

One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX. You may see how it works having a look at its online tutorial. You can insert or extract contents or even merge multiple Word files into a asingle one.

使用PHP操作Word文件的一种方法是PHPDocX的帮助。您可以通过查看其在线教程了解它的工作原理。您可以插入或提取内容,甚至将多个Word文件合并为一个。

#11


0  

Office 2007 .docx should be possible since it's an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven't seen many libraries written to match them yet.

Office 2007 .docx应该是可行的,因为它是XML标准。 Word 2003最有可能要求COM阅读,即使现在由MS发布的标准,因为这些标准是巨大的。我还没有看到很多库编写来匹配它们。

#12


0  

I don't know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called "catdoc"; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.

我不知道你将使用它,但我需要.doc支持搜索索引;我所做的是使用一个名为“catdoc”的小命令工具;这会将Word文档的内容传输到纯文本,以便对其进行索引。如果你需要保持格式和东西,这不是你的工具。

#13


0  

Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.

.rtf格式是否适用于您的目的? .rtf可以很容易地转换为.doc格式,但它是用明文写的(嵌入了控制命令)。这就是我计划将我的应用程序与Word文档集成的方式。

#14


0  

even i'm working on same kind of project [An Onlinw Word Processor]! But i've choosen c#.net and ASP.net. But through the survey i did; i got to know that

即使我正在开发相同类型的项目[On Onlinw字处理器]!但我选择了c#.net和ASP.net。但通过调查我做了;我知道了

By Using Open XML SDK and VSTO [Visual Studio Tools For Office]

使用Open XML SDK和VSTO [Office的Visual Studio工具]

we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..

我们可以轻松地使用word文件来操作它们,甚至可以在内部转换为不同的格式,如.odt,.pdf,.docx等。

So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!

所以,转到msdn.microsoft.com并彻底了解办公室开发选项卡。这是最简单的方法,因为我们需要实现的所有功能都已在.net中提供!

But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!

但是你想在PHP中做你的项目,你可以在Visual Studio和.net中做,因为PHP也是.net兼容语言之一!

#15


0  

I have the same case I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy. All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP so simple CURL would do it.

我有相同的情况,我想我将使用一个便宜的50兆基于Windows的托管与免费域名使用它来转换我的文件,为PHP服务器。连接它们很容易。您所需要的只是创建一个ASP.NET页面,通过post收到doc文件并通过HTTP回复它,这样简单的CURL就可以了。

#1


27  

Reading binary Word documents would involve creating a parser according to the published file format specifications for the DOC format. I think this is no real feasible solution.

读取二进制Word文档将涉及根据DOC格式的已发布文件格式规范创建解析器。我认为这不是真正可行的解决方案。

You could use the Microsoft Office XML formats for reading and writing Word files - this is compatible with the 2003 and 2007 version of Word. For reading you have to ensure that the Word documents are saved in the correct format (it's called Word 2003 XML-Document in Word 2007). For writing you just have to follow the openly available XML schema. I've never used this format for writing out Office documents from PHP, but I'm using it for reading in an Excel worksheet (naturally saved as XML-Spreadsheet 2003) and displaying its data on a web page. As the files are plainly XML data it's no problem to navigate within and figure out how to extract the data you need.

您可以使用Microsoft Office XML格式来读取和写入Word文件 - 这与Word和2003版本的Word兼容。对于阅读,您必须确保以正确的格式保存Word文档(在Word 2007中称为Word 2003 XML-Document)。对于编写,您只需遵循公开可用的XML模式。我从未使用过这种格式从PHP写出Office文档,但是我用它来读取Excel工作表(自然保存为XML-Spreadsheet 2003)并在网页上显示其数据。由于文件显然是XML数据,因此在内部导航并找出如何提取所需数据是没有问题的。

The other option - a Word 2007 only option (if the OpenXML file formats are not installed in your Word 2003) - would be to ressort to OpenXML. As databyss pointed out here the DOCX file format is just a ZIP archive with XML files included. There are a lot of resources on MSDN regarding the OpenXML file format, so you should be able to figure out how to read the data you want. Writing will be much more complicated I think - it just depends on how much time you'll invest.

另一个选项 - 仅限Word 2007选项(如果未在Word 2003中安装OpenXML文件格式) - 将重新输入OpenXML。正如databyss在这里指出的那样,DOCX文件格式只是一个包含XML文件的ZIP存档。 MSDN上有很多关于OpenXML文件格式的资源,因此您应该能够弄清楚如何读取您想要的数据。我认为写作会复杂得多 - 这取决于你投入多少时间。

Perhaps you can have a look at PHPExcel which is a library able to write to Excel 2007 files and read from Excel 2007 files using the OpenXML standard. You could get an idea of the work involved when trying to read and write OpenXML Word documents.

也许您可以查看PHPExcel,它是一个能够写入Excel 2007文件并使用OpenXML标准从Excel 2007文件读取的库。您可以在尝试读取和编写OpenXML Word文档时了解所涉及的工作。

#2


17  

this works with vs < office 2007 and its pure PHP, no COM crap, still trying to figure 2007

这适用于vs

<?php



/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "cv.doc";

$text = parseWord($userDoc);
echo $text;


?>

#3


8  

You can use Antiword, it is a free MS Word reader for Linux and most popular OS.

您可以使用Antiword,它是适用于Linux和大多数流行操作系统的免费MS Word阅读器。

$document_file = 'c:\file.doc';
$text_from_doc = shell_exec('/usr/local/bin/antiword '.$document_file);

#4


6  

Just updating the code

只是更新代码

<?php

/*****************************************************************
This approach uses detection of NUL (chr(00)) and end line (chr(13))
to decide where the text is:
- divide the file contents up by chr(13)
- reject any slices containing a NUL
- stitch the rest together again
- clean up with a regular expression
*****************************************************************/

function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $word_text = @fread($fileHandle, filesize($userDoc));
    $line = "";
    $tam = filesize($userDoc);
    $nulos = 0;
    $caracteres = 0;
    for($i=1536; $i<$tam; $i++)
    {
        $line .= $word_text[$i];

        if( $word_text[$i] == 0)
        {
            $nulos++;
        }
        else
        {
            $nulos=0;
            $caracteres++;
        }

        if( $nulos>1996)
        {   
            break;  
        }
    }

    //echo $caracteres;

    $lines = explode(chr(0x0D),$line);
    //$outtext = "<pre>";

    $outtext = "";
    foreach($lines as $thisline)
    {
        $tam = strlen($thisline);
        if( !$tam )
        {
            continue;
        }

        $new_line = ""; 
        for($i=0; $i<$tam; $i++)
        {
            $onechar = $thisline[$i];
            if( $onechar > chr(240) )
            {
                continue;
            }

            if( $onechar >= chr(0x20) )
            {
                $caracteres++;
                $new_line .= $onechar;
            }

            if( $onechar == chr(0x14) )
            {
                $new_line .= "</a>";
            }

            if( $onechar == chr(0x07) )
            {
                $new_line .= "\t";
                if( isset($thisline[$i+1]) )
                {
                    if( $thisline[$i+1] == chr(0x07) )
                    {
                        $new_line .= "\n";
                    }
                }
            }
        }
        //troca por hiperlink
        $new_line = str_replace("HYPERLINK" ,"<a href=",$new_line); 
        $new_line = str_replace("\o" ,">",$new_line); 
        $new_line .= "\n";

        //link de imagens
        $new_line = str_replace("INCLUDEPICTURE" ,"<br><img src=",$new_line); 
        $new_line = str_replace("\*" ,"><br>",$new_line); 
        $new_line = str_replace("MERGEFORMATINET" ,"",$new_line); 


        $outtext .= nl2br($new_line);
    }

 return $outtext;
} 

$userDoc = "custo.doc";
$userDoc = "Cultura.doc";
$text = parseWord($userDoc);

echo $text;


?>

#5


5  

I don't know about reading native Word documents in PHP, but if you want to write a Word document in PHP, WordprocessingML (aka WordML) might be a good solution. All you have to do is create an XML document in the correct format. I believe Word 2003 and 2007 both support WordML.

我不知道在PHP中阅读本机Word文档,但如果你想用PHP编写Word文档,WordprocessingML(又名WordML)可能是一个很好的解决方案。您所要做的就是以正确的格式创建XML文档。我相信Word 2003和2007都支持WordML。

#6


4  

Most probably you won't be able to read Word documents without COM.

很可能你没有COM就无法阅读Word文档。

Writing was covered in this topic

本主题介绍了写作

#7


2  

www.phplivedocx.org is a SOAP based service that means that you always need to be online for testing the Files also does not have enough examples for its use . Strangely I found only after 2 days of downloading (requires additionaly zend framework too) that its a SOAP based program(cursed me !!!)...I think without COM its just not possible on a Linux server and the only idea is to change the doc file in another usable file which PHP can parse...

www.phplivedocx.org是一个基于SOAP的服务,这意味着你总是需要在线测试文件,但是没有足够的示例供它使用。奇怪的是,我发现只有在下载2天后(需要另外的zend框架),它是一个基于SOAP的程序(诅咒我!!!)...我认为没有COM它只是不可能在Linux服务器上,唯一的想法是在另一个可以解析的可用文件中更改doc文件...

#8


1  

2007 might be a bit complicated as well.

2007年也可能有点复杂。

The .docx format is a zip file that contains a few folders with other files in them for formatting and other stuff.

.docx格式是一个zip文件,其中包含一些文件夹,其中包含其他文件,用于格式化和其他内容。

Rename a .docx file to .zip and you'll see what I mean.

将.docx文件重命名为.zip,您将看到我的意思。

So if you can work within zip files in PHP, you should be on the right path.

因此,如果您可以在PHP中的zip文件中工作,那么您应该走在正确的道路上。

#9


1  

phpLiveDocx is a Zend Framework component and can read and write DOC and DOCX files in PHP on Linux, Windows and Mac.

phpLiveDocx是一个Zend Framework组件,可以在Linux,Windows和Mac上以PHP语言读写DOC和DOCX文件。

See the project web site at:

查看项目网站:

http://www.phplivedocx.org

http://www.phplivedocx.org

#10


1  

One way to manipulate Word files with PHP that you may find interesting is with the help of PHPDocX. You may see how it works having a look at its online tutorial. You can insert or extract contents or even merge multiple Word files into a asingle one.

使用PHP操作Word文件的一种方法是PHPDocX的帮助。您可以通过查看其在线教程了解它的工作原理。您可以插入或提取内容,甚至将多个Word文件合并为一个。

#11


0  

Office 2007 .docx should be possible since it's an XML standard. Word 2003 most likely requires COM to read, even with the standards now published by MS, since those standards are huge. I haven't seen many libraries written to match them yet.

Office 2007 .docx应该是可行的,因为它是XML标准。 Word 2003最有可能要求COM阅读,即使现在由MS发布的标准,因为这些标准是巨大的。我还没有看到很多库编写来匹配它们。

#12


0  

I don't know what you are going to use it for, but I needed .doc support for search indexing; What I did was use a little commandline tool called "catdoc"; This transfers the contents of the Word document to plain text so it can be indexed. If you need to keep formatting and stuff this is not your tool.

我不知道你将使用它,但我需要.doc支持搜索索引;我所做的是使用一个名为“catdoc”的小命令工具;这会将Word文档的内容传输到纯文本,以便对其进行索引。如果你需要保持格式和东西,这不是你的工具。

#13


0  

Would the .rtf format work for your purposes? .rtf can easily be converted to and from .doc format, but it is written in plaintext (with control commands embedded). This is how I plan to integrate my application with Word documents.

.rtf格式是否适用于您的目的? .rtf可以很容易地转换为.doc格式,但它是用明文写的(嵌入了控制命令)。这就是我计划将我的应用程序与Word文档集成的方式。

#14


0  

even i'm working on same kind of project [An Onlinw Word Processor]! But i've choosen c#.net and ASP.net. But through the survey i did; i got to know that

即使我正在开发相同类型的项目[On Onlinw字处理器]!但我选择了c#.net和ASP.net。但通过调查我做了;我知道了

By Using Open XML SDK and VSTO [Visual Studio Tools For Office]

使用Open XML SDK和VSTO [Office的Visual Studio工具]

we may easily work with a word file manipulate them and even convert internally to different into several formats such as .odt,.pdf,.docx etc..

我们可以轻松地使用word文件来操作它们,甚至可以在内部转换为不同的格式,如.odt,.pdf,.docx等。

So, goto msdn.microsoft.com and be thorough about the office development tab. Its the easiest way to do this as all functions we need to implement are already available in .net!!

所以,转到msdn.microsoft.com并彻底了解办公室开发选项卡。这是最简单的方法,因为我们需要实现的所有功能都已在.net中提供!

But as u want to do ur project in PHP, u can do it in Visual Studio and .net as PHP is also one of the .net Compliant Language!!

但是你想在PHP中做你的项目,你可以在Visual Studio和.net中做,因为PHP也是.net兼容语言之一!

#15


0  

I have the same case I guess I am going to use a cheap 50 mega windows based hosting with free domain to use it to convert my files on, for PHP server. And linking them is easy. All you need is make an ASP.NET page that recieves the doc file via post and replies it via HTTP so simple CURL would do it.

我有相同的情况,我想我将使用一个便宜的50兆基于Windows的托管与免费域名使用它来转换我的文件,为PHP服务器。连接它们很容易。您所需要的只是创建一个ASP.NET页面,通过post收到doc文件并通过HTTP回复它,这样简单的CURL就可以了。