I want to extract the text content from the word document with PHP.
我想用PHP从word文档中提取文本内容。
I have created a new word document in Microsoft Word for Mac 2011. Edit: have also tested by creating the same document in Microsoft Word under Windows 7.
我在Microsoft Word for Mac 2011中创建了一个新的word文档。编辑:还通过在Windows 7下的Microsoft Word中创建相同的文档进行了测试。
The contents of the document is
该文件的内容是
The quick brown fox jumps over the lazy dog
I have saved it to disk as a Word 97-2004 Document (.doc).
我已将其作为Word 97-2004文档(.doc)保存到磁盘。
I'm using phpoffice/phpword and this code to extract the text:
我正在使用phpoffice / phpword和这段代码来提取文本:
<?php
$source = "word.doc";
$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');
$text = '';
$sections = $phpWord->getSections();
foreach ($sections as $s) {
$els = $s->getElements();
foreach ($els as $e) {
if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
$text .= $e->getText();
} elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
$text .= " \n";
} else {
throw new Exception('Unknown class type ' . get_class($e));
}
}
}
print $text;
The output of this code is only parts of the text:
此代码的输出只是文本的一部分:
The quick brown fox j
Is there a problem with the code, or is it some kind of compatibility issue?
代码有问题,还是某种兼容性问题?
Edit:
If I add a var_dump($els);
before foreach ($els as $e) {
the output is this:
如果我添加一个var_dump($ els);在foreach之前($ els as $ e){输出是这样的:
array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
["text":protected]=>
string(21) "The quick brown fox j"
["fontStyle":protected]=>
object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["type":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "text"
["name":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["hint":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["size":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["color":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["bold":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["italic":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["underline":"PhpOffice\PhpWord\Style\Font":private]=>
string(4) "none"
["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["scale":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"
["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""
["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["shading":"PhpOffice\PhpWord\Style\Font":private]=>
NULL
["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
bool(false)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["paragraphStyle":protected]=>
object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
["aliases":protected]=>
array(1) {
["line-height"]=>
string(10) "lineHeight"
}
["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(6) "Normal"
["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
string(0) ""
["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(true)
["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
bool(false)
["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
int(0)
["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
array(0) {
}
["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["phpWord":protected]=>
object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
["sections":"PhpOffice\PhpWord\PhpWord":private]=>
array(1) {
[0]=>
object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
["container":protected]=>
string(7) "Section"
["style":"PhpOffice\PhpWord\Element\Section":private]=>
object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
string(8) "portrait"
["paper":"PhpOffice\PhpWord\Style\Section":private]=>
object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
array(6) {
["A3"]=>
array(3) {
[0]=>
int(297)
[1]=>
int(420)
[2]=>
string(2) "mm"
}
["A4"]=>
array(3) {
[0]=>
int(210)
[1]=>
int(297)
[2]=>
string(2) "mm"
}
["A5"]=>
array(3) {
[0]=>
int(148)
[1]=>
int(210)
[2]=>
string(2) "mm"
}
["Folio"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(13)
[2]=>
string(2) "in"
}
["Legal"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(14)
[2]=>
string(2) "in"
}
["Letter"]=>
array(3) {
[0]=>
float(8.5)
[1]=>
int(11)
[2]=>
string(2) "in"
}
}
["size":"PhpOffice\PhpWord\Style\Paper":private]=>
string(2) "A4"
["width":"PhpOffice\PhpWord\Style\Paper":private]=>
int(11870)
["height":"PhpOffice\PhpWord\Style\Paper":private]=>
int(16787)
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
int(11906)
["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
int(16838)
["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
int(1417)
["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
int(0)
["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
int(1)
["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
int(720)
["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
NULL
["borderTopSize":protected]=>
NULL
["borderTopColor":protected]=>
NULL
["borderLeftSize":protected]=>
NULL
["borderLeftColor":protected]=>
NULL
["borderRightSize":protected]=>
NULL
["borderRightColor":protected]=>
NULL
["borderBottomSize":protected]=>
NULL
["borderBottomColor":protected]=>
NULL
["styleName":protected]=>
NULL
["index":protected]=>
NULL
["aliases":protected]=>
array(0) {
}
["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
bool(false)
}
["headers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["footers":"PhpOffice\PhpWord\Element\Section":private]=>
array(0) {
}
["elements":protected]=>
array(1) {
[0]=>
*RECURSION*
}
["phpWord":protected]=>
*RECURSION*
["sectionId":protected]=>
int(1)
["docPart":protected]=>
string(7) "Section"
["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
NULL
["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
NULL
["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}
["collections":"PhpOffice\PhpWord\PhpWord":private]=>
array(5) {
["Bookmarks"]=>
object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Titles"]=>
object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Footnotes"]=>
object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Endnotes"]=>
object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
["Charts"]=>
object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
array(0) {
}
}
}
["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
array(3) {
["DocInfo"]=>
object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
int(1483515248)
["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
string(0) ""
["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
array(0) {
}
}
["Protection"]=>
object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
NULL
}
["Compatibility"]=>
object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
int(12)
}
}
}
["sectionId":protected]=>
NULL
["docPart":protected]=>
string(7) "Section"
["docPartId":protected]=>
int(1)
["elementIndex":protected]=>
int(1)
["elementId":protected]=>
string(6) "5d531b"
["relationId":protected]=>
NULL
["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
int(0)
["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
string(7) "Section"
["mediaRelation":protected]=>
bool(false)
["collectionRelation":protected]=>
bool(false)
}
}
2 个解决方案
#1
4
Try to create your reader before
尝试先创建您的阅读器
$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}
Answer is based on this example and API documentation
答案基于此示例和API文档
#2
1
You can extract txt from a word document using catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/
你可以使用catdoc从word文档中提取txt http://www.wagner.pp.ru/~vitus/software/catdoc/
It can be installed on Ubuntu using
它可以安装在Ubuntu上使用
sudo apt-get install catdoc
Once you have catdoc working on your system you can call it from php using shell_exec()
一旦你有catdoc在你的系统上工作,你可以使用shell_exec()从php调用它
<?php
$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');
print $text;
?>
Be sure to substitute (fullpath) with the actual path to catdoc and your word doc.
一定要用catdoc和你的单词doc的实际路径替换(fullpath)。
EDIT ---- Addition
编辑----加法
If you can save your files as .docx rather than .doc it is a little bit easier. You can use unzip rather than catdoc.
如果您可以将文件保存为.docx而不是.doc,则更容易一些。你可以使用解压缩而不是catdoc。
Simply replace:
$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');
with
$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");
You could use this same technique with most other command line document to text converters. Just replace the command in the shell_exec() with the command that works on your system. You can check How to extract just plain text from .doc & .docx files? (unix) for other unix/linux alternatives
您可以使用与大多数其他命令行文档相同的技术到文本转换器。只需使用适用于您系统的命令替换shell_exec()中的命令即可。您可以查看如何从.doc和.docx文件中提取纯文本? (unix)用于其他unix / linux替代品
For other PHP alternatives check out How to extract text from word file .doc,docx,.xlsx,.pptx php
对于其他PHP替代品,请查看如何从word文件.doc,docx,.xlsx,.pptx php中提取文本
#1
4
Try to create your reader before
尝试先创建您的阅读器
$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}
Answer is based on this example and API documentation
答案基于此示例和API文档
#2
1
You can extract txt from a word document using catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/
你可以使用catdoc从word文档中提取txt http://www.wagner.pp.ru/~vitus/software/catdoc/
It can be installed on Ubuntu using
它可以安装在Ubuntu上使用
sudo apt-get install catdoc
Once you have catdoc working on your system you can call it from php using shell_exec()
一旦你有catdoc在你的系统上工作,你可以使用shell_exec()从php调用它
<?php
$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');
print $text;
?>
Be sure to substitute (fullpath) with the actual path to catdoc and your word doc.
一定要用catdoc和你的单词doc的实际路径替换(fullpath)。
EDIT ---- Addition
编辑----加法
If you can save your files as .docx rather than .doc it is a little bit easier. You can use unzip rather than catdoc.
如果您可以将文件保存为.docx而不是.doc,则更容易一些。你可以使用解压缩而不是catdoc。
Simply replace:
$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');
with
$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");
You could use this same technique with most other command line document to text converters. Just replace the command in the shell_exec() with the command that works on your system. You can check How to extract just plain text from .doc & .docx files? (unix) for other unix/linux alternatives
您可以使用与大多数其他命令行文档相同的技术到文本转换器。只需使用适用于您系统的命令替换shell_exec()中的命令即可。您可以查看如何从.doc和.docx文件中提取纯文本? (unix)用于其他unix / linux替代品
For other PHP alternatives check out How to extract text from word file .doc,docx,.xlsx,.pptx php
对于其他PHP替代品,请查看如何从word文件.doc,docx,.xlsx,.pptx php中提取文本