i run a job search site, and i need to convert doc, docx and pdf files into HTML on linux CentOS server running php. People submit these files as resumes. So far, I found PHPDocx to be great at converting docx to html. But I am stuck at doc/pdf. PDFTOHTML gives error "bad color" when i run tests. As far as doc, i only found wvwave, which seems complex and bulky to install.
我运行一个求职网站,我需要在运行php的linux CentOS服务器上将doc,docx和pdf文件转换为HTML。人们将这些文件作为简历提交。到目前为止,我发现PHPDocx非常适合将docx转换为html。但我被困在doc / pdf上。当我运行测试时,PDFTOHTML给出错误“颜色不好”。至于doc,我只找到了wvwave,它看起来既复杂又笨重。
does anyone have any ideas on how to easily convert doc/pdf to HTML?
有没有人对如何轻松地将doc / pdf转换为HTML有任何想法?
4 个解决方案
#1
3
As far as .doc files go how about trying OpenOffice/LibreOffice, something like: lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you're out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert.
至于.doc文件如何尝试OpenOffice / LibreOffice,如:lowriter -convert-to html doc_file.doc - 就PDF而言,如果PDF是文本的图形表示,那么你运气不好,你可以做的最好的尝试是使用ImageMagick将其转换为图像,如果它是一个应该很容易转换的正确文本。
#2
3
The only thing i can think of is FPDF. It is intended for creating PDF files in PHP but it can also open PDF files. Maybe you can use that as a base and develop some sort of toHTML function for it.
我唯一能想到的就是FPDF。它用于在PHP中创建PDF文件,但也可以打开PDF文件。也许你可以使用它作为基础并为它开发某种toHTML功能。
It is completely free to use and it has some extensions already. It MIGHT help you.
它完全免费使用,它已经有一些扩展。它可能会帮助你。
http://www.fpdf.org
EDIT: Thanks for the addition to my post in the comments to Pierre:
编辑:感谢您在对皮埃尔的评论中添加我的帖子:
You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image.
您可以使用fpdi:http://www.setasign.de/products/pdf-php-solutions/fpdi,但输入的pdf就像一个图像。
I havent taken a look at it myself so far but this might help.
到目前为止我还没看过它,但这可能会有所帮助。
#3
2
There are various tools out there already to do this, such as http://dag.wieers.com/home-made/unoconv/, http://www.phpdocx.com/ (which you've already tried)
有各种各样的工具可以做到这一点,例如http://dag.wieers.com/home-made/unoconv/,http://www.phpdocx.com/(你已经尝试过了)
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/ looks promising.
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/看起来很有前途。
Or, you could install a portable version of libreoffice on your server which allows command line conversion https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
或者,您可以在服务器上安装可移植版本的libreoffice,它允许命令行转换https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
I'm sure there'll be tutorials out there (on libreoffice support area)
我相信那里会有教程(在libreoffice支持区域)
#4
1
To easily convert pdf to html, I would suggest pdf2htmlEX which produces outstanding HTML and is fast enough for runtime converting. You should first put some effort to optimize and build it for your system. There is simple build howto included on the project link.
为了轻松地将pdf转换为html,我建议使用pdf2htmlEX,它可以生成出色的HTML并且足够快速进行运行时转换。您应该首先付出一些努力来优化并为您的系统构建它。项目链接中包含简单的构建方法。
#1
3
As far as .doc files go how about trying OpenOffice/LibreOffice, something like: lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you're out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert.
至于.doc文件如何尝试OpenOffice / LibreOffice,如:lowriter -convert-to html doc_file.doc - 就PDF而言,如果PDF是文本的图形表示,那么你运气不好,你可以做的最好的尝试是使用ImageMagick将其转换为图像,如果它是一个应该很容易转换的正确文本。
#2
3
The only thing i can think of is FPDF. It is intended for creating PDF files in PHP but it can also open PDF files. Maybe you can use that as a base and develop some sort of toHTML function for it.
我唯一能想到的就是FPDF。它用于在PHP中创建PDF文件,但也可以打开PDF文件。也许你可以使用它作为基础并为它开发某种toHTML功能。
It is completely free to use and it has some extensions already. It MIGHT help you.
它完全免费使用,它已经有一些扩展。它可能会帮助你。
http://www.fpdf.org
EDIT: Thanks for the addition to my post in the comments to Pierre:
编辑:感谢您在对皮埃尔的评论中添加我的帖子:
You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image.
您可以使用fpdi:http://www.setasign.de/products/pdf-php-solutions/fpdi,但输入的pdf就像一个图像。
I havent taken a look at it myself so far but this might help.
到目前为止我还没看过它,但这可能会有所帮助。
#3
2
There are various tools out there already to do this, such as http://dag.wieers.com/home-made/unoconv/, http://www.phpdocx.com/ (which you've already tried)
有各种各样的工具可以做到这一点,例如http://dag.wieers.com/home-made/unoconv/,http://www.phpdocx.com/(你已经尝试过了)
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/ looks promising.
http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/看起来很有前途。
Or, you could install a portable version of libreoffice on your server which allows command line conversion https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
或者,您可以在服务器上安装可移植版本的libreoffice,它允许命令行转换https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
I'm sure there'll be tutorials out there (on libreoffice support area)
我相信那里会有教程(在libreoffice支持区域)
#4
1
To easily convert pdf to html, I would suggest pdf2htmlEX which produces outstanding HTML and is fast enough for runtime converting. You should first put some effort to optimize and build it for your system. There is simple build howto included on the project link.
为了轻松地将pdf转换为html,我建议使用pdf2htmlEX,它可以生成出色的HTML并且足够快速进行运行时转换。您应该首先付出一些努力来优化并为您的系统构建它。项目链接中包含简单的构建方法。