
时间:2022-04-02 08:48:24

Every now and then I receive a Word Document that I have to display as a web page. I'm currently using Django's flatpages to achieve this by grabbing the html content generated by MS Word. The generated html is quite messy. Is there a better way that can generate very simple html to solve this issue using Python?

我不时地收到一个Word文档,我必须将其显示为网页。我目前正在使用Django的flatpages通过抓取MS Word生成的html内容来实现这一目标。生成的HTML非常混乱。有没有更好的方法可以使用Python生成非常简单的html来解决这个问题?

6 个解决方案



A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

一个好的解决方案是上传到Google文档并从中导出html版本。 (必须有api吗?)

It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.


This is a known standard for Journalist companies.




I found this web page: http://www.textfixer.com/html/convert-word-to-html.php


It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.




My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:


import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)



It depends how much formatting and images you're dealing with. I do one of a couple things:


  • Google Docs: Probably the closest you'll get to the original formatting and usable HTML.
  • Google文档:可能是您最接近原始格式和可用HTML的版本。
  • Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.
  • Markdown:放弃格式化。将其粘贴到纯文本编辑器中,通过Markdown运行并手动修复其余部分。



You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

您还可以使用Abiword / wvWare将word文档转换为XHTML,然后使用BeautifulSoup / ElementTree / etc进行解析。如果需要,可以预处理它。根据我的经验,Abiword在转换Word文件和生成相对干净的XHTML文件方面做得非常好。

I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.




Word 2010 has the ability to "save as filtered web page". This will eliminate the overwhelming majority of the HTML that Word inserts.

Word 2010具有“另存为筛选的网页”的功能。这将消除Word插入的绝大多数HTML。



A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?)

一个好的解决方案是上传到Google文档并从中导出html版本。 (必须有api吗?)

It does so many "clean ups"; Beautiful Soup down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet.


This is a known standard for Journalist companies.




I found this web page: http://www.textfixer.com/html/convert-word-to-html.php


It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.




My super-simple app WordOff has an API for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this:


import urllib
import urllib2

def decruft(html):
    data = urllib.urlencode({'html' : html})
    req = urllib2.Request('http://wordoff.org/api/clean', data)
    response = urllib2.urlopen(req)
    return response.read()

def save(self, **kwargs):
    if not self.pk: # only de-cruft when content is first added
        self.content = decruft(self.content)
    super(FlatPage, self).save(**kwargs)



It depends how much formatting and images you're dealing with. I do one of a couple things:


  • Google Docs: Probably the closest you'll get to the original formatting and usable HTML.
  • Google文档:可能是您最接近原始格式和可用HTML的版本。
  • Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.
  • Markdown:放弃格式化。将其粘贴到纯文本编辑器中,通过Markdown运行并手动修复其余部分。



You can also use Abiword/wvWare to convert word document to XHTML and then parse it with BeautifulSoup/ElementTree/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files.

您还可以使用Abiword / wvWare将word文档转换为XHTML,然后使用BeautifulSoup / ElementTree / etc进行解析。如果需要,可以预处理它。根据我的经验,Abiword在转换Word文件和生成相对干净的XHTML文件方面做得非常好。

I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.




Word 2010 has the ability to "save as filtered web page". This will eliminate the overwhelming majority of the HTML that Word inserts.

Word 2010具有“另存为筛选的网页”的功能。这将消除Word插入的绝大多数HTML。