I'm wondering how you clean the special characters that MS Word as, such as m- and n-dashes and curly quotes?
我想知道如何清理MS Word中的特殊字符,例如m和n-dashes以及引号?
I often find myself copying content from clients from Word and pasting into a static HTML page, but the content ends up with weird characters because the special characters are not converted to their correct ACSII codes and therefore show up as garbled text. (For these basic websites, I'm using Dreamweaver.)
我经常发现自己从Word复制客户端内容并粘贴到静态HTML页面,但内容最终会出现奇怪的字符,因为特殊字符没有转换为正确的ACSII代码,因此显示为乱码文本。 (对于这些基本网站,我使用的是Dreamweaver。)
I have seen a lot of similar problems when clients copy content from Word into text only fields (mostly textareas). When I put this into a PDF (through PHP) or it shows up on the page it too has garbled text.
当客户端将Word中的内容复制到仅文本字段(主要是textareas)时,我看到了很多类似的问题。当我将它放入PDF(通过PHP)或它出现在页面上时,它也会出现乱码。
How do you deal with this? Is there a cleaning service or program you use?
你怎么处理这个?您使用的是清洁服务或程序吗?
6 个解决方案
#1
8
With regards to clients posting copy/pasted text from Word in textareas:
关于客户在textareas中发布Word中的复制/粘贴文本:
The most reliable way to ensure that the client sends you text in any particular encoding (thus hopefully doing any conversion from CP-1252 [or whatever Word uses] for you), is to add the accept-charset="..."
attribute to all your <form>
s. E.g.:
确保客户端以任何特定编码发送文本的最可靠方法(因此希望从CP-1252 [或任何Word使用]进行任何转换)是添加accept-charset =“...”属性到你的所有
<form ... accept-charset="UTF-8">
...
</form>
Most browsers will obey that and make sure any "Word-specific" characters are converted to the appropriate character set before it gets to your website.
大多数浏览器都会遵守这一规定,并确保在访问您的网站之前将任何“特定于字的”字符转换为适当的字符集。
Once invalid text gets to your website, there's very little you can do to fix it reliably, so it's best to simply check all input for being valid in whatever character set you use, and discard any requests that have invalid text. This is necessary even with accept-charset
, because undoubtedly there are some clients out there that will ignore it.
一旦无效文本进入您的网站,您可以做的很少,可以做到可靠地修复它,因此最好只检查所有输入是否在您使用的任何字符集中有效,并丢弃任何具有无效文本的请求。即使使用accept-charset,这也是必要的,因为毫无疑问,有些客户会忽略它。
#2
5
You can use preg_replace
function call to remove all special characters of word or others from your string
您可以使用preg_replace函数调用从字符串中删除单词或其他字符的所有特殊字符
preg_replace('/[^\x00-\x7F]+/', '', $str);
#3
4
Pay attention to specify an encoding everywhere and use UTF-8, then those "special" characters should survive just fine. But once they've gone through an encoding that can't represent them, the information which character it was originally is lost, so it can't be repaired (except for some specific though probably very common cases like switching between Cp1252 and ISO-8859-1).
注意在任何地方指定一个编码并使用UTF-8,然后那些“特殊”字符应该存活得很好。但是一旦他们经历了无法代表他们的编码,那么它原来的信息就会丢失,所以它无法修复(除了一些特定的但很可能非常常见的情况,比如在Cp1252和ISO之间切换) 8859-1)。
#5
1
Make sure Word is configured to use UTF-8 for "Save As.." HTML.
确保Word配置为使用UTF-8“另存为...”HTML。
This is in Options > Word Options > Advanced > Web Options > Encoding
这在选项>单词选项>高级> Web选项>编码中
#6
0
If it's a Word file that's just text (i.e.: no graphics, tables, etc.), you might try Saving As HTML from within Word, copy/pasting the resulting HTML into your document in Dreamweaver, and then use Dreamweaver's "Clean Up Word HTML" function (under the Command menu).
如果它是一个只是文本的Word文件(即:没有图形,表格等),您可以尝试从Word中保存为HTML,将生成的HTML复制/粘贴到Dreamweaver中的文档中,然后使用Dreamweaver的“清理Word” HTML“功能(在Command菜单下)。
As an alternative, you can try fix my HTML, though I've not personally tried it with Word text, so results may vary.
作为替代方案,您可以尝试修复我的HTML,虽然我没有亲自尝试使用Word文本,因此结果可能会有所不同。
#1
8
With regards to clients posting copy/pasted text from Word in textareas:
关于客户在textareas中发布Word中的复制/粘贴文本:
The most reliable way to ensure that the client sends you text in any particular encoding (thus hopefully doing any conversion from CP-1252 [or whatever Word uses] for you), is to add the accept-charset="..."
attribute to all your <form>
s. E.g.:
确保客户端以任何特定编码发送文本的最可靠方法(因此希望从CP-1252 [或任何Word使用]进行任何转换)是添加accept-charset =“...”属性到你的所有
<form ... accept-charset="UTF-8">
...
</form>
Most browsers will obey that and make sure any "Word-specific" characters are converted to the appropriate character set before it gets to your website.
大多数浏览器都会遵守这一规定,并确保在访问您的网站之前将任何“特定于字的”字符转换为适当的字符集。
Once invalid text gets to your website, there's very little you can do to fix it reliably, so it's best to simply check all input for being valid in whatever character set you use, and discard any requests that have invalid text. This is necessary even with accept-charset
, because undoubtedly there are some clients out there that will ignore it.
一旦无效文本进入您的网站,您可以做的很少,可以做到可靠地修复它,因此最好只检查所有输入是否在您使用的任何字符集中有效,并丢弃任何具有无效文本的请求。即使使用accept-charset,这也是必要的,因为毫无疑问,有些客户会忽略它。
#2
5
You can use preg_replace
function call to remove all special characters of word or others from your string
您可以使用preg_replace函数调用从字符串中删除单词或其他字符的所有特殊字符
preg_replace('/[^\x00-\x7F]+/', '', $str);
#3
4
Pay attention to specify an encoding everywhere and use UTF-8, then those "special" characters should survive just fine. But once they've gone through an encoding that can't represent them, the information which character it was originally is lost, so it can't be repaired (except for some specific though probably very common cases like switching between Cp1252 and ISO-8859-1).
注意在任何地方指定一个编码并使用UTF-8,然后那些“特殊”字符应该存活得很好。但是一旦他们经历了无法代表他们的编码,那么它原来的信息就会丢失,所以它无法修复(除了一些特定的但很可能非常常见的情况,比如在Cp1252和ISO之间切换) 8859-1)。
#4
#5
1
Make sure Word is configured to use UTF-8 for "Save As.." HTML.
确保Word配置为使用UTF-8“另存为...”HTML。
This is in Options > Word Options > Advanced > Web Options > Encoding
这在选项>单词选项>高级> Web选项>编码中
#6
0
If it's a Word file that's just text (i.e.: no graphics, tables, etc.), you might try Saving As HTML from within Word, copy/pasting the resulting HTML into your document in Dreamweaver, and then use Dreamweaver's "Clean Up Word HTML" function (under the Command menu).
如果它是一个只是文本的Word文件(即:没有图形,表格等),您可以尝试从Word中保存为HTML,将生成的HTML复制/粘贴到Dreamweaver中的文档中,然后使用Dreamweaver的“清理Word” HTML“功能(在Command菜单下)。
As an alternative, you can try fix my HTML, though I've not personally tried it with Word text, so results may vary.
作为替代方案,您可以尝试修复我的HTML,虽然我没有亲自尝试使用Word文本,因此结果可能会有所不同。