XML中的非法非标引号

时间:2021-06-02 10:18:53

I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this ”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.

我在我的网站上允许一些用户输入,后来用XML读取。每隔一段时间我就会得到这些奇怪的单引号或双引号。这些是直接从破坏我的XML的源复制的。我想知道是否有一种简单的方法来纠正我的xml中的这些类型的字符。 htmlentities似乎没有触及他们。

Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.

这些角色来自哪里?我甚至不确定如何无意中输入它们。

EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:

编辑 - 我忘了澄清这些引用没有在属性中使用,但是以下列方式:

<SomeTag>User’s Input</SomeTag>

5 个解决方案

#1


2  

Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:

不要禁止和/或修改外国字符;这对你的用户来说太烦人了!这只是一个编码问题。我不知道您使用什么解析器来读取XML,但如果它相当复杂,您可以通过在XML文件的顶部包含以下编码pragma来解决您的问题:

<?xml version="1.0" encoding="UTF-8"?>

There may also be a UTF-8 option in the parser's API.

解析器的API中可能还有一个UTF-8选项。

Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!

编辑:我刚刚读到您正在浏览器中直接读取XML。大多数浏览器都会听编码编译指示!

Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.

编辑2:显然,这些引用在UTF-8中甚至都不合法,所以忽略我上面所说的。相反,你可能会在这里找到你正在寻找的东西,正在讨论类似的问题。

#2


2  

Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".

这些引号是用在文本内容中还是用于分隔属性?对于属性分隔符,XML需要打字机引号(单引号或双引号)。微软和其他文字处理应用程序经常试图变得聪明,并用打印报价取代打字机报价,这几乎肯定是“它们来自哪里?”这一问题的答案。

If you need to get rid of them, a simple global replace using a text editor will do the job fine.

如果您需要摆脱它们,使用文本编辑器进行简单的全局替换将完成工作。

But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).

但是你可能会先尝试解决它们导致问题的原因。也许你的数据流不能处理任何非ASCII字符,在这种情况下,这是你真正应该解决的更深层次的问题(它通常意味着一些不需要的转码正在沿着某个地方徘徊)。

#3


1  

If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:

如果输入字符串是UTF-8编码,则可能需要将其指定为htmlentities(),例如:

$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;

For me gives:

对我来说:

&rdquo;&rsquo;

whereas

$html = htmlentities( '”’' );
echo $html;

gets confused:

&acirc;??&acirc;??

If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.

如果输入字符串是非UTF-8,那么您需要相应地调整htmlentities()的编码arg。

#4


1  

Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".

远离MicroSoft Office应用程序。 Word,Excel等有一个讨厌的习惯,用非标准的“智能引号”替换匹配的单引号和双引号。

These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.

这些引号字符是真正的非标准字符,并且从未成为官方的latin-1字符集。所有MS Office应用程序都“帮助”用这些可恶的名称替换标准引号字符。

Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.

只需谷歌“撤消smatquotes”或“转换智能引号”以获取提示提示和正则表达式以摆脱这些。

#5


0  

Use

 $s =    'User’s Input';
    $descriptfix = preg_replace('/[“”]/','\"',$s);
    $descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo    "<SomeTag>htmlentities($s)</SomeTag>";

#1


2  

Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:

不要禁止和/或修改外国字符;这对你的用户来说太烦人了!这只是一个编码问题。我不知道您使用什么解析器来读取XML,但如果它相当复杂,您可以通过在XML文件的顶部包含以下编码pragma来解决您的问题:

<?xml version="1.0" encoding="UTF-8"?>

There may also be a UTF-8 option in the parser's API.

解析器的API中可能还有一个UTF-8选项。

Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!

编辑:我刚刚读到您正在浏览器中直接读取XML。大多数浏览器都会听编码编译指示!

Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.

编辑2:显然,这些引用在UTF-8中甚至都不合法,所以忽略我上面所说的。相反,你可能会在这里找到你正在寻找的东西,正在讨论类似的问题。

#2


2  

Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".

这些引号是用在文本内容中还是用于分隔属性?对于属性分隔符,XML需要打字机引号(单引号或双引号)。微软和其他文字处理应用程序经常试图变得聪明,并用打印报价取代打字机报价,这几乎肯定是“它们来自哪里?”这一问题的答案。

If you need to get rid of them, a simple global replace using a text editor will do the job fine.

如果您需要摆脱它们,使用文本编辑器进行简单的全局替换将完成工作。

But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).

但是你可能会先尝试解决它们导致问题的原因。也许你的数据流不能处理任何非ASCII字符,在这种情况下,这是你真正应该解决的更深层次的问题(它通常意味着一些不需要的转码正在沿着某个地方徘徊)。

#3


1  

If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:

如果输入字符串是UTF-8编码,则可能需要将其指定为htmlentities(),例如:

$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;

For me gives:

对我来说:

&rdquo;&rsquo;

whereas

$html = htmlentities( '”’' );
echo $html;

gets confused:

&acirc;??&acirc;??

If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.

如果输入字符串是非UTF-8,那么您需要相应地调整htmlentities()的编码arg。

#4


1  

Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".

远离MicroSoft Office应用程序。 Word,Excel等有一个讨厌的习惯,用非标准的“智能引号”替换匹配的单引号和双引号。

These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.

这些引号字符是真正的非标准字符,并且从未成为官方的latin-1字符集。所有MS Office应用程序都“帮助”用这些可恶的名称替换标准引号字符。

Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.

只需谷歌“撤消smatquotes”或“转换智能引号”以获取提示提示和正则表达式以摆脱这些。

#5


0  

Use

 $s =    'User’s Input';
    $descriptfix = preg_replace('/[“”]/','\"',$s);
    $descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo    "<SomeTag>htmlentities($s)</SomeTag>";