
时间:2021-06-02 10:18:53

I'm allowing some user input on my website, that later is read in XML. Every once in a while I get these weird single or double quotes like this ”’. These are directly copied from the source that broke my XML. I'm wondering if there is an easy way to correct these types of characters in my xml. htmlentities did not seem to touch them.

我在我的网站上允许一些用户输入,后来用XML读取。每隔一段时间我就会得到这些奇怪的单引号或双引号。这些是直接从破坏我的XML的源复制的。我想知道是否有一种简单的方法来纠正我的xml中的这些类型的字符。 htmlentities似乎没有触及他们。

Where do these characters come from? I'm not even sure how I'd go about typing them out unintentionally.


EDIT- I forgot to clarify these quotes are not being used in attributes, but in the following way:

编辑 - 我忘了澄清这些引用没有在属性中使用,但是以下列方式:

<SomeTag>User’s Input</SomeTag>

5 个解决方案



Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:


<?xml version="1.0" encoding="UTF-8"?>

There may also be a UTF-8 option in the parser's API.


Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!


Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.




Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".


If you need to get rid of them, a simple global replace using a text editor will do the job fine.


But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).




If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:


$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;

For me gives:




$html = htmlentities( '”’' );
echo $html;

gets confused:


If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.




Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".

远离MicroSoft Office应用程序。 Word,Excel等有一个讨厌的习惯,用非标准的“智能引号”替换匹配的单引号和双引号。

These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.

这些引号字符是真正的非标准字符,并且从未成为官方的latin-1字符集。所有MS Office应用程序都“帮助”用这些可恶的名称替换标准引号字符。

Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.





 $s =    'User’s Input';
    $descriptfix = preg_replace('/[“”]/','\"',$s);
    $descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo    "<SomeTag>htmlentities($s)</SomeTag>";



Don't disallow and/or modify foreign characters; that's just annoying for your users! This is just an encoding issue. I don't know what parser you're using to read the XML, but if it's reasonably sophisticated, you can solve your problem by including the following encoding pragma at the top of your XML files:


<?xml version="1.0" encoding="UTF-8"?>

There may also be a UTF-8 option in the parser's API.


Edit: I just read that you're reading the XML directly in a browser. Most browsers listen to the encoding pragma!


Edit 2: Apparently, those quotes aren't even legal in UTF-8, so ignore what I said above. Instead, you might find what you're looking for here, where a similar problem is being discussed.




Are these quotes being used in text content, or to delimit attributes? For attribute delimiters, XML requires typewriter quotes (single or double). Microsoft and other word-processing applications often try to be smart and replace typewriter quotes with typographical quotes, which is almost certainly the answer to the question "where are they coming from?".


If you need to get rid of them, a simple global replace using a text editor will do the job fine.


But you might try to work out first why they are causing a problem. Perhaps your data flow can't handle ANY non-ASCII characters, in which case that's a deeper problem that you really ought to fix (it would typically imply some unwanted transcoding is happing somewhere along the line).




If the input string is UTF-8 encoded, maybe you need to specify that to htmlentities(), for example:


$html = htmlentities( '”’', ENT_COMPAT, "utf-8" );
echo $html;

For me gives:




$html = htmlentities( '”’' );
echo $html;

gets confused:


If the input string is non-UTF-8, then you'd need to adjust the encoding arg for htmlentities() accordingly.




Stay away from MicroSoft Office apps. Word, Excel etc. have a nasty habit of replacing matching pairs of single quotes and double quotes with non-standard "smart-quotes".

远离MicroSoft Office应用程序。 Word,Excel等有一个讨厌的习惯,用非标准的“智能引号”替换匹配的单引号和双引号。

These quote characters are truly non-standard and never made it into the official latin-1 character set. All the MS Office apps "helpfully" replace standard quote characters with these abominations.

这些引号字符是真正的非标准字符,并且从未成为官方的latin-1字符集。所有MS Office应用程序都“帮助”用这些可恶的名称替换标准引号字符。

Just google for "undoing smatquotes" or "convert smartquotes back" for hints tips and regexes to get rid of these.





 $s =    'User’s Input';
    $descriptfix = preg_replace('/[“”]/','\"',$s);
    $descriptfix = preg_replace('/[‘’]/','\'',$descriptfix);
echo    "<SomeTag>htmlentities($s)</SomeTag>";