如何确保表单中编码的文本为utf8

时间:2022-10-24 13:18:51

I have an html box with which users may enter text. I would like to ensure all text entered in the box is either encoded in UTF-8 or converted to UTF-8 when a user finishes typing. Furthermore, I don't quite understand how various UTF encoding are chosen when being entered into a text box.

我有一个html框,用户可以输入文本。我想确保在框中输入的所有文本都以UTF-8编码或在用户完成输入时转换为UTF-8。此外,我不太清楚在输入文本框时如何选择各种UTF编码。

Generally I'm curious about the following:

一般来说,我对以下内容感到好奇:

  • How does a browser determine which encodings to use when a user is typing into a text box?
  • 当用户在文本框中键入内容时,浏览器如何确定要使用哪些编码?

  • How can javascript determine the encoding of a string value in an html text box?
  • javascript如何确定html文本框中字符串值的编码?

  • Can I force the browser to only use UTF-8 encoding?
  • 我可以强制浏览器只使用UTF-8编码吗?

  • How can I encode arbitrary encodings to UTF-8 I assume there is a JavaScript library for this?
  • 如何将任意编码编码为UTF-8我假设有一个JavaScript库?

** Edit **

**编辑**

Removed some questions unnecessary to my goals.

删除了一些不符合我目标的问题。

This tutorial helped me understand JavaScript character codes better, but is buggy and does not actually translate character codes to utf-8 in all cases. http://www.webtoolkit.info/javascript-base64.html

本教程帮助我更好地理解JavaScript字符代码,但是在所有情况下都没有错误并且实际上并没有将字符代码转换为utf-8。 http://www.webtoolkit.info/javascript-base64.html

3 个解决方案

#1


15  

  • How does a browser determine which encodings to use when a user is typing into a text box?
  • 当用户在文本框中键入内容时,浏览器如何确定要使用哪些编码?

It uses the encoding the page was decoded as by default. According to the spec, you should be able to override this with the accept-charset attribute of the <form> element, but IE is buggy, so you shouldn't rely on this (I've seen several different sources describe several different bugs, and I don't have all the relevant versions of IE in front of me to test, so I'll leave it at that).

它使用默认情况下解码页面的编码。根据规范,您应该能够使用

元素的accept-charset属性覆盖它,但是IE是错误的,因此您不应该依赖于此(我已经看到几个不同的来源描述了几个不同的错误,我没有在我面前测试IE的所有相关版本,所以我会留在那里)。

  • How can javascript determine the encoding of a string value in an html text box?
  • javascript如何确定html文本框中字符串值的编码?

All strings in JavaScript are encoded in UTF-16. The browser will map everything into UTF-16 for JavaScript, and from UTF-16 into whatever the page is encoded in.

JavaScript中的所有字符串都以UTF-16编码。浏览器会将所有内容映射到UTF-16 for JavaScript,并将UTF-16映射到编码的页面。

UTF-16 is an encoding that grew out of UCS-2. Originally, it was thought that 65,536 code points would be enough for all of Unicode, and so a 16 bit character encoding would be sufficient. It turned out that the is not the case, and so the character set was expanded to 1,114,112 code points. In order to maintain backwards compatibility, a few unused ranges of the 16 bit character set were set aside for surrogate pairs, in which two 16 bit code units were used to encode a single character. Read up on UTF-16 and UCS-2 on Wikipedia for details.

UTF-16是一种源自UCS-2的编码。最初,人们认为65,536个代码点对于所有Unicode都足够了,因此16位字符编码就足够了。事实证明情况并非如此,因此字符集扩展到1,114,112个代码点。为了保持向后兼容性,为代理对留出了一些未使用的16位字符集范围,其中两个16位代码单元用于编码单个字符。有关详细信息,请阅读*上的UTF-16和UCS-2。

The upshot is that when you have a string str in JavaScript, str.length does not give you the number of characters, it gives you the number of code units, where two code units may be used to encode a single character, if that character is not within the Basic Multilingual Plane. For instance, "abc".length gives you 3, but "????????????".length gives you 6; and "????????????".substring(0,1) gives what looks like an empty string, since a half of a surrogate pair cannot be displayed, but the string still contains that invalid character (I will not guarantee this works cross browser; I believe it is acceptable to drop broken characters). To get a valid character, you must use "????????????".substring(0,2).

结果是当你在JavaScript中有一个字符串str时,str.length没有给你字符数,它给你代码单元的数量,其中两个代码单元可以用来编码单个字符,如果那个字符不在基本多语言平面内。例如,“abc”.length给你3,但“????????????”。长度给你6;和“????????????”.substring(0,1)给出了一个看起来像空字符串的东西,因为代理对的一半不能显示,但是字符串仍然包含那个无效的字符(我不保证这可以跨浏览器工作;我相信丢掉破碎的字符是可以接受的)。要获得有效字符,必须使用“????????????”.substring(0,2)。

  • Can I force the browser to only use UTF-8 encoding?
  • 我可以强制浏览器只使用UTF-8编码吗?

The best way to do this is to deliver your page in UTF-8. Ensure that your web server is sending the appropriate Content-type: text/html; charset=UTF-8 headers. You may also want to embed a <meta charset="UTF-8"> element in your <head> element, for cases in which the Content-Type does not get set properly (such as if your page is loaded off of the local disk).

最好的方法是以UTF-8格式提供页面。确保您的Web服务器正在发送相应的Content-type:text / html; charset = UTF-8标头。您可能还希望在元素中嵌入 元素,以用于未正确设置Content-Type的情况(例如,如果您的页面是从本地加载的磁盘)。

  • How can I encode arbitrary encodings to UTF-8 I assume there is a JavaScript library for this?
  • 如何将任意编码编码为UTF-8我假设有一个JavaScript库?

There isn't much need in JavaScript to encode text in particular encodings. If you are simply writing to the DOM, or reading or filling in form controls, you should just use JavaScript strings which are treated as sequences of UTF-16 code units. XMLHTTPRequest, when used to send(data) via POST, will use UTF-8 (if you pass it a document with a different encoding declared in the <?xml ...> declaration, it may or may not convert that to UTF-8, so for compatibility you generally shouldn't use anything other than UTF-8).

JavaScript中没有太多需要对特定编码的文本进行编码。如果您只是写入DOM,或者阅读或填写表单控件,则应该使用被视为UTF-16代码单元序列的JavaScript字符串。 XMLHTTPRequest,当用于通过POST发送(数据)时,将使用UTF-8(如果您传递一个文档,其中包含在 声明中声明的不同编码,它可能会也可能不会将其转换为UTF- 8,所以为了兼容性,你通常不应该使用UTF-8以外的任何东西。

#2


4  

I would like to ensure all text entered in the box is either encoded in UTF-8

我想确保在框中输入的所有文本都以UTF-8编码

Text in an HTML DOM including input fields has no intrinsic byte encoding; it is stored as Unicode characters (specifically, at a DOM and ECMAScript standard level, UTF-16 code units; on the rare case you use characters outside the Basic Multilingual Plane it is possible to see the difference, eg. '????'.length is 2).

包含输入字段的HTML DOM中的文本没有内部字节编码;它存储为Unicode字符(具体地说,在DOM和ECMAScript标准级别,UTF-16代码单元;在极少数情况下,您使用基本多语言平面之外的字符,可以看到差异,例如'????????????'.length是2)。

It is only when the form is sent that the text is serialised into bytes using a particular encoding, by default the same encoding as was used to parse the page So you should serve your page containing the form as UTF-8 (via Content-Type header charset parameter and/or equivalent <meta> tag).

只有在发送表单时才使用特定编码将文本序列化为字节,默认情况下使用与解析页面相同的编码所以您应该将包含表单的页面作为UTF-8(通过Content-Type) header charset参数和/或等效的 标记)。

Whilst in principle there is an override for this in the accept-charset attribute of the <form> element, it doesn't work correctly (and is actively harmful in many cases) in IE. So avoid that one.

虽然原则上在

元素的accept-charset属性中有一个覆盖,但它在IE中无法正常工作(并且在许多情况下是非常有害的)。所以避免那个。

There are no explicit encoding-handling functions available in JavaScript itself. You can hack together a Unicode-to-UTF-8-bytes encoder by chaining unescape(encodeURIComponent(str)) (and similarly the other way round with the inverse function), but that's about it.

JavaScript本身没有明确的编码处理函数。你可以通过链接unescape(encodeURIComponent(str))来破解Unicode到UTF-8字节的编码器(并且类似地使用反函数反过来),但这就是它。

#3


1  

The text in a text box is not encoded in any way; it is "text", an abstract series of characters. In almost every contemporary application, that text is expressed as a sequence of Unicode code points, which are integers mapped to particular abstract characters. Text doesn't get "encoded" until it is turned into a sequence of bytes, as when submitting the form. At that time, the encoding is determined by the encoding of the HTML page in which the form appears, or by the accept-charset attribute of the form element.

文本框中的文本不以任何方式编码;它是“文本”,一个抽象的人物系列。在几乎每个当代应用程序中,该文本都表示为一系列Unicode代码点,这些代码点是映射到特定抽象字符的整数。在将文本转换为字节序列之前,文本不会被“编码”,就像提交表单一样。那时,编码由表单出现的HTML页面的编码确定,或者由表单元素的accept-charset属性确定。

#1


15  

  • How does a browser determine which encodings to use when a user is typing into a text box?
  • 当用户在文本框中键入内容时,浏览器如何确定要使用哪些编码?

It uses the encoding the page was decoded as by default. According to the spec, you should be able to override this with the accept-charset attribute of the <form> element, but IE is buggy, so you shouldn't rely on this (I've seen several different sources describe several different bugs, and I don't have all the relevant versions of IE in front of me to test, so I'll leave it at that).

它使用默认情况下解码页面的编码。根据规范,您应该能够使用

元素的accept-charset属性覆盖它,但是IE是错误的,因此您不应该依赖于此(我已经看到几个不同的来源描述了几个不同的错误,我没有在我面前测试IE的所有相关版本,所以我会留在那里)。

  • How can javascript determine the encoding of a string value in an html text box?
  • javascript如何确定html文本框中字符串值的编码?

All strings in JavaScript are encoded in UTF-16. The browser will map everything into UTF-16 for JavaScript, and from UTF-16 into whatever the page is encoded in.

JavaScript中的所有字符串都以UTF-16编码。浏览器会将所有内容映射到UTF-16 for JavaScript,并将UTF-16映射到编码的页面。

UTF-16 is an encoding that grew out of UCS-2. Originally, it was thought that 65,536 code points would be enough for all of Unicode, and so a 16 bit character encoding would be sufficient. It turned out that the is not the case, and so the character set was expanded to 1,114,112 code points. In order to maintain backwards compatibility, a few unused ranges of the 16 bit character set were set aside for surrogate pairs, in which two 16 bit code units were used to encode a single character. Read up on UTF-16 and UCS-2 on Wikipedia for details.

UTF-16是一种源自UCS-2的编码。最初,人们认为65,536个代码点对于所有Unicode都足够了,因此16位字符编码就足够了。事实证明情况并非如此,因此字符集扩展到1,114,112个代码点。为了保持向后兼容性,为代理对留出了一些未使用的16位字符集范围,其中两个16位代码单元用于编码单个字符。有关详细信息,请阅读*上的UTF-16和UCS-2。

The upshot is that when you have a string str in JavaScript, str.length does not give you the number of characters, it gives you the number of code units, where two code units may be used to encode a single character, if that character is not within the Basic Multilingual Plane. For instance, "abc".length gives you 3, but "????????????".length gives you 6; and "????????????".substring(0,1) gives what looks like an empty string, since a half of a surrogate pair cannot be displayed, but the string still contains that invalid character (I will not guarantee this works cross browser; I believe it is acceptable to drop broken characters). To get a valid character, you must use "????????????".substring(0,2).

结果是当你在JavaScript中有一个字符串str时,str.length没有给你字符数,它给你代码单元的数量,其中两个代码单元可以用来编码单个字符,如果那个字符不在基本多语言平面内。例如,“abc”.length给你3,但“????????????”。长度给你6;和“????????????”.substring(0,1)给出了一个看起来像空字符串的东西,因为代理对的一半不能显示,但是字符串仍然包含那个无效的字符(我不保证这可以跨浏览器工作;我相信丢掉破碎的字符是可以接受的)。要获得有效字符,必须使用“????????????”.substring(0,2)。

  • Can I force the browser to only use UTF-8 encoding?
  • 我可以强制浏览器只使用UTF-8编码吗?

The best way to do this is to deliver your page in UTF-8. Ensure that your web server is sending the appropriate Content-type: text/html; charset=UTF-8 headers. You may also want to embed a <meta charset="UTF-8"> element in your <head> element, for cases in which the Content-Type does not get set properly (such as if your page is loaded off of the local disk).

最好的方法是以UTF-8格式提供页面。确保您的Web服务器正在发送相应的Content-type:text / html; charset = UTF-8标头。您可能还希望在元素中嵌入 元素,以用于未正确设置Content-Type的情况(例如,如果您的页面是从本地加载的磁盘)。

  • How can I encode arbitrary encodings to UTF-8 I assume there is a JavaScript library for this?
  • 如何将任意编码编码为UTF-8我假设有一个JavaScript库?

There isn't much need in JavaScript to encode text in particular encodings. If you are simply writing to the DOM, or reading or filling in form controls, you should just use JavaScript strings which are treated as sequences of UTF-16 code units. XMLHTTPRequest, when used to send(data) via POST, will use UTF-8 (if you pass it a document with a different encoding declared in the <?xml ...> declaration, it may or may not convert that to UTF-8, so for compatibility you generally shouldn't use anything other than UTF-8).

JavaScript中没有太多需要对特定编码的文本进行编码。如果您只是写入DOM,或者阅读或填写表单控件,则应该使用被视为UTF-16代码单元序列的JavaScript字符串。 XMLHTTPRequest,当用于通过POST发送(数据)时,将使用UTF-8(如果您传递一个文档,其中包含在 声明中声明的不同编码,它可能会也可能不会将其转换为UTF- 8,所以为了兼容性,你通常不应该使用UTF-8以外的任何东西。

#2


4  

I would like to ensure all text entered in the box is either encoded in UTF-8

我想确保在框中输入的所有文本都以UTF-8编码

Text in an HTML DOM including input fields has no intrinsic byte encoding; it is stored as Unicode characters (specifically, at a DOM and ECMAScript standard level, UTF-16 code units; on the rare case you use characters outside the Basic Multilingual Plane it is possible to see the difference, eg. '????'.length is 2).

包含输入字段的HTML DOM中的文本没有内部字节编码;它存储为Unicode字符(具体地说,在DOM和ECMAScript标准级别,UTF-16代码单元;在极少数情况下,您使用基本多语言平面之外的字符,可以看到差异,例如'????????????'.length是2)。

It is only when the form is sent that the text is serialised into bytes using a particular encoding, by default the same encoding as was used to parse the page So you should serve your page containing the form as UTF-8 (via Content-Type header charset parameter and/or equivalent <meta> tag).

只有在发送表单时才使用特定编码将文本序列化为字节,默认情况下使用与解析页面相同的编码所以您应该将包含表单的页面作为UTF-8(通过Content-Type) header charset参数和/或等效的 标记)。

Whilst in principle there is an override for this in the accept-charset attribute of the <form> element, it doesn't work correctly (and is actively harmful in many cases) in IE. So avoid that one.

虽然原则上在

元素的accept-charset属性中有一个覆盖,但它在IE中无法正常工作(并且在许多情况下是非常有害的)。所以避免那个。

There are no explicit encoding-handling functions available in JavaScript itself. You can hack together a Unicode-to-UTF-8-bytes encoder by chaining unescape(encodeURIComponent(str)) (and similarly the other way round with the inverse function), but that's about it.

JavaScript本身没有明确的编码处理函数。你可以通过链接unescape(encodeURIComponent(str))来破解Unicode到UTF-8字节的编码器(并且类似地使用反函数反过来),但这就是它。

#3


1  

The text in a text box is not encoded in any way; it is "text", an abstract series of characters. In almost every contemporary application, that text is expressed as a sequence of Unicode code points, which are integers mapped to particular abstract characters. Text doesn't get "encoded" until it is turned into a sequence of bytes, as when submitting the form. At that time, the encoding is determined by the encoding of the HTML page in which the form appears, or by the accept-charset attribute of the form element.

文本框中的文本不以任何方式编码;它是“文本”,一个抽象的人物系列。在几乎每个当代应用程序中,该文本都表示为一系列Unicode代码点,这些代码点是映射到特定抽象字符的整数。在将文本转换为字节序列之前,文本不会被“编码”,就像提交表单一样。那时,编码由表单出现的HTML页面的编码确定,或者由表单元素的accept-charset属性确定。