在请求体中解析UTF8字符的问题？

When implementing HTTP services in node.js, there is a lot of sample code like below used to get the whole request entity (data uploaded by the client, for example a POST with JSON data) :

在node.js中实现HTTP服务时,有很多示例代码如下所示用于获取整个请求实体(客户端上传的数据,例如带有JSON数据的POST):

var http = require('http');

var server = http.createServer(function(req, res) {
    var data = '';
    req.setEncoding('utf8');

    req.on('data', function(chunk) {
        data += chunk;
    });

    req.on('end', function() {
        // parse data
    });
});

Using req.setEncoding('utf8') automatically decodes input bytes into string, assuming the input is UTF8-encoded. But I get the feeling that it can break. What if we receive a chunk of data that ends in the middle of a multi-byte UTF8 character ? We can simulate this :

假设输入是UTF8编码的,使用req.setEncoding('utf8')会自动将输入字节解码为字符串。但我觉得它可以打破。如果我们收到一个以多字节UTF8字符结尾的数据块怎么办?我们可以模拟这个:

> new Buffer("café")
<Buffer 63 61 66 c3 a9>
> new Buffer("café").slice(0,4)
<Buffer 63 61 66 c3>
> new Buffer("café").slice(0,4).toString('utf8')
'caf?'

So we get an erroneous character instead of waiting for the next bytes to properly decode the last character.

所以我们得到一个错误的字符,而不是等待下一个字节正确解码最后一个字符。

Therefore, unless the request object take cares of this, making sure that only completely decoded characters are pushed into chunks, this ubiquitous code sample is broken.

因此,除非请求对象处理此问题,否则确保只将完全解码的字符推入块中,这种无处不在的代码示例将被破坏。

The alternative would be to use buffers, handling the problem of buffer size limits :

另一种方法是使用缓冲区,处理缓冲区大小限制的问题:

var http = require('http');
var MAX_REQUEST_BODY_SIZE = 16 * 1024 * 1024;

var server = http.createServer(function(req, res) {
    // A better way to do this could be to start with a small buffer
    // and grow it geometrically until the limit is reached.
    var requestBody = new Buffer(MAX_REQUEST_BODY_SIZE); 
    var requestBodyLength = 0;

    req.on('data', function(chunk) {
        if(requestBodyLength + chunk.length >= MAX_REQUEST_BODY_SIZE) {
           res.statusCode = 413; // Request Entity Too Large
           return;
        }
        chunk.copy(requestBody, requestBodyLength, 0, chunk.length);
        requestBodyLength += chunk.length;
    });

    req.on('end', function() {
        if(res.statusCode == 413) {
            // handle 413 error
            return;
        }

        requestBody = requestBody.toString('utf8', 0, requestBodyLength);
        // process requestBody as string
    });
});

Am I right, or is this already taken care by the http request class ?

我是对的,还是已经由http请求类处理了?

3 个解决方案

#1

This is taken care of automatically. There is a string_decoder module in node which is loaded when you call setEncoding. The decoder will check the last few bytes received and store them between emits of 'data' if they are not full characters, so data will always get a correct string. If you do not do setEncoding, and don't use string_decoder yourself, then the buffer emitted can have the issue you mentioned, though.

这是自动处理。节点中有一个string_decoder模块,在调用setEncoding时会加载该模块。解码器将检查收到的最后几个字节,如果它们不是完整字符,则将它们存储在'data'的发送之间,因此数据将始终获得正确的字符串。如果你不做setEncoding,并且不自己使用string_decoder,那么发出的缓冲区可能会出现你提到的问题。

The docs aren't much help though, http://nodejs.org/docs/latest/api/string_decoder.html, but you can see the module here, https://github.com/joyent/node/blob/master/lib/string_decoder.js

文档虽然帮助不大,http://nodejs.org/docs/latest/api/string_decoder.html,但你可以在这里看到这个模块,https://github.com/joyent/node/blob/master /lib/string_decoder.js

The implementation of 'setEncoding' and logic for emitting also makes it clearer.

“setEncoding”的实现和发射逻辑也使它更加清晰。

setEncoding: https://github.com/joyent/node/blob/master/lib/http.js#L270
_emitData https://github.com/joyent/node/blob/master/lib/http.js#L306

#2

Just add response.setEncoding('utf8'); to request.on('response') callback function. In my case that was sufficient.

只需添加response.setEncoding('utf8'); request.on('response')回调函数。在我的情况下,这是足够的。

#3

// Post : 'tèéïst3 ùél'
// Node return : 't%C3%A8%C3%A9%C3%AFst3+%C3%B9%C3%A9l'
decodeURI('t%C3%A8%C3%A9%C3%AFst3+%C3%B9%C3%A9l');
// Return 'tèéïst3+ùél'

#1