http-parser解析http报文详解

时间:2023-01-18 04:38:16

说明

项目里用到力http-parser,在这里简单说明一下其用法吧

下载地址:https://github.com/joyent/http-parser

其使用说明很详细。

开源用例

开源tcpflow 1.4.4中使用http-parser的源代码

<span xmlns="http://www.w3.org/1999/xhtml" style="">/* -*- mode: C++; c-basic-offset: 4; indent-tabs-mode: nil -*- */
/**
*
* scan_http:
* Decodes HTTP responses
*/

#include "config.h"

#include "tcpflow.h"
#include "tcpip.h"
#include "tcpdemux.h"

#include "http-parser/http_parser.h"

#include "mime_map.h"

#ifdef HAVE_SYS_WAIT_H
#include <sys/wait.h>
#endif


#ifdef HAVE_LIBZ
# define ZLIB_CONST
# ifdef GNUC_HAS_DIAGNOSTIC_PRAGMA
# pragma GCC diagnostic ignored "-Wundef"
# pragma GCC diagnostic ignored "-Wcast-qual"
# endif
# ifdef HAVE_ZLIB_H
# include <zlib.h>
# endif
#else
# define z_stream void * // prevents z_stream from generating an error
#endif

#define MIN_HTTP_BUFSIZE 80 // don't bother parsing smaller than this

#include <sys/types.h>
#include <iostream>
#include <algorithm>
#include <map>
#include <iomanip>

#define HTTP_CMD "http_cmd"
#define HTTP_ALERT_FD "http_alert_fd"

/* options */
std::string http_cmd; // command to run on each http object
int http_subproc_max = 10; // how many subprocesses are we allowed?
int http_subproc = 0; // how many do we currently have?
int http_alert_fd = -1; // where should we send alerts?


/* define a callback object for sharing state between scan_http() and its callbacks
*/
class scan_http_cbo {
private:
typedef enum {NOTHING,FIELD,VALUE} last_on_header_t;
scan_http_cbo(const scan_http_cbo& c); // not implemented
scan_http_cbo &operator=(const scan_http_cbo &c); // not implemented

public:
virtual ~scan_http_cbo(){
on_message_complete(); // make sure message was ended
}
scan_http_cbo(const std::string& path_,const char *base_,std::stringstream *xmlstream_) :
path(path_), base(base_),xmlstream(xmlstream_),xml_fo(),request_no(0),
headers(), last_on_header(NOTHING), header_value(), header_field(),
output_path(), fd(-1), first_body(true),bytes_written(0),unzip(false),zs(),zinit(false),zfail(false){};
private:

const std::string path; // where data gets written
const char *base; // where data started in memory
std::stringstream *xmlstream; // if present, where to put the fileobject annotations
std::stringstream xml_fo; // xml stream for this file object
int request_no; // request number

/* parsed headers */
std::map<std::string, std::string> headers;

/* placeholders for possibly-incomplete header data */
last_on_header_t last_on_header;
std::string header_value, header_field;
std::string output_path;
int fd; // fd for writing
bool first_body; // first call to on_body after headers
uint64_t bytes_written;

/* decompression for gzip-encoded streams. */
bool unzip; // should we be decompressing?
z_stream zs; // zstream (avoids casting and memory allocation)
bool zinit; // we have initialized the zstream
bool zfail; // zstream failed in some manner, so ignore the rest of this stream

/* The static functions are callbacks; they wrap the method calls */
#define CBO (reinterpret_cast<scan_http_cbo*>(parser->data))
public:
static int scan_http_cb_on_message_begin(http_parser * parser) { return CBO->on_message_begin();}
static int scan_http_cb_on_url(http_parser * parser, const char *at, size_t length) { return 0;}
static int scan_http_cb_on_header_field(http_parser * parser, const char *at, size_t length) { return CBO->on_header_field(at,length);}
static int scan_http_cb_on_header_value(http_parser * parser, const char *at, size_t length) { return CBO->on_header_value(at,length); }
static int scan_http_cb_on_headers_complete(http_parser * parser) { return CBO->on_headers_complete();}
static int scan_http_cb_on_body(http_parser * parser, const char *at, size_t length) { return CBO->on_body(at,length);}
static int scan_http_cb_on_message_complete(http_parser * parser) {return CBO->on_message_complete();}
#undef CBO
private:
int on_message_begin();
int on_url(const char *at, size_t length);
int on_header_field(const char *at, size_t length);
int on_header_value(const char *at, size_t length);
int on_headers_complete();
int on_body(const char *at, size_t length);
int on_message_complete();
};


/**
* on_message_begin:
* Increment request nubmer. Note that the first request is request_no = 1
*/

int scan_http_cbo::on_message_begin()
{
request_no ++;
return 0;
}

/**
* on_url currently not implemented.
*/

int scan_http_cbo::on_url(const char *at, size_t length)
{
return 0;
}


/* Note 1: The state machine is defined in http-parser/README.md
* Note 2: All header field names are converted to lowercase.
* This is consistent with the RFC.
*/

int scan_http_cbo::on_header_field(const char *at,size_t length)
{
std::string field(at,length);
std::transform(field.begin(), field.end(), field.begin(), ::tolower);

switch(last_on_header){
case NOTHING:
// Allocate new buffer and copy callback data into it
header_field = field;
break;
case VALUE:
// New header started.
// Copy current name,value buffers to headers
// list and allocate new buffer for new name
headers[header_field] = header_value;
header_field = field;
break;
case FIELD:
// Previous name continues. Reallocate name
// buffer and append callback data to it
header_field.append(field);
break;
}
last_on_header = FIELD;
return 0;
}

int scan_http_cbo::on_header_value(const char *at, size_t length)
{
const std::string value(at,length);
switch(last_on_header){
case FIELD:
//Value for current header started. Allocate
//new buffer and copy callback data to it
header_value = value;
break;
case VALUE:
//Value continues. Reallocate value buffer
//and append callback data to it
header_value.append(value);
break;
case NOTHING:
// this shouldn't happen
DEBUG(10)("Internal error in http-parser");
break;
}
last_on_header = VALUE;

return 0;
}

/**
* called when last header is read.
* Determine the filename based on request_no and extension.
* Also see if decompressing is happening...
*/

int scan_http_cbo::on_headers_complete()
{
tcpdemux *demux = tcpdemux::getInstance();

/* Add the most recently read header to the map, if any */
if (last_on_header==VALUE) {
headers[header_field] = header_value;
header_field="";
}

/* Set output path to <path>-HTTPBODY-nnn.ext for each part.
* This is not consistent with tcpflow <= 1.3.0, which supported only one HTTPBODY,
* but it's correct...
*/

std::stringstream os;
os << path << "-HTTPBODY-" << std::setw(3) << std::setfill('0') << request_no << std::setw(0);

/* See if we can guess a file extension */
std::string extension = get_extension_for_mime_type(headers["content-type"]);
if (extension.size()) {
os << "." << extension;
}

output_path = os.str();

/* Choose an output function based on the content encoding */
std::string content_encoding(headers["content-encoding"]);

if ((content_encoding == "gzip" || content_encoding == "deflate") && (demux->opt.gzip_decompress)){
#ifdef HAVE_LIBZ
DEBUG(10) ( "%s: detected zlib content, decompressing", output_path.c_str());
unzip = true;
#else
/* We can't decompress, so just give it a .gz */
output_path.append(".gz");
DEBUG(5) ( "%s: refusing to decompress since zlib is unavailable", output_path.c_str() );
#endif
}

/* Open the output path */
fd = demux->retrying_open(output_path.c_str(), O_WRONLY|O_CREAT|O_BINARY|O_TRUNC, 0644);
if (fd < 0) {
DEBUG(1) ("unable to open HTTP body file %s", output_path.c_str());
}
if(http_alert_fd>=0){
std::stringstream ss;
ss << "open\t" << output_path << "\n";
const std::string &sso = ss.str();
if(write(http_alert_fd,sso.c_str(),sso.size())!=(int)sso.size()){
perror("write");
}
}

first_body = true; // next call to on_body will be the first one

/* We can do something smart with the headers here.
*
* For example, we could:
* - Record all headers into the report.xml
* - Pick the intended filename if we see Content-Disposition: attachment; name="..."
* - Record headers into filesystem extended attributes on the body file
*/
return 0;
}

/* Write to fd, optionally decompressing as we go */
int scan_http_cbo::on_body(const char *at,size_t length)
{
if (fd < 0) return -1; // no open fd? (internal error)x
if (length==0) return 0; // nothing to write

if(first_body){ // stuff for first time on_body is called
xml_fo << " <byte_run file_offset='" << (at-base) << "'><fileobject><filename>" << output_path << "</filename>";
first_body = false;
}

/* If not decompressing, just write the data and return. */
if(unzip==false){
int rv = write(fd,at,length);
if(rv<0) return -1; // write error; that's bad
bytes_written += rv;
return 0;
}

#ifndef HAVE_LIBZ
assert(0); // shoudln't have gotten here
#endif
if(zfail) return 0; // stream was corrupt; ignore rest
/* set up this round of decompression, using a small local buffer */

/* Call init if we are not initialized */
char decompressed[65536]; // where decompressed data goes
if (!zinit) {
memset(&zs,0,sizeof(zs));
zs.next_in = (Bytef*)at;
zs.avail_in = length;
zs.next_out = (Bytef*)decompressed;
zs.avail_out = sizeof(decompressed);

int rv = inflateInit2(&zs, 32 + MAX_WBITS); /* 32 auto-detects gzip or deflate */
if (rv != Z_OK) {
/* fail! */
DEBUG(3) ("decompression failed at stream initialization; rv=%d bad Content-Encoding?",rv);
zfail = true;
return 0;
}
zinit = true; // successfully initted
} else {
zs.next_in = (Bytef*)at;
zs.avail_in = length;
zs.next_out = (Bytef*)decompressed;
zs.avail_out = sizeof(decompressed);
}

/* iteratively decompress, writing each time */
while (zs.avail_in > 0) {
/* decompress as much as possible */
int rv = inflate(&zs, Z_SYNC_FLUSH);

if (rv == Z_STREAM_END) {
/* are we done with the stream? */
if (zs.avail_in > 0) {
/* ...no. */
DEBUG(3) ("decompression completed, but with trailing garbage");
return 0;
}
} else if (rv != Z_OK) {
/* some other error */
DEBUG(3) ("decompression failed (corrupted stream?)");
zfail = true; // ignore the rest of this stream
return 0;
}

/* successful decompression, at least partly */
/* write the result */
int bytes_decompressed = sizeof(decompressed) - zs.avail_out;
ssize_t written = write(fd, decompressed, bytes_decompressed);

if (written < bytes_decompressed) {
DEBUG(3) ("writing decompressed data failed");
zfail= true;
return 0;
}
bytes_written += written;

/* reset the buffer for the next iteration */
zs.next_out = (Bytef*)decompressed;
zs.avail_out = sizeof(decompressed);
}
return 0;
}


/**
* called at the conclusion of each HTTP body.
* Clean out all of the state for this HTTP header/body pair.
*/

int scan_http_cbo::on_message_complete()
{
/* Close the file */
headers.clear();
header_field = "";
header_value = "";
last_on_header = NOTHING;
if(fd >= 0) {
if (::close(fd) != 0) {
perror("close() of http body");
}
fd = -1;
}

/* Erase zero-length files and update the DFXML */
if(bytes_written>0){
/* Update DFXML */
if(xmlstream){
xml_fo << "<filesize>" << bytes_written << "</filesize></fileobject></byte_run>\n";
if(xmlstream) *xmlstream << xml_fo.str();
}
if(http_alert_fd>=0){
std::stringstream ss;
ss << "close\t" << output_path << "\n";
const std::string &sso = ss.str();
if(write(http_alert_fd,sso.c_str(),sso.size()) != (int)sso.size()){
perror("write");
}
}
if(http_cmd.size()>0 && output_path.size()>0){
/* If we are at maximum number of subprocesses, wait for one to exit */
std::string cmd = http_cmd + " " + output_path;
#ifdef HAVE_FORK
int status=0;
pid_t pid = 0;
while(http_subproc >= http_subproc_max){
pid = wait(&status);
http_subproc--;
}
/* Fork off a child */
pid = fork();
if(pid<0) die("Cannot fork child");
if(pid==0){
/* We are the child */
exit(system(cmd.c_str()));
}
http_subproc++;
#else
system(cmd.c_str());
#endif
}
} else {
/* Nothing written; erase the file */
if(output_path.size() > 0){
::unlink(output_path.c_str());
}
}

/* Erase the state variables for this part */
xml_fo.str("");
output_path = "";
bytes_written=0;
unzip = false;
if(zinit){
inflateEnd(&zs);
zinit = false;
}
zfail = false;
return 0;
}


/***
* the HTTP scanner plugin itself
*/

extern "C"
void scan_http(const class scanner_params &sp,const recursion_control_block &rcb)
{
if(sp.sp_version!=scanner_params::CURRENT_SP_VERSION){
std::cerr << "scan_http requires sp version " << scanner_params::CURRENT_SP_VERSION << "; "
<< "got version " << sp.sp_version << "\n";
exit(1);
}

if(sp.phase==scanner_params::PHASE_STARTUP){
sp.info->name = "http";
sp.info->flags = scanner_info::SCANNER_DISABLED; // default disabled
sp.info->get_config(HTTP_CMD,&http_cmd,"Command to execute on each HTTP attachment");
sp.info->get_config(HTTP_ALERT_FD,&http_alert_fd,"File descriptor to send information about completed HTTP attachments");
return; /* No feature files created */
}

if(sp.phase==scanner_params::PHASE_SCAN){
/* See if there is an HTTP response */
if(sp.sbuf.bufsize>=MIN_HTTP_BUFSIZE && sp.sbuf.memcmp(reinterpret_cast<const uint8_t *>("HTTP/1."),0,7)==0){
/* Smells enough like HTTP to try parsing */
/* Set up callbacks */
http_parser_settings scan_http_parser_settings;
memset(&scan_http_parser_settings,0,sizeof(scan_http_parser_settings)); // in the event that new callbacks get created
scan_http_parser_settings.on_message_begin = scan_http_cbo::scan_http_cb_on_message_begin;
scan_http_parser_settings.on_url = scan_http_cbo::scan_http_cb_on_url;
scan_http_parser_settings.on_header_field = scan_http_cbo::scan_http_cb_on_header_field;
scan_http_parser_settings.on_header_value = scan_http_cbo::scan_http_cb_on_header_value;
scan_http_parser_settings.on_headers_complete = scan_http_cbo::scan_http_cb_on_headers_complete;
scan_http_parser_settings.on_body = scan_http_cbo::scan_http_cb_on_body;
scan_http_parser_settings.on_message_complete = scan_http_cbo::scan_http_cb_on_message_complete;

if(sp.sxml) (*sp.sxml) << "\n <byte_runs>\n";
for(size_t offset=0;;){
/* Set up a parser instance for the next chunk of HTTP responses and data.
* This might be repeated several times due to connection re-use and multiple requests.
* Note that the parser is not a C++ library but it can pass a "data" to the
* callback. We put the address for the scan_http_cbo object in the data and
* recover it with a cast in each of the callbacks.
*/

/* Make an sbuf for the remaining data.
* Note that this may not be necessary, because in our test runs the parser
* processed all of the data the first time through...
*/
sbuf_t sub_buf(sp.sbuf, offset);

const char *base = reinterpret_cast<const char*>(sub_buf.buf);
http_parser parser;
http_parser_init(&parser, HTTP_RESPONSE);

scan_http_cbo cbo(sp.sbuf.pos0.path,base,sp.sxml);
parser.data = &cbo;

/* Parse */
size_t parsed = http_parser_execute(&parser, &scan_http_parser_settings,
base, sub_buf.size());
assert(parsed <= sub_buf.size());

/* Indicate EOF (flushing callbacks) and terminate if we parsed the entire buffer.
*/
if (parsed == sub_buf.size()) {
http_parser_execute(&parser, &scan_http_parser_settings, NULL, 0);
break;
}

/* Stop parsing if we parsed nothing, as that indicates something header! */
if (parsed == 0) {
break;
}

/* Stop parsing if we're a connection upgrade (e.g. WebSockets) */
if (parser.upgrade) {
DEBUG(9) ("upgrade connection detected (WebSockets?); cowardly refusing to dump further");
break;
}

/* Bump the offset for next iteration */
offset += parsed;
}
if(sp.sxml) (*sp.sxml) << " </byte_runs>";
}
}
}</span>

其中使用    struct http_parser_settings 设置回调,使用http_parser 来解析。


开源libtnet-master中的使用情况

#include "httpparser.h"

#include "httputil.h"

#include "log.h"

using namespace std;

namespace tnet
{
struct http_parser_settings ms_settings;

class HttpParserSettings
{
public:
HttpParserSettings();

static int onMessageBegin(struct http_parser*);
static int onUrl(struct http_parser*, const char*, size_t);
static int onStatusComplete(struct http_parser*);
static int onHeaderField(struct http_parser*, const char*, size_t);
static int onHeaderValue(struct http_parser*, const char*, size_t);
static int onHeadersComplete(struct http_parser*);
static int onBody(struct http_parser*, const char*, size_t);
static int onMessageComplete(struct http_parser*);
};

HttpParserSettings::HttpParserSettings()
{
ms_settings.on_message_begin = &HttpParserSettings::onMessageBegin;
ms_settings.on_url = &HttpParserSettings::onUrl;
ms_settings.on_status_complete = &HttpParserSettings::onStatusComplete;
ms_settings.on_header_field = &HttpParserSettings::onHeaderField;
ms_settings.on_header_value = &HttpParserSettings::onHeaderValue;
ms_settings.on_headers_complete = &HttpParserSettings::onHeadersComplete;
ms_settings.on_body = &HttpParserSettings::onBody;
ms_settings.on_message_complete = &HttpParserSettings::onMessageComplete;
}

static HttpParserSettings initObj;

int HttpParserSettings::onMessageBegin(struct http_parser* parser)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_MessageBegin, 0, 0);
}

int HttpParserSettings::onUrl(struct http_parser* parser, const char* at, size_t length)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_Url, at, length);
}

int HttpParserSettings::onStatusComplete(struct http_parser* parser)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_StatusComplete, 0, 0);
}

int HttpParserSettings::onHeaderField(struct http_parser* parser, const char* at, size_t length)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_HeaderField, at, length);
}

int HttpParserSettings::onHeaderValue(struct http_parser* parser, const char* at, size_t length)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_HeaderValue, at, length);
}

int HttpParserSettings::onHeadersComplete(struct http_parser* parser)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_HeadersComplete, 0, 0);
}

int HttpParserSettings::onBody(struct http_parser* parser, const char* at, size_t length)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_Body, at, length);
}

int HttpParserSettings::onMessageComplete(struct http_parser* parser)
{
HttpParser* p = (HttpParser*)parser->data;
return p->onParser(HttpParser::Parser_MessageComplete, 0, 0);
}


HttpParser::HttpParser(enum http_parser_type type)
{
http_parser_init(&m_parser, type);

m_parser.data = this;

m_lastWasValue = true;
}

HttpParser::~HttpParser()
{

}

int HttpParser::onParser(Event event, const char* at, size_t length)
{
switch(event)
{
case Parser_MessageBegin:
return handleMessageBegin();
case Parser_Url:
return onUrl(at, length);
case Parser_StatusComplete:
return 0;
case Parser_HeaderField:
return handleHeaderField(at, length);
case Parser_HeaderValue:
return handleHeaderValue(at, length);
case Parser_HeadersComplete:
return handleHeadersComplete();
case Parser_Body:
return onBody(at, length);
case Parser_MessageComplete:
return onMessageComplete();
default:
break;
}

return 0;
}

int HttpParser::handleMessageBegin()
{
m_curField.clear();
m_curValue.clear();
m_lastWasValue = true;

m_errorCode = 0;

return onMessageBegin();
}

int HttpParser::handleHeaderField(const char* at, size_t length)
{
if(m_lastWasValue)
{
if(!m_curField.empty())
{
onHeader(HttpUtil::normalizeHeader(m_curField), m_curValue);
}

m_curField.clear();
m_curValue.clear();
}

m_curField.append(at, length);

m_lastWasValue = 0;

return 0;
}

int HttpParser::handleHeaderValue(const char* at, size_t length)
{
m_curValue.append(at, length);
m_lastWasValue = 1;

return 0;
}

int HttpParser::handleHeadersComplete()
{
if(!m_curField.empty())
{
string field = HttpUtil::normalizeHeader(m_curField);
onHeader(field, m_curValue);
}

return onHeadersComplete();
}

int HttpParser::execute(const char* buffer, size_t count)
{
int n = http_parser_execute(&m_parser, &ms_settings, buffer, count);
if(m_parser.upgrade)
{
onUpgrade(buffer + n, count - n);
return 0;
}
else if(n != count)
{
int code = (m_errorCode != 0 ? m_errorCode : 400);

HttpError error(code, http_errno_description((http_errno)m_parser.http_errno));

LOG_ERROR("parser error %s", error.message.c_str());

onError(error);

return code;
}

return 0;
}
}



中文说明


概括

http-parser是一个用C代码编写的HTTP消息解析器。可以解析HTTP请求或者回应消息。这个解析器常常在高性能的HTTP应用中使用。在解析的过程中,它不会调用任何系统调用,不会在HEAP上申请内存,不会缓存数据,并且可以在任意时刻打断解析过程,而不会产生任何影响。对于每个HTTP消息(在WEB服务器中就是每个请求),它只需要40字节的内存占用(解析器本身的基本数据结构),不过最终的要看你实际的代码架构。

特性:

无第三方依赖可以处理持久消息(keep-alive)支持解码chunk编码的消息支持Upgrade协议升级(如无例外就是WebSocket)可以防御缓冲区溢出攻击

解析器可以处理以下类型的HTTP消息:

头部的字段和值Content-Length请求方法返回的HTTP代码Transfer-EncodingHTTP版本请求的URLHTTP消息主体

简单使用:

每个HTTP请求使用一个http_parser对象。使用http_parser_init来初始化结构体,并且设置解析时的回调。下面的代码可能看起来像是解析HTTP请求:

// 设置回调
http_parser_settings settings;
settings.on_url = my_url_callback;
settings.on_header_field = my_header_field_callback;
/* ... */

// 为结构体申请内存
http_parser *parser = malloc(sizeof(http_parser));
// 初始化解析器
http_parser_init(parser, HTTP_REQUEST);
// 设置保存调用者的数据,用于在callback内使用
parser->data = my_socket;

当接收到数据后,解析器开始执行,并检查错误:

size_t len = 80*1024; // 需要接受的数据大小80K
size_t nparsed; // 已经解析完成的数据大小
char buf[len]; // 接收缓存
ssize_t recved; // 实际接收到的数据大小

// 接受数据
recved = recv(fd, buf, len, 0);

// 如果接收到的字节数小于0,说明从socket读取出错
if (recved < 0) {
/* Handle error. */
}

/* Start up / continue the parser.
* Note we pass recved==0 to signal that EOF has been recieved.
*/
// 开始解析
// @parser 解析器对象
// @&settings 解析时的回调函数
// @buf 要解析的数据
// @receved 要解析的数据大小
nparsed = http_parser_execute(parser, &settings, buf, recved);

// 如果解析到websocket请求
if (parser->upgrade) {
/* handle new protocol */
// 如果解析出错,即解析完成的数据大小不等于传递给http_parser_execute的大小
} else if (nparsed != recved) {
/* Handle error. Usually just close the connection. */
}

HTTP需要知道数据流在那里结束。

举个例子,一些服务器发送响应数据的时候,HTTP头部不带有Content-Length字段,希望客户端持续从socket中读取数据,知道遇到EOF为止。在调用http_parser_execute时,传递最后一个参数为0,用来通知http_parser,解析已经结束。在http_parser遇到EOF并处理的过程中,仍然可能会遇到错误,所以应该在callback中处理这些错误。

注意: 上面的意思是说,如果需要多次调用http_parser_execute的时候,就是因为无法一次完成对HTTP服务器/客户端数据的接收。所以需要在每次接收到一些数据之后,调用一次http_parser_execute,当从socket接收到EOF时,应该结束解析,同时通知http_parser解析结束。

一些可扩展的信息字段,例如status_codemethodHTTP版本号,它们都存储在解析器的数据结构中。这些数据被临时的存储在http_parser中,并且会在每个连接到来后被重置(当多个连接的HTTP数据使用同一个解析器时);如果需要保留这些数据,必须要在on_headers_complete返回之前保存它門。

注意: 应该为每个HTTP连接的数据,单独初始化一个解析器的时候,不会存在上述问题.

解析器会解析HTTP请求和相应中的transfer-encoding字段。就是说,chunked编码会在调用on_body之前被解析。

关于Upgrade协议的问题

HTTP支持将连接升级为不同的协议. 例如目前日益普遍的WebSocket协议的请求数据:

GET /demo HTTP/1.1
Upgrade: WebSocket
Connection: Upgrade
Host: example.com
Origin: http://example.com
WebSocket-Protocol: sample

在WebSocket请求头部传输完毕后,就下来传输的数据是非HTTP协议的数据了。

关于WebSocket协议的详细内容见: http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-75

要支持这种类似与WebSocket的协议,解析器会把它当作一个不带HTTP主体数据的包(只含有头部).然后调用on_headers_completeon_message_complete回调。所以不论怎样,当检测到HTTP头部的数据结束时,http_parser_execute会停止解析,并且返回。

建议用户在http_parser_execute函数返回后,检查parset->upgrade字段,是否被设置为1.在http_parset_execute的返回值中,非HTTP类型的数据(除去HTTP头部的数据)的范围,会被设置为从一个offset参数处开始。

回调函数

当调用http_parser_execute时,在http_parset_settings中设置的回调会执行。解析器维护了自身状态数据,并且这些数据不会被保存,所以没有必要将这些状态数据缓存。如果你真需要保存这些状态数据,可以在回调中保存。

有两种类型的回调:

通知 typedef int (*http_cb) (http_parser *);包括:on_message_begin,on_headers_complete,on_message_complete

数据 typedef int (*http_data_cb) (http_parser *, const char at, size_t length);包括;(只限与请求)on_uri, (通用)on_header_field, on_header_value,on_body

用户的回调函数应该返回0表示成功。返回非0的值,会告诉解析器发生了错误,解析器会立刻退出。

如果你解析chunks编码的HTTP消息(例如:从socket中读read()HTTP请求行,解析,然后再次读到一半的头部消息后,再次解析,等等),你的数据类型的回调就会被调用不止一次。HTTP解析器保证,参数中传递的数据指针,只在回调函数内有效(即回调调用结束,数据指针无效).因为http-parser返回解析结果的方式为:在需要解析的数据中,依靠指针和数据长度来供用户代码读取 如果可以的话,你也可以将read()到的数据,保存到在HEAP上申请的内存中,以避免非必要的数据拷贝。

比较笨的方法是:每读取一次将读取到的数据传递给http_parset_execute函数.

注意:对于将一个完整的HTTP报文分开多次解析,应该使用同一个parser对象!

但是实际上的情况更复杂:

首先根据HTTP协议头部的规则,应该持续从socket读取数据,直到读到了\r\n\r\n,表示头部报文结束。这时可以传递给http_parser解析,或者根据下面的规则,继续读取实体部分的数据。

如果报文中使用Content-Length指定传输实体的大小,接下来不论HTTP客户/服务器都因该根据读取到Content-Length指定的实体大小

对于分块传输的实体,传输编码为chunked。即Transfer-Encoding: chunked。分快传输的编码,一般只适用于HTTP内容响应(HTTP请求也可以指定传输编码为chunked,但不是所有HTTP服务器都支持)。这时可以读取定量的数据(如4096字节) ,交给parser解析。然后重复此过程,直到chunk编码结束。


是不是很简单,那就用到你项目中吧!


参考:

https://github.com/joyent/http-parser

https://github.com/simsong/tcpflow

https://github.com/siddontang/libtnet

http://rootk.com/post/tutorial-for-http-parser.html