用C语言抓取网页最简单的方法是什么?

I'm working on an old school linux variant (QNX to be exact) and need a way to grab a web page (no cookies or login, the target URL is just a text file) using nothing but sockets and arrays.

我正在研究一种老式的linux变体(确切地说，QNX)，需要一种使用套接字和数组获取web页面(没有cookie或登录，目标URL只是一个文本文件)的方法。

Anyone got a snippet for this?

有人听过吗?

note: I don't control the server and I've got very little to work with besides what is already on the box (adding in additional libraries is not really "easy" given the contraints -- although I do love libcurl)

注意:我不控制服务器，除了已经在方框中显示的内容之外，我几乎没有什么要处理的内容(考虑到这些限制，添加额外的库并不真的“容易”——尽管我确实喜欢libcurl)。

2 个解决方案

#1

I do have some code, but it also supports (Open)SSL so it's a bit long to post here.

我确实有一些代码，但它也支持(打开)SSL，所以在这里发布有点长。

In essence:

从本质上讲:

parse the URL (split out URL scheme, host name, port number, scheme specific part

解析URL(分离URL方案、主机名、端口号、方案特定部分)
create the socket:

创建套接字:

s = socket(PF_INET, SOCK_STREAM, proto);

s = socket(PF_INET, SOCK_STREAM, proto);
populate a sockaddr_in structure with the remote IP and port

用远程IP和端口填充sockaddr_in结构。
connect the socket to the far end:

将插座连接到远端:

err = connect(s, &addr, sizeof(addr));

err =连接(s， &addr, sizeof(addr);
make the request string:

使请求字符串:

n = snprinf(headers, "GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n", ...);

n = snprinf(标题,“GET / % s HTTP / 1.0 \ r \ nHost:% s \ r \ n \ r \ n”,…);
send the request string:

发送请求字符串:

write(s, headers, n);

写(年代,头,n);
read the data:

读取数据:

while (n = read(s, buffer, bufsize) > 0) { ... }

而(n =读取(s, buffer, bufsize) >){…}
close the socket:

关闭套接字:

close(s);

关闭(年代);

nb: pseudo-code above would collect both response headers and data. The split between the two is the first blank line.

nb:上面的伪代码将收集响应头和数据。两者之间的分割是第一个空行。

#2

I'd look at libcurl if you want SSL support for or anything fancy.

如果您想要SSL支持或其他花哨的东西，我将查看libcurl。

However if you just want to get a simple webpage from a port 80, then just open a tcp socket, send "GET /index.html HTTP/1.0\n\r\n\r" and parse the output.

但是，如果您只想从端口80获得一个简单的页面，那么只需打开tcp套接字，发送“get /index”。然后解析输出。

#1