用C语言抓取网页最简单的方法是什么?

时间:2021-09-26 14:03:07

I'm working on an old school linux variant (QNX to be exact) and need a way to grab a web page (no cookies or login, the target URL is just a text file) using nothing but sockets and arrays.

我正在研究一种老式的linux变体(确切地说,QNX),需要一种使用套接字和数组获取web页面(没有cookie或登录,目标URL只是一个文本文件)的方法。

Anyone got a snippet for this?

有人听过吗?

note: I don't control the server and I've got very little to work with besides what is already on the box (adding in additional libraries is not really "easy" given the contraints -- although I do love libcurl)

注意:我不控制服务器,除了已经在方框中显示的内容之外,我几乎没有什么要处理的内容(考虑到这些限制,添加额外的库并不真的“容易”——尽管我确实喜欢libcurl)。

2 个解决方案

#1


8  

I do have some code, but it also supports (Open)SSL so it's a bit long to post here.

我确实有一些代码,但它也支持(打开)SSL,所以在这里发布有点长。

In essence:

从本质上讲:

  • parse the URL (split out URL scheme, host name, port number, scheme specific part

    解析URL(分离URL方案、主机名、端口号、方案特定部分)

  • create the socket:

    创建套接字:

    s = socket(PF_INET, SOCK_STREAM, proto);

    s = socket(PF_INET, SOCK_STREAM, proto);

  • populate a sockaddr_in structure with the remote IP and port

    用远程IP和端口填充sockaddr_in结构。

  • connect the socket to the far end:

    将插座连接到远端:

    err = connect(s, &addr, sizeof(addr));

    err =连接(s, &addr, sizeof(addr);

  • make the request string:

    使请求字符串:

    n = snprinf(headers, "GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n", ...);

    n = snprinf(标题,“GET / % s HTTP / 1.0 \ r \ nHost:% s \ r \ n \ r \ n”,…);

  • send the request string:

    发送请求字符串:

    write(s, headers, n);

    写(年代,头,n);

  • read the data:

    读取数据:

    while (n = read(s, buffer, bufsize) > 0) { ... }

    而(n =读取(s, buffer, bufsize) >){…}

  • close the socket:

    关闭套接字:

    close(s);

    关闭(年代);

nb: pseudo-code above would collect both response headers and data. The split between the two is the first blank line.

nb:上面的伪代码将收集响应头和数据。两者之间的分割是第一个空行。

#2


9  

I'd look at libcurl if you want SSL support for or anything fancy.

如果您想要SSL支持或其他花哨的东西,我将查看libcurl。

However if you just want to get a simple webpage from a port 80, then just open a tcp socket, send "GET /index.html HTTP/1.0\n\r\n\r" and parse the output.

但是,如果您只想从端口80获得一个简单的页面,那么只需打开tcp套接字,发送“get /index”。然后解析输出。

#1


8  

I do have some code, but it also supports (Open)SSL so it's a bit long to post here.

我确实有一些代码,但它也支持(打开)SSL,所以在这里发布有点长。

In essence:

从本质上讲:

  • parse the URL (split out URL scheme, host name, port number, scheme specific part

    解析URL(分离URL方案、主机名、端口号、方案特定部分)

  • create the socket:

    创建套接字:

    s = socket(PF_INET, SOCK_STREAM, proto);

    s = socket(PF_INET, SOCK_STREAM, proto);

  • populate a sockaddr_in structure with the remote IP and port

    用远程IP和端口填充sockaddr_in结构。

  • connect the socket to the far end:

    将插座连接到远端:

    err = connect(s, &addr, sizeof(addr));

    err =连接(s, &addr, sizeof(addr);

  • make the request string:

    使请求字符串:

    n = snprinf(headers, "GET /%s HTTP/1.0\r\nHost: %s\r\n\r\n", ...);

    n = snprinf(标题,“GET / % s HTTP / 1.0 \ r \ nHost:% s \ r \ n \ r \ n”,…);

  • send the request string:

    发送请求字符串:

    write(s, headers, n);

    写(年代,头,n);

  • read the data:

    读取数据:

    while (n = read(s, buffer, bufsize) > 0) { ... }

    而(n =读取(s, buffer, bufsize) >){…}

  • close the socket:

    关闭套接字:

    close(s);

    关闭(年代);

nb: pseudo-code above would collect both response headers and data. The split between the two is the first blank line.

nb:上面的伪代码将收集响应头和数据。两者之间的分割是第一个空行。

#2


9  

I'd look at libcurl if you want SSL support for or anything fancy.

如果您想要SSL支持或其他花哨的东西,我将查看libcurl。

However if you just want to get a simple webpage from a port 80, then just open a tcp socket, send "GET /index.html HTTP/1.0\n\r\n\r" and parse the output.

但是,如果您只想从端口80获得一个简单的页面,那么只需打开tcp套接字,发送“get /index”。然后解析输出。