解析HTML代码以查找字段

时间:2021-07-30 07:46:29

I have these page http://www.elseptimoarte.net/. The page have a search field, If I put for instance "batman" it give me some searchs results with a url of every result: http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859-1&oe=ISO-8859-1&q=batman#978

我有这些页面http://www.elseptimoarte.net/。该页面有一个搜索字段,如果我把例如“蝙蝠侠”它给我一些搜索结果与每个结果的网址:http://www.elseptimoarte.net/busquedas.html?cx = 003284578463992023034%3Alraatm7pya0&cof = FORID% 3A11&即= ISO-8859-1&OE = ISO-8859-1&q =蝙蝠侠#978

I would like to parse the html code to get the url for example of the firse link: Example: www.elseptimoarte.net/peliculas/batman-begins-1266.html

我想解析html代码以获取第一个链接的网址示例:示例:www.elseptimoarte.net/peliculas/batman-begins-1266.html

The problem it is that I use curl (in bash) but when I do a curl -L -s http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859-1&oe=ISO-8859-1&q=batman#978 it doesn't give the link.

问题是我使用卷曲(在bash中)但是当我做卷曲-L -s http://www.elseptimoarte.net/busquedas.html?cx=003284578463992023034%3Alraatm7pya0&cof=FORID%3A11&ie=ISO-8859- 1&oe = ISO-8859-1&q =蝙蝠侠#978它没有给出链接。

Any help?

Many thanks and sorry for my english!

非常感谢和抱歉我的英语!

6 个解决方案

#1


You don't get the link using cURL because the page uses Javascript to get that data.

您没有使用cURL获取链接,因为该页面使用Javascript来获取该数据。

Using FireBug I found the real URL to be here - quite monstrous!

使用FireBug我发现真正的URL在这里 - 非常可怕!

#2


This might not be exactly what you're looking for, but it gives me the same response as your example. Perhaps you can adjust it to suit your needs:

这可能不是您正在寻找的,但它给出了与您的示例相同的响应。也许您可以根据自己的需要进行调整:

From bash, type:

从bash中键入:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\
</g' | sed -n '/href="http:\/\/www\.elseptimoarte\.net/p'

the "</g" starts a new line. Don't include the prompt ($). Someone more familiar with sed might do a better job than me. You can replace the query string 'batman' and/or the duplicate site url strings to suit your needs.

The following was my output:

以下是我的输出:

<a href="http://www.elseptimoarte.net/peliculas/batman-begins-1266.html" class=l>
<a href="http://www.elseptimoarte.net/peliculas/batman:-the-dark-knight-30.html" class=l>El Caballero Oscuro (2008) - El Séptimo Arte
<a href="http://www.elseptimoarte.net/-batman-3--y-sus-rumores-4960.html" class=l>&#39;
<a href="http://www.elseptimoarte.net/esp--15-17-ago--batman-es-lider-y-triunfadora-aunque-no-bate-record-4285.html" class=l>(Esp. 15-17 Ago.) 
<a href="http://www.elseptimoarte.net/peliculas/batman-gotham-knight-1849.html" class=l>
<a href="http://www.elseptimoarte.net/cine-articulo541.html" class=l>Se ponen en marcha las secuelas de &#39;
<a href="http://www.elseptimoarte.net/trailers-de-buena-calidad-para--indiana--e--batman--3751.html" class=l>Tráilers en buena calidad de &#39;Indiana&#39; y &#39;
<a href="http://www.elseptimoarte.net/usa-8-10-ago--impresionante--batman-sigue-lider-por-4%C2%AA-semana-consecutiva-4245.html" class=l>(USA 8-10 Ago.) Impresionante. 
<a href="http://www.elseptimoarte.net/usa-25-27-jul--increible--batman-en-su-segunda-semana-logra-75-millones-4169.html" class=l>(USA 25-27 Jul.) Increíble. 
<a href="http://www.elseptimoarte.net/cine-articulo1498.html" class=l>¿Aparecerá Catwoman en &#39;

#3


I'll give you a more thorough command-line answer in a second, but in the mean time, have you considered using Yahoo Pipes? It's little more than a proof-of-concept now, but it has everything you need.

我会在一秒钟内给你一个更全面的命令行答案,但与此同时,你考虑过使用Yahoo Pipes吗?它现在只不过是一个概念验证,但它拥有您需要的一切。

#4


Pepe,

Here's the command you can use to get what you want:

这是您可以用来获得所需内容的命令:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\                                                            
</g' | sed -n 's/<a href="\(http:\/\/www\.elseptimoarte\.net[^"]*\).*$/\1/gp' > myfile.txt

It's a slight alteration of the above command. Puts line breaks in between urls, but it wouldn't be difficult to change it to give your exact output.

这是对上述命令的轻微改动。在网址之间放置换行符,但要更改它以提供确切的输出并不困难。

#5


curl and wget share many uses. I'm sure people have their preferences, but I tend to go to wget first for crawling, as it has auto-following of links to a specified depth and tends to be a bit more versatile with common text web pages, while I use curl when I need a less-common protocol or I have to interact with form data.

卷曲和wget分享许多用途。我确信人们有他们的偏好,但我倾向于首先去爬行,因为它有自动跟踪到指定深度的链接,并且往往使用常见的文本网页更加通用,而我使用卷曲当我需要一个不太常见的协议或我必须与表单数据交互时。

You can use curl if you have some preference for it, though I think wget is more suited. In the command above, just replace 'wget' with 'curl' and '-U' with '-A'. Omit '-O -' (I believe curl defaults to stdout, if not on your machine, use its appropriate flag) and leave everything else the same. You should get the same output.

你可以使用卷曲,如果你有一些偏好,虽然我认为wget更适合。在上面的命令中,只需将'wget'替换为'curl',将'-U'替换为'-A'。省略'-O - '(我相信curl默认为stdout,如果没有在你的机器上,使用它的相应标志)并保留其他所有内容。你应该得到相同的输出。

#6


There are Watir for Java

有Watir for Java

And if you are on .NET C#/VB you can use WatiN which is an awesome browser manipulation tool.

如果您使用的是.NET C#/ VB,则可以使用WatiN,这是一个非常棒的浏览器操作工具。

It is sort of a testing framework with tools to manipulate the browser DOM and poke around it but I believe you can also use those outside of a "testing" context.

它是一种测试框架,带有操作浏览器DOM的工具并围绕它浏览,但我相信你也可以在“测试”环境之外使用它们。

#1


You don't get the link using cURL because the page uses Javascript to get that data.

您没有使用cURL获取链接,因为该页面使用Javascript来获取该数据。

Using FireBug I found the real URL to be here - quite monstrous!

使用FireBug我发现真正的URL在这里 - 非常可怕!

#2


This might not be exactly what you're looking for, but it gives me the same response as your example. Perhaps you can adjust it to suit your needs:

这可能不是您正在寻找的,但它给出了与您的示例相同的响应。也许您可以根据自己的需要进行调整:

From bash, type:

从bash中键入:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\
</g' | sed -n '/href="http:\/\/www\.elseptimoarte\.net/p'

the "</g" starts a new line. Don't include the prompt ($). Someone more familiar with sed might do a better job than me. You can replace the query string 'batman' and/or the duplicate site url strings to suit your needs.

The following was my output:

以下是我的输出:

<a href="http://www.elseptimoarte.net/peliculas/batman-begins-1266.html" class=l>
<a href="http://www.elseptimoarte.net/peliculas/batman:-the-dark-knight-30.html" class=l>El Caballero Oscuro (2008) - El Séptimo Arte
<a href="http://www.elseptimoarte.net/-batman-3--y-sus-rumores-4960.html" class=l>&#39;
<a href="http://www.elseptimoarte.net/esp--15-17-ago--batman-es-lider-y-triunfadora-aunque-no-bate-record-4285.html" class=l>(Esp. 15-17 Ago.) 
<a href="http://www.elseptimoarte.net/peliculas/batman-gotham-knight-1849.html" class=l>
<a href="http://www.elseptimoarte.net/cine-articulo541.html" class=l>Se ponen en marcha las secuelas de &#39;
<a href="http://www.elseptimoarte.net/trailers-de-buena-calidad-para--indiana--e--batman--3751.html" class=l>Tráilers en buena calidad de &#39;Indiana&#39; y &#39;
<a href="http://www.elseptimoarte.net/usa-8-10-ago--impresionante--batman-sigue-lider-por-4%C2%AA-semana-consecutiva-4245.html" class=l>(USA 8-10 Ago.) Impresionante. 
<a href="http://www.elseptimoarte.net/usa-25-27-jul--increible--batman-en-su-segunda-semana-logra-75-millones-4169.html" class=l>(USA 25-27 Jul.) Increíble. 
<a href="http://www.elseptimoarte.net/cine-articulo1498.html" class=l>¿Aparecerá Catwoman en &#39;

#3


I'll give you a more thorough command-line answer in a second, but in the mean time, have you considered using Yahoo Pipes? It's little more than a proof-of-concept now, but it has everything you need.

我会在一秒钟内给你一个更全面的命令行答案,但与此同时,你考虑过使用Yahoo Pipes吗?它现在只不过是一个概念验证,但它拥有您需要的一切。

#4


Pepe,

Here's the command you can use to get what you want:

这是您可以用来获得所需内容的命令:

$ wget -U 'Mozilla/5.0' -O - 'http://www.google.com/search?q=batman+site%3Awww.elseptimoarte.net' | sed 's/</\                                                            
</g' | sed -n 's/<a href="\(http:\/\/www\.elseptimoarte\.net[^"]*\).*$/\1/gp' > myfile.txt

It's a slight alteration of the above command. Puts line breaks in between urls, but it wouldn't be difficult to change it to give your exact output.

这是对上述命令的轻微改动。在网址之间放置换行符,但要更改它以提供确切的输出并不困难。

#5


curl and wget share many uses. I'm sure people have their preferences, but I tend to go to wget first for crawling, as it has auto-following of links to a specified depth and tends to be a bit more versatile with common text web pages, while I use curl when I need a less-common protocol or I have to interact with form data.

卷曲和wget分享许多用途。我确信人们有他们的偏好,但我倾向于首先去爬行,因为它有自动跟踪到指定深度的链接,并且往往使用常见的文本网页更加通用,而我使用卷曲当我需要一个不太常见的协议或我必须与表单数据交互时。

You can use curl if you have some preference for it, though I think wget is more suited. In the command above, just replace 'wget' with 'curl' and '-U' with '-A'. Omit '-O -' (I believe curl defaults to stdout, if not on your machine, use its appropriate flag) and leave everything else the same. You should get the same output.

你可以使用卷曲,如果你有一些偏好,虽然我认为wget更适合。在上面的命令中,只需将'wget'替换为'curl',将'-U'替换为'-A'。省略'-O - '(我相信curl默认为stdout,如果没有在你的机器上,使用它的相应标志)并保留其他所有内容。你应该得到相同的输出。

#6


There are Watir for Java

有Watir for Java

And if you are on .NET C#/VB you can use WatiN which is an awesome browser manipulation tool.

如果您使用的是.NET C#/ VB,则可以使用WatiN,这是一个非常棒的浏览器操作工具。

It is sort of a testing framework with tools to manipulate the browser DOM and poke around it but I believe you can also use those outside of a "testing" context.

它是一种测试框架,带有操作浏览器DOM的工具并围绕它浏览,但我相信你也可以在“测试”环境之外使用它们。