使用Objective-C解析网页的源代码

Is there a way to parse a website's source on the iPhone to get the URL's of photos on that page? If so how would you do that?

有没有办法在iPhone上解析网站的来源以获取该页面上的照片网址?如果是这样你会怎么做?

Thanks

6 个解决方案

#1

I recommend regular expressions. There's a great open source Regex library for Cocoa called RegexKit. For the most part, you can just drop it in your code and it'll "just work".

我推荐正则表达式。 Cocoa有一个很棒的开源Regex库,名为RegexKit。在大多数情况下,你可以将它放在你的代码中,它“只是工作”。

Getting all the urls of images wouldn't be too difficult (less than 20 lines of code) if you assume that all images are going to be in <img> tags. You'd just grab all the image tags (something like: <img\s+[^>]+>), then iterate through those matches. For each match, you'd pull out whatever's in the src attribute: src\s*=\s*("|')?\s*([^\s"']+)(\s|"|')

如果您假设所有图像都在使用Objective-C解析网页的源代码标签中,那么获取图像的所有网址并不会太困难(少于20行代码)。您只需抓取所有图像标记(类似于:] +>),然后迭代这些匹配。对于每个匹配,你将拉出src属性中的任何内容:src \ s * = \ s *(“|')?\ s *([^ \ s”'] +)(\ s |“|')

You might need to tweak that a bit, but it shouldn't be too bad.

你可能需要调整一下,但它不应该太糟糕。

#2

I'd say go for regular expressions - there is a one page library that wraps c regexesthat you can drop into your project.

我会说正则表达式 - 有一个单页库可以包含你可以放入项目的c regexest。

#3

There is no super easy way. When I had to do it I wrote a libxml2 SAX parser. libxml2 has an html reader that works fairly well with malformed html, and libxml2 is included with the base system.

没有超级简单的方法。当我不得不这样做时,我写了一个libxml2 SAX解析器。 libxml2有一个html阅读器,它可以很好地处理格式错误的html,并且libxml2包含在基本系统中。

#4

You could try it using regular expressions, but I wouldn't recommend that. You should have a look at NSXMLParser, assuming the webpage is coded to be XHTML compliant. TouchXML is another good library.

您可以使用正则表达式尝试它,但我不建议这样做。您应该看看NSXMLParser,假设网页编码为符合XHTML。 TouchXML是另一个很好的库。

#5

take a look at Event Driven XML Parsing in the iPhone reference library

看一下iPhone参考库中的事件驱动的XML解析

#6

Are you OK with any approach you use not picking up on images loaded dynamically via JavaScript.

您使用的任何方法都没关系,而不是通过JavaScript动态加载的图像。

The closest thing I could see working is to parse out any JavaScript imports, load those up too, and then use a regular expression across the whole file looking for anything that ends in ".jpg/.gif/.png" and grab the full URL out from that. The libxml approach would miss out on references to images not in img tags, but it might well be good enough.

我能看到的最接近的工作是解析任何JavaScript导入,加载它们,然后在整个文件中使用正则表达式查找以“.jpg / .gif / .png”结尾的任何内容并抓住全部从中输出的URL。 libxml方法会错过对不在img标签中的图像的引用,但它可能已经足够好了。

#1