Is there a way to parse a website's source on the iPhone to get the URL's of photos on that page? If so how would you do that?
有没有办法在iPhone上解析网站的来源以获取该页面上的照片网址?如果是这样你会怎么做?
Thanks
6 个解决方案
#1
I recommend regular expressions. There's a great open source Regex library for Cocoa called RegexKit. For the most part, you can just drop it in your code and it'll "just work".
我推荐正则表达式。 Cocoa有一个很棒的开源Regex库,名为RegexKit。在大多数情况下,你可以将它放在你的代码中,它“只是工作”。
Getting all the urls of images wouldn't be too difficult (less than 20 lines of code) if you assume that all images are going to be in <img> tags. You'd just grab all the image tags (something like: <img\s+[^>]+>), then iterate through those matches. For each match, you'd pull out whatever's in the src attribute: src\s*=\s*("|')?\s*([^\s"']+)(\s|"|')
如果您假设所有图像都在标签中,那么获取图像的所有网址并不会太困难(少于20行代码)。您只需抓取所有图像标记(类似于:] +>),然后迭代这些匹配。对于每个匹配,你将拉出src属性中的任何内容:src \ s * = \ s *(“|')?\ s *([^ \ s”'] +)(\ s |“|')
You might need to tweak that a bit, but it shouldn't be too bad.
你可能需要调整一下,但它不应该太糟糕。
#2
I'd say go for regular expressions - there is a one page library that wraps c regexesthat you can drop into your project.
我会说正则表达式 - 有一个单页库可以包含你可以放入项目的c regexest。
#3
There is no super easy way. When I had to do it I wrote a libxml2 SAX parser. libxml2 has an html reader that works fairly well with malformed html, and libxml2 is included with the base system.
没有超级简单的方法。当我不得不这样做时,我写了一个libxml2 SAX解析器。 libxml2有一个html阅读器,它可以很好地处理格式错误的html,并且libxml2包含在基本系统中。
#4
You could try it using regular expressions, but I wouldn't recommend that. You should have a look at NSXMLParser, assuming the webpage is coded to be XHTML compliant. TouchXML is another good library.
您可以使用正则表达式尝试它,但我不建议这样做。您应该看看NSXMLParser,假设网页编码为符合XHTML。 TouchXML是另一个很好的库。
#6
Are you OK with any approach you use not picking up on images loaded dynamically via JavaScript.
您使用的任何方法都没关系,而不是通过JavaScript动态加载的图像。
The closest thing I could see working is to parse out any JavaScript imports, load those up too, and then use a regular expression across the whole file looking for anything that ends in ".jpg/.gif/.png" and grab the full URL out from that. The libxml approach would miss out on references to images not in img tags, but it might well be good enough.
我能看到的最接近的工作是解析任何JavaScript导入,加载它们,然后在整个文件中使用正则表达式查找以“.jpg / .gif / .png”结尾的任何内容并抓住全部从中输出的URL。 libxml方法会错过对不在img标签中的图像的引用,但它可能已经足够好了。
#1
I recommend regular expressions. There's a great open source Regex library for Cocoa called RegexKit. For the most part, you can just drop it in your code and it'll "just work".
我推荐正则表达式。 Cocoa有一个很棒的开源Regex库,名为RegexKit。在大多数情况下,你可以将它放在你的代码中,它“只是工作”。
Getting all the urls of images wouldn't be too difficult (less than 20 lines of code) if you assume that all images are going to be in <img> tags. You'd just grab all the image tags (something like: <img\s+[^>]+>), then iterate through those matches. For each match, you'd pull out whatever's in the src attribute: src\s*=\s*("|')?\s*([^\s"']+)(\s|"|')
如果您假设所有图像都在标签中,那么获取图像的所有网址并不会太困难(少于20行代码)。您只需抓取所有图像标记(类似于:] +>),然后迭代这些匹配。对于每个匹配,你将拉出src属性中的任何内容:src \ s * = \ s *(“|')?\ s *([^ \ s”'] +)(\ s |“|')
You might need to tweak that a bit, but it shouldn't be too bad.
你可能需要调整一下,但它不应该太糟糕。
#2
I'd say go for regular expressions - there is a one page library that wraps c regexesthat you can drop into your project.
我会说正则表达式 - 有一个单页库可以包含你可以放入项目的c regexest。
#3
There is no super easy way. When I had to do it I wrote a libxml2 SAX parser. libxml2 has an html reader that works fairly well with malformed html, and libxml2 is included with the base system.
没有超级简单的方法。当我不得不这样做时,我写了一个libxml2 SAX解析器。 libxml2有一个html阅读器,它可以很好地处理格式错误的html,并且libxml2包含在基本系统中。
#4
You could try it using regular expressions, but I wouldn't recommend that. You should have a look at NSXMLParser, assuming the webpage is coded to be XHTML compliant. TouchXML is another good library.
您可以使用正则表达式尝试它,但我不建议这样做。您应该看看NSXMLParser,假设网页编码为符合XHTML。 TouchXML是另一个很好的库。
#5
take a look at Event Driven XML Parsing in the iPhone reference library
看一下iPhone参考库中的事件驱动的XML解析
#6
Are you OK with any approach you use not picking up on images loaded dynamically via JavaScript.
您使用的任何方法都没关系,而不是通过JavaScript动态加载的图像。
The closest thing I could see working is to parse out any JavaScript imports, load those up too, and then use a regular expression across the whole file looking for anything that ends in ".jpg/.gif/.png" and grab the full URL out from that. The libxml approach would miss out on references to images not in img tags, but it might well be good enough.
我能看到的最接近的工作是解析任何JavaScript导入,加载它们,然后在整个文件中使用正则表达式查找以“.jpg / .gif / .png”结尾的任何内容并抓住全部从中输出的URL。 libxml方法会错过对不在img标签中的图像的引用,但它可能已经足够好了。