如何从随机网页中抓取文本和图像?

时间:2021-12-27 08:09:42

I need a way to visually represent a random web page on the internet.

我需要一种在互联网上直观地表示随机网页的方法。

Let's say for example this web page.

比方说这个网页。

Currently, these are the standard assets I can use:

目前,这些是我可以使用的标准资产:

  • Favicon: Too small, too abstract.
  • Favicon:太小,太抽象了。

  • Title: Very specific but poor visual aesthetics.
  • 标题:非常具体但视觉美感差。

  • URL: Nobody cares to read.
  • 网址:没有人关心阅读。

  • Icon: Too abstract.
  • 图标:太抽象了。

  • Thumbnail: Hard to get, too ugly (many elements crammed in a small space).
  • 缩略图:难以获得,太丑陋(许多元素挤在一个小空间里)。

I need to visually represent a random website in a way that is very meaningful and inviting for others to click on it.

我需要以一种非常有意义的方式直观地表示随机网站,并邀请其他人点击它。

I need something like what Facebook does when you share a link:

我需要类似Facebook在共享链接时所做的事情:

如何从随机网页中抓取文本和图像?

It scraps the link for images and then creates a beautiful meaningful tile which is inviting to click on.

它会废弃图像的链接,然后创建一个美丽的有意义的图块,邀请您点击。

如何从随机网页中抓取文本和图像?

Any way I can scrape the images and text from websites? I'm primarily interested in a Objective-C/JavaScript combo but anything will do and will be selected as an approved answer.

我能以任何方式从网站上刮下图像和文字吗?我主要对Objective-C / JavaScript组合感兴趣,但任何事情都会做,并将被选为批准的答案。

Edit: Re-wrote the post and changed the title.

编辑:重新写了帖子并更改了标题。

3 个解决方案

#1


6  

Websites will often provide meta information for user friendly social media sharing, such as Open Graph protocol tags. In fact, in your own example, the reddit page has Open Graph tags which make up the information in the Link Preview (look for meta tags with og: properties).

网站通常会为用户友好的社交媒体共享提供元信息,例如Open Graph协议标签。实际上,在您自己的示例中,reddit页面具有Open Graph标记,这些标记构成了链接预览中的信息​​(使用og:properties查找元标记)。

A fallback approach would be to implement site specific parsing code for most popular websites that don't already conform to a standardized format or to try and generically guess what the most prominent content on a given website is (for example, biggest image above the fold, first few sentences of the first paragraph, text in heading elements etc).

回退方法是为大多数流行网站实施特定于站点的解析代码,这些网站不符合标准化格式,或者尝试一般性地猜测给定网站上最突出的内容是什么(例如,最重要的图像,第一段的前几句,标题元素中的文字等)。

Problem with the former approach is that you you have to maintain the parsers as those websites change and evolve and with the latter that you simply cannot reliably predict what's important on a page and you can't expect to always find what you're looking for either (images for the thumbnail, for example).

前一种方法的问题在于,您必须维护解析器,因为这些网站会发生变化和演变,而后者则无法可靠地预测页面上的重要内容,您无法始终找到您要查找的内容要么(缩略图的图像,例如)。

Since you will never be able to generate meaningful previews for a 100% of the websites, it boils down to a simple question. What's an acceptable rate of successful link previews? If it's close to what you can get parsing standard meta information, I'd stick with that and save myself a lot of headache. If not, alternatively to the libraries shared above, you can also have a look at paid services/APIs which will likely cover more use cases than you could on your own.

由于您永远无法为100%的网站生成有意义的预览,因此归结为一个简单的问题。成功链接预览的可接受率是多少?如果它接近解析标准元信息的内容,我会坚持使用它并为自己省去很多麻烦。如果没有,或者上面共享的库,您还可以查看付费服务/ API,这些服务/ API可能会覆盖比您自己更多的用例。

#2


2  

This is what the OpenGraph standard is for. For instance, if you go to the Reddit post in the example, you can view the page information provided by HTML <meta /> tags (all the ones with names starting with 'og'):

这就是OpenGraph标准的用途。例如,如果你转到示例中的Reddit帖子,你可以查看HTML 标签提供的页面信息(名称以'og'开头的所有标签):

如何从随机网页中抓取文本和图像?

However, it is not possible for you to get the data from inside a web browser; CORS prevents the request to the URL. In fact, what Facebook seems to do is send the URL to their servers and have them perform a request to get the required information, and sending it back.

但是,您无法从Web浏览器中获取数据; CORS阻止对URL的请求。实际上,Facebook似乎要做的是将URL发送到他们的服务器并让他们执行请求以获取所需信息并将其发回。

#3


1  

You can develop your own Link Preview plugin or use existing third party available plugins.

您可以开发自己的链接预览插件或使用现有的第三方可用插件。

Posting example here is not possible. But i can URL of popular Link Preview plugins. Which may free or paid.

这里发布示例是不可能的。但我可以链接流行的链接预览插件的URL。哪个可以免费或付费。

You can check your url demo here , Which gives response in JSON and Raw Data You can use API also.

您可以在此处查看您的网址演示,其中包含JSON和原始数据的响应您也可以使用API​​。

Hope it helps.

希望能帮助到你。

#1


6  

Websites will often provide meta information for user friendly social media sharing, such as Open Graph protocol tags. In fact, in your own example, the reddit page has Open Graph tags which make up the information in the Link Preview (look for meta tags with og: properties).

网站通常会为用户友好的社交媒体共享提供元信息,例如Open Graph协议标签。实际上,在您自己的示例中,reddit页面具有Open Graph标记,这些标记构成了链接预览中的信息​​(使用og:properties查找元标记)。

A fallback approach would be to implement site specific parsing code for most popular websites that don't already conform to a standardized format or to try and generically guess what the most prominent content on a given website is (for example, biggest image above the fold, first few sentences of the first paragraph, text in heading elements etc).

回退方法是为大多数流行网站实施特定于站点的解析代码,这些网站不符合标准化格式,或者尝试一般性地猜测给定网站上最突出的内容是什么(例如,最重要的图像,第一段的前几句,标题元素中的文字等)。

Problem with the former approach is that you you have to maintain the parsers as those websites change and evolve and with the latter that you simply cannot reliably predict what's important on a page and you can't expect to always find what you're looking for either (images for the thumbnail, for example).

前一种方法的问题在于,您必须维护解析器,因为这些网站会发生变化和演变,而后者则无法可靠地预测页面上的重要内容,您无法始终找到您要查找的内容要么(缩略图的图像,例如)。

Since you will never be able to generate meaningful previews for a 100% of the websites, it boils down to a simple question. What's an acceptable rate of successful link previews? If it's close to what you can get parsing standard meta information, I'd stick with that and save myself a lot of headache. If not, alternatively to the libraries shared above, you can also have a look at paid services/APIs which will likely cover more use cases than you could on your own.

由于您永远无法为100%的网站生成有意义的预览,因此归结为一个简单的问题。成功链接预览的可接受率是多少?如果它接近解析标准元信息的内容,我会坚持使用它并为自己省去很多麻烦。如果没有,或者上面共享的库,您还可以查看付费服务/ API,这些服务/ API可能会覆盖比您自己更多的用例。

#2


2  

This is what the OpenGraph standard is for. For instance, if you go to the Reddit post in the example, you can view the page information provided by HTML <meta /> tags (all the ones with names starting with 'og'):

这就是OpenGraph标准的用途。例如,如果你转到示例中的Reddit帖子,你可以查看HTML 标签提供的页面信息(名称以'og'开头的所有标签):

如何从随机网页中抓取文本和图像?

However, it is not possible for you to get the data from inside a web browser; CORS prevents the request to the URL. In fact, what Facebook seems to do is send the URL to their servers and have them perform a request to get the required information, and sending it back.

但是,您无法从Web浏览器中获取数据; CORS阻止对URL的请求。实际上,Facebook似乎要做的是将URL发送到他们的服务器并让他们执行请求以获取所需信息并将其发回。

#3


1  

You can develop your own Link Preview plugin or use existing third party available plugins.

您可以开发自己的链接预览插件或使用现有的第三方可用插件。

Posting example here is not possible. But i can URL of popular Link Preview plugins. Which may free or paid.

这里发布示例是不可能的。但我可以链接流行的链接预览插件的URL。哪个可以免费或付费。

You can check your url demo here , Which gives response in JSON and Raw Data You can use API also.

您可以在此处查看您的网址演示,其中包含JSON和原始数据的响应您也可以使用API​​。

Hope it helps.

希望能帮助到你。