我怎么能看到所有来自Python的Tumblr帖子?

时间:2021-12-27 08:10:18

Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.

我看看下面的Tumblr贴子:http://ronbarak.tumblr。

I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api). Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.

我希望使用Python脚本(例如,通过urllib2、BeautifulSoup、simplejson或tumblr Api)获得上述所有注释。一些广泛的谷歌搜索并没有产生与Tumblr上的笔记有关的任何项目。

Can anyone point me in the right direction on which tool will enable me to do that?

谁能给我指出正确的方向,哪一种工具能使我做到这一点?

4 个解决方案

#1


7  

Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.

不幸的是,Tumblr API似乎有一些限制(缺少关于Reblogs的元信息,50的notes),所以您不能获得所有的notes。

It is also forbidden to do page scraping according to the Terms of Service.

也禁止根据服务条款进行页面抓取。

"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"

“在访问或使用服务时,您不可以做以下任何事情:(……)获取服务,尤其是在未经Tumblr明确事先书面同意的情况下,从服务中获取内容(定义见下文);”

Source:

来源:

https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc

https://groups.google.com/forum/?fromgroups= # ! / tumblr-api / ktfMIdJCOmc话题

#2


5  

Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:

没有JS,就会得到单独的页面,只包含注释。上述博文的第一页将是:

http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

Following pages are linked at the bottom, e.g.:

下面的页面在底部被链接,例如:

(See my answer on how to find the next URL in a’s onclick attribute.)

(参见我关于如何在a的onclick属性中找到下一个URL的回答。)

Now you could use various tools to download/parse the data.

现在您可以使用各种工具来下载/解析数据。

The following wget command should download all notes pages for that post:

以下wget命令应下载该文章的所有notes页面:

wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

#3


3  

Like Fabio implies, it is better to use the API.

如Fabio所说,最好使用API。

If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.

如果无论出于什么原因,您不能这样做,那么您将使用的工具将取决于您希望如何处理文章中的数据。

  • for a data dump: urllib will return a string of the page you want
  • 对于数据转储:urllib将返回所需页面的字符串
  • looking for a specific section in the html: lxml is pretty good
  • 在html中查找特定的部分:lxml非常好
  • looking for something in unruly html: definitely beautifulsoup
  • 在不受约束的html中寻找一些东西:绝对漂亮的汤
  • looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
  • 在一节中查找特定的项目:beautifulsoup、lxml、文本解析是您需要的。
  • need to put the data in a database/file: use scrapy
  • 需要将数据放入数据库/文件中:使用剪贴

Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.

Tumblr的url方案很简单:url/scheme/1、url/scheme/2、url/scheme/3等等。直到你到达文章的末尾,服务器才不再返回任何数据。

So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.

因此,如果要使用蛮力抓取,可以很容易地告诉脚本将所有数据转储到硬盘上,直到内容标签为空。

One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.

最后一个建议,请记住在你的脚本中加入一个小睡眠(1000),因为你可能会给Tumblr服务器带来压力。

#4


0  

how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.

如何在tumblr上下载所有的笔记?也涉及到主题,但unor的响应(上面)做得很好。

#1


7  

Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.

不幸的是,Tumblr API似乎有一些限制(缺少关于Reblogs的元信息,50的notes),所以您不能获得所有的notes。

It is also forbidden to do page scraping according to the Terms of Service.

也禁止根据服务条款进行页面抓取。

"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"

“在访问或使用服务时,您不可以做以下任何事情:(……)获取服务,尤其是在未经Tumblr明确事先书面同意的情况下,从服务中获取内容(定义见下文);”

Source:

来源:

https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc

https://groups.google.com/forum/?fromgroups= # ! / tumblr-api / ktfMIdJCOmc话题

#2


5  

Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:

没有JS,就会得到单独的页面,只包含注释。上述博文的第一页将是:

http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

Following pages are linked at the bottom, e.g.:

下面的页面在底部被链接,例如:

(See my answer on how to find the next URL in a’s onclick attribute.)

(参见我关于如何在a的onclick属性中找到下一个URL的回答。)

Now you could use various tools to download/parse the data.

现在您可以使用各种工具来下载/解析数据。

The following wget command should download all notes pages for that post:

以下wget命令应下载该文章的所有notes页面:

wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy

#3


3  

Like Fabio implies, it is better to use the API.

如Fabio所说,最好使用API。

If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.

如果无论出于什么原因,您不能这样做,那么您将使用的工具将取决于您希望如何处理文章中的数据。

  • for a data dump: urllib will return a string of the page you want
  • 对于数据转储:urllib将返回所需页面的字符串
  • looking for a specific section in the html: lxml is pretty good
  • 在html中查找特定的部分:lxml非常好
  • looking for something in unruly html: definitely beautifulsoup
  • 在不受约束的html中寻找一些东西:绝对漂亮的汤
  • looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
  • 在一节中查找特定的项目:beautifulsoup、lxml、文本解析是您需要的。
  • need to put the data in a database/file: use scrapy
  • 需要将数据放入数据库/文件中:使用剪贴

Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.

Tumblr的url方案很简单:url/scheme/1、url/scheme/2、url/scheme/3等等。直到你到达文章的末尾,服务器才不再返回任何数据。

So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.

因此,如果要使用蛮力抓取,可以很容易地告诉脚本将所有数据转储到硬盘上,直到内容标签为空。

One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.

最后一个建议,请记住在你的脚本中加入一个小睡眠(1000),因为你可能会给Tumblr服务器带来压力。

#4


0  

how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.

如何在tumblr上下载所有的笔记?也涉及到主题,但unor的响应(上面)做得很好。