NodeJS x-ray web-scraper:如何跟踪链接并从子页面获取内容

时间:2021-04-20 20:33:52

So I am trying to scrape some content with node.js x-ray scraping framework. While I can get the content from a single page I can't get my head around on how to follow links and get content from a subpage in one go.

所以我在尝试用node抓取一些内容。js x射线抓取框架。虽然我可以从一个页面获取内容,但我无法了解如何从一个页面中获取链接并从一个子页面获取内容。

There is a sample on x-ray github profile but it returns empty data if I change the code to some other site.

x射线github配置文件上有一个示例,但如果我将代码更改为其他站点,它将返回空数据。

I have simplified my code and made it crawl the SO questions for this sample.

我简化了我的代码,并让它抓取了这个示例的SO问题。

The following works fine:

以下工作正常:

var Xray = require('x-ray');
var x = Xray();

x('http://*.com/questions/9202531/minimizing-nexpectation-for-a-custom-distribution-in-mathematica', '#content', [{

  title: '#question-header h1',
  question: '.question .post-text'

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

This also works:

这同样适用:

var Xray = require('x-ray');
var x = Xray();

x('http://*.com/questions', '#questions .question-summary .summary', [{

  title: 'h3',
  question: x('h3 a@href', '#content .question .post-text'),

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

but this gives me empty details result and I can't figure out what is wrong:

但这给了我一个空洞的细节结果,我不知道哪里出了问题:

var Xray = require('x-ray');
var x = Xray();

x('http://*.com/questions', '#questions .question-summary .summary', [{

  title: 'h3',
  link: 'h3 a@href',
  details: x('h3 a@href', '#content', [{
    title: 'h1',
    question: '.question .post-text',
  }])

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

I would like my spider to crawl the page with listed questions and then follow the link to each question and retrieve additional information.

我希望我的蜘蛛抓取列出的问题的页面,然后跟踪到每个问题的链接和检索额外的信息。

2 个解决方案

#1


5  

So with with some help I figured out what the problem was. I am posting this answer in case somebody else might have the same problem.

有了一些帮助,我找到了问题所在。我把这个答案贴出来,以防别人也有同样的问题。

Working example:

工作的例子:

var Xray = require('x-ray');
var x = Xray();

x('http://*.com/questions', '#questions .question-summary .summary', [{

  title: 'h3',
  link: 'h3 a@href',
  details: x('h3 a@href', {
    title: 'h1',
    question: '.question .post-text',
  })

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

#2


1  

version 2.0.2 does work.. there is a current issue in github here to followhttps://github.com/lapwinglabs/x-ray/issues/189

版本2.0.2工作. .在github中有一个关于followhttps:// github.com/laps/x -ray/issue /189的当前问题。

#1


5  

So with with some help I figured out what the problem was. I am posting this answer in case somebody else might have the same problem.

有了一些帮助,我找到了问题所在。我把这个答案贴出来,以防别人也有同样的问题。

Working example:

工作的例子:

var Xray = require('x-ray');
var x = Xray();

x('http://*.com/questions', '#questions .question-summary .summary', [{

  title: 'h3',
  link: 'h3 a@href',
  details: x('h3 a@href', {
    title: 'h1',
    question: '.question .post-text',
  })

}])
(function(err, obj) {

  console.log(err);
  console.log(obj);

})

#2


1  

version 2.0.2 does work.. there is a current issue in github here to followhttps://github.com/lapwinglabs/x-ray/issues/189

版本2.0.2工作. .在github中有一个关于followhttps:// github.com/laps/x -ray/issue /189的当前问题。