So I am trying to scrape some content with node.js x-ray scraping framework. While I can get the content from a single page I can't get my head around on how to follow links and get content from a subpage in one go.
所以我在尝试用node抓取一些内容。js x射线抓取框架。虽然我可以从一个页面获取内容,但我无法了解如何从一个页面中获取链接并从一个子页面获取内容。
There is a sample on x-ray github profile but it returns empty data if I change the code to some other site.
x射线github配置文件上有一个示例,但如果我将代码更改为其他站点,它将返回空数据。
I have simplified my code and made it crawl the SO questions for this sample.
我简化了我的代码,并让它抓取了这个示例的SO问题。
The following works fine:
以下工作正常:
var Xray = require('x-ray');
var x = Xray();
x('http://*.com/questions/9202531/minimizing-nexpectation-for-a-custom-distribution-in-mathematica', '#content', [{
title: '#question-header h1',
question: '.question .post-text'
}])
(function(err, obj) {
console.log(err);
console.log(obj);
})
This also works:
这同样适用:
var Xray = require('x-ray');
var x = Xray();
x('http://*.com/questions', '#questions .question-summary .summary', [{
title: 'h3',
question: x('h3 a@href', '#content .question .post-text'),
}])
(function(err, obj) {
console.log(err);
console.log(obj);
})
but this gives me empty details result and I can't figure out what is wrong:
但这给了我一个空洞的细节结果,我不知道哪里出了问题:
var Xray = require('x-ray');
var x = Xray();
x('http://*.com/questions', '#questions .question-summary .summary', [{
title: 'h3',
link: 'h3 a@href',
details: x('h3 a@href', '#content', [{
title: 'h1',
question: '.question .post-text',
}])
}])
(function(err, obj) {
console.log(err);
console.log(obj);
})
I would like my spider to crawl the page with listed questions and then follow the link to each question and retrieve additional information.
我希望我的蜘蛛抓取列出的问题的页面,然后跟踪到每个问题的链接和检索额外的信息。
2 个解决方案
#1
5
So with with some help I figured out what the problem was. I am posting this answer in case somebody else might have the same problem.
有了一些帮助,我找到了问题所在。我把这个答案贴出来,以防别人也有同样的问题。
Working example:
工作的例子:
var Xray = require('x-ray');
var x = Xray();
x('http://*.com/questions', '#questions .question-summary .summary', [{
title: 'h3',
link: 'h3 a@href',
details: x('h3 a@href', {
title: 'h1',
question: '.question .post-text',
})
}])
(function(err, obj) {
console.log(err);
console.log(obj);
})
#2
1
version 2.0.2 does work.. there is a current issue in github here to followhttps://github.com/lapwinglabs/x-ray/issues/189
版本2.0.2工作. .在github中有一个关于followhttps:// github.com/laps/x -ray/issue /189的当前问题。
#1
5
So with with some help I figured out what the problem was. I am posting this answer in case somebody else might have the same problem.
有了一些帮助,我找到了问题所在。我把这个答案贴出来,以防别人也有同样的问题。
Working example:
工作的例子:
var Xray = require('x-ray');
var x = Xray();
x('http://*.com/questions', '#questions .question-summary .summary', [{
title: 'h3',
link: 'h3 a@href',
details: x('h3 a@href', {
title: 'h1',
question: '.question .post-text',
})
}])
(function(err, obj) {
console.log(err);
console.log(obj);
})
#2
1
version 2.0.2 does work.. there is a current issue in github here to followhttps://github.com/lapwinglabs/x-ray/issues/189
版本2.0.2工作. .在github中有一个关于followhttps:// github.com/laps/x -ray/issue /189的当前问题。