抓住包含文本nokogiri xpath的元素

时间:2021-08-17 22:20:23

Still learning how to use nokogiri and so far can grab by css elements.There is a page I want to scrape http://www.bbc.co.uk/sport/football/results, i want to get all the results for the barclays premier league which can be rendered via an Ajax call, however this is not possible with nokogiri i have read.

还在学习如何使用nokogiri到目前为止可以通过css元素抓取。有一个页面我想要刮http://www.bbc.co.uk/sport/football/results,我想获得所有的结果巴克莱高级联赛可以通过阿贾克斯电话进行渲染,但是这对于我读过的nokogiri来说是不可能的。

So the link i have provided has many results for all different leagues so can i grab only the ones which are titled Barclays Premier League which is contained in

所以我提供的链接对于所有不同的联赛都有很多结果,所以我只能抓住名为Barclays Premier League的那些联赛。

class="competition-title"

so far i can grab all results like so

到目前为止,我可以抓住所有结果

def get_results # Get me all results
 doc = Nokogiri::HTML(open(RESULTS_URL))
 days = doc.css('#results-data h2').each do |h2_tag|
 date = Date.parse(h2_tag.text.strip).to_date
  matches = h2_tag.xpath('following-sibling::*[1]').css('tr.report')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
 Result.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end

Any help appreciated

任何帮助赞赏

Edit

ok so it seems as if i can use some ruby , using select? not sure how to implement though. example below

好吧所以好像我可以使用一些红宝石,使用选择?不知道如何实施。以下示例

.select{|th|th.text =~ /Barclays Premier League/}

or some more reading has said xpath can be used

或者更多的读数表示可以使用xpath

matches = h2_tag.xpath('//th[contains(text(), "Barclays Premier League")]').css('tr.report')

or

matches = h2_tag.xpath('//b/a[contains(text(),"Barclays")]/../following-sibling::*[1]').css('tr.report')

have tried the xpath way but clearly wrong as nothing saving

尝试过xpath方式,但显然错误,因为没有节省

Thanks

2 个解决方案

#1


2  

I prefer an approach where you drill down to precisely what you need. Looking at the source, you need the match details:

我更喜欢一种方法,您可以深入了解您的需求。查看源代码,您需要匹配详细信息:

    <td class='match-details'>
        <p>
            <span class='team-home teams'><a href='...'>Brechin</a></span>
            <span class='score'><abbr title='Score'> 0-2 </abbr></span>
            <span class='team-away teams'><a href='...'>Alloa</a></span>
        </p>
    </td>

You need the three text content items within the p element. You need this for only "Barclays Premier League".

您需要p元素中的三个文本内容项。你只需要“巴克莱英超联赛”。

Viewing the source, notice that the elements you need above happen to be in a table that contains scores only for that league. How convenient! The table can be identified by a <th> element contanining "Barclays Premier League". All you then have to do is identify that table using XPath:

查看源代码,请注意上面所需的元素恰好位于包含仅适用于该联盟的分数的表格中。多方便啊!该表可以通过符合“巴克莱英超联赛”的元素来识别。然后你要做的就是使用XPath识别该表:

matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')

The td/p is sufficient because the match details is the only one containing a p, but you can add the class to the td if you want.

td / p就足够了,因为匹配细节是唯一包含p的细节,但如果需要,可以将类添加到td。

Then you grab your information exactly the way you have done it:

然后,您完全按照您的方式获取信息:

matches.each do |match|
  home_team = match.css('.team-home').text.strip
  away_team = match.css('.team-away').text.strip
  score = match.css('.score').text.strip
  ...
end

One remaining task: getting the date of each match. Looking back at the source, you can follow back up to the first containing table, and see that the first preceding h2 node has it. You can express this in XPath:

剩下的任务是:获取每场比赛的日期。回顾一下源代码,您可以回到第一个包含的表,并看到前面的第一个h2节点有它。你可以在XPath中表达这个:

date = match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text

Putting it all together

把它们放在一起

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
    date = Date.parse(match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text).to_date
    Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
  end
end

#2


1  

Just for fun, here's how I would transform @Mark Thomas's solution:

只是为了好玩,这就是我如何改变@Mark Thomas的解决方案:

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  doc.search('h2.table-header').each do |h2|
    date = Date.parse(h2.text).to_date
    next unless h2.at('+ table th[2]').text['Barclays Premier League']
    h2.search('+ table tbody tr').each do |tr|
      home_team = tr.at('.team-home').text.strip
      away_team = tr.at('.team-away').text.strip
      score = tr.at('.score').text.strip
      Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
    end
  end
end

By iterating over those h2's first you get:

通过迭代那些h2,你会得到:

Pros:

  • pulling the date outside of a loop
  • 把日期拉到循环之外

  • much simpler expressions (you might not be too worried about those but think about the guy who comes along after you.)
  • 更简单的表达方式(你可能不会太担心这些,但想想那些跟在你后面的人。)

Cons:

  • a few extra bytes of code
  • 一些额外的代码字节

#1


2  

I prefer an approach where you drill down to precisely what you need. Looking at the source, you need the match details:

我更喜欢一种方法,您可以深入了解您的需求。查看源代码,您需要匹配详细信息:

    <td class='match-details'>
        <p>
            <span class='team-home teams'><a href='...'>Brechin</a></span>
            <span class='score'><abbr title='Score'> 0-2 </abbr></span>
            <span class='team-away teams'><a href='...'>Alloa</a></span>
        </p>
    </td>

You need the three text content items within the p element. You need this for only "Barclays Premier League".

您需要p元素中的三个文本内容项。你只需要“巴克莱英超联赛”。

Viewing the source, notice that the elements you need above happen to be in a table that contains scores only for that league. How convenient! The table can be identified by a <th> element contanining "Barclays Premier League". All you then have to do is identify that table using XPath:

查看源代码,请注意上面所需的元素恰好位于包含仅适用于该联盟的分数的表格中。多方便啊!该表可以通过符合“巴克莱英超联赛”的元素来识别。然后你要做的就是使用XPath识别该表:

matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')

The td/p is sufficient because the match details is the only one containing a p, but you can add the class to the td if you want.

td / p就足够了,因为匹配细节是唯一包含p的细节,但如果需要,可以将类添加到td。

Then you grab your information exactly the way you have done it:

然后,您完全按照您的方式获取信息:

matches.each do |match|
  home_team = match.css('.team-home').text.strip
  away_team = match.css('.team-away').text.strip
  score = match.css('.score').text.strip
  ...
end

One remaining task: getting the date of each match. Looking back at the source, you can follow back up to the first containing table, and see that the first preceding h2 node has it. You can express this in XPath:

剩下的任务是:获取每场比赛的日期。回顾一下源代码,您可以回到第一个包含的表,并看到前面的第一个h2节点有它。你可以在XPath中表达这个:

date = match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text

Putting it all together

把它们放在一起

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
    date = Date.parse(match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text).to_date
    Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
  end
end

#2


1  

Just for fun, here's how I would transform @Mark Thomas's solution:

只是为了好玩,这就是我如何改变@Mark Thomas的解决方案:

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  doc.search('h2.table-header').each do |h2|
    date = Date.parse(h2.text).to_date
    next unless h2.at('+ table th[2]').text['Barclays Premier League']
    h2.search('+ table tbody tr').each do |tr|
      home_team = tr.at('.team-home').text.strip
      away_team = tr.at('.team-away').text.strip
      score = tr.at('.score').text.strip
      Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
    end
  end
end

By iterating over those h2's first you get:

通过迭代那些h2,你会得到:

Pros:

  • pulling the date outside of a loop
  • 把日期拉到循环之外

  • much simpler expressions (you might not be too worried about those but think about the guy who comes along after you.)
  • 更简单的表达方式(你可能不会太担心这些,但想想那些跟在你后面的人。)

Cons:

  • a few extra bytes of code
  • 一些额外的代码字节