DOM遍历cheerio - 如何使用相应的文本获取所有元素

时间:2022-10-28 20:35:49

So I'm using Cheerio, a library similar to jQuery on the Node server side, that allows you to parse an html text and traverse it just like you would with jQuery. I need to get the plain text of the html body, but not only that, I need to get the corresponding element and number. IE: if the plain text was found in the third paragraph element, I would have something like:

所以我正在使用Cheerio,一个类似于Node服务器端jQuery的库,它允许你解析html文本并像使用jQuery一样遍历它。我需要获取html正文的纯文本,但不仅如此,我需要获取相应的元素和数字。 IE:如果在第三段元素中找到纯文本,我会有类似的东西:

{
    text: <element plaintext>,
    element: "p-3"
}

I currently have the following function that attempts to do this:

我目前有以下功能试图这样做:

var plaintext_elements = traverse_tree($('body'));    

function traverse_tree(root, found_elements = {}, return_array = []) {
    if (root.children().length) {
        //root has children, call traverse_tree on that subtree
        traverse_tree(root.children().first(), found_elements, return_array);
    }
    root.nextAll().each(function(i, elem) {
        if ($(elem).children().length) {
            //if the element has children call traverse_tree on the element's first child
            traverse_tree($(elem).children().first(), found_elements, return_array)
        }
        else {
            if (!found_elements[$(elem)[0].name]) {
                found_elements[$(elem)[0].name] = 1;
            }
            else {
                found_elements[$(elem)[0].name]++
            }
            if ($(elem).text() && $(elem).text != '') {
                return_array.push({
                    text: $(elem).text(),
                    element: $(elem)[0].name + '-' + found_elements[$(elem)[0].name]
                })
            }
        }
    })


    if (root[0].name == 'body') {
        return return_array;
    }

}

Am I going in the right direction, should I attempt something else? Any help on this would be appreciated. Again this is not jQuery, but Cheerio on the server side. (they are very similar, however)

我是否朝着正确的方向前进,我应该尝试别的吗?任何有关这方面的帮助将不胜感激。这不是jQuery,而是服务器端的Cheerio。 (但它们非常相似)

1 个解决方案

#1


0  

I think a lot of the traversal is not needed if you use the * css selector

我认为如果使用* css选择器,则不需要进行大量遍历

function textElements($){
  const found = {}
  return $('body *').map(function(el){
    if ( $(this).children().length || $(this).text() === '' ) return
    found[this.name] = found[this.name] ? 1 + found[this.name] : 1
    return {
      text: $(this).text(),
      element: `${this.name}-${found[this.name]}`,
    }
  }).get()
}

textElements(cheerio.load(html)

#1


0  

I think a lot of the traversal is not needed if you use the * css selector

我认为如果使用* css选择器,则不需要进行大量遍历

function textElements($){
  const found = {}
  return $('body *').map(function(el){
    if ( $(this).children().length || $(this).text() === '' ) return
    found[this.name] = found[this.name] ? 1 + found[this.name] : 1
    return {
      text: $(this).text(),
      element: `${this.name}-${found[this.name]}`,
    }
  }).get()
}

textElements(cheerio.load(html)