使用Puppeteer和无头Chrome获取DOM节点文本

时间:2022-12-20 18:45:47

I'm trying to use headless Chrome and Puppeteer to run our Javascript tests, but I can't extract the results from the page. Based on this answer, it looks like I should use page.evaluate(). That section even has an example that looks like what I need.

我正在尝试使用无头Chrome和Puppeteer来运行我们的Javascript测试,但我无法从页面中提取结果。根据这个答案,看起来我应该使用page.evaluate()。该部分甚至有一个看起来像我需要的例子。

const bodyHandle = await page.$('body');
const html = await page.evaluate(body => body.innerHTML, bodyHandle);
await bodyHandle.dispose();

As a full example, I tried to convert that to a script that will extract my name from my user profile on Stack Overflow. Our project is using Node 6, so I converted the await expressions to use .then().

作为一个完整的例子,我尝试将其转换为一个脚本,该脚本将从Stack Overflow上的用户配置文件中提取我的名字。我们的项目使用的是Node 6,因此我将await表达式转换为使用.then()。

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://*.com/users/4794').then(function() {
            page.$('h2.user-card-name').then(function(heading_handle) {
                page.evaluate(function(heading) {
                    return heading.innerText;
                }, heading_handle).then(function(result) {
                    console.info(result);
                    browser.close();
                }, function(error) {
                    console.error(error);
                    browser.close();
                });
            });
        });
    });
});

When I run that, I get this error:

当我运行它时,我收到此错误:

$ node get_user.js 
TypeError: Converting circular structure to JSON
    at Object.stringify (native)
    at args.map.x (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:43)
    at Array.map (native)
    at Function.evaluationString (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:29)
    at Frame.<anonymous> (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:376:31)
    at next (native)
    at step (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:355:24)
    at Promise (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:373:12)
    at fn (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:351:10)
    at Frame._rawEvaluate (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:375:3)

The problem seems to be with serializing the input parameter to page.evaluate(). I can pass in strings and numbers, but not element handles. Is the example wrong, or is it a problem with Node 6? How can I extract the text of a DOM node?

问题似乎是将输入参数序列化为page.evaluate()。我可以传递字符串和数字,但不传递元素句柄。示例是错误的,还是Node 6的问题?如何提取DOM节点的文本?

3 个解决方案

#1


8  

I found three solutions to this problem, depending on how complicated your extraction is. The simplest option is a related function that I hadn't noticed: page.$eval(). It basically does what I was trying to do: combines page.$() and page.evaluate(). Here's an example that works:

我发现了这个问题的三个解决方案,具体取决于你的提取有多复杂。最简单的选项是我没有注意到的相关函数:page。$ eval()。它基本上做了我想要做的事情:结合页面。$()和page.evaluate()。这是一个有效的例子:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://*.com/users/4794').then(function() {
            page.$eval('h2.user-card-name', function(heading) {
                return heading.innerText;
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me the expected result:

这给了我预期的结果:

$ node get_user.js 
Don Kirkby top 2% overall

I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page. That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth. In this example, I use jQuery in a string to extract what I want:

我想提取更复杂的东西,但我终于意识到评估函数正在页面的上下文中运行。这意味着您可以使用页面中加载的任何工具,然后只来回发送字符串和数字。在这个例子中,我在字符串中使用jQuery来提取我想要的东西:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://*.com/users/4794').then(function() {
            page.evaluate("$('h2.user-card-name').text()").then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me a result with the whitespace intact:

这给了我一个完整的空白结果:

$ node get_user.js 

                            Don Kirkby

                                top 2% overall

In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:

在我的真实脚本中,我想提取几个节点的文本,所以我需要一个函数而不是一个简单的字符串:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://*.com/users/4794').then(function() {
            page.evaluate(function() {
                return $('h2.user-card-name').text();
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives the exact same result. Now I need to add error handling, and maybe reduce the indentation levels.

这给出了完全相同的结果。现在我需要添加错误处理,并可能减少缩进级别。

#2


2  

Using await/async and $eval, the syntax looks like the following:

使用await / async和$ eval,语法如下所示:

await page.goto('https://*.com/users/4794')
const nameElement = await context.page.$eval('h2.user-card-name', el => el.text())
console.log(nameElement)

#3


0  

I had success using the following:

我成功使用以下内容:

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  await page.goto(url);
  await page.waitFor(2000);
  let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
  console.log(html_content);
} catch (err) {
  console.log(err);
}

Hope it helps.

希望能帮助到你。

#1


8  

I found three solutions to this problem, depending on how complicated your extraction is. The simplest option is a related function that I hadn't noticed: page.$eval(). It basically does what I was trying to do: combines page.$() and page.evaluate(). Here's an example that works:

我发现了这个问题的三个解决方案,具体取决于你的提取有多复杂。最简单的选项是我没有注意到的相关函数:page。$ eval()。它基本上做了我想要做的事情:结合页面。$()和page.evaluate()。这是一个有效的例子:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://*.com/users/4794').then(function() {
            page.$eval('h2.user-card-name', function(heading) {
                return heading.innerText;
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me the expected result:

这给了我预期的结果:

$ node get_user.js 
Don Kirkby top 2% overall

I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page. That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth. In this example, I use jQuery in a string to extract what I want:

我想提取更复杂的东西,但我终于意识到评估函数正在页面的上下文中运行。这意味着您可以使用页面中加载的任何工具,然后只来回发送字符串和数字。在这个例子中,我在字符串中使用jQuery来提取我想要的东西:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://*.com/users/4794').then(function() {
            page.evaluate("$('h2.user-card-name').text()").then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me a result with the whitespace intact:

这给了我一个完整的空白结果:

$ node get_user.js 

                            Don Kirkby

                                top 2% overall

In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:

在我的真实脚本中,我想提取几个节点的文本,所以我需要一个函数而不是一个简单的字符串:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://*.com/users/4794').then(function() {
            page.evaluate(function() {
                return $('h2.user-card-name').text();
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives the exact same result. Now I need to add error handling, and maybe reduce the indentation levels.

这给出了完全相同的结果。现在我需要添加错误处理,并可能减少缩进级别。

#2


2  

Using await/async and $eval, the syntax looks like the following:

使用await / async和$ eval,语法如下所示:

await page.goto('https://*.com/users/4794')
const nameElement = await context.page.$eval('h2.user-card-name', el => el.text())
console.log(nameElement)

#3


0  

I had success using the following:

我成功使用以下内容:

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  await page.goto(url);
  await page.waitFor(2000);
  let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
  console.log(html_content);
} catch (err) {
  console.log(err);
}

Hope it helps.

希望能帮助到你。