如何在给定html文件的情况下确定html表的大小(以像素为单位)

时间:2021-02-04 02:48:08

I have a html file that has various html tags in it. This html also has a bunch of tables in it. I am processing this file using python. How do I find out what the size (length x width in pixels) when it is rendered by a browser (preferably chrome or firefox)?

我有一个html文件,里面有各种html标签。这个html里面还有一堆表。我正在使用python处理这个文件。如何通过浏览器(最好是chrome或firefox)呈现大小(长度x宽度,以像素为单位)?

I am essentially looking for the information when you do "inspect element" on a browser, and you are able to see the size of the various elements. I want to access this size in my python code.

当您在浏览器上执行“检查元素”时,我基本上在寻找信息,您可以看到各种元素的大小。我想在我的python代码中访问此大小。

I am using lxml to parse my html and can use selenium if needed.

我正在使用lxml来解析我的html,如果需要可以使用selenium。

edit: added #node.js incase I can use it to spit out the size of all the tables in a shell script and I can grab it in python.

编辑:添加#node.js incase我可以用它来吐出shell脚本中所有表的大小,我可以在python中获取它。

2 个解决方案

#1


1  

You're going to want to use Selenium WebDriver to open the HTML file in an actual browser installed on the computer that your Python code is running on.

您将要使用Selenium WebDriver在运行Python代码的计算机上安装的实际浏览器中打开HTML文件。

I'm not sure how you'd use the Selenium WebDriver API to find out how tall a rendered table is, but the value_of_css_property method might do it.

我不确定您如何使用Selenium WebDriver API来查找渲染表的高度,但value_of_css_property方法可能会这样做。

#2


0  

If you can call out shellscript, and you can use Node.js, I'm assuming you could also install and use PhantomJS, which is a headless WebKit port. (I.e. an actual honest to goodness WebKit renderer that just doesn't require a window to work.) This will let you use Javascript and the familiar web libraries to manipulate the document. As an example, the following gets you the width of the logo element towards the upper left Stack Overflow site:

如果你可以调出shellcript,你可以使用Node.js,我假设你也可以安装和使用PhantomJS,这是一个无头的WebKit端口。 (即,一个真正的诚实的善良WebKit渲染器,它不需要窗口工作。)这将允许您使用Javascript和熟悉的Web库来操作文档。例如,以下内容将向您显示徽标元素向左上方Stack Overflow站点的宽度:

page = require('webpage').create(); // create a new "browser"

page.open('http://*.com/', function() {
  // callback when loading completes
  var logoWidth = page.evaluate(function() {
    // This runs in the rendered page and uses the version of jQuery that SO loads.
    return $('#hlogo').width();
  });

  console.log(logoWidth); // prints 250, the same as Chrome.

  phantom.exit(); // for some reason you need to exit manually
});

The documentation for PhantomJS will tell you more about what you can do with it and how.

PhantomJS的文档将告诉您更多关于它可以用它做什么以及如何做。

One caveat however is that loading a page takes a while, since it needs to fetch CSS and scripts and generally do everything a browser does. I'm not sure if and how PhantomJS does any caching, if it does it might make sense to reuse the same process for multiple scrapes of the same site.

但需要注意的是,加载页面需要一段时间,因为它需要获取CSS和脚本,并且通常会执行浏览器所做的一切。我不确定PhantomJS是否以及如何进行任何缓存,如果确实如此,对同一站点的多次擦除重用相同的过程可能是有意义的。

#1


1  

You're going to want to use Selenium WebDriver to open the HTML file in an actual browser installed on the computer that your Python code is running on.

您将要使用Selenium WebDriver在运行Python代码的计算机上安装的实际浏览器中打开HTML文件。

I'm not sure how you'd use the Selenium WebDriver API to find out how tall a rendered table is, but the value_of_css_property method might do it.

我不确定您如何使用Selenium WebDriver API来查找渲染表的高度,但value_of_css_property方法可能会这样做。

#2


0  

If you can call out shellscript, and you can use Node.js, I'm assuming you could also install and use PhantomJS, which is a headless WebKit port. (I.e. an actual honest to goodness WebKit renderer that just doesn't require a window to work.) This will let you use Javascript and the familiar web libraries to manipulate the document. As an example, the following gets you the width of the logo element towards the upper left Stack Overflow site:

如果你可以调出shellcript,你可以使用Node.js,我假设你也可以安装和使用PhantomJS,这是一个无头的WebKit端口。 (即,一个真正的诚实的善良WebKit渲染器,它不需要窗口工作。)这将允许您使用Javascript和熟悉的Web库来操作文档。例如,以下内容将向您显示徽标元素向左上方Stack Overflow站点的宽度:

page = require('webpage').create(); // create a new "browser"

page.open('http://*.com/', function() {
  // callback when loading completes
  var logoWidth = page.evaluate(function() {
    // This runs in the rendered page and uses the version of jQuery that SO loads.
    return $('#hlogo').width();
  });

  console.log(logoWidth); // prints 250, the same as Chrome.

  phantom.exit(); // for some reason you need to exit manually
});

The documentation for PhantomJS will tell you more about what you can do with it and how.

PhantomJS的文档将告诉您更多关于它可以用它做什么以及如何做。

One caveat however is that loading a page takes a while, since it needs to fetch CSS and scripts and generally do everything a browser does. I'm not sure if and how PhantomJS does any caching, if it does it might make sense to reuse the same process for multiple scrapes of the same site.

但需要注意的是,加载页面需要一段时间,因为它需要获取CSS和脚本,并且通常会执行浏览器所做的一切。我不确定PhantomJS是否以及如何进行任何缓存,如果确实如此,对同一站点的多次擦除重用相同的过程可能是有意义的。