如何将HTML表转换为CSV?

时间:2022-02-03 07:25:04

How do I convert the contents of an HTML table (<table>) to CSV format? Is there a library or linux program that does this? This is similar to copy tables in Internet Explorer, and pasting them into Excel.

如何将HTML表的内容(

)转换为CSV格式?有库或linux程序可以做到这一点吗?这类似于在Internet Explorer中复制表,并将它们粘贴到Excel中。

15 个解决方案

#1


37  

This method is not really a library OR a program, but for ad hoc conversions you can

这个方法并不是真正的库或程序,但是对于特定的转换,您可以这样做

  • put the HTML for a table in a text file called something.xls
  • 将HTML放在一个名为something.xls的文本文件中。
  • open it with a spreadsheet
  • 用电子表格打开它
  • save it as CSV.
  • 将其保存为CSV。

I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.

我知道这适用于Excel,我相信我已经用OpenOffice电子表格做过了。

But you probably would prefer a Perl or Ruby script...

但是您可能更喜欢Perl或Ruby脚本……

#2


16  

Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.

很抱歉重新启用了一个古老的线程,但是我最近想这么做,但是我想要一个100%可移植的bash脚本。这是我的解决方案只用grep和sed。

The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...

下面的部分很快就被打烂了,所以可以做得更优雅一些,但是我刚刚开始使用sed/awk等等……

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.

正如您所看到的,我已经使用curl获得了页面源代码,但是您也可以从其他地方轻松地在表源代码中提供这些内容。

Here's the explanation:

这是解释:

Get the Contents of the URL using cURL, dump stderr to null (no progress meter)

使用cURL将stderr转储为null(没有进度表)获取URL的内容

curl "http://www.webpagewithtableinit.com/" 2>/dev/null 

.

I only want Table elements (return only lines with TABLE,TR,TH,TD tags)

我只需要表元素(只返回带有表、TR、TH和TD标记的行)

| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'

.

Remove any Whitespace at the beginning of the line.

删除行开头的任何空格。

| sed 's/^[\ \t]*//g' 

.

Remove newlines

删除换行

| tr -d '\n\r' 

.

Replace </TR> with newline

用换行符替换< / TR >

| sed 's/<\/TR[^>]*>/\n/Ig'  

.

Remove TABLE and TR tags

删除表和TR标记

| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' 

.

Remove ^<TD>, ^<TH>, </TD>$, </TH>$

删除^ < TD > ^ < TH >、< / TD >美元,美元< / TH >

| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' 

.

Replace </TD><TD> with comma

< / TD > < TD >替换为逗号

| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

.

Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.

注意,如果任何表单元格包含逗号,您可能需要首先转义它们,或者使用不同的分隔符。

Hope this helps someone!

希望这可以帮助别人!

#3


15  

Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/

下面是一个使用nokogiri的ruby脚本——http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
  end
  print "\n"
end

Worked for my basic test case.

为我的基本测试用例工作。

#4


6  

I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV and HTML::Parser.

我不确定是否有预先准备好的库,但是如果您愿意使用一点Perl,您可以使用Text::CSV和HTML:::Parser来做一些事情。

#5


6  

Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan or rowspan. Enjoy.

下面是我为完成这项任务编写的一个简短的Python程序。它在几分钟内就写好了,所以可能会做得更好。不确定它将如何处理嵌套表(可能会做一些糟糕的事情)或多个表(可能它们会一个接一个地出现)。它不处理colspan或rowspan。享受。

from HTMLParser import HTMLParser
import sys
import re


class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim="\t"):
        HTMLParser.__init__(self)
        self.despace_re = re.compile(r'\s+')
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            sys.stdout.write(self.despace_re.sub(' ', data).strip())
            self.data_interrupt = False


parser = HTMLTableParser() 
parser.feed(sys.stdin.read()) 

#6


5  

With Perl you can use the HTML::TableExtract module to extract the data from the table and then use Text::CSV_XS to create a CSV file or Spreadsheet::WriteExcel to create an Excel file.

使用Perl,您可以使用HTML::TableExtract模块从表中提取数据,然后使用文本::CSV_XS创建一个CSV文件或电子表格::WriteExcel创建一个Excel文件。

#7


5  

Assuming that u've designed an html page containing a table I would recommend this solution. Worked like charm for me.

假设您已经设计了一个包含表的html页面,我将推荐这个解决方案。我工作得很有魅力。

$(document).ready(function() {
$("#btnExport").click(function(e) {
    //getting values of current time for generating the file name
    var dt = new Date();
    var day = dt.getDate();
    var month = dt.getMonth() + 1;
    var year = dt.getFullYear();
    var hour = dt.getHours();
    var mins = dt.getMinutes();
    var postfix = day + "." + month + "." + year + "_" + hour + "." + mins;
    //creating a temporary HTML link element (they support setting file names)
    var a = document.createElement('a');
    //getting data from our div that contains the HTML table
    var data_type = 'data:application/vnd.ms-excel';
    var table_div = document.getElementById('dvData');
    var table_html = table_div.outerHTML.replace(/ /g, '%20');
    a.href = data_type + ', ' + table_html;
    //setting the file name
    a.download = 'exported_table_' + postfix + '.xls';
    //triggering the function
    a.click();
    //just in case, prevent default behaviour
    e.preventDefault();
});
});

Courtesy : http://www.kubilayerdogan.net/?p=218

礼貌:http://www.kubilayerdogan.net/?p=218

You can edit the file format to .csv here a.download = 'exported_table_' + postfix + '.csv';

可以将文件格式编辑为.csv。下载= 'exported_table_' + postfix + '.csv';

#8


4  

Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheets is your spreadsheeting program of choice. Simply do these two things.

为了补充这些答案(正如我最近尝试的一样)——如果谷歌电子表格是您选择的电子表格程序。简单地做这两件事。

1. Strip everything out of your html file around the Table opening/closing tags and resave it as another html file.

1。将html文件中的所有内容从表打开/关闭标记中删除,并将其重新保存为另一个html文件。

2. Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)

2。将html文件直接导入到谷歌电子表格中,您的信息将被漂亮地导入(最重要的提示:如果您在表中使用内联样式,它们也将被导入!)

Saved me loads of time and figuring out different conversions.

节省了我大量的时间和计算不同的转换。

#9


3  

Based on audiodude's answer, but simplified by using the built-in CSV library

基于audiodude的答案,但是通过使用内置的CSV库来简化

require 'nokogiri'
require 'csv'

doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')

doc.xpath('//table//tr').each do |row|
    tarray = [] #temporary array
    row.xpath('td').each do |cell|
        tarray << cell.text #Build array of that row of data.
    end
    csv << tarray #Write that row out to csv file
end

csv.close

I did wonder if there was any way to take the Nokogiri NodeSet (row.xpath('td')) and write this out as an array to the csv file in one step. But I could only figure out doing it by iterating over each cell and building the temporary array of each cell's content.

我想知道是否有什么方法可以将Nokogiri NodeSet (row.xpath(“td”))作为一个数组写入csv文件中。但是我只能通过遍历每个单元格并构建每个单元格内容的临时数组来解决这个问题。

#10


3  

Here a simple solution without any external lib:

这里有一个简单的解决方案,没有任何外部库:

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

It works for me without any issue

这对我来说没什么问题

#11


2  

here's a few options

这里有几个选项

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

How can I scrape an HTML table to CSV?

如何将HTML表刮到CSV?

https://addons.mozilla.org/en-US/firefox/addon/1852

https://addons.mozilla.org/en-US/firefox/addon/1852

#12


2  

Here is an example using pQuery and Spreadsheet::WriteExcel:

这里有一个使用pQuery和电子表格的例子::WriteExcel:

use strict;
use warnings;

use Spreadsheet::WriteExcel;
use pQuery;

my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet    = $workbook->add_worksheet;
my $row = 0;

pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
    my $col = 0;
    pQuery( $_ )->find( 'td' )->each( sub{
        $sheet->write( $row, $col++, $_->innerHTML );
    });
    $row++;
});

$workbook->close;

The example simply extracts all tr tags that it finds into an excel file. You can easily tailor it to pick up specific table or even trigger a new excel file per table tag.

这个示例只提取它在excel文件中找到的所有tr标记。您可以轻松地调整它以获取特定的表,甚至可以为每个表标记触发一个新的excel文件。

Further things to consider:

进一步的事情要考虑:

  • You may want to pick up td tags to create excel header(s).
  • 您可能想要获取td标记来创建excel头。
  • And you may have issues with rowspan & colspan.
  • 你可能会遇到rowspan和colspan的问题。

To see if rowspan or colspan is being used you can:

要查看是否正在使用rowspan或colspan,您可以:

pQuery( $data )->find( 'td' )->each( sub{ 
    my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});

#13


1  

OpenOffice.org can view HTML tables. Simply use the open command on the HTML file, or select and copy the table in your browser and then Paste Special in OpenOffice.org. It will query you for the file type, one of which should be HTML. Select that and voila!

org可以查看HTML表。只需在HTML文件上使用open命令,或者在浏览器中选择并复制该表,然后在OpenOffice.org中粘贴Special。它将查询您的文件类型,其中之一应该是HTML。选择,瞧!

#14


1  

This is a very old thread, but may be someone like me will bump into it. I have made some additions for the audiodude's script to read the html from file instead adding it to the code, and another parameter that controls printing of the header lines.

这是一条很旧的线,但可能像我这样的人会撞到它。我为audiodude的脚本添加了一些内容,用于从文件中读取html,而不是将其添加到代码中,并添加了另一个参数,用于控制打印头行。

the script should be run like that

脚本应该像这样运行

ruby <script_name> <file_name> [<print_headers>]

the code is:

的代码是:

require 'nokogiri'

print_header_lines = ARGV[1]

File.open(ARGV[0]) do |f|

  table_string=f
  doc = Nokogiri::HTML(table_string)

  doc.xpath('//table//tr').each do |row|
    if print_header_lines
      row.xpath('th').each do |cell|
        print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
      end
    end
    row.xpath('td').each do |cell|
      print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
    end
    print "\n"
  end
end

#15


0  

This is based on atomicules' answer but more succinct and also processes th (header) cells as well as td cells. I also added the strip method to get rid of the extra whitespaces.

这是基于atomicules的答案,但更简洁,也处理th(标头)细胞和td细胞。我还添加了条带方法以消除多余的白空间。

CSV.open("output.csv", 'w') do |csv|
  doc.xpath('//table//tr').each do |row|
    csv << row.xpath('th|td').map {|cell| cell.text.strip}
  end
end

Wrapping the code inside the CSV block ensures that the file will be closed properly.

将代码封装在CSV块中可以确保文件将被正确关闭。


If you just want the text and don't need to write it to a file, you can use this:

如果你只是想要文本而不需要将其写入文件,你可以使用以下方法:

doc.xpath('//table//tr').inject('') do |result, row|
  result << row.xpath('th|td').map {|cell| cell.text.strip}.to_csv
end

#1


37  

This method is not really a library OR a program, but for ad hoc conversions you can

这个方法并不是真正的库或程序,但是对于特定的转换,您可以这样做

  • put the HTML for a table in a text file called something.xls
  • 将HTML放在一个名为something.xls的文本文件中。
  • open it with a spreadsheet
  • 用电子表格打开它
  • save it as CSV.
  • 将其保存为CSV。

I know this works with Excel, and I believe I've done it with the OpenOffice spreadsheet.

我知道这适用于Excel,我相信我已经用OpenOffice电子表格做过了。

But you probably would prefer a Perl or Ruby script...

但是您可能更喜欢Perl或Ruby脚本……

#2


16  

Sorry for resurrecting an ancient thread, but I recently wanted to do this, but I wanted a 100% portable bash script to do it. So here's my solution using only grep and sed.

很抱歉重新启用了一个古老的线程,但是我最近想这么做,但是我想要一个100%可移植的bash脚本。这是我的解决方案只用grep和sed。

The below was bashed out very quickly, and so could be made much more elegant, but I'm just getting started really with sed/awk etc...

下面的部分很快就被打烂了,所以可以做得更优雅一些,但是我刚刚开始使用sed/awk等等……

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

As you can see I've got the page source using curl, but you could just as easily feed in the table source from elsewhere.

正如您所看到的,我已经使用curl获得了页面源代码,但是您也可以从其他地方轻松地在表源代码中提供这些内容。

Here's the explanation:

这是解释:

Get the Contents of the URL using cURL, dump stderr to null (no progress meter)

使用cURL将stderr转储为null(没有进度表)获取URL的内容

curl "http://www.webpagewithtableinit.com/" 2>/dev/null 

.

I only want Table elements (return only lines with TABLE,TR,TH,TD tags)

我只需要表元素(只返回带有表、TR、TH和TD标记的行)

| grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH'

.

Remove any Whitespace at the beginning of the line.

删除行开头的任何空格。

| sed 's/^[\ \t]*//g' 

.

Remove newlines

删除换行

| tr -d '\n\r' 

.

Replace </TR> with newline

用换行符替换< / TR >

| sed 's/<\/TR[^>]*>/\n/Ig'  

.

Remove TABLE and TR tags

删除表和TR标记

| sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' 

.

Remove ^<TD>, ^<TH>, </TD>$, </TH>$

删除^ < TD > ^ < TH >、< / TD >美元,美元< / TH >

| sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' 

.

Replace </TD><TD> with comma

< / TD > < TD >替换为逗号

| sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

.

Note that if any of the table cells contain commas, you may need to escape them first, or use a different delimiter.

注意,如果任何表单元格包含逗号,您可能需要首先转义它们,或者使用不同的分隔符。

Hope this helps someone!

希望这可以帮助别人!

#3


15  

Here's a ruby script that uses nokogiri -- http://nokogiri.rubyforge.org/nokogiri/

下面是一个使用nokogiri的ruby脚本——http://nokogiri.rubyforge.org/nokogiri/

require 'nokogiri'

doc = Nokogiri::HTML(table_string)

doc.xpath('//table//tr').each do |row|
  row.xpath('td').each do |cell|
    print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
  end
  print "\n"
end

Worked for my basic test case.

为我的基本测试用例工作。

#4


6  

I'm not sure if there is pre-made library for this, but if you're willing to get your hands dirty with a little Perl, you could likely do something with Text::CSV and HTML::Parser.

我不确定是否有预先准备好的库,但是如果您愿意使用一点Perl,您可以使用Text::CSV和HTML:::Parser来做一些事情。

#5


6  

Here's a short Python program I wrote to complete this task. It was written in a couple of minutes, so it can probably be made better. Not sure how it'll handle nested tables (probably it'll do bad stuff) or multiple tables (probably they'll just appear one after another). It doesn't handle colspan or rowspan. Enjoy.

下面是我为完成这项任务编写的一个简短的Python程序。它在几分钟内就写好了,所以可能会做得更好。不确定它将如何处理嵌套表(可能会做一些糟糕的事情)或多个表(可能它们会一个接一个地出现)。它不处理colspan或rowspan。享受。

from HTMLParser import HTMLParser
import sys
import re


class HTMLTableParser(HTMLParser):
    def __init__(self, row_delim="\n", cell_delim="\t"):
        HTMLParser.__init__(self)
        self.despace_re = re.compile(r'\s+')
        self.data_interrupt = False
        self.first_row = True
        self.first_cell = True
        self.in_cell = False
        self.row_delim = row_delim
        self.cell_delim = cell_delim

    def handle_starttag(self, tag, attrs):
        self.data_interrupt = True
        if tag == "table":
            self.first_row = True
            self.first_cell = True
        elif tag == "tr":
            if not self.first_row:
                sys.stdout.write(self.row_delim)
            self.first_row = False
            self.first_cell = True
            self.data_interrupt = False
        elif tag == "td" or tag == "th":
            if not self.first_cell:
                sys.stdout.write(self.cell_delim)
            self.first_cell = False
            self.data_interrupt = False
            self.in_cell = True

    def handle_endtag(self, tag):
        self.data_interrupt = True
        if tag == "td" or tag == "th":
            self.in_cell = False

    def handle_data(self, data):
        if self.in_cell:
            #if self.data_interrupt:
            #   sys.stdout.write(" ")
            sys.stdout.write(self.despace_re.sub(' ', data).strip())
            self.data_interrupt = False


parser = HTMLTableParser() 
parser.feed(sys.stdin.read()) 

#6


5  

With Perl you can use the HTML::TableExtract module to extract the data from the table and then use Text::CSV_XS to create a CSV file or Spreadsheet::WriteExcel to create an Excel file.

使用Perl,您可以使用HTML::TableExtract模块从表中提取数据,然后使用文本::CSV_XS创建一个CSV文件或电子表格::WriteExcel创建一个Excel文件。

#7


5  

Assuming that u've designed an html page containing a table I would recommend this solution. Worked like charm for me.

假设您已经设计了一个包含表的html页面,我将推荐这个解决方案。我工作得很有魅力。

$(document).ready(function() {
$("#btnExport").click(function(e) {
    //getting values of current time for generating the file name
    var dt = new Date();
    var day = dt.getDate();
    var month = dt.getMonth() + 1;
    var year = dt.getFullYear();
    var hour = dt.getHours();
    var mins = dt.getMinutes();
    var postfix = day + "." + month + "." + year + "_" + hour + "." + mins;
    //creating a temporary HTML link element (they support setting file names)
    var a = document.createElement('a');
    //getting data from our div that contains the HTML table
    var data_type = 'data:application/vnd.ms-excel';
    var table_div = document.getElementById('dvData');
    var table_html = table_div.outerHTML.replace(/ /g, '%20');
    a.href = data_type + ', ' + table_html;
    //setting the file name
    a.download = 'exported_table_' + postfix + '.xls';
    //triggering the function
    a.click();
    //just in case, prevent default behaviour
    e.preventDefault();
});
});

Courtesy : http://www.kubilayerdogan.net/?p=218

礼貌:http://www.kubilayerdogan.net/?p=218

You can edit the file format to .csv here a.download = 'exported_table_' + postfix + '.csv';

可以将文件格式编辑为.csv。下载= 'exported_table_' + postfix + '.csv';

#8


4  

Just to add to these answers (as i've recently been attempting a similar thing) - if Google spreadsheets is your spreadsheeting program of choice. Simply do these two things.

为了补充这些答案(正如我最近尝试的一样)——如果谷歌电子表格是您选择的电子表格程序。简单地做这两件事。

1. Strip everything out of your html file around the Table opening/closing tags and resave it as another html file.

1。将html文件中的所有内容从表打开/关闭标记中删除,并将其重新保存为另一个html文件。

2. Import that html file directly into google spreadsheets and you'll have your information beautifully imported (Top tip: if you used inline styles in your table, they will be imported as well!)

2。将html文件直接导入到谷歌电子表格中,您的信息将被漂亮地导入(最重要的提示:如果您在表中使用内联样式,它们也将被导入!)

Saved me loads of time and figuring out different conversions.

节省了我大量的时间和计算不同的转换。

#9


3  

Based on audiodude's answer, but simplified by using the built-in CSV library

基于audiodude的答案,但是通过使用内置的CSV库来简化

require 'nokogiri'
require 'csv'

doc = Nokogiri::HTML(table_string)
csv = CSV.open("output.csv", 'w')

doc.xpath('//table//tr').each do |row|
    tarray = [] #temporary array
    row.xpath('td').each do |cell|
        tarray << cell.text #Build array of that row of data.
    end
    csv << tarray #Write that row out to csv file
end

csv.close

I did wonder if there was any way to take the Nokogiri NodeSet (row.xpath('td')) and write this out as an array to the csv file in one step. But I could only figure out doing it by iterating over each cell and building the temporary array of each cell's content.

我想知道是否有什么方法可以将Nokogiri NodeSet (row.xpath(“td”))作为一个数组写入csv文件中。但是我只能通过遍历每个单元格并构建每个单元格内容的临时数组来解决这个问题。

#10


3  

Here a simple solution without any external lib:

这里有一个简单的解决方案,没有任何外部库:

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

https://www.codexworld.com/export-html-table-data-to-csv-using-javascript/

It works for me without any issue

这对我来说没什么问题

#11


2  

here's a few options

这里有几个选项

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

http://groups.google.com/group/ruby-talk-google/browse_thread/thread/cfae0aa4b14e5560?hl=nn

http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

http://ouseful.wordpress.com/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

How can I scrape an HTML table to CSV?

如何将HTML表刮到CSV?

https://addons.mozilla.org/en-US/firefox/addon/1852

https://addons.mozilla.org/en-US/firefox/addon/1852

#12


2  

Here is an example using pQuery and Spreadsheet::WriteExcel:

这里有一个使用pQuery和电子表格的例子::WriteExcel:

use strict;
use warnings;

use Spreadsheet::WriteExcel;
use pQuery;

my $workbook = Spreadsheet::WriteExcel->new( 'data.xls' );
my $sheet    = $workbook->add_worksheet;
my $row = 0;

pQuery( 'http://www.blahblah.site' )->find( 'tr' )->each( sub{
    my $col = 0;
    pQuery( $_ )->find( 'td' )->each( sub{
        $sheet->write( $row, $col++, $_->innerHTML );
    });
    $row++;
});

$workbook->close;

The example simply extracts all tr tags that it finds into an excel file. You can easily tailor it to pick up specific table or even trigger a new excel file per table tag.

这个示例只提取它在excel文件中找到的所有tr标记。您可以轻松地调整它以获取特定的表,甚至可以为每个表标记触发一个新的excel文件。

Further things to consider:

进一步的事情要考虑:

  • You may want to pick up td tags to create excel header(s).
  • 您可能想要获取td标记来创建excel头。
  • And you may have issues with rowspan & colspan.
  • 你可能会遇到rowspan和colspan的问题。

To see if rowspan or colspan is being used you can:

要查看是否正在使用rowspan或colspan,您可以:

pQuery( $data )->find( 'td' )->each( sub{ 
    my $number_of_cols_spanned = $_->getAttribute( 'colspan' );
});

#13


1  

OpenOffice.org can view HTML tables. Simply use the open command on the HTML file, or select and copy the table in your browser and then Paste Special in OpenOffice.org. It will query you for the file type, one of which should be HTML. Select that and voila!

org可以查看HTML表。只需在HTML文件上使用open命令,或者在浏览器中选择并复制该表,然后在OpenOffice.org中粘贴Special。它将查询您的文件类型,其中之一应该是HTML。选择,瞧!

#14


1  

This is a very old thread, but may be someone like me will bump into it. I have made some additions for the audiodude's script to read the html from file instead adding it to the code, and another parameter that controls printing of the header lines.

这是一条很旧的线,但可能像我这样的人会撞到它。我为audiodude的脚本添加了一些内容,用于从文件中读取html,而不是将其添加到代码中,并添加了另一个参数,用于控制打印头行。

the script should be run like that

脚本应该像这样运行

ruby <script_name> <file_name> [<print_headers>]

the code is:

的代码是:

require 'nokogiri'

print_header_lines = ARGV[1]

File.open(ARGV[0]) do |f|

  table_string=f
  doc = Nokogiri::HTML(table_string)

  doc.xpath('//table//tr').each do |row|
    if print_header_lines
      row.xpath('th').each do |cell|
        print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
      end
    end
    row.xpath('td').each do |cell|
      print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s){2,}/m, '\1'), "\", "
    end
    print "\n"
  end
end

#15


0  

This is based on atomicules' answer but more succinct and also processes th (header) cells as well as td cells. I also added the strip method to get rid of the extra whitespaces.

这是基于atomicules的答案,但更简洁,也处理th(标头)细胞和td细胞。我还添加了条带方法以消除多余的白空间。

CSV.open("output.csv", 'w') do |csv|
  doc.xpath('//table//tr').each do |row|
    csv << row.xpath('th|td').map {|cell| cell.text.strip}
  end
end

Wrapping the code inside the CSV block ensures that the file will be closed properly.

将代码封装在CSV块中可以确保文件将被正确关闭。


If you just want the text and don't need to write it to a file, you can use this:

如果你只是想要文本而不需要将其写入文件,你可以使用以下方法:

doc.xpath('//table//tr').inject('') do |result, row|
  result << row.xpath('th|td').map {|cell| cell.text.strip}.to_csv
end