I have a ".CSV" file that I'm trying to parse using CSV
in ruby. The file has two rows of headers though and I've never encountered this before and don't know how to handle it. Below is an example of the headers and rows.
我有一个“.CSV”文件,我正在尝试使用ruby中的CSV进行解析。该文件有两行标题,我以前从未遇到过这种情况,也不知道如何处理它。下面是标题和行的示例。
Row 2
第2行
"Institution ID","Institution","Game Date","Uniform Number","Last Name","First Name","Rushing","","","","","Passing","","","","","","Total Off.","","Receiving","","","Pass Int","","","Fumble Ret","","","Punting","","Punt Ret","","","KO Ret","","","Total TD","Off xpts","","","","Def xpts","","","","FG","","Saf","Points"
Row 2
第2行
"","","","","","","Rushes","Gain","Loss","Net","TD","Att","Cmp","Int","Yards","TD","Conv","Plays","Yards","No.","Yards","TD","No.","Yards","TD","No.","Yards","TD","No.","Yards","No.","Yards","TD","No.","Yards","TD","","Kicks Att","Kicks Made","R/P Att","R/P Made","Kicks Att","Kicks Made","Int/Fum Att","Int/Fum Made","Att","Made"
“”, “”, “”, “”, “”, “”, “冲转”, “增益”, “亏损”, “网络”, “TD”, “ATT”, “CMP”, “内部”, “码”, “TD”, “转化率”, “播放”, “码”, “无”, “码”, “TD”, “没有”, “码”, “TD”, “没有。” ,“Yards”,“TD”,“No。”,“Yards”,“No。”,“Yards”,“TD”,“No。”,“Yards”,“TD”,“”,“Kicks Att” “,”Kicks Made“,”R / P Att“,”R / P Made“,”Kicks Att“,”Kicks Made“,”Int / Fum Att“,”Int / Fum Made“,”Att“,”制作”
Row 3
第3行
"721","AirForce","09/01/12","19","BASKA","DAVID","","","","","","","","","","","","0","0","","","","","","","","","","2","85","","","","","","","","","","","","","","","","","","","0"
“721”, “空军”, “12年9月1日”, “19”, “巴斯卡”, “DAVID”, “”, “”, “”, “”, “”, “”, “”,” ”, “”, “”, “”, “0”, “0”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “2”, “85”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “”, “”,” ”, “”, “”, “0”
There are no returns in the example above I just added them so it would be easier to read. Does CSV
have methods available to handle this structure or will I have to write my own methods to handle this? Thanks!
上面的示例中没有返回我刚添加它们以便更容易阅读。 CSV是否有可用于处理此结构的方法,或者我是否必须编写自己的方法来处理此问题?谢谢!
5 个解决方案
#1
3
You'll have to write your own logic. CSV is really just rows and columns, and by itself has no inherent idea of what each column or row really is, it's just raw data. Thus, CSV has no concept or awareness that it has two header rows, that's a human thing, so you'll need to build your own heuristics.
你必须编写自己的逻辑。 CSV实际上只是行和列,并且它本身并不固有每个列或行的内容,它只是原始数据。因此,CSV没有概念或意识到它有两个标题行,这是一个人类的事情,所以你需要建立自己的启发式方法。
Given that your data rows look like:
鉴于您的数据行如下所示:
"721","Air Force","09/01/12",
When you start parsing your data, if the first column represents an integer, then, if you convert it to an int and if it's > 0
than you know you're dealing with a valid "row" and not a header.
当你开始解析你的数据时,如果第一列代表一个整数,那么,如果你将它转换为一个int,如果它> 0,你知道你正在处理一个有效的“行”而不是一个标题。
#2
7
It looks like your CSV file was produced from an Excel spreadsheet that has columns grouped like this:
看起来您的CSV文件是从Excel电子表格生成的,该电子表格的列分组如下:
... | Rushing | Passing | ...
... |Rushes|Gain|Loss|Net|TD|Att|Cmp|Int|Yards|TD|Conv| ...
(Not sure if I restored the groups properly.)
(不确定我是否正确恢复了组。)
There is no standard tools to work with such kind of CSV files, AFAIK. You have to do the job manually.
没有标准工具可以使用这种类型的CSV文件AFAIK。你必须手动完成这项工作。
- Read the first line, treat it as first header line.
- 阅读第一行,将其视为第一个标题行。
- Read the second line, treat it as second header line.
- 阅读第二行,将其视为第二个标题行。
- Read the third line, treat it as first data line.
- 阅读第三行,将其视为第一条数据线。
- ...
- ...
#3
4
I'd recommend using the smarter_csv
gem, and manually provide the correct headers:
我建议使用smarter_csv gem,并手动提供正确的标头:
require 'smarter_csv'
options = {:user_provided_headers => ["Institution ID","Institution","Game Date","Uniform Number","Last Name","First Name", ... provide all headers here ... ],
:headers_in_file => false}
data = SmarterCSV.process(filename, options)
data.pop # to ignore the first header line
data.pop # to ignore the second header line
# data now contains an array of hashes with your data
Please check the GitHub page for the options, and examples. https://github.com/tilo/smarter_csv
请查看GitHub页面以获取选项和示例。 https://github.com/tilo/smarter_csv
One option you should use is :user_provided_headers
, and then simply specify the headers you want in an array. This way you can work around cases like this.
您应该使用的一个选项是:user_provided_headers,然后只需在数组中指定所需的标头。这样你可以解决这样的情况。
You will have to do data.pop
to ignore the header lines in the file.
您必须执行data.pop以忽略文件中的标题行。
#4
1
Read the CSV in and skip the first line on output:
读取CSV并跳过输出的第一行:
arr_of_arrs = CSV.read("path/to/file.csv")
arr_of_arrs[2..arr_of_arrs.length].each do |x|
# operation here
end
#5
1
It's really easy to do this with CSV. Just watch to see what the current line number is that's been read, and loop until you've read the headers:
使用CSV执行此操作非常简单。只需观察已读取的当前行号是什么,并在读取标题之前循环:
require 'csv'
CSV.foreach('test.csv') do |row|
next unless $. > 2
puts "'" + row.join("', '") + "'"
end
When run this is what is output:
运行时,这是输出:
'721', 'Air Force', '09/01/12', '19', 'BASKA', 'DAVID', '', '', '', '', '', '', '', '', '', '', '', '0', '0', '', '', '', '', '', '', '', '', '', '2', '85', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '0'
$.
is the line-number of the last line read from the file that's opened. So, this immediately loops until $.
has read two lines.
$。是从打开的文件中读取的最后一行的行号。所以,这会立即循环直到$。已阅读两行。
#1
3
You'll have to write your own logic. CSV is really just rows and columns, and by itself has no inherent idea of what each column or row really is, it's just raw data. Thus, CSV has no concept or awareness that it has two header rows, that's a human thing, so you'll need to build your own heuristics.
你必须编写自己的逻辑。 CSV实际上只是行和列,并且它本身并不固有每个列或行的内容,它只是原始数据。因此,CSV没有概念或意识到它有两个标题行,这是一个人类的事情,所以你需要建立自己的启发式方法。
Given that your data rows look like:
鉴于您的数据行如下所示:
"721","Air Force","09/01/12",
When you start parsing your data, if the first column represents an integer, then, if you convert it to an int and if it's > 0
than you know you're dealing with a valid "row" and not a header.
当你开始解析你的数据时,如果第一列代表一个整数,那么,如果你将它转换为一个int,如果它> 0,你知道你正在处理一个有效的“行”而不是一个标题。
#2
7
It looks like your CSV file was produced from an Excel spreadsheet that has columns grouped like this:
看起来您的CSV文件是从Excel电子表格生成的,该电子表格的列分组如下:
... | Rushing | Passing | ...
... |Rushes|Gain|Loss|Net|TD|Att|Cmp|Int|Yards|TD|Conv| ...
(Not sure if I restored the groups properly.)
(不确定我是否正确恢复了组。)
There is no standard tools to work with such kind of CSV files, AFAIK. You have to do the job manually.
没有标准工具可以使用这种类型的CSV文件AFAIK。你必须手动完成这项工作。
- Read the first line, treat it as first header line.
- 阅读第一行,将其视为第一个标题行。
- Read the second line, treat it as second header line.
- 阅读第二行,将其视为第二个标题行。
- Read the third line, treat it as first data line.
- 阅读第三行,将其视为第一条数据线。
- ...
- ...
#3
4
I'd recommend using the smarter_csv
gem, and manually provide the correct headers:
我建议使用smarter_csv gem,并手动提供正确的标头:
require 'smarter_csv'
options = {:user_provided_headers => ["Institution ID","Institution","Game Date","Uniform Number","Last Name","First Name", ... provide all headers here ... ],
:headers_in_file => false}
data = SmarterCSV.process(filename, options)
data.pop # to ignore the first header line
data.pop # to ignore the second header line
# data now contains an array of hashes with your data
Please check the GitHub page for the options, and examples. https://github.com/tilo/smarter_csv
请查看GitHub页面以获取选项和示例。 https://github.com/tilo/smarter_csv
One option you should use is :user_provided_headers
, and then simply specify the headers you want in an array. This way you can work around cases like this.
您应该使用的一个选项是:user_provided_headers,然后只需在数组中指定所需的标头。这样你可以解决这样的情况。
You will have to do data.pop
to ignore the header lines in the file.
您必须执行data.pop以忽略文件中的标题行。
#4
1
Read the CSV in and skip the first line on output:
读取CSV并跳过输出的第一行:
arr_of_arrs = CSV.read("path/to/file.csv")
arr_of_arrs[2..arr_of_arrs.length].each do |x|
# operation here
end
#5
1
It's really easy to do this with CSV. Just watch to see what the current line number is that's been read, and loop until you've read the headers:
使用CSV执行此操作非常简单。只需观察已读取的当前行号是什么,并在读取标题之前循环:
require 'csv'
CSV.foreach('test.csv') do |row|
next unless $. > 2
puts "'" + row.join("', '") + "'"
end
When run this is what is output:
运行时,这是输出:
'721', 'Air Force', '09/01/12', '19', 'BASKA', 'DAVID', '', '', '', '', '', '', '', '', '', '', '', '0', '0', '', '', '', '', '', '', '', '', '', '2', '85', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '0'
$.
is the line-number of the last line read from the file that's opened. So, this immediately loops until $.
has read two lines.
$。是从打开的文件中读取的最后一行的行号。所以,这会立即循环直到$。已阅读两行。