Have few thousand reports that have consistently formatted tabular data embedded within them that I need to extract.
有几千个报告一直格式化我需要提取的表格数据。
Have a few ideas, but thought I'd post to see if there's a better way to do this than what I'm thinking; which is to extract the tabular data, create a new file for it, then parse that data as a tabular file.
有一些想法,但我想发布一下,看看有没有比我想的更好的方法呢?这是提取表格数据,为其创建一个新文件,然后将该数据解析为表格文件。
Here's a sample input and output, where the output read and written row by row to a database.
这是一个示例输入和输出,其中输出读取和逐行写入数据库。
INPUT_FILE
输入文件
MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText
SubHeader
PASS 1283019238 alksdjalskdjl
FAIL 102310928301 kajdlkajsldkaj
PASS 102930192830 aoisdajsdoiaj
PASS 192830192301 jiasdojoasi
MiscText MiscText MiscText
MiscText MiscText MiscText
MiscText MiscText MiscText
OUTPUT (read/write row-by-row from text-file to DB)
OUTPUT(从文本文件到DB逐行读/写)
ROW-01{column01,column02,column03}
...
ROW-nth{column01,column02,column03}
3 个解决方案
#1
2
Recognizing when to start processing tabular data is easy. You've got the marker line. The difficulty is recognizing when to stop processing data. You can apply the heuristics of stopping to process data when the split
doesn't yield the expected result.
识别何时开始处理表格数据很容易。你有标记线。困难在于识别何时停止处理数据。当拆分未产生预期结果时,您可以应用停止的启发式处理数据。
use strict;
use warnings;
my $tab_data;
my $num_cols;
while ( <> ) {
$tab_data = 1, next if $_ eq "SubHeader\n";
next unless $tab_data;
chomp;
my @cols = split /\t/;
$num_cols ||= scalar @cols;
last if $num_cols and $num_cols != scalar @cols;
print join( "\t", @cols ), "\n";
}
Save as etd.pl
(etd = extract tabular data, what did you think?), and call it like this from the command line:
保存为etd.pl(etd =提取表格数据,你怎么看?),并从命令行调用它:
perl etd.pl < your-mixed-input.txt
#2
1
If you know how to extract data, why create a new file instead of processing it immediately?
如果您知道如何提取数据,为什么要创建新文件而不是立即处理它?
#3
0
In case this is a fixed width data, I would strongly suggest using unpack
or plain old substr
.
如果这是一个固定宽度的数据,我强烈建议使用unpack或plain old substr。
#1
2
Recognizing when to start processing tabular data is easy. You've got the marker line. The difficulty is recognizing when to stop processing data. You can apply the heuristics of stopping to process data when the split
doesn't yield the expected result.
识别何时开始处理表格数据很容易。你有标记线。困难在于识别何时停止处理数据。当拆分未产生预期结果时,您可以应用停止的启发式处理数据。
use strict;
use warnings;
my $tab_data;
my $num_cols;
while ( <> ) {
$tab_data = 1, next if $_ eq "SubHeader\n";
next unless $tab_data;
chomp;
my @cols = split /\t/;
$num_cols ||= scalar @cols;
last if $num_cols and $num_cols != scalar @cols;
print join( "\t", @cols ), "\n";
}
Save as etd.pl
(etd = extract tabular data, what did you think?), and call it like this from the command line:
保存为etd.pl(etd =提取表格数据,你怎么看?),并从命令行调用它:
perl etd.pl < your-mixed-input.txt
#2
1
If you know how to extract data, why create a new file instead of processing it immediately?
如果您知道如何提取数据,为什么要创建新文件而不是立即处理它?
#3
0
In case this is a fixed width data, I would strongly suggest using unpack
or plain old substr
.
如果这是一个固定宽度的数据,我强烈建议使用unpack或plain old substr。