基于内容(副本)在linux中分割文件

时间:2021-12-23 15:31:01

This question already has an answer here:

这个问题已经有了答案:

I have an email dump of around 400mb. I want to split this into .txt files, consisting of one mail in each file. Every e-mail starts with the standard HTML header specifying the doctype.

我有一个大约400mb的电子邮件转储。我想把它分成。txt文件,每个文件中包含一个邮件。每个电子邮件都以指定doctype的标准HTML头开始。

This means I will have to split my files based on the above said header. How do I go about it in linux?

这意味着我将不得不根据上面提到的头文件来分割我的文件。在linux中我该怎么做呢?

5 个解决方案

#1


56  

If you have a mail.txt

如果你有mail。txt

$ cat mail.txt
<html>
    mail A
</html>

<html>
    mail B
</html>

<html>
    mail C
</html>

run csplit to split by <html>

运行csplit by

$ csplit mail.txt '/^<html>$/' '{*}'

 - mail.txt    => input file
 - /^<html>$/  => pattern match every `<html>` line
 - {*}         => repeat the previous pattern as many times as possible

check output

检查输出

$ ls
mail.txt  xx00  xx01  xx02  xx03

If you want do it in awk

如果你想在awk里做

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt  5.txt  9.txt  mail.txt

#2


4  

The csplit program solves your problem elegantly:

csplit程序优雅地解决您的问题:

csplit '/<!DOCTYPE.*/' $FILE

#3


2  

csplit is the best solution to this problem. Just thought I'd post a bash-solution to show that there is no need to go perl on this task:

csplit是解决这个问题的最佳方法。我只是想发布一个bash解决方案来说明没有必要在这个任务上使用perl:

#!/usr/bin/bash

MAIL='mail'        # path to huge mail-file

#get linenumbers for all headers
line_no=$(grep -n html $MAIL | cut -d: -f1)

read -a LINES<<< $line_no

file=0
for i in $(seq 0 2 ${#LINES[@]}); do
    start=${LINES[i]}
    end=$((${LINES[i+1]}-1))
    echo $start, $end
    sed -n "${start},${end}p" $MAIL > ${MAIL}${file}.txt
    file=$((file+1))
done

#4


1  

I agree with fge. With perl it would be a lot simpler. You can try something like this -

我同意fge。使用perl会简单得多。你可以试试这个-

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?=HEADER_FORMAT)/)) {
      open(O, '>mail' . ++$n);
      print O $match;
      close(O);
}

Replace HEADER_FORMAT with your header type.

用标题类型替换HEADER_FORMAT。

#5


1  

It is doable with some perl "magic"... Many people would call this ugly but here goes.

它可以用一些perl“魔法”实现……很多人会说这很丑,但事实是这样的。

The trick is to replace $/ with what you want and read your input, as such:

诀窍是用你想要的东西来替换$/,并阅读你的输入,例如:

#!/usr/bin/perl -W
use strict;
my $i = 1;

$/ = <<EOF;
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <xmeta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
EOF

open INPUT, "/path/to/inputfile" or die;

while (my $mail = <INPUT>) {
    $mail = substr($mail, 0, index($mail, $/));
    open OUTPUT, ">/path/to/emailfile." . $i . ".txt" or die;
    $i++;
    print OUTPUT $mail;
    close OUTPUT;
}

edit: fixed, I always forget that $/ is included in the input. Also, the first file will always be empty, but then it can be easily handled.

编辑:固定,我总是忘记$/包含在输入中。而且,第一个文件总是为空的,但是可以很容易地处理它。

#1


56  

If you have a mail.txt

如果你有mail。txt

$ cat mail.txt
<html>
    mail A
</html>

<html>
    mail B
</html>

<html>
    mail C
</html>

run csplit to split by <html>

运行csplit by

$ csplit mail.txt '/^<html>$/' '{*}'

 - mail.txt    => input file
 - /^<html>$/  => pattern match every `<html>` line
 - {*}         => repeat the previous pattern as many times as possible

check output

检查输出

$ ls
mail.txt  xx00  xx01  xx02  xx03

If you want do it in awk

如果你想在awk里做

$ awk '/<html>/{filename=NR".txt"}; {print >filename}' mail.txt
$ ls
1.txt  5.txt  9.txt  mail.txt

#2


4  

The csplit program solves your problem elegantly:

csplit程序优雅地解决您的问题:

csplit '/<!DOCTYPE.*/' $FILE

#3


2  

csplit is the best solution to this problem. Just thought I'd post a bash-solution to show that there is no need to go perl on this task:

csplit是解决这个问题的最佳方法。我只是想发布一个bash解决方案来说明没有必要在这个任务上使用perl:

#!/usr/bin/bash

MAIL='mail'        # path to huge mail-file

#get linenumbers for all headers
line_no=$(grep -n html $MAIL | cut -d: -f1)

read -a LINES<<< $line_no

file=0
for i in $(seq 0 2 ${#LINES[@]}); do
    start=${LINES[i]}
    end=$((${LINES[i+1]}-1))
    echo $start, $end
    sed -n "${start},${end}p" $MAIL > ${MAIL}${file}.txt
    file=$((file+1))
done

#4


1  

I agree with fge. With perl it would be a lot simpler. You can try something like this -

我同意fge。使用perl会简单得多。你可以试试这个-

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?=HEADER_FORMAT)/)) {
      open(O, '>mail' . ++$n);
      print O $match;
      close(O);
}

Replace HEADER_FORMAT with your header type.

用标题类型替换HEADER_FORMAT。

#5


1  

It is doable with some perl "magic"... Many people would call this ugly but here goes.

它可以用一些perl“魔法”实现……很多人会说这很丑,但事实是这样的。

The trick is to replace $/ with what you want and read your input, as such:

诀窍是用你想要的东西来替换$/,并阅读你的输入,例如:

#!/usr/bin/perl -W
use strict;
my $i = 1;

$/ = <<EOF;
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> <xmeta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
EOF

open INPUT, "/path/to/inputfile" or die;

while (my $mail = <INPUT>) {
    $mail = substr($mail, 0, index($mail, $/));
    open OUTPUT, ">/path/to/emailfile." . $i . ".txt" or die;
    $i++;
    print OUTPUT $mail;
    close OUTPUT;
}

edit: fixed, I always forget that $/ is included in the input. Also, the first file will always be empty, but then it can be easily handled.

编辑:固定,我总是忘记$/包含在输入中。而且,第一个文件总是为空的,但是可以很容易地处理它。

相关文章