如何将文件拆分为n个零件

时间:2022-08-03 21:37:06

I have a file contining some no of lines. I want split file into n no.of files with particular names. It doesn't matter how many line present in each file. I just want particular no.of files (say 5). here the problem is the no of lines in the original file keep on changing. So I need to calculate no of lines then just split the files into 5 parts. If possible we have to send each of them into different directories.

我有一个文件,包含一些行。我希望将文件拆分为具有特定名称的n个文件。每个文件中存在多少行并不重要。我只想要特别的no.of文件(比如5)。这里的问题是原始文件中的行号不断变化。所以我需要计算没有行,然后将文件分成5个部分。如果可能,我们必须将它们分别发送到不同的目录中。

4 个解决方案

#1


21  

In bash, you can use the split command to split it based on number of lines desired. You can use wc command to figure out how many lines are desired. Here's wc combined with with split into one line.

在bash中,您可以使用split命令根据所需的行数对其进行拆分。您可以使用wc命令确定所需的行数。这里的wc与分成一行相结合。

For example, to split onepiece.log into 5 parts

例如,将onepiece.log拆分为5个部分

    split -l$((`wc -l < onepiece.log`/5)) onepiece.log onepiece.split.log -da 4

This will create files like onepiece.split.log0000 ...

这将创建像onepiece.split.log0000这样的文件...

Note: bash division rounds down, so if there is a remainder there will be a 6th part file.

注意:bash部分向下舍入,所以如果有剩余部分则会有第6部分文件。

#2


6  

Assuming you are processing a text file then wc -l to determine the total number of lines and split -l to split into a specified number of lines (total / 5 in your case). This works on UNIX/Mac and Windows (if you have cygwin installed)

假设您正在处理文本文件,那么wc -l确定总行数并拆分-l以拆分为指定行数(在您的情况下为total / 5)。这适用于UNIX / Mac和Windows(如果安装了cygwin)

#3


5  

On linux, there is a split command,

在linux上,有一个split命令,

split --lines=1m /path/to/large/file /path/to/output/file/prefix

Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT is -, read standard input.

输出固定大小的INPUT到PREFIXaa,PREFIXab,...;默认大小为1000行,默认PREFIX为“x”。没有INPUT,或INPUT为 - 时,读取标准输入。

...

...

-l, --lines=NUMBER put NUMBER lines per output file

-l, - lines = NUM​​BER为每个输出文件放置NUMBER行

...

...

You would have to calculate the actual size of the splits beforehand, though.

但是,您必须事先计算拆分的实际大小。

#4


0  

I can think of a few ways to do it. Which you would use depends a lot on the data.

我可以想到几种方法来做到这一点。您将使用哪种方法取决于数据。

  1. Lines are fixed length: Find the size of the file by reading it's directory entry and divide by the line length to get the number of lines. Use this to determine how many lines per file.

    行是固定长度:通过读取文件的目录条目来查找文件的大小,然后除以行长度以获得行数。使用此选项可确定每个文件的行数。

  2. The files only need to have approximately the same number of lines. Again read the file size from the directory entry. Read the first N lines (N should be small but some reasonable fraction of the file) to calculate an average line length. Calculate the approximate number of lines based on the file size and predicted average line length. This assumes that the line length follows a normal distribution. If not, adjust your method to randomly sample lines (using seek() or something similar). Rewind the file after your have your average, then split it based on the predicted line length.

    文件只需要具有大致相同的行数。再次从目录条目中读取文件大小。读取前N行(N应该很小但文件的一些合理部分)来计算平均行长度。根据文件大小和预测的平均线长计算大致的行数。这假设线长遵循正态分布。如果没有,请调整您的方法以随机采样行(使用seek()或类似的东西)。在获得平均值后回滚文件,然后根据预测的行长度将其拆分。

  3. Read the file twice. The first time count the number of lines. The second time splitting the file into the requisite pieces.

    读两次文件。第一次计算行数。第二次将文件拆分为必需的部分。

EDIT: Using a shell script (according to your comments), the randomized version of #2 would be hard unless you wrote a small program to do that for you. You should be able to use ls -l to get the file size, wc -l to count the exact number of lines, and head -nNNN | wc -c to calculate the average line length.

编辑:使用shell脚本(根据你的评论),#2的随机版本很难,除非你写了一个小程序来为你做。您应该能够使用ls -l来获取文件大小,wc -l来计算确切的行数,并且使用-nNNN | wc -c来计算平均线长。

#1


21  

In bash, you can use the split command to split it based on number of lines desired. You can use wc command to figure out how many lines are desired. Here's wc combined with with split into one line.

在bash中,您可以使用split命令根据所需的行数对其进行拆分。您可以使用wc命令确定所需的行数。这里的wc与分成一行相结合。

For example, to split onepiece.log into 5 parts

例如,将onepiece.log拆分为5个部分

    split -l$((`wc -l < onepiece.log`/5)) onepiece.log onepiece.split.log -da 4

This will create files like onepiece.split.log0000 ...

这将创建像onepiece.split.log0000这样的文件...

Note: bash division rounds down, so if there is a remainder there will be a 6th part file.

注意:bash部分向下舍入,所以如果有剩余部分则会有第6部分文件。

#2


6  

Assuming you are processing a text file then wc -l to determine the total number of lines and split -l to split into a specified number of lines (total / 5 in your case). This works on UNIX/Mac and Windows (if you have cygwin installed)

假设您正在处理文本文件,那么wc -l确定总行数并拆分-l以拆分为指定行数(在您的情况下为total / 5)。这适用于UNIX / Mac和Windows(如果安装了cygwin)

#3


5  

On linux, there is a split command,

在linux上,有一个split命令,

split --lines=1m /path/to/large/file /path/to/output/file/prefix

Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT is -, read standard input.

输出固定大小的INPUT到PREFIXaa,PREFIXab,...;默认大小为1000行,默认PREFIX为“x”。没有INPUT,或INPUT为 - 时,读取标准输入。

...

...

-l, --lines=NUMBER put NUMBER lines per output file

-l, - lines = NUM​​BER为每个输出文件放置NUMBER行

...

...

You would have to calculate the actual size of the splits beforehand, though.

但是,您必须事先计算拆分的实际大小。

#4


0  

I can think of a few ways to do it. Which you would use depends a lot on the data.

我可以想到几种方法来做到这一点。您将使用哪种方法取决于数据。

  1. Lines are fixed length: Find the size of the file by reading it's directory entry and divide by the line length to get the number of lines. Use this to determine how many lines per file.

    行是固定长度:通过读取文件的目录条目来查找文件的大小,然后除以行长度以获得行数。使用此选项可确定每个文件的行数。

  2. The files only need to have approximately the same number of lines. Again read the file size from the directory entry. Read the first N lines (N should be small but some reasonable fraction of the file) to calculate an average line length. Calculate the approximate number of lines based on the file size and predicted average line length. This assumes that the line length follows a normal distribution. If not, adjust your method to randomly sample lines (using seek() or something similar). Rewind the file after your have your average, then split it based on the predicted line length.

    文件只需要具有大致相同的行数。再次从目录条目中读取文件大小。读取前N行(N应该很小但文件的一些合理部分)来计算平均行长度。根据文件大小和预测的平均线长计算大致的行数。这假设线长遵循正态分布。如果没有,请调整您的方法以随机采样行(使用seek()或类似的东西)。在获得平均值后回滚文件,然后根据预测的行长度将其拆分。

  3. Read the file twice. The first time count the number of lines. The second time splitting the file into the requisite pieces.

    读两次文件。第一次计算行数。第二次将文件拆分为必需的部分。

EDIT: Using a shell script (according to your comments), the randomized version of #2 would be hard unless you wrote a small program to do that for you. You should be able to use ls -l to get the file size, wc -l to count the exact number of lines, and head -nNNN | wc -c to calculate the average line length.

编辑:使用shell脚本(根据你的评论),#2的随机版本很难,除非你写了一个小程序来为你做。您应该能够使用ls -l来获取文件大小,wc -l来计算确切的行数,并且使用-nNNN | wc -c来计算平均线长。