读取一个巨大的固定宽度文件

I Have a requirement to read a Huge Flat File, without keeping the entire file in memory. It is flat file with multiple segments, each record starting with a Header record identified by 'H' in the beginning followed by many lines and then again Header record, this pattern repeats For e.g.

我需要阅读一个巨大的平面文件,而不是将整个文件保存在内存中。它是具有多个段的平面文件,每个记录以在开头由'H'标识的Header记录开始,后跟多行,然后是Header记录,此模式重复例如

HXYZ CORP  12/12/2016
R1 234 qweewwqewewq wqewe
R1 234 qweewwqewewq wqewe
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
HABC LTD  12/12/2016
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
HDRE CORP  12/12/2016
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
R2 344 dfgdfgdf gfd  df g

I want to read a record set at a time for e.g.

我想一次读取一个记录集,例如

HDRE CORP  12/12/2016
R1 234 qweewwqewewq wqewe
R2 344 dfgdfgdf gfd  df g
R2 344 dfgdfgdf gfd  df g

How can i achieve this keep in mind that i do not want to keep the entire file in memory Is there any standard library that i can use for this purpose? I have tried using some implementations without much success, i have used Apache's Line Iterator , but that reads line by line.

我怎样才能实现这一点,请记住,我不想将整个文件保存在内存中是否有可用于此目的的标准库?我尝试使用一些实现没有太大成功,我使用了Apache的Line Iterator,但是它逐行读取。

Any help or suggestions will be much appreciated.

任何帮助或建议将不胜感激。

6 个解决方案

#1

The data is stored by line, and you don't know the record has ended until you read the header line of the next record. You need to read line-by-line. Something like this should work:

数据按行存储,并且在读取下一条记录的标题行之前,您不知道记录已结束。你需要逐行阅读。像这样的东西应该工作:

BufferedReader br = new BufferedReader( new FileReader( file ) );
Vector<String> record = new Vector<>();
String line;

// loop is explicitly broken when file ends
for ( ;; )
{
    line = br.readline();

    // no more lines - process what's in record and break the loop
    if ( null == line )
    {
        ProcessRecord( record );
        break;
    }

    // new header line, process what's in record and clear it
    // for the new record
    if ( line.startsWith( "H" ) )
    {
        ProcessRecord( record );
        record.clear()
    }

    // add the current line to the current record
    record.add( line );
}

#2

You should aim to achieve your goal using line-by-line reading (like Apache you used or Java8 Files.lines()).

您应该使用逐行读取(例如您使用的Apache或Java8 Files.lines())来实现您的目标。

Use two loops: outer that processes until the EOF is reached. Inner loop for reading a record set at a time. Once you process whole record - you can discard the lines you have read to garbage-collector. And then (outer loop) process next record.

使用两个循环:外部进行处理,直到达到EOF。用于一次读取记录集的内循环。处理完整条记录后 - 您可以将已读取的行丢弃到垃圾收集器。然后(外循环)处理下一条记录。

If using Lambdas and Java 8 Files.lines(...) - you may want to group (collect) lines related to same record. Then process these grouped objects.

如果使用Lambdas和Java 8 Files.lines(...) - 您可能希望分组(收集)与同一记录相关的行。然后处理这些分组对象。

#3

In Java 8 Using nio Files.lines() method, Stream.map() and PrintWriter.

在Java 8中使用nio Files.lines()方法,Stream.map()和PrintWriter。

I updated the code to be able to write line by line to a new file adding the current date to the headers.

我更新了代码,以便能够逐行写入新文件,将当前日期添加到标题中。

import java.util.stream.Stream;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.io.IOException;

import java.time.LocalDate;
import java.time.format.DateTimeFormatter;    

public class Main {

    public static void main(String[] args) {

        String input =  "C://data.txt";
        String output = "C://data1.txt";
        String date = getDate();

        addDate(input,output,date);

    }

    public static void addDate(String in, String out,String date)
    {

        try (Stream<String> stream = Files.lines(Paths.get(in));
             PrintWriter output = new PrintWriter(out, "UTF-8"))
        {    
         stream.map(x -> {
            if(x.startsWith("H")) return (x +" "+date); 
            else return x;
            }
         ).forEach(output::println);
        }
        catch(IOException e){e.printStackTrace();}
    }

    public static String getDate(){
        DateTimeFormatter dtf = DateTimeFormatter.ofPattern("dd/MM/yyyy");
        LocalDate localDate = LocalDate.now();
        return dtf.format(localDate);
    }
}

#4

I would just go with the built-in BufferedReader and read it line-by-line.

我会选择内置的BufferedReader并逐行读取。

I don't know what you mean by fixed-width file because in your comment you mention that

我不知道你对固定宽度文件的意思,因为在你的评论中你提到了这一点

R1,R2,R3 all are optional,repeatable and are of varying width's.

R1,R2,R3都是可选的,可重复的并且具有不同的宽度。

In any case, based on your description, your format is structured so

在任何情况下,根据您的描述,您的格式是这样的

1. Read the first character to get the TOKEN
2. Check if TOKEN equals "H" or "R"
3. Split the line and parse it based on what type of TOKEN it is.

If R1, R2, and R3 are separate tokens, then you would need to check whether it's an R-entry, and then check the next character as needed.

如果R1,R2和R3是单独的令牌,那么您需要检查它是否是R-entry,然后根据需要检查下一个字符。

For step 3, you may consider splitting on spaces if each field in the line is separated by a space. Or, if each record has a fixed-width, it may be acceptable to use substring to extract each segment.

对于步骤3,如果行中的每个字段由空格分隔,则可以考虑拆分空格。或者,如果每个记录具有固定宽度,则可以使用子字符串来提取每个段。

I'm not sure what you mean by

我不确定你的意思

My use-case requires to read a entire record set at a time.

我的用例需要一次读取整个记录集。

#5

As per @firephil's suggestion, I have used Java 8 Stream API for this requirement. I have used a buffer in form of StringBuilder to store lines between a Header and another Header record. Finally getting a iterator from the Stream to get one full record(H+R1+R2+R3) from the file at a time. There is a problem fetching the last record, the way I am processing the last record is getting lost, so I had to concatenate a Fake Record to the original Stream. This will do for this time, however I am sure there will be a better way to process.

根据@ firephil的建议,我已经使用Java 8 Stream API来满足这个要求。我使用StringBuilder形式的缓冲区来存储Header和另一个Header记录之间的行。最后从Stream中获取一个迭代器,一次从文件中获取一个完整记录(H + R1 + R2 + R3)。获取最后一条记录时出现问题,我处理最后一条记录的方式正在丢失,因此我不得不将假记录连接到原始流。这将在这个时候做,但我相信会有更好的方法来处理。

public static StringBuilder sbTemp;

public static Iterator<String> process(String in) throws IOException
{
    Iterator<String> recordIterator = null;
    sbTemp = new StringBuilder();
    List<String> fakeRecordList = new ArrayList<String>();
    fakeRecordList.add("H Fake Line");
    Stream<String> fakeRecordStream = fakeRecordList.stream(); //For getting last Record Set
    Stream<String> stream = Files.lines(Paths.get(in)).sequential();
        Stream<String> finalStream = Stream.concat(stream,fakeRecordStream);
        // PrintWriter output = new PrintWriter(out, "UTF-8"))
    {    
        recordIterator =    finalStream.map(x -> {
        if(x.startsWith("H")) {
            String s = sbTemp.toString();
            //System.out.println("Header: "+x);
            sbTemp = new StringBuilder();
            sbTemp.append(x);
            return s; 
            }
        else {
            sbTemp.append("\n").append(x);              
            return "";
        } 
     }
     ).filter(line -> (line.startsWith("H")) ).iterator();

        System.out.println(recordIterator.next()); 
    }
    return recordIterator;
}

#6

-1

A library for that purpose is BeanIO

用于此目的的库是BeanIO

There are a lot of unsupported libraries for fixed file format out there.

有很多不受支持的库用于固定文件格式。

Flatpack is more recent, but I didn't try it.

Flatpack是最近的,但我没试过。

#1