从java中的大文本文件高效读写

时间:2022-03-22 09:17:22

I have big text file that contains source-target nodes and threshold.I store all the distinct nodes in HashSet,then filter the edges based on user threshold and store the filtered nodes in separated Hash Set.So i want to find a way to do the processing as fast as possible.

我有一个包含源目标节点和阈值的大文本文件。我将所有不同的节点存储在HashSet中,然后根据用户阈值过滤边缘,并将过滤后的节点存储在分离的哈希集中。所以我想找到一种方法处理尽可能快。

public class Simulator {

static HashSet<Integer> Alledgecount = new HashSet<>();
static HashSet<Integer> FilteredEdges = new HashSet<>();

static void process(BufferedReader reader,double userThres) throws IOException {
     String line = null;
     int l = 0;

     BufferedWriter writer = new BufferedWriter( new FileWriter("C:/users/mario/desktop/edgeList.txt"));

     while ((line = reader.readLine()) != null & l < 50_000_000) {

            String[] intArr = line.split("\\s+");

            checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), Alledgecount);

            double threshold = Double.parseDouble(intArr[3]);

            if(threshold > userThres) {  
                writeToFile(intArr[1],intArr[2],writer);

                checkDuplicate(Integer.parseInt(intArr[1]), Integer.parseInt(intArr[2]), FilteredEdges);
             }
        l++;   

     }

     writer.close();

}

static void writeToFile(String param1,String param2,Writer writer) throws IOException {

       writer.write(param1+","+param2);

   writer.write("\r\n");

}

The graph class does BFS and writes the nodes in separated file.I have done the processing excluding some functionalities and the timings are below.

图形类执行BFS并将节点写入分离的文件中。我已完成处理,不包括某些功能,时序如下。

Timings with 50 million lines read in process()

读取过程中有5000万行的计时()

without calling BFS(),checkDuplicates,writeAllEdgesToFile() -> 54s
without calling BFS(),writeAllEdgesToFile() -> 50s
without calling writeAllEdgesToFile() -> 1min

Timings with 300 million lines read in process()

有3亿行读取的计时()

without calling writeAllEdges() 5 min 

2 个解决方案

#1


3  

Reading a file doesn't depend only on CPU cores.
IO operations on a file will be limited by physical constraints of classic disks that contrary to CPU core cannot parallel operations.

读取文件不仅取决于CPU核心。对文件的IO操作将受到经典磁盘的物理限制的限制,这与CPU核心无法并行操作相反。

What you could do is having a thread for IO operations and other(s) for data processing but it makes sense only if data processing is long enough to make relevant to create a Thread for this task as Threads have a cost in terms of CPU scheduling.

您可以做的是为IO操作和其他数据处理提供一个线程,但只有当数据处理足够长以使相关的创建线程用于此任务时才有意义,因为线程在CPU调度方面有成本。

#2


2  

Getting a multi-threaded Java program to run correctly can be very tricky. It needs some deep understanding of things like synchronization issues etc. Without the knowledge/experience necessary, you'll have a hard time searching for bugs that occur sometimes but aren't reliably reproducible.

让多线程Java程序正确运行可能非常棘手。它需要对同步问题等方面有一些深刻的理解。如果没有必要的知识/经验,你将很难找到有时发生但不可靠再现的错误。

So, before trying multi-threading, find out if there are easier ways to achieve acceptable performance:

因此,在尝试多线程之前,找出是否有更简单的方法来实现可接受的性能:

Find the part of your program that takes the time!

找到花费时间的程序部分!

First question: is it I/O or CPU? Have a look at Task Manager. Does your single-threaded program occupy one core (e.g. CPU close to 25% on a 4-core machine)? If it's far below that, then I/O must be the limiting factor, and changing your program probably won't help much - buy a faster HD. (In some situations, the software style of doing I/O might influence the hardware performance, but that's rare.)

第一个问题:是I / O还是CPU?看看任务管理器。您的单线程程序是否占用一个核心(例如,4核计算机上的CPU接近25%)?如果它远远低于那个,那么I / O必须是限制因素,改变你的程序可能无济于事 - 购买更快的HD。 (在某些情况下,执行I / O的软件风格可能会影响硬件性能,但这种情况很少见。)

If it's CPU, use a profiler, e.g. the JVisualVM contained in the JDK, to find the method that takes most of the runtime and think about alternatives. One candidate might be the line.split("\\s+"), using a regular expression. They are slow, especially if the expression isn't compiled to a Pattern beforehand - but that's nothing more than a guess, and the profiler will most probably tell you some very different place.

如果是CPU,请使用分析器,例如JDK中包含的JVisualVM,用于查找占用大部分运行时并考虑备选方案的方法。一个候选者可能是line.split(“\\ s +”),使用正则表达式。它们很慢,特别是如果表达式没有事先编译成模式 - 但这只不过是一个猜测,而探查器很可能会告诉你一些非常不同的地方。

#1


3  

Reading a file doesn't depend only on CPU cores.
IO operations on a file will be limited by physical constraints of classic disks that contrary to CPU core cannot parallel operations.

读取文件不仅取决于CPU核心。对文件的IO操作将受到经典磁盘的物理限制的限制,这与CPU核心无法并行操作相反。

What you could do is having a thread for IO operations and other(s) for data processing but it makes sense only if data processing is long enough to make relevant to create a Thread for this task as Threads have a cost in terms of CPU scheduling.

您可以做的是为IO操作和其他数据处理提供一个线程,但只有当数据处理足够长以使相关的创建线程用于此任务时才有意义,因为线程在CPU调度方面有成本。

#2


2  

Getting a multi-threaded Java program to run correctly can be very tricky. It needs some deep understanding of things like synchronization issues etc. Without the knowledge/experience necessary, you'll have a hard time searching for bugs that occur sometimes but aren't reliably reproducible.

让多线程Java程序正确运行可能非常棘手。它需要对同步问题等方面有一些深刻的理解。如果没有必要的知识/经验,你将很难找到有时发生但不可靠再现的错误。

So, before trying multi-threading, find out if there are easier ways to achieve acceptable performance:

因此,在尝试多线程之前,找出是否有更简单的方法来实现可接受的性能:

Find the part of your program that takes the time!

找到花费时间的程序部分!

First question: is it I/O or CPU? Have a look at Task Manager. Does your single-threaded program occupy one core (e.g. CPU close to 25% on a 4-core machine)? If it's far below that, then I/O must be the limiting factor, and changing your program probably won't help much - buy a faster HD. (In some situations, the software style of doing I/O might influence the hardware performance, but that's rare.)

第一个问题:是I / O还是CPU?看看任务管理器。您的单线程程序是否占用一个核心(例如,4核计算机上的CPU接近25%)?如果它远远低于那个,那么I / O必须是限制因素,改变你的程序可能无济于事 - 购买更快的HD。 (在某些情况下,执行I / O的软件风格可能会影响硬件性能,但这种情况很少见。)

If it's CPU, use a profiler, e.g. the JVisualVM contained in the JDK, to find the method that takes most of the runtime and think about alternatives. One candidate might be the line.split("\\s+"), using a regular expression. They are slow, especially if the expression isn't compiled to a Pattern beforehand - but that's nothing more than a guess, and the profiler will most probably tell you some very different place.

如果是CPU,请使用分析器,例如JDK中包含的JVisualVM,用于查找占用大部分运行时并考虑备选方案的方法。一个候选者可能是line.split(“\\ s +”),使用正则表达式。它们很慢,特别是如果表达式没有事先编译成模式 - 但这只不过是一个猜测,而探查器很可能会告诉你一些非常不同的地方。