如何以高效的方式编写1GB文件C#

时间:2022-09-23 17:52:31

I have .txt file (contains more than million rows) which is around 1GB and I have one list of string, I am trying to remove all the rows from the file that exist in the list of strings and creating new file but it is taking long long time.

我有.txt文件(包含超过百万行),大约1GB,我有一个字符串列表,我试图删除字符串列表中存在的文件中的所有行并创建新文件,但它正在采取很长时间。

using (StreamReader reader = new StreamReader(_inputFileName))
{
   using (StreamWriter writer = new StreamWriter(_outputFileName))
   {
     string line;
     while ((line = reader.ReadLine()) != null)
     {
       if (!_lstLineToRemove.Contains(line))
              writer.WriteLine(line);
     }

    }
  }

How can I enhance the performance of my code?

如何提高代码的性能?

5 个解决方案

#1


4  

You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains( check. HashSet is thread safe for read-only operations.

通过使用PLINQ并行完成工作,您可以获得一些加速,同时从列表切换到散列集也将大大加快Contains(检查.HashSet对于只读操作是线程安全的。

private HashSet<string> _hshLineToRemove;

void ProcessFiles()
{
    var inputLines = File.ReadLines(_inputFileName);
    var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
    File.WriteAllLines(_outputFileName, filteredInputLines);
}

If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered() and get some additional speed.

如果输出文件的顺序与输入文件的顺序无关,则可以删除.AsOrdered()并获得一些额外的速度。

Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on.

除此之外,你真的只是I / O绑定,唯一能让它更快的方法就是让更快的驱动器运行它。

#2


0  

The code is particularly slow because the reader and writer never execute in parallel. Each has to wait for the other.

代码特别慢,因为读写器永远不会并行执行。每个人都要等待另一个。

You can almost double the speed of file operations like this by having a reader thread and a writer thread. Put a BlockingCollection between them so you can communicate between the threads and limit how many rows you buffer in memory.

通过使用读者线程和编写器线程,您几乎可以将文件操作的速度提高一倍。在它们之间放置一个BlockingCollection,以便您可以在线程之间进行通信,并限制在内存中缓冲的行数。

If the computation is really expensive (it isn't in your case), a third thread with another BlockingCollection doing the processing can help too.

如果计算非常昂贵(在您的情况下不是这样),那么另一个执行处理的另一个BlockingCollection的第三个线程也可以提供帮助。

#3


0  

Do not use buffered text routines. Use binary, unbuffered library routines and make your buffer size as big as possible. That's how to make it the fastest.

不要使用缓冲的文本例程。使用二进制,无缓冲的库例程,并使缓冲区大小尽可能大。这就是如何使它成为最快的。

#4


0  

Have you considered using AWK

你考虑过使用AWK吗?

AWK is a very powerfull tool to process text files, you can find more information about how to filter lines that match a certain criteria Filter text with ASK

AWK是一个非常强大的处理文本文件的工具,您可以找到有关如何过滤符合特定条件的行的更多信息使用ASK过滤文本

#5


0  

From what I can see, the Read and Write parts of your code should normally be quite OK for faster than "15 minutes" for 1gb that you quote in your comments. I can process more than 1gb per minute on my laptop using your read and write code. I cannot say the processing where you skip certain lines is well optimized or not, but I am getting away from my point I am about to make.

从我所看到的,你的代码的读写部分通常应该比你在评论中引用的1gb的“15分钟”更快。我可以使用您的读写代码在我的笔记本电脑上每分钟处理超过1GB的速度。我不能说你跳过某些行的处理是否得到了很好的优化,但是我正在逐渐远离我的观点。

Because the Read and Write method you use should normally already be "fast", I recommend you take the following strategy to determine the best speed you could hope to get close to, and where the bottleneck is with your slow speed.

因为您使用的读写方法通常应该已经“快速”,所以我建议您采取以下策略来确定您希望接近的最佳速度,以及您的慢速瓶颈。

  1. Manually copy this large file from the Source area to the Destination area. Note the time it takes to perform the copy. If this time is too slow, your problem is most likely with the computer you are using. But you could just as easily be killing your performance by copying from a network drive or to a network drive or working only on a network drive or something like that (usb drives, drives that are already under a very high IO load, etc).
  2. 将此大文件从“源”区域手动复制到“目标”区域。请注意执行复制所需的时间。如果此时间太慢,则问题很可能与您使用的计算机有关。但是,您可以通过从网络驱动器复制到网络驱动器或仅在网络驱动器或类似工具(usb驱动器,已经处于非常高IO负载的驱动器等)上轻松地杀死您的性能。
  3. Adjust your code so it simply reads the file and writes the file without any extra processing. Note the time it takes to perform the task. If you notice a big difference in time, you need to optimize this part first. I see some good suggestions here to try, and sometimes the answer can be exotic.
  4. 调整代码,使其只读取文件并写入文件,无需任何额外处理。请注意执行任务所需的时间。如果您发现时间差异很大,则需要先优化此部件。我在这里看到了一些很好的建议,有时候答案可能是异国情调。
  5. If the time between 1 and 2 is almost the same, and the times are both nice and fast, then the processing your are performing in-between the read and the write is the problem. You need to optimize that part of the code. Gradually add code back until you identify the bottle neck. Loops, String Operations, Lists, Dictionaries can all murder your performance, but a simple logic mistake can, as well. I see some suggestions here for dealing with HashSet, etc, that could help speed up potentially slow parts of your code, but you need to understand why it is slow, or get lucky trying out random changes (not recommended).
  6. 如果介于1和2之间的时间几乎相同,并且时间既美观又快速,那么您在读取和写入之间执行的处理就是问题所在。您需要优化代码的这一部分。逐渐添加代码,直到您识别瓶颈。循环,字符串操作,列表,字典都可以谋杀你的表现,但也可以是一个简单的逻辑错误。我在这里看到了一些处理HashSet等的建议,这些建议可能有助于加快代码中潜在的缓慢部分,但是你需要理解为什么它很慢,或者尝试随机更改(不推荐)。

#1


4  

You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains( check. HashSet is thread safe for read-only operations.

通过使用PLINQ并行完成工作,您可以获得一些加速,同时从列表切换到散列集也将大大加快Contains(检查.HashSet对于只读操作是线程安全的。

private HashSet<string> _hshLineToRemove;

void ProcessFiles()
{
    var inputLines = File.ReadLines(_inputFileName);
    var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
    File.WriteAllLines(_outputFileName, filteredInputLines);
}

If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered() and get some additional speed.

如果输出文件的顺序与输入文件的顺序无关,则可以删除.AsOrdered()并获得一些额外的速度。

Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on.

除此之外,你真的只是I / O绑定,唯一能让它更快的方法就是让更快的驱动器运行它。

#2


0  

The code is particularly slow because the reader and writer never execute in parallel. Each has to wait for the other.

代码特别慢,因为读写器永远不会并行执行。每个人都要等待另一个。

You can almost double the speed of file operations like this by having a reader thread and a writer thread. Put a BlockingCollection between them so you can communicate between the threads and limit how many rows you buffer in memory.

通过使用读者线程和编写器线程,您几乎可以将文件操作的速度提高一倍。在它们之间放置一个BlockingCollection,以便您可以在线程之间进行通信,并限制在内存中缓冲的行数。

If the computation is really expensive (it isn't in your case), a third thread with another BlockingCollection doing the processing can help too.

如果计算非常昂贵(在您的情况下不是这样),那么另一个执行处理的另一个BlockingCollection的第三个线程也可以提供帮助。

#3


0  

Do not use buffered text routines. Use binary, unbuffered library routines and make your buffer size as big as possible. That's how to make it the fastest.

不要使用缓冲的文本例程。使用二进制,无缓冲的库例程,并使缓冲区大小尽可能大。这就是如何使它成为最快的。

#4


0  

Have you considered using AWK

你考虑过使用AWK吗?

AWK is a very powerfull tool to process text files, you can find more information about how to filter lines that match a certain criteria Filter text with ASK

AWK是一个非常强大的处理文本文件的工具,您可以找到有关如何过滤符合特定条件的行的更多信息使用ASK过滤文本

#5


0  

From what I can see, the Read and Write parts of your code should normally be quite OK for faster than "15 minutes" for 1gb that you quote in your comments. I can process more than 1gb per minute on my laptop using your read and write code. I cannot say the processing where you skip certain lines is well optimized or not, but I am getting away from my point I am about to make.

从我所看到的,你的代码的读写部分通常应该比你在评论中引用的1gb的“15分钟”更快。我可以使用您的读写代码在我的笔记本电脑上每分钟处理超过1GB的速度。我不能说你跳过某些行的处理是否得到了很好的优化,但是我正在逐渐远离我的观点。

Because the Read and Write method you use should normally already be "fast", I recommend you take the following strategy to determine the best speed you could hope to get close to, and where the bottleneck is with your slow speed.

因为您使用的读写方法通常应该已经“快速”,所以我建议您采取以下策略来确定您希望接近的最佳速度,以及您的慢速瓶颈。

  1. Manually copy this large file from the Source area to the Destination area. Note the time it takes to perform the copy. If this time is too slow, your problem is most likely with the computer you are using. But you could just as easily be killing your performance by copying from a network drive or to a network drive or working only on a network drive or something like that (usb drives, drives that are already under a very high IO load, etc).
  2. 将此大文件从“源”区域手动复制到“目标”区域。请注意执行复制所需的时间。如果此时间太慢,则问题很可能与您使用的计算机有关。但是,您可以通过从网络驱动器复制到网络驱动器或仅在网络驱动器或类似工具(usb驱动器,已经处于非常高IO负载的驱动器等)上轻松地杀死您的性能。
  3. Adjust your code so it simply reads the file and writes the file without any extra processing. Note the time it takes to perform the task. If you notice a big difference in time, you need to optimize this part first. I see some good suggestions here to try, and sometimes the answer can be exotic.
  4. 调整代码,使其只读取文件并写入文件,无需任何额外处理。请注意执行任务所需的时间。如果您发现时间差异很大,则需要先优化此部件。我在这里看到了一些很好的建议,有时候答案可能是异国情调。
  5. If the time between 1 and 2 is almost the same, and the times are both nice and fast, then the processing your are performing in-between the read and the write is the problem. You need to optimize that part of the code. Gradually add code back until you identify the bottle neck. Loops, String Operations, Lists, Dictionaries can all murder your performance, but a simple logic mistake can, as well. I see some suggestions here for dealing with HashSet, etc, that could help speed up potentially slow parts of your code, but you need to understand why it is slow, or get lucky trying out random changes (not recommended).
  6. 如果介于1和2之间的时间几乎相同,并且时间既美观又快速,那么您在读取和写入之间执行的处理就是问题所在。您需要优化代码的这一部分。逐渐添加代码,直到您识别瓶颈。循环,字符串操作,列表,字典都可以谋杀你的表现,但也可以是一个简单的逻辑错误。我在这里看到了一些处理HashSet等的建议,这些建议可能有助于加快代码中潜在的缓慢部分,但是你需要理解为什么它很慢,或者尝试随机更改(不推荐)。