确保一行存在于纯文本文件中的最有效方法

时间:2022-09-23 17:12:59

I'm using C# (.Net 2.0), and I have a fairly large text file (~1600 lines on average) that I need to check periodically to make sure a certain line of text is there.

我正在使用C#(。Net 2.0),我有一个相当大的文本文件(平均约1600行),我需要定期检查以确保存在某一行文本。

What is the most efficient way of doing this? Do I really have to load the entire file into memory each time?

这样做最有效的方法是什么?我是否真的每次都要将整个文件加载到内存中?

Is there a file-content-search api of some sort that I could use?

有没有我可以使用的某种文件内容搜索API?

Thanks for any help/advice.

感谢您的帮助/建议。

7 个解决方案

#1


Well, you can always use the FileSystemWatcher to give you an event when the file has changed, that way you only scan the file on demand.

好吧,您可以随时使用FileSystemWatcher在文件更改时为您提供事件,这样您只需按需扫描文件。

#2


If the line of text is always going to be the same then using RegEx to match the text of the line is probably more efficient than looping through a file to match the text using String.Equals() or ==.

如果文本行总是相同,则使用RegEx匹配行的文本可能比使用String.Equals()或==循环文件以匹配文本更有效。

That said, I don't know of anyway in c# to find text in a file with out opening the file into memory and reading the lines.

也就是说,我不知道在c#中找到文件中的文本,而不是将文件打开到内存中并读取行。

This link is a nice tutorial on using RegEx to match lines in a file using c#.

这个链接是一个很好的教程,使用RegEx来匹配文件中使用c#的行。

#3


Unless they are very long lines, in modern computing terms 1600 lines is not a lot! The file IO will be handled by the runtime, and will be buffered, and will be astonishingly fast, and the memory footprint astonishingly unremarkable.

除非它们是很长的线,否则在现代计算术语中1600线并不是很多!文件IO将由运行时处理,并将被缓冲,并且速度惊人,而且内存占用空间非常小。

Simply read the file line by line, or use System.IO.File.ReadAllLines(), and then see if the line exists e.g. using a whole line comparision with a string.

只需逐行读取文件,或使用System.IO.File.ReadAllLines(),然后查看该行是否存在,例如使用整行与字符串比较。

This isn't going to be your bottleneck.

这不会成为你的瓶颈。

Your bottleneck might occur if you are polling frequently and/or using regular expressions unnecessarily. Its best to use a file system watcher to avoid parsing the file at all if it is unchanged.

如果您经常轮询和/或不必要地使用正则表达式,则可能会出现瓶颈。如果文件不变,最好使用文件系统观察程序来避免解析文件。

#4


It really depends on your definition of "efficient".

这实际上取决于你对“有效”的定义。

If you mean memory-efficient then you could use a stream reader so that you only have one line of text in memory at a time, unfortunately this is slower than loading the whole thing in at once and may lock the file.

如果你的意思是内存效率那么你可以使用一个流阅读器,这样你一次只能在内存中有一行文本,不幸的是,这比一次加载整个文件要慢,并且可能会锁定文件。

If you mean in the shortest possible time, then this is a task that will gain great benefits from a parallel architecture. Split the file into chunks and pass each chunk off to a different thread to process. Of course that isn't especially CPU efficient, as it may put all your cores at a high level of usage.

如果您的意思是在最短的时间内,那么这项任务将从并行架构中获益良多。将文件拆分为块并将每个块传递给另一个要处理的线程。当然,这并不是特别节省CPU,因为它可能会使您的所有内核处于高使用水平。

If you are looking to just do the least amount of work is there anything you already know about the file? How often will it be updated? Are the first 10 characters of each line always the same? If you looked at 100 lines last time do you need to rescan those lines again? Any of these could create huge savings for both time and memory usage.

如果您正在寻找最少量的工作,那么您对该文件有什么了解吗?它会多久更新一次?每行的前10个字符是否始终相同?如果您上次查看100行,是否需要重新扫描这些行?任何这些都可以为时间和内存使用节省大量成本。

At the end of the day though there is no magic bullet, and to search a file is (at worst case) an O(n) operation.

在一天结束时虽然没有灵丹妙药,但搜索文件(在最坏的情况下)是O(n)操作。


Sorry, just re-read that, and it may come across as sarcastic, and I don't mean it to be. I just meant to emphasize that any gains you make in one area are likely to be loses elsewhere and "efficient" is a very ambiguous term in circumstances like these.

对不起,只是重读一下,它可能会像讽刺一样,我不是故意的。我只是想强调,你在一个领域取得的任何收益都可能在其他地方失去,而“高效”在这样的情况下是一个非常模糊的术语。

#5


List<String> lines = System.IO.File.ReadAllLines(file).ToList()
lines.Contains("foo");

#6


You should be able to just loop over the lines like this:

你应该能够循环遍历这样的行:

String line;
while ((line = file.ReadLine()) != null)
{
    if (line matches regex blah)
        return true;
}
return false;

The ReadLine method only loads a single line of the file into memory, not the whole file. When the loop runs again, the only reference to that line is lost and so, the line will be garbage collected when needed.

ReadLine方法只将文件的一行加载到内存中,而不是整个文件中。当循环再次运行时,对该行的唯一引用将丢失,因此,在需要时将对该行进行垃圾回收。

#7


I would combine a couple of techniques used here:

我会结合使用这里使用的几种技术:

1). Set a FileSystemWatcher on the file. Set the necessary filters to prevent false positives. You don't want to check the file unecessarily.

1)。在文件上设置FileSystemWatcher。设置必要的过滤器以防止误报。您不想不必要地检查文件。

2). When the FSW raises the event, grab the contents using string fileString = File.ReadAllLines().

2)。当FSW引发事件时,使用字符串fileString = File.ReadAllLines()获取内容。

3). Use a simple regex to find the match for your string.

3)。使用简单的正则表达式查找字符串的匹配项。

4). If the match has an index greater than -1, then the file contains the string at whatever value is in the index.

4)。如果匹配的索引大于-1,则该文件包含索引中任何值的字符串。

You've successfully avoided having to parse the file line by line, you have potentially loaded a large amount of data (although 1600 lines of text is hardly that large) into memory. When the string literal goes out of scope it'll be reclaimed by the garbage collector.

您已成功避免必须逐行解析文件,您可能已将大量数据(尽管1600行文本很难)加载到内存中。当字符串文字超出范围时,它将被垃圾收集器回收。

#1


Well, you can always use the FileSystemWatcher to give you an event when the file has changed, that way you only scan the file on demand.

好吧,您可以随时使用FileSystemWatcher在文件更改时为您提供事件,这样您只需按需扫描文件。

#2


If the line of text is always going to be the same then using RegEx to match the text of the line is probably more efficient than looping through a file to match the text using String.Equals() or ==.

如果文本行总是相同,则使用RegEx匹配行的文本可能比使用String.Equals()或==循环文件以匹配文本更有效。

That said, I don't know of anyway in c# to find text in a file with out opening the file into memory and reading the lines.

也就是说,我不知道在c#中找到文件中的文本,而不是将文件打开到内存中并读取行。

This link is a nice tutorial on using RegEx to match lines in a file using c#.

这个链接是一个很好的教程,使用RegEx来匹配文件中使用c#的行。

#3


Unless they are very long lines, in modern computing terms 1600 lines is not a lot! The file IO will be handled by the runtime, and will be buffered, and will be astonishingly fast, and the memory footprint astonishingly unremarkable.

除非它们是很长的线,否则在现代计算术语中1600线并不是很多!文件IO将由运行时处理,并将被缓冲,并且速度惊人,而且内存占用空间非常小。

Simply read the file line by line, or use System.IO.File.ReadAllLines(), and then see if the line exists e.g. using a whole line comparision with a string.

只需逐行读取文件,或使用System.IO.File.ReadAllLines(),然后查看该行是否存在,例如使用整行与字符串比较。

This isn't going to be your bottleneck.

这不会成为你的瓶颈。

Your bottleneck might occur if you are polling frequently and/or using regular expressions unnecessarily. Its best to use a file system watcher to avoid parsing the file at all if it is unchanged.

如果您经常轮询和/或不必要地使用正则表达式,则可能会出现瓶颈。如果文件不变,最好使用文件系统观察程序来避免解析文件。

#4


It really depends on your definition of "efficient".

这实际上取决于你对“有效”的定义。

If you mean memory-efficient then you could use a stream reader so that you only have one line of text in memory at a time, unfortunately this is slower than loading the whole thing in at once and may lock the file.

如果你的意思是内存效率那么你可以使用一个流阅读器,这样你一次只能在内存中有一行文本,不幸的是,这比一次加载整个文件要慢,并且可能会锁定文件。

If you mean in the shortest possible time, then this is a task that will gain great benefits from a parallel architecture. Split the file into chunks and pass each chunk off to a different thread to process. Of course that isn't especially CPU efficient, as it may put all your cores at a high level of usage.

如果您的意思是在最短的时间内,那么这项任务将从并行架构中获益良多。将文件拆分为块并将每个块传递给另一个要处理的线程。当然,这并不是特别节省CPU,因为它可能会使您的所有内核处于高使用水平。

If you are looking to just do the least amount of work is there anything you already know about the file? How often will it be updated? Are the first 10 characters of each line always the same? If you looked at 100 lines last time do you need to rescan those lines again? Any of these could create huge savings for both time and memory usage.

如果您正在寻找最少量的工作,那么您对该文件有什么了解吗?它会多久更新一次?每行的前10个字符是否始终相同?如果您上次查看100行,是否需要重新扫描这些行?任何这些都可以为时间和内存使用节省大量成本。

At the end of the day though there is no magic bullet, and to search a file is (at worst case) an O(n) operation.

在一天结束时虽然没有灵丹妙药,但搜索文件(在最坏的情况下)是O(n)操作。


Sorry, just re-read that, and it may come across as sarcastic, and I don't mean it to be. I just meant to emphasize that any gains you make in one area are likely to be loses elsewhere and "efficient" is a very ambiguous term in circumstances like these.

对不起,只是重读一下,它可能会像讽刺一样,我不是故意的。我只是想强调,你在一个领域取得的任何收益都可能在其他地方失去,而“高效”在这样的情况下是一个非常模糊的术语。

#5


List<String> lines = System.IO.File.ReadAllLines(file).ToList()
lines.Contains("foo");

#6


You should be able to just loop over the lines like this:

你应该能够循环遍历这样的行:

String line;
while ((line = file.ReadLine()) != null)
{
    if (line matches regex blah)
        return true;
}
return false;

The ReadLine method only loads a single line of the file into memory, not the whole file. When the loop runs again, the only reference to that line is lost and so, the line will be garbage collected when needed.

ReadLine方法只将文件的一行加载到内存中,而不是整个文件中。当循环再次运行时,对该行的唯一引用将丢失,因此,在需要时将对该行进行垃圾回收。

#7


I would combine a couple of techniques used here:

我会结合使用这里使用的几种技术:

1). Set a FileSystemWatcher on the file. Set the necessary filters to prevent false positives. You don't want to check the file unecessarily.

1)。在文件上设置FileSystemWatcher。设置必要的过滤器以防止误报。您不想不必要地检查文件。

2). When the FSW raises the event, grab the contents using string fileString = File.ReadAllLines().

2)。当FSW引发事件时,使用字符串fileString = File.ReadAllLines()获取内容。

3). Use a simple regex to find the match for your string.

3)。使用简单的正则表达式查找字符串的匹配项。

4). If the match has an index greater than -1, then the file contains the string at whatever value is in the index.

4)。如果匹配的索引大于-1,则该文件包含索引中任何值的字符串。

You've successfully avoided having to parse the file line by line, you have potentially loaded a large amount of data (although 1600 lines of text is hardly that large) into memory. When the string literal goes out of scope it'll be reclaimed by the garbage collector.

您已成功避免必须逐行解析文件,您可能已将大量数据(尽管1600行文本很难)加载到内存中。当字符串文字超出范围时,它将被垃圾收集器回收。