如何将一个巨大的文件分割成文字?

时间:2022-07-22 13:33:04

How can I read a very long string from text file, and then process it (split into words)?

如何从文本文件中读取一个很长的字符串,然后进行处理(分割为单词)?

I tried the StreamReader.ReadLine() method, but I get an OutOfMemory exception. Apparently, my lines are extremely long. This is my code for reading file:

我尝试了StreamReader.ReadLine()方法,但是得到了一个OutOfMemory异常。显然,我的台词非常长。这是我读取文件的代码:

using (var streamReader = File.OpenText(_filePath))
    {

        int lineNumber = 1;
        string currentString = String.Empty;
        while ((currentString = streamReader.ReadLine()) != null)
        {

            ProcessString(currentString, lineNumber);
            Console.WriteLine("Line {0}", lineNumber);
            lineNumber++;
        }
    }

And the code which splits line into words:

代码将行分割成字:

var wordPattern = @"\w+";
var matchCollection = Regex.Matches(text, wordPattern);
var words = (from Match word in matchCollection
             select word.Value.ToLowerInvariant()).ToList();

3 个解决方案

#1


5  

You could read by char, building up words as you go, using yield to make it deferred so you don't have to read the entire file at once:

你可以通过char阅读,建立单词,使用yield来延迟,这样你就不必同时阅读整个文件:

private static IEnumerable<string> ReadWords(string filename)
{
    using (var reader = new StreamReader(filename))
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            char c = (char)reader.Read();

            // Mimics regex /w/ - almost.
            if (char.IsLetterOrDigit(c) || c == '_')
            {
                builder.Append(c);
            }
            else
            {
                if (builder.Length > 0)
                {
                    yield return builder.ToString();
                    builder.Clear();
                }
            }
        }

        yield return builder.ToString();
    }
}

The code reads the file by character, and when it encounters a non-word character it will yield return the word built up until then (only for the first non-letter character). The code uses a StringBuilder to build the word string.

代码按字符读取文件,当遇到非单词字符时,它将返回构建的单词(仅针对第一个非字母字符)。代码使用StringBuilder构建字串。

Char.IsLetterOrDigit() behaves just as the regex word character w for characters, but underscores (amongst others) also fall into the latter category. If your input contains more characters you also wish to include, you'll have to alter the if().

isletterordigit()的行为就像regex字符w的字符,但是下划线(在其他字符中)也属于后一类。如果输入包含更多希望包含的字符,则必须修改If()。

#2


0  

Cut it into bit size sections. So that instead of trying to read 4gb, which I believe is about the size of a page, try to read 8 500mb chunks and that should help.

把它切成比特大小的部分。因此,与其尝试读取4gb(我认为大约相当于一个页面的大小),不如尝试读取8500mb的块,这应该会有所帮助。

#3


0  

Garbage collection may be a solution. I am not really sure that it is the problem source. But if it is the case, a simple GC.Collect is often unsufficient and, for performance reason, it should only be called if really required. Try the following procedure that calls the garbage when the available memory is too low (below the threshold provided as procedure parameter).

垃圾收集可能是一种解决方案。我不确定这是否是问题的根源。但是,如果是这样的话,一个简单的GC。收集常常是不够的,而且出于性能原因,只在真正需要时才应该调用它。尝试下面的过程,当可用内存过低时调用垃圾(低于阈值作为过程参数)。

int charReadSinceLastMemCheck = 0 ;
using (var streamReader = File.OpenText(_filePath))
{

    int lineNumber = 1;
    string currentString = String.Empty;
    while ((currentString = streamReader.ReadLine()) != null)
    {

        ProcessString(currentString, lineNumber);
        Console.WriteLine("Line {0}", lineNumber);
        lineNumber++;
        totalRead+=currentString.Length ;
        if (charReadSinceLastMemCheck>1000000) 
        { // Check memory left every Mb read, and collect garbage if required
          CollectGarbage(100) ;
          charReadSinceLastMemCheck=0 ;
        } 
    }
}


internal static void CollectGarbage(int SizeToAllocateInMo)
{
       long [,] TheArray ;
       try { TheArray =new long[SizeToAllocateInMo,125000]; }low function 
       catch { TheArray=null ; GC.Collect() ; GC.WaitForPendingFinalizers() ; GC.Collect() ; }
       TheArray=null ;
}

#1


5  

You could read by char, building up words as you go, using yield to make it deferred so you don't have to read the entire file at once:

你可以通过char阅读,建立单词,使用yield来延迟,这样你就不必同时阅读整个文件:

private static IEnumerable<string> ReadWords(string filename)
{
    using (var reader = new StreamReader(filename))
    {
        var builder = new StringBuilder();

        while (!reader.EndOfStream)
        {
            char c = (char)reader.Read();

            // Mimics regex /w/ - almost.
            if (char.IsLetterOrDigit(c) || c == '_')
            {
                builder.Append(c);
            }
            else
            {
                if (builder.Length > 0)
                {
                    yield return builder.ToString();
                    builder.Clear();
                }
            }
        }

        yield return builder.ToString();
    }
}

The code reads the file by character, and when it encounters a non-word character it will yield return the word built up until then (only for the first non-letter character). The code uses a StringBuilder to build the word string.

代码按字符读取文件,当遇到非单词字符时,它将返回构建的单词(仅针对第一个非字母字符)。代码使用StringBuilder构建字串。

Char.IsLetterOrDigit() behaves just as the regex word character w for characters, but underscores (amongst others) also fall into the latter category. If your input contains more characters you also wish to include, you'll have to alter the if().

isletterordigit()的行为就像regex字符w的字符,但是下划线(在其他字符中)也属于后一类。如果输入包含更多希望包含的字符,则必须修改If()。

#2


0  

Cut it into bit size sections. So that instead of trying to read 4gb, which I believe is about the size of a page, try to read 8 500mb chunks and that should help.

把它切成比特大小的部分。因此,与其尝试读取4gb(我认为大约相当于一个页面的大小),不如尝试读取8500mb的块,这应该会有所帮助。

#3


0  

Garbage collection may be a solution. I am not really sure that it is the problem source. But if it is the case, a simple GC.Collect is often unsufficient and, for performance reason, it should only be called if really required. Try the following procedure that calls the garbage when the available memory is too low (below the threshold provided as procedure parameter).

垃圾收集可能是一种解决方案。我不确定这是否是问题的根源。但是,如果是这样的话,一个简单的GC。收集常常是不够的,而且出于性能原因,只在真正需要时才应该调用它。尝试下面的过程,当可用内存过低时调用垃圾(低于阈值作为过程参数)。

int charReadSinceLastMemCheck = 0 ;
using (var streamReader = File.OpenText(_filePath))
{

    int lineNumber = 1;
    string currentString = String.Empty;
    while ((currentString = streamReader.ReadLine()) != null)
    {

        ProcessString(currentString, lineNumber);
        Console.WriteLine("Line {0}", lineNumber);
        lineNumber++;
        totalRead+=currentString.Length ;
        if (charReadSinceLastMemCheck>1000000) 
        { // Check memory left every Mb read, and collect garbage if required
          CollectGarbage(100) ;
          charReadSinceLastMemCheck=0 ;
        } 
    }
}


internal static void CollectGarbage(int SizeToAllocateInMo)
{
       long [,] TheArray ;
       try { TheArray =new long[SizeToAllocateInMo,125000]; }low function 
       catch { TheArray=null ; GC.Collect() ; GC.WaitForPendingFinalizers() ; GC.Collect() ; }
       TheArray=null ;
}