在整个文件中进行多行正则表达式搜索

时间:2023-01-14 19:26:19

I've found loads of examples on to to replace text in files using regex. However it all boils down to two versions:
1. Iterate over all lines in the file and apply regex to each single line
2. Load the whole file.

我已经找到大量示例来使用正则表达式替换文件中的文本。然而,它归结为两个版本:1。迭代文件中的所有行并将regex应用于每一行2.加载整个文件。

No. 2 Is not feasible using "my" files - they're about 2GiB...
As to No. 1: Currently this is my approach, however I was wondering... What if need to apply a regex spanning more than one line ?

第2号使用“我的”文件是不可行的 - 它们大概是2GiB ......至于第1号:目前这是我的方法,但是我想知道......如果需要应用跨越多个的正则表达式怎么办?线?

6 个解决方案

#1


2  

Here's the Answer:
There is no easy way

这是答案:没有简单的方法

I found a StreamRegex-Class which could be able to do what I am looking for.
From what I could grasp of the algorithm:

我找到了一个StreamRegex-Class,可以做我想要的。从我能掌握的算法:

  • Start at the beginning of the file with an empty buffer
  • 使用空缓冲区从文件的开头开始
  • do (
    • add a chunk of the file to the buffer
    • 将一大块文件添加到缓冲区
    • if there is a match in the buffer
      • mark the match
      • 标记比赛
      • drop all data which appeared before the end of the match from the buffer
      • 从缓冲区中删除匹配结束前出现的所有数据
    • 如果缓冲区标记中存在匹配,则匹配将从缓冲区中删除在匹配结束之前出现的所有数据
  • do(如果缓冲区中存在匹配项,则将一大块文件添加到缓冲区中,匹配将从缓冲区中删除匹配结束前出现的所有数据
  • ) while there is still something of the file left
  • )虽然还有一些文件遗留下来

That way it is not nessesary to load the full file -- or at least the chances of loading the full file in memory are reduced...
However: Worst case is that there is no match in the whole file - in this case the full file will be loaded into memory.

这样,加载完整文件并不是必需的 - 或者至少减少了在内存中加载整个文件的机会......但是:最坏的情况是整个文件中没有匹配 - 在这种情况下是完整的文件将被加载到内存中。

#2


1  

Regex is not the way to go, especially not with these large amounts of text. Create a little parser of your own:

正则表达式不是一种可行的方式,特别是没有这些大量的文本。创建一个自己的小解析器:

  • read the file line by line;
  • 逐行读取文件;
  • for each line:
    • loop through the line char by char keeping track of any opening/closing string literals
    • 循环遍历字符char by char跟踪任何打开/关闭字符串文字
    • when you encounter '/*' (and you're not 'inside' a string), store that offset number and loop until you encounter the first '*/' and store that number as well
    • 当你遇到'/ *'(并且你不是'在'字符串'里面)时,存储该偏移数字并循环直到遇到第一个'* /'并存储该数字
  • 对于每一行:当你遇到'/ *'(并且你不是'在'字符串里面')时,通过char循环char char跟踪任何打开/关闭字符串文字,存储该偏移数并循环直到遇到首先'* /'并存储该号码

That will give you all the starting- and closing-offset numbers of the comment blocks. You should now be able to replace them by creating a temp-file and writing the text from the original file to the temp file (and writing something else if you're inside a comment block of course).

这将为您提供注释块的所有起始和结束偏移数。您现在应该能够通过创建临时文件并将原始文件中的文本写入临时文件来替换它们(如果您当然在注释块中,则可以编写其他内容)。

Edit: source files of 2GiB??

编辑:2GiB的源文件??

#3


0  

Perhaps you could load in 2 lines at a time (or more, depending on how many lines you think your matches are going to span), and overlap them, e.g: load lines 1-2, then the next loop load lines 2-3, the next load 3-4; and do your multiline regexes over both lines combined, in each loop.

也许您可以一次加载2行(或更多,取决于您认为匹配将跨越多少行),并重叠它们,例如:加载行1-2,然后下一个循环加载行2-3 ,下次加载3-4;并在每个循环中组合两行的多行正则表达式。

#4


0  

I would say you should pre-parse/normalize the data before doing your replacements so that each line describes one possible set of data that needs to have replacements applied. Otherwise you get into complications with data integrity that cannot really be solved without a host of other difficulties.

我会说你应该在进行替换之前对数据进行预解析/规范化,以便每行描述一个可能需要替换的数据集。否则,您会遇到数据完整性的复杂问题,如果没有其他许多困难,这些问题无法真正解决。

If there is a way to chunk the data into logical blocks then you could build a program that uses a mapreduce pattern to parse the data.

如果有办法将数据块化为逻辑块,那么您可以构建一个使用mapreduce模式来解析数据的程序。

#5


0  

I'm with Bart; you really should be using some kind of parser for this.

我和巴特在一起;你真的应该使用某种解析器。

Or, if you don't mind spawning a child process, you could just use sed (there's a native port on windows, or you can use Cygwin)

或者,如果你不介意产生子进程,你可以只使用sed(在windows上有一个本机端口,或者你可以使用Cygwin)

#6


0  

If you don't mind getting your hands a little dirty (and your regex is simple enough, or perhaps you have a strong desire for speed and don't mind suffering a bit), you can use Ragel. It can target C#, though the site doesn't mention it. You'll need to wrap a FileStream to provide a buffered indexer or use a memory mapped file (with unsafe pointers) in a 64 bit process to use this with large files though.

如果你不介意让你的手变得有点脏(你的正则表达式很简单,或者你对速度有强烈的渴望并且不介意受苦),你可以使用Ragel。它可以针对C#,尽管该网站没有提到它。您需要在64位进程中包装FileStream以提供缓冲索引器或使用内存映射文件(使用不安全指针)来将其用于大文件。

#1


2  

Here's the Answer:
There is no easy way

这是答案:没有简单的方法

I found a StreamRegex-Class which could be able to do what I am looking for.
From what I could grasp of the algorithm:

我找到了一个StreamRegex-Class,可以做我想要的。从我能掌握的算法:

  • Start at the beginning of the file with an empty buffer
  • 使用空缓冲区从文件的开头开始
  • do (
    • add a chunk of the file to the buffer
    • 将一大块文件添加到缓冲区
    • if there is a match in the buffer
      • mark the match
      • 标记比赛
      • drop all data which appeared before the end of the match from the buffer
      • 从缓冲区中删除匹配结束前出现的所有数据
    • 如果缓冲区标记中存在匹配,则匹配将从缓冲区中删除在匹配结束之前出现的所有数据
  • do(如果缓冲区中存在匹配项,则将一大块文件添加到缓冲区中,匹配将从缓冲区中删除匹配结束前出现的所有数据
  • ) while there is still something of the file left
  • )虽然还有一些文件遗留下来

That way it is not nessesary to load the full file -- or at least the chances of loading the full file in memory are reduced...
However: Worst case is that there is no match in the whole file - in this case the full file will be loaded into memory.

这样,加载完整文件并不是必需的 - 或者至少减少了在内存中加载整个文件的机会......但是:最坏的情况是整个文件中没有匹配 - 在这种情况下是完整的文件将被加载到内存中。

#2


1  

Regex is not the way to go, especially not with these large amounts of text. Create a little parser of your own:

正则表达式不是一种可行的方式,特别是没有这些大量的文本。创建一个自己的小解析器:

  • read the file line by line;
  • 逐行读取文件;
  • for each line:
    • loop through the line char by char keeping track of any opening/closing string literals
    • 循环遍历字符char by char跟踪任何打开/关闭字符串文字
    • when you encounter '/*' (and you're not 'inside' a string), store that offset number and loop until you encounter the first '*/' and store that number as well
    • 当你遇到'/ *'(并且你不是'在'字符串'里面)时,存储该偏移数字并循环直到遇到第一个'* /'并存储该数字
  • 对于每一行:当你遇到'/ *'(并且你不是'在'字符串里面')时,通过char循环char char跟踪任何打开/关闭字符串文字,存储该偏移数并循环直到遇到首先'* /'并存储该号码

That will give you all the starting- and closing-offset numbers of the comment blocks. You should now be able to replace them by creating a temp-file and writing the text from the original file to the temp file (and writing something else if you're inside a comment block of course).

这将为您提供注释块的所有起始和结束偏移数。您现在应该能够通过创建临时文件并将原始文件中的文本写入临时文件来替换它们(如果您当然在注释块中,则可以编写其他内容)。

Edit: source files of 2GiB??

编辑:2GiB的源文件??

#3


0  

Perhaps you could load in 2 lines at a time (or more, depending on how many lines you think your matches are going to span), and overlap them, e.g: load lines 1-2, then the next loop load lines 2-3, the next load 3-4; and do your multiline regexes over both lines combined, in each loop.

也许您可以一次加载2行(或更多,取决于您认为匹配将跨越多少行),并重叠它们,例如:加载行1-2,然后下一个循环加载行2-3 ,下次加载3-4;并在每个循环中组合两行的多行正则表达式。

#4


0  

I would say you should pre-parse/normalize the data before doing your replacements so that each line describes one possible set of data that needs to have replacements applied. Otherwise you get into complications with data integrity that cannot really be solved without a host of other difficulties.

我会说你应该在进行替换之前对数据进行预解析/规范化,以便每行描述一个可能需要替换的数据集。否则,您会遇到数据完整性的复杂问题,如果没有其他许多困难,这些问题无法真正解决。

If there is a way to chunk the data into logical blocks then you could build a program that uses a mapreduce pattern to parse the data.

如果有办法将数据块化为逻辑块,那么您可以构建一个使用mapreduce模式来解析数据的程序。

#5


0  

I'm with Bart; you really should be using some kind of parser for this.

我和巴特在一起;你真的应该使用某种解析器。

Or, if you don't mind spawning a child process, you could just use sed (there's a native port on windows, or you can use Cygwin)

或者,如果你不介意产生子进程,你可以只使用sed(在windows上有一个本机端口,或者你可以使用Cygwin)

#6


0  

If you don't mind getting your hands a little dirty (and your regex is simple enough, or perhaps you have a strong desire for speed and don't mind suffering a bit), you can use Ragel. It can target C#, though the site doesn't mention it. You'll need to wrap a FileStream to provide a buffered indexer or use a memory mapped file (with unsafe pointers) in a 64 bit process to use this with large files though.

如果你不介意让你的手变得有点脏(你的正则表达式很简单,或者你对速度有强烈的渴望并且不介意受苦),你可以使用Ragel。它可以针对C#,尽管该网站没有提到它。您需要在64位进程中包装FileStream以提供缓冲索引器或使用内存映射文件(使用不安全指针)来将其用于大文件。