SED -多行正则表达式

时间:2021-07-28 15:29:10

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.

现在我被这个问题困扰了好几个小时,为了完成这项工作,我使用了大量不同的工具。没有成功。如果有人能帮我解决这个问题,那就太好了。

Here is the problem:

这是问题:

I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:

我有一个非常大的CSV文件(400mb+),但格式不正确。现在看起来是这样的:

This is a long abstract describing something. What follows is the tile for this sentence."   
,Title1  
This is another sentence that is running on one line. On the next line you can find the title.   
,Title2

As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:

正如你可能看到标题“,Title1”和“,Title2”实际上应该与前面的句子在同一条线上。然后它看起来是这样的:

This is a long abstract describing something. What follows is the tile for this sentence.",Title1  
This is another sentence that is running on one line. On the next line you can find the title.,Title2

Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.

请注意,句尾可以包含引号或不包含引号。最后,他们也应该被替换。

Here is what I came up with so far:

这是我到目前为止的想法:

sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv

This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)

这实际上应该能够完成将表达式与多行匹配的工作。不幸的是它没有:)

The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.

表达式是在句子末尾的点和可选引号加上我要匹配的换行符。*。

Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).

感谢帮助。不管使用什么工具完成工作(awk、perl、sed、tr等)。

Thanks, Chris

谢谢你,克里斯

2 个解决方案

#1


16  

Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.

sed中的多行代码本身并不一定很复杂,只是它使用了大多数人不熟悉的命令,并具有某些副作用,比如当您使用' n'将下一行添加到模式空间时,将下一行从下一行分隔为'\n'。

Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:

无论如何,如果你匹配以逗号开头的行来决定是否删除换行符,这就容易得多了,这就是我在这里所做的:

sed 'N;/\n,/s/"\? *\n//;P;D' title_csv

Input

$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line

Output

$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

#2


12  

Yours works with a couple of small changes:

你的工作有几个小变化:

sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile

The ? needs to be escaped and . doesn't match newlines.

的吗?需要逃走。不匹配换行。

Here's another way to do it which doesn't require using the hold space:

这是另一种不需要使用货舱空间的方法:

sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile

Here is a commented version:

以下是评论版:

sed -n '
$          # for the last input line
{
  p;             # print
  q              # and quit
};
N;         # otherwise, append the next line
/\n,/      # if it starts with a comma
{
  s/"\?\n//p;    # delete an optional comma and the newline and print the result
  b              # branch to the end to read the next line
};
P;         # it doesn't start with a comma so print it
D          # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile

#1


16  

Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.

sed中的多行代码本身并不一定很复杂,只是它使用了大多数人不熟悉的命令,并具有某些副作用,比如当您使用' n'将下一行添加到模式空间时,将下一行从下一行分隔为'\n'。

Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:

无论如何,如果你匹配以逗号开头的行来决定是否删除换行符,这就容易得多了,这就是我在这里所做的:

sed 'N;/\n,/s/"\? *\n//;P;D' title_csv

Input

$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line

Output

$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

#2


12  

Yours works with a couple of small changes:

你的工作有几个小变化:

sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile

The ? needs to be escaped and . doesn't match newlines.

的吗?需要逃走。不匹配换行。

Here's another way to do it which doesn't require using the hold space:

这是另一种不需要使用货舱空间的方法:

sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile

Here is a commented version:

以下是评论版:

sed -n '
$          # for the last input line
{
  p;             # print
  q              # and quit
};
N;         # otherwise, append the next line
/\n,/      # if it starts with a comma
{
  s/"\?\n//p;    # delete an optional comma and the newline and print the result
  b              # branch to the end to read the next line
};
P;         # it doesn't start with a comma so print it
D          # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile