使用sed/perl-like regex或awk使嵌套列表模式变平

时间:2022-03-28 20:15:06

I have a simple nested list-type pattern that I'd like to flatten so that each child item is prefixed with its parent item using a regular expression (if it's possible) using sed or command line perl. I appreciate that it's fairly trivial to do this using loops/recursion in a simple perl program, but I'm interested in whether it can be done via regex. If it's not possible to do via regex, then I'll consider alternatives via awk or similar (e.g. trivial perl on command line) that can still be easily used in a Unix pipe.

我有一个简单的嵌套列表类型的模式,我希望将它简化,以便使用sed或命令行perl使用正则表达式(如果可能的话)将每个子条目以其父条目作为前缀。我很欣赏在一个简单的perl程序中使用循环/递归进行此操作是相当简单的,但我对是否可以通过regex进行此操作感兴趣。如果不能通过regex进行操作,那么我将考虑通过awk或类似的(例如,命令行上的普通perl)进行替代,在Unix管道中仍然可以轻松使用。

Notes / Assumptions:

Notes /假设:

  1. For my particular usage, I'm piping input/output as part of a larger data transformation chain that already has several command line perl regexes, hence the preference to be consistent.
  2. 对于我的特殊用法,我将输入/输出作为一个更大的数据转换链的一部分,该数据转换链已经有几个命令行perl regexes,因此首选保持一致。
  3. Performance isn't a particular concern - there will be less than 100 items in the list and items will typically be less than 50 characters.
  4. 性能不是一个特别关注的问题——列表中的项目将少于100个,而项目通常少于50个字符。
  5. No requirement to handle edge cases like parents with no children, or badly formatted list structures (assume data is correct format).
  6. 不需要处理边缘情况,比如没有孩子的父母,或者格式糟糕的列表结构(假设数据是正确的格式)。
  7. The tokens that delimit parent/child items is unimportant - the example below is using '< ' for parent and '> ' for child, but these could be anything.
  8. 分隔父/子项的令牌并不重要——下面的示例使用'< '表示父项,'> '表示子项,但这些可以是任何东西。
  9. The separator between parent and child in the output is unimportant - the example below is using '.' just as an example.
  10. 输出中父级和子级之间的分隔符不重要——下面的示例使用'。只是举个例子。
  11. There is only one level of nesting (assume I can derive how to manage further nesting levels, should I need to).
  12. 嵌套只有一个级别(假设我可以推导出如何管理进一步的嵌套级别,如果需要的话)。
  13. Number of parents and number of children (total and per parent) are unknown.
  14. 家长人数和子女人数(总数和家长人数)不详。
  15. Number of children can differ between parents.
  16. 孩子的数量可能因父母而异。

Example input:

示例输入:

< Parent1
> Child1
> Child2
< Parent2
> Child3
< Parent3
> Child4
> Child5
> Child6
> Child7

Desired output:

期望的输出:

Parent1.Child1
Parent1.Child2
Parent2.Child3
Parent3.Child4
Parent3.Child5
Parent3.Child6
Parent3.Child7

Best attempt:

最好的尝试:

perl -0pe 's/< (.*)\n> (.*)\n/\1.\2\n/g'

Best attempt output:

最好尝试输出:

Parent1.Child1
> Child2
Parent2.Child3
Parent3.Child4
> Child5
> Child6
> Child7

Obviously my best attempt is only handling the initial child of each parent as part of the multi-line match. I know why, I just don't know what technique to allow repeatedly printing the parent capture group for each child capture group printed.

显然,我的最佳尝试只是在多行匹配中处理每个父类的初始子节点。我知道为什么,我只是不知道什么技术允许为每个打印的子捕获组重复打印父捕获组。

Thanks in advance.

提前谢谢。

3 个解决方案

#1


2  

Not bothering with a regex, but using perl

不使用正则表达式,而是使用perl

perl -lne '$p=$_ if s/< //; print "$p.$_" if s/> //' file.txt

Btw, the reason why using a single regex for this problem is silly, is because you're trying to do more than one transformation. You're wanting to prefix the children with their parent's name. And you're also wanting to strip the parents. Those are 2 distinct operations and so trying to dream up a way to combine them doesn't make any sense.

顺便说一句,为这个问题使用一个regex是愚蠢的,因为您正在尝试多个转换。你想给孩子们加上他们父母的名字。你还想剥夺父母的权利。这是两种截然不同的操作,所以试图想出一种结合它们的方法是没有意义的。

The below uses 3 regexes to accomplish the transformation that you want, but obviously the above is a lot more clear.

下面使用了3个regex来完成您想要的转换,但是显然上面的内容要清楚得多。

perl -0777 -pe '
    s/(^<.*\n)((?:>.*\n)*)/$2$1/mg;
    s/^> (?=.*?^< ([^\n]*))/$1./smg;
    s/^<.*\n//mg;
  ' file.txt

#2


1  

Using sed

使用sed

sed '/</{h;ba};G;s/[><] //g;s/\(.*\)\n\(.*\)/\2\.\1/p;:a;d' file

Explanation

  • this is a if-then-else-fi by sed.
  • 这是sed生产的if-then- elsefi。
  • /</ similar as option in if
  • /
  • {h;ba}; similar as commands after then
  • { h,英航};类似于之后的命令
  • G;s/[><] //g;s/(.*)\n(.*)/\2\.\1/p; similar as commands after else
  • (> <)G;s / / / G;s /(. *)\ n(. *)/ 2 \ \ \ 1 / p;类似于其他命令
  • :a;d similar as fi
  • :一个;d fi相似

Using awk

使用awk

awk '/^</{s=$2;next}{$0=s"."$2}1' file

#3


0  

This might work for you (GNU sed):

这可能对您有用(GNU sed):

sed -r '$!N;/^(< (Parent.*))\n> (Child.*)/{s//\2.\3\n\1/;P};D' file

This pairs Parents with Child until two Parents occur then it discards the first Parent.

这对有孩子的父母,直到有两个父母出现,才会丢弃第一个父母。

N.B. the Parent,Child regexp is superfluous:

注意:父、子regexp是多余的:

sed -r '$!N;/^(< (.*))\n> (.*)/{s//\2.\3\n\1/;P};D' file

would work also.

也会工作。

#1


2  

Not bothering with a regex, but using perl

不使用正则表达式,而是使用perl

perl -lne '$p=$_ if s/< //; print "$p.$_" if s/> //' file.txt

Btw, the reason why using a single regex for this problem is silly, is because you're trying to do more than one transformation. You're wanting to prefix the children with their parent's name. And you're also wanting to strip the parents. Those are 2 distinct operations and so trying to dream up a way to combine them doesn't make any sense.

顺便说一句,为这个问题使用一个regex是愚蠢的,因为您正在尝试多个转换。你想给孩子们加上他们父母的名字。你还想剥夺父母的权利。这是两种截然不同的操作,所以试图想出一种结合它们的方法是没有意义的。

The below uses 3 regexes to accomplish the transformation that you want, but obviously the above is a lot more clear.

下面使用了3个regex来完成您想要的转换,但是显然上面的内容要清楚得多。

perl -0777 -pe '
    s/(^<.*\n)((?:>.*\n)*)/$2$1/mg;
    s/^> (?=.*?^< ([^\n]*))/$1./smg;
    s/^<.*\n//mg;
  ' file.txt

#2


1  

Using sed

使用sed

sed '/</{h;ba};G;s/[><] //g;s/\(.*\)\n\(.*\)/\2\.\1/p;:a;d' file

Explanation

  • this is a if-then-else-fi by sed.
  • 这是sed生产的if-then- elsefi。
  • /</ similar as option in if
  • /
  • {h;ba}; similar as commands after then
  • { h,英航};类似于之后的命令
  • G;s/[><] //g;s/(.*)\n(.*)/\2\.\1/p; similar as commands after else
  • (> <)G;s / / / G;s /(. *)\ n(. *)/ 2 \ \ \ 1 / p;类似于其他命令
  • :a;d similar as fi
  • :一个;d fi相似

Using awk

使用awk

awk '/^</{s=$2;next}{$0=s"."$2}1' file

#3


0  

This might work for you (GNU sed):

这可能对您有用(GNU sed):

sed -r '$!N;/^(< (Parent.*))\n> (Child.*)/{s//\2.\3\n\1/;P};D' file

This pairs Parents with Child until two Parents occur then it discards the first Parent.

这对有孩子的父母,直到有两个父母出现,才会丢弃第一个父母。

N.B. the Parent,Child regexp is superfluous:

注意:父、子regexp是多余的:

sed -r '$!N;/^(< (.*))\n> (.*)/{s//\2.\3\n\1/;P};D' file

would work also.

也会工作。