awk或sed在字符之前删除文件中的文本,然后在字符之后删除

时间:2021-09-10 16:51:59

I have a file on which I am trying to use awk to remove the text before the (), but keep the text in the (). I am also trying to remove the whitespace and text after the _# and then output the entire line. Maybe sed is a better choice, but I am not certain how.

我有一个文件,我试图使用awk删除()之前的文本,但保持文本在()。我也试图删除_#后面的空格和文本,然后输出整行。也许sed是一个更好的选择,但我不确定如何。

file

文件

chr4    100009839   100009851   426_1201_128(ADH5)_1    0   -
chr4    100006265   100006367   426_1202_128(ADH5)_2    0   -
chr4    100003125   100003267   426_1203_128(ADH5)_3    0   -

desired output

期望的输出

chr4    100009839   100009851   ADH5_1  
chr4    100006265   100006367   ADH5_2  
chr4    100003125   100003267   ADH5_3

awk

AWK

awk -F'()_*' '{print $1,$2,$3,$4}' file

2 个解决方案

#1


1  

awk -F'[\t()]' '{OFS="\t"; print $1, $2, $3, $5 $6}' file

Output:

输出:

chr4    100009839       100009851       ADH5_1
chr4    100006265       100006367       ADH5_2
chr4    100003125       100003267       ADH5_3

#2


1  

Using sed with a substitution:

使用带替换的sed:

$ sed 's/[^ ]*(\([^)]*\))\(_[^ ]*\).*$/\1\2/' infile
chr4    100009839   100009851   ADH5_1
chr4    100006265   100006367   ADH5_2
chr4    100003125   100003267   ADH5_3

Taking apart the regex:

拆开正则表达式:

[^ ]*(       # Non-spaces up to and including opening parenthesis
\(           # Start first capture group
    [^)]*    # Content between parentheses: everything but a closing parenthesis
\)           # End of first capture group
)            # Closing parenthesis, not captured
\(           # Start second capture group
    _[^ ]*   # Underscore and non-spaces, '_1' etc.
\)           # End of second capture group
.*$          # Rest of line, not captured

#1


1  

awk -F'[\t()]' '{OFS="\t"; print $1, $2, $3, $5 $6}' file

Output:

输出:

chr4    100009839       100009851       ADH5_1
chr4    100006265       100006367       ADH5_2
chr4    100003125       100003267       ADH5_3

#2


1  

Using sed with a substitution:

使用带替换的sed:

$ sed 's/[^ ]*(\([^)]*\))\(_[^ ]*\).*$/\1\2/' infile
chr4    100009839   100009851   ADH5_1
chr4    100006265   100006367   ADH5_2
chr4    100003125   100003267   ADH5_3

Taking apart the regex:

拆开正则表达式:

[^ ]*(       # Non-spaces up to and including opening parenthesis
\(           # Start first capture group
    [^)]*    # Content between parentheses: everything but a closing parenthesis
\)           # End of first capture group
)            # Closing parenthesis, not captured
\(           # Start second capture group
    _[^ ]*   # Underscore and non-spaces, '_1' etc.
\)           # End of second capture group
.*$          # Rest of line, not captured