I've been looking for days trying to use awk, sed, cut, and tr for a possible solution to my problem. I have a data set thats delimited with a "@" as shown below...
我一直在寻找使用awk,sed,cut和tr来解决问题的日子。我有一个用“@”分隔的数据集,如下所示......
1@2@11@11/8@11/8@11@11/2
2@4@31 1/2@31 1/2@31/2@21@21/2
3@10@116 1/4@98@911 3/4@410@38 1/2
4@1@21@21/8@21/8@33@49 1/4
5@11@74@75@67 1/2@511 1/2@511 1/2
6@9@106@108 1/4@89 1/4@613 1/2@616
7@7@96@118 1/4@1313 1/2@715@717 3/4
8@12@127 3/4@129 3/4@1212 1/2@816 1/2@817 3/4
9@6@63@ 64 1/2@79@916 1/2@918
10@13@139 3/4@1311 1/4@1112@1017@1019 3/4
11@3@42@42@43 1/2@1118 1/2@1126 1/4
12@5@84 1/2@87@1011 3/4@1219 1/2@1228 1/4
13@8@52 1/2@53 1/2@57@1324@1332 3/4
What I would like to do is split the row numbers in the first column (the ranks) from the rest of the integers in the other columns starting at column 3 and on. The end results would look like this...
我想要做的是从第3列开始的其他列中的其余整数中分割第一列(行列)中的行号。最终结果看起来像这样......
1@2@1@1@1@1/8@1@1/8@1@1@1@1/2
2@4@3@1 1/2@3@1 1/2@3@1/2@2@1@2@1/2
3@10@11@6 1/4@9@8@9@11 3/4@4@10@3@8 1/2
4@1@2@1@2@1/8@2@1/8@3@3@4@9 1/4
5@11@7@4@7@5@6@7 1/2@5@11 1/2@5@11 1/2
6@9@10@6@10@8 1/4@8@9 1/4@6@13 1/2@6@16
7@7@9@6@11@8 1/4@13@13 1/2@7@15@7@17 3/4
8@12@12@7 3/4@12@9 3/4@12@12 1/2@8@16 1/2@8@17 3/4
9@6@6@3@6@4 1/2@7@9@9@16 1/2@9@18
10@13@13@9 3/4@13@11 1/4@11@12@10@17@10@19 3/4
11@3@4@2@4@2@4@3 1/2@11@18 1/2@11@26 1/4
12@5@8@4 1/2@8@7@10@11 3/4@12@19 1/2@12@28 1/4
13@8@5@2 1/2@5@3 1/2@5@7@13@24@13@32 3/4
I was thinking that I could use an "if statement". Something like "if integers start with [2-9] then split after one character, elif it starts with [1] and length is equal to 3 or more (before the space and fraction) then split the firsts two characters." I have no idea how to how to go about solving this problem. I have thousands of similar files and need to change the structure for all of them, so the solution will have to be ran through a loop.
我以为我可以用“if语句”。类似于“如果整数以[2-9]开头然后在一个字符后分割,则以[1]开始,长度等于3或更多(在空格和分数之前),然后将第一个分成两个字符。”我不知道如何解决这个问题。我有数千个类似的文件,需要更改所有这些文件的结构,因此解决方案必须通过循环运行。
4 个解决方案
#1
Here's a fun one:
这是一个有趣的:
perl -F@ -lape '$_ = join "@", shift(@F), shift(@F), map {s/(1\d|\d)(\d+)/$1\@$2/g; $_} @F' file
With a little commentary
随着一点评论
perl -F@ -lape '
$_ = join "@", # join the following things, using "@"
shift(@F), # the first field
shift(@F), # the second field
map { # then, transform the rest with this expr
s{ # search for:
(1\d | \d) # 1 plus a digit, or a digit
(\d+) # followed by one or more digits
}{$1\@$2}xg; # add an "@" in between
$_ # and return the new string
} @F
' file
The options:
-
-a
and-F@
-- split each line into the array@F
using the@
character as the separator -
-l
-- handle line endings automatically -
-p
-- automatically print the variable$_
after processing each line
-a和-F @ - 使用@字符作为分隔符将每一行拆分为数组@F
-l - 自动处理行结尾
-p - 在处理每一行后自动打印变量$ _
#2
Here is pretty much a transcription of the logic as you described it in awk (I added the assumption that starting with 1 and having length 2 should split after the first character). I also noticed in row 9 there was a space after the @ delimiter, so added that possibility to the field separator as you can see in the BEGIN
block --- maybe with the real data you don't need that, so just to be aware. I did in the end get your expected output, but probably you want to desk-check this on larger data sets in case there are some more use cases not taken into account.
这里几乎是你在awk中描述的逻辑转录(我添加了假设,从1开始,长度为2应该在第一个字符之后分开)。我还注意到第9行在@分隔符后面有一个空格,所以在BEGIN块中可以看到字段分隔符的可能性 - 可能是您不需要的真实数据,所以只是为了知道的。我最终得到了你的预期输出,但是你可能想要在较大的数据集上进行检查,以防有一些更多的用例没有被考虑在内。
$ cat jd.awk
BEGIN { FS = " *@ *"; OFS = "@" }
{
for (i=3; i<=NF; ++i) {
# if integers start with [2-9] then split after one character
if (substr($i, 1, 1) ~ /[2-9]/) {
$i = substr($i, 1, 1) "@" substr($i, 2)
}
else {
split($i, parts, "[ /]")
# else if it starts with [1] and length is equal to 2
# (before the space and fraction) then split the first character
if (substr($i, 1, 1) == "1" && length(parts[1]) == 2) {
$i = substr($i, 1, 1) "@" substr($i, 2)
}
# else if it starts with [1] and length is equal to 3 or more
# (before the space and fraction) then split the firsts two characters.
else if (substr($i, 1, 1) == "1" && length(parts[1]) >= 3) {
$i = substr($i, 1, 2) "@" substr($i, 3)
}
}
}
print
}
$ cat jd.txt
1@2@11@11/8@11/8@11@11/2
2@4@31 1/2@31 1/2@31/2@21@21/2
3@10@116 1/4@98@911 3/4@410@38 1/2
4@1@21@21/8@21/8@33@49 1/4
5@11@74@75@67 1/2@511 1/2@511 1/2
6@9@106@108 1/4@89 1/4@613 1/2@616
7@7@96@118 1/4@1313 1/2@715@717 3/4
8@12@127 3/4@129 3/4@1212 1/2@816 1/2@817 3/4
9@6@63@ 64 1/2@79@916 1/2@918
10@13@139 3/4@1311 1/4@1112@1017@1019 3/4
11@3@42@42@43 1/2@1118 1/2@1126 1/4
12@5@84 1/2@87@1011 3/4@1219 1/2@1228 1/4
13@8@52 1/2@53 1/2@57@1324@1332 3/4
$ awk -f jd.awk jd.txt
1@2@1@1@1@1/8@1@1/8@1@1@1@1/2
2@4@3@1 1/2@3@1 1/2@3@1/2@2@1@2@1/2
3@10@11@6 1/4@9@8@9@11 3/4@4@10@3@8 1/2
4@1@2@1@2@1/8@2@1/8@3@3@4@9 1/4
5@11@7@4@7@5@6@7 1/2@5@11 1/2@5@11 1/2
6@9@10@6@10@8 1/4@8@9 1/4@6@13 1/2@6@16
7@7@9@6@11@8 1/4@13@13 1/2@7@15@7@17 3/4
8@12@12@7 3/4@12@9 3/4@12@12 1/2@8@16 1/2@8@17 3/4
9@6@6@3@6@4 1/2@7@9@9@16 1/2@9@18
10@13@13@9 3/4@13@11 1/4@11@12@10@17@10@19 3/4
11@3@4@2@4@2@4@3 1/2@11@18 1/2@11@26 1/4
12@5@8@4 1/2@8@7@10@11 3/4@12@19 1/2@12@28 1/4
13@8@5@2 1/2@5@3 1/2@5@7@13@24@13@32 3/4
#3
This might work for you (GNU sed):
这可能适合你(GNU sed):
sed -r 's/^(([^@]*@){2})/\1\n/;ta;:a;/\n[0-9]?$/s/\n//;t;/\n(1[0-9]|[0-9])([0-9][0-9]?)/s//\1@\2\n/;ta;/\n([0-9]?[^0-9\n]) ?/s//\1\n/;ta' file
This inserts a newline following the second field and then pattern matches and loops, each successive match moves the newline forward until the end of the line when the newline is removed.
这会在第二个字段后插入一个换行符,然后对匹配和循环进行模式处理,每次连续匹配都会将换行符向前移动,直到删除换行符时的行尾。
#4
Thank you for all of your quick responses.
They helped me with the parsing of my data sets that needed to be done before an upload. The solution that I ended up using was the one based on pearl due to simplicity. Again, Thank You for your answers.
感谢您的所有快速回复。他们帮助我解析了在上传之前需要完成的数据集。由于简单,我最终使用的解决方案是基于珍珠的解决方案。再次,谢谢你的答案。
#1
Here's a fun one:
这是一个有趣的:
perl -F@ -lape '$_ = join "@", shift(@F), shift(@F), map {s/(1\d|\d)(\d+)/$1\@$2/g; $_} @F' file
With a little commentary
随着一点评论
perl -F@ -lape '
$_ = join "@", # join the following things, using "@"
shift(@F), # the first field
shift(@F), # the second field
map { # then, transform the rest with this expr
s{ # search for:
(1\d | \d) # 1 plus a digit, or a digit
(\d+) # followed by one or more digits
}{$1\@$2}xg; # add an "@" in between
$_ # and return the new string
} @F
' file
The options:
-
-a
and-F@
-- split each line into the array@F
using the@
character as the separator -
-l
-- handle line endings automatically -
-p
-- automatically print the variable$_
after processing each line
-a和-F @ - 使用@字符作为分隔符将每一行拆分为数组@F
-l - 自动处理行结尾
-p - 在处理每一行后自动打印变量$ _
#2
Here is pretty much a transcription of the logic as you described it in awk (I added the assumption that starting with 1 and having length 2 should split after the first character). I also noticed in row 9 there was a space after the @ delimiter, so added that possibility to the field separator as you can see in the BEGIN
block --- maybe with the real data you don't need that, so just to be aware. I did in the end get your expected output, but probably you want to desk-check this on larger data sets in case there are some more use cases not taken into account.
这里几乎是你在awk中描述的逻辑转录(我添加了假设,从1开始,长度为2应该在第一个字符之后分开)。我还注意到第9行在@分隔符后面有一个空格,所以在BEGIN块中可以看到字段分隔符的可能性 - 可能是您不需要的真实数据,所以只是为了知道的。我最终得到了你的预期输出,但是你可能想要在较大的数据集上进行检查,以防有一些更多的用例没有被考虑在内。
$ cat jd.awk
BEGIN { FS = " *@ *"; OFS = "@" }
{
for (i=3; i<=NF; ++i) {
# if integers start with [2-9] then split after one character
if (substr($i, 1, 1) ~ /[2-9]/) {
$i = substr($i, 1, 1) "@" substr($i, 2)
}
else {
split($i, parts, "[ /]")
# else if it starts with [1] and length is equal to 2
# (before the space and fraction) then split the first character
if (substr($i, 1, 1) == "1" && length(parts[1]) == 2) {
$i = substr($i, 1, 1) "@" substr($i, 2)
}
# else if it starts with [1] and length is equal to 3 or more
# (before the space and fraction) then split the firsts two characters.
else if (substr($i, 1, 1) == "1" && length(parts[1]) >= 3) {
$i = substr($i, 1, 2) "@" substr($i, 3)
}
}
}
print
}
$ cat jd.txt
1@2@11@11/8@11/8@11@11/2
2@4@31 1/2@31 1/2@31/2@21@21/2
3@10@116 1/4@98@911 3/4@410@38 1/2
4@1@21@21/8@21/8@33@49 1/4
5@11@74@75@67 1/2@511 1/2@511 1/2
6@9@106@108 1/4@89 1/4@613 1/2@616
7@7@96@118 1/4@1313 1/2@715@717 3/4
8@12@127 3/4@129 3/4@1212 1/2@816 1/2@817 3/4
9@6@63@ 64 1/2@79@916 1/2@918
10@13@139 3/4@1311 1/4@1112@1017@1019 3/4
11@3@42@42@43 1/2@1118 1/2@1126 1/4
12@5@84 1/2@87@1011 3/4@1219 1/2@1228 1/4
13@8@52 1/2@53 1/2@57@1324@1332 3/4
$ awk -f jd.awk jd.txt
1@2@1@1@1@1/8@1@1/8@1@1@1@1/2
2@4@3@1 1/2@3@1 1/2@3@1/2@2@1@2@1/2
3@10@11@6 1/4@9@8@9@11 3/4@4@10@3@8 1/2
4@1@2@1@2@1/8@2@1/8@3@3@4@9 1/4
5@11@7@4@7@5@6@7 1/2@5@11 1/2@5@11 1/2
6@9@10@6@10@8 1/4@8@9 1/4@6@13 1/2@6@16
7@7@9@6@11@8 1/4@13@13 1/2@7@15@7@17 3/4
8@12@12@7 3/4@12@9 3/4@12@12 1/2@8@16 1/2@8@17 3/4
9@6@6@3@6@4 1/2@7@9@9@16 1/2@9@18
10@13@13@9 3/4@13@11 1/4@11@12@10@17@10@19 3/4
11@3@4@2@4@2@4@3 1/2@11@18 1/2@11@26 1/4
12@5@8@4 1/2@8@7@10@11 3/4@12@19 1/2@12@28 1/4
13@8@5@2 1/2@5@3 1/2@5@7@13@24@13@32 3/4
#3
This might work for you (GNU sed):
这可能适合你(GNU sed):
sed -r 's/^(([^@]*@){2})/\1\n/;ta;:a;/\n[0-9]?$/s/\n//;t;/\n(1[0-9]|[0-9])([0-9][0-9]?)/s//\1@\2\n/;ta;/\n([0-9]?[^0-9\n]) ?/s//\1\n/;ta' file
This inserts a newline following the second field and then pattern matches and loops, each successive match moves the newline forward until the end of the line when the newline is removed.
这会在第二个字段后插入一个换行符,然后对匹配和循环进行模式处理,每次连续匹配都会将换行符向前移动,直到删除换行符时的行尾。
#4
Thank you for all of your quick responses.
They helped me with the parsing of my data sets that needed to be done before an upload. The solution that I ended up using was the one based on pearl due to simplicity. Again, Thank You for your answers.
感谢您的所有快速回复。他们帮助我解析了在上传之前需要完成的数据集。由于简单,我最终使用的解决方案是基于珍珠的解决方案。再次,谢谢你的答案。