bash / regex，用于使用复杂模式重构部分不一致的文件

I need reconstruct a file by splitting each line into 4 segments and inserting a delimiter like a pipe or : inbetween each segment. My problem is that the structure is somewhat inconsistent...

我需要重新构建一个文件，将每行分割为4段，并插入一个分隔符，如管道或:inbetween each段。我的问题是结构有点不一致……

the file looks like this:

文件如下:

MIKE TESTUSER Some Text 21 - Etc BLA 43 BLA  - Some, Additional..12 info

STEVE NOBODY 43 More `Text and So on BLA (MORE ADDITIONAL info)

LEROY ANYONE Again some text chars numbers BLABLA

and i need to split it into name : address : city and optional zip : optional additional info

我需要把它分成名字:地址:城市和可选的邮政编码:可选的附加信息

MIKE TESTUSER|Some Text 21 - Etc|BLA43 BLA|- Some, Additional..12 info

STEVE NOBODY|43 More `Text and So on|BLA|(MORE ADDITIONAL info)

LEROY ANYONE|Again some text chars, numbers|BLABLA

first segment is always in uppercase, no numbers or special chars second segment consists of anything except words in uppercase third segment is only uppercase and sometimes numbers last segment can be anything except words in uppercase

第一段总是大写，没有数字或特殊字符第二段除了大写第三段中的单词外，其他都是大写，有时候数字最后一段可以是除了大写的单词之外的任何东西

would be great if someone has a solution for this or can point me into a direction that gets me close (doesn't have to be perfect)

如果有人能解决这个问题，或者能指引我走向一个能让我接近的方向(不一定要完美)，那就太好了

first of all thanks for the quick replies! i've tried to explode each line into array elements using the blanks and then check each element for upper/lowercase, numbers etc. somewhat like charlies awk approach. the problem is that i can't always determine when my delimiter has to be placed since a segment sometimes ends with a number or non-alphanumerical char and the next segment starts with a number/non-alphanumerical char.

首先感谢您的快速回复!我试着用空格将每一行分解成数组元素，然后检查每个元素的上/小写、数字等，有点像charlies awk方法。问题是，我不能总是确定何时必须放置分隔符，因为一个段有时以数字或非字母数字字符结尾，而下一个段以数字/非字母数字字符开头。

for example

例如

THIS NAME 23 Rue da guerre 321 12345 MARSEILLE - Info

这条街23号，马赛克321 12345号

should look like

应该是什么样的

THIS NAME|23 Rue da guerre 321|12345 MARSEILLE|- Info

这个名字|23 Rue da guerre 321|12345马赛|- Info

the file has a couple thousand lines and is really messy. quite often the zip code comes in front of the city and sometimes behind it plus various other inconsistencies..

这个文件有几千行，而且非常杂乱。邮政编码经常出现在城市的前面，有时还会在后面加上各种不一致的地方。

i know i'll have to reedit it manually in any case but i was hoping to find a solution that makes it not all that time consuming :)

我知道无论如何我都要重新编辑它，但我希望找到一个解决方案，使它不那么耗时:)

3 个解决方案

#1

Must it be just bash? I'd seriously think about writing something like a simple Awk program.

一定是巴什?我会认真考虑写一个简单的Awk程序。

Say, as a start

说,作为一个开始

awk -f 'BEGIN {FS=" "; uplow=0;}
              {uplow=1;
               for(i=1; i < $NF; i++){
                  if(uplow && ($i ~ [A-Z])) out += $i+" "
                  else if (uplow && ($i !~ [A-Z])) {
                       uplow = 0;
                       out += "|"
                  } else if # fill in the other cases
                }
                print out
               }'

The idea is to check each blank-delimited field for case, and keep a flag to remember if you're in a run of upper-case items or lower case items. What you change, add your pipe character to the output.

这样做的目的是检查每个空格分隔的字段是否有大小写，并保留一个标记，以记住您是在运行大小写项还是大小写项。您所改变的，将您的管道字符添加到输出。

#2

You really need a full blown language like Perl. It'd be something like this:

您确实需要像Perl那样的完整语言。大概是这样的:

use strict;
use warnings;

open MY_FILE "myFileName" or die qq(Can't open "myFileName" for reading\n);
while (my $line = <MY_FILE>) {
    chomp $line;
    $line =~ /([A-Z\s]+)(.*)([A-Z\d\s]{2,})(.*);
    print join "|", ($1, $2, $3, $4) . "\n";
}

The big trick is the regular expression in:

最大的诀窍是正则表达式:

$line =~ /([A-Z\s]+)(.*)([A-Z\d\s])(.*);

That's what breaks the line into four parts (which are then represented by $1 through $4). I don't simply have enough data to even start to test it.

这就是将行划分为四个部分的原因(然后由$1到$4表示)。我没有足够的数据来测试它。

Can you attach about 4 to 5 lines of file to your question, and I'll work something out?

你能在你的问题上加上4到5行文件吗?我会想办法的。

#3

This might work for you:

这可能对你有用:

sed 's/^\([A-Z ]*\) \(.*\)/\2\n\1|/;s/[A-Z]\{2\}/|&/;s/\([^|]*|\)\(.*\)/\2\1/;s/\([^A-Z0-9 ]\)/|\1/;s/\([^\n]*\)\n\(.*\)/\2\1/;s/|$//' file

#1