用单词拆分行并删除Java Regex中的String和Numeric单词

时间:2020-12-12 21:39:41

How to split a line in words and remove String & Numeric words by using Java Regex. Here Input is what I receive and Output is what I want:

如何使用Java Regex拆分单词中的一行并删除String和Numeric单词。这里输入是我收到的,输出是我想要的:

Input:
05  ECPRF-057     PIC S9(4) VALUE +0057 COMP-3.

Output:
ECPRF-057
PIC
S
VALUE
COMP-3


Input:
88  ACCT-LVL-CHG     VALUE 'WT ' 'WTO', "AA ",

Output:
ACCT-LVL-CHG
VALUE

Thanks in Advance Kishore

先谢谢Kishore

2 个解决方案

#1


4  

No, you didn't.

不,你没有。

05  ECPRF-057     PIC S9(4) VALUE +0057 COMP-3.
05  ECPRF-057     COMP-3 PIC S9(4) VALUE +0057.
05  ECPRF-057     VALUE +0057 PIC S9(4) USAGE COMP-3.
05  ECPRF-057     VALUE +0057 PIC S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PICTURE IS S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PICTURE S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PIC S9(4) COMP-3.
05  ECPRF-057     VALUE +0057 COMP-3 PIC S9(4).
05  ECPRF-057     VALUE +0057 PACKED-DECIMAL PIC S9(4).
05  ECPRF-057     VALUE +0057 PIC S9(4).

In addition, COMP-3 can be written as COMPUTATIONAL-3 or PACKED-DECIMAL. And none of these have to be on the same line, and very often won't be.

此外,COMP-3可以写成COMPUTATIONAL-3或PACKED-DECIMAL。并且这些都不必在同一条线上,并且通常不会。

These are all the same. And make many, many more combinations. Am I sure, about that last one even? Yes, because somewhere before that (immediately, or any number of lines before) is:

这些都是一样的。并制作许多更多组合。我确定,关于最后一个甚至?是的,因为在此之前的某个地方(紧接着或之前的任何行数)是:

02  ECPRF-057-GROUP COMP-3. (which may also have the combinations relating to COMP-3)

That's without getting to the 88-level in your second example.

在第二个例子中没有达到88级。

That's without:

05  PIC X(20) VALUE SPACE.

That's without duplicate data-names which are valid when "qualified" by a higher-level data-name, requiring the use of either IN or OF.

没有重复的数据名称,当由更高级别的数据名称“限定”时需要使用IN或OF。

That's without REDEFINES.

那没有REDEFINES。

That's without COMP/COMP-4/COMP-5/BINARY where things like the maximum value that can be held depend upon a compiler option.

那就是没有COMP / COMP-4 / COMP-5 / BINARY,其中可以保存的最大值取决于编译器选项。

Please don't attempt to do this unless all the data you are processing is already rigorously normalised.

除非您正在处理的所有数据都已经过严格标准化,否则请不要尝试这样做。

Plus, the word VALUE is useless to you, it is the actual content related to the VALUE clause which you want, which is singular, when present, VALUE is optional on Levels 01-49, or can be an unlimited multiple number of items on a Level 88. Plus you are ignoring the number of digits, or bytes (it varies, depending on PICture, and even whether number is even, odd, or due to compile option).

另外,单词VALUE对你来说没用,它是你想要的VALUE子句的实际内容,它是单数的,当存在时,VALUE在级别01-49上是可选的,或者可以是无限多个项目的数量。等级88.另外,你忽略了数字或字节数(它取决于PICture,甚至数字是偶数,奇数还是由于编译选项而变化)。

You were previously asked what you were doing looking at COBOL program programmatically and didn't mention this.

您以前被问过以编程方式查看COBOL程序的内容并没有提到这一点。

If you want a program to understand COBOL on the Mainframe, it already exists, it is the Enterprise COBOL compiler.

如果您希望程序理解大型机上的COBOL,它已经存在,那就是Enterprise COBOL编译器。

If you really want to do something by trying to "understand" a COBOL program, at least make your task orders of magnitude easier and use the compile listing which the compiler produces. You will still have to work out the number of decimal places, and the number of times something OCCURS, but these are minor things which can be specifically sought within a limited context which can be provided by data on the compile listing.

如果您真的想通过尝试“理解”COBOL程序来做某事,至少可以使您的任务数量级更容易,并使用编译器生成的编译列表。你仍然需要算出小数位数,以及OCCURS的次数,但这些是可以在有限的上下文中特别寻求的次要事情,可以由编译列表中的数据提供。

And, if you genuinely need to ignore the values associated with VALUE, you have the figurative-constants (SPACE(S), LOW-VALUES, HIGH-VALUES, ZERO(S/ES), QUOTE(S)) to deal with as well, plus NULL, which you may find as a VALUE on a USAGE POINTER item. You also need to be aware that these can be specified on the group that a given data-item is part of.

并且,如果您真的需要忽略与VALUE相关联的值,您可以使用比喻常量(SPACE(S),LOW-VALUES,HIGH-VALUES,ZERO(S / ES),QUOTE(S))来处理好吧,加上NULL,你可以在USAGE POINTER项目中找到它。您还需要注意,可以在给定数据项所属的组中指定这些。

Time now allows some expansion, so have a look at these:

时间现在允许一些扩展,所以看看这些:

   01  A-GROUP VALUE ZERO. 
       05  PIC 9. 
       05  A-NAME-1 PIC S9(4). 
       05  A-NAME-2 PIC S9999. 
       05  A-NAME-3 REDEFINES A-NAME-2 PIC 9999.
   01  B-GROUP BINARY. 
       05  PIC 9. 
       05  B-NAME-1 PIC S9(4). 
       05  B-NAME-2 PIC S9999. 
       05  B-NAME-3 REDEFINES B-NAME-2 PIC 9999.
   01  C-GROUP COMPUTATIONAL-3. 
       05  PIC 9. 
       05  C-NAME-1 PIC S9(4). 
       05  C-NAME-2 PIC S9999. 
       05  C-NAME-3 REDEFINES C-NAME-2 PIC 9999.

   01  D-GROUP SIGN LEADING SEPARATE. 
       05  PIC 9. 
       05  D-NAME-1 PIC S9(4). 
       05  D-NAME-2 PIC S9999. 
       05  FILLER  REDEFINES D-NAME-2. 
           10  FILLER PIC X. 
           10  D-NAME-3 PIC 9999. 

If you look at the 05-level definitions, all these fields look the same from group-to-group. They are not, they are all different due to the additional clauses on the 01-level.

如果查看05级定义,所有这些字段在组到组中看起来都是一样的。它们不是,由于01级的附加条款,它们都是不同的。

I've not even scratched the surface. COBOL has a very wide range of data-definitions which can be easily applied to produce complex data-structures.

我甚至没有刮过表面。 COBOL具有非常广泛的数据定义,可以轻松应用于生成复杂的数据结构。

COBOL is an old language. Many COBOL programs are old programs changed already by many people with different coding styles and different levels of knowledge of COBOL. Will you find definitions like the above in all your programs? No. Will you find them in some? Maybe. Can't have Maybes when you are processing data.

COBOL是一种古老的语言。许多COBOL程序都是由许多人改变的旧程序,这些程序具有不同的编码风格和不同的COBOL知识水平。您会在所有程序中找到如上所述的定义吗?不,你会在某些人身上找到他们吗?也许。处理数据时无法使用Maybes。

The data you are extracting does not make sense to me. The level-number is significant, the content of the value is significant. The number of digits in a field is significant as well as the size of a field in bytes. Perhaps you don't need any of these, but I doubt it.

你提取的数据对我来说没有意义。级别数值很重要,值的内容很重要。字段中的位数以及字节的大小(以字节为单位)都很重要。也许你不需要任何这些,但我对此表示怀疑。

Abandon this route.

放弃这条路线。

If you seriously need to "understand" a COBOL program on an IBM Mainframe, compile it, with all the listing options, and use the listing. Or look at the SYSADATA appendix in the Enterprise COBOL Programming Guide and use the compiler option to generate that data (this will take longer to compile, but will leave you with less work to do if you need to accomplish several distinct tasks (you have two already)).

如果您真的需要“理解”IBM大型机上的COBOL程序,请使用所有列表选项对其进行编译,并使用该列表。或者查看“企业COBOL编程指南”中的SYSADATA附录,并使用编译器选项生成该数据(这将需要更长的时间来编译,但如果您需要完成几个不同的任务,那么您将需要做更少的工作(您有两个已经))。

If you try to do anything else, you are looking at a very considerable amount of work. If you are not knowledgeable in COBOL and have no source of good knowledge available for the design, your results will be "patchy" at best.

如果您尝试做其他任何事情,那么您正在寻找相当多的工作。如果您不熟悉COBOL并且没有可用于设计的良好知识来源,那么您的结果将充其量只是“不完整”。

If you'd answered more fully on your previous question relating to this, you'd have saved yourself all of the above as well.

如果您在上一个与此相关的问题上得到了更充分的回答,那么您也可以保存上述所有内容。

Here are some links to SO questions which may aid you if you look to continue with other solutions:

以下是SO问题的一些链接,如果您继续寻求其他解决方案,可能会对您有所帮助:

Generating Record Layouts for EBCDIC Data Files.

为EBCDIC数据文件生成记录布局。

Is there a Python library to parse and manipulate COBOL code?

是否有一个Python库来解析和操作COBOL代码?

Is there a free (as in beer) Flow chart generator for COBOL Code?

COBOL Code是否有免费(如啤酒)流程图生成器?

#2


0  

I got a a solution in 3 phases Phase 1: Remove all Strings first

我得到了3阶段的解决方案第1阶段:首先删除所有字符串

String line = srcLine.replaceAll("((?:\"(?:[[^\"]|\"\"]*)\")|(?:\'(?:[[^\']|\'\']*)\'))", "");

Phase 2: Break the line in words

阶段2:用语言划分界限

Pattern pattern = Pattern.compile("\\b(?:(?<=\")[^\"]*(?=\")|(?<=\\')[^\\']*(?=\\')|[\\w\\d-]+)\\b");

Pahse 3 : Discard the numeric data at last.

Pahse 3:最后丢弃数字数据。

#1


4  

No, you didn't.

不,你没有。

05  ECPRF-057     PIC S9(4) VALUE +0057 COMP-3.
05  ECPRF-057     COMP-3 PIC S9(4) VALUE +0057.
05  ECPRF-057     VALUE +0057 PIC S9(4) USAGE COMP-3.
05  ECPRF-057     VALUE +0057 PIC S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PICTURE IS S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PICTURE S9(4) USAGE IS COMP-3.
05  ECPRF-057     VALUE +0057 PIC S9(4) COMP-3.
05  ECPRF-057     VALUE +0057 COMP-3 PIC S9(4).
05  ECPRF-057     VALUE +0057 PACKED-DECIMAL PIC S9(4).
05  ECPRF-057     VALUE +0057 PIC S9(4).

In addition, COMP-3 can be written as COMPUTATIONAL-3 or PACKED-DECIMAL. And none of these have to be on the same line, and very often won't be.

此外,COMP-3可以写成COMPUTATIONAL-3或PACKED-DECIMAL。并且这些都不必在同一条线上,并且通常不会。

These are all the same. And make many, many more combinations. Am I sure, about that last one even? Yes, because somewhere before that (immediately, or any number of lines before) is:

这些都是一样的。并制作许多更多组合。我确定,关于最后一个甚至?是的,因为在此之前的某个地方(紧接着或之前的任何行数)是:

02  ECPRF-057-GROUP COMP-3. (which may also have the combinations relating to COMP-3)

That's without getting to the 88-level in your second example.

在第二个例子中没有达到88级。

That's without:

05  PIC X(20) VALUE SPACE.

That's without duplicate data-names which are valid when "qualified" by a higher-level data-name, requiring the use of either IN or OF.

没有重复的数据名称,当由更高级别的数据名称“限定”时需要使用IN或OF。

That's without REDEFINES.

那没有REDEFINES。

That's without COMP/COMP-4/COMP-5/BINARY where things like the maximum value that can be held depend upon a compiler option.

那就是没有COMP / COMP-4 / COMP-5 / BINARY,其中可以保存的最大值取决于编译器选项。

Please don't attempt to do this unless all the data you are processing is already rigorously normalised.

除非您正在处理的所有数据都已经过严格标准化,否则请不要尝试这样做。

Plus, the word VALUE is useless to you, it is the actual content related to the VALUE clause which you want, which is singular, when present, VALUE is optional on Levels 01-49, or can be an unlimited multiple number of items on a Level 88. Plus you are ignoring the number of digits, or bytes (it varies, depending on PICture, and even whether number is even, odd, or due to compile option).

另外,单词VALUE对你来说没用,它是你想要的VALUE子句的实际内容,它是单数的,当存在时,VALUE在级别01-49上是可选的,或者可以是无限多个项目的数量。等级88.另外,你忽略了数字或字节数(它取决于PICture,甚至数字是偶数,奇数还是由于编译选项而变化)。

You were previously asked what you were doing looking at COBOL program programmatically and didn't mention this.

您以前被问过以编程方式查看COBOL程序的内容并没有提到这一点。

If you want a program to understand COBOL on the Mainframe, it already exists, it is the Enterprise COBOL compiler.

如果您希望程序理解大型机上的COBOL,它已经存在,那就是Enterprise COBOL编译器。

If you really want to do something by trying to "understand" a COBOL program, at least make your task orders of magnitude easier and use the compile listing which the compiler produces. You will still have to work out the number of decimal places, and the number of times something OCCURS, but these are minor things which can be specifically sought within a limited context which can be provided by data on the compile listing.

如果您真的想通过尝试“理解”COBOL程序来做某事,至少可以使您的任务数量级更容易,并使用编译器生成的编译列表。你仍然需要算出小数位数,以及OCCURS的次数,但这些是可以在有限的上下文中特别寻求的次要事情,可以由编译列表中的数据提供。

And, if you genuinely need to ignore the values associated with VALUE, you have the figurative-constants (SPACE(S), LOW-VALUES, HIGH-VALUES, ZERO(S/ES), QUOTE(S)) to deal with as well, plus NULL, which you may find as a VALUE on a USAGE POINTER item. You also need to be aware that these can be specified on the group that a given data-item is part of.

并且,如果您真的需要忽略与VALUE相关联的值,您可以使用比喻常量(SPACE(S),LOW-VALUES,HIGH-VALUES,ZERO(S / ES),QUOTE(S))来处理好吧,加上NULL,你可以在USAGE POINTER项目中找到它。您还需要注意,可以在给定数据项所属的组中指定这些。

Time now allows some expansion, so have a look at these:

时间现在允许一些扩展,所以看看这些:

   01  A-GROUP VALUE ZERO. 
       05  PIC 9. 
       05  A-NAME-1 PIC S9(4). 
       05  A-NAME-2 PIC S9999. 
       05  A-NAME-3 REDEFINES A-NAME-2 PIC 9999.
   01  B-GROUP BINARY. 
       05  PIC 9. 
       05  B-NAME-1 PIC S9(4). 
       05  B-NAME-2 PIC S9999. 
       05  B-NAME-3 REDEFINES B-NAME-2 PIC 9999.
   01  C-GROUP COMPUTATIONAL-3. 
       05  PIC 9. 
       05  C-NAME-1 PIC S9(4). 
       05  C-NAME-2 PIC S9999. 
       05  C-NAME-3 REDEFINES C-NAME-2 PIC 9999.

   01  D-GROUP SIGN LEADING SEPARATE. 
       05  PIC 9. 
       05  D-NAME-1 PIC S9(4). 
       05  D-NAME-2 PIC S9999. 
       05  FILLER  REDEFINES D-NAME-2. 
           10  FILLER PIC X. 
           10  D-NAME-3 PIC 9999. 

If you look at the 05-level definitions, all these fields look the same from group-to-group. They are not, they are all different due to the additional clauses on the 01-level.

如果查看05级定义,所有这些字段在组到组中看起来都是一样的。它们不是,由于01级的附加条款,它们都是不同的。

I've not even scratched the surface. COBOL has a very wide range of data-definitions which can be easily applied to produce complex data-structures.

我甚至没有刮过表面。 COBOL具有非常广泛的数据定义,可以轻松应用于生成复杂的数据结构。

COBOL is an old language. Many COBOL programs are old programs changed already by many people with different coding styles and different levels of knowledge of COBOL. Will you find definitions like the above in all your programs? No. Will you find them in some? Maybe. Can't have Maybes when you are processing data.

COBOL是一种古老的语言。许多COBOL程序都是由许多人改变的旧程序,这些程序具有不同的编码风格和不同的COBOL知识水平。您会在所有程序中找到如上所述的定义吗?不,你会在某些人身上找到他们吗?也许。处理数据时无法使用Maybes。

The data you are extracting does not make sense to me. The level-number is significant, the content of the value is significant. The number of digits in a field is significant as well as the size of a field in bytes. Perhaps you don't need any of these, but I doubt it.

你提取的数据对我来说没有意义。级别数值很重要,值的内容很重要。字段中的位数以及字节的大小(以字节为单位)都很重要。也许你不需要任何这些,但我对此表示怀疑。

Abandon this route.

放弃这条路线。

If you seriously need to "understand" a COBOL program on an IBM Mainframe, compile it, with all the listing options, and use the listing. Or look at the SYSADATA appendix in the Enterprise COBOL Programming Guide and use the compiler option to generate that data (this will take longer to compile, but will leave you with less work to do if you need to accomplish several distinct tasks (you have two already)).

如果您真的需要“理解”IBM大型机上的COBOL程序,请使用所有列表选项对其进行编译,并使用该列表。或者查看“企业COBOL编程指南”中的SYSADATA附录,并使用编译器选项生成该数据(这将需要更长的时间来编译,但如果您需要完成几个不同的任务,那么您将需要做更少的工作(您有两个已经))。

If you try to do anything else, you are looking at a very considerable amount of work. If you are not knowledgeable in COBOL and have no source of good knowledge available for the design, your results will be "patchy" at best.

如果您尝试做其他任何事情,那么您正在寻找相当多的工作。如果您不熟悉COBOL并且没有可用于设计的良好知识来源,那么您的结果将充其量只是“不完整”。

If you'd answered more fully on your previous question relating to this, you'd have saved yourself all of the above as well.

如果您在上一个与此相关的问题上得到了更充分的回答,那么您也可以保存上述所有内容。

Here are some links to SO questions which may aid you if you look to continue with other solutions:

以下是SO问题的一些链接,如果您继续寻求其他解决方案,可能会对您有所帮助:

Generating Record Layouts for EBCDIC Data Files.

为EBCDIC数据文件生成记录布局。

Is there a Python library to parse and manipulate COBOL code?

是否有一个Python库来解析和操作COBOL代码?

Is there a free (as in beer) Flow chart generator for COBOL Code?

COBOL Code是否有免费(如啤酒)流程图生成器?

#2


0  

I got a a solution in 3 phases Phase 1: Remove all Strings first

我得到了3阶段的解决方案第1阶段:首先删除所有字符串

String line = srcLine.replaceAll("((?:\"(?:[[^\"]|\"\"]*)\")|(?:\'(?:[[^\']|\'\']*)\'))", "");

Phase 2: Break the line in words

阶段2:用语言划分界限

Pattern pattern = Pattern.compile("\\b(?:(?<=\")[^\"]*(?=\")|(?<=\\')[^\\']*(?=\\')|[\\w\\d-]+)\\b");

Pahse 3 : Discard the numeric data at last.

Pahse 3:最后丢弃数字数据。