正则表达式以正斜杠分割

时间:2021-12-06 21:40:14

I have a parse tree which includes some information. To extract the information that I need, I am using a code which splits the string based on forward slash (/), but that is not a perfect code. I explain more details here:

我有一个解析树,其中包含一些信息。为了提取我需要的信息,我使用的代码基于正斜杠(/)拆分字符串,但这不是一个完美的代码。我在这里解释更多细节:

I had used this code in another project earlier and that worked perfectly. But now the parse trees of my new dataset are more complicated and the code makes wrong decisions sometimes.

我之前在另一个项目中使用过此代码并且运行良好。但是现在我的新数据集的解析树更加复杂,而且代码有时会做出错误的决定。

The parse tree is something like this:

解析树是这样的:

(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) ) 

As you see, the leaves of the tree are the words right before the forward slashes. To get these words, I have used this code before:

如您所见,树的叶子是正斜杠之前的单词。为了得到这些话,我之前使用过这段代码:

parse_tree.split("/");

But now, in my new data, I see instances like these:

但现在,在我的新数据中,我看到这样的实例:

1) (TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )

1)(TOP Source / NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./。)

where there are multiple slashes due to website addresses (In this case, only the last slash is the separator of the word).

由于网站地址而存在多个斜杠(在这种情况下,只有最后一个斜杠是单词的分隔符)。

2) (NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )

2)(NPB~姐姐~2~2你/ PRP $妹妹/ NN // PUNC :)

Where the slash is a word itself.

斜线本身就是一个词。

Could you please help me to replace my current simple regular expression with an expression which can manage these cases?

你能帮我用一个可以管理这些案例的表达式替换我当前的简单正则表达式吗?

To summarize what I need, I would say that I need a regular expression which can split based on forward slash, but it must be able to manage two exceptions: 1) if there is a website address, it must split based on the last slash. 2) If there are two consecutive slashes, it must split based on the second split (and the first slash must NOT be considered as a separator, it is a WORD).

总结一下我需要的东西,我会说我需要一个可以基于正斜杠拆分的正则表达式,但它必须能够管理两个例外:1)如果有一个网站地址,它必须根据最后一个斜线进行拆分。 2)如果有两个连续的斜杠,它必须根据第二个分割进行分割(并且第一个斜杠不能被视为分隔符,它是WORD)。

3 个解决方案

#1


3  

I achieved what you requested following this article:

我实现了您在本文后所要求的内容:

http://www.rexegg.com/regex-best-trick.html

http://www.rexegg.com/regex-best-trick.html

Just to summarize, here is the over all strategy:

总结一下,这是一个全面的策略:

1st, you will need to create a Regex in this format:

1,您需要以这种格式创建一个正则表达式:

NotThis | NeitherThis | (IWantThis)

After that, your capture group $1 will contain only the slashes you are interested in perform the splits.

之后,您的捕获组$ 1将仅包含您感兴趣的斜杠来执行拆分。

You can then replace them with something less likely to occur, and after that you perform the split in this replaced term.

然后,您可以用不太可能发生的事情替换它们,然后在此替换的术语中执行拆分。

So, having this strategy in mind, here's the code:

所以,考虑到这个策略,这里是代码:

Regex:

正则表达式:

\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)

Explanation:

说明:

NotThis term would be double slashes with lookAhead( to take just 1st slash)

不是这个术语是lookAhead的双斜线(仅采用第一斜杠)

\\/(?=\\/)

NeitherThis term is just a basic url check with a lookahead to not capture the last \/

这个术语都不是一个基本的URL检查,前瞻不能捕获最后一个\ /

(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)

IWantThis term is simply the slash:

IWant这个词只是斜线:

(\\/)

In the Java code you can put this all together doing something like this:

在Java代码中,您可以将所有内容放在一起,执行以下操作:

Pattern p = Pattern.compile("\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)");

Matcher m = p.matcher("(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) )\n(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )\n(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )");
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, "Superman");
    else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println("\n" + "*** Replacements ***");
System.out.println(replaced);

String[] splits = replaced.split("Superman");
System.out.println("\n" + "*** Splits ***");
for (String split : splits) System.out.println(split);

Output:

输出:

*** Replacements ***                                                                                                                                                                                  
(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 ISupermanPRP ) (VP~did~3~1 didSupermanVBD notSupermanRB (VP~read~2~1 readSupermanVB (NPB~article~2~2 theSupermanDT articleSupermanNN .SupermanPUNC. ) ) ) ) )      
(TOP SourceSupermanNN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htmSupermanX .Superman. )                                                                                    
(NPB~sister~2~2 YourSupermanPRP$ sisterSupermanNN /SupermanPUNC: )                                                                                                                                           

*** Splits ***                                                                                                                                                                                        
(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I                                                                                                                                                                  
PRP ) (VP~did~3~1 did                                                                                                                                                                                 
VBD not                                                                                                                                                                                               
RB (VP~read~2~1 read                                                                                                                                                                                  
VB (NPB~article~2~2 the                                                                                                                                                                               
DT article                                                                                                                                                                                            
NN .                                                                                                                                                                                                  
PUNC. ) ) ) ) )                                                                                                                                                                                       
(TOP Source                                                                                                                                                                                           
NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm                                                                                                                             
X .                                                                                                                                                                                                   
. )
(NPB~sister~2~2 Your                                                                                                                                                                                  
PRP$ sister                                                                                                                                                                                           
NN /
PUNC: ) 

#2


1  

You should be able to use a negative lookbehind with a regex. This would need a bigger sample of inputs to be sure, but seems to work for your two cases:

你应该能够使用正则表达式的负面lookbehind。这需要更大的输入样本才能确定,但​​似乎适用于您的两种情况:

    String pattern = "(?<![\\:\\/])\\/";

    String s1 = "(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )";
    List<String> a = (List<String>) Arrays.asList(s1.split(pattern));

    System.out.println("first case:");
    System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n")));
    System.out.println("\n");

    String s2 = "(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )";
    a = (List<String>) Arrays.asList(s2.split(pattern));
    System.out.println("second case");
    System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n")));

This outputs:

这输出:

first case:
(TOP Source,
NN http://www.alwatan.com.sa,
daily,
2007-01-31,
first_page,
first_page01.htm,
X .,
. )


second case
(NPB~sister~2~2 Your,
PRP$ sister,
NN ,
/PUNC: )

#3


0  

Filter your matches further to not include regex matched by below which matches any url http/https/ftp you can include as much protocols as you like

进一步过滤你的比赛,不包括下面匹配的正则表达式,匹配任何网址http / https / ftp你可以包含尽可能多的协议

(?<protocol>http(s)?|ftp)://(?<server>([A-Za-z0-9-]+\.)*(?<basedomain>[A-Za-z0-9-]+\.[A-Za-z0-9]+))+ ((/?)(?<path>(?<dir>[A-Za-z0-9\._\-]+)))*

and then match instances of multiple slashes with (/)+ 
the'+' here is a greedy match which means it will match as many consecutive slashes as it can whether it be // // or //

hope this helps

希望这可以帮助

#1


3  

I achieved what you requested following this article:

我实现了您在本文后所要求的内容:

http://www.rexegg.com/regex-best-trick.html

http://www.rexegg.com/regex-best-trick.html

Just to summarize, here is the over all strategy:

总结一下,这是一个全面的策略:

1st, you will need to create a Regex in this format:

1,您需要以这种格式创建一个正则表达式:

NotThis | NeitherThis | (IWantThis)

After that, your capture group $1 will contain only the slashes you are interested in perform the splits.

之后,您的捕获组$ 1将仅包含您感兴趣的斜杠来执行拆分。

You can then replace them with something less likely to occur, and after that you perform the split in this replaced term.

然后,您可以用不太可能发生的事情替换它们,然后在此替换的术语中执行拆分。

So, having this strategy in mind, here's the code:

所以,考虑到这个策略,这里是代码:

Regex:

正则表达式:

\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)

Explanation:

说明:

NotThis term would be double slashes with lookAhead( to take just 1st slash)

不是这个术语是lookAhead的双斜线(仅采用第一斜杠)

\\/(?=\\/)

NeitherThis term is just a basic url check with a lookahead to not capture the last \/

这个术语都不是一个基本的URL检查,前瞻不能捕获最后一个\ /

(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)

IWantThis term is simply the slash:

IWant这个词只是斜线:

(\\/)

In the Java code you can put this all together doing something like this:

在Java代码中,您可以将所有内容放在一起,执行以下操作:

Pattern p = Pattern.compile("\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)");

Matcher m = p.matcher("(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I/PRP ) (VP~did~3~1 did/VBD not/RB (VP~read~2~1 read/VB (NPB~article~2~2 the/DT article/NN ./PUNC. ) ) ) ) )\n(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )\n(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )");
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, "Superman");
    else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
System.out.println("\n" + "*** Replacements ***");
System.out.println(replaced);

String[] splits = replaced.split("Superman");
System.out.println("\n" + "*** Splits ***");
for (String split : splits) System.out.println(split);

Output:

输出:

*** Replacements ***                                                                                                                                                                                  
(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 ISupermanPRP ) (VP~did~3~1 didSupermanVBD notSupermanRB (VP~read~2~1 readSupermanVB (NPB~article~2~2 theSupermanDT articleSupermanNN .SupermanPUNC. ) ) ) ) )      
(TOP SourceSupermanNN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htmSupermanX .Superman. )                                                                                    
(NPB~sister~2~2 YourSupermanPRP$ sisterSupermanNN /SupermanPUNC: )                                                                                                                                           

*** Splits ***                                                                                                                                                                                        
(TOP~did~1~1 (S~did~2~2 (NPB~I~1~1 I                                                                                                                                                                  
PRP ) (VP~did~3~1 did                                                                                                                                                                                 
VBD not                                                                                                                                                                                               
RB (VP~read~2~1 read                                                                                                                                                                                  
VB (NPB~article~2~2 the                                                                                                                                                                               
DT article                                                                                                                                                                                            
NN .                                                                                                                                                                                                  
PUNC. ) ) ) ) )                                                                                                                                                                                       
(TOP Source                                                                                                                                                                                           
NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm                                                                                                                             
X .                                                                                                                                                                                                   
. )
(NPB~sister~2~2 Your                                                                                                                                                                                  
PRP$ sister                                                                                                                                                                                           
NN /
PUNC: ) 

#2


1  

You should be able to use a negative lookbehind with a regex. This would need a bigger sample of inputs to be sure, but seems to work for your two cases:

你应该能够使用正则表达式的负面lookbehind。这需要更大的输入样本才能确定,但​​似乎适用于您的两种情况:

    String pattern = "(?<![\\:\\/])\\/";

    String s1 = "(TOP Source/NN http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/X ./. )";
    List<String> a = (List<String>) Arrays.asList(s1.split(pattern));

    System.out.println("first case:");
    System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n")));
    System.out.println("\n");

    String s2 = "(NPB~sister~2~2 Your/PRP$ sister/NN //PUNC: )";
    a = (List<String>) Arrays.asList(s2.split(pattern));
    System.out.println("second case");
    System.out.println(a.stream().map(i->i.toString()).collect(Collectors.joining(",\n")));

This outputs:

这输出:

first case:
(TOP Source,
NN http://www.alwatan.com.sa,
daily,
2007-01-31,
first_page,
first_page01.htm,
X .,
. )


second case
(NPB~sister~2~2 Your,
PRP$ sister,
NN ,
/PUNC: )

#3


0  

Filter your matches further to not include regex matched by below which matches any url http/https/ftp you can include as much protocols as you like

进一步过滤你的比赛,不包括下面匹配的正则表达式,匹配任何网址http / https / ftp你可以包含尽可能多的协议

(?<protocol>http(s)?|ftp)://(?<server>([A-Za-z0-9-]+\.)*(?<basedomain>[A-Za-z0-9-]+\.[A-Za-z0-9]+))+ ((/?)(?<path>(?<dir>[A-Za-z0-9\._\-]+)))*

and then match instances of multiple slashes with (/)+ 
the'+' here is a greedy match which means it will match as many consecutive slashes as it can whether it be // // or //

hope this helps

希望这可以帮助