分割字符串”。(点)处理缩写的时候

I'm finding this fairly hard to explain, so I'll kick off with a few examples of before/after of what I'd like to achieve.

我发现这很难解释，所以我首先举几个例子，说明我想要达到的目标。

Example of input:

输入的例子:

Hello.World

Hello.World

This.Is.A.Test

This.Is.A.Test

The.S.W.A.T.Team

The.S.W.A.T.Team

S.W.A.T.

S.W.A.T.

s.w.a.t.

s.w.a.t。

2001.A.Space.Odyssey

2001. a.space.odyssey

Wanted output:

想要的输出:

Hello World

你好，世界

This Is A Test

这是一个测试

The SWAT Team

斯瓦特的团队

SWAT

斯瓦特

swat

斯瓦特

2001 A Space Odyssey

2001太空漫游

Essentially, I'd like to create something that's capable of splitting strings by dots, but at the same time handles abbreviations.

本质上，我想创建一些可以用点来分割字符串的东西，但是同时处理缩写。

My definition of an abbreviation is something that has at least two characters (casing irrelevant) and two dots, i.e. "A.B." or "a.b.". It shouldn't work with digits, i.e. "1.a.".

我对缩写的定义是至少有两个字符(不相关的大小写)和两个点，也就是。“学士”或“a.b。”。它不应该对数字起作用，也就是说。" 1. "。

I've tried all kinds of things with regex, but it isn't exactly my strong suit, so I'm hoping that someone here has any ideas or pointers that I can use.

我用regex尝试过各种各样的东西，但它并不是我的强项，所以我希望这里有人能给我一些建议。

2 个解决方案

#1

How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).

删除需要用regex消失的点，然后用空格替换其余的点，怎么样?正则表达式可以看起来像(? < =(^ |[])[\ \ s \ \ D])[](? = \[\ s \ \ D]([,]| $))。

String[] data = { 
        "Hello.World", 
        "This.Is.A.Test", 
        "The.S.W.A.T.Team",
        "S.w.a.T.", 
        "S.w.a.T.1", 
        "2001.A.Space.Odyssey" };

for (String s : data) {
    System.out.println(s.replaceAll(
            "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
            .replace('.', ' '));
}

result

结果

Hello World
This Is A Test
The SWAT Team
SwaT 
SwaT 1
2001 A Space Odyssey

In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

在regex中，我需要转义点字符的特殊含义。我可以用\\。但我更喜欢(。)。

So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.

所以在regex的canter我们有点文字。现在这个点被(?<=…)和(?=…)包围着。这些是被称为look-behind和look-ahead的查找机制的一部分。

Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].

因为点需要移除点(或开始数据^)和一些非空白\ \ S也是non-digit \ D字符之前,我可以测试使用(? < =(^ |[])[\ \ S \ \ D])[,]。
Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))

还需要删除的点还有非白空格和非数字字符，以及后面的另一个点(可选的数据结束$)，可以写成[.](?=[\ S&&\ D]([.]|$))

Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.

根据需求\[\ s \ \ D]也在字母匹配字符喜欢! @ # $ % ^ & *()_ = +……可以用[a-zA-Z]替换为只适用于英文字母，或适用于所有Unicode字母的\p{isalphabet}。

#2

Since every word starts with a capital (uppercase) letter, I would suggest that you first remove all dots, and replace it with no space (""). Then, iterate over all characters and put space between lowercase letter and following uppercase letter. Also, if you encounter an uppercase with following lowercase, put the space before the uppercase.

由于每个单词都以大写字母开头，我建议您首先删除所有的点，然后用空格(“”)替换它。然后，遍历所有字符，并在小写字母和大写字母后面空格。另外，如果遇到一个大写字母后面跟着小写字母，请将空格放在大写字母前面。

It will work for all examples you provided, but I am not sure if there are any exceptions to my observation.

它适用于您提供的所有示例，但我不确定我的观察是否有例外。

#1