I have the following regex in Java:
我在Java中有以下regex:
String regex = "[^\\s\\p{L}\\p{N}]";
Pattern p = Pattern.compile(regex);
String phrase = "Time flies: "when you're having fun!" Can't wait, 'until' next summer :)";
String delimited = p.matcher(phrase).replaceAll("");
Right now this regex removes all non-spaces and nonAlphanumerics.
现在,这个regex将删除所有非空格和非字母数字。
Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when youre having fun Cant wait until next summer
Problem is, I want to maintain the single quotes on words, such as you're, can't, etc. But want to remove single quotes that are at the end of a sentence, or surround a word, such as 'hello'. This is what I want:
问题是,我想要保持单引号,比如你,不能,等等。但是要删除一个句子末尾的单引号,或者环绕一个单词,比如“hello”。这就是我想要的:
Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when you're having fun Can't wait until next summer
How can I update my current regex to be able to do this? I need to keep the \p{L} and \p{N} as it has to work for more than one language.
我如何更新我的当前regex才能做到这一点?我需要保留\p{L}和\p{N},因为它必须为不止一种语言工作。
Thanks!
谢谢!
1 个解决方案
#1
2
This should do what you want, or come close:
这应该做你想做的,或接近:
String regex = "[^\\s\\p{L}\\p{N}']|(?<=(^|\\s))'|'(?=($|\\s))";
The regex has three alternatives separated by |
. It will match:
regex有三个由|分隔的选项。它将匹配:
- Any character that is not a space, letter, number, or quote mark.
- 任何不是空格、字母、数字或引号的字符。
- A quote mark, if it is preceded by the beginning of the line or a space (therefore, a quote mark at the beginning of a word). This uses positive lookbehind.
- 引号,如果它前面有一行的开头或空格(因此,在一个词的开头有引号)。它使用积极的向后插入。
- A quote mark, if it is followed by the end of the line or a space (therefore, a quote mark at the end of the word). This uses positive lookahead.
- 引号,如果它后面跟着一行的末尾或空格(因此,在单词的末尾加上引号)。它使用积极的超前。
It works on the example you give. Where it might not work the way you want is if you have a word with a quote mark on one side, but not the other: "'Tis a shame that we couldn't visit James' house"
. Since the lookahead/behind only look at the character right before and after the quote, and doesn't look ahead to see if (say) the quote mark at the beginning of the word is followed by a quote mark at the end of the word, it will delete the quote marks on 'Tis and James'.
它适用于你给出的例子。如果你在一边写了一个带引号的词,而在另一边却没有:“我们不能参观詹姆斯的家真是太遗憾了。”自背后的超前/只看之前和之后的字符引用,并且不向前看,看(说)这个词的引号开始,后跟一个引号结束时,将删除引号这和詹姆斯。
#1
2
This should do what you want, or come close:
这应该做你想做的,或接近:
String regex = "[^\\s\\p{L}\\p{N}']|(?<=(^|\\s))'|'(?=($|\\s))";
The regex has three alternatives separated by |
. It will match:
regex有三个由|分隔的选项。它将匹配:
- Any character that is not a space, letter, number, or quote mark.
- 任何不是空格、字母、数字或引号的字符。
- A quote mark, if it is preceded by the beginning of the line or a space (therefore, a quote mark at the beginning of a word). This uses positive lookbehind.
- 引号,如果它前面有一行的开头或空格(因此,在一个词的开头有引号)。它使用积极的向后插入。
- A quote mark, if it is followed by the end of the line or a space (therefore, a quote mark at the end of the word). This uses positive lookahead.
- 引号,如果它后面跟着一行的末尾或空格(因此,在单词的末尾加上引号)。它使用积极的超前。
It works on the example you give. Where it might not work the way you want is if you have a word with a quote mark on one side, but not the other: "'Tis a shame that we couldn't visit James' house"
. Since the lookahead/behind only look at the character right before and after the quote, and doesn't look ahead to see if (say) the quote mark at the beginning of the word is followed by a quote mark at the end of the word, it will delete the quote marks on 'Tis and James'.
它适用于你给出的例子。如果你在一边写了一个带引号的词,而在另一边却没有:“我们不能参观詹姆斯的家真是太遗憾了。”自背后的超前/只看之前和之后的字符引用,并且不向前看,看(说)这个词的引号开始,后跟一个引号结束时,将删除引号这和詹姆斯。