I thought to share this relatively smart problem with everyone here. I am trying to remove unbalanced/unpaired double-quotes from a string.
我想与大家分享这个相对聪明的问题。我试图从字符串中删除不平衡/不成对的双引号。
My work is in progress, I might be close to a solution. But, I didn't get a working solution yet. I am not able to delete the unpaired/unpartnered double-quotes from the string.
我的工作正在进行中,我可能接近解决方案。但是,我还没有得到一个有效的解决方案。我无法从字符串中删除未配对/未配对的双引号。
Example Input
string1=injunct! alter ego."
string2=successor "alter ego" single employer" "proceeding "citation assets"
Output Should be
输出应该是
string1=injunct! alter ego.
string2=successor "alter ego" single employer proceeding "citation assets"
This problem sound similar to Using Java remove unbalanced/unpartnered parenthesis
此问题听起来类似于使用Java删除不平衡/未共享的括号
Here is my code so far(it doesn't delete all the unpaird double-quotes)
这是我到目前为止的代码(它不会删除所有非空双引号)
private String removeUnattachedDoubleQuotes(String stringWithDoubleQuotes) {
String firstPass = "";
String openingQuotePattern = "\\\"[a-z0-9\\p{Punct}]";
String closingQuotePattern = "[a-z0-9\\p{Punct}]\\\"";
int doubleQuoteLevel = 0;
for (int i = 0; i < stringWithDoubleQuotes.length() - 3; i++) {
String c = stringWithDoubleQuotes.substring(i, i + 2);
if (c.matches(openingQuotePattern)) {
doubleQuoteLevel++;
firstPass += c;
}
else if (c.matches(closingQuotePattern)) {
if (doubleQuoteLevel > 0) {
doubleQuoteLevel--;
firstPass += c;
}
}
else {
firstPass += c;
}
}
String secondPass = "";
doubleQuoteLevel = 0;
for (int i = firstPass.length() - 1; i >= 0; i--) {
String c = stringWithDoubleQuotes.substring(i, i + 2);
if (c.matches(closingQuotePattern)) {
doubleQuoteLevel++;
secondPass = c + secondPass;
}
else if (c.matches(openingQuotePattern)) {
if (doubleQuoteLevel > 0) {
doubleQuoteLevel--;
secondPass = c + secondPass;
}
}
else {
secondPass = c + secondPass;
}
}
String result = secondPass;
return result;
}
2 个解决方案
#1
1
You could use something like (Perl notation):
你可以使用类似的东西(Perl表示法):
s/("(?=\S)[^"]*(?<=\S)")|"/$1/g;
Which in Java would be:
在Java中将是:
str.replaceAll("(\"(?=\\S)[^\"]*(?<=\\S)\")|\"", "$1");
#2
2
It could probably be done in a single regex if there is no nesting.
There is a concept of delimeters roughly defined, and it is possible to 'bias'
those rules to get a better outcome.
It all depends on what rules are set forth. This regex takes into account
three possible scenario's in order;
如果没有嵌套,它可能在单个正则表达式中完成。有一个大致定义的分界符的概念,有可能“偏向”这些规则以获得更好的结果。这一切都取决于规定的规则。这个正则表达式按顺序考虑了三个可能的场景;
- Valid Pair
- Invalid Pair (with bias)
- Invalid Single
无效对(有偏见)
It also doesen't parse "" beyond end of line. But it does do multiple
lines combined as a single string. To change that, remove \n
where you see it.
它也不会解析“超出行尾”。但它确实将多行合并为一个字符串。要更改它,请删除您看到的位置。
global context - raw find regex
shortened
全局背景 - 原始查找正则表达式缩短
(?:("[a-zA-Z0-9\p{Punct}][^"\n]*(?<=[a-zA-Z0-9\p{Punct}])")|(?<![a-zA-Z0-9\p{Punct}])"([^"\n]*)"(?![a-zA-Z0-9\p{Punct}])|")
replacement grouping
$1$2 or \1\2
Expanded raw regex:
扩展原始正则表达式:
(?: // Grouping
// Try to line up a valid pair
( // Capt grp (1) start
" // "
[a-zA-Z0-9\p{Punct}] // 1 of [a-zA-Z0-9\p{Punct}]
[^"\n]* // 0 or more non- [^"\n] characters
(?<=[a-zA-Z0-9\p{Punct}]) // 1 of [a-zA-Z0-9\p{Punct}] behind us
" // "
) // End capt grp (1)
| // OR, try to line up an invalid pair
(?<![a-zA-Z0-9\p{Punct}]) // Bias, not 1 of [a-zA-Z0-9\p{Punct}] behind us
" // "
( [^"\n]* ) // Capt grp (2) - 0 or more non- [^"\n] characters
" // "
(?![a-zA-Z0-9\p{Punct}]) // Bias, not 1 of [a-zA-Z0-9\p{Punct}] ahead of us
| // OR, this single " is considered invalid
" // "
) // End Grouping
Perl testcase (don't have Java)
Perl testcase(没有Java)
$str = '
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
';
print "\n'$str'\n";
$str =~ s
/
(?:
(
"[a-zA-Z0-9\p{Punct}]
[^"\n]*
(?<=[a-zA-Z0-9\p{Punct}])
"
)
|
(?<![a-zA-Z0-9\p{Punct}])
"
( [^"\n]* )
" (?![a-zA-Z0-9\p{Punct}])
|
"
)
/$1$2/xg;
print "\n'$str'\n";
Output
'
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
'
'
string1=injunct! alter ego.
string2=successor "alter ego" single employer "a" free proceeding "citation assets"
'
#1
1
You could use something like (Perl notation):
你可以使用类似的东西(Perl表示法):
s/("(?=\S)[^"]*(?<=\S)")|"/$1/g;
Which in Java would be:
在Java中将是:
str.replaceAll("(\"(?=\\S)[^\"]*(?<=\\S)\")|\"", "$1");
#2
2
It could probably be done in a single regex if there is no nesting.
There is a concept of delimeters roughly defined, and it is possible to 'bias'
those rules to get a better outcome.
It all depends on what rules are set forth. This regex takes into account
three possible scenario's in order;
如果没有嵌套,它可能在单个正则表达式中完成。有一个大致定义的分界符的概念,有可能“偏向”这些规则以获得更好的结果。这一切都取决于规定的规则。这个正则表达式按顺序考虑了三个可能的场景;
- Valid Pair
- Invalid Pair (with bias)
- Invalid Single
无效对(有偏见)
It also doesen't parse "" beyond end of line. But it does do multiple
lines combined as a single string. To change that, remove \n
where you see it.
它也不会解析“超出行尾”。但它确实将多行合并为一个字符串。要更改它,请删除您看到的位置。
global context - raw find regex
shortened
全局背景 - 原始查找正则表达式缩短
(?:("[a-zA-Z0-9\p{Punct}][^"\n]*(?<=[a-zA-Z0-9\p{Punct}])")|(?<![a-zA-Z0-9\p{Punct}])"([^"\n]*)"(?![a-zA-Z0-9\p{Punct}])|")
replacement grouping
$1$2 or \1\2
Expanded raw regex:
扩展原始正则表达式:
(?: // Grouping
// Try to line up a valid pair
( // Capt grp (1) start
" // "
[a-zA-Z0-9\p{Punct}] // 1 of [a-zA-Z0-9\p{Punct}]
[^"\n]* // 0 or more non- [^"\n] characters
(?<=[a-zA-Z0-9\p{Punct}]) // 1 of [a-zA-Z0-9\p{Punct}] behind us
" // "
) // End capt grp (1)
| // OR, try to line up an invalid pair
(?<![a-zA-Z0-9\p{Punct}]) // Bias, not 1 of [a-zA-Z0-9\p{Punct}] behind us
" // "
( [^"\n]* ) // Capt grp (2) - 0 or more non- [^"\n] characters
" // "
(?![a-zA-Z0-9\p{Punct}]) // Bias, not 1 of [a-zA-Z0-9\p{Punct}] ahead of us
| // OR, this single " is considered invalid
" // "
) // End Grouping
Perl testcase (don't have Java)
Perl testcase(没有Java)
$str = '
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
';
print "\n'$str'\n";
$str =~ s
/
(?:
(
"[a-zA-Z0-9\p{Punct}]
[^"\n]*
(?<=[a-zA-Z0-9\p{Punct}])
"
)
|
(?<![a-zA-Z0-9\p{Punct}])
"
( [^"\n]* )
" (?![a-zA-Z0-9\p{Punct}])
|
"
)
/$1$2/xg;
print "\n'$str'\n";
Output
'
string1=injunct! alter ego."
string2=successor "alter ego" single employer "a" free" proceeding "citation assets"
'
'
string1=injunct! alter ego.
string2=successor "alter ego" single employer "a" free proceeding "citation assets"
'