关于String替换操作的一点笔记

时间:2022-07-22 08:59:59

最近项目需要抓取学校百合的一些热点信息,免不了频繁使用正则和String的一些替换操作,遇到了一些问题,值得小记一下。

下面是一个操作的片段

Pattern textareaContent = Pattern.compile("(?s)(<table)(.*?)<textarea.*?class=hide>(.*?)</textarea>");
Matcher contentMatcher = textareaContent.matcher(resultHTML);
StringBuffer buff = new StringBuffer();
while(contentMatcher.find()) {
contentMatcher.appendReplacement(buff, contentMatcher.group(1) + " style='BORDER: 2px solid;BORDER-COLOR: D0F0C0;' "
+ contentMatcher.group(2) + contentMatcher.group(3));
}
resultHTML = contentMatcher.appendTail(buff).toString();
由于抓取的内容可能还有‘$’,‘\\’等字符,在appendReplacement(StringBuffer,String replacement)中可能会导致错误,比如$在replace可以作为group的选择器。其实可以通过jdk的源码明确的看出appendRelacement的处理方式:

char nextChar = replacement.charAt(cursor);
if (nextChar == '\\') {//当读到'\\'时直接跳过将nextChar压入buffer
cursor++;
nextChar = replacement.charAt(cursor);
result.append(nextChar);
cursor++;
} else if (nextChar == '$') {//当读取到'$'时,根据nextChar不同处理不同
// Skip past $跳过了'$'!!!!!
cursor++;
// A StringIndexOutOfBoundsException is thrown if
// this "$" is the last character in replacement
// string in current implementation, a IAE might be
// more appropriate.
nextChar = replacement.charAt(cursor);
int refNum = -1;
if (nextChar == '{') {
cursor++;//跳过'{'
StringBuilder gsb = new StringBuilder();
while (cursor < replacement.length()) {//将'{'后的字母和数字暂存
nextChar = replacement.charAt(cursor);
if (ASCII.isLower(nextChar) ||
ASCII.isUpper(nextChar) ||
ASCII.isDigit(nextChar)) {
gsb.append(nextChar);
cursor++;
} else {
break;
}
}
if (gsb.length() == 0)//如果buffer里没有就报错
throw new IllegalArgumentException(
"named capturing group has 0 length name");
if (nextChar != '}')
throw new IllegalArgumentException(
"named capturing group is missing trailing '}'");
String gname = gsb.toString();
if (ASCII.isDigit(gname.charAt(0)))//组名不可能以数字开头
throw new IllegalArgumentException(
"capturing group name {" + gname +
"} starts with digit character");
if (!parentPattern.namedGroups().containsKey(gname))//在pattern中查找组
throw new IllegalArgumentException(
"No group with name {" + gname + "}");
refNum = parentPattern.namedGroups().get(gname);
cursor++;
} else {//如果不是上述情况那下一个char应当是字符
                    // The first number is always a group
refNum = (int)nextChar - '0';
if ((refNum < 0)||(refNum > 9))
throw new IllegalArgumentException(
"Illegal group reference");
cursor++;
// Capture the largest legal group string
boolean done = false;
while (!done) {
if (cursor >= replacement.length()) {
break;
}
int nextDigit = replacement.charAt(cursor) - '0';
if ((nextDigit < 0)||(nextDigit > 9)) { // not a number
break;
}
int newRefNum = (refNum * 10) + nextDigit;
if (groupCount() < newRefNum) {
done = true;
} else {
refNum = newRefNum;
cursor++;
}
}
}
// Append group
if (start(refNum) != -1 && end(refNum) != -1)
result.append(text, start(refNum), end(refNum));
}
处理的方法:Matcher.quoteReplacement()

if ((s.indexOf('\\') == -1) && (s.indexOf('$') == -1))
return s;
StringBuilder sb = new StringBuilder();
for (int i=0; i<s.length(); i++) {
char c = s.charAt(i);
if (c == '\\' || c == '$') {
sb.append('\\');
}
sb.append(c);
}
return sb.toString();

在特殊字前插入'\\'‘;

另外String.replace()

public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}

是通过Matcher.replaceAll来实现的。