I have a regex to get the src and the remaining attributes for all the images present in the content.
我有一个正则表达式来获取内容中存在的所有图像的src和剩余属性。
<img *((.|\s)*?) *src *= *['"]([^'"]*)['"] *((.|\s)*?) */*>
If the content I am matching against is like
如果我匹配的内容是这样的
<img src=src1"/> <img src=src2"/>
the find(index) hangs and I see the following in the thread dump
find(索引)挂起,我在线程转储中看到以下内容
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
Is there a solution or a workaround for solving this issue?
是否有解决此问题的解决方案或解决方法?
2 个解决方案
#1
1
A workaround is to use a HTML parser such as JSoup
, for example
解决方法是使用HTML解析器,例如JSoup
Document doc =
Jsoup.parse("<html><img src=\"src1\"/> <img src=\"src2\"/></html>");
Elements elements = doc.select("img[src]");
for (Element element: elements) {
System.out.println(element.attr("src"));
System.out.println(element.attr("alt"));
System.out.println(element.attr("height"));
System.out.println(element.attr("width"));
}
#2
0
It looks like what you've got is an "evil regex", which is not uncommon when you try to construct a complicated regex to match one thing (src) within another thing (img). In particular, evil regexs usually happen when you try to apply repetition to a complex subexpression, which you are doing with (.|\s)*?
.
看起来你所拥有的是一个“邪恶的正则表达式”,当你试图构造一个复杂的正则表达式以匹配另一个东西(img)中的一个东西(src)时,这并不罕见。特别是,当您尝试将重复应用于复杂的子表达式时,通常会发生邪恶的正则表达式,您正在使用(。| \ s)* ?.
A better approach would be to use two regexes; one to match all <img>
tags, and then another to match the src
attribute within it.
更好的方法是使用两个正则表达式;一个匹配所有标签,然后另一个匹配其中的src属性。
My Java's rusty, so I'll just give you the pseudocode solution:
我的Java生锈了,所以我只给你一个伪代码解决方案:
foreach( imgTag in input.match( /<img .*?>/ig ) ) {
src = imgTag.match( /\bsrc *= *(['\"])(.*?)\1/i );
// if you want to get other attributes, you can do that the same way:
alt = imgTag.match( /\balt *= *(['\"])(.*?)\1/i );
// even better, you can get all the attributes in one go:
attrs = imgTag.match( /\b(\w+) *= *(['\"])(.*?)\2/g );
// attrs is now an array where the first group is the attr name
// (alt, height, width, src, etc.) and the second group is the
// attr value
}
Note the use of a backreference to match the appropriate type of closing quote (i.e., this will match src='abc'
and src="abc"
. Also note that the quantifiers are lazy here (*?
instead of just *
); this is necessary to prevent too much from being consumed.
注意使用反向引用来匹配相应类型的结束引用(即,这将匹配src ='abc'和src =“abc”。还要注意量词在这里是惰性的(*?而不仅仅是*);这有必要防止被消耗太多。
EDIT: even though my Java's rusty, I was able to crank out an example. Here's the solution in Java:
编辑:即使我的Java生锈了,我还是能够做出一个例子。这是Java中的解决方案:
import java.util.regex.*;
public class Regex {
public static void main( String[] args ) {
String input = "<img alt=\"altText\" src=\"src\" height=\"50\" width=\"50\"/> <img alt='another image' src=\"foo.jpg\" />";
Pattern attrPat = Pattern.compile( "\\b(\\w+) *= *(['\"])(.*?)\\2" );
Matcher imgMatcher = Pattern.compile( "<img .*?>" ).matcher( input );
while( imgMatcher.find() ) {
String imgTag = imgMatcher.group();
System.out.println( imgTag );
Matcher attrMatcher = attrPat.matcher( imgTag );
while( attrMatcher.find() ) {
String attr = attrMatcher.group(1);
System.out.format( "\tattr: %s, value: %s\n", attrMatcher.group(1), attrMatcher.group(3) );
}
}
}
}
#1
1
A workaround is to use a HTML parser such as JSoup
, for example
解决方法是使用HTML解析器,例如JSoup
Document doc =
Jsoup.parse("<html><img src=\"src1\"/> <img src=\"src2\"/></html>");
Elements elements = doc.select("img[src]");
for (Element element: elements) {
System.out.println(element.attr("src"));
System.out.println(element.attr("alt"));
System.out.println(element.attr("height"));
System.out.println(element.attr("width"));
}
#2
0
It looks like what you've got is an "evil regex", which is not uncommon when you try to construct a complicated regex to match one thing (src) within another thing (img). In particular, evil regexs usually happen when you try to apply repetition to a complex subexpression, which you are doing with (.|\s)*?
.
看起来你所拥有的是一个“邪恶的正则表达式”,当你试图构造一个复杂的正则表达式以匹配另一个东西(img)中的一个东西(src)时,这并不罕见。特别是,当您尝试将重复应用于复杂的子表达式时,通常会发生邪恶的正则表达式,您正在使用(。| \ s)* ?.
A better approach would be to use two regexes; one to match all <img>
tags, and then another to match the src
attribute within it.
更好的方法是使用两个正则表达式;一个匹配所有标签,然后另一个匹配其中的src属性。
My Java's rusty, so I'll just give you the pseudocode solution:
我的Java生锈了,所以我只给你一个伪代码解决方案:
foreach( imgTag in input.match( /<img .*?>/ig ) ) {
src = imgTag.match( /\bsrc *= *(['\"])(.*?)\1/i );
// if you want to get other attributes, you can do that the same way:
alt = imgTag.match( /\balt *= *(['\"])(.*?)\1/i );
// even better, you can get all the attributes in one go:
attrs = imgTag.match( /\b(\w+) *= *(['\"])(.*?)\2/g );
// attrs is now an array where the first group is the attr name
// (alt, height, width, src, etc.) and the second group is the
// attr value
}
Note the use of a backreference to match the appropriate type of closing quote (i.e., this will match src='abc'
and src="abc"
. Also note that the quantifiers are lazy here (*?
instead of just *
); this is necessary to prevent too much from being consumed.
注意使用反向引用来匹配相应类型的结束引用(即,这将匹配src ='abc'和src =“abc”。还要注意量词在这里是惰性的(*?而不仅仅是*);这有必要防止被消耗太多。
EDIT: even though my Java's rusty, I was able to crank out an example. Here's the solution in Java:
编辑:即使我的Java生锈了,我还是能够做出一个例子。这是Java中的解决方案:
import java.util.regex.*;
public class Regex {
public static void main( String[] args ) {
String input = "<img alt=\"altText\" src=\"src\" height=\"50\" width=\"50\"/> <img alt='another image' src=\"foo.jpg\" />";
Pattern attrPat = Pattern.compile( "\\b(\\w+) *= *(['\"])(.*?)\\2" );
Matcher imgMatcher = Pattern.compile( "<img .*?>" ).matcher( input );
while( imgMatcher.find() ) {
String imgTag = imgMatcher.group();
System.out.println( imgTag );
Matcher attrMatcher = attrPat.matcher( imgTag );
while( attrMatcher.find() ) {
String attr = attrMatcher.group(1);
System.out.format( "\tattr: %s, value: %s\n", attrMatcher.group(1), attrMatcher.group(3) );
}
}
}
}