I'm writing a PowerShell Script which extracts URL's from ASPX files and test if their HTTP Statuscode is equal to 200.
我正在编写一个PowerShell脚本,它从ASPX文件中提取URL,并测试它们的HTTP状态码是否等于200。
I found the following Regex to get the URL:
我找到以下Regex获取URL:
$regex = "(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)"
select-string -Path $path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value }
But the return looks like this:
但回报是这样的:
https://code.jquery.com/ui/1.9.0/themes/base/jquery-ui.css"/>
https://code.jquery.com/ui/1.11.4/jquery-ui.min.js"></script>
as you can see, it doesn't really trim the end of the HTML Tags.
正如您所看到的,它并没有真正减少HTML标记的末尾。
How can I edit my regex to get the URL without the HTML Tags in the end?
我如何编辑我的regex以获得没有HTML标签的URL ?
1 个解决方案
#1
2
If you have a look at the [^\s,]
negated character class, you will see it matches any char but whitespace and ,
. If you look at the input you have, you will notice that "
and <
and >
can all be matched with [^\s,]
.
如果你看一看[^ \ s]否定字符类,您将看到它匹配任何字符,但空白,。如果你看看输入,您会注意到,”, <和> 都可以与[^ \ s]。
A fix for the current situation is to add <>"
chars into the negated character class to make the regex engine "stop" when it comes across the >
, <
and "
chars.
当前情况的修正是将<>" chars添加到被否定的字符类中,使regex引擎在遇到>、 <和" chars "时“停止”。< p>
Note that since you extract whole matches, you may refactor the pattern a bit and remove unnecessary groupings and turn the first one into a non-capturing group:
注意,由于您提取了整个匹配,您可以对模式进行重构,并删除不必要的分组,并将第一个分组转换为非捕获组:
$regex = '(?:http|s?ftp)s?://[^\s,<>"]+'
Mind that in .NET patterns, /
does not need to be escaped (it is not a special regex metacharacter/operator).
注意在。net模式中,/不需要转义(它不是一个特殊的regex元字符/操作符)。
#1
2
If you have a look at the [^\s,]
negated character class, you will see it matches any char but whitespace and ,
. If you look at the input you have, you will notice that "
and <
and >
can all be matched with [^\s,]
.
如果你看一看[^ \ s]否定字符类,您将看到它匹配任何字符,但空白,。如果你看看输入,您会注意到,”, <和> 都可以与[^ \ s]。
A fix for the current situation is to add <>"
chars into the negated character class to make the regex engine "stop" when it comes across the >
, <
and "
chars.
当前情况的修正是将<>" chars添加到被否定的字符类中,使regex引擎在遇到>、 <和" chars "时“停止”。< p>
Note that since you extract whole matches, you may refactor the pattern a bit and remove unnecessary groupings and turn the first one into a non-capturing group:
注意,由于您提取了整个匹配,您可以对模式进行重构,并删除不必要的分组,并将第一个分组转换为非捕获组:
$regex = '(?:http|s?ftp)s?://[^\s,<>"]+'
Mind that in .NET patterns, /
does not need to be escaped (it is not a special regex metacharacter/operator).
注意在。net模式中,/不需要转义(它不是一个特殊的regex元字符/操作符)。