Regex从ASPX文件PowerShell读取URL

I'm writing a PowerShell Script which extracts URL's from ASPX files and test if their HTTP Statuscode is equal to 200.

我正在编写一个PowerShell脚本，它从ASPX文件中提取URL，并测试它们的HTTP状态码是否等于200。

I found the following Regex to get the URL:

我找到以下Regex获取URL:

$regex = "(http[s]?|[s]?ftp[s]?)(:\/\/)([^\s,]+)"
select-string -Path $path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value }

But the return looks like this:

但回报是这样的:

https://code.jquery.com/ui/1.9.0/themes/base/jquery-ui.css"/>
https://code.jquery.com/ui/1.11.4/jquery-ui.min.js"></script>

as you can see, it doesn't really trim the end of the HTML Tags.

正如您所看到的，它并没有真正减少HTML标记的末尾。

How can I edit my regex to get the URL without the HTML Tags in the end?

我如何编辑我的regex以获得没有HTML标签的URL ?

1 个解决方案

#1

If you have a look at the [^\s,] negated character class, you will see it matches any char but whitespace and ,. If you look at the input you have, you will notice that " and < and > can all be matched with [^\s,].

如果你看一看[^ \ s]否定字符类,您将看到它匹配任何字符,但空白,。如果你看看输入,您会注意到,”, <和> 都可以与[^ \ s]。

A fix for the current situation is to add <>" chars into the negated character class to make the regex engine "stop" when it comes across the >, < and " chars.

当前情况的修正是将<>" chars添加到被否定的字符类中，使regex引擎在遇到>、 <和" chars "时“停止”。< p>

Note that since you extract whole matches, you may refactor the pattern a bit and remove unnecessary groupings and turn the first one into a non-capturing group:

注意，由于您提取了整个匹配，您可以对模式进行重构，并删除不必要的分组，并将第一个分组转换为非捕获组:

$regex = '(?:http|s?ftp)s?://[^\s,<>"]+'

Mind that in .NET patterns, / does not need to be escaped (it is not a special regex metacharacter/operator).

注意在。net模式中，/不需要转义(它不是一个特殊的regex元字符/操作符)。

#1