获取URL的部分(Regex)

时间:2022-02-25 16:52:47

Given the URL (single line):
http://test.example.com/dir/subdir/file.html

给定URL(单行):http://test.example.com/dir/subdir/file.html。

How can I extract the following parts using regular expressions:

如何使用正则表达式提取以下部分:

  1. The Subdomain (test)
  2. 子域名(测试)
  3. The Domain (example.com)
  4. 域(example.com)
  5. The path without the file (/dir/subdir/)
  6. 没有文件的路径(/dir/subdir/)
  7. The file (file.html)
  8. 文件(file.html)
  9. The path with the file (/dir/subdir/file.html)
  10. 使用文件的路径(/dir/subdir/file.html)
  11. The URL without the path (http://test.example.com)
  12. 没有路径的URL (http://test.example.com)
  13. (add any other that you think would be useful)
  14. (添加任何你认为有用的东西)

The regex should work correctly even if I enter the following URL:
http://example.example.com/example/example/example.html

即使我输入了以下URL, regex也应该正确工作:http://example.example.com/example/example/example.html。

Thank you.

谢谢你!

26 个解决方案

#1


121  

A single regex to parse and breakup a full URL including query parameters and anchors e.g.

单个正则表达式来解析和分解完整的URL,包括查询参数和锚点。

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c哈希

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

^(http[s]? | ftp):\ /)? \ / ?([^:\ / \]+)((\ / \ w +)* \ /)((\ w \ \]+[^ # ? \ s]+)(. *)?(#(\ w \]+)?美元

RexEx positions:

RexEx职位:

url: RegExp['$&'],

url:正则表达式(“$ &’),

protocol:RegExp.$2,

协议:RegExp。2美元,

host:RegExp.$3,

主持人:RegExp。3美元,

path:RegExp.$4,

路径:RegExp。4美元,

file:RegExp.$6,

文件:RegExp。6美元,

query:RegExp.$7,

查询:RegExp。7美元,

hash:RegExp.$8

散列:RegExp。8美元

you could then further parse the host ('.' delimited) quite easily.

然后可以进一步解析主机('。很容易的”分隔)。

What I would do is use something like this:

我要做的是用这样的东西:

/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.

进一步解析“其余”尽可能具体。在一个正则表达式里做这个,有点疯狂。

#2


75  

I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:

我意识到我已经迟到了,但是有一种简单的方法可以让浏览器解析一个没有regex的url:

var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
    console.log(k+':', a[k]);
});

/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/

#3


43  

I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:

我在晚会上迟到了好几年,但是我很惊讶没有人提到统一资源标识符规范有一个关于解析uri和正则表达式的部分。由Berners-Lee等人编写的正则表达式是:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

上面第二行中的数字只是为了帮助可读性;它们表示每个子表达式的引用点(即:,每个配对的括号)。我们引用与子表达式匹配的值为$。例如,匹配上面的表达式。

http://www.ics.uci.edu/pub/ietf/uri/#Related

http://www.ics.uci.edu/pub/ietf/uri/相关

results in the following subexpression matches:

结果如下的子表达式匹配:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

For what it's worth, I found that I had to escape the forward slashes in JavaScript:

因为它的价值,我发现我不得不逃避JavaScript的前斜杠:

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

^([^:\ / ? #]+):)?(\ / \ /((^ \ / ? #)*))?([^ ? #]*)(\ ?([^ #]*))?(#(. *))?

#4


30  

I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

我找到了最高的答案(hometoast的回答)对我来说并不是很完美。两个问题:

  1. It can not handle port number.
  2. 它不能处理端口号。
  3. The hash part is broken.
  4. 散列部分被破坏。

The following is a modified version:

以下是修改后的版本:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

Position of parts are as follows:

零件的位置如下:

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

Edit posted by anon user:

由anon用户编辑:

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

#5


10  

I needed a regular Expression to match all urls and made this one:

我需要一个正则表达式来匹配所有的url,并做了这个:

/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

It matches all urls, any protocol, even urls like

它匹配所有的url,任何协议,甚至url。

ftp://user:pass@www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag

The result (in JavaScript) looks like this:

结果(在JavaScript中)是这样的:

["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

An url like

一个网址

mailto://admin@www.cs.server.com

looks like this:

是这样的:

["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined] 

#6


6  

This is not a direct answer but most web libraries have a function that accomplishes this task. The function is often called something similar to CrackUrl. If such a function exists, use it, it is almost guaranteed to be more reliable and more efficient than any hand-crafted code.

这不是一个直接的答案,但是大多数web库都有一个完成这个任务的函数。这个函数通常被称为类似于CrackUrl的函数。如果存在这样的函数,使用它,几乎可以保证它比任何手工编写的代码更可靠、更高效。

#7


6  

I was trying to solve this in javascript, which should be handled by:

我试着用javascript来解决这个问题,应该通过以下方法来处理:

var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');

since (in Chrome, at least) it parses to:

因为(至少在Chrome浏览器中),它可以解析为:

{
  "hash": "#foobar/bing/bo@ng?bang",
  "search": "?foo=bar&bingobang=&king=kong@kong.com",
  "pathname": "/path/wah@t/foo.js",
  "port": "890",
  "hostname": "example.com",
  "host": "example.com:890",
  "password": "b",
  "username": "a",
  "protocol": "http:",
  "origin": "http://example.com:890",
  "href": "http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang"
}

However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:

然而,这并不是跨浏览器(https://developer.mozilla.org/en-US/docs/Web/API/URL),所以我将这些内容拼凑在一起,将相同的部分拉出:

^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?

Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.

该regex的信用将转到https://gist.github.com/rpflorence,他发布了这个jsperf http://jsperf.com/url-解析(最初在这里找到:https://gist.github.com/jlong/2428561# comm-310066),这是基于regex的。

The parts are in this order:

零件按此顺序排列:

var keys = [
    "href",                    // http://user:pass@host.com:81/directory/file.ext?query=1#anchor
    "origin",                  // http://user:pass@host.com:81
    "protocol",                // http:
    "username",                // user
    "password",                // pass
    "host",                    // host.com:81
    "hostname",                // host.com
    "port",                    // 81
    "pathname",                // /directory/file.ext
    "search",                  // ?query=1
    "hash"                     // #anchor
];

There is also a small library which wraps it and provides query params:

还有一个小的库,它封装并提供了查询参数:

https://github.com/sadams/lite-url (also available on bower)

https://github.com/sadams/lite-url(也可用于bower)

If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.

如果您有改进,请创建一个带有更多测试的拉请求,我将接受并合并感谢。

#8


5  

subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/

子域和域是很困难的,因为子域可以有多个部分,*别的域,http://sub1.sub2.domain.co.uk/。

 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)  

(Markdown isn't very friendly to regexes)

(Markdown对regex不是很友好)

#9


5  

This improved version should work as reliably as a parser.

这个改进的版本应该像解析器一样可靠地工作。

   // Applies to URI, not just URL or URN:
   //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
   //
   // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
   //
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   //
   // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
   //
   // $@ matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:pwd@host, etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   //
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   //
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   //
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
   {
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
   }

   // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
   function uriSchemesRegExp()
   {
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
   }

#10


5  

Try the following:

试试以下:

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

It supports HTTP / FTP, subdomains, folders, files etc.

它支持HTTP / FTP、子域、文件夹、文件等。

I found it from a quick google search:

我在谷歌的快速搜索中找到了它:

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

#11


4  

/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/

From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).

从我的回答中也有类似的问题。比前面提到的一些更有效,因为它们有一些bug(比如不支持用户名/密码,不支持单字符文件名,片段标识符被破坏)。

#12


4  

Propose a much more readable solution (in Python, but applies to any regex):

提出一个更可读的解决方案(在Python中,但适用于任何regex):

def url_path_to_dict(path):
    pattern = (r'^'
               r'((?P<schema>.+?)://)?'
               r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
               r'(?P<host>.*?)'
               r'(:(?P<port>\d+?))?'
               r'(?P<path>/.*?)?'
               r'(?P<query>[?].*?)?'
               r'$'
               )
    regex = re.compile(pattern)
    m = regex.match(path)
    d = m.groupdict() if m is not None else None

    return d

def main():
    print url_path_to_dict('http://example.example.com/example/example/example.html')

Prints:

打印:

{
'host': 'example.example.com', 
'user': None, 
'path': '/example/example/example.html', 
'query': None, 
'password': None, 
'port': None, 
'schema': 'http'
}

#13


2  

You can get all the http/https, host, port, path as well as query by using Uri object in .NET. just the difficult task is to break the host into sub domain, domain name and TLD.

通过在。net中使用Uri对象,您可以获得所有的http/https、主机、端口、路径和查询。困难的任务是将主机分解为子域、域名和TLD。

There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.

这样做没有标准,也不能简单地使用字符串解析或正则表达式来生成正确的结果。首先,我使用RegEx函数,但不是所有的URL都可以正确地解析子域。练习方法是使用一个TLDs列表。在一个URL的TLD定义后,左边是域,剩下的是子域。

However the list need to maintain it since new TLDs is possible. The current moment I know is publicsuffix.org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.

然而,该列表需要保持它,因为新的TLDs是可能的。我知道的当前时间是publicsuffix.org,它维护最新的列表,您可以使用谷歌代码中的domainnames -parser工具来解析公共后缀列表,并通过使用DomainName对象(DomainName)轻松获取子域、域和TLD。子域名,域名。域和domainName.TLD。

This answers also helpfull: Get the subdomain from a URL

这个答案也很有用:从URL获取子域名。

CaLLMeLaNN

CaLLMeLaNN

#14


2  

I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.

我建议不要使用正则表达式。像WinHttpCrackUrl()这样的API调用更容易出错。

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

#15


2  

Sadly, this doesn't work with some URLs. Take, for example, this one: http://www.example.org/&value=329

遗憾的是,这与一些url无关。举个例子,这个例子:http://www.example.org/value329。

Neither does &value=329

也不知道平均数= 329

Or even with no parameters at all (a simple URL)!

甚至没有参数(一个简单的URL)!

I understand that the regex is expecting some seriously complex/long URL, but it should be able to work on simple ones as well, am I right?

我知道regex正在期待一些非常复杂/长的URL,但是它也应该能够处理简单的URL,我说的对吗?

#16


2  

None of the above worked for me. Here's what I ended up using:

以上这些都不适合我。以下是我最后使用的:

/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/

#17


2  

I like the regex that was published in "Javascript: The Good Parts". Its not too short and not too complex. This page on github also has the JavaScript code that uses it. But it an be adapted for any language. https://gist.github.com/voodooGQ/4057330

我喜欢《Javascript:好部分》中发布的regex。它不太短也不太复杂。这个页面在github上也有使用它的JavaScript代码。但它适合任何语言。https://gist.github.com/voodooGQ/4057330

#18


1  

Java offers a URL class that will do this. Query URL Objects.

Java提供了一个URL类来完成这个任务。查询URL对象。

On a side note, PHP offers parse_url().

另一方面,PHP提供了parse_url()。

#19


1  

Here is one that is complete, and doesnt rely on any protocol.

这是一个完整的,并且不依赖于任何协议。

function getServerURL(url) {
        var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
        console.log(m[1]) // Remove this
        return m[1];
    }

getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")

Prints

打印

http://dev.test.se

http://dev.test.se

//ajax.googleapis.com

//

www.dev.test.se

www.dev.test.se

www.dev.test.se

www.dev.test.se

//dev.test.se

http://www.dev.test.se

http://localhost:8080

https://localhost:8080

#20


0  

Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.

使用http://www.fileformat.info/tool/regex.htm hometoast的regex效果很好。

But here is the deal, I want to use different regex patterns in different situations in my program.

但是,我想在我的程序中使用不同的regex模式。

For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.

例如,我有这个URL,我有一个枚举列表,它列出了我的程序中所有支持的URL。枚举中的每个对象都有一个方法getRegexPattern,该方法返回regex模式,然后使用该模式与URL进行比较。如果特定的regex模式返回true,那么我知道这个URL是由我的程序支持的。因此,每个枚举都有自己的regex,这取决于它应该查看URL的内部位置。

Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).

Hometoast的建议很好,但是在我的例子中,我认为它不会有帮助(除非我在所有的枚举中复制粘贴相同的regex)。

That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

这就是为什么我想要把每个情况分别给予regex的原因。尽管hometoast + 1。,)

#21


0  

I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?

我知道你在用语言无关的东西,但是你能告诉我们你用的是什么让我们知道你有什么regex功能吗?

If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:

如果您有非捕获匹配的功能,您可以修改hometoast的表达式,以便您对捕获不感兴趣的子表达式是这样设置的:

(?:SOMESTUFF)

(?:SOMESTUFF)

You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.

您仍然需要将Regex复制和粘贴(并稍微修改)到多个位置,但这是有意义的——您不只是检查子表达式是否存在,而是作为URL的一部分存在。使用非捕获修饰符可以为您提供您所需要的东西,如果我正确地读取您的信息,您将会得到您想要的。

Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:

就像一个小的小纸条,hometoast的表达不需要把括号括在's'上,因为他只有一个字符。量词可以量化前面的一个字符(或字符类或子表达式)。所以:

https?

https吗?

would match 'http' or 'https' just fine.

匹配“http”或“https”就可以了。

#22


0  

regexp to get the URL path without the file.

regexp获取没有文件的URL路径。

url = 'http://domain/dir1/dir2/somefile' url.scan(/^(http://[^/]+)((?:/[^/]+)+(?=/))?/?(?:[^/]+)?$/i).to_s

url = ' http://domain/dir1/dir2/somefile ' url.scan(/ ^(http://[^ /]+)((?:/ /[^]+)+(? = /))? / ?(?:[^ /]+)? /我美元).to_s

It can be useful for adding a relative path to this url.

它可以用于为这个url添加一个相对路径。

#23


0  

String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));

Will provide the following output:
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl

If you change the URL to
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888"; the output will be the following :
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888

将提供以下输出:1:https:// 2: www.thomas-bayer.com 3: / 4: axis2/services/BLZService?如果您将URL更改为String s = " https://www.thomasbayer.com? wsdl=qwerwer&ttt=888";输出如下:1:https:// 2: www.thomasbayer.com 3: ?4:wsdl = qwerwer&ttt = 888

enjoy..
Yosi Lev

享受. .尤西列夫

#24


0  

The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:

正则表达式的完整解析非常可怕。我已经把它的名字命名为backreferences,把每个部分分成不同的部分,但它仍然是这样的:

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:

需要它如此冗长的是,除了协议或端口之外,任何部分都可以包含HTML实体,这使得对片段的描述非常棘手。在最后几个例子中——主机、路径、文件、querystring和片段,我们允许任何一个html实体或任何不是a的字符?或#。一个html实体的正则表达式是这样的:

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:

当提取出来(我使用了mustache语法来表示它)时,它变得更加清晰了:

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

In JavaScript, of course, you can't use named backreferences, so the regex becomes

当然,在JavaScript中,不能使用命名的反向引用,因此正则表达式就变成了。

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.

在每个匹配中,协议是\1,主机是\2,端口是\3,路径\4,文件\5,querystring \6,和片段\7。

#25


0  

I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (http://example.com/)

我尝试了一些并没有满足我的需求,尤其是最高的投票,没有找到一个没有路径的url (http://example.com/)

also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).

另外,缺少组名使得它无法使用(或者我的jinja2技能缺乏)。

so this is my version slightly modified with the source being the highest voted version here:

这是我的版本稍微修改了一下源代码是这里的最高版本

^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$

#26


-1  

//USING REGEX
/**
 * Parse URL to get information
 *
 * @param   url     the URL string to parse
 * @return  parsed  the URL parsed or null
 */
var UrlParser = function (url) {
    "use strict";

    var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
        matches = regx.exec(url),
        parser = null;

    if (null !== matches) {
        parser = {
            href              : matches[0],
            withoutHash       : matches[1],
            url               : matches[2],
            origin            : matches[3],
            protocol          : matches[4],
            protocolseparator : matches[5],
            credhost          : matches[6],
            cred              : matches[7],
            user              : matches[8],
            pass              : matches[9],
            host              : matches[10],
            hostname          : matches[11],
            port              : matches[12],
            pathname          : matches[13],
            segment1          : matches[14],
            segment2          : matches[15],
            search            : matches[16],
            hash              : matches[17]
        };
    }

    return parser;
};

var parsedURL=UrlParser(url);
console.log(parsedURL);

#1


121  

A single regex to parse and breakup a full URL including query parameters and anchors e.g.

单个正则表达式来解析和分解完整的URL,包括查询参数和锚点。

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c哈希

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

^(http[s]? | ftp):\ /)? \ / ?([^:\ / \]+)((\ / \ w +)* \ /)((\ w \ \]+[^ # ? \ s]+)(. *)?(#(\ w \]+)?美元

RexEx positions:

RexEx职位:

url: RegExp['$&'],

url:正则表达式(“$ &’),

protocol:RegExp.$2,

协议:RegExp。2美元,

host:RegExp.$3,

主持人:RegExp。3美元,

path:RegExp.$4,

路径:RegExp。4美元,

file:RegExp.$6,

文件:RegExp。6美元,

query:RegExp.$7,

查询:RegExp。7美元,

hash:RegExp.$8

散列:RegExp。8美元

you could then further parse the host ('.' delimited) quite easily.

然后可以进一步解析主机('。很容易的”分隔)。

What I would do is use something like this:

我要做的是用这样的东西:

/*
    ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$
*/
proto $1
host $2
port $3
the-rest $4

the further parse 'the rest' to be as specific as possible. Doing it in one regex is, well, a bit crazy.

进一步解析“其余”尽可能具体。在一个正则表达式里做这个,有点疯狂。

#2


75  

I realize I'm late to the party, but there is a simple way to let the browser parse a url for you without a regex:

我意识到我已经迟到了,但是有一种简单的方法可以让浏览器解析一个没有regex的url:

var a = document.createElement('a');
a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo';

['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) {
    console.log(k+':', a[k]);
});

/*//Output:
href: http://www.example.com:123/foo/bar.html?fox=trot#foo
protocol: http:
host: www.example.com:123
hostname: www.example.com
port: 123
pathname: /foo/bar.html
search: ?fox=trot
hash: #foo
*/

#3


43  

I'm a few years late to the party, but I'm surprised no one has mentioned the Uniform Resource Identifier specification has a section on parsing URIs with a regular expression. The regular expression, written by Berners-Lee, et al., is:

我在晚会上迟到了好几年,但是我很惊讶没有人提到统一资源标识符规范有一个关于解析uri和正则表达式的部分。由Berners-Lee等人编写的正则表达式是:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
 12            3  4          5       6  7        8 9

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis). We refer to the value matched for subexpression as $. For example, matching the above expression to

上面第二行中的数字只是为了帮助可读性;它们表示每个子表达式的引用点(即:,每个配对的括号)。我们引用与子表达式匹配的值为$。例如,匹配上面的表达式。

http://www.ics.uci.edu/pub/ietf/uri/#Related

http://www.ics.uci.edu/pub/ietf/uri/相关

results in the following subexpression matches:

结果如下的子表达式匹配:

$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related

For what it's worth, I found that I had to escape the forward slashes in JavaScript:

因为它的价值,我发现我不得不逃避JavaScript的前斜杠:

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

^([^:\ / ? #]+):)?(\ / \ /((^ \ / ? #)*))?([^ ? #]*)(\ ?([^ #]*))?(#(. *))?

#4


30  

I found the highest voted answer (hometoast's answer) doesn't work perfectly for me. Two problems:

我找到了最高的答案(hometoast的回答)对我来说并不是很完美。两个问题:

  1. It can not handle port number.
  2. 它不能处理端口号。
  3. The hash part is broken.
  4. 散列部分被破坏。

The following is a modified version:

以下是修改后的版本:

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

Position of parts are as follows:

零件的位置如下:

int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

Edit posted by anon user:

由anon用户编辑:

function getFileName(path) {
    return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8];
}

#5


10  

I needed a regular Expression to match all urls and made this one:

我需要一个正则表达式来匹配所有的url,并做了这个:

/(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

It matches all urls, any protocol, even urls like

它匹配所有的url,任何协议,甚至url。

ftp://user:pass@www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag

The result (in JavaScript) looks like this:

结果(在JavaScript中)是这样的:

["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

An url like

一个网址

mailto://admin@www.cs.server.com

looks like this:

是这样的:

["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined] 

#6


6  

This is not a direct answer but most web libraries have a function that accomplishes this task. The function is often called something similar to CrackUrl. If such a function exists, use it, it is almost guaranteed to be more reliable and more efficient than any hand-crafted code.

这不是一个直接的答案,但是大多数web库都有一个完成这个任务的函数。这个函数通常被称为类似于CrackUrl的函数。如果存在这样的函数,使用它,几乎可以保证它比任何手工编写的代码更可靠、更高效。

#7


6  

I was trying to solve this in javascript, which should be handled by:

我试着用javascript来解决这个问题,应该通过以下方法来处理:

var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');

since (in Chrome, at least) it parses to:

因为(至少在Chrome浏览器中),它可以解析为:

{
  "hash": "#foobar/bing/bo@ng?bang",
  "search": "?foo=bar&bingobang=&king=kong@kong.com",
  "pathname": "/path/wah@t/foo.js",
  "port": "890",
  "hostname": "example.com",
  "host": "example.com:890",
  "password": "b",
  "username": "a",
  "protocol": "http:",
  "origin": "http://example.com:890",
  "href": "http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang"
}

However, this isn't cross browser (https://developer.mozilla.org/en-US/docs/Web/API/URL), so I cobbled this together to pull the same parts out as above:

然而,这并不是跨浏览器(https://developer.mozilla.org/en-US/docs/Web/API/URL),所以我将这些内容拼凑在一起,将相同的部分拉出:

^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?

Credit for this regex goes to https://gist.github.com/rpflorence who posted this jsperf http://jsperf.com/url-parsing (originally found here: https://gist.github.com/jlong/2428561#comment-310066) who came up with the regex this was originally based on.

该regex的信用将转到https://gist.github.com/rpflorence,他发布了这个jsperf http://jsperf.com/url-解析(最初在这里找到:https://gist.github.com/jlong/2428561# comm-310066),这是基于regex的。

The parts are in this order:

零件按此顺序排列:

var keys = [
    "href",                    // http://user:pass@host.com:81/directory/file.ext?query=1#anchor
    "origin",                  // http://user:pass@host.com:81
    "protocol",                // http:
    "username",                // user
    "password",                // pass
    "host",                    // host.com:81
    "hostname",                // host.com
    "port",                    // 81
    "pathname",                // /directory/file.ext
    "search",                  // ?query=1
    "hash"                     // #anchor
];

There is also a small library which wraps it and provides query params:

还有一个小的库,它封装并提供了查询参数:

https://github.com/sadams/lite-url (also available on bower)

https://github.com/sadams/lite-url(也可用于bower)

If you have an improvement, please create a pull request with more tests and I will accept and merge with thanks.

如果您有改进,请创建一个带有更多测试的拉请求,我将接受并合并感谢。

#8


5  

subdomain and domain are difficult because the subdomain can have several parts, as can the top level domain, http://sub1.sub2.domain.co.uk/

子域和域是很困难的,因为子域可以有多个部分,*别的域,http://sub1.sub2.domain.co.uk/。

 the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?)  
 the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$  
 the path with the file : http://[^/]+/(.*)  
 the URL without the path : (http://[^/]+/)  

(Markdown isn't very friendly to regexes)

(Markdown对regex不是很友好)

#9


5  

This improved version should work as reliably as a parser.

这个改进的版本应该像解析器一样可靠地工作。

   // Applies to URI, not just URL or URN:
   //    http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN
   //
   // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp
   //
   // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))?
   //
   // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax
   //
   // $@ matches the entire uri
   // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc)
   // $2 matches authority (host, user:pwd@host, etc)
   // $3 matches path
   // $4 matches query (http GET REST api, etc)
   // $5 matches fragment (html anchor, etc)
   //
   // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme
   // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$
   //
   // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))?
   //
   // Validate the authority with an orthogonal RegExp, so the RegExp above won’t fail to match any valid urls.
   function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ )
   {
      if( !schemes )
         schemes = '[^\\s:\/?#]+'
      else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) )
         throw TypeError( 'expected URI schemes' )
      return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) :
         new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags )
   }

   // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes
   function uriSchemesRegExp()
   {
      return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr'
   }

#10


5  

Try the following:

试试以下:

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

It supports HTTP / FTP, subdomains, folders, files etc.

它支持HTTP / FTP、子域、文件夹、文件等。

I found it from a quick google search:

我在谷歌的快速搜索中找到了它:

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

#11


4  

/^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/

From my answer on a similar question. Works better than some of the others mentioned because they had some bugs (such as not supporting username/password, not supporting single-character filenames, fragment identifiers being broken).

从我的回答中也有类似的问题。比前面提到的一些更有效,因为它们有一些bug(比如不支持用户名/密码,不支持单字符文件名,片段标识符被破坏)。

#12


4  

Propose a much more readable solution (in Python, but applies to any regex):

提出一个更可读的解决方案(在Python中,但适用于任何regex):

def url_path_to_dict(path):
    pattern = (r'^'
               r'((?P<schema>.+?)://)?'
               r'((?P<user>.+?)(:(?P<password>.*?))?@)?'
               r'(?P<host>.*?)'
               r'(:(?P<port>\d+?))?'
               r'(?P<path>/.*?)?'
               r'(?P<query>[?].*?)?'
               r'$'
               )
    regex = re.compile(pattern)
    m = regex.match(path)
    d = m.groupdict() if m is not None else None

    return d

def main():
    print url_path_to_dict('http://example.example.com/example/example/example.html')

Prints:

打印:

{
'host': 'example.example.com', 
'user': None, 
'path': '/example/example/example.html', 
'query': None, 
'password': None, 
'port': None, 
'schema': 'http'
}

#13


2  

You can get all the http/https, host, port, path as well as query by using Uri object in .NET. just the difficult task is to break the host into sub domain, domain name and TLD.

通过在。net中使用Uri对象,您可以获得所有的http/https、主机、端口、路径和查询。困难的任务是将主机分解为子域、域名和TLD。

There is no standard to do so and can't be simply use string parsing or RegEx to produce the correct result. At first, I am using RegEx function but not all URL can be parse the subdomain correctly. The practice way is to use a list of TLDs. After a TLD for a URL is defined the left part is domain and the remaining is sub domain.

这样做没有标准,也不能简单地使用字符串解析或正则表达式来生成正确的结果。首先,我使用RegEx函数,但不是所有的URL都可以正确地解析子域。练习方法是使用一个TLDs列表。在一个URL的TLD定义后,左边是域,剩下的是子域。

However the list need to maintain it since new TLDs is possible. The current moment I know is publicsuffix.org maintain the latest list and you can use domainname-parser tools from google code to parse the public suffix list and get the sub domain, domain and TLD easily by using DomainName object: domainName.SubDomain, domainName.Domain and domainName.TLD.

然而,该列表需要保持它,因为新的TLDs是可能的。我知道的当前时间是publicsuffix.org,它维护最新的列表,您可以使用谷歌代码中的domainnames -parser工具来解析公共后缀列表,并通过使用DomainName对象(DomainName)轻松获取子域、域和TLD。子域名,域名。域和domainName.TLD。

This answers also helpfull: Get the subdomain from a URL

这个答案也很有用:从URL获取子域名。

CaLLMeLaNN

CaLLMeLaNN

#14


2  

I would recommend not using regex. An API call like WinHttpCrackUrl() is less error prone.

我建议不要使用正则表达式。像WinHttpCrackUrl()这样的API调用更容易出错。

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

#15


2  

Sadly, this doesn't work with some URLs. Take, for example, this one: http://www.example.org/&value=329

遗憾的是,这与一些url无关。举个例子,这个例子:http://www.example.org/value329。

Neither does &value=329

也不知道平均数= 329

Or even with no parameters at all (a simple URL)!

甚至没有参数(一个简单的URL)!

I understand that the regex is expecting some seriously complex/long URL, but it should be able to work on simple ones as well, am I right?

我知道regex正在期待一些非常复杂/长的URL,但是它也应该能够处理简单的URL,我说的对吗?

#16


2  

None of the above worked for me. Here's what I ended up using:

以上这些都不适合我。以下是我最后使用的:

/^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/

#17


2  

I like the regex that was published in "Javascript: The Good Parts". Its not too short and not too complex. This page on github also has the JavaScript code that uses it. But it an be adapted for any language. https://gist.github.com/voodooGQ/4057330

我喜欢《Javascript:好部分》中发布的regex。它不太短也不太复杂。这个页面在github上也有使用它的JavaScript代码。但它适合任何语言。https://gist.github.com/voodooGQ/4057330

#18


1  

Java offers a URL class that will do this. Query URL Objects.

Java提供了一个URL类来完成这个任务。查询URL对象。

On a side note, PHP offers parse_url().

另一方面,PHP提供了parse_url()。

#19


1  

Here is one that is complete, and doesnt rely on any protocol.

这是一个完整的,并且不依赖于任何协议。

function getServerURL(url) {
        var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)");
        console.log(m[1]) // Remove this
        return m[1];
    }

getServerURL("http://dev.test.se")
getServerURL("http://dev.test.se/")
getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js")
getServerURL("//")
getServerURL("www.dev.test.se/sdas/dsads")
getServerURL("www.dev.test.se/")
getServerURL("www.dev.test.se?abc=32")
getServerURL("www.dev.test.se#abc")
getServerURL("//dev.test.se?sads")
getServerURL("http://www.dev.test.se#321")
getServerURL("http://localhost:8080/sads")
getServerURL("https://localhost:8080?sdsa")

Prints

打印

http://dev.test.se

http://dev.test.se

//ajax.googleapis.com

//

www.dev.test.se

www.dev.test.se

www.dev.test.se

www.dev.test.se

//dev.test.se

http://www.dev.test.se

http://localhost:8080

https://localhost:8080

#20


0  

Using http://www.fileformat.info/tool/regex.htm hometoast's regex works great.

使用http://www.fileformat.info/tool/regex.htm hometoast的regex效果很好。

But here is the deal, I want to use different regex patterns in different situations in my program.

但是,我想在我的程序中使用不同的regex模式。

For example, I have this URL, and I have an enumeration that lists all supported URLs in my program. Each object in the enumeration has a method getRegexPattern that returns the regex pattern which will then be used to compare with a URL. If the particular regex pattern returns true, then I know that this URL is supported by my program. So, each enumeration has it's own regex depending on where it should look inside the URL.

例如,我有这个URL,我有一个枚举列表,它列出了我的程序中所有支持的URL。枚举中的每个对象都有一个方法getRegexPattern,该方法返回regex模式,然后使用该模式与URL进行比较。如果特定的regex模式返回true,那么我知道这个URL是由我的程序支持的。因此,每个枚举都有自己的regex,这取决于它应该查看URL的内部位置。

Hometoast's suggestion is great, but in my case, I think it wouldn't help (unless I copy paste the same regex in all enumerations).

Hometoast的建议很好,但是在我的例子中,我认为它不会有帮助(除非我在所有的枚举中复制粘贴相同的regex)。

That is why I wanted the answer to give the regex for each situation separately. Although +1 for hometoast. ;)

这就是为什么我想要把每个情况分别给予regex的原因。尽管hometoast + 1。,)

#21


0  

I know you're claiming language-agnostic on this, but can you tell us what you're using just so we know what regex capabilities you have?

我知道你在用语言无关的东西,但是你能告诉我们你用的是什么让我们知道你有什么regex功能吗?

If you have the capabilities for non-capturing matches, you can modify hometoast's expression so that subexpressions that you aren't interested in capturing are set up like this:

如果您有非捕获匹配的功能,您可以修改hometoast的表达式,以便您对捕获不感兴趣的子表达式是这样设置的:

(?:SOMESTUFF)

(?:SOMESTUFF)

You'd still have to copy and paste (and slightly modify) the Regex into multiple places, but this makes sense--you're not just checking to see if the subexpression exists, but rather if it exists as part of a URL. Using the non-capturing modifier for subexpressions can give you what you need and nothing more, which, if I'm reading you correctly, is what you want.

您仍然需要将Regex复制和粘贴(并稍微修改)到多个位置,但这是有意义的——您不只是检查子表达式是否存在,而是作为URL的一部分存在。使用非捕获修饰符可以为您提供您所需要的东西,如果我正确地读取您的信息,您将会得到您想要的。

Just as a small, small note, hometoast's expression doesn't need to put brackets around the 's' for 'https', since he only has one character in there. Quantifiers quantify the one character (or character class or subexpression) directly preceding them. So:

就像一个小的小纸条,hometoast的表达不需要把括号括在's'上,因为他只有一个字符。量词可以量化前面的一个字符(或字符类或子表达式)。所以:

https?

https吗?

would match 'http' or 'https' just fine.

匹配“http”或“https”就可以了。

#22


0  

regexp to get the URL path without the file.

regexp获取没有文件的URL路径。

url = 'http://domain/dir1/dir2/somefile' url.scan(/^(http://[^/]+)((?:/[^/]+)+(?=/))?/?(?:[^/]+)?$/i).to_s

url = ' http://domain/dir1/dir2/somefile ' url.scan(/ ^(http://[^ /]+)((?:/ /[^]+)+(? = /))? / ?(?:[^ /]+)? /我美元).to_s

It can be useful for adding a relative path to this url.

它可以用于为这个url添加一个相对路径。

#23


0  

String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl";

String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)";

System.out.println("1: " + s.replaceAll(regex, "$1"));
System.out.println("2: " + s.replaceAll(regex, "$2"));
System.out.println("3: " + s.replaceAll(regex, "$3"));
System.out.println("4: " + s.replaceAll(regex, "$4"));

Will provide the following output:
1: https://
2: www.thomas-bayer.com
3: /
4: axis2/services/BLZService?wsdl

If you change the URL to
String s = "https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888"; the output will be the following :
1: https://
2: www.thomas-bayer.com
3: ?
4: wsdl=qwerwer&ttt=888

将提供以下输出:1:https:// 2: www.thomas-bayer.com 3: / 4: axis2/services/BLZService?如果您将URL更改为String s = " https://www.thomasbayer.com? wsdl=qwerwer&ttt=888";输出如下:1:https:// 2: www.thomasbayer.com 3: ?4:wsdl = qwerwer&ttt = 888

enjoy..
Yosi Lev

享受. .尤西列夫

#24


0  

The regex to do full parsing is quite horrendous. I've included named backreferences for legibility, and broken each part into separate lines, but it still looks like this:

正则表达式的完整解析非常可怕。我已经把它的名字命名为backreferences,把每个部分分成不同的部分,但它仍然是这样的:

^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?
(?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)
(?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?
(?:#(?P<fragment>.*))?$

The thing that requires it to be so verbose is that except for the protocol or the port, any of the parts can contain HTML entities, which makes delineation of the fragment quite tricky. So in the last few cases - the host, path, file, querystring, and fragment, we allow either any html entity or any character that isn't a ? or #. The regex for an html entity looks like this:

需要它如此冗长的是,除了协议或端口之外,任何部分都可以包含HTML实体,这使得对片段的描述非常棘手。在最后几个例子中——主机、路径、文件、querystring和片段,我们允许任何一个html实体或任何不是a的字符?或#。一个html实体的正则表达式是这样的:

$htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

When that is extracted (I used a mustache syntax to represent it), it becomes a bit more legible:

当提取出来(我使用了mustache语法来表示它)时,它变得更加清晰了:

^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))?
(?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)?
(?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)?
(?P<file>(?:{{htmlentity}}|[^?#])+)
(?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))?
(?:#(?P<fragment>.*))?$

In JavaScript, of course, you can't use named backreferences, so the regex becomes

当然,在JavaScript中,不能使用命名的反向引用,因此正则表达式就变成了。

^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

and in each match, the protocol is \1, the host is \2, the port is \3, the path \4, the file \5, the querystring \6, and the fragment \7.

在每个匹配中,协议是\1,主机是\2,端口是\3,路径\4,文件\5,querystring \6,和片段\7。

#25


0  

I tried a few of these that didn't cover my needs, especially the highest voted which didn't catch a url without a path (http://example.com/)

我尝试了一些并没有满足我的需求,尤其是最高的投票,没有找到一个没有路径的url (http://example.com/)

also lack of group names made it unusable in ansible (or perhaps my jinja2 skills are lacking).

另外,缺少组名使得它无法使用(或者我的jinja2技能缺乏)。

so this is my version slightly modified with the source being the highest voted version here:

这是我的版本稍微修改了一下源代码是这里的最高版本

^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$

#26


-1  

//USING REGEX
/**
 * Parse URL to get information
 *
 * @param   url     the URL string to parse
 * @return  parsed  the URL parsed or null
 */
var UrlParser = function (url) {
    "use strict";

    var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/,
        matches = regx.exec(url),
        parser = null;

    if (null !== matches) {
        parser = {
            href              : matches[0],
            withoutHash       : matches[1],
            url               : matches[2],
            origin            : matches[3],
            protocol          : matches[4],
            protocolseparator : matches[5],
            credhost          : matches[6],
            cred              : matches[7],
            user              : matches[8],
            pass              : matches[9],
            host              : matches[10],
            hostname          : matches[11],
            port              : matches[12],
            pathname          : matches[13],
            segment1          : matches[14],
            segment2          : matches[15],
            search            : matches[16],
            hash              : matches[17]
        };
    }

    return parser;
};

var parsedURL=UrlParser(url);
console.log(parsedURL);